Model accuracy test V2¶

In the accuracy testing of Model accuracy test V1, the process was not streamlined enough. We listened to feedback from the community developers and developed Model Accuracy Test V2.

In the V2 version, we no longer need to use an inference engine to start a service, nor do we need to break the testing into multiple steps.

Our goal is to make downstream accuracy testing equivalent to PPL testing. Running a program from llmc will, after completing the algorithm execution, directly conduct PPL testing and simultaneously perform the corresponding downstream accuracy testing.

To achieve the above goals, we only need to add an opencompass setting in the existing configuration.

base:
    seed: &seed 42
model:
    type: Llama
    path: model path
    torch_dtype: auto
calib:
    name: pileval
    download: False
    path: calib data path
    n_samples: 128
    bs: -1
    seq_len: 512
    preproc: pileval_awq
    seed: *seed
eval:
    eval_pos: [pretrain, fake_quant]
    name: wikitext2
    download: False
    path: eval data path
    bs: 1
    seq_len: 2048
quant:
    method: Awq
    weight:
        bit: 4
        symmetric: False
        granularity: per_group
        group_size: 128
    special:
        weight_clip: False
save:
    save_trans: True
    save_path: ./save
opencompass:
    cfg_path: opencompass config path
    output_path: ./oc_output

The cfg_path in opencompass needs to point to a configuration path for opencompass.

Here, we have provided the configurations for both the base model and the chat model regarding the human-eval test as a reference for everyone.

It is important to note that the configuration provided by opencompass needs to have the path key. However, in this case, we do not need this key because llmc will default to using the model path in the save path of trans

Of course, since the save path of trans model is required, you need to set save_trans to True if you want to test in opencompass.

The max_num_workers in opencompass refers to the maximum number of inference instances.

If the model is running on a single GPU, then max_num_workers refers to the number of inference instances to be started, meaning it will occupy max_num_workers number of GPUs.

If the model is running on multiple GPUs, as in the case of multi-GPU parallel testing (as mentioned below), for example, if the model is running inference on 2 GPUs, then max_num_workers refers to the number of inference instances to be started, meaning it will occupy 2 * max_num_workers number of GPUs.

In summary, the required number of GPUs = number of PP (pipeline parallelism) * max_num_workers.

If the required number of GPUs exceeds the actual number of available GPUs, then some workers will have to wait in a queue.

max_num_workers not only starts multiple inference instances but also splits each dataset into max_num_workers parts, which can be understood as data parallelism.

Therefore, the optimal setting is to make the required number of GPUs equal to the number of available GPUs.

For example:

On a machine with 8 GPUs, if a model runs on a single GPU, then max_num_workers=8. On a machine with 8 GPUs, if a model runs on 4 GPUs, then max_num_workers=2. We should try to lower the number of PPs while increasing max_num_workers, because PP parallelism tends to be slower. PP should only be used when the model cannot run on a single GPU, such as for a 70B model that cannot run on a single GPU. In this case, we can set PP=4 and use four 80GB GPUs to run it.

The output_path in opencompass is used to set the output directory for the evaluation logs of opencompass.

In this log directory, OpenCompass will output logs for inference and evaluation, detailed inference results, and the final evaluation accuracy.

Before running the llmc program, you also need to install the version of opencompass that has been adapted for llmc.

git clone https://github.com/ModelTC/opencompass.git -b opencompass-llmc
cd opencompass
pip install -v -e .
pip install human-eval

According to the opencompass documentation, prepare the dataset and place it in the current directory where you execute the command.

Finally, you can load the above configuration and perform model compression and accuracy testing just like running a regular llmc program.

Multi-GPU parallel test¶

If the model is too large to fit on a single GPU for evaluation, and multi-GPU evaluation is needed, we support using pipeline parallelism when running opencompass.

What you need to do is:

Identify which GPUs are available, add them to CUDA_VISIBLE_DEVICES at the beginning of your run script
Modify the file pointed to by cfg_path under opencompass, setting the num_gpus to the desired number.

Model accuracy test V2¶

Multi-GPU parallel test¶

Docs