Configs’ brief description¶

All configurations can be found here

Here’s a brief config example

base:
    seed: &seed 42 # Set random seed
model:
    type: Llama # Type of model
    path: model path # Model path
    tokenizer_mode: fast # The tokenizer type of the model
    torch_dtype: auto # Model dtype
calib:
    name: pileval # Calibration data set name
    download: False # Whether the calibration dataset can be downloaded online
    path: calib data path # Calibration dataset path
    n_samples: 512 # Number of calibration samples
    bs: 1 # Batch size of calibration dataset
    seq_len: 512 # Sequence length of calibration dataset
    preproc: pileval_smooth # Pre-procession of the calibration dataset
    seed: *seed # Random seed for calibration dataset
eval:
    eval_pos: [pretrain, transformed, fake_quant] # eval positon
    name: wikitext2 # The name of the evaluation dataset
    download: False # Whether the evaluation dataset can be downloaded online
    path: eval data path # Path to evaluation dataset
    bs: 1 # The batch size of the evaluation dataset
    seq_len: 2048 # Sequence length of the evaluation dataset
quant:
    method: SmoothQuant # Compression method
    weight:
        bit: 8 # The number of quantified bits of the weight
        symmetric: True # Is weight quantization a symmetric quantization
        granularity: per_channel # The granularity of weight quantification
    act:
        bit: 8 # Number of activated quantization bits
        symmetric: True # Whether activation quantization is symmetric quantization
        granularity: per_token # The granularity of activation quantification
save:
    save_trans: False # Whether to save the adjusted model
    save_path: ./save # Save path

Configs’ detailed description¶

base¶

base.seed

Set Random Seed, which is used to set all random seeds for the entire frame

model¶

model.type

The type of model, which can support Llama, Qwen2, Llava, Gemma2 and other models, you can check all the models supported by llmc from here.

model.path

Currently, LLMC only supports models in Hugging Face format, and you can use the following code to check whether the model can be loaded normally.

from transformers import AutoModelForCausalLM, AutoConfig


model_path = # model path
model_config = AutoConfig.from_pretrained(
    model_path, trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    config=model_config,
    trust_remote_code=True,
    torch_dtype="auto",
    low_cpu_mem_usage=True,
)

print(model)

If the above code does not load the model you give, may be:

Your model format is not hugging face format
Your version of tansformers is too low and you can execute pip install transformers --upgrade to upgrade it.

Before llmc runs, make sure that the above code can load your model successfully, otherwise llmc will not be able to load your model.

model.tokenizer_mode

Choose whether to use a Slow or Fast tokenizer

model.torch_dtype

You can set the data types of model weights:

auto
torch.float16
torch.bfloat16
torch.float32

where auto will follow the original data type setting of the weight file

calib¶

calib.name

The name of the calibration dataset. Currently supported by the following types of calibration datasets:

pileval
wikitext2
c4
ptb
custom

where custom indicates the use of user-defined calibration datasets, refer to the Custom Calibration Dataset section of the advanced usage document for specific instructions

calib.download

Indicates whether the calibration dataset needs to be downloaded online at runtime

If you set True, you do not need to set calib.path, llmc will automatically download the dataset online

If you set False, you need to set calib.path, and llmc will read the dataset from this address, and you don’t need to run llmc on the Internet

calib.path

If calib.download is set to False, you need to set calib.path, which indicates the path where the calibration dataset is stored

The data stored in this path must be a dataset in arrow format

To download the dataset in Arrow format from Hugging Face, you can use the following code

from datasets import load_dataset
calib_dataset = load_dataset(...)
calib_dataset.save_to_disk(...)

Load datasets in that format can be used

from datasets import load_from_disk
data = load_from_disk(...)

The LLMC has provided a download script for the above dataset

The calibration dataset can be downloaded here.

The execution command is python download_calib_dataset.py --save_path [calib dataset save path]

The test dataset can be downloaded here.

The execution command is python download_eval_dataset.py --save_path [eval dataset save path]

If you want to use more datasets, you can refer to the download method of the arrow format dataset above and modify it yourself

calib.n_samples

Select n_samples pieces of data for calibration

calib.bs

Set the calibration data to calib.bs as the batch size, if it is -1, all the data is packaged into a batch of data

calib.seq_len

The sequence length of the calibration data

calib.preproc

The preprocessing methods of calibration data are currently implemented by llmc in a variety of preprocessing methods

wikitext2_gptq
ptb_gptq
c4_gptq
pileval_awq
pileval_smooth
pileval_omni
general
random_truncate_txt

With the exception of general, the rest of the preprocessing can be found here

general is implemented in the general_preproc function in the base_dataset

calib.seed

The random seed in the data preprocessing follows the base.seed setting by default

eval¶

eval.eval_pos

Indicates the eval positions, and currently supports three positions that can be evaluated

pretrain
transformed
fake_quant

eval_pos need to give a list, the list can be empty, and an empty list means that no tests are being performed

eval.name

The name of the eval dataset is supported by the following types of test datasets:

wikitext2
c4
ptb

For details about how to download the test dataset, see calib.name calibration dataset

eval.download

Indicates whether the eval dataset needs to be downloaded online at runtime, see calib.download

eval.path

Refer to calib.path

eval.bs

Eval batch size

eval.seq_len

The sequence length of the eval data

eval.inference_per_block

If your model is too large and the gpu memory of a single card cannot cover the entire model during the eval, then you need to open the inference_per_block for inference, and at the same time, on the premise of not exploding the gpu memory, appropriately increase the bs to improve the inference speed.

Here’s a config example

bs: 10
inference_per_block: True

Eval multiple datasets at the same time

LLMC also supports the simultaneous evaluation of multiple datasets

Below is an example of evaluating a single wikitext2 dataset

eval:
    name: wikitext2
    path: wikitext2 path

Here’s an example of evaluating multiple datasets

eval:
    name: [wikitext2, c4, ptb]
    path: The common upper directory of these data sets

It should be noted that the names of multiple dataset evaluations need to be represented in the form of a list, and the following directory rules need to be followed

upper-level directory
- wikitext2
- c4
- ptb

If you use the LLMC download script directly, the shared upper-level directory is the --save_path specified dataset storage path

quant¶

quant.method

The names of the quantization algorithms used, and all the quantization algorithms supported by the LLMC, can be viewed here.

quant.weight

Quantization settings for weights

quant.weight.bit

The quantized number of bits of the weight

quant.weight.symmetric

Quantitative symmetry of weights

quant.weight.granularity

The quantification granularity of the weights supports the following granularities

per tensor
per channel
per group

quant.act

Activated quantization settings

quant.act.bit

Activated quantized bit digits

quant.act.symmetric

Quantified symmetry or not

quant.act.granularity

The quantization granularity of the activation supports the following granularities

per tensor
per token
per head

If quant.method is set to RTN, activating quantization can support static per tensor settings, and the following is a W8A8 configuration that activates static per tensor quantization

quant:
    method: RTN
    weight:
        bit: 8
        symmetric: True
        granularity: per_channel
    act:
        bit: 8
        symmetric: True
        granularity: per_tensor
        static: True

save¶

save.save_trans

Whether to save the adjusted model weights

The saved weight is the weight that is more suitable for quantization after adjustment, and it is still saved in the form of FP16, and when it is deployed in the inference engine, you need to enable NAIVE quantization to achieve quantitative inference

save.save_path

Save the path of the model, which needs to be a new directory path that does not exist, otherwise the llmc will terminate the operation with a corresponding error message