Configs’ brief description¶
All configurations can be found here
Here’s a brief config example
base:
seed: &seed 42 # Set random seed
model:
type: Llama # Type of model
path: model path # Model path
tokenizer_mode: fast # The tokenizer type of the model
torch_dtype: auto # Model dtype
calib:
name: pileval # Calibration data set name
download: False # Whether the calibration dataset can be downloaded online
path: calib data path # Calibration dataset path
n_samples: 512 # Number of calibration samples
bs: 1 # Batch size of calibration dataset
seq_len: 512 # Sequence length of calibration dataset
preproc: pileval_smooth # Pre-procession of the calibration dataset
seed: *seed # Random seed for calibration dataset
eval:
eval_pos: [pretrain, transformed, fake_quant] # eval positon
name: wikitext2 # The name of the evaluation dataset
download: False # Whether the evaluation dataset can be downloaded online
path: eval data path # Path to evaluation dataset
bs: 1 # The batch size of the evaluation dataset
seq_len: 2048 # Sequence length of the evaluation dataset
quant:
method: SmoothQuant # Compression method
weight:
bit: 8 # The number of quantified bits of the weight
symmetric: True # Is weight quantization a symmetric quantization
granularity: per_channel # The granularity of weight quantification
act:
bit: 8 # Number of activated quantization bits
symmetric: True # Whether activation quantization is symmetric quantization
granularity: per_token # The granularity of activation quantification
save:
save_trans: False # Whether to save the adjusted model
save_path: ./save # Save path
Configs’ detailed description¶
base¶
base.seed
Set Random Seed, which is used to set all random seeds for the entire frame
model¶
model.type
The type of model, which can support Llama, Qwen2, Llava, Gemma2 and other models, you can check all the models supported by llmc from here.
model.path
Currently, LLMC only supports models in Hugging Face format, and you can use the following code to check whether the model can be loaded normally.
from transformers import AutoModelForCausalLM, AutoConfig
model_path = # model path
model_config = AutoConfig.from_pretrained(
model_path, trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
model_path,
config=model_config,
trust_remote_code=True,
torch_dtype="auto",
low_cpu_mem_usage=True,
)
print(model)
If the above code does not load the model you give, may be:
Your model format is not hugging face format
Your version of tansformers is too low and you can execute
pip install transformers --upgradeto upgrade it.
Before llmc runs, make sure that the above code can load your model successfully, otherwise llmc will not be able to load your model.
model.tokenizer_mode
Choose whether to use a Slow or Fast tokenizer
model.torch_dtype
You can set the data types of model weights:
auto
torch.float16
torch.bfloat16
torch.float32
where auto will follow the original data type setting of the weight file
calib¶
calib.name
The name of the calibration dataset. Currently supported by the following types of calibration datasets:
pileval
wikitext2
c4
ptb
custom
where custom indicates the use of user-defined calibration datasets, refer to the Custom Calibration Dataset section of the advanced usage document for specific instructions
calib.download
Indicates whether the calibration dataset needs to be downloaded online at runtime
If you set True, you do not need to set calib.path, llmc will automatically download the dataset online
If you set False, you need to set calib.path, and llmc will read the dataset from this address, and you don’t need to run llmc on the Internet
calib.path
If calib.download is set to False, you need to set calib.path, which indicates the path where the calibration dataset is stored
The data stored in this path must be a dataset in arrow format
To download the dataset in Arrow format from Hugging Face, you can use the following code
from datasets import load_dataset
calib_dataset = load_dataset(...)
calib_dataset.save_to_disk(...)
Load datasets in that format can be used
from datasets import load_from_disk
data = load_from_disk(...)
The LLMC has provided a download script for the above dataset
The calibration dataset can be downloaded here.
The execution command is python download_calib_dataset.py --save_path [calib dataset save path]
The test dataset can be downloaded here.
The execution command is python download_eval_dataset.py --save_path [eval dataset save path]
If you want to use more datasets, you can refer to the download method of the arrow format dataset above and modify it yourself
calib.n_samples
Select n_samples pieces of data for calibration
calib.bs
Set the calibration data to calib.bs as the batch size, if it is -1, all the data is packaged into a batch of data
calib.seq_len
The sequence length of the calibration data
calib.preproc
The preprocessing methods of calibration data are currently implemented by llmc in a variety of preprocessing methods
wikitext2_gptq
ptb_gptq
c4_gptq
pileval_awq
pileval_smooth
pileval_omni
general
random_truncate_txt
With the exception of general, the rest of the preprocessing can be found here
general is implemented in the general_preproc function in the base_dataset
calib.seed
The random seed in the data preprocessing follows the base.seed setting by default
eval¶
eval.eval_pos
Indicates the eval positions, and currently supports three positions that can be evaluated
pretrain
transformed
fake_quant
eval_pos need to give a list, the list can be empty, and an empty list means that no tests are being performed
eval.name
The name of the eval dataset is supported by the following types of test datasets:
wikitext2
c4
ptb
For details about how to download the test dataset, see calib.name calibration dataset
eval.download
Indicates whether the eval dataset needs to be downloaded online at runtime, see calib.download
eval.path
Refer to calib.path
eval.bs
Eval batch size
eval.seq_len
The sequence length of the eval data
eval.inference_per_block
If your model is too large and the gpu memory of a single card cannot cover the entire model during the eval, then you need to open the inference_per_block for inference, and at the same time, on the premise of not exploding the gpu memory, appropriately increase the bs to improve the inference speed.
Here’s a config example
bs: 10
inference_per_block: True
Eval multiple datasets at the same time
LLMC also supports the simultaneous evaluation of multiple datasets
Below is an example of evaluating a single wikitext2 dataset
eval:
name: wikitext2
path: wikitext2 path
Here’s an example of evaluating multiple datasets
eval:
name: [wikitext2, c4, ptb]
path: The common upper directory of these data sets
It should be noted that the names of multiple dataset evaluations need to be represented in the form of a list, and the following directory rules need to be followed
upper-level directory
wikitext2
c4
ptb
If you use the LLMC download script directly, the shared upper-level directory is the --save_path specified dataset storage path
quant¶
quant.method
The names of the quantization algorithms used, and all the quantization algorithms supported by the LLMC, can be viewed here.
quant.weight
Quantization settings for weights
quant.weight.bit
The quantized number of bits of the weight
quant.weight.symmetric
Quantitative symmetry of weights
quant.weight.granularity
The quantification granularity of the weights supports the following granularities
per tensor
per channel
per group
quant.act
Activated quantization settings
quant.act.bit
Activated quantized bit digits
quant.act.symmetric
Quantified symmetry or not
quant.act.granularity
The quantization granularity of the activation supports the following granularities
per tensor
per token
per head
If quant.method is set to RTN, activating quantization can support static per tensor settings, and the following is a W8A8 configuration that activates static per tensor quantization
quant:
method: RTN
weight:
bit: 8
symmetric: True
granularity: per_channel
act:
bit: 8
symmetric: True
granularity: per_tensor
static: True
save¶
save.save_trans
Whether to save the adjusted model weights
The saved weight is the weight that is more suitable for quantization after adjustment, and it is still saved in the form of FP16, and when it is deployed in the inference engine, you need to enable NAIVE quantization to achieve quantitative inference
save.save_path
Save the path of the model, which needs to be a new directory path that does not exist, otherwise the llmc will terminate the operation with a corresponding error message