AWQ + OmniQuant¶
OmniQuant uses Learnable Weight Clipping (LWC) and Learnable Equivalent Transformation (LET) to optimize quantized models, often achieving better performance compared to non-learning-based algorithms. However, due to instability during training and sensitivity to hyperparameters, OmniQuant requires significant time to fine-tune the hyperparameters. This not only increases training costs but can also lead to suboptimal results.
To address these issues, we have improved OmniQuant in LLMC. We use AWQ to generate clipping parameters and transformation parameters, which are then used as initializations for OmniQuant’s LWC and LET, respectively. This quality initialization significantly reduces OmniQuant’s training time while improving its accuracy.
1.1 Weight-only Quantization¶
As an example of the w4a16g128 setting, we provide a configuration file combining AWQ and OmniQuant.
1.1.1 Run AWQ¶
Step One, run the AWQ-related configuration file. Note that in this step, you need to set the save_trans parameter to True to save the transformed model.
# configs/quantization/combination/awq_comb_omni/w4a16g128/step_1_awq.yml
save:
# Save the AWQ-transformed model for OmniQuant.
save_trans: True
save_fake: False
save_path: /path/to/save_awq_trans/
Run the script:
# scripts/run_llmc.sh
llmc=llmc_path
export PYTHONPATH=$llmc:$PYTHONPATH
task_name=step_1_awq
config=${llmc}/configs/quantization/combination/awq_comb_omni/w4a16g128/step_1_awq.yml
1.1.2 Run OmniQuant¶
Step Two, load the AWQ-transformed model and run the OmniQuant-related configuration file. In this step, set the search_clip_init parameter to True to initialize LWC using the clipping parameters generated by AWQ grid search.
# configs/quantization/combination/awq_comb_omni/w4a16g128/step_2_omniq.yml
model:
type: model_type
# Load AWQ-transformed model
path: /path/to/save_awq_trans/transformed_model
torch_dtype: auto
quant:
special:
search_clip_init: True
Run the script:
# scripts/run_llmc.sh
llmc=llmc_path
export PYTHONPATH=$llmc:$PYTHONPATH
task_name=step_2_omni
config=${llmc}/configs/quantization/combination/awq_comb_omni/w4a16g128/step_2_omniq.yml
By running these two steps, LLMC can achieve better results in weight-only quantization compared to the original OmniQuant paper. More importantly, LLMC only requires 5 epochs to achieve this effect, much less than the 20 or 40 epochs required in the original paper, significantly reducing training time.
Please note that in weight-only quantization, AWQ’s clipping parameters and transformation parameters do not need to be stored for use by OmniQuant. Only a transformed model needs to be saved. This is because Learnable Equivalent Transformation (LET) mainly addresses the outlier phenomenon in activation quantization. Therefore, in weight-only quantization, OmniQuant does not need to use LET. At the same time, the use of AWQ’s clipping parameters to initialize Learnable Weight Clipping (LWC) is automatically handled by OmniQuant in LLMC.
1.2 Weight-Activation Quantization¶
As an example of the w8a8 setting, we provide a configuration file combining AWQ and OmniQuant.
1.2.1 Run AWQ¶
Step One, run the AWQ-related configuration file. Note that in this step, you need to set the save_clip and save_scale parameters to True to save the clipping parameters and transformation parameters. Also, make sure to use learnable as the weight calibration method since only learnable supports saving and loading of the clipping parameters.
# configs/quantization/combination/awq_comb_omni/w8a8/step_1_awq.yml
quant:
weight:
bit: 8
symmetric: False
granularity: per_channel
group_size: -1
calib_algo: learnable
act:
bit: 8
symmetric: False
granularity: per_token
calib_algo: minmax
save:
save_scale: True
scale_path: /path/to/scale/awq_w8a8.pth
save_clip: True
clip_path: /path/to/clip/awq_w8a8.pth
Run the script:
# scripts/run_llmc.sh
llmc=llmc_path
export PYTHONPATH=$llmc:$PYTHONPATH
task_name=step_1_awq
config=${llmc}/configs/quantization/combination/awq_comb_omni/w8a8/step_1_awq.yml
1.2.2 Run OmniQuant¶
Step Two, load the clipping parameters and transformation parameters generated by AWQ. In this step, the clipping parameters and transformation parameters generated by AWQ are loaded for initialization training in OmniQuant’s LWC and LET. Run the OmniQuant-related configuration file.
# configs/quantization/combination/awq_comb_omni/w8a8/step_2_omniq.yml
quant:
special:
# Use AWQ's search clip factors to initialize OmniQuant's clip factors,
# Then refine them through learning (LWC).
search_clip_init: True
load_clip: True
clip_path: /path/to/scale/awq_w8a8.pth
# Use AWQ's search scale factors to initialize OmniQuant's scale factors,
# Then refine them through learning (LET).
search_scale_init: True
scale_path: /path/to/clip/awq_w8a8.pth
In this step, set both search_scale_init and search_clip_init to True to use the clipping parameters and transformation parameters generated by AWQ to initialize LWC and LET.
Run the script:
# scripts/run_llmc.sh
llmc=llmc_path
export PYTHONPATH=$llmc:$PYTHONPATH
task_name=step_2_omniq
config=${llmc}/configs/quantization/combination/awq_comb_omni/w8a8/step_2_omniq.yml
By running these two steps, LLMC can achieve better results in weight-activation quantization than those reported in the original paper, and it only requires 5 epochs.