W8A8 LLM Model Deployment¶

LMDeploy provides functions for quantization and inference of large language models using 8-bit integers.

Before starting inference, ensure that lmdeploy and openai/triton are correctly installed. Execute the following commands to install these:

pip install lmdeploy
pip install triton>=2.1.0

8-bit Weight Model Inference¶

For performing 8-bit weight model inference, you can directly download the pre-quantized 8-bit weight models from LMDeploy’s model zoo. For instance, the 8-bit Internlm-chat-7B model is available for direct download from the model zoo:

git-lfs install
git clone https://huggingface.co/lmdeploy/internlm-chat-7b-w8 (coming soon)

Alternatively, you can manually convert original 16-bit weights into 8-bit by referring to the content under the “8bit Weight Quantization” section. Save them in the internlm-chat-7b-w8 directory, using the command below:

lmdeploy lite smooth_quant internlm/internlm-chat-7b --work-dir ./internlm-chat-7b-w8

Afterwards, use the following command to interact with the model via the terminal:

lmdeploy chat ./internlm-chat-7b-w8 --backend pytorch

Launching gradio service¶

Coming soon…

Inference Speed¶

Coming soon…

8bit Weight Quantization¶

Performing 8bit weight quantization involves three steps:

Smooth Weights: Start by smoothing the weights of the Language Model (LLM). This process makes the weights more amenable to quantizing.
Replace Modules: Locate DecoderLayers and replace the modules RSMNorm and nn.Linear with QRSMNorm and QLinear modules respectively. These ‘Q’ modules are available in the lmdeploy/pytorch/models/q_modules.py file.
Save the Quantized Model: Once you’ve made the necessary replacements, save the new quantized model.

The script lmdeploy/lite/apis/smooth_quant.py accomplishes all three tasks detailed above. For example, you can obtain the model weights of the quantized Internlm-chat-7B model by running the following command:

lmdeploy lite smooth_quant internlm/internlm-chat-7b --work-dir ./internlm-chat-7b-w8

After saving, you can instantiate your quantized model by calling the from_pretrained interface.