Shortcuts

W8A8 LLM Model Deployment

LMDeploy provides functions for quantization and inference of large language models using 8-bit integers.

Before starting inference, ensure that lmdeploy and openai/triton are correctly installed. Execute the following commands to install these:

pip install lmdeploy
pip install triton>=2.1.0

8-bit Weight Model Inference

For performing 8-bit weight model inference, you can directly download the pre-quantized 8-bit weight models from LMDeploy’s model zoo. For instance, the 8-bit Internlm-chat-7B model is available for direct download from the model zoo:

git-lfs install
git clone https://huggingface.co/lmdeploy/internlm-chat-7b-w8 (coming soon)

Alternatively, you can manually convert original 16-bit weights into 8-bit by referring to the content under the “8bit Weight Quantization” section. Save them in the internlm-chat-7b-w8 directory, using the command below:

lmdeploy lite smooth_quant internlm/internlm-chat-7b --work-dir ./internlm-chat-7b-w8

Afterwards, use the following command to interact with the model via the terminal:

lmdeploy chat ./internlm-chat-7b-w8 --backend pytorch

Launching gradio service

Coming soon…

Inference Speed

Coming soon…

8bit Weight Quantization

Performing 8bit weight quantization involves three steps:

  1. Smooth Weights: Start by smoothing the weights of the Language Model (LLM). This process makes the weights more amenable to quantizing.

  2. Replace Modules: Locate DecoderLayers and replace the modules RSMNorm and nn.Linear with QRSMNorm and QLinear modules respectively. These ‘Q’ modules are available in the lmdeploy/pytorch/models/q_modules.py file.

  3. Save the Quantized Model: Once you’ve made the necessary replacements, save the new quantized model.

The script lmdeploy/lite/apis/smooth_quant.py accomplishes all three tasks detailed above. For example, you can obtain the model weights of the quantized Internlm-chat-7B model by running the following command:

lmdeploy lite smooth_quant internlm/internlm-chat-7b --work-dir ./internlm-chat-7b-w8

After saving, you can instantiate your quantized model by calling the from_pretrained interface.

Read the Docs v: latest
Versions
latest
stable
v0.4.1
v0.4.0
v0.3.0
v0.2.6
v0.2.5
v0.2.4
v0.2.3
v0.2.2
v0.2.0
v0.1.0
Downloads
pdf
html
epub
On Read the Docs
Project Home
Builds

Free document hosting provided by Read the Docs.