SmoothQuant#
LMDeploy provides functions for quantization and inference of large language models using 8-bit integers(INT8). For GPUs such as Nvidia H100, lmdeploy also supports 8-bit floating point(FP8).
And the following NVIDIA GPUs are available for INT8/FP8 inference respectively:
INT8
V100(sm70): V100
Turing(sm75): 20 series, T4
Ampere(sm80,sm86): 30 series, A10, A16, A30, A100
Ada Lovelace(sm89): 40 series
Hopper(sm90): H100
FP8
Ada Lovelace(sm89): 40 series
Hopper(sm90): H100
First of all, run the following command to install lmdeploy:
pip install lmdeploy[all]
8-bit Weight Quantization#
Performing 8-bit weight quantization involves three steps:
Smooth Weights: Start by smoothing the weights of the Language Model (LLM). This process makes the weights more amenable to quantizing.
Replace Modules: Locate DecoderLayers and replace the modules RSMNorm and nn.Linear with QRSMNorm and QLinear modules respectively. These ‘Q’ modules are available in the lmdeploy/pytorch/models/q_modules.py file.
Save the Quantized Model: Once you’ve made the necessary replacements, save the new quantized model.
lmdeploy provides lmdeploy lite smooth_quant
command to accomplish all three tasks detailed above. Note that the argument --quant-dtype
is used to determine if you are doing int8 or fp8 weight quantization. To get more info about usage of the cli, run lmdeploy lite smooth_quant --help
Here are two examples:
int8
lmdeploy lite smooth_quant internlm/internlm2_5-7b-chat --work-dir ./internlm2_5-7b-chat-int8 --quant-dtype int8
fp8
lmdeploy lite smooth_quant internlm/internlm2_5-7b-chat --work-dir ./internlm2_5-7b-chat-fp8 --quant-dtype fp8
Inference#
Trying the following codes, you can perform the batched offline inference with the quantized model:
from lmdeploy import pipeline, PytorchEngineConfig
engine_config = PytorchEngineConfig(tp=1)
pipe = pipeline("internlm2_5-7b-chat-int8", backend_config=engine_config)
response = pipe(["Hi, pls intro yourself", "Shanghai is"])
print(response)
Service#
LMDeploy’s api_server
enables models to be easily packed into services with a single command. The provided RESTful APIs are compatible with OpenAI’s interfaces. Below are an example of service startup:
lmdeploy serve api_server ./internlm2_5-7b-chat-int8 --backend pytorch
The default port of api_server
is 23333
. After the server is launched, you can communicate with server on terminal through api_client
:
lmdeploy serve api_client http://0.0.0.0:23333
You can overview and try out api_server
APIs online by swagger UI at http://0.0.0.0:23333
, or you can also read the API specification from here.