SmoothQuant#

LMDeploy provides functions for quantization and inference of large language models using 8-bit integers(INT8). For GPUs such as Nvidia H100, lmdeploy also supports 8-bit floating point(FP8).

And the following NVIDIA GPUs are available for INT8/FP8 inference respectively:

  • INT8

    • V100(sm70): V100

    • Turing(sm75): 20 series, T4

    • Ampere(sm80,sm86): 30 series, A10, A16, A30, A100

    • Ada Lovelace(sm89): 40 series

    • Hopper(sm90): H100

  • FP8

    • Ada Lovelace(sm89): 40 series

    • Hopper(sm90): H100

First of all, run the following command to install lmdeploy:

pip install lmdeploy[all]

8-bit Weight Quantization#

Performing 8-bit weight quantization involves three steps:

  1. Smooth Weights: Start by smoothing the weights of the Language Model (LLM). This process makes the weights more amenable to quantizing.

  2. Replace Modules: Locate DecoderLayers and replace the modules RSMNorm and nn.Linear with QRSMNorm and QLinear modules respectively. These ‘Q’ modules are available in the lmdeploy/pytorch/models/q_modules.py file.

  3. Save the Quantized Model: Once you’ve made the necessary replacements, save the new quantized model.

lmdeploy provides lmdeploy lite smooth_quant command to accomplish all three tasks detailed above. Note that the argument --quant-dtype is used to determine if you are doing int8 or fp8 weight quantization. To get more info about usage of the cli, run lmdeploy lite smooth_quant --help

Here are two examples:

  • int8

    lmdeploy lite smooth_quant internlm/internlm2_5-7b-chat --work-dir ./internlm2_5-7b-chat-int8 --quant-dtype int8
    
  • fp8

    lmdeploy lite smooth_quant internlm/internlm2_5-7b-chat --work-dir ./internlm2_5-7b-chat-fp8 --quant-dtype fp8
    

Inference#

Trying the following codes, you can perform the batched offline inference with the quantized model:

from lmdeploy import pipeline, PytorchEngineConfig

engine_config = PytorchEngineConfig(tp=1)
pipe = pipeline("internlm2_5-7b-chat-int8", backend_config=engine_config)
response = pipe(["Hi, pls intro yourself", "Shanghai is"])
print(response)

Service#

LMDeploy’s api_server enables models to be easily packed into services with a single command. The provided RESTful APIs are compatible with OpenAI’s interfaces. Below are an example of service startup:

lmdeploy serve api_server ./internlm2_5-7b-chat-int8 --backend pytorch

The default port of api_server is 23333. After the server is launched, you can communicate with server on terminal through api_client:

lmdeploy serve api_client http://0.0.0.0:23333

You can overview and try out api_server APIs online by swagger UI at http://0.0.0.0:23333, or you can also read the API specification from here.