# SmoothQuant

LMDeploy provides functions for quantization and inference of large language models using 8-bit integers(INT8). For GPUs such as Nvidia H100, lmdeploy also supports 8-bit floating point(FP8).

And the following NVIDIA GPUs are available for INT8/FP8 inference respectively:

- INT8
  - V100(sm70): V100
  - Turing(sm75): 20 series, T4
  - Ampere(sm80,sm86): 30 series, A10, A16, A30, A100
  - Ada Lovelace(sm89): 40 series
  - Hopper(sm90): H100
- FP8
  - Ada Lovelace(sm89): 40 series
  - Hopper(sm90): H100

First of all, run the following command to install lmdeploy:

```shell
pip install lmdeploy[all]
```

## 8-bit Weight Quantization

Performing 8-bit weight quantization involves three steps:

1. **Smooth Weights**: Start by smoothing the weights of the Language Model (LLM). This process makes the weights more amenable to quantizing.
2. **Replace Modules**: Locate DecoderLayers and replace the modules RSMNorm and nn.Linear with QRSMNorm and QLinear modules respectively. These 'Q' modules are available in the lmdeploy/pytorch/models/q_modules.py file.
3. **Save the Quantized Model**: Once you've made the necessary replacements, save the new quantized model.

lmdeploy provides `lmdeploy lite smooth_quant` command to accomplish all three tasks detailed above. Note that the argument `--quant-dtype` is used to determine if you are doing int8 or fp8 weight quantization. To get more info about usage of the cli, run `lmdeploy lite smooth_quant --help`

Here are two examples:

- int8

  ```shell
  lmdeploy lite smooth_quant internlm/internlm2_5-7b-chat --work-dir ./internlm2_5-7b-chat-int8 --quant-dtype int8
  ```

- fp8

  ```shell
  lmdeploy lite smooth_quant internlm/internlm2_5-7b-chat --work-dir ./internlm2_5-7b-chat-fp8 --quant-dtype fp8
  ```

## Inference

Trying the following codes, you can perform the batched offline inference with the quantized model:

```python
from lmdeploy import pipeline, PytorchEngineConfig

engine_config = PytorchEngineConfig(tp=1)
pipe = pipeline("internlm2_5-7b-chat-int8", backend_config=engine_config)
response = pipe(["Hi, pls intro yourself", "Shanghai is"])
print(response)
```

## Service

LMDeploy's `api_server` enables models to be easily packed into services with a single command. The provided RESTful APIs are compatible with OpenAI's interfaces. Below are an example of service startup:

```shell
lmdeploy serve api_server ./internlm2_5-7b-chat-int8 --backend pytorch
```

The default port of `api_server` is `23333`. After the server is launched, you can communicate with server on terminal through `api_client`:

```shell
lmdeploy serve api_client http://0.0.0.0:23333
```

You can overview and try out `api_server` APIs online by swagger UI at `http://0.0.0.0:23333`, or you can also read the API specification from [here](../llm/api_server.md).
