# W8A8 LLM 模型部署

LMDeploy 提供了使用 8-bit 整数(INT8)和浮点数(FP8)对神经网络模型进行量化和推理的功能。

可用于 INT8 和 FP8 推理的 NVIDIA GPU 分别为：

- INT8
  - V100(sm70): V100
  - Turing(sm75): 20 series, T4
  - Ampere(sm80,sm86): 30 series, A10, A16, A30, A100
  - Ada Lovelace(sm89): 40 series
  - Hopper(sm90): H100
- FP8
  - Ada Lovelace(sm89): 40 series
  - Hopper(sm90): H100

首先，执行如下命令安装lmdeploy：

```shell
pip install lmdeploy[all]
```

## 8-bit 权重量化

进行 8-bit 权重量化需要经历以下三步：

1. **权重平滑**：首先对语言模型的权重进行平滑处理，以便更好地进行量化。
2. **模块替换**：使用 `QRMSNorm` 和 `QLinear` 模块替换原模型 `DecoderLayer` 中的 `RMSNorm` 模块和 `nn.Linear` 模块。`lmdeploy/pytorch/models/q_modules.py` 文件中定义了这些量化模块。
3. **保存量化模型**：完成上述必要的替换后，我们即可保存新的量化模型。

lmdeploy 提供了命令行工具 `lmdeploy lite smooth_quant` 实现了以上三个步骤。并且其中命令行参数 `--quant-dtype` 可以用来控制是进行8-bit整数还是浮点数类型的量化。更多命令行工具使用方式，请执行 `lmdeploy lite smooth_quant --help` 查看。

以下示例演示了进行 int8 或 fp8 的量化命令。

- int8

  ```shell
  lmdeploy lite smooth_quant internlm/internlm2_5-7b-chat --work-dir ./internlm2_5-7b-chat-int8 --quant-dtype int8
  ```

- fp8

  ```shell
  lmdeploy lite smooth_quant internlm/internlm2_5-7b-chat --work-dir ./internlm2_5-7b-chat-fp8 --quant-dtype fp8
  ```

## 模型推理

量化后的模型，通过以下几行简单的代码，可以实现离线推理：

```python
from lmdeploy import pipeline, PytorchEngineConfig

engine_config = PytorchEngineConfig(tp=1)
pipe = pipeline("internlm2_5-7b-chat-int8", backend_config=engine_config)
response = pipe(["Hi, pls intro yourself", "Shanghai is"])
print(response)
```

关于 pipeline 的详细介绍，请参考[这里](../llm/pipeline.md)

## 推理服务

LMDeploy `api_server` 支持把模型一键封装为服务，对外提供的 RESTful API 兼容 openai 的接口。以下为服务启动的示例：

```shell
lmdeploy serve api_server ./internlm2_5-7b-chat-int8 --backend pytorch
```

服务默认端口是23333。在 server 启动后，你可以在终端通过`api_client`与server进行对话：

```shell
lmdeploy serve api_client http://0.0.0.0:23333
```

还可以通过 Swagger UI `http://0.0.0.0:23333` 在线阅读和试用 `api_server` 的各接口，也可直接查阅[文档](../llm/api_server.md)，了解各接口的定义和使用方法。