Shortcuts

Key-Value(KV) Cache Quantization

Since v0.4.0, LMDeploy has supported online key-value (kv) cache quantization with int4 and int8 numerical precision, utilizing an asymmetric quantization method that is applied on a per-head, per-token basis. The original kv offline quantization method has been removed.

Intuitively, quantizing the kv cache is beneficial for reducing memory usage. Compared to FP16, the memory for int4/int8 kv can be reduced to 1/4 and 1/2, respectively. This means that under the same memory conditions, the system can support a significantly increased number of concurrent operations after kv quantization, thereby ultimately enhancing throughput.

However, quantization typically brings in some loss of model accuracy. We have used OpenCompass to evaluate the accuracy of several models after applying int4/int8 quantization. int8 kv keeps the accuracy while int4 kv has slight loss. The detailed results are presented in the Evaluation section. You can refer to the information and choose wisely based on your requirements.

LMDeploy inference with quantized kv supports the following NVIDIA GPU models:

  • Volta architecture (sm70): V100

  • Turing architecture (sm75): 20 series, T4

  • Ampere architecture (sm80, sm86): 30 series, A10, A16, A30, A100

  • Ada Lovelace architecture (sm89): 40 series

  • Hopper architecture (sm90): H100, H200

In summary, LMDeploy kv quantization has the following advantages:

  1. data-free online quantization

  2. Supports all nvidia GPU models with Volta architecture (sm70) and above

  3. KV int8 quantization has almost lossless accuracy, and KV int4 quantization accuracy is within an acceptable range

  4. Efficient inference, with int8/int4 kv quantization applied to llama2-7b, RPS is improved by round 30% and 40% respectively compared to fp16

In the next section, we will take internlm2-chat-7b model as an example, introducing the usage of kv quantization and inference of lmdeploy. But before that, please ensure that lmdeploy is installed.

pip install lmdeploy

Usage

Applying kv quantization and inference via LMDeploy is quite straightforward. Simply set the quant_policy parameter.

LMDeploy specifies that quant_policy=4 stands for 4-bit kv, whereas quant_policy=8 indicates 8-bit kv.

Offline inference

from lmdeploy import pipeline, TurbomindEngineConfig
engine_config = TurbomindEngineConfig(quant_policy=8)
pipe = pipeline("internlm/internlm2-chat-7b", backend_config=engine_config)
response = pipe(["Hi, pls intro yourself", "Shanghai is"])
print(response)

Serving

lmdeploy serve api_server internlm/internlm2-chat-7b --quant-policy 8

Evaluation

We apply kv quantization of LMDeploy to several LLM models and utilize OpenCompass to evaluate the inference accuracy. The results are shown in the table below:

- - - llama2-7b-chat - - internlm2-chat-7b - - qwen1.5-7b-chat - -
dataset version metric kv fp16 kv int8 kv int4 kv fp16 kv int8 kv int4 fp16 kv int8 kv int4
ceval - naive_average 28.42 27.96 27.58 60.45 60.88 60.28 70.56 70.49 68.62
mmlu - naive_average 35.64 35.58 34.79 63.91 64 62.36 61.48 61.56 60.65
triviaqa 2121ce score 56.09 56.13 53.71 58.73 58.7 58.18 44.62 44.77 44.04
gsm8k 1d7fe4 accuracy 28.2 28.05 27.37 70.13 69.75 66.87 54.97 56.41 54.74
race-middle 9a54b6 accuracy 41.57 41.78 41.23 88.93 88.93 88.93 87.33 87.26 86.28
race-high 9a54b6 accuracy 39.65 39.77 40.77 85.33 85.31 84.62 82.53 82.59 82.02

For detailed evaluation methods, please refer to this guide. Remember to pass quant_policy to the inference engine in the config file.

Performance

model kv type test settings RPS v.s. kv fp16
llama2-chat-7b fp16 tp1 / ratio 0.8 / bs 256 / prompts 10000 14.98 1.0
- int8 tp1 / ratio 0.8 / bs 256 / prompts 10000 19.01 1.27
- int4 tp1 / ratio 0.8 / bs 256 / prompts 10000 20.81 1.39
llama2-chat-13b fp16 tp1 / ratio 0.9 / bs 128 / prompts 10000 8.55 1.0
- int8 tp1 / ratio 0.9 / bs 256 / prompts 10000 10.96 1.28
- int4 tp1 / ratio 0.9 / bs 256 / prompts 10000 11.91 1.39
internlm2-chat-7b fp16 tp1 / ratio 0.8 / bs 256 / prompts 10000 24.13 1.0
- int8 tp1 / ratio 0.8 / bs 256 / prompts 10000 25.28 1.05
- int4 tp1 / ratio 0.8 / bs 256 / prompts 10000 25.80 1.07

The performance data is obtained by benchmark/profile_throughput.py

Read the Docs v: latest
Versions
latest
stable
v0.4.1
v0.4.0
v0.3.0
v0.2.6
v0.2.5
v0.2.4
v0.2.3
v0.2.2
v0.2.0
v0.1.0
Downloads
pdf
html
epub
On Read the Docs
Project Home
Builds

Free document hosting provided by Read the Docs.