Profile Token Latency and Throughput

Profile Token Latency and Throughput#

We profile the latency and throughput of generated tokens with fixed batch size and fixed input/output token.

The profiling script is Before running it, please install the lmdeploy precompiled package and download the profiling script:

pip install lmdeploy
git clone --depth=1


LMDeploy records test results like first token latency, token throughput (tokens/s), percentile data of each token’s latency (P50, P75, P95, P99), GPU mem, etc.

first_token_latency is only reported in the case of streaming inference.

The formula for calculating throughput is:

\[\begin{split} TokenThroughput = Number\\ of\\ generated\\ tokens/TotalTime \end{split}\]

Total time includes prefill time.

During the test process, all graphics cards on the node should not run any other programs, otherwise the statistics of GPU mem would be inaccurate.


In this section, we take internlm/internlm-7b as an example to show how to profile the inference engines of LMDeploy.

Profile turbomind engine#

cd lmdeploy/benchmark
python3 internlm/internlm-7b

Profile pytorch engine#

cd lmdeploy/benchmark
python3 internlm/internlm-7b --backend pytorch

For detailed argument specification of, such as batch size, input and output token number an so on, please run the help command python3 -h.