Shortcuts

Profile Triton Inference Server

Triton Inference Server (TIS) is another serving method supported by LMDeploy besides api_server. Its performance testing methods and metrics are similar to those of api_server.

The profiling script is profile_serving.py. Before running it, please install the lmdeploy precompiled package, download the profiling script and the test dataset:

pip install 'lmdeploy[serve]'
git clone --depth=1 https://github.com/InternLM/lmdeploy
cd lmdeploy/benchmark
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json

Metrics

LMDeploy records the performance metrics like first token latency, token throughput (tokens/s) and request throughput (RPM)

first_token_latency is only reported in the case of streaming inference.

The formula for calculating token throughput is:

$$ TokenThroughput=Number\ of\ generated\ tokens/TotalTime $$

And the formula for calculating request throughput is:

$$ RPM(request\ per\ minute)=Number\ of\ prompts/TotalTime * 60 $$

Total time includes prefill time.

Profile

In this section, we take internlm/internlm-7b as an example to show the benchmark procedure.

Launch triton inference server

Before launching the server, the LLM model must be converted to the turbomind format in advance.

lmdeploy convert internlm internlm/internlm-7b --dst-path ./internlm-7b --trust-remote-code

Then, the triton inference server can be launched by:

bash ./internlm-7b/service_docker_up.sh

Profile

python3 profile_serving.py 0.0.0.0:33337 ./internlm-7b/triton_models/tokenizer ./ShareGPT_V3_unfiltered_cleaned_split.json

For detailed argument specification of profile_serving.py, such as request concurrency, sampling parameters an so on, please run the help command python3 profile_serving.py -h.

Read the Docs v: latest
Versions
latest
stable
v0.4.1
v0.4.0
v0.3.0
v0.2.6
v0.2.5
v0.2.4
v0.2.3
v0.2.2
v0.2.0
v0.1.0
Downloads
pdf
html
epub
On Read the Docs
Project Home
Builds

Free document hosting provided by Read the Docs.