Inference Pipeline¶

In this tutorial, We will first present a list of examples to introduce the usage of lmdeploy.pipeline.

Then, we will describe the pipeline API in detail.

Usage¶

An example using default parameters:

from lmdeploy import pipeline

pipe = pipeline('internlm/internlm2-chat-7b')
response = pipe(['Hi, pls intro yourself', 'Shanghai is'])
print(response)

In this example, the pipeline by default allocates a predetermined percentage of GPU memory for storing k/v cache. The ratio is dictated by the parameter TurbomindEngineConfig.cache_max_entry_count.

There have been alterations to the strategy for setting the k/v cache ratio throughout the evolution of LMDeploy. The following are the change histories:

v0.2.0 <= lmdeploy <= v0.2.1

TurbomindEngineConfig.cache_max_entry_count defaults to 0.5, indicating 50% GPU total memory allocated for k/v cache. Out Of Memory (OOM) errors may occur if a 7B model is deployed on a GPU with memory less than 40G. If you encounter an OOM error, please decrease the ratio of the k/v cache occupation as follows:

from lmdeploy import pipeline, TurbomindEngineConfig

# decrease the ratio of the k/v cache occupation to 20%
backend_config = TurbomindEngineConfig(cache_max_entry_count=0.2)

pipe = pipeline('internlm/internlm2-chat-7b',
                backend_config=backend_config)
response = pipe(['Hi, pls intro yourself', 'Shanghai is'])
print(response)

lmdeploy > v0.2.1

The allocation strategy for k/v cache is changed to reserve space from the GPU free memory proportionally. The ratio TurbomindEngineConfig.cache_max_entry_count has been adjusted to 0.8 by default. If OOM error happens, similar to the method mentioned above, please consider reducing the ratio value to decrease the memory usage of the k/v cache.

An example showing how to set tensor parallel num:

from lmdeploy import pipeline, TurbomindEngineConfig

backend_config = TurbomindEngineConfig(tp=2)
pipe = pipeline('internlm/internlm2-chat-7b',
                backend_config=backend_config)
response = pipe(['Hi, pls intro yourself', 'Shanghai is'])
print(response)

An example for setting sampling parameters:

from lmdeploy import pipeline, GenerationConfig, TurbomindEngineConfig

backend_config = TurbomindEngineConfig(tp=2)
gen_config = GenerationConfig(top_p=0.8,
                              top_k=40,
                              temperature=0.8,
                              max_new_tokens=1024)
pipe = pipeline('internlm/internlm2-chat-7b',
                backend_config=backend_config)
response = pipe(['Hi, pls intro yourself', 'Shanghai is'],
                gen_config=gen_config)
print(response)

An example for OpenAI format prompt input:

from lmdeploy import pipeline, GenerationConfig, TurbomindEngineConfig

backend_config = TurbomindEngineConfig(tp=2)
gen_config = GenerationConfig(top_p=0.8,
                              top_k=40,
                              temperature=0.8,
                              max_new_tokens=1024)
pipe = pipeline('internlm/internlm2-chat-7b',
                backend_config=backend_config)
prompts = [[{
    'role': 'user',
    'content': 'Hi, pls intro yourself'
}], [{
    'role': 'user',
    'content': 'Shanghai is'
}]]
response = pipe(prompts,
                gen_config=gen_config)
print(response)

An example for streaming mode:

from lmdeploy import pipeline, GenerationConfig, TurbomindEngineConfig

backend_config = TurbomindEngineConfig(tp=2)
gen_config = GenerationConfig(top_p=0.8,
                              top_k=40,
                              temperature=0.8,
                              max_new_tokens=1024)
pipe = pipeline('internlm/internlm2-chat-7b',
                backend_config=backend_config)
prompts = [[{
    'role': 'user',
    'content': 'Hi, pls intro yourself'
}], [{
    'role': 'user',
    'content': 'Shanghai is'
}]]
for item in pipe.stream_infer(prompts, gen_config=gen_config):
    print(item)

Below is an example for pytorch backend. Please install triton first.

pip install triton>=2.1.0

from lmdeploy import pipeline, GenerationConfig, PytorchEngineConfig

backend_config = PytorchEngineConfig(session_len=2048)
gen_config = GenerationConfig(top_p=0.8,
                              top_k=40,
                              temperature=0.8,
                              max_new_tokens=1024)
pipe = pipeline('internlm/internlm-chat-7b',
                backend_config=backend_config)
prompts = [[{
    'role': 'user',
    'content': 'Hi, pls intro yourself'
}], [{
    'role': 'user',
    'content': 'Shanghai is'
}]]
response = pipe(prompts, gen_config=gen_config)
print(response)

`pipeline` API¶

The pipeline function is a higher-level API designed for users to easily instantiate and use the AsyncEngine.

Init parameters:¶

Parameter	Type	Description	Default
model_path	str	Path to the model. Can be a path to a local directory storing a Turbomind model, or a model_id for models hosted on huggingface.co.	N/A
model_name	Optional[str]	Name of the model when the model_path points to a Pytorch model on huggingface.co.	None
backend_config	TurbomindEngineConfig \| PytorchEngineConfig \| None	Configuration object for the backend. It can be either TurbomindEngineConfig or PytorchEngineConfig depending on the backend chosen.	None, running turbomind backend by default
chat_template_config	Optional[ChatTemplateConfig]	Configuration for chat template.	None
log_level	str	The level of logging.	'ERROR'

Invocation¶

Parameter Name	Data Type	Default Value	Description
prompts	List[str]	None	A batch of prompts.
gen_config	GenerationConfig or None	None	An instance of GenerationConfig. Default is None.
do_preprocess	bool	True	Whether to pre-process the messages. Default is True, which means chat_template will be applied.
request_output_len	int	512	The number of output tokens. This parameter will be deprecated. Please use the gen_config parameter instead.
top_k	int	40	The number of the highest probability vocabulary tokens to keep for top-k-filtering. This parameter will be deprecated. Please use the gen_config parameter instead.
top_p	float	0.8	If set to a float \< 1, only the smallest set of most probable tokens with probabilities that add up to top_p or higher are kept for generation. This parameter will be deprecated. Please use the gen_config parameter instead.
temperature	float	0.8	Used to modulate the next token probability. This parameter will be deprecated. Please use the gen_config parameter instead.
repetition_penalty	float	1.0	The parameter for repetition penalty. 1.0 means no penalty. This parameter will be deprecated. Please use the gen_config parameter instead.
ignore_eos	bool	False	Indicator for ignoring end-of-string (eos). This parameter will be deprecated. Please use the gen_config parameter instead.

Response¶

Parameter Name	Type	Description
text	str	The text response from the server. If the output text is an empty string and the finish_reason is 'length', it means the maximum session length has been reached.
generate_token_len	int	The number of tokens in the response.
input_token_len	int	The number of tokens in the input prompt. Note that this may include the chat template part.
session_id	int	The ID for running a session. Basically, it refers to the index position of the input request batch.
finish_reason	Optional[Literal['stop', 'length']]	The reason the model stopped generating tokens. This will be set to 'stop' if the model encounters a stop word; if the maximum number of tokens specified in the request is reached or the session length is reached, it will be set to 'length'.

TurbomindEngineConfig¶

Description¶

This class provides the configuration parameters for TurboMind backend.

Arguments¶

Parameter	Type	Description	Default
model_name	str, Optional	The chat template name of the deployed model, deprecated and has no effect when version > 0.2.1	None
model_format	str, Optional	The layout of the deployed model. Can be one of the following values: hf, llama, awq.	None
tp	int	The number of GPU cards used in tensor parallelism.	1
session_len	int, Optional	The maximum session length of a sequence.	None
max_batch_size	int	The maximum batch size during inference.	128
cache_max_entry_count	float	The percentage of GPU memory occupied by the k/v cache.	0.5
quant_policy	int	Set it to 4 when k/v is quantized into 8 bits.	0
rope_scaling_factor	float	Scaling factor used for dynamic ntk. TurboMind follows the implementation of transformer LlamaAttention.	0.0
use_logn_attn	bool	Whether or not to use logarithmic attention.	False
download_dir	str, optional	Directory to download and load the weights, default to the default cache directory of huggingface.	None
revision	str, optional	The specific model version to use. It can be a branch name, a tag name, or a commit id. If unspecified, will use the default version.	None

PytorchEngineConfig¶

Description¶

This class provides the configuration parameters for Pytorch backend.

Arguments¶

Parameter	Type	Description	Default
model_name	str	The chat template name of the deployed model	''
tp	int	Tensor Parallelism.	1
session_len	int	Maximum session length.	None
max_batch_size	int	Maximum batch size.	128
eviction_type	str	Action to perform when kv cache is full. Options are ['recompute', 'copy'].	'recompute'
prefill_interval	int	Interval to perform prefill.	16
block_size	int	Paging cache block size.	64
num_cpu_blocks	int	Number of CPU blocks. If the number is 0, cache would be allocated according to the current environment.	0
num_gpu_blocks	int	Number of GPU blocks. If the number is 0, cache would be allocated according to the current environment.	0
adapters	dict	The path configs to lora adapters.	None
download_dir	str	Directory to download and load the weights, default to the default cache directory of huggingface.	None
revision	str	The specific model version to use. It can be a branch name, a tag name, or a commit id. If unspecified, will use the default version.	None

GenerationConfig¶

Description¶

This class contains the generation parameters used by inference engines.

Arguments¶

Parameter	Type	Description	Default
n	int	Number of chat completion choices to generate for each input message. Currently, only 1 is supported	1
max_new_tokens	int	Maximum number of tokens that can be generated in chat completion.	512
top_p	float	Nucleus sampling, where the model considers the tokens with top_p probability mass.	1.0
top_k	int	The model considers the top_k tokens with the highest probability.	1
temperature	float	Sampling temperature.	0.8
repetition_penalty	float	Penalty to prevent the model from generating repeated words or phrases. A value larger than 1 discourages repetition.	1.0
ignore_eos	bool	Indicator to ignore the eos_token_id or not.	False
random_seed	int	Seed used when sampling a token.	None
stop_words	List[str]	Words that stop generating further tokens.	None
bad_words	List[str]	Words that the engine will never generate.	None
min_new_tokens	int	The minimum numbers of tokens to generate, ignoring the number of tokens in the prompt.	None
skip_special_tokens	bool	Whether or not to remove special tokens in the decoding.	True

FAQs¶

RuntimeError: context has already been set. If you got this for tp>1 in pytorch backend. Please make sure the python script has following
```
if __name__ == '__main__':
```
Generally, in the context of multi-threading or multi-processing, it might be necessary to ensure that initialization code is executed only once. In this case, if __name__ == '__main__': can help to ensure that these initialization codes are run only in the main program, and not repeated in each newly created process or thread.

Inference Pipeline¶

Usage¶

pipeline API¶

Init parameters:¶

Invocation¶

Response¶

TurbomindEngineConfig¶

Description¶

Arguments¶

PytorchEngineConfig¶

Description¶

Arguments¶

GenerationConfig¶

Description¶

Arguments¶

FAQs¶

`pipeline` API¶