Shortcuts

推理 pipeline

pipeline

lmdeploy.pipeline(model_path: str, model_name: Optional[str] = None, backend_config: Optional[Union[lmdeploy.messages.PytorchEngineConfig, lmdeploy.messages.TurbomindEngineConfig]] = None, chat_template_config: Optional[lmdeploy.model.ChatTemplateConfig] = None, log_level='ERROR', **kwargs)[源代码]
参数
  • model_path (str) –

    the path of a model. It could be one of the following options:

      1. A local directory path of a turbomind model which is

      converted by lmdeploy convert command or download from ii) and iii).

      1. The model_id of a lmdeploy-quantized model hosted

      inside a model repo on huggingface.co, such as “InternLM/internlm-chat-20b-4bit”, “lmdeploy/llama2-chat-70b-4bit”, etc.

      1. The model_id of a model hosted inside a model repo

      on huggingface.co, such as “internlm/internlm-chat-7b”, “Qwen/Qwen-7B-Chat “, “baichuan-inc/Baichuan2-7B-Chat” and so on.

  • model_name (str) – needed when model_path is a pytorch model on huggingface.co, such as “internlm/internlm-chat-7b”, “Qwen/Qwen-7B-Chat “, “baichuan-inc/Baichuan2-7B-Chat” and so on.

  • backend_config (TurbomindEngineConfig | PytorchEngineConfig) – backend config instance. Default to None.

  • chat_template_config (ChatTemplateConfig) – chat template configuration. Default to None.

  • log_level (str) – set log level whose value among [CRITICAL, ERROR, WARNING, INFO, DEBUG]

实际案例

>>> # LLM
>>> import lmdeploy
>>> pipe = lmdeploy.pipeline('internlm/internlm-chat-7b')
>>> response = pipe(['hi','say this is a test'])
>>> print(response)
>>>
>>> # VLM
>>> from lmdeploy.vl import load_image
>>> from lmdeploy import pipeline, TurbomindEngineConfig, ChatTemplateConfig
>>> pipe = pipeline('liuhaotian/llava-v1.5-7b',
...                 backend_config=TurbomindEngineConfig(session_len=8192),
...                 chat_template_config=ChatTemplateConfig(model_name='vicuna'))
>>> im = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg')
>>> response = pipe([('describe this image', [im])])
>>> print(response)

serving

lmdeploy.serve(model_path: str, model_name: Optional[str] = None, backend: Literal[turbomind, pytorch] = 'turbomind', backend_config: Optional[Union[lmdeploy.messages.PytorchEngineConfig, lmdeploy.messages.TurbomindEngineConfig]] = None, chat_template_config: Optional[lmdeploy.model.ChatTemplateConfig] = None, server_name: str = '0.0.0.0', server_port: int = 23333, log_level: str = 'ERROR', api_keys: Optional[Union[str, List[str]]] = None, ssl: bool = False, **kwargs)[源代码]

This will run the api_server in a subprocess.

参数
  • model_path (str) –

    the path of a model. It could be one of the following options:

      1. A local directory path of a turbomind model which is

      converted by lmdeploy convert command or download from ii) and iii).

      1. The model_id of a lmdeploy-quantized model hosted

      inside a model repo on huggingface.co, such as “InternLM/internlm-chat-20b-4bit”, “lmdeploy/llama2-chat-70b-4bit”, etc.

      1. The model_id of a model hosted inside a model repo

      on huggingface.co, such as “internlm/internlm-chat-7b”, “Qwen/Qwen-7B-Chat “, “baichuan-inc/Baichuan2-7B-Chat” and so on.

  • model_name (str) – needed when model_path is a pytorch model on huggingface.co, such as “internlm/internlm-chat-7b”, “Qwen/Qwen-7B-Chat “, “baichuan-inc/Baichuan2-7B-Chat” and so on.

  • backend (str) – either turbomind or pytorch backend. Default to turbomind backend.

  • backend_config (TurbomindEngineConfig | PytorchEngineConfig) – backend config instance. Default to none.

  • chat_template_config (ChatTemplateConfig) – chat template configuration. Default to None.

  • server_name (str) – host ip for serving

  • server_port (int) – server port

  • log_level (str) – set log level whose value among [CRITICAL, ERROR, WARNING, INFO, DEBUG]

  • api_keys (List[str] | str | None) – Optional list of API keys. Accepts string type as a single api_key. Default to None, which means no api key applied.

  • ssl (bool) – Enable SSL. Requires OS Environment variables ‘SSL_KEYFILE’ and ‘SSL_CERTFILE’.

返回

A client chatbot for LLaMA series models.

返回类型

APIClient

实际案例

>>> import lmdeploy
>>> client = lmdeploy.serve('internlm/internlm-chat-7b', 'internlm-chat-7b')
>>> for output in client.chat('hi', 1):
...    print(output)
lmdeploy.client(api_server_url: str = 'http://0.0.0.0:23333', api_key: Optional[str] = None, **kwargs)[源代码]
参数
  • api_server_url (str) – communicating address ‘http://<ip>:<port>’ of api_server

  • api_key (str | None) – api key. Default to None, which means no api key will be used.

返回

Chatbot for LLaMA series models with turbomind as inference engine.

PytorchEngineConfig

class lmdeploy.PytorchEngineConfig(model_name: str = '', tp: int = 1, session_len: Optional[int] = None, max_batch_size: int = 128, cache_max_entry_count: float = 0.8, eviction_type: str = 'recompute', prefill_interval: int = 16, block_size: int = 64, num_cpu_blocks: int = 0, num_gpu_blocks: int = 0, adapters: Optional[Dict[str, str]] = None, max_prefill_token_num: int = 4096, thread_safe: bool = False, download_dir: Optional[str] = None, revision: Optional[str] = None)[源代码]

PyTorch Engine Config.

参数
  • model_name (str) – name of the given model.

  • tp (int) – Tensor Parallelism. default 1.

  • session_len (int) – Max session length. Default None.

  • max_batch_size (int) – Max batch size. Default 128.

  • cache_max_entry_count (float) – the percentage of gpu memory occupied by the k/v cache. For lmdeploy versions greater than v0.2.1, it defaults to 0.8, signifying the percentage of FREE GPU memory to be reserved for the k/v cache

  • eviction_type (str) – What action to perform when kv cache is full, [‘recompute’, ‘copy’], Default ‘recompute’.

  • prefill_interval (int) – Interval to perform prefill, Default 16.

  • block_size (int) – paging cache block size, default 64.

  • num_cpu_blocks (int) – Num cpu blocks. If num is 0, cache would be allocate according to current environment.

  • num_gpu_blocks (int) – Num gpu blocks. If num is 0, cache would be allocate according to current environment.

  • adapters (dict) – The path configs to lora adapters.

  • max_prefill_token_num (int) – tokens per iteration.

  • thread_safe (bool) – thread safe engine instance.

  • download_dir (str) – Directory to download and load the weights, default to the default cache directory of huggingface.

  • revision (str) – The specific model version to use. It can be a branch name, a tag name, or a commit id. If unspecified, will use the default version.

TurbomindEngineConfig

class lmdeploy.TurbomindEngineConfig(model_name: Optional[str] = None, model_format: Optional[str] = None, tp: int = 1, session_len: Optional[int] = None, max_batch_size: int = 128, cache_max_entry_count: float = 0.8, cache_block_seq_len: int = 64, quant_policy: int = 0, rope_scaling_factor: float = 0.0, use_logn_attn: bool = False, download_dir: Optional[str] = None, revision: Optional[str] = None, max_prefill_token_num: int = 8192)[源代码]

TurboMind Engine config.

参数
  • model_name (str) – the name of the deployed model, deprecated and has no effect when version > 0.2.1

  • model_format (str) – the layout of the deployed model. It can be one of the following values [hf, llama, awq], hf meaning hf_llama, llama meaning meta_llama, awq meaning the quantized model by AWQ.

  • tp (int) – the number of GPU cards used in tensor parallelism, default to 1

  • session_len (int) – the max session length of a sequence, default to None

  • max_batch_size (int) – the max batch size during inference, default to 128

  • cache_max_entry_count (float) – the percentage of gpu memory occupied by the k/v cache. For versions of lmdeploy between v0.2.0 and v0.2.1, it defaults to 0.5, depicting the percentage of TOTAL GPU memory to be allocated to the k/v cache. For lmdeploy versions greater than v0.2.1, it defaults to 0.8, signifying the percentage of FREE GPU memory to be reserved for the k/v cache

  • cache_block_seq_len (int) – the length of the token sequence in a k/v block, default to 64

  • quant_policy (int) – default to 0. When k/v is quantized into 8 bit, set it to 4

  • rope_scaling_factor (int) – scaling factor used for dynamic ntk, default to 0. TurboMind follows the implementation of transformer LlamaAttention

  • use_logn_attn (bool) – whether or not to use log attn: default to False

  • download_dir (str) – Directory to download and load the weights, default to the default cache directory of huggingface.

  • revision (str) – The specific model version to use. It can be a branch name, a tag name, or a commit id. If unspecified, will use the default version.

  • max_prefill_token_num (int) – the number of tokens each iteration during prefill, default to 8192

GenerationConfig

class lmdeploy.GenerationConfig(n: int = 1, max_new_tokens: int = 512, top_p: float = 1.0, top_k: int = 1, temperature: float = 0.8, repetition_penalty: float = 1.0, ignore_eos: bool = False, random_seed: Optional[int] = None, stop_words: Optional[List[str]] = None, bad_words: Optional[List[str]] = None, min_new_tokens: Optional[int] = None, skip_special_tokens: bool = True)[源代码]

generation parameters used by inference engines.

参数
  • n (int) – Define how many chat completion choices to generate for each input message

  • max_new_tokens (int) – The maximum number of tokens that can be generated in the chat completion

  • top_p (float) – An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass

  • top_k (int) – An alternative to sampling with temperature, where the model considers the top_k tokens with the highest probability

  • temperature (float) – Sampling temperature

  • repetition_penalty (float) – Penalty to prevent the model from generating repeated words or phrases. A value larger than 1 discourages repetition

  • ignore_eos (bool) – Indicator to ignore the eos_token_id or not

  • random_seed (int) – Seed used when sampling a token

  • stop_words (List[str]) – Words that stop generating further tokens

  • bad_words (List[str]) – Words that the engine will never generate

  • min_new_tokens (int) – The minimum numbers of tokens to generate, ignoring the number of tokens in the prompt.

  • skip_special_tokens (bool) – Whether or not to remove special tokens in the decoding. Default to be True.

ChatTemplateConfig

class lmdeploy.ChatTemplateConfig(model_name: str, system: Optional[str] = None, meta_instruction: Optional[str] = None, eosys: Optional[str] = None, user: Optional[str] = None, eoh: Optional[str] = None, assistant: Optional[str] = None, eoa: Optional[str] = None, separator: Optional[str] = None, capability: Optional[Literal[completion, infilling, chat, python]] = None, stop_words: Optional[List[str]] = None)[源代码]

Parameters for chat template.

参数
  • model_name (str) – the name of the deployed model. Determine which chat template will be applied. All the chat template names: lmdeploy list

  • system (str | None) – begin of the system prompt

  • meta_instruction (str | None) – system prompt

  • eosys (str | None) – end of the system prompt

  • user (str | None) – begin of the user prompt

  • eoh (str | None) – end of the user prompt

  • assistant (str | None) – begin of the assistant prompt

  • eoa (str | None) – end of the assistant prompt

  • capability – (‘completion’ | ‘infilling’ | ‘chat’ | ‘python’) = None