Restful API¶
Launch Service¶
The user can open the http url print by the following command in a browser.
Please check the http url for the detailed api usage!!!
Please check the http url for the detailed api usage!!!
Please check the http url for the detailed api usage!!!
lmdeploy serve api_server ./workspace --server-name 0.0.0.0 --server-port ${server_port} --tp 1
The parameters supported by api_server can be viewed through the command line lmdeploy serve api_server -h.
We provide some RESTful APIs. Three of them are in OpenAI format.
/v1/chat/completions
/v1/models
/v1/completions
However, we recommend users try
our own api /v1/chat/interactive which provides more arguments for users to modify. The performance is comparatively better.
Note please, if you want to launch multiple requests, you’d better set different session_id for both
/v1/chat/completions and /v1/chat/interactive apis. Or, we will set them random values.
python¶
We have integrated the client-side functionalities of these services into the APIClient class. Below are some examples demonstrating how to invoke the api_server service on the client side.
If you want to use the /v1/chat/completions endpoint, you can try the following code:
from lmdeploy.serve.openai.api_client import APIClient
api_client = APIClient('http://{server_ip}:{server_port}')
model_name = api_client.available_models[0]
messages = [{"role": "user", "content": "Say this is a test!"}]
for item in api_client.chat_completions_v1(model=model_name, messages=messages):
print(item)
For the /v1/completions endpoint. If you want to use the /v1/completions endpoint, you can try:
from lmdeploy.serve.openai.api_client import APIClient
api_client = APIClient('http://{server_ip}:{server_port}')
model_name = api_client.available_models[0]
for item in api_client.completions_v1(model=model_name, prompt='hi'):
print(item)
Lmdeploy supports maintaining session histories on the server for /v1/chat/interactive api. We disable the
feature by default.
On interactive mode, the chat history is kept on the server. In a multiple rounds of conversation, you should set
interactive_mode = Trueand the samesession_id(can’t be -1, it’s the default number) to/v1/chat/interactivefor requests.On normal mode, no chat history is kept on the server.
The interactive mode can be controlled by the interactive_mode boolean parameter. The following is an example of normal mode. If you want to experience the interactive mode, simply pass in interactive_mode=True.
from lmdeploy.serve.openai.api_client import APIClient
api_client = APIClient('http://{server_ip}:{server_port}')
for item in api_client.chat_interactive_v1(prompt='hi'):
print(item)
Java/Golang/Rust¶
May use openapi-generator-cli to convert http://{server_ip}:{server_port}/openapi.json to java/rust/golang client.
Here is an example:
$ docker run -it --rm -v ${PWD}:/local openapitools/openapi-generator-cli generate -i /local/openapi.json -g rust -o /local/rust
$ ls rust/*
rust/Cargo.toml rust/git_push.sh rust/README.md
rust/docs:
ChatCompletionRequest.md EmbeddingsRequest.md HttpValidationError.md LocationInner.md Prompt.md
DefaultApi.md GenerateRequest.md Input.md Messages.md ValidationError.md
rust/src:
apis lib.rs models
cURL¶
cURL is a tool for observing the output of the api.
List Models:
curl http://{server_ip}:{server_port}/v1/models
Interactive Chat:
curl http://{server_ip}:{server_port}/v1/chat/interactive \
-H "Content-Type: application/json" \
-d '{
"prompt": "Hello! How are you?",
"session_id": 1,
"interactive_mode": true
}'
Chat Completions:
curl http://{server_ip}:{server_port}/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "internlm-chat-7b",
"messages": [{"role": "user", "content": "Hello! How are you?"}]
}'
Text Completions:
curl http://{server_ip}:{server_port}/v1/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "llama",
"prompt": "two steps to build a house:"
}'
CLI client¶
There is a client script for restful api server.
# restful_api_url is what printed in api_server.py, e.g. http://localhost:23333
lmdeploy serve api_client api_server_url
webui through gradio¶
You can also test restful-api through webui.
# api_server_url is what printed in api_server.py, e.g. http://localhost:23333
# server_ip and server_port here are for gradio ui
# example: lmdeploy serve gradio http://localhost:23333 --server-name localhost --server-port 6006
lmdeploy serve gradio api_server_url --server-name ${gradio_ui_ip} --server-port ${gradio_ui_port}
webui through OpenAOE¶
You can use OpenAOE for seamless integration with LMDeploy.
pip install -U openaoe
openaoe -f /path/to/your/config-template.yaml
Please refer to the guidance for more deploy information.
FAQ¶
When user got
"finish_reason":"length", it means the session is too long to be continued. The session length can be modified by passing--session_lento api_server.When OOM appeared at the server side, please reduce the
cache_max_entry_countofbackend_configwhen lanching the service.When the request with the same
session_idto/v1/chat/interactivegot a empty return value and a negativetokens, please consider settinginteractive_mode=falseto restart the session.The
/v1/chat/interactiveapi disables engaging in multiple rounds of conversation by default. The input argumentpromptconsists of either single strings or entire chat histories.If you need to adjust other default parameters of the session, such as the content of fields like system. You can directly pass in the initialization parameters of the dialogue template. For example, for the internlm-chat-7b model, you can set the
--meta-instructionparameter when starting theapi_server.Regarding the stop words, we only support characters that encode into a single index. Furthermore, there may be multiple indexes that decode into results containing the stop word. In such cases, if the number of these indexes is too large, we will only use the index encoded by the tokenizer. If you want use a stop symbol that encodes into multiple indexes, you may consider performing string matching on the streaming client side. Once a successful match is found, you can then break out of the streaming loop.
request distribution service¶
Please refer to our request distributor server