Serving VLM with OpenAI Compatible Server¶

This article primarily discusses the deployment of a single large vision language model across multiple GPUs on a single node, providing a service that is compatible with the OpenAI interface, as well as the usage of the service API. For the sake of convenience, we refer to this service as api_server. Regarding parallel services with multiple models, please refer to the guide about Request Distribution Server.

In the following sections, we will first introduce two methods for starting the service, choosing the appropriate one based on your application scenario.

Next, we focus on the definition of the service’s RESTful API, explore the various ways to interact with the interface, and demonstrate how to try the service through the Swagger UI or LMDeploy CLI tools.

Finally, we showcase how to integrate the service into a WebUI, providing you with a reference to easily set up a demonstration demo.

Launch Service¶

Take the llava-v1.6-vicuna-7b model hosted on huggingface hub as an example, you can choose one the following methods to start the service.

Option 1: Launching with lmdeploy CLI¶

lmdeploy serve api_server liuhaotian/llava-v1.6-vicuna-7b --server-port 23333

The arguments of api_server can be viewed through the command lmdeploy serve api_server -h, for instance, --tp to set tensor parallelism, --session-len to specify the max length of the context window, --cache-max-entry-count to adjust the GPU mem ratio for k/v cache etc.

Option 2: Deploying with docker¶

With LMDeploy official docker image, you can run OpenAI compatible server as follows:

docker run --runtime nvidia --gpus all \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HUGGING_FACE_HUB_TOKEN=<secret>" \
    -p 23333:23333 \
    --ipc=host \
    openmmlab/lmdeploy:latest \
    lmdeploy serve api_server liuhaotian/llava-v1.6-vicuna-7b

The parameters of api_server are the same with that mentioned in “option 1” section

Each model may require specific dependencies not included in the Docker image. If you run into issues, you may need to install those yourself on a case-by-case basis. If in doubt, refer to the specific model’s project for documentation.

For example, for Llava:

FROM openmmlab/lmdeploy:latest

RUN apt-get update && apt-get install -y python3 python3-pip git

WORKDIR /app

RUN pip3 install --upgrade pip
RUN pip3 install timm
RUN pip3 install git+https://github.com/haotian-liu/LLaVA.git --no-deps

COPY . .

CMD ["lmdeploy", "serve", "api_server", "liuhaotian/llava-v1.6-34b"]

RESTful API¶

LMDeploy’s RESTful API is compatible with the following three OpenAI interfaces:

/v1/chat/completions
/v1/models
/v1/completions

The interface for image interaction is /v1/chat/completions, which is consistent with OpenAI.

You can overview and try out the offered RESTful APIs by the website http://0.0.0.0:23333 as shown in the below image after launching the service successfully.

swagger_ui

If you need to integrate the service into your own projects or products, we recommend the following approach:

Integrate with `OpenAI`¶

Here is an example of interaction with the endpoint v1/chat/completions service via the openai package. Before running it, please install the openai package by pip install openai

from openai import OpenAI

client = OpenAI(api_key='YOUR_API_KEY', base_url='http://0.0.0.0:23333/v1')
model_name = client.models.list().data[0].id
response = client.chat.completions.create(
    model=model_name,
    messages=[{
        'role':
        'user',
        'content': [{
            'type': 'text',
            'text': 'Describe the image please',
        }, {
            'type': 'image_url',
            'image_url': {
                'url':
                'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg',
            },
        }],
    }],
    temperature=0.8,
    top_p=0.8)
print(response)

Integrate with lmdeploy `APIClient`¶

Below are some examples demonstrating how to visit the service through APIClient

If you want to use the /v1/chat/completions endpoint, you can try the following code:

from lmdeploy.serve.openai.api_client import APIClient

api_client = APIClient(f'http://0.0.0.0:23333')
model_name = api_client.available_models[0]
messages = [{
    'role':
    'user',
    'content': [{
        'type': 'text',
        'text': 'Describe the image please',
    }, {
        'type': 'image_url',
        'image_url': {
            'url':
            'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg',
        },
    }]
}]
for item in api_client.chat_completions_v1(model=model_name,
                                           messages=messages):
    print(item)

Integrate with Java/Golang/Rust¶

May use openapi-generator-cli to convert http://{server_ip}:{server_port}/openapi.json to java/rust/golang client. Here is an example:

$ docker run -it --rm -v ${PWD}:/local openapitools/openapi-generator-cli generate -i /local/openapi.json -g rust -o /local/rust

$ ls rust/*
rust/Cargo.toml  rust/git_push.sh  rust/README.md

rust/docs:
ChatCompletionRequest.md  EmbeddingsRequest.md  HttpValidationError.md  LocationInner.md  Prompt.md
DefaultApi.md             GenerateRequest.md    Input.md                Messages.md       ValidationError.md

rust/src:
apis  lib.rs  models