Deployment Guide of openPangu-VL-7B Based on vllm-ascend

Deployment Environment Description

Atlas 800T A2(64GB), openPangu-VL-7B can be deployed on 1、2、4 or 8 cards.

Docker Boot and Inference Code

Use the vllm-ascend community image v0.9.1.

Pull the image with the following command:

docker pull quay.io/ascend/vllm-ascend:v0.9.1

The following operations need to be performed on each node. Start the container.

# Update the vllm-ascend image
export IMAGE=quay.io/ascend/vllm-ascend:v0.9.1  # Use correct image id
export NAME=vllm-ascend  # Custom docker name

# Run the container using the defined variables
# Note if you are running bridge network with docker, Please expose available ports for multiple nodes communication in advance
# To prevent device interference from other docker containers, add the argument "--privileged"
docker run --rm \
--name $NAME \
--network host \
--ipc=host \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci2 \
--device /dev/davinci3 \
--device /dev/davinci4 \
--device /dev/davinci5 \
--device /dev/davinci6 \
--device /dev/davinci7 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /mnt/sfs_turbo/.cache:/root/.cache \
-it $IMAGE bash

Ensure that the model checkpoint and the project code are accessible within the container. If not inside the container, enter the container as the root user:

docker exec -itu root $NAME /bin/bash

PD Aggregation Inference

Example startup script: LOAD_CKPT_DIR=xxx bash examples/start_serving_openpangu_vl_7b.sh。This example script requires 8 nodes(TENSOR_PARALLEL_SIZE_LOCAL=8）) to deploy the openPangu-VL-7B model. After starting the service, we can send requests to the first node (master node).

Send Testing Requests

After the service is started, we can send testing requests. It is recommended to use the system prompt provided in the examples.

example：image + text

import json
import base64
import os
import requests
import json

def encode_image_to_base64(img_path, img_name):
    #load image to base64
    try:
        with open(os.path.join(img_path, img_name), 'rb') as img_file:
            img_data = img_file.read()
        base64_str = base64.b64encode(img_data).decode('utf-8')
        return base64_str
    except Exception as e:
        print(f"image load failed: {e}")
        return None

base64_image = encode_image_to_base64("/image_path", "image_name.jpg")


payload_image_example = json.dumps({
  "messages": [
      {
            "role": "system",
            "content": [
                {"type": "text", "text": "You are a multimodal large model developed by Huawei, named openPangu-VL-7B. You can process both text and visual inputs and generate text outputs."},     
            ]
      },
      {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": f"data:image/jpg;base64,{base64_image}"}},
                {"type": "text", "text": "Please describe this picture."}, 
            ]
      }
  ],    
  "model": "pangu_vl",
  "max_tokens": 500,
  "temperature": 1.0,
  "stream": False,
})


url = "http://127.0.0.1:8000/v1/chat/completions"
headers = {
  'Content-Type': 'application/json'
}

response_image_example = requests.request("POST", url, headers=headers, data=payload_image_example)
print(f"the response of image example is {response_image_example.text}")

example：video + text

import json
import base64
import os
import requests
import json

def encode_video_to_base64(video_path, video_name):
    #load video to base64
    try:
        with open(os.path.join(video_path, video_name), 'rb') as video_file:
            video_data = video_file.read()
        base64_str = base64.b64encode(video_data).decode('utf-8')
        return base64_str
    except Exception as e:
        print(f"video load failed: {e}")
        return None

base64_video = encode_video_to_base64("/video_path", "video_name.mp4")

payload_video_example = json.dumps({
  "messages": [
      {
            "role": "system",
            "content": [
                {"type": "text", "text": "You are a multimodal large model developed by Huawei, named openPangu-VL-7B. You can process both text and visual inputs and generate text outputs."},     
            ]
      },
      {
            "role": "user",
            "content": [
                {"type": "video_url", "video_url": {"url": f"data:video/mp4;base64,{base64_video}"}},
                {"type": "text", "text": "Please describe this video."},  
            ]
      }
  ],    
  "model": "pangu_vl",
  "max_tokens": 500,
  "temperature": 1.0,
  "stream": False,
})


url = "http://127.0.0.1:8000/v1/chat/completions"
headers = {
  'Content-Type': 'application/json'
}

response_video_example = requests.request("POST", url, headers=headers, data=payload_video_example)
print(f"the response of video example is {response_video_example.text}")

128k Sequence Video Inference

Add fields to /preprocessor_config.json,The input video will be sampled into 768 frames.

"num_frames": 768,
"sample_fps": -1.0

In the startup script(/inference/vllm_ascend/examples/start_serving_openpangu_vl_7b.sh) setting the parameters as follows：

export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
MAX_MODEL_LEN=128000
MAX_NUM_BATCHED_TOKENS=100000
GPU_MEMORY_UTILIZATION=0.7

--no-enable-chunked-prefill \
--no-enable-prefix-caching \

Int8 Inference

ModelSlim Quantization

openPangu-VL-7B model supports quantization using the open source quantization framework,Please refer to [ModelSlim_openPangu-VL-7B_README]，The current model supports W8A8 quantization.

openPangu-VL-7B W8A8 Dynamic quantization

export QUANT_PATH=your_quant_save_dir
export MODEL_PATH=your_model_ckpt_dir
export CALI_DATASET=your_cali_dataset_dir
python quant_pangu_vl.py \
--model_path $MODEL_PATH --calib_images $CALI_DATASET \
--save_directory $QUANT_PATH --w_bit 8 --a_bit 8 --device_type npu \
--trust_remote_code True --anti_method m2 --act_method 3 --is_dynamic True

Compared with the BF16 model, the following fields are added to the config.json file of the int8 quantization model:

"quantize": "w8a8_dynamic",

After the ModelSlim quantization script generates a quantization model, the preceding fields are automatically added to the config.json file.

Int8 Inference

Compared to BF16 model inference, the int8 quantized model inference uses the same startup script, requiring only:

Reducing the number of nodes and GPUs;
Modifying the model checkpoint path.