## Deployment Guide of openPangu-VL-7B Based on [vllm-ascend](https://github.com/vllm-project/vllm-ascend)

### Deployment Environment Description

Atlas 800T A2(64GB), openPangu-VL-7B can be deployed on 1、2、4 or 8 cards.

### Docker Boot and Inference Code

Use the vllm-ascend community image v0.9.1.

Pull the image with the following command:

```bash
docker pull quay.io/ascend/vllm-ascend:v0.9.1
```

The following operations need to be performed on each node.
Start the container.

```bash
# Update the vllm-ascend image
export IMAGE=quay.io/ascend/vllm-ascend:v0.9.1  # Use correct image id
export NAME=vllm-ascend  # Custom docker name

# Run the container using the defined variables
# Note if you are running bridge network with docker, Please expose available ports for multiple nodes communication in advance
# To prevent device interference from other docker containers, add the argument "--privileged"
docker run --rm \
--name $NAME \
--network host \
--ipc=host \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci2 \
--device /dev/davinci3 \
--device /dev/davinci4 \
--device /dev/davinci5 \
--device /dev/davinci6 \
--device /dev/davinci7 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /mnt/sfs_turbo/.cache:/root/.cache \
-it $IMAGE bash
```

Ensure that the model checkpoint and the project code are accessible within the container. If not inside the container, enter the container as the root user:
```bash
docker exec -itu root $NAME /bin/bash
```

### PD Aggregation Inference

Example startup script: `LOAD_CKPT_DIR=xxx bash examples/start_serving_openpangu_vl_7b.sh`。This example script requires 8 nodes(TENSOR_PARALLEL_SIZE_LOCAL=8）) to deploy the openPangu-VL-7B model. After starting the service, we can send requests to the first node (master node).

### Send Testing Requests

After the service is started, we can send testing requests. It is recommended to use the system prompt provided in the examples.

example：image + text

```python
import json
import base64
import os
import requests
import json

def encode_image_to_base64(img_path, img_name):
    #load image to base64
    try:
        with open(os.path.join(img_path, img_name), 'rb') as img_file:
            img_data = img_file.read()
        base64_str = base64.b64encode(img_data).decode('utf-8')
        return base64_str
    except Exception as e:
        print(f"image load failed: {e}")
        return None

base64_image = encode_image_to_base64("/image_path", "image_name.jpg")


payload_image_example = json.dumps({
  "messages": [
      {
            "role": "system",
            "content": [
                {"type": "text", "text": "You are a multimodal large model developed by Huawei, named openPangu-VL-7B. You can process both text and visual inputs and generate text outputs."},     
            ]
      },
      {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": f"data:image/jpg;base64,{base64_image}"}},
                {"type": "text", "text": "Please describe this picture."}, 
            ]
      }
  ],    
  "model": "pangu_vl",
  "max_tokens": 500,
  "temperature": 1.0,
  "stream": False,
})


url = "http://127.0.0.1:8000/v1/chat/completions"
headers = {
  'Content-Type': 'application/json'
}

response_image_example = requests.request("POST", url, headers=headers, data=payload_image_example)
print(f"the response of image example is {response_image_example.text}")

```


example：video + text

```python
import json
import base64
import os
import requests
import json

def encode_video_to_base64(video_path, video_name):
    #load video to base64
    try:
        with open(os.path.join(video_path, video_name), 'rb') as video_file:
            video_data = video_file.read()
        base64_str = base64.b64encode(video_data).decode('utf-8')
        return base64_str
    except Exception as e:
        print(f"video load failed: {e}")
        return None

base64_video = encode_video_to_base64("/video_path", "video_name.mp4")

payload_video_example = json.dumps({
  "messages": [
      {
            "role": "system",
            "content": [
                {"type": "text", "text": "You are a multimodal large model developed by Huawei, named openPangu-VL-7B. You can process both text and visual inputs and generate text outputs."},     
            ]
      },
      {
            "role": "user",
            "content": [
                {"type": "video_url", "video_url": {"url": f"data:video/mp4;base64,{base64_video}"}},
                {"type": "text", "text": "Please describe this video."},  
            ]
      }
  ],    
  "model": "pangu_vl",
  "max_tokens": 500,
  "temperature": 1.0,
  "stream": False,
})


url = "http://127.0.0.1:8000/v1/chat/completions"
headers = {
  'Content-Type': 'application/json'
}

response_video_example = requests.request("POST", url, headers=headers, data=payload_video_example)
print(f"the response of video example is {response_video_example.text}")

```

### 128k Sequence Video Inference
Add fields to /preprocessor_config.json,The input video will be sampled into 768 frames.
```
"num_frames": 768,
"sample_fps": -1.0
```

In the startup script(/inference/vllm_ascend/examples/start_serving_openpangu_vl_7b.sh) setting the parameters as follows：
```
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
MAX_MODEL_LEN=128000
MAX_NUM_BATCHED_TOKENS=100000
GPU_MEMORY_UTILIZATION=0.7

--no-enable-chunked-prefill \
--no-enable-prefix-caching \
```

### Int8 Inference

#### ModelSlim Quantization

openPangu-VL-7B model supports quantization using the open source quantization framework,Please refer to [[ModelSlim_openPangu-VL-7B_README]](https://gitcode.com/Ascend/msit/blob/msModelslim_Pangu_VL/msmodelslim/example/multimodal_vlm/openPangu-VL/ReadMe.md)，The current model supports W8A8 quantization.

##### openPangu-VL-7B W8A8 Dynamic quantization

```bash
export QUANT_PATH=your_quant_save_dir
export MODEL_PATH=your_model_ckpt_dir
export CALI_DATASET=your_cali_dataset_dir
python quant_pangu_vl.py \
--model_path $MODEL_PATH --calib_images $CALI_DATASET \
--save_directory $QUANT_PATH --w_bit 8 --a_bit 8 --device_type npu \
--trust_remote_code True --anti_method m2 --act_method 3 --is_dynamic True
```

Compared with the BF16 model, the following fields are added to the config.json file of the int8 quantization model:
```
"quantize": "w8a8_dynamic",
```

After the ModelSlim quantization script generates a quantization model, the preceding fields are automatically added to the config.json file.

#### Int8 Inference

Compared to BF16 model inference, the int8 quantized model inference uses the same startup script, requiring only:
* Reducing the number of nodes and GPUs;
* Modifying the model checkpoint path.