lbourdois commited on
Commit
30020cf
·
verified ·
1 Parent(s): 42d18fe

Improve language tag

Browse files

Hi! As the model is multilingual, this is a PR to add other languages than English to the language tag to improve the referencing. Note that 29 languages are announced in the README, but only 13 are explicitly listed. I was therefore only able to add these 13 languages.

Files changed (1) hide show
  1. README.md +269 -258
README.md CHANGED
@@ -1,258 +1,269 @@
1
- ---
2
- license: mit
3
- language:
4
- - zh
5
- - en
6
- base_model:
7
- - Qwen/Qwen2.5-1.5B-Instruct
8
- pipeline_tag: text-generation
9
- library_name: transformers
10
- tags:
11
- - Context
12
- - Qwen2.5-1.5B
13
- ---
14
-
15
- # Qwen2.5-1.5B-Instruct-CTX-Int8
16
-
17
- This version of Qwen2.5-1.5B-Instruct-CTX-Int8 has been converted to run on the Axera NPU using **w8a16** quantization.
18
-
19
- This model has been optimized with the following LoRA:
20
-
21
- Compatible with Pulsar2 version: 4.0(Not released yet)
22
-
23
- ## Feature
24
-
25
- - Support for longer contexts, in this sample it's 2.5k
26
- - Support context dialogue
27
- - System prompt kvcache is supported
28
-
29
- ## Convert tools links:
30
-
31
- For those who are interested in model conversion, you can try to export axmodel through the original repo : https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct-GPTQ-Int4
32
-
33
- [Pulsar2 Link, How to Convert LLM from Huggingface to axmodel](https://pulsar2-docs.readthedocs.io/en/latest/appendix/build_llm.html)
34
-
35
- [AXera NPU AXEngine LLM Runtime](https://github.com/ZHEQIUSHUI/ax-llm/tree/prefill_kvcaches_context)
36
-
37
- [AXera NPU AXCL LLM Runtime](https://github.com/ZHEQIUSHUI/ax-llm/tree/axcl-context-kvcache)
38
-
39
- ## Support Platform
40
-
41
- - AX650
42
- - AX650N DEMO Board
43
- - [M4N-Dock(爱芯派Pro)](https://wiki.sipeed.com/hardware/zh/maixIV/m4ndock/m4ndock.html)
44
- - [M.2 Accelerator card](https://axcl-docs.readthedocs.io/zh-cn/latest/doc_guide_hardware.html)
45
- - AX630C
46
- - *TBD*
47
-
48
- |Chips|w8a16|w4a16| DDR | Flash |
49
- |--|--|--|--|--|
50
- |AX650| 11 tokens/sec| *TBD* | 2.3GB | 2.3GB |
51
-
52
- ## How to use
53
-
54
- Download all files from this repository to the device
55
-
56
- ```
57
- root@ax650:/mnt/qtang/llm-test/Qwen2.5-1.5B-Instruct-CTX-Int8# tree -L 1
58
- .
59
- ├── kvcache
60
- ├── main
61
- ├── main_axcl_aarch64
62
- ├── main_axcl_x86
63
- ├── post_config.json
64
- ├── qwen2.5-1.5b-ctx-ax650
65
- ├── qwen2.5_tokenizer
66
- ├── qwen2.5_tokenizer_uid.py
67
- ├── run_qwen2.5_1.5b_ctx_ax650.sh
68
- ├── run_qwen2.5_1.5b_ctx_axcl_aarch64.sh
69
- └── run_qwen2.5_1.5b_ctx_axcl_x86.sh
70
- ```
71
-
72
- #### Start the Tokenizer service
73
-
74
- ```
75
- root@ax650:/mnt/qtang/llm-test/Qwen2.5-1.5B-Instruct-CTX-Int8# python qwen2.5_tokenizer_uid.py
76
- Server running at http://0.0.0.0:12345
77
- ```
78
-
79
- #### System prompt cache
80
-
81
- - The System prompt can be preset through the configuration file from `--system_prompt`
82
- - The System prompt can be cached in the form of kv cache to a specified folder for quick loading at the next run time from `--kvcache_path`
83
- - This folder needs to be created manually before running, for example `mkdir kvcache`
84
-
85
- ```
86
- (base) axera@raspberrypi:~/samples/qwen2.5-1.5b-ctx $ cat run_qwen2.5_1.5b_ctx_axcl_aarch64.sh
87
- ./main_axcl_aarch64 \
88
- --system_prompt "你的名字叫小智(allen),你是一个人畜无害的AI助手。深圳市今天(4月1日)阴天,愚人节,气温在14°C至19°C之间,微风。" \
89
- --kvcache_path "./kvcache" \
90
- --template_filename_axmodel "qwen2.5-1.5b-ctx-ax650/qwen2_p128_l%d_together.axmodel" \
91
- --axmodel_num 28 \
92
- --tokenizer_type 2 \
93
- --url_tokenizer_model "http://127.0.0.1:12345" \
94
- --filename_post_axmodel "qwen2.5-1.5b-ctx-ax650/qwen2_post.axmodel" \
95
- --filename_tokens_embed "qwen2.5-1.5b-ctx-ax650/model.embed_tokens.weight.bfloat16.bin" \
96
- --tokens_embed_num 151936 \
97
- --tokens_embed_size 1536 \
98
- --use_mmap_load_embed 1 \
99
- --live_print 1 \
100
- --devices 0
101
- ```
102
-
103
- #### Inference with AX650 Host, such as M4N-Dock(爱芯派Pro) or AX650N DEMO Board
104
-
105
- Open another terminal and run `run_qwen2.5_1.5b_gptq_int4_ax650.sh`
106
-
107
- ```
108
- root@ax650:/mnt/qtang/llm-test/Qwen2.5-1.5B-Instruct-CTX-Int8# mkdir -p kvcache
109
- root@ax650:/mnt/qtang/llm-test/Qwen2.5-1.5B-Instruct-CTX-Int8# ./run_qwen2.5_1.5b_ctx_ax650.sh
110
- [I][ Init][ 107]: LLM init start
111
- [I][ Init][ 34]: connect http://127.0.0.1:12345 ok
112
- bos_id: -1, eos_id: 151645
113
- 3% | ██ | 1 / 31 [0.21s<6.39s, 4.85 count/s] tokenizer init ok
114
- [I][ Init][ 26]: LLaMaEmbedSelector use mmap
115
- 100% | ████████████████████████████████ | 31 / 31 [5.04s<5.04s, 6.15 count/s] init post axmodel ok,remain_cmm(9656 MB)
116
- [I][ Init][ 185]: max_token_len : 2559
117
- [I][ Init][ 190]: kv_cache_size : 256, kv_cache_num: 2559
118
- [I][ Init][ 198]: prefill_token_num : 128
119
- [I][ Init][ 202]: grp: 1, prefill_max_token_num : 1
120
- [I][ Init][ 202]: grp: 2, prefill_max_token_num : 512
121
- [I][ Init][ 202]: grp: 3, prefill_max_token_num : 1024
122
- [I][ Init][ 202]: grp: 4, prefill_max_token_num : 1536
123
- [I][ Init][ 202]: grp: 5, prefill_max_token_num : 2048
124
- [I][ load_config][ 282]: load config:
125
- {
126
- "enable_repetition_penalty": false,
127
- "enable_temperature": true,
128
- "enable_top_k_sampling": true,
129
- "enable_top_p_sampling": false,
130
- "penalty_window": 20,
131
- "repetition_penalty": 1.2,
132
- "temperature": 0.9,
133
- "top_k": 10,
134
- "top_p": 0.8
135
- }
136
-
137
- [I][ Init][ 213]: LLM init ok
138
- Type "q" to exit, Ctrl+c to stop current running
139
- [E][ load_kvcache][ 101]: k_cache ./kvcache/k_cache_0.bin or v_cache ./kvcache/v_cache_0.bin not exist
140
- [W][ main][ 217]: load kvcache from path: ./kvcache failed,generate kvcache
141
- 100% | ████████████████████████████████ | 53 / 53 [4.12s<4.12s, 12.85 token/s]
142
- [I][ GetKVCache][ 325]: precompute_len:53
143
- [I][ main][ 224]: generate kvcache to path: ./kvcache
144
- [I][ main][ 226]: precompute_len: 53
145
- [I][ main][ 227]: system_prompt: 你的名字叫小智(allen),你是一个人畜无害的AI助手。深圳市今天(4月1日)阴天,愚人节,气温在14°C至19°C之间,微风。
146
- prompt >> who are you?
147
- [I][ SetKVCache][ 354]: prefill_grpid:2 kv_cache_num:512 precompute_len:53 input_num_token:12
148
- [I][ Run][ 527]: input_embed_num(12)
149
- [I][ Run][ 642]: ttft: 537.06 ms
150
- 我是Allen,一个能够回答问题、提供信息和执行任务的虚拟助手。我可以帮助你解决各种问题、做计划、玩游戏、甚至是进行一些娱乐活动。请问有什么我能帮助你的吗?
151
-
152
- [N][ Run][ 756]: hit eos,avg 11.09 token/s
153
-
154
- [I][ GetKVCache][ 325]: precompute_len:108
155
- prompt >> 今天是几号,天气怎么样
156
- [I][ SetKVCache][ 354]: prefill_grpid:2 kv_cache_num:512 precompute_len:108 input_num_token:15
157
- [I][ Run][ 527]: input_embed_num(15)
158
- [I][ Run][ 642]: ttft: 536.81 ms
159
- 今天是4月1日,愚人节。根据您所描述的深圳天气情况,气温在14°C至19°C之间,气温较低,建议穿着适当。希望您今天愉快!
160
-
161
- [N][ Run][ 756]: hit eos,avg 11.17 token/s
162
-
163
- [I][ GetKVCache][ 325]: precompute_len:166
164
- ```
165
-
166
- #### Inference with M.2 Accelerator card
167
-
168
- [What is M.2 Accelerator card?](https://axcl-pi5-examples-cn.readthedocs.io/zh-cn/latest/index.html), Show this DEMO based on Raspberry PI 5.
169
-
170
- ```
171
- (base) axera@raspberrypi:~/samples/Qwen2.5-1.5B-Instruct-CTX-Int8 $ mkdir -p kvcache
172
- (base) axera@raspberrypi:~/samples/Qwen2.5-1.5B-Instruct-CTX-Int8 $ ./run_qwen2.5_1.5b_ctx_axcl_aarch64.sh
173
- [I][ Init][ 134]: LLM init start
174
- [I][ Init][ 41]: connect http://127.0.0.1:12345 ok
175
- bos_id: -1, eos_id: 151645
176
- 3% | ██ | 1 / 31 [0.46s<14.11s, 2.20 count/s] tokenizer init ok
177
- [I][ Init][ 45]: LLaMaEmbedSelector use mmap
178
- 6% | ███ | 2 / 31 [0.46s<7.05s, 4.40 count/s] embed_selector init ok
179
- [I][ run][ 30]: AXCLWorker start with devid 0
180
- 100% | ████████████████████████████████ | 31 / 31 [29.18s<29.18s, 1.06 count/s] init post axmodel ok,remain_cmm(-1 MB)m(-1 MB)
181
- [I][ Init][ 235]: max_token_len : 2559
182
- [I][ Init][ 238]: kv_cache_size : 256, kv_cache_num: 2559
183
- [I][ Init][ 246]: prefill_token_num : 128
184
- [I][ Init][ 250]: grp: 1, prefill_max_token_num : 1
185
- [I][ Init][ 250]: grp: 2, prefill_max_token_num : 512
186
- [I][ Init][ 250]: grp: 3, prefill_max_token_num : 1024
187
- [I][ Init][ 250]: grp: 4, prefill_max_token_num : 1536
188
- [I][ Init][ 250]: grp: 5, prefill_max_token_num : 2048
189
- ________________________
190
- | ID| remain cmm(MB)|
191
- ========================
192
- | 0| -1|
193
- ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
194
- [I][ load_config][ 282]: load config:
195
- {
196
- "enable_repetition_penalty": false,
197
- "enable_temperature": true,
198
- "enable_top_k_sampling": true,
199
- "enable_top_p_sampling": false,
200
- "penalty_window": 20,
201
- "repetition_penalty": 1.2,
202
- "temperature": 0.9,
203
- "top_k": 10,
204
- "top_p": 0.8
205
- }
206
-
207
- [I][ Init][ 275]: LLM init ok
208
- Type "q" to exit, Ctrl+c to stop current running
209
- [E][ load_kvcache][ 100]: k_cache ./kvcache/k_cache_0.bin or v_cache ./kvcache/v_cache_0.bin not exist
210
- [W][ main][ 223]: load kvcache from path: ./kvcache failed,generate kvcache
211
- 100% | ████████████████████████████████ | 53 / 53 [5.06s<5.06s, 10.47 token/s]
212
- [I][ GetKVCache][ 419]: precompute_len:53
213
- [I][ main][ 230]: generate kvcache to path: ./kvcache
214
- [I][ main][ 232]: precompute_len: 53
215
- [I][ main][ 233]: system_prompt: 你的名字叫小智(allen),你是一个人畜无害的AI助手。深圳市今天(4月1日)阴天,愚人节,气温在14°C至19°C之间,微风。
216
- prompt >> 你是谁
217
- [I][ SetKVCache][ 448]: prefill_grpid:2 kv_cache_num:512 precompute_len:53 input_num_token:10
218
- [I][ Run][ 722]: input token num : 10
219
- [I][ Run][ 823]: ttft: 548.23 ms
220
- 我是深圳市气象局发布的天气预报,我叫小智,是为了解答大家关于天气的问题而设计的。如果你对天气有疑问,欢迎随时询问!
221
-
222
- [N][ Run][ 975]: hit eos,avg 9.04 token/s
223
-
224
- [I][ GetKVCache][ 419]: precompute_len:98
225
- prompt >> 你能干什么
226
- [I][ SetKVCache][ 448]: prefill_grpid:2 kv_cache_num:512 precompute_len:98 input_num_token:10
227
- [I][ Run][ 722]: input token num : 10
228
- [I][ Run][ 823]: ttft: 548.07 ms
229
- 我能回答关于天气、生活、科技、文化、娱乐、历史等方面的很多问题。如果你有任何想知道的内容,都可以问我哦!
230
-
231
- [N][ Run][ 975]: hit eos,avg 9.03 token/s
232
-
233
- [I][ GetKVCache][ 419]: precompute_len:135
234
- prompt >> q
235
- [I][ run][ 80]: AXCLWorker exit with devid 0
236
-
237
-
238
- >> q
239
-
240
- (base) axera@raspberrypi:~ $ axcl-smi
241
- +------------------------------------------------------------------------------------------------+
242
- | AXCL-SMI V2.25.0_20250117163029 Driver V2.25.0_20250117163029 |
243
- +-----------------------------------------+--------------+---------------------------------------+
244
- | Card Name Firmware | Bus-Id | Memory-Usage |
245
- | Fan Temp Pwr:Usage/Cap | CPU NPU | CMM-Usage |
246
- |=========================================+==============+=======================================|
247
- | 0 AX650N V2.25.0 | 0000:01:00.0 | 188 MiB / 945 MiB |
248
- | -- 37C -- / -- | 1% 0% | 2335 MiB / 7040 MiB |
249
- +-----------------------------------------+--------------+---------------------------------------+
250
-
251
- +------------------------------------------------------------------------------------------------+
252
- | Processes: |
253
- | Card PID Process Name NPU Memory Usage |
254
- |================================================================================================|
255
- | 0 147835 /home/axera/samples/qwen2.5-1.5b-ctx/main_axcl_aarch64 1990172 KiB |
256
- +------------------------------------------------------------------------------------------------+
257
- (base) axera@raspberrypi:~ $
258
- ```
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - zho
5
+ - eng
6
+ - fra
7
+ - spa
8
+ - por
9
+ - deu
10
+ - ita
11
+ - rus
12
+ - jpn
13
+ - kor
14
+ - vie
15
+ - tha
16
+ - ara
17
+ base_model:
18
+ - Qwen/Qwen2.5-1.5B-Instruct
19
+ pipeline_tag: text-generation
20
+ library_name: transformers
21
+ tags:
22
+ - Context
23
+ - Qwen2.5-1.5B
24
+ ---
25
+
26
+ # Qwen2.5-1.5B-Instruct-CTX-Int8
27
+
28
+ This version of Qwen2.5-1.5B-Instruct-CTX-Int8 has been converted to run on the Axera NPU using **w8a16** quantization.
29
+
30
+ This model has been optimized with the following LoRA:
31
+
32
+ Compatible with Pulsar2 version: 4.0(Not released yet)
33
+
34
+ ## Feature
35
+
36
+ - Support for longer contexts, in this sample it's 2.5k
37
+ - Support context dialogue
38
+ - System prompt kvcache is supported
39
+
40
+ ## Convert tools links:
41
+
42
+ For those who are interested in model conversion, you can try to export axmodel through the original repo : https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct-GPTQ-Int4
43
+
44
+ [Pulsar2 Link, How to Convert LLM from Huggingface to axmodel](https://pulsar2-docs.readthedocs.io/en/latest/appendix/build_llm.html)
45
+
46
+ [AXera NPU AXEngine LLM Runtime](https://github.com/ZHEQIUSHUI/ax-llm/tree/prefill_kvcaches_context)
47
+
48
+ [AXera NPU AXCL LLM Runtime](https://github.com/ZHEQIUSHUI/ax-llm/tree/axcl-context-kvcache)
49
+
50
+ ## Support Platform
51
+
52
+ - AX650
53
+ - AX650N DEMO Board
54
+ - [M4N-Dock(爱芯派Pro)](https://wiki.sipeed.com/hardware/zh/maixIV/m4ndock/m4ndock.html)
55
+ - [M.2 Accelerator card](https://axcl-docs.readthedocs.io/zh-cn/latest/doc_guide_hardware.html)
56
+ - AX630C
57
+ - *TBD*
58
+
59
+ |Chips|w8a16|w4a16| DDR | Flash |
60
+ |--|--|--|--|--|
61
+ |AX650| 11 tokens/sec| *TBD* | 2.3GB | 2.3GB |
62
+
63
+ ## How to use
64
+
65
+ Download all files from this repository to the device
66
+
67
+ ```
68
+ root@ax650:/mnt/qtang/llm-test/Qwen2.5-1.5B-Instruct-CTX-Int8# tree -L 1
69
+ .
70
+ ├── kvcache
71
+ ├── main
72
+ ├── main_axcl_aarch64
73
+ ├── main_axcl_x86
74
+ ├── post_config.json
75
+ ├── qwen2.5-1.5b-ctx-ax650
76
+ ├── qwen2.5_tokenizer
77
+ ├── qwen2.5_tokenizer_uid.py
78
+ ├── run_qwen2.5_1.5b_ctx_ax650.sh
79
+ ├── run_qwen2.5_1.5b_ctx_axcl_aarch64.sh
80
+ └── run_qwen2.5_1.5b_ctx_axcl_x86.sh
81
+ ```
82
+
83
+ #### Start the Tokenizer service
84
+
85
+ ```
86
+ root@ax650:/mnt/qtang/llm-test/Qwen2.5-1.5B-Instruct-CTX-Int8# python qwen2.5_tokenizer_uid.py
87
+ Server running at http://0.0.0.0:12345
88
+ ```
89
+
90
+ #### System prompt cache
91
+
92
+ - The System prompt can be preset through the configuration file from `--system_prompt`
93
+ - The System prompt can be cached in the form of kv cache to a specified folder for quick loading at the next run time from `--kvcache_path`
94
+ - This folder needs to be created manually before running, for example `mkdir kvcache`
95
+
96
+ ```
97
+ (base) axera@raspberrypi:~/samples/qwen2.5-1.5b-ctx $ cat run_qwen2.5_1.5b_ctx_axcl_aarch64.sh
98
+ ./main_axcl_aarch64 \
99
+ --system_prompt "你的名字叫小智(allen),你是一个人畜无害的AI助手。深圳市今天(4月1日)阴天,愚人节,气温在14°C至19°C之间,微风。" \
100
+ --kvcache_path "./kvcache" \
101
+ --template_filename_axmodel "qwen2.5-1.5b-ctx-ax650/qwen2_p128_l%d_together.axmodel" \
102
+ --axmodel_num 28 \
103
+ --tokenizer_type 2 \
104
+ --url_tokenizer_model "http://127.0.0.1:12345" \
105
+ --filename_post_axmodel "qwen2.5-1.5b-ctx-ax650/qwen2_post.axmodel" \
106
+ --filename_tokens_embed "qwen2.5-1.5b-ctx-ax650/model.embed_tokens.weight.bfloat16.bin" \
107
+ --tokens_embed_num 151936 \
108
+ --tokens_embed_size 1536 \
109
+ --use_mmap_load_embed 1 \
110
+ --live_print 1 \
111
+ --devices 0
112
+ ```
113
+
114
+ #### Inference with AX650 Host, such as M4N-Dock(爱芯派Pro) or AX650N DEMO Board
115
+
116
+ Open another terminal and run `run_qwen2.5_1.5b_gptq_int4_ax650.sh`
117
+
118
+ ```
119
+ root@ax650:/mnt/qtang/llm-test/Qwen2.5-1.5B-Instruct-CTX-Int8# mkdir -p kvcache
120
+ root@ax650:/mnt/qtang/llm-test/Qwen2.5-1.5B-Instruct-CTX-Int8# ./run_qwen2.5_1.5b_ctx_ax650.sh
121
+ [I][ Init][ 107]: LLM init start
122
+ [I][ Init][ 34]: connect http://127.0.0.1:12345 ok
123
+ bos_id: -1, eos_id: 151645
124
+ 3% | ██ | 1 / 31 [0.21s<6.39s, 4.85 count/s] tokenizer init ok
125
+ [I][ Init][ 26]: LLaMaEmbedSelector use mmap
126
+ 100% | ████████████████████████████████ | 31 / 31 [5.04s<5.04s, 6.15 count/s] init post axmodel ok,remain_cmm(9656 MB)
127
+ [I][ Init][ 185]: max_token_len : 2559
128
+ [I][ Init][ 190]: kv_cache_size : 256, kv_cache_num: 2559
129
+ [I][ Init][ 198]: prefill_token_num : 128
130
+ [I][ Init][ 202]: grp: 1, prefill_max_token_num : 1
131
+ [I][ Init][ 202]: grp: 2, prefill_max_token_num : 512
132
+ [I][ Init][ 202]: grp: 3, prefill_max_token_num : 1024
133
+ [I][ Init][ 202]: grp: 4, prefill_max_token_num : 1536
134
+ [I][ Init][ 202]: grp: 5, prefill_max_token_num : 2048
135
+ [I][ load_config][ 282]: load config:
136
+ {
137
+ "enable_repetition_penalty": false,
138
+ "enable_temperature": true,
139
+ "enable_top_k_sampling": true,
140
+ "enable_top_p_sampling": false,
141
+ "penalty_window": 20,
142
+ "repetition_penalty": 1.2,
143
+ "temperature": 0.9,
144
+ "top_k": 10,
145
+ "top_p": 0.8
146
+ }
147
+
148
+ [I][ Init][ 213]: LLM init ok
149
+ Type "q" to exit, Ctrl+c to stop current running
150
+ [E][ load_kvcache][ 101]: k_cache ./kvcache/k_cache_0.bin or v_cache ./kvcache/v_cache_0.bin not exist
151
+ [W][ main][ 217]: load kvcache from path: ./kvcache failed,generate kvcache
152
+ 100% | ████████████████████████████████ | 53 / 53 [4.12s<4.12s, 12.85 token/s]
153
+ [I][ GetKVCache][ 325]: precompute_len:53
154
+ [I][ main][ 224]: generate kvcache to path: ./kvcache
155
+ [I][ main][ 226]: precompute_len: 53
156
+ [I][ main][ 227]: system_prompt: 你的名字叫小智(allen),你是一个人畜无害的AI助手。深圳市今天(4月1日)阴天,愚人节,气温在14°C至19°C之间,微风。
157
+ prompt >> who are you?
158
+ [I][ SetKVCache][ 354]: prefill_grpid:2 kv_cache_num:512 precompute_len:53 input_num_token:12
159
+ [I][ Run][ 527]: input_embed_num(12)
160
+ [I][ Run][ 642]: ttft: 537.06 ms
161
+ 我是Allen,一个能够回答问题、提供信息和执行任务的虚拟助手。我可以帮助你解决各种问题、做计划、玩游戏、甚至是进行一些娱乐活动。请问有什么我能帮助你的吗?
162
+
163
+ [N][ Run][ 756]: hit eos,avg 11.09 token/s
164
+
165
+ [I][ GetKVCache][ 325]: precompute_len:108
166
+ prompt >> 今天是几号,天气怎么样
167
+ [I][ SetKVCache][ 354]: prefill_grpid:2 kv_cache_num:512 precompute_len:108 input_num_token:15
168
+ [I][ Run][ 527]: input_embed_num(15)
169
+ [I][ Run][ 642]: ttft: 536.81 ms
170
+ 今天是4月1日,愚人节。根据您所描述的深圳天气情况,气温在14°C至19°C之间,气温较低,建议穿着适当。希望您今天愉快!
171
+
172
+ [N][ Run][ 756]: hit eos,avg 11.17 token/s
173
+
174
+ [I][ GetKVCache][ 325]: precompute_len:166
175
+ ```
176
+
177
+ #### Inference with M.2 Accelerator card
178
+
179
+ [What is M.2 Accelerator card?](https://axcl-pi5-examples-cn.readthedocs.io/zh-cn/latest/index.html), Show this DEMO based on Raspberry PI 5.
180
+
181
+ ```
182
+ (base) axera@raspberrypi:~/samples/Qwen2.5-1.5B-Instruct-CTX-Int8 $ mkdir -p kvcache
183
+ (base) axera@raspberrypi:~/samples/Qwen2.5-1.5B-Instruct-CTX-Int8 $ ./run_qwen2.5_1.5b_ctx_axcl_aarch64.sh
184
+ [I][ Init][ 134]: LLM init start
185
+ [I][ Init][ 41]: connect http://127.0.0.1:12345 ok
186
+ bos_id: -1, eos_id: 151645
187
+ 3% | ██ | 1 / 31 [0.46s<14.11s, 2.20 count/s] tokenizer init ok
188
+ [I][ Init][ 45]: LLaMaEmbedSelector use mmap
189
+ 6% | ███ | 2 / 31 [0.46s<7.05s, 4.40 count/s] embed_selector init ok
190
+ [I][ run][ 30]: AXCLWorker start with devid 0
191
+ 100% | ████████████████████████████████ | 31 / 31 [29.18s<29.18s, 1.06 count/s] init post axmodel ok,remain_cmm(-1 MB)m(-1 MB)
192
+ [I][ Init][ 235]: max_token_len : 2559
193
+ [I][ Init][ 238]: kv_cache_size : 256, kv_cache_num: 2559
194
+ [I][ Init][ 246]: prefill_token_num : 128
195
+ [I][ Init][ 250]: grp: 1, prefill_max_token_num : 1
196
+ [I][ Init][ 250]: grp: 2, prefill_max_token_num : 512
197
+ [I][ Init][ 250]: grp: 3, prefill_max_token_num : 1024
198
+ [I][ Init][ 250]: grp: 4, prefill_max_token_num : 1536
199
+ [I][ Init][ 250]: grp: 5, prefill_max_token_num : 2048
200
+ ________________________
201
+ | ID| remain cmm(MB)|
202
+ ========================
203
+ | 0| -1|
204
+ ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
205
+ [I][ load_config][ 282]: load config:
206
+ {
207
+ "enable_repetition_penalty": false,
208
+ "enable_temperature": true,
209
+ "enable_top_k_sampling": true,
210
+ "enable_top_p_sampling": false,
211
+ "penalty_window": 20,
212
+ "repetition_penalty": 1.2,
213
+ "temperature": 0.9,
214
+ "top_k": 10,
215
+ "top_p": 0.8
216
+ }
217
+
218
+ [I][ Init][ 275]: LLM init ok
219
+ Type "q" to exit, Ctrl+c to stop current running
220
+ [E][ load_kvcache][ 100]: k_cache ./kvcache/k_cache_0.bin or v_cache ./kvcache/v_cache_0.bin not exist
221
+ [W][ main][ 223]: load kvcache from path: ./kvcache failed,generate kvcache
222
+ 100% | ████████████████████████████████ | 53 / 53 [5.06s<5.06s, 10.47 token/s]
223
+ [I][ GetKVCache][ 419]: precompute_len:53
224
+ [I][ main][ 230]: generate kvcache to path: ./kvcache
225
+ [I][ main][ 232]: precompute_len: 53
226
+ [I][ main][ 233]: system_prompt: 你的名字叫小智(allen),你是一个人畜无害的AI助手。深圳市今天(4月1日)阴天,愚人节,气温在14°C至19°C之间,微风。
227
+ prompt >> 你是谁
228
+ [I][ SetKVCache][ 448]: prefill_grpid:2 kv_cache_num:512 precompute_len:53 input_num_token:10
229
+ [I][ Run][ 722]: input token num : 10
230
+ [I][ Run][ 823]: ttft: 548.23 ms
231
+ 我是深圳市气象局发布的天气预报,我叫小智,是为了解答大家关于天气的问题而设计的。如果你对天气有疑问,欢迎随时询问!
232
+
233
+ [N][ Run][ 975]: hit eos,avg 9.04 token/s
234
+
235
+ [I][ GetKVCache][ 419]: precompute_len:98
236
+ prompt >> 你能干什么
237
+ [I][ SetKVCache][ 448]: prefill_grpid:2 kv_cache_num:512 precompute_len:98 input_num_token:10
238
+ [I][ Run][ 722]: input token num : 10
239
+ [I][ Run][ 823]: ttft: 548.07 ms
240
+ 我能回答关于天气、生活、科技、文化、娱乐、历史等方面的很多问题。如果你有任何想知道的内容,都可以问我哦!
241
+
242
+ [N][ Run][ 975]: hit eos,avg 9.03 token/s
243
+
244
+ [I][ GetKVCache][ 419]: precompute_len:135
245
+ prompt >> q
246
+ [I][ run][ 80]: AXCLWorker exit with devid 0
247
+
248
+
249
+ >> q
250
+
251
+ (base) axera@raspberrypi:~ $ axcl-smi
252
+ +------------------------------------------------------------------------------------------------+
253
+ | AXCL-SMI V2.25.0_20250117163029 Driver V2.25.0_20250117163029 |
254
+ +-----------------------------------------+--------------+---------------------------------------+
255
+ | Card Name Firmware | Bus-Id | Memory-Usage |
256
+ | Fan Temp Pwr:Usage/Cap | CPU NPU | CMM-Usage |
257
+ |=========================================+==============+=======================================|
258
+ | 0 AX650N V2.25.0 | 0000:01:00.0 | 188 MiB / 945 MiB |
259
+ | -- 37C -- / -- | 1% 0% | 2335 MiB / 7040 MiB |
260
+ +-----------------------------------------+--------------+---------------------------------------+
261
+
262
+ +------------------------------------------------------------------------------------------------+
263
+ | Processes: |
264
+ | Card PID Process Name NPU Memory Usage |
265
+ |================================================================================================|
266
+ | 0 147835 /home/axera/samples/qwen2.5-1.5b-ctx/main_axcl_aarch64 1990172 KiB |
267
+ +------------------------------------------------------------------------------------------------+
268
+ (base) axera@raspberrypi:~ $
269
+ ```