自拍偷在线精品自拍偷,亚洲欧美中文日韩v在线观看不卡

51CTO首頁(yè)

AI.x社區(qū)

軟考社區(qū)

免費(fèi)課

企業(yè)培訓(xùn)

鴻蒙開(kāi)發(fā)者社區(qū)

WOT技術(shù)大會(huì)

公眾號(hào)矩陣

移動(dòng)端

視頻課免費(fèi)課排行榜短視頻直播課軟考學(xué)堂

全部課程軟考華為認(rèn)證廠商認(rèn)證 IT技術(shù)PMP項(xiàng)目管理免費(fèi)題庫(kù)

在線(xiàn)學(xué)習(xí)

文章資源問(wèn)答課堂專(zhuān)欄直播

51CTO

鴻蒙開(kāi)發(fā)者社區(qū)

51CTO技術(shù)棧

51CTO官微

51CTO學(xué)堂

51CTO博客

CTO訓(xùn)練營(yíng)

鴻蒙開(kāi)發(fā)者社區(qū)訂閱號(hào)

51CTO軟考

51CTO學(xué)堂APP

51CTO學(xué)堂企業(yè)版APP

鴻蒙開(kāi)發(fā)者社區(qū)視頻號(hào)

51CTO軟考題庫(kù)

賬號(hào)設(shè)置退出

容器下在 Triton Server 中使用 TensorRT-LLM 進(jìn)行推理

作者：陳少文 2024-02-04 00:00:00

開(kāi)發(fā) 前端

使用 TensorRT 時(shí)，通常需要將模型轉(zhuǎn)換為 ONNX 格式，再將 ONNX 轉(zhuǎn)換為 TensorRT 格式，然后在 TensorRT、Triton Server 中進(jìn)行推理。

1. TensorRT-LLM 編譯模型

1.1 TensorRT-LLM 簡(jiǎn)介

使用 TensorRT 時(shí)，通常需要將模型轉(zhuǎn)換為 ONNX 格式，再將 ONNX 轉(zhuǎn)換為 TensorRT 格式，然后在 TensorRT、Triton Server 中進(jìn)行推理。

但這個(gè)轉(zhuǎn)換過(guò)程并不簡(jiǎn)單，經(jīng)常會(huì)遇到各種報(bào)錯(cuò)，需要對(duì)模型結(jié)構(gòu)、平臺(tái)算子有一定的掌握，具備轉(zhuǎn)換和調(diào)試能力。而 TensorRT-LLM 的目標(biāo)就是降低這一過(guò)程的復(fù)雜度，讓大模型更容易跑在 TensorRT 引擎上。

需要注意的是，TensorRT 針對(duì)的是具體硬件，不同的 GPU 型號(hào)需要編譯不同的 TensorRT 格式模型。這與 ONNX 模型格式的通用性定位顯著不同。

同時(shí)，TensortRT-LLM 并不支持全部 GPU 型號(hào)，僅支持 H100、L40S、A100、A30、V100 等顯卡。

1.2 配置編譯環(huán)境

docker run --gpus device=0 -v $PWD:/app/tensorrt_llm/models -it --rm hubimage/nvidia-tensorrt-llm:v0.7.1 bash

--gpus device=0 表示使用編號(hào)為 0 的 GPU 卡，這里的 hubimage/nvidia-tensorrt-llm:v0.7.1 對(duì)應(yīng)的就是 TensorRT-LLM v0.7.1 的 Release 版本。

由于自行打鏡像非常麻煩，這里提供幾個(gè)可選版本的鏡像:

hubimage/nvidia-tensorrt-llm:v0.7.1
hubimage/nvidia-tensorrt-llm:v0.7.0
hubimage/nvidia-tensorrt-llm:v0.6.1

1.3 編譯生成 TensorRT 格式模型

在上述容器環(huán)境下，執(zhí)行命令:

python examples/baichuan/build.py --model_version v2_7b \
                --model_dir ./models/Baichuan2-7B-Chat \
                --dtype float16 \
                --parallel_build \
                --use_inflight_batching \
                --enable_context_fmha \
                --use_gemm_plugin float16 \
                --use_gpt_attention_plugin float16 \
                --output_dir ./models/Baichuan2-7B-trt-engines

生成的文件主要有三個(gè):

baichuan_float16_tp1_rank0.engine，嵌入權(quán)重的模型計(jì)算圖文件
config.json，模型結(jié)構(gòu)、精度、插件等詳細(xì)配置信息文件
model.cache，編譯緩存文件，可以加速后續(xù)編譯速度

1.4 推理測(cè)試

python examples/run.py --input_text "世界上第二高的山峰是哪座？" \
                 --max_output_len=200 \
                 --tokenizer_dir ./models/Baichuan2-7B-Chat \
                 --engine_dir=./models/Baichuan2-7B-trt-engines

[02/03/2024-10:02:58] [TRT-LLM] [W] Found pynvml==11.4.1. Please use pynvml>=11.5.0 to get accurate memory usage
Input [Text 0]: "世界上第二高的山峰是哪座？"
Output [Text 0 Beam 0]: "
珠穆朗瑪峰（Mount Everest）是地球上最高的山峰，海拔高度為8,848米（29,029英尺）。第二高的山峰是喀喇昆侖山脈的喬戈里峰（K2），海拔高度為8,611米（28,251英尺）。"

1.5 驗(yàn)證是否嚴(yán)重退化

模型推理優(yōu)化，可以替換算子、量化、裁剪反向傳播等手段，但有一個(gè)基本線(xiàn)一定要達(dá)到，那就是模型不能退化很多。

在精度損失可接受的范圍內(nèi)，模型的推理優(yōu)化才有意義。TensorRT-LLM 項(xiàng)目提供的 summarize.py 可以跑一些測(cè)試，給模型打分，rouge1、rouge2 和 rougeLsum 是用于評(píng)價(jià)文本生成質(zhì)量的指標(biāo)，可以用于評(píng)估模型推理質(zhì)量。

獲取原格式模型的 Rouge 指標(biāo)

pip install datasets nltk rouge_score -i https://pypi.tuna.tsinghua.edu.cn/simple

由于目前 optimum 不支持 Baichuan 模型，因此，需要編輯 examples/summarize.py 注釋掉 model.to_bettertransformer()，這個(gè)問(wèn)題在最新的 TensorRT-LLM 代碼中已經(jīng)解決，我使用的是當(dāng)前最新的 Release 版本（v0.7.1）。

python examples/summarize.py --test_hf \
                    --hf_model_dir ./models/Baichuan2-7B-Chat \
                    --data_type fp16 \
                    --engine_dir ./models/Baichuan2-7B-trt-engines

輸出結(jié)果:

[02/03/2024-10:21:45] [TRT-LLM] [I] Hugging Face (total latency: 31.27020287513733 sec)
[02/03/2024-10:21:45] [TRT-LLM] [I] HF beam 0 result
[02/03/2024-10:21:45] [TRT-LLM] [I]   rouge1 : 28.847385241217726
[02/03/2024-10:21:45] [TRT-LLM] [I]   rouge2 : 9.519352831698162
[02/03/2024-10:21:45] [TRT-LLM] [I]   rougeL : 20.85486489462602
[02/03/2024-10:21:45] [TRT-LLM] [I]   rougeLsum : 24.090111126907733

獲取 TensorRT 格式模型的 Rouge 指標(biāo)

python examples/summarize.py --test_trt_llm \
                    --hf_model_dir ./models/Baichuan2-7B-Chat \
                    --data_type fp16 \
                    --engine_dir ./models/Baichuan2-7B-trt-engines

輸出結(jié)果:

[02/03/2024-10:23:16] [TRT-LLM] [I] TensorRT-LLM (total latency: 28.360705375671387 sec)
[02/03/2024-10:23:16] [TRT-LLM] [I] TensorRT-LLM beam 0 result
[02/03/2024-10:23:16] [TRT-LLM] [I]   rouge1 : 26.557043897453102
[02/03/2024-10:23:16] [TRT-LLM] [I]   rouge2 : 8.28672928021811
[02/03/2024-10:23:16] [TRT-LLM] [I]   rougeL : 19.13639628365737
[02/03/2024-10:23:16] [TRT-LLM] [I]   rougeLsum : 22.0436013250798

TensorRT-LLM 編譯之后的模型，rougeLsum 從 24 降到了 22，說(shuō)明能力會(huì)有退化，但只要在可接受的范圍之內(nèi)，還是可以使用的，因?yàn)橥评硭俣葧?huì)有較大的提升。

完成這步之后，就可以退出容器了，推理是在另外一個(gè)容器中進(jìn)行。

2. Triton Server 配置說(shuō)明

2.1 Triton Server 簡(jiǎn)介

Triton Server 是一個(gè)推理框架，提供用戶(hù)規(guī)?；M(jìn)行推理的能力。具體包括:

支持多種后端，tensorrt、onnxruntime、pytorch、python、vllm、tensorrtllm 等，還可以自定義后端，只需要相應(yīng)的 shared library 即可。
對(duì)外提供 HTTP、GRPC 接口
batch 能力，支持批量進(jìn)行推理，而開(kāi)啟 Dynamic batching 之后，多個(gè) batch 可以合并之后同時(shí)進(jìn)行推理，實(shí)現(xiàn)更高吞吐量
pipeline 能力，一個(gè) Triton Server 可以同時(shí)推理多個(gè)模型，并且模型之間可以進(jìn)行編排，支持 Concurrent Model Execution 流水線(xiàn)并行推理
觀測(cè)能力，提供有 Metrics 可以實(shí)時(shí)監(jiān)控推理的各種指標(biāo)

圖片

上面是 Triton Server 的架構(gòu)圖，簡(jiǎn)單點(diǎn)說(shuō) Triton Server 是一個(gè)端（模型）到端（應(yīng)用）的推理框架，提供了圍繞推理的生命周期過(guò)程管理，配置好模型之后，就能直接對(duì)應(yīng)用層提供服務(wù)。

2.2 Triton Server 使用配置

在 Triton 社區(qū)的示例中，通常會(huì)有這樣四個(gè)目錄:

.
├── ensemble
│   ├── 1
│   └── config.pbtxt
├── postprocessing
│   ├── 1
│   │   └── model.py
│   └── config.pbtxt
├── preprocessing
│   ├── 1
│   │   └── model.py
│   └── config.pbtxt
└── tensorrt_llm
    ├── 1
    └── config.pbtxt

9 directories, 6 files

對(duì)于 Triton Server 來(lái)說(shuō)，上面的目錄格式實(shí)際上是定義了四個(gè)模型，分別是 preprocessing、tensorrt_llm、postprocessing、ensemble，只不過(guò) ensemble 是一個(gè)組合模型，定義多個(gè)模型來(lái)融合。

ensemble 存在的原因在于 tensorrt_llm 的推理并不是 text2text ，借助 Triton Server 的 Pipeline 能力，通過(guò) preprocessing 對(duì)輸入進(jìn)行 Tokenizing，postprocessing 對(duì)輸出進(jìn)行 Detokenizing，就能夠?qū)崿F(xiàn)端到端的推理能力。否則，在客戶(hù)端直接使用 TensorRT-LLM 時(shí)，還需要自行處理詞與索引的雙向映射。

這四個(gè)模型具體作用如下:

preprocessing, 用于輸入文本的預(yù)處理，包括分詞、詞向量化等，實(shí)現(xiàn)類(lèi)似 text2vec 的預(yù)處理。
tensorrt_llm, 用于 TensorRT 格式模型的 vec2vec 的推理
postprocessing，用于輸出文本的后處理，包括生成文本的后處理，如對(duì)齊、截?cái)嗟?，?shí)現(xiàn)類(lèi)似 vec2text 的后處理。
ensemble，將上面的是三個(gè)模型進(jìn)行融合，提供 text2text 的推理

上面定義的模型都有一個(gè) 1 目錄表示版本 1 ，在版本目錄中放置模型文件，在模型目錄下放置 config.pbtxt 描述推理的參數(shù) input、output、version 等。

2.3 模型加載的控制管理

Triton Server 通過(guò)參數(shù) --model-control-mode 來(lái)控制模型加載的方式，目前有三種加載模式:

none，加載目錄下的全部模型
explicit，加載目錄下的指定模型，通過(guò)參數(shù) --load-model 加載指定的模型
poll，定時(shí)輪詢(xún)加載目錄下的全部模型，通過(guò)參數(shù) --repository-poll-secs 配置輪詢(xún)周期

2.4 模型版本的控制管理

Triton Server 在模型的配置文件 config.pbtxt 中提供有 Version Policy，每個(gè)模型可以有多個(gè)版本共存。默認(rèn)使用版本號(hào)為 1 的模型，目前有三種版本策略:

所有版本同時(shí)使用

version_policy: { all: {}}

只使用最近 n 個(gè)版本

version_policy: { latest: { num_versions: 3}}

只使用指定的版本

version_policy: { specific: { versions: [1, 3, 5]}}

3. Triton Server 中使用 TensorRT-LLM

3.1 克隆配置文件

本文示例相關(guān)的配置已經(jīng)整理了一份到 GitHub 上，拷貝模型到指定的目之后，就可以直接進(jìn)行推理了。

git clone https://github.com/shaowenchen/modelops

3.2 組織推理目錄

拷貝 TensorRT 格式模型

cp Baichuan2-7B-trt-engines/* modelops/triton-tensorrtllm/Baichuan2-7B-Chat/tensorrt_llm/1/

拷貝源模型

cp -r Baichuan2-7B-Chat modelops/triton-tensorrtllm/downloads

此時(shí)文件的目錄結(jié)構(gòu)是:

tree modelops/triton-tensorrtllm

modelops/triton-tensorrtllm
├── Baichuan2-7B-Chat
│   ├── end_to_end_grpc_client.py
│   ├── ensemble
│   │   ├── 1
│   │   └── config.pbtxt
│   ├── postprocessing
│   │   ├── 1
│   │   │   ├── model.py
│   │   │   └── __pycache__
│   │   │       └── model.cpython-310.pyc
│   │   └── config.pbtxt
│   ├── preprocessing
│   │   ├── 1
│   │   │   ├── model.py
│   │   │   └── __pycache__
│   │   │       └── model.cpython-310.pyc
│   │   └── config.pbtxt
│   └── tensorrt_llm
│       ├── 1
│       │   ├── baichuan_float16_tp1_rank0.engine
│       │   ├── config.json
│       │   └── model.cache
│       └── config.pbtxt
└── downloads
    └── Baichuan2-7B-Chat
        ├── Baichuan2 模型社區(qū)許可協(xié)議.pdf
        ├── Community License for Baichuan2 Model.pdf
        ├── config.json
        ├── configuration_baichuan.py
        ├── generation_config.json
        ├── generation_utils.py
        ├── modeling_baichuan.py
        ├── pytorch_model.bin
        ├── quantizer.py
        ├── README.md
        ├── special_tokens_map.json
        ├── tokenization_baichuan.py
        ├── tokenizer_config.json
        └── tokenizer.model

13 directories, 26 files

3.3 啟動(dòng)推理服務(wù)

docker run --gpus device=0 --rm -p 38000:8000 -p 38001:8001 -p 38002:8002 \
    -v $PWD/modelops/triton-tensorrtllm:/models \
    hubimage/nvidia-triton-trt-llm:v0.7.1 \
    tritonserver --model-repository=/models/Baichuan2-7B-Chat \
    --disable-auto-complete-config \
    --backend-cnotallow=python,shm-region-prefix-name=prefix0_:

如果一臺(tái)機(jī)器上運(yùn)行了多個(gè) triton server，那么需要用 shm-region-prefix-name=prefix0_ 區(qū)分一下共享內(nèi)存的前綴，詳情可以參考 https://github.com/triton-inference-server/server/issues/4145 。

啟動(dòng)日志:

I0129 10:27:31.658112 1 server.cc:619]
+-------------+-----------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Backend     | Path                                                            | Config                                                                                                                                                                                              |
+-------------+-----------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| python      | /opt/tritonserver/backends/python/libtriton_python.so           | {"cmdline":{"auto-complete-config":"false","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","shm-region-prefix-name":"prefix0_:","default-max-batch-size":"4"}} |
| tensorrtllm | /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so | {"cmdline":{"auto-complete-config":"false","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","default-max-batch-size":"4"}}                                      |
+-------------+-----------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0129 10:27:31.658192 1 server.cc:662]
+----------------+---------+--------+
| Model          | Version | Status |
+----------------+---------+--------+
| ensemble       | 1       | READY  |
| postprocessing | 1       | READY  |
| preprocessing  | 1       | READY  |
| tensorrt_llm   | 1       | READY  |
+----------------+---------+--------+
...
I0129 10:27:31.745587 1 grpc_server.cc:2513] Started GRPCInferenceService at 0.0.0.0:8001
I0129 10:27:31.745810 1 http_server.cc:4497] Started HTTPService at 0.0.0.0:8000
I0129 10:27:31.787129 1 http_server.cc:270] Started Metrics Service at 0.0.0.0:8002

四個(gè)模型都處于 READY 狀態(tài)，就可以正常推理了。

查看模型配置參數(shù)

curl localhost:38000/v2/models/ensemble/config

{"name":"ensemble","platform":"ensemble","backend":"","version_policy":{"latest":{"num_versions":1}},"max_batch_size":32,"input":[{"name":"text_input","data_type":"TYPE_STRING",...

可以查看模型的推理參數(shù)。如果使用的是 auto-complete-config，那么這個(gè)接口可以用于導(dǎo)出 Triton Server 自動(dòng)生成的模型推理參數(shù)，用于修改和調(diào)試。

查看 Triton 是否正常運(yùn)行

curl -v localhost:38000/v2/health/ready

< HTTP/1.1 200 OK
< Content-Length: 0
< Content-Type: text/plain

3.4 客戶(hù)端調(diào)用

安裝依賴(lài)

pip install tritonclient[grpc] -i https://pypi.tuna.tsinghua.edu.cn/simple

Triton GRPC 接口的性能顯著高于 HTTP 接口，同時(shí)在容器中，我也沒(méi)有找到 HTTP 接口的示例，這里就直接用 GRPC 了。

推理測(cè)試

wget https://raw.githubusercontent.com/shaowenchen/modelops/master/triton-tensorrtllm/Baichuan2-7B-Chat/end_to_end_grpc_client.py

python3 ./end_to_end_grpc_client.py -u 127.0.0.1:38001 -p "世界上第三高的山峰是哪座？" -S -o 128


珠穆朗瑪峰（Mount Everest）是世界上最高的山峰，海拔高度為8,848米（29,029英尺）。在世界上，珠穆朗瑪峰之后，第二高的山峰是喀喇昆侖山脈的喬戈里峰（K2，又稱(chēng)K2峰），海拔高度為8,611米（28,251英尺）。第三高的山峰是喜馬拉雅山脈的坎欽隆加峰（Kangchenjunga），海拔高度為8,586米（28,169英尺）。</s>

3.5 查看指標(biāo)

Triton Server 已經(jīng)提供了推理指標(biāo)，監(jiān)聽(tīng)在 8002 端口。在本文的示例中，就是 38002 端口。

curl -v localhost:38002/metrics

nv_inference_request_success{model="ensemble",versinotallow="1"} 1
nv_inference_request_success{model="tensorrt_llm",versinotallow="1"} 1
nv_inference_request_success{model="preprocessing",versinotallow="1"} 1
nv_inference_request_success{model="postprocessing",versinotallow="1"} 128
# HELP nv_inference_request_failure Number of failed inference requests, all batch sizes
# TYPE nv_inference_request_failure counter
nv_inference_request_failure{model="ensemble",versinotallow="1"} 0
nv_inference_request_failure{model="tensorrt_llm",versinotallow="1"} 0
nv_inference_request_failure{model="preprocessing",versinotallow="1"} 0
nv_inference_request_failure{model="postprocessing",versinotallow="1"} 0

在 Grafana 中可以導(dǎo)入面板 https://grafana.com/grafana/dashboards/18737-triton-inference-server/ 查看指標(biāo)，如下圖:

圖片

4. 總結(jié)

本文主要是在學(xué)習(xí)使用 TensorRT 和 Triton Server 進(jìn)行推理過(guò)程的記錄，主要內(nèi)容如下:

TensorRT 是一種針對(duì) Nvidia GPU 硬件更高效的模型推理引擎
TensorRT-LLM 能讓大模型更快使用上 TensorRT 引擎
Triton Server 是一個(gè)端到端的推理框架，支持大部分的模型框架，能幫助用戶(hù)快速實(shí)現(xiàn)規(guī)模化的推理服務(wù)
Triton Server 下使用 TensorRT-LLM 進(jìn)行推理的示例

5. 參考

https://mmdeploy.readthedocs.io/zh-cn/latest/tutorial/03_pytorch2onnx.html
https://docs.nvidia.com/deeplearning/tensorrt/container-release-notes/running.html#running
https://github.com/NVIDIA/TensorRT-LLM
https://github.com/triton-inference-server/triton-tensorrtllm
https://zhuanlan.zhihu.com/p/663748373

責(zé)任編輯：武曉燕來(lái)源：陳少文

Triton 格式 TensorRT

點(diǎn)贊

51CTO技術(shù)棧公眾號(hào)

業(yè)務(wù)
速覽

媒體

51CTO CIOAge HC3i

社區(qū)

51CTO博客鴻蒙開(kāi)發(fā)者社區(qū) AI.x社區(qū)

教育

51CTO學(xué)堂精培企業(yè)培訓(xùn) CTO訓(xùn)練營(yíng)