自拍偷在线精品自拍偷,亚洲欧美中文日韩v在线观看不卡

<cite id="57p0k"></cite>

<style id="57p0k"></style>

<legend id="57p0k"><track id="57p0k"></track></legend>

AI.x社區(qū)

軟考社區(qū)

免費(fèi)課

企業(yè)培訓(xùn)

鴻蒙開發(fā)者社區(qū)

WOT技術(shù)大會(huì)

公眾號(hào)矩陣

移動(dòng)端

視頻課免費(fèi)課排行榜短視頻直播課軟考學(xué)堂

全部課程軟考華為認(rèn)證廠商認(rèn)證 IT技術(shù)PMP項(xiàng)目管理免費(fèi)題庫

在線學(xué)習(xí)

文章資源問答課堂專欄直播

51CTO

鴻蒙開發(fā)者社區(qū)

51CTO技術(shù)棧

51CTO官微

51CTO學(xué)堂

51CTO博客

CTO訓(xùn)練營

鴻蒙開發(fā)者社區(qū)訂閱號(hào)

51CTO軟考

51CTO學(xué)堂APP

51CTO學(xué)堂企業(yè)版APP

鴻蒙開發(fā)者社區(qū)視頻號(hào)

51CTO軟考題庫

AI.x社區(qū)

登錄/注冊(cè)
51CTO

中國優(yōu)質(zhì)的IT技術(shù)網(wǎng)站

51CTO博客

專業(yè)IT技術(shù)創(chuàng)作平臺(tái)

51CTO學(xué)堂

IT職業(yè)在線教育平臺(tái)

部署滿血DeepSeek R1的避坑指南-vLLM 0.7.1

發(fā)布于 2025-2-6 15:33

瀏覽

0收藏

今天看到vLLM的朋友圈發(fā)布了DeepSeek R1的PP支持，立刻開始我的搗鼓之旅，假如我訓(xùn)練的超大MoE上線了，也得做好技術(shù)準(zhǔn)備工作是不嘛。把踩坑經(jīng)驗(yàn)給大家分享一下，希望能夠相比于官方文檔更白話一點(diǎn)。

Distributed Inference and Serving: https://docs.vllm.ai/en/latest/serving/distributed_serving.html#running-vllm-on-multiple-nodes

知乎@游凱超說一定要讓整個(gè)過程變得絲滑無比，我倆配合做了幾個(gè)驗(yàn)證，現(xiàn)在應(yīng)該只需要 Step0 和 Step3 就可以run起來了，如果遇到autoscalar的相關(guān)問題可以看Step1可以解決。

Step 0 Prepare weights & Environment

由于權(quán)重太大了，即使你網(wǎng)速可以，也不建議直連下載了。大家可以先從HF及或代理弄一份權(quán)重回來，直連大概率直接超時(shí)或者把公網(wǎng)IP打爆。我們今天展示的多機(jī)多卡8xH20 (x2) 部署，對(duì)應(yīng)TP size 8，PP size 2，所以要搞兩臺(tái)這樣的機(jī)器過來。同時(shí)有一個(gè)假設(shè)：兩機(jī)的網(wǎng)絡(luò)互通，不一定需要IB，儲(chǔ)存需要共享（NAS或OSS均可），完成準(zhǔn)備工作之后便可以做第一步。

Step 1 Setup up Ray & Cluster

官方文檔里面簡單帶過了這一部分，但這個(gè)是我被卡時(shí)間太久的問題。首先我說一下官方文檔的意思，就是讓你準(zhǔn)備好兩個(gè)節(jié)點(diǎn)，之間用ray start這個(gè)CLI去建立好ray集群。因?yàn)楹竺嬉?，但是比較坑的有兩點(diǎn)，第一點(diǎn)是啟動(dòng)的命令似乎有點(diǎn)點(diǎn)問題，我在前幾次嘗試的時(shí)候都遇到了Ray的autoscaler報(bào)錯(cuò)的問題：

(autoscaler +1m19s) Error: No available node types can fulfill resource request {'node:33.18.26.153': 0.001, 'GPU': 1.0}. Add suitable node types to this cluster to resolve this issue.
(autoscaler +1m54s) Error: No available node types can fulfill resource request {'GPU': 1.0, 'node:33.18.26.153': 0.001}. Add suitable node types to this cluster to resolve this issue.
(autoscaler +2m29s) Error: No available node types can fulfill resource request {'GPU': 1.0, 'node:33.18.26.153': 0.001}. Add suitable node types to this cluster to resolve this issue.
INFO 02-02 09:39:14 ray_utils.py:212] Waiting for creating a placement group of specs for 150 seconds. specs=[{'node:33.18.26.153': 0.001, 'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}]. Check `ray status` to see if you have enough resources.

這看起來就很奇怪，因?yàn)関LLM找Ray集群要的Resource是custom resource，'node:33.18.26.153':0.001，這可以理解成vLLM優(yōu)先要driver節(jié)點(diǎn)。但是這個(gè)東西我印象中是需要啟動(dòng)ray的時(shí)候自己設(shè)置的：

https://docs.ray.io/en/latest/ray-core/scheduling/resources.html#custom-resources

像這樣才會(huì)有這種resource。背后的原因是對(duì)于多（虛擬）網(wǎng)卡的機(jī)器會(huì)有多個(gè)網(wǎng)段，vLLM assume使用POD IP來做Ray的master尋址。

解法1：設(shè)置 VLLM_HOST_IP

# Get local IP address and set on every node before Ray start
VLLM_HOST_IP=$(hostname -I | awk '{print $1}')
export VLLM_HOST_IP

解法2：魔改Ray啟動(dòng)邏輯

def get_actual_ip():
    """Get the actual IP address of the current machine."""
    try:
        # Create a socket to connect to an external server (doesn't actually connect)
        s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
        s.connect(('8.8.8.8', 80))
        ip = s.getsockname()[0]
        s.close()
        return ip
    except Exception:
        # Fallback to hostname-based IP resolution
        return socket.gethostbyname(socket.gethostname())

def start_ray_cluster():
    free_ports = get_free_ports()
    port = free_ports[0]
    node_manager_port = free_ports[1]
    master_addr = get_master_addr()
    rank = get_rank()
    node_ip = get_actual_ip()  # Use the new function to get actual IP
    
    # Define custom resource based on node IP
    resource_spec = f'--resources=\'{{"node:{node_ip}": 1}}\''
    
    if rank == 0:
        cmd = f"ray start --head --port={port} --node-ip-address={master_addr} --node-manager-port {node_manager_port} --node-name={master_addr} {resource_spec}"
    else:
        cmd = f"ray start --address={master_addr}:{port} --node-manager-port {node_manager_port} --node-name={get_addr()} {resource_spec}"
    
    if ray.is_initialized():
        print("Ray is already initialized, skipping node level init.")
    else:
        stop_cmd = "ray stop"
        execute(stop_cmd, check=True)
        print(f"Executing Ray start command: {cmd}")
        execute(cmd, check=True)

其中execute可以這樣寫，

import time
import subprocess

def execute(cmd, check=False, retry=1):
    ret = subprocess.run(cmd, shell=True, capture_output=True, text=True, check=check)
    state = ret.returncode == 0
    msg = ret.stdout if state else ret.stderr
    if not state and retry > 1:
        print(f"execute {cmd} got error {msg}, retry...")
        time.sleep(1)
        return execute(cmd, check, retry-1)
    return state, msg

然后這里我稍微提一下ray的一些基礎(chǔ)玩法：大家在使用Ray的時(shí)候一般都不是在裸機(jī)上面的，大部分深度學(xué)習(xí)的資源都是k8s結(jié)合kubeflow或者volcano這樣的插件分發(fā)出來的。環(huán)境變量里面會(huì)有當(dāng)前是第幾個(gè)rank，頭結(jié)點(diǎn)master_addr這樣的信息，大家可以根據(jù)自己的需要把這些函數(shù)實(shí)現(xiàn)一下。比較坑的 {resource_spec} 這里我已經(jīng)替大家把坑給填了。

Step 2 Other small bugs

期間又報(bào)了兩個(gè)錯(cuò)誤，花了一點(diǎn)時(shí)間修復(fù)：

Traceback (most recent call last):
  File "/usr/local/bin/vllm", line 5, in <module>
    from vllm.scripts import main
  File "/usr/local/lib/python3.10/dist-packages/vllm/__init__.py", line 4, in <module>
    from vllm.engine.async_llm_engine import AsyncLLMEngine
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 15, in <module>
    from vllm.engine.llm_engine import (DecoderPromptComponents, LLMEngine,
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 24, in <module>
    from vllm.engine.output_processor.interfaces import (
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/output_processor/interfaces.py", line 6, in <module>
    from vllm.engine.output_processor.stop_checker import StopChecker
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/output_processor/stop_checker.py", line 6, in <module>
    from vllm.transformers_utils.tokenizer import AnyTokenizer
  File "/usr/local/lib/python3.10/dist-packages/vllm/transformers_utils/tokenizer.py", line 13, in <module>
    from vllm.transformers_utils.tokenizers import (BaichuanTokenizer,
  File "/usr/local/lib/python3.10/dist-packages/vllm/transformers_utils/tokenizers/__init__.py", line 2, in <module>
    from vllm.transformers_utils.tokenizers.mistral import MistralTokenizer
  File "/usr/local/lib/python3.10/dist-packages/vllm/transformers_utils/tokenizers/mistral.py", line 9, in <module>
    from mistral_common.tokens.tokenizers.mistral import ChatCompletionRequest
  File "/usr/local/lib/python3.10/dist-packages/mistral_common/tokens/tokenizers/mistral.py", line 32, in <module>
    from mistral_common.tokens.tokenizers.multimodal import (
  File "/usr/local/lib/python3.10/dist-packages/mistral_common/tokens/tokenizers/multimodal.py", line 6, in <module>
    import cv2
  File "/usr/local/lib/python3.10/dist-packages/cv2/__init__.py", line 181, in <module>
    bootstrap()
  File "/usr/local/lib/python3.10/dist-packages/cv2/__init__.py", line 175, in bootstrap
    if __load_extra_py_code_for_module("cv2", submodule, DEBUG):
  File "/usr/local/lib/python3.10/dist-packages/cv2/__init__.py", line 28, in __load_extra_py_code_for_module
    py_module = importlib.import_module(module_name)
  File "/usr/lib/python3.10/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "/usr/local/lib/python3.10/dist-packages/cv2/typing/__init__.py", line 171, in <module>
    LayerId = cv2.dnn.DictValue
AttributeError: module 'cv2.dnn' has no attribute 'DictValue'

一個(gè)opencv封建余孽的問題，pin住opencv的版本來解決

pip install opencv-python-headless==4.5.4.58

還有一個(gè)load之后報(bào)TypeError的問題

[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek_v3.py", line 472, in forward
[rank0]:     kv_c, k_pe = self.kv_a_proj_with_mqa(hidden_states)[0].split(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/linear.py", line 246, in forward
[rank0]:     output = self.quant_method.apply(self, x, bias)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/fp8.py", line 357, in apply
[rank0]:     return apply_w8a8_block_fp8_linear(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/utils/fp8_utils.py", line 61, in apply_w8a8_block_fp8_linear
[rank0]:     output = w8a8_block_fp8_matmul(q_input,
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/utils/fp8_utils.py", line 470, in w8a8_block_fp8_matmul
[rank0]:     configs = get_w8a8_block_fp8_configs(N, K, block_size[0], block_size[1])
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/utils/fp8_utils.py", line 407, in get_w8a8_block_fp8_configs
[rank0]:     device_name = current_platform.get_device_name().replace(" ", "_")
[rank0]: TypeError: a bytes-like object is required, not 'str'

通過升級(jí) pynvml 解決

pip install pynvml -U

Step 3 Run the model

這一步反而是最簡單的：

vllm serve /your/path/to_checkpoint_deepseek-r1/ --tensor-parallel-size 8 --pipeline-parallel-size 2 --trust-remote-code --host 0.0.0.0

由于有了PP加持，沒有IB的同學(xué)也可以嘗試把sequence length和bsz給稍微拉大一些拉。用gaoce哥哥貢獻(xiàn)的Reasoning Output，在同一臺(tái)機(jī)器來試一把，或者換一臺(tái)機(jī)器把localhost改了：

from openai import OpenAI

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

models = client.models.list()
model = models.data[0].id

# Round 1
messages = [{"role": "user", "content": "9.11 and 9.8, which is greater?"}]
response = client.chat.completions.create(model=model, messages=messages)

reasoning_content = response.choices[0].message.reasoning_content
content = response.choices[0].message.content

print("reasoning_content:", reasoning_content)
print("content:", content)

對(duì)，你不是卡主了，是你的錢包不夠厚。切到后臺(tái)可以看到，這個(gè)prompt里面

INFO 02-02 14:18:52 metrics.py:453] Avg prompt throughput: 1.7 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 02-02 14:18:57 metrics.py:453] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 20.7 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cacheusage: 0.0%, CPU KV cache usage: 0.0%.
INFO 02-02 14:19:02 metrics.py:453] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 20.5 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cacheusage: 0.0%, CPU KV cache usage: 0.0%.
INFO 02-02 14:19:07 metrics.py:453] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 20.5 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cacheusage: 0.0%, CPU KV cache usage: 0.0%.
INFO 02-02 14:19:12 metrics.py:453] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 20.1 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cacheusage: 0.0%, CPU KV cache usage: 0.0%.
INFO 02-02 14:19:17 metrics.py:453] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 19.8 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cacheusage: 0.1%, CPU KV cache usage: 0.0%.
INFO 02-02 14:19:22 metrics.py:453] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 19.4 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cacheusage: 0.1%, CPU KV cache usage: 0.0%.
INFO 02-02 14:19:27 metrics.py:453] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 19.1 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cacheusage: 0.1%, CPU KV cache usage: 0.0%.

稍等一會(huì)他就會(huì)告訴你9.8更大了。

祝大家搗鼓順利，感謝vLLM社區(qū)的工作。

https://github.com/vllm-project/vllm/pull/12679

凱超真 nb 春節(jié)在這做貼身客服，哈哈，RL仔現(xiàn)在不管原來是主修文還是主修理的，都先修infra吧。

本文轉(zhuǎn)載自 ??NLP工作站??，作者：曹宇

標(biāo)簽

贊

收藏

回復(fù)

舉報(bào)

社區(qū)頭條

回復(fù)

相關(guān)推薦

GraphRAG + Ollama 本地部署全攻略：避坑實(shí)戰(zhàn)指南

玄姐聊AGI ? 9646瀏覽 ? 0回復(fù)
GraphRAG + Ollama 本地部署全攻略：避坑實(shí)戰(zhàn)指南

玄姐聊AGI ? 6425瀏覽 ? 0回復(fù)
手把手教你將本地部署的DeepSeek R1集成到Dify

AIGC新知 ? 4524瀏覽 ? 0回復(fù)
基于 DeepSeek R1 和 Ollama 開發(fā) RAG 系統(tǒng)

玄姐聊AGI ? 3336瀏覽 ? 0回復(fù)
手把手教你在本地部署 DeepSeek R1，并集成到 Dify 中，建議收藏！

玄姐聊AGI ? 1.0w瀏覽 ? 0回復(fù)
OpenAI o3-mini 干翻了 DeepSeek R1？

PyTorch研習(xí)社 ? 1765瀏覽 ? 0回復(fù)
綜述 DeepSeek R1、LIMO、S1 等 6 篇文章的關(guān)鍵結(jié)論

amei2000go ? 1972瀏覽 ? 0回復(fù)
滿血DeepSeek-R1免費(fèi)用！附帶數(shù)據(jù)蒸餾的一些想法！

NLP工作站 ? 2553瀏覽 ? 0回復(fù)
DeepSeek又開源R1部署最佳實(shí)踐！

探索AGI ? 1724瀏覽 ? 0回復(fù)
騰訊文檔也能用上DeepSeek R1滿血版了！騰訊已漲超7.5%！實(shí)測(cè)：絲滑生成哪吒3預(yù)測(cè)PPT

51CTO技術(shù)棧 ? 2875瀏覽 ? 0回復(fù)
這個(gè)開源項(xiàng)目厲害了：一鍵部署DeepSeek R1！

NLP前沿1 ? 2714瀏覽 ? 0回復(fù)
大模型對(duì)決：DeepSeek R1與o3-mini

丟翅膀的魚 ? 1927瀏覽 ? 0回復(fù)
4090單卡部署滿血 671B DeepSeek，本地部署“成本驟降32倍”?。?！

玄姐聊AGI ? 6133瀏覽 ? 0回復(fù)
DeepSeek R1 全系列模型部署指南

芝士AI吃魚 ? 6920瀏覽 ? 0回復(fù)
白嫖資源訓(xùn)練 DeepSeek R1 推理模型

AIGC前沿技術(shù)追蹤 ? 3166瀏覽 ? 0回復(fù)
白話DeepSeek R1的GRPO強(qiáng)化學(xué)習(xí)算法：原理、圖解、視頻

后向傳播 ? 2603瀏覽 ? 0回復(fù)
后 DeepSeek R1 時(shí)代：從資本壁壘到技術(shù)普惠

Baihai_IDP ? 1764瀏覽 ? 0回復(fù)
M3芯片+Ollama本地部署DeepSeek R1：小白也能玩轉(zhuǎn)AI推理

zhishan15 ? 1456瀏覽 ? 0回復(fù)
DeepSeek R1 & R2 技術(shù)原理

ceesoft ? 1910瀏覽 ? 0回復(fù)

這個(gè)用戶很懶，還沒有個(gè)人簡介

帖子

聲望

粉絲

關(guān)注

最近發(fā)布

LLM實(shí)戰(zhàn)系列 | 大模型的多Lora部署，將顯存節(jié)省到極致 8天前發(fā)布
Llama4 模型細(xì)節(jié) & 效果實(shí)測(cè) 2025-04-09 07:07:26發(fā)布

熱門推薦

大半精銳盡出！o1下線！滿血o3之后，模型本身就是Manus，最大賣點(diǎn)：替代人干真活！ 1回復(fù)

王炸！MCP 架構(gòu)設(shè)計(jì)深度剖析 & 使用 Spring AI + MCP 四步教你實(shí)現(xiàn) Agent 智能體開發(fā) 0回復(fù)

Dify從入門到高階系列二：手把手教學(xué)！超詳細(xì)的Dify知識(shí)庫配置全攻略 0回復(fù)

Crawl4AI：GitHub榜首40K星標(biāo)！LLM專屬極速開源爬蟲神器 0回復(fù)

只需5分鐘，教你用Python搭建MCP Server 0回復(fù)

上一篇： Kimi發(fā)布最新模型k1.5，技術(shù)報(bào)告也干貨滿滿

下一篇：滿血DeepSeek-R1免費(fèi)用！附帶數(shù)據(jù)蒸餾的一些想法！

社區(qū)精華內(nèi)容

目錄

<style id="vkmao"></style>

<style id="vkmao"></style>