自拍偷在线精品自拍偷,亚洲欧美中文日韩v在线观看不卡

<meter id="2q6gz"></meter>

51CTO首頁(yè)

AI.x社區(qū)

軟考社區(qū)

免費(fèi)課

企業(yè)培訓(xùn)

鴻蒙開(kāi)發(fā)者社區(qū)

WOT技術(shù)大會(huì)

公眾號(hào)矩陣

移動(dòng)端

視頻課免費(fèi)課排行榜短視頻直播課軟考學(xué)堂

全部課程軟考華為認(rèn)證廠商認(rèn)證 IT技術(shù)PMP項(xiàng)目管理免費(fèi)題庫(kù)

在線學(xué)習(xí)

文章資源問(wèn)答課堂專欄直播

51CTO

鴻蒙開(kāi)發(fā)者社區(qū)

51CTO技術(shù)棧

51CTO官微

51CTO學(xué)堂

51CTO博客

CTO訓(xùn)練營(yíng)

鴻蒙開(kāi)發(fā)者社區(qū)訂閱號(hào)

51CTO軟考

51CTO學(xué)堂APP

51CTO學(xué)堂企業(yè)版APP

鴻蒙開(kāi)發(fā)者社區(qū)視頻號(hào)

51CTO軟考題庫(kù)

賬號(hào)設(shè)置退出

探索基于Qwen2.5實(shí)現(xiàn)DeepSeek推理的奇妙之旅

2025-02-27 08:00:00

本文基于基于Qwen2.5實(shí)現(xiàn)DeepSeek推理功能。本文使用unsloth框架，這個(gè)輕量高效、易于上手的工具，加上SFT中文數(shù)據(jù)集的加持，測(cè)試了在醫(yī)療領(lǐng)域的推理應(yīng)用。

作為一名互聯(lián)網(wǎng)技術(shù)愛(ài)好者，我一直對(duì)大型語(yǔ)言模型和高效推理技術(shù)充滿熱情。本文基于基于Qwen2.5實(shí)現(xiàn)DeepSeek推理功能。

本文使用unsloth框架，這個(gè)輕量高效、易于上手的工具，加上SFT中文數(shù)據(jù)集的加持，測(cè)試了在醫(yī)療領(lǐng)域的推理應(yīng)用。

當(dāng)然，過(guò)程中還遇到了諸如GRPO等新概念的挑戰(zhàn)與啟示，這一切都讓我對(duì)整個(gè)系統(tǒng)有了更深的認(rèn)識(shí)。接下來(lái)，我就以親歷者的角度，帶大家走進(jìn)這個(gè)既枯燥又充滿樂(lè)趣的技術(shù)世界。

Qwen2.5模型簡(jiǎn)介

Qwen2.5是近年來(lái)備受矚目的大語(yǔ)言模型之一。相較于傳統(tǒng)模型，它在語(yǔ)義理解、生成能力以及推理效率上都有明顯提升。作為一個(gè)以中文為主要應(yīng)用場(chǎng)景的模型，Qwen2.5在處理復(fù)雜語(yǔ)言任務(wù)時(shí)表現(xiàn)得游刃有余。我的初衷正是希望借助這一強(qiáng)大的模型，為實(shí)現(xiàn)DeepSeek推理功能提供堅(jiān)實(shí)的底層支撐。

在實(shí)際使用過(guò)程中，我發(fā)現(xiàn)Qwen2.5不僅能夠快速響應(yīng)，還具備一定的自適應(yīng)能力。尤其是在面對(duì)專業(yè)領(lǐng)域——如醫(yī)療場(chǎng)景——的應(yīng)用時(shí)，它表現(xiàn)出的邏輯嚴(yán)謹(jǐn)性和數(shù)據(jù)敏感性，讓人不得不對(duì)其刮目相看。當(dāng)然，這背后離不開(kāi)大規(guī)模預(yù)訓(xùn)練和精心設(shè)計(jì)的架構(gòu)。

DeepSeek推理功能概述

DeepSeek是一種新興的推理功能，其核心目標(biāo)是實(shí)現(xiàn)對(duì)輸入數(shù)據(jù)的高效、準(zhǔn)確解析，并基于模型預(yù)訓(xùn)練的知識(shí)進(jìn)行深度推理。簡(jiǎn)單來(lái)說(shuō)，DeepSeek的優(yōu)勢(shì)在于它不僅能回答常見(jiàn)問(wèn)題，還能通過(guò)復(fù)雜的邏輯鏈條，推斷出更為隱含的信息。說(shuō)白了，就是讓模型不僅會(huì)“答題”，還會(huì)“思考”。

在項(xiàng)目中，我利用DeepSeek實(shí)現(xiàn)了醫(yī)療領(lǐng)域內(nèi)的一系列推理任務(wù)。在醫(yī)療診斷中形成一個(gè)較為完整的推理鏈條，從而輔助醫(yī)生進(jìn)行診斷決策。整個(gè)過(guò)程既考驗(yàn)?zāi)Ｐ捅旧淼恼Z(yǔ)言理解能力，也對(duì)后端推理算法提出了挑戰(zhàn)。

unsloth框架的選擇與優(yōu)勢(shì)

談到技術(shù)實(shí)現(xiàn)，就不得不提unsloth框架。這款框架雖然名字聽(tīng)起來(lái)“懶散”，但實(shí)際上它專為高并發(fā)、低延遲的推理任務(wù)而設(shè)計(jì)。以下幾點(diǎn)是我選擇unsloth框架的重要原因：

高效性：unsloth框架在處理大規(guī)模并發(fā)請(qǐng)求時(shí)表現(xiàn)穩(wěn)定，并且在資源利用率上做了諸多優(yōu)化。對(duì)于我這樣需要實(shí)時(shí)推理的系統(tǒng)來(lái)說(shuō)，這無(wú)疑是一個(gè)大優(yōu)勢(shì)。

易擴(kuò)展性：框架的模塊化設(shè)計(jì)使得我可以根據(jù)實(shí)際需求靈活調(diào)整架構(gòu)。例如，在后期需要增加新的推理策略時(shí)，只需輕微改動(dòng)即可。

兼容性：unsloth框架與Qwen2.5等主流大語(yǔ)言模型兼容性極高，這為我后續(xù)的調(diào)試和優(yōu)化提供了極大的便利。

在實(shí)際開(kāi)發(fā)過(guò)程中，我曾遇到一些資源瓶頸的問(wèn)題，正是依靠unsloth的靈活擴(kuò)展能力，我才能快速定位問(wèn)題并進(jìn)行相應(yīng)的調(diào)整。

本次使用的數(shù)據(jù)集簡(jiǎn)介

數(shù)據(jù)是人工智能的燃料。在醫(yī)療領(lǐng)域，精確的中文數(shù)據(jù)尤為關(guān)鍵。

FreedomIntelligence/medical-o1-reasoning-SFT中文數(shù)據(jù)集正是在這種背景下應(yīng)運(yùn)而生。它包含了大量經(jīng)過(guò)精細(xì)標(biāo)注的醫(yī)療案例和推理路徑，使得模型在進(jìn)行醫(yī)療相關(guān)推理時(shí)能夠借鑒真實(shí)場(chǎng)景下的數(shù)據(jù)邏輯。

我在項(xiàng)目中采用該數(shù)據(jù)集進(jìn)行微調(diào)，目的是讓Qwen2.5不僅具備通用的語(yǔ)言理解能力，更能深入理解醫(yī)學(xué)術(shù)語(yǔ)和專業(yè)知識(shí)。數(shù)據(jù)集的豐富性和多樣性為模型提供了足夠的訓(xùn)練樣本，從而提升了其在醫(yī)療場(chǎng)景下的表現(xiàn)。

事實(shí)上，經(jīng)過(guò)這一輪微調(diào)后，我發(fā)現(xiàn)模型在面對(duì)復(fù)雜病癥描述時(shí)，給出的推理結(jié)果更加合理且具有說(shuō)服力。

數(shù)據(jù)集地址(本次我們使用的是中文數(shù)據(jù)集)：

https://huggingface.co/datasets/FreedomIntelligence/medical-o1-reasoning-SFT/viewer/zh

實(shí)現(xiàn)代碼

01、環(huán)境準(zhǔn)備

%%capture
# 該魔法命令用于在 Jupyter Notebook中隱藏輸出
# 避免輸出干擾執(zhí)行結(jié)果，例如 `pip install` 過(guò)程中的信息


# 安裝 Unsloth 和 vLLM 這兩個(gè)庫(kù)
!pip install unsloth vllm


# 升級(jí) Pillow（Python Imaging Library），確保使用最新版本
!pip install --upgrade pillow

from unsloth import FastLanguageModel, PatchFastRL  
# 從 unsloth 庫(kù)中導(dǎo)入 FastLanguageModel（加速的語(yǔ)言模型）和 PatchFastRL（加速補(bǔ)?。? 


PatchFastRL("GRPO", FastLanguageModel)  
# 對(duì) FastLanguageModel 應(yīng)用 "GRPO" 方案的補(bǔ)丁優(yōu)化，以提高推理效率

02、加載基礎(chǔ)模型

from unsloth import is_bfloat16_supported  # 檢測(cè)是否支持 bfloat16（更節(jié)省顯存的數(shù)值格式）
import torch  # 導(dǎo)入 PyTorch 進(jìn)行深度學(xué)習(xí)計(jì)算


# 設(shè)置模型參數(shù)
max_seq_length = 1024  # 最大序列長(zhǎng)度，可以增大以支持更長(zhǎng)的推理
lora_rank = 64  # LoRA 低秩適配矩陣的秩，較大值提高智能性但降低推理速度


# 加載 FastLanguageModel 預(yù)訓(xùn)練模型
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="Qwen/Qwen2.5-3B-Instruct",  # 使用 Qwen2.5-3B-Instruct 作為基礎(chǔ)模型
    max_seq_length=max_seq_length,  # 設(shè)置最大序列長(zhǎng)度
    load_in_4bit=True,  # 使用 4-bit 量化（減少顯存消耗），若為 False 則使用 16-bit LoRA
    fast_inference=True,  # 啟用 vLLM 加速推理（優(yōu)化并行計(jì)算）
    max_lora_rank=lora_rank,  # 設(shè)置 LoRA 適配的秩
    gpu_memory_utilization=0.5,  # 限制 GPU 內(nèi)存占用，降低該值可減少顯存溢出
)


# 為模型應(yīng)用 PEFT（參數(shù)高效微調(diào)）和 LoRA（低秩適配）
model = FastLanguageModel.get_peft_model(
    model,
    r=lora_rank,  # LoRA 低秩矩陣的秩，可選 8, 16, 32, 64, 128
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",  # 量化關(guān)鍵模塊：QKV 投影和輸出投影
        "gate_proj", "up_proj", "down_proj",  # MLP 相關(guān)模塊
    ],  # 若顯存不足，可以移除 QKVO 相關(guān)模塊
    lora_alpha=lora_rank,  # LoRA 適配器的 alpha 參數(shù)，通常設(shè)為與 r 相同
    use_gradient_checkpointing="unsloth",  # 啟用梯度檢查點(diǎn)，減少顯存占用，適用于長(zhǎng)上下文微調(diào)
    random_state=3407,  # 設(shè)置隨機(jī)種子，保證實(shí)驗(yàn)可復(fù)現(xiàn)
)

# 輸出 
==((====))==  Unsloth 2025.2.12: Fast Qwen2 patching. Transformers: 4.48.3.
   \\   /|    GPU: Tesla T4. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: vLLM loading unsloth/qwen2.5-3b-instruct-unsloth-bnb-4bit with actual GPU utilization = 49.66%
Unsloth: Your GPU has CUDA compute capability 7.5 with VRAM = 14.74 GB.
Unsloth: Using conservativeness = 1.0. Chunked prefill tokens = 1024. Num Sequences = 192.
Unsloth: vLLM's KV Cache can use up to 4.9 GB. Also swap space = 2 GB.
WARNING 02-17 07:17:06 config.py:2386] Casting torch.bfloat16 to torch.float16.
INFO 02-17 07:17:19 config.py:542] This model supports multiple tasks: {'classify', 'generate', 'embed', 'reward', 'score'}. Defaulting to 'generate'.
Unsloth: vLLM Bitsandbytes config using kwargs = {'load_in_8bit': False, 'load_in_4bit': True, 'bnb_4bit_compute_dtype': 'float16', 'bnb_4bit_quant_storage': 'uint8', 'bnb_4bit_quant_type': 'nf4', 'bnb_4bit_use_double_quant': True, 'llm_int8_enable_fp32_cpu_offload': False, 'llm_int8_has_fp16_weight': False, 'llm_int8_skip_modules': ['lm_head', 'multi_modal_projector', 'merger', 'modality_projection', 'model.layers.2.mlp', 'model.layers.3.mlp', 'model.layers.30.mlp'], 'llm_int8_threshold': 6.0}
INFO 02-17 07:17:19 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.2) with config: model='unsloth/qwen2.5-3b-instruct-unsloth-bnb-4bit', speculative_cnotallow=None, tokenizer='unsloth/qwen2.5-3b-instruct-unsloth-bnb-4bit', skip_tokenizer_init=False, tokenizer_mode=auto, revisinotallow=None, override_neuron_cnotallow=None, tokenizer_revisinotallow=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=1024, download_dir=None, load_format=LoadFormat.BITSANDBYTES, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantizatinotallow=bitsandbytes, enforce_eager=False, kv_cache_dtype=auto,  device_cnotallow=cuda:0, decoding_cnotallow=DecodingConfig(guided_decoding_backend='xgrammar'), observability_cnotallow=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=unsloth/qwen2.5-3b-instruct-unsloth-bnb-4bit, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_cnotallow=None, compilation_cnotallow={"level":0,"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":192}, use_cached_outputs=False, 
tokenizer_config.json:?100%
?7.36k/7.36k?[00:00<00:00,?451kB/s]
vocab.json:?100%
?2.78M/2.78M?[00:00<00:00,?8.77MB/s]
merges.txt:?100%
?1.67M/1.67M?[00:00<00:00,?6.92MB/s]
tokenizer.json:?100%
?11.4M/11.4M?[00:00<00:00,?42.1MB/s]
added_tokens.json:?100%
?605/605?[00:00<00:00,?37.9kB/s]
special_tokens_map.json:?100%
?614/614?[00:00<00:00,?54.5kB/s]
generation_config.json:?100%
?271/271?[00:00<00:00,?17.2kB/s]
INFO 02-17 07:17:23 cuda.py:179] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 02-17 07:17:23 cuda.py:227] Using XFormers backend.
INFO 02-17 07:17:24 model_runner.py:1110] Starting to load model unsloth/qwen2.5-3b-instruct-unsloth-bnb-4bit...
INFO 02-17 07:17:24 loader.py:1102] Loading weights with BitsAndBytes quantization.  May take a while ...
INFO 02-17 07:17:25 weight_utils.py:252] Using model weights format ['*.safetensors']
model.safetensors:?100%
?2.36G/2.36G?[00:15<00:00,?165MB/s]
Loading?safetensors?checkpoint?shards:?100%?Completed?|?1/1?[00:04<00:00,??4.67s/it]
Loading?safetensors?checkpoint?shards:?100%?Completed?|?1/1?[00:01<00:00,??1.17s/it]
INFO 02-17 07:17:48 model_runner.py:1115] Loading model weights took 2.2160 GB
INFO 02-17 07:17:48 punica_selector.py:18] Using PunicaWrapperGPU.
INFO 02-17 07:17:58 worker.py:267] Memory profiling takes 9.77 seconds
INFO 02-17 07:17:58 worker.py:267] the current vLLM instance can use total_gpu_memory (14.74GiB) x gpu_memory_utilization (0.50) = 7.32GiB
INFO 02-17 07:17:58 worker.py:267] model weights take 2.22GiB; non_torch_memory takes 0.05GiB; PyTorch activation peak memory takes 1.05GiB; the rest of the memory reserved for KV Cache is 4.01GiB.
INFO 02-17 07:17:58 executor_base.py:110] # CUDA blocks: 7300, # CPU blocks: 3640
INFO 02-17 07:17:58 executor_base.py:115] Maximum concurrency for 1024 tokens per request: 114.06x
INFO 02-17 07:18:01 model_runner.py:1434] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
Capturing CUDA graph shapes: 100%|██████████| 27/27 [00:43<00:00,  1.60s/it]
INFO 02-17 07:18:44 model_runner.py:1562] Graph capturing finished in 43 secs, took 0.62 GiB
INFO 02-17 07:18:44 llm_engine.py:431] init engine (profile, create kv cache, warmup model) took 56.33 seconds
tokenizer_config.json:?100%
?7.36k/7.36k?[00:00<00:00,?650kB/s]
vocab.json:?100%
?2.78M/2.78M?[00:00<00:00,?14.2MB/s]
merges.txt:?100%
?1.67M/1.67M?[00:00<00:00,?26.5MB/s]
added_tokens.json:?100%
?605/605?[00:00<00:00,?45.0kB/s]
special_tokens_map.json:?100%
?614/614?[00:00<00:00,?32.3kB/s]
tokenizer.json:?100%
?11.4M/11.4M?[00:00<00:00,?40.2MB/s]
Unsloth 2025.2.12 patched 36 layers with 36 QKV layers, 36 O layers and 36 MLP layers.

03、訓(xùn)練數(shù)據(jù)準(zhǔn)備

import re
from datasets import load_dataset, Dataset
# Load and prep dataset
SYSTEM_PROMPT = """
Respond in the following format:
<reasoning>
...
</reasoning>
<answer>
...
</answer>
"""
def extract_xml_answer(text: str) -> str:
    """提取 <answer> 標(biāo)簽內(nèi)的內(nèi)容"""
    match = re.search(r"<answer>(.*?)</answer>", text, re.DOTALL)
    return match.group(1).strip() if match else text.strip()
def get_medical_questions(split="train") -> Dataset:
    """加載 FreedomIntelligence/medical-o1-reasoning-SFT 數(shù)據(jù)集并格式化"""
    data = load_dataset("FreedomIntelligence/medical-o1-reasoning-SFT", 'zh')[split]  
    def format_example(x):
        xml_answer = f"""\
<reasoning>
{x['Complex_CoT'].strip()}
</reasoning>
<answer>
{x['Response'].strip()}
</answer>"""
        
        return {
            'prompt': [
                {'role': 'system', 'content': SYSTEM_PROMPT},
                {'role': 'user', 'content': x['Question']}
            ],
            'answer': extract_xml_answer(xml_answer)  # 確保解析正確答案
        }
    data = data.map(format_example)
    return data
# 加載數(shù)據(jù)集
dataset = get_medical_questions()
# 獎(jiǎng)勵(lì)函數(shù)
def correctness_reward_func(prompts, completions, answer, **kwargs) -> list[float]:
    """檢查模型輸出的答案是否正確"""
    responses = [completion[0]['content'] for completion in completions]
    extracted_responses = [extract_xml_answer(r) for r in responses]
    return [2.0 if r == a else 0.0 for r, a in zip(extracted_responses, answer)]
def int_reward_func(completions, **kwargs) -> list[float]:
    """檢查答案是否為整數(shù)"""
    responses = [completion[0]['content'] for completion in completions]
    extracted_responses = [extract_xml_answer(r) for r in responses]
    return [0.5 if r.isdigit() else 0.0 for r in extracted_responses]
def strict_format_reward_func(completions, **kwargs) -> list[float]:
    """嚴(yán)格格式檢查：必須有換行符"""
    pattern = r"^<reasoning>\n.*?\n</reasoning>\n<answer>\n.*?\n</answer>\n$"
    responses = [completion[0]["content"] for completion in completions]
    return [0.5 if re.match(pattern, r) else 0.0 for r in responses]
def soft_format_reward_func(completions, **kwargs) -> list[float]:
    """寬松格式檢查：允許不嚴(yán)格換行"""
    pattern = r"<reasoning>.*?</reasoning>\s*<answer>.*?</answer>"
    responses = [completion[0]["content"] for completion in completions]
    return [0.5 if re.match(pattern, r) else 0.0 for r in responses]
def count_xml(text) -> float:
    """檢查 XML 結(jié)構(gòu)完整性"""
    count = 0.0
    if text.count("<reasoning>\n") == 1:
        count += 0.125
    if text.count("\n</reasoning>\n") == 1:
        count += 0.125
    if text.count("\n<answer>\n") == 1:
        count += 0.125
        count -= len(text.split("\n</answer>\n")[-1]) * 0.001
    if text.count("\n</answer>") == 1:
        count += 0.125
        count -= (len(text.split("\n</answer>")[-1]) - 1) * 0.001
    return count
def xmlcount_reward_func(completions, **kwargs) -> list[float]:
    """計(jì)算 XML 結(jié)構(gòu)完整性分?jǐn)?shù)"""
    contents = [completion[0]["content"] for completion in completions]
    return [count_xml(c) for c in contents]

#輸出
medical_o1_sft_Chinese.json:?100%
?64.8M/64.8M?[00:00<00:00,?120MB/s]
Generating?train?split:?100%
?24772/24772?[00:01<00:00,?12662.63?examples/s]
Map:?100%
?24772/24772?[00:04<00:00,?5735.65?examples/s]

04、訓(xùn)練模型

from trl import GRPOConfig, GRPOTrainer  # 導(dǎo)入 GRPO 訓(xùn)練配置和訓(xùn)練器


# 設(shè)置 GRPO（General Reinforcement Preference Optimization）的訓(xùn)練參數(shù)
training_args = GRPOConfig(
    use_vllm=True,  # 啟用 vLLM 進(jìn)行推理加速，提高訓(xùn)練效率
    learning_rate=5e-6,  # 設(shè)置學(xué)習(xí)率，較小值防止梯度爆炸
    adam_beta1=0.9,  # AdamW 優(yōu)化器的 beta1 參數(shù)（動(dòng)量項(xiàng)）
    adam_beta2=0.99,  # AdamW 優(yōu)化器的 beta2 參數(shù)（平方梯度平滑項(xiàng)）
    weight_decay=0.1,  # 權(quán)重衰減，防止過(guò)擬合
    warmup_ratio=0.1,  # 預(yù)熱步數(shù)比例，避免初始學(xué)習(xí)率過(guò)高導(dǎo)致的不穩(wěn)定
    lr_scheduler_type="cosine",  # 采用余弦退火學(xué)習(xí)率調(diào)度
    optim="adamw_8bit",  # 使用 `8-bit AdamW` 優(yōu)化器，節(jié)省顯存
    logging_steps=1,  # 每 1 步記錄日志
    bf16=is_bfloat16_supported(),  # 如果支持 bfloat16，則使用 bfloat16 精度
    fp16=not is_bfloat16_supported(),  # 如果不支持 bfloat16，則使用 fp16 精度
    per_device_train_batch_size=1,  # 每個(gè)設(shè)備的批量大小，顯存不足時(shí)設(shè)為 1
    gradient_accumulation_steps=1,  # 梯度累積步數(shù)，增加到 4 可以平滑訓(xùn)練
    num_generations=8,  # 生成的樣本數(shù)，減少該值可降低顯存占用
    max_prompt_length=256,  # 最大輸入提示長(zhǎng)度
    max_completion_length=200,  # 最大模型輸出長(zhǎng)度
    # num_train_epochs=1,  # 訓(xùn)練 1 輪，實(shí)際使用時(shí)可以調(diào)整
    max_steps=10,  # 訓(xùn)練的最大步數(shù)（短時(shí)間測(cè)試用）正常為250
    save_steps=10,  # 每 10 步保存一次模型 正常為250
    max_grad_norm=0.1,  # 梯度裁剪，防止梯度爆炸
    report_to="none",  # 訓(xùn)練日志不上傳到 W&B（Weights & Biases）
    output_dir="outputs",  # 訓(xùn)練結(jié)果的保存目錄
)

trainer = GRPOTrainer(
    model=model,  # 使用 FastLanguageModel 加載的 Qwen2.5-3B-Instruct
    processing_class=tokenizer,  # 令牌化工具（Tokenizer）
    reward_funcs=[  # 定義獎(jiǎng)勵(lì)函數(shù)
        xmlcount_reward_func,  # XML 結(jié)構(gòu)完整性評(píng)分
        soft_format_reward_func,  # 寬松格式檢查
        strict_format_reward_func,  # 嚴(yán)格格式檢查
        int_reward_func,  # 評(píng)估答案是否為整數(shù)
        correctness_reward_func,  # 檢查答案是否正確
    ],
    args=training_args,  # 訓(xùn)練參數(shù)配置
    train_dataset=dataset,  # 訓(xùn)練數(shù)據(jù)集（醫(yī)學(xué)問(wèn)答數(shù)據(jù)）
)


trainer.train()  # 運(yùn)行 GRPO 強(qiáng)化學(xué)習(xí)訓(xùn)練

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 24,772 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 8 | Gradient Accumulation steps = 1
\        /    Total batch size = 8 | Total steps = 10
 "-____-"     Number of trainable parameters = 119,734,272
 [10/10 03:46, Epoch 0/1]
Step
Training Loss
reward
reward_std
completion_length
kl
rewards / xmlcount_reward_func
rewards / soft_format_reward_func
rewards / strict_format_reward_func
rewards / int_reward_func
rewards / correctness_reward_func
....因輸出太長(zhǎng)，此處省略
TrainOutput(global_step=10, training_loss=2.103237093251664e-05, metrics={'train_runtime': 254.9268, 'train_samples_per_second': 0.314, 'train_steps_per_second': 0.039, 'total_flos': 0.0, 'train_loss': 2.103237093251664e-05})

05、不使用GRPO訓(xùn)練的模型進(jìn)行問(wèn)答

text = tokenizer.apply_chat_template([
    {"role" : "user", "content" : "一個(gè)10歲男孩在患膿皮癥后出現(xiàn)眼瞼水腫、尿少和高血壓等癥狀，作為醫(yī)生，你認(rèn)為首選的治療藥物是什么?"
},
], tokenize = False, add_generation_prompt = True)
from vllm import SamplingParams
sampling_params = SamplingParams(
    temperature = 0.8,
    top_p = 0.95,
    max_tokens = 1024,
)
output = model.fast_generate(
    [text],
    sampling_params = sampling_params,
    lora_request = None,
)[0].outputs[0].text
output

# 回答
Processed prompts: 100%|██████████| 1/1 [00:08<00:00,  8.94s/it, est. speed input: 6.94 toks/s, output: 37.73 toks/s]
'
膿皮癥（也稱為膿皰瘡或膿皰?。┦且环N由細(xì)菌感染引起的皮膚感染，雖然膿皮癥與眼瞼水腫、尿少和高血壓等癥狀聯(lián)系不大，但是這些癥狀可能提示存在其他問(wèn)題，如膿皮癥并發(fā)了其他并發(fā)癥，或者出現(xiàn)了膿皮癥以外的其他疾病。因此，首先需要通過(guò)詳細(xì)的病史詢問(wèn)、體檢、實(shí)驗(yàn)室檢查（如血液檢查、尿液檢查等）和影像學(xué)檢查（如必要時(shí)進(jìn)行）來(lái)明確診斷。\n\n如果確診為膿皮癥，治療通常依賴抗生素。對(duì)于10歲男孩的膿皮癥，首選的治療藥物通常是：\n\n1. **青霉素類**：對(duì)于大多數(shù)膿皮癥患者，首選青霉素類抗生素治療（如阿莫西林），尤其是對(duì)于兒童而言。\n2. **大環(huán)內(nèi)酯類**：如果患者對(duì)青霉素過(guò)敏，可以考慮使用大環(huán)內(nèi)酯類抗生素（如紅霉素）。\n\n如果出現(xiàn)眼瞼水腫、尿少和高血壓等癥狀，可能需要更進(jìn)一步的評(píng)估和治療，這可能包括：\n\n- **利尿劑**：用于減少體內(nèi)液體積聚，減輕水腫。\n- **血管緊張素轉(zhuǎn)換酶抑制劑（ACEI）或血管緊張素受體拮抗劑（ARB）**：用于降低血壓，尤其是如果是因?yàn)槟I臟受損導(dǎo)致的高血壓時(shí)。\n\n需要注意的是，這些癥狀可能提示存在并發(fā)癥或其它系統(tǒng)性疾病，因此，必須首先由專業(yè)的醫(yī)療人員進(jìn)行評(píng)估和治療。請(qǐng)務(wù)必咨詢專業(yè)醫(yī)生，獲得針對(duì)具體情況的專業(yè)治療建議。在沒(méi)有得到專業(yè)醫(yī)生指導(dǎo)的情況下，不要自行給孩子使用任何藥物。

06、使用GRPO訓(xùn)練后的模型進(jìn)行問(wèn)答

model.save_lora("grpo_saved_lora")

text = tokenizer.apply_chat_template([
    {"role" : "system", "content" : SYSTEM_PROMPT},
    {"role" : "user", "content" : "一個(gè)10歲男孩在患膿皮癥后出現(xiàn)眼瞼水腫、尿少和高血壓等癥狀，作為醫(yī)生，你認(rèn)為首選的治療藥物是什么?"},
], tokenize = False, add_generation_prompt = True)
from vllm import SamplingParams
sampling_params = SamplingParams(
    temperature = 0.8,
    top_p = 0.95,
    max_tokens = 1024,
)
output = model.fast_generate(
    text,
    sampling_params = sampling_params,
    lora_request = model.load_lora("grpo_saved_lora"),
)[0].outputs[0].text
output

#回答
Processed prompts: 100%|██████████| 1/1 [00:09<00:00,  9.17s/it, est. speed input: 7.42 toks/s, output: 34.04 toks/s]
'
<reasoning>\n根據(jù)病史描述，這個(gè)10男孩出現(xiàn)眼瞼水腫、
尿少和高血壓的癥狀，而膿皮癥是引起這些癥狀的一個(gè)可能的原因，
但是最應(yīng)該考慮的是水腫和高血壓可能與感染相關(guān)的系統(tǒng)性炎癥反應(yīng)有關(guān)。
在這種情況下，首選的治療藥物需要考慮控制感染（膿皮癥）、
控制炎癥反應(yīng)以及處理高血壓?？紤]到患兒的年齡，
首選的藥物應(yīng)該是安全的并且可以兼顧上述三個(gè)問(wèn)題。
\n\n1. 抗生素：膿皮癥是一種細(xì)菌感染，需要使用抗生素進(jìn)行治療。對(duì)于10歲的小男孩，青霉素類藥物通常是首選，因?yàn)檫@類抗生素對(duì)兒童來(lái)說(shuō)副作用較小。
\n\n2. 非甾體抗炎藥（NSAIDs）：這類藥物可以減輕炎癥反應(yīng)，幫助控制膿皮癥。例如布洛芬，它相對(duì)安全，副作用較少，適用于兒童。
\n\n3. 抗高血壓藥物：如果伴有高血壓，可能需要使用抗高血壓藥物，例如ACE抑制劑或者鈣通道阻滯劑。然而，考慮到10歲兒童的特殊性，這些藥物的使用需要謹(jǐn)慎，并且需要兒科醫(yī)生的指導(dǎo)。
\n\n綜上所述，治療首選藥物應(yīng)該是青霉素類抗生素來(lái)控制膿皮癥，并且根據(jù)病情可能還需要使用NSAIDs以及兒科醫(yī)生指導(dǎo)下可能需要的抗高血壓藥物。
\n...\n<answer>\n首選治療藥物是青霉素類抗生素來(lái)控制膿皮癥。其他可能需要的藥物包括NSAIDs和抗高血壓藥物，但是需要兒科醫(yī)生指導(dǎo)。

reasoning后的數(shù)據(jù)為推理，answer后的數(shù)據(jù)為回答，因本次訓(xùn)練不夠，模型回答結(jié)果和標(biāo)注數(shù)據(jù)相差很遠(yuǎn)。

責(zé)任編輯：龐桂玉來(lái)源：寫代碼的中年人

DeepSeek 大模型人工智能

點(diǎn)贊

51CTO技術(shù)棧公眾號(hào)

業(yè)務(wù)
速覽

媒體

51CTO CIOAge HC3i

社區(qū)

51CTO博客鴻蒙開(kāi)發(fā)者社區(qū) AI.x社區(qū)

教育

51CTO學(xué)堂精培企業(yè)培訓(xùn) CTO訓(xùn)練營(yíng)

<sub id="opquv"><s id="opquv"></s></sub>