自拍偷在线精品自拍偷,亚洲欧美中文日韩v在线观看不卡

^{<blockquote id="o8mce"></blockquote>}<sub id="o8mce"></sub>

<p id="o8mce"></p>

AI.x社區(qū)

軟考社區(qū)

企業(yè)培訓(xùn)

鴻蒙開發(fā)者社區(qū)

WOT技術(shù)大會

公眾號矩陣

移動端

視頻課免費課排行榜短視頻直播課軟考學(xué)堂

全部課程軟考華為認(rèn)證廠商認(rèn)證 IT技術(shù)PMP項目管理免費題庫

在線學(xué)習(xí)

文章資源問答課堂專欄直播

51CTO

鴻蒙開發(fā)者社區(qū)

51CTO技術(shù)棧

51CTO官微

51CTO學(xué)堂

51CTO博客

CTO訓(xùn)練營

鴻蒙開發(fā)者社區(qū)訂閱號

51CTO軟考

51CTO學(xué)堂APP

51CTO學(xué)堂企業(yè)版APP

鴻蒙開發(fā)者社區(qū)視頻號

51CTO軟考題庫

AI.x社區(qū)

登錄/注冊
51CTO

中國優(yōu)質(zhì)的IT技術(shù)網(wǎng)站

51CTO博客

專業(yè)IT技術(shù)創(chuàng)作平臺

51CTO學(xué)堂

IT職業(yè)在線教育平臺

機器學(xué)習(xí) | 從0開發(fā)大模型之DeepSeek的GRPO

周末程序猿

發(fā)布于 2025-2-12 14:21

瀏覽

0收藏

最近，DeepSeek-R1的發(fā)布為國產(chǎn)大模型爭光了（太強了），不過 GRPO 算法源自 DeepSeekMath 7B 模型，該模型在 MATH 基準(zhǔn)測試中取得了優(yōu)異成績，論文發(fā)表于2024年2月份：https://huggingface.co/papers/2402.03300，以下是該論文的摘要原文：

Mathematical reasoning poses a significant challenge for language models due to its complex and structured nature. In this paper, we introduce DeepSeekMath 7B, which continues pre-training DeepSeek-Coder-Base-v1.5 7B with 120B math-related tokens sourced from Common Crawl, together with natural language and code data. DeepSeekMath 7B has achieved an impressive score of 51.7% on the competition-level MATH benchmark without relying on external toolkits and voting techniques, approaching the performance level of Gemini-Ultra and GPT-4. Self-consistency over 64 samples from DeepSeekMath 7B achieves 60.9% on MATH. The mathematical reasoning capability of DeepSeekMath is attributed to two key factors: First, we harness the significant potential of publicly available web data through a meticulously engineered data selection pipeline. Second, we introduce Group Relative Policy Optimization (GRPO), a variant of Proximal Policy Optimization (PPO), that enhances mathematical reasoning abilities while concurrently optimizing the memory usage of PPO.

翻譯如下：

數(shù)學(xué)推理對語言模型構(gòu)成了重大挑戰(zhàn)，因為其復(fù)雜且結(jié)構(gòu)化的特性。在本文中，我們介紹了DeepSeekMath 7B，它在DeepSeek-Coder-Base-v1.5 7B的基礎(chǔ)上進行了繼續(xù)預(yù)訓(xùn)練，使用了來自Common Crawl的120B與數(shù)學(xué)相關(guān)的標(biāo)記，以及自然語言和代碼數(shù)據(jù)。DeepSeekMath 7B在競爭級MATH基準(zhǔn)測試中取得了51.7%的優(yōu)異成績，且未依賴外部工具包和投票技術(shù)，接近Gemini-Ultra和GPT-4的性能水平。DeepSeekMath 7B在64個樣本上的自一致性達到了60.9%的MATH成績。DeepSeekMath的數(shù)學(xué)推理能力歸因于兩個關(guān)鍵因素：首先，我們通過精心設(shè)計的數(shù)據(jù)選擇流程，充分利用了公開可用的網(wǎng)絡(luò)數(shù)據(jù)的巨大潛力。其次，我們引入了群體相對策略優(yōu)化（GRPO），這是一種近端策略優(yōu)化（PPO）的變體，旨在增強數(shù)學(xué)推理能力，同時優(yōu)化PPO的內(nèi)存使用。

機器學(xué)習(xí) | 從0開發(fā)大模型之DeepSeek的GRPO-AI.x社區(qū)

對比數(shù)據(jù)

1、什么是GRPO

GRPO 是一種在線學(xué)習(xí)算法，核心思想是通過組內(nèi)相對獎勵來估計基線，從而避免使用額外的價值函數(shù)模型。通過在訓(xùn)練期間使用受訓(xùn)模型自身生成的數(shù)據(jù)來迭代改進，GRPO 旨在最大化生成補全的優(yōu)勢，同時確保模型保持接近參考策略，下圖是論文中的算法流程圖：

機器學(xué)習(xí) | 從0開發(fā)大模型之DeepSeek的GRPO-AI.x社區(qū)

GRPO

GRPO 是 PPO (Proximal Policy Optimization，近端策略優(yōu)化，是一種強化學(xué)習(xí)算法，由OpenAI于2017年提出，旨在解決策略梯度方法中的訓(xùn)練不穩(wěn)定問題) 的變體，主要區(qū)別是：

GRPO 省略 value function model
GRPO 獎勵計算，改成了一個 q 生成多個 r，然后 reward 打分

GRPO算法流程：

采樣一組輸出并計算每個輸出的獎勵
對組內(nèi)獎勵進行歸一化處理
使用歸一化后的獎勵計算優(yōu)勢函數(shù)
通過最大化目標(biāo)函數(shù)更新策略模型
迭代訓(xùn)練，逐步優(yōu)化策略模型

機器學(xué)習(xí) | 從0開發(fā)大模型之DeepSeek的GRPO-AI.x社區(qū)

論文中的偽代碼

2、獎勵設(shè)計

huggingface 庫提供 GRPOTrainer 可以直接使用 GRPO 訓(xùn)練，參數(shù)包括定義獎勵模型和函數(shù)。

2.1 獎勵模型

trainer = GRPOTrainer(
    model="Qwen/Qwen2.5-3B-Instruct",
    reward_funcs="weqweasdas/RM-Gemma-2B",
    args=training_args,
    train_dataset=dataset,
    peft_cnotallow=LoraConfig(task_type="CAUSAL_LM"),
)

這里的 reward_funcs 參數(shù)可以傳入獎勵模型。

2.2 獎勵函數(shù)

GRPOTrainer 允許用戶自定義獎勵函數(shù)，通過定義一個返回浮點數(shù)列表的函數(shù)來實現(xiàn)。

獎勵函數(shù)的輸入：completions（生成的補全）和 prompts（提示）
獎勵函數(shù)的輸出：返回一個浮點數(shù)列表，每個浮點數(shù)代表對應(yīng)于單個補全的獎勵

（1）較長補全獎勵函數(shù)

def completion_reward(completions, **kwargs):
    '''獎勵函數(shù)，對較長的補全給予更高的分?jǐn)?shù)'''
    return [float(len(completion))/100 for completion in completions]

prompts = ["The sky is", "The sun is"]
completions = [" blue.", " in the sky."]
print("completion_reward: ", completion_reward(prompts=prompts, completinotallow=completions))

（2）格式正確獎勵函數(shù)

def format_reward(completions, **kwargs):
    '''格式獎勵'''
    pattern = r"<think>.*?</think>\s*<answer>.*?</answer>"
    responses = [completion[0]["content"] for completion in completions]
    matches = [re.match(pattern, response) for response in responses]
    return [0.5if match else0.0for match in matches]

prompts = [
    [{"role": "assistant", "content": "What is the result of (1 + 2) * 4?"}],
    [{"role": "assistant", "content": "What is the result of (3 + 1) * 2?"}],
]
completions = [
    [{"role": "assistant", "content": "<think>The sum of 1 and 2 is 3, which we multiply by 4 to get 12.</think><answer>(1 + 2) * 4 = 12</answer>"}],
    [{"role": "assistant", "content": "The sum of 3 and 1 is 4, which we multiply by 2 to get 8. So (3 + 1) * 2 = 8."}],
]
print("format_reward: ", format_reward(prompts=prompts, completinotallow=completions))

根據(jù)以上的獎勵樣例，可以設(shè)計對于不同數(shù)據(jù)集的獎勵函數(shù)，如：

判斷內(nèi)容中是否包含數(shù)字
判斷內(nèi)容回答是否參考網(wǎng)頁的知識庫內(nèi)容
...

然后將這些函數(shù)傳入 GRPOTrainer 即可，代碼如下：

trainer = GRPOTrainer(
    model=model,
    processing_class=tokenizer,
    reward_funcs=[
        ...
        format_reward,
        completion_reward,
    ],
    args=training_args,
    train_dataset=data,
    ...
)

3、使用 GRPO 訓(xùn)練模型

github上已經(jīng)有很多復(fù)刻 DeepSeek-R1-Zero 的方案，有興趣可以看一下這幾個開源項目（成本基本都控制在500以內(nèi)）：

??https://github.com/datawhalechina/unlock-deepseek??
??https://github.com/Jiayi-Pan/TinyZero??

3.1 訓(xùn)練代碼

這里為了演示如何使用 GRPO 訓(xùn)練模型，本文也給出了完整的訓(xùn)練代碼，其中流程如下：

機器學(xué)習(xí) | 從0開發(fā)大模型之DeepSeek的GRPO-AI.x社區(qū)

使用Qwen/Qwen2.5-3B-Instruct 作為基礎(chǔ)模型
使用swulling/gsm8k_chinese 作為訓(xùn)練數(shù)據(jù)集

import re
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig
from trl import GRPOConfig, GRPOTrainer

SYSTEM_PROMPT = """
按照如下格式生成：
<think>
...
</think>
<answer>
...
</answer>
"""

def process_data(data):
    return data.map(
        lambda x: {
            "prompt": [
                {"role": "system", "content": SYSTEM_PROMPT},
                {"role": "user", "content": x["question_zh-cn"]},
            ],
            "answer": x["answer_only"],
        }
    )

def extract_answer(text):
    answer = text.split("<answer>")[-1]
    answer = answer.split("</answer>")[0]
    return answer.strip()

def correctness_reward(completions, answer, **kwargs):
    responses = [completion[0]["content"] for completion in completions]
    extracted_responses = [extract_answer(r) for r in responses]
    return [2.0if response == str(ans) else0.0for response, ans in zip(extracted_responses, answer)]

def completion_reward(completions, **kwargs):
    '''獎勵函數(shù)，對較長的補全給予更高的分?jǐn)?shù)'''
    return [float(len(completion)) / 100for completion in completions]

prompts = ["The sky is", "The sun is"]
completions = [" blue.", " in the sky."]
print("completion_reward: ", completion_reward(prompts=prompts, completinotallow=completions))

def digit_reward(completions, **kwargs):
    responses = [completion[0]["content"] for completion in completions]
    extracted_responses = [extract_answer(r) for r in responses]
    return [0.5if response.isdigit() else0.0for response in extracted_responses]

def format_reward(completions, **kwargs):
    '''格式獎勵'''
    pattern = r"<think>.*?</think>\s*<answer>.*?</answer>"
    responses = [completion[0]["content"] for completion in completions]
    matches = [re.match(pattern, response) for response in responses]
    return [0.5if match else0.0for match in matches]

prompts = [
    [{"role": "assistant", "content": "What is the result of (1 + 2) * 4?"}],
    [{"role": "assistant", "content": "What is the result of (3 + 1) * 2?"}],
]
completions = [
    [{"role": "assistant", "content": "<think>The sum of 1 and 2 is 3, which we multiply by 4 to get 12.</think><answer>(1 + 2) * 4 = 12</answer>"}],
    [{"role": "assistant", "content": "The sum of 3 and 1 is 4, which we multiply by 2 to get 8. So (3 + 1) * 2 = 8."}],
]
print("format_reward: ", format_reward(prompts=prompts, completinotallow=completions))

def mark_reward(completions, **kwargs):
    '''標(biāo)記獎勵（改善格式獎勵稀疏問題）'''
    def mark_num(text):
        reward = 0
        if text.count("<think>\n") == 1:
            reward += 0.125

        if text.count("</think>\n") == 1:
            reward += 0.125

        if text.count("<answer>\n") == 1:
            reward += 0.125

        if text.count("</answer>\n") == 1:
            reward += 0.125 * 2

        return reward

    responses = [completion[0]["content"] for completion in completions]
    return [mark_num(response) for response in responses]

if __name__ == "__main__":
    model_name = "Qwen/Qwen2.5-3B-Instruct"
    model = AutoModelForCausalLM.from_pretrained(model_name, cache_dir="./model")
    model.cuda()
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    ds = load_dataset("swulling/gsm8k_chinese", cache_dir="./dataset")
    data = process_data(ds["train"])
    output_dir = "output"
    training_args = GRPOConfig(
        output_dir=output_dir,
        learning_rate=5e-6,
        adam_beta1=0.9,
        adam_beta2=0.99,
        weight_decay=0.1,
        warmup_ratio=0.1,
        lr_scheduler_type="cosine",
        logging_steps=1,
        bf16=True,
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        num_generatinotallow=16,
        max_prompt_length=256,
        max_completion_length=200,
        num_train_epochs=1,
        save_steps=100,
        max_grad_norm=0.1,
        log_on_each_node=False,
        use_vllm=False,
        report_to="tensorboard",
    )

    trainer = GRPOTrainer(
        model=model,
        processing_class=tokenizer,
        reward_funcs=[
            mark_reward,
            format_reward,
            digit_reward,
            completion_reward,
            correctness_reward,
        ],
        args=training_args,
        train_dataset=data,
        peft_cnotallow=LoraConfig(task_type="CAUSAL_LM"),
    )
    trainer.train()
    trainer.save_model(output_dir)

3.2 輸出結(jié)果

（1）Qwen/Qwen2.5-3B-Instruct 輸出：

用戶: 
使用給定的數(shù)字 [44, 19, 35]，創(chuàng)建一個等于 98 的方程。你可以使用基本算術(shù)運算（+、-、*、/）一次或多次，但每個數(shù)字只能使用一次。在 <think> </think> 標(biāo)簽中展示你的思考過程，并在 <answer> </answer> 標(biāo)簽中返回最終方程，例如 <answer> (1 + 2) / 3 </answer>。在 <think> 標(biāo)簽中逐步思考。

機器人:  
<span class="think">首先我們需要找到如何將給定的數(shù)字組合起來得到 98。我們可以嘗試不同的組合，比如加法和減法。注意到 98 可以分解為 90 和 8 的和，因此我們可以考慮用 44 和 35 來構(gòu)造這兩個數(shù)。</span>
<span class="think">觀察到 44 和 35 的差是 9，如果我們將 35 減去這個差，就可以得到 8。即 35 - (44 - 35) = 8。</span>
<span class="think">現(xiàn)在我們有了 8 和 90，可以將它們相加得到 98。所以最終的方程是 35 + (44 - 35)。</span>
<answer> 35 + (44 - 35) </answer>

（2）訓(xùn)練的模型輸出：

用戶: 
使用給定的數(shù)字 [44, 19, 35]，創(chuàng)建一個等于 98 的方程。你可以使用基本算術(shù)運算（+、-、*、/）一次或多次，但每個數(shù)字只能使用一次。在 <think> </think> 標(biāo)簽中展示你的思考過程，并在 <answer> </answer> 標(biāo)簽中返回最終方程，例如 <answer> (1 + 2) / 3 </answer>。在 <think> 標(biāo)簽中逐步思考。

機器人:  
<think>
首先，我需要使用數(shù)字44、19和35，通過基本算術(shù)運算（+、-、*、/）來構(gòu)造一個等于98的方程，且每個數(shù)字只能使用一次。

我先考慮將較大的數(shù)字進行加法或乘法運算，以達到接近98的目標(biāo)。例如，44加19等于63，再加上35可以得到98。具體計算如下：

44 + 19 = 63  
63 + 35 = 98  

因此，最終的方程可以表示為44 + 19 + 35 = 98。
</think>

<answer> 44 + 19 + 35 = 98 </answer>

參考

（1）??https://github.com/Jiayi-Pan/TinyZero??

（2）??https://github.com/huggingface/open-r1??

（3）DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning（??https://arxiv.org/pdf/2501.12948）??

（4）??https://zhuanlan.zhihu.com/p/20021693569??

（5）??https://zhuanlan.zhihu.com/p/19949917958??

（6）??https://blog.csdn.net/qq_38961840/article/details/145387854??

本文轉(zhuǎn)載自 ??周末程序猿??，作者：周末程序猿

標(biāo)簽

機器學(xué)習(xí)

已于2025-2-12 15:15:51修改

贊

收藏

回復(fù)

舉報

回復(fù)

相關(guān)推薦

機器學(xué)習(xí)之特征工程深度解析

51CTO內(nèi)容精選 ? 2515瀏覽 ? 0回復(fù)
【創(chuàng)新一夏學(xué)習(xí)季】熱浪升溫，創(chuàng)新一夏，釋放開發(fā)潛能

AI.x社區(qū)官方賬號 ? 52.8w瀏覽 ? 39回復(fù)
學(xué)習(xí)大模型技術(shù)的方法論——從應(yīng)用中學(xué)習(xí)大模型

AI探索時代 ? 2637瀏覽 ? 0回復(fù)
大模型學(xué)習(xí)方法之——大模型技術(shù)學(xué)習(xí)路線

AI探索時代 ? 3879瀏覽 ? 0回復(fù)
大模型學(xué)習(xí)范式之——語境學(xué)習(xí)(In-context learning)

AI探索時代 ? 3110瀏覽 ? 0回復(fù)
大模型開發(fā)之算子

AI探索時代 ? 7927瀏覽 ? 0回復(fù)
拋開技術(shù)，從問題來串聯(lián)人工智能，機器學(xué)習(xí)和大模型技術(shù)

AI探索時代 ? 1533瀏覽 ? 0回復(fù)
AI大模型實踐之字節(jié)0-1智能客服

數(shù)字化助推器 ? 1829瀏覽 ? 0回復(fù)
從0到1開發(fā)AI Agent（智能體）| LangChain 的快速入門

AI取經(jīng)路 ? 5210瀏覽 ? 0回復(fù)
機器學(xué)習(xí)|從0開始大模型之位置編碼

周末程序猿 ? 1549瀏覽 ? 0回復(fù)
機器學(xué)習(xí)|從0開始大模型之模型DPO訓(xùn)練

周末程序猿 ? 1830瀏覽 ? 0回復(fù)
DeepSeek 爆了，普通人如何3小時完全從0訓(xùn)練自己的大模型

玄姐聊AGI ? 6088瀏覽 ? 0回復(fù)
機器學(xué)習(xí) | 從0開發(fā)大模型-譯llama3-from-scratch

周末程序猿 ? 1704瀏覽 ? 0回復(fù)
從PPO到GRPO：算力減半的大模型推理能力訓(xùn)練革命

Baihai_IDP ? 3709瀏覽 ? 0回復(fù)
白話DeepSeek R1的GRPO強化學(xué)習(xí)算法：原理、圖解、視頻

后向傳播 ? 2603瀏覽 ? 0回復(fù)
機器學(xué)習(xí)|從0開發(fā)大模型之復(fù)現(xiàn)DeepSeek的aha moment

周末程序猿 ? 1631瀏覽 ? 0回復(fù)
GRPO教會DeepSeek R1高智商推理，但GRPO可能不完美且有偏見 | Dr. GRPO簡化之，消除偏見帶來改進

后向傳播 ? 974瀏覽 ? 0回復(fù)
基于 DeepSeek GRPO 的 1.5B Rust 代碼生成模型訓(xùn)練實戰(zhàn)

Baihai_IDP ? 909瀏覽 ? 0回復(fù)
大模型從聊天走向智能體，智能體開發(fā)協(xié)議之——MCP協(xié)議的初步理解

AI探索時代 ? 1081瀏覽 ? 0回復(fù)

周末程序猿

這個用戶很懶，還沒有個人簡介

帖子

聲望

粉絲

關(guān)注

最近發(fā)布

機器學(xué)習(xí)｜MCP（Model Context Protocol）實戰(zhàn) 2025-04-16 06:17:45發(fā)布
RAG實戰(zhàn) | 向量數(shù)據(jù)庫LanceDB指南 2025-04-03 00:15:42發(fā)布

熱門推薦

大半精銳盡出！o1下線！滿血o3之后，模型本身就是Manus，最大賣點：替代人干真活！ 1回復(fù)

王炸！MCP 架構(gòu)設(shè)計深度剖析 & 使用 Spring AI + MCP 四步教你實現(xiàn) Agent 智能體開發(fā) 0回復(fù)

Dify從入門到高階系列二：手把手教學(xué)！超詳細(xì)的Dify知識庫配置全攻略 0回復(fù)

Crawl4AI：GitHub榜首40K星標(biāo)！LLM專屬極速開源爬蟲神器 0回復(fù)

只需5分鐘，教你用Python搭建MCP Server 0回復(fù)

上一篇：機器學(xué)習(xí)|從0開始大模型之模型DPO訓(xùn)練

下一篇：機器學(xué)習(xí) | 從0開發(fā)大模型-譯llama3-from-scratch

社區(qū)精華內(nèi)容

目錄

<sub id="brxln"></sub>

<sub id="brxln"><code id="brxln"></code></sub>

<sub id="brxln"></sub>