自拍偷在线精品自拍偷,亚洲欧美中文日韩v在线观看不卡

<sub id="b8j9d"><p id="b8j9d"></p></sub>

<blockquote id="b8j9d"><i id="b8j9d"><video id="b8j9d"></video></i></blockquote>

51CTO首頁(yè)

AI.x社區(qū)

軟考社區(qū)

免費(fèi)課

企業(yè)培訓(xùn)

鴻蒙開發(fā)者社區(qū)

WOT技術(shù)大會(huì)

公眾號(hào)矩陣

移動(dòng)端

視頻課免費(fèi)課排行榜短視頻直播課軟考學(xué)堂

全部課程軟考華為認(rèn)證廠商認(rèn)證 IT技術(shù)PMP項(xiàng)目管理免費(fèi)題庫(kù)

在線學(xué)習(xí)

文章資源問(wèn)答課堂專欄直播

51CTO

鴻蒙開發(fā)者社區(qū)

51CTO技術(shù)棧

51CTO官微

51CTO學(xué)堂

51CTO博客

CTO訓(xùn)練營(yíng)

鴻蒙開發(fā)者社區(qū)訂閱號(hào)

51CTO軟考

51CTO學(xué)堂APP

51CTO學(xué)堂企業(yè)版APP

鴻蒙開發(fā)者社區(qū)視頻號(hào)

51CTO軟考題庫(kù)

AI.x社區(qū)

登錄/注冊(cè)
51CTO

中國(guó)優(yōu)質(zhì)的IT技術(shù)網(wǎng)站

51CTO博客

專業(yè)IT技術(shù)創(chuàng)作平臺(tái)

51CTO學(xué)堂

IT職業(yè)在線教育平臺(tái)

沒有標(biāo)記數(shù)據(jù)集，如何做大模型指令微調(diào)？介紹一款有潛力的標(biāo)記數(shù)據(jù)集生成模型

發(fā)布于 2024-6-20 09:49

瀏覽

0收藏

在構(gòu)建大模型應(yīng)用時(shí)，通常有兩種方式來(lái)改進(jìn)效果，一種是構(gòu)建外部知識(shí)庫(kù)，利用RAG來(lái)完成。但RAG并不是萬(wàn)能的，對(duì)于特定領(lǐng)域的LLM應(yīng)用，以及無(wú)需示例，就能完成特定任務(wù)等場(chǎng)合就需要進(jìn)行微調(diào)。然而，微調(diào)本身相較于RAG來(lái)講，需要更多的算力資源和時(shí)間周期，但更大的瓶頸在于微調(diào)需要標(biāo)記過(guò)的樣本數(shù)據(jù)。這對(duì)于很多企業(yè)來(lái)講，很難有這樣高質(zhì)量的數(shù)據(jù)積累，他們的數(shù)據(jù)通常是未經(jīng)標(biāo)記的，可能是一篇一篇的文章或者規(guī)章制度，并不是以問(wèn)答對(duì)的方式而存在。

為了完成微調(diào)，傳統(tǒng)做法就是通過(guò)人工的方式進(jìn)行問(wèn)答對(duì)構(gòu)造，在此基礎(chǔ)上斯坦福研究團(tuán)隊(duì)也提出了Alpaca使用GPT-4這樣的強(qiáng)模型模仿種子樣本生成標(biāo)記數(shù)據(jù)集。

沒有標(biāo)記數(shù)據(jù)集，如何做大模型指令微調(diào)？介紹一款有潛力的標(biāo)記數(shù)據(jù)集生成模型-AI.x社區(qū)

??https://arxiv.org/pdf/2402.18334??

筆者介紹一個(gè)新的樣本數(shù)據(jù)生成的項(xiàng)目Bonito（https://github.com/BatsResearch/bonito），一個(gè)用于條件任務(wù)生成的開源模型，它可以將未標(biāo)注的文本轉(zhuǎn)換為特定任務(wù)的訓(xùn)練數(shù)據(jù)集，用于指令微調(diào)。根據(jù)論文介紹，該模型本身是在 mistralai/Mistral-7B-v0.1 的基礎(chǔ)上，利用包含 165 萬(wàn)個(gè)示例的數(shù)據(jù)集（https://huggingface.co/datasets/BatsResearch/ctga-v1）進(jìn)行微調(diào)，支持多種任務(wù)類型，包括多選題回答、是非題回答、自然語(yǔ)言推理、主題分類等。

沒有標(biāo)記數(shù)據(jù)集，如何做大模型指令微調(diào)？介紹一款有潛力的標(biāo)記數(shù)據(jù)集生成模型-AI.x社區(qū)

Benito項(xiàng)目本身是一個(gè)數(shù)據(jù)生成的LLM應(yīng)用，模型由vllm加速，使用方法比較簡(jiǎn)單。基本過(guò)程為將文檔內(nèi)容提取出來(lái)（datasets），比如PDF等，然后指定生成任務(wù)類型，并將其傳給bonito.generate_task即可。

Bonito定義：

class Bonito(LLM, AbstractBonito):
    def generate_tasks(
        self,
        text_dataset: Dataset,
        context_col: str,
        task_type: str,
        sampling_params: SamplingParams,
        **kwargs,
    ):
        """
        Generates tasks using the Bonito model.


        This method takes a text dataset, a context column name,
        a task type, and sampling parameters, and generates tasks
        using the Bonito model. It processes the input dataset,
        generates outputs, collects multiple generations into
        one dataset object, and filters out the examples that
        cannot be parsed.


        Args:
            text_dataset (Dataset): The dataset that provides the text
                for the tasks.
            context_col (str): The name of the column in the dataset
                that provides the context for the tasks.
            task_type (str): The type of the tasks. This can be a
                short form or a full form.
            sampling_params (SamplingParams): The parameters for
                sampling.
            **kwargs: Additional keyword arguments.


        Returns:
            Dataset: The synthetic dataset with the generated tasks.
        """
        processed_dataset = self._prepare_bonito_input(
            text_dataset, task_type, context_col, **kwargs
        )
        outputs = self.generate(processed_dataset["input"], sampling_params)


        # collect multiple generations into one dataset object
        examples = []
        for i, example in enumerate(text_dataset.to_list()):
            for output in outputs[i].outputs:
                examples.append(
                    {"context": example[context_col], "prediction": output.text.strip()}
                )


        synthetic_dataset = Dataset.from_list(examples)


        # filter out the examples that cannot be parsed
        synthetic_dataset = self._postprocess_dataset(
            synthetic_dataset, context_col="context", **kwargs
        )


        return synthetic_dataset

基本使用：

from bonito import Bonito
from vllm import SamplingParams
from datasets import load_dataset


# Initialize the Bonito model
bonito = Bonito("BatsResearch/bonito-v1")


# load dataset with unannotated text
unannotated_text = load_dataset(
    "BatsResearch/bonito-experiment",
    "unannotated_contract_nli"
)["train"].select(range(10))


# Generate synthetic instruction tuning dataset
sampling_params = SamplingParams(max_tokens=256, top_p=0.95, temperature=0.5, n=1)
synthetic_dataset = bonito.generate_tasks(
    unannotated_text,
    context_col="input",
    task_type="nli",
    sampling_params=sampling_params
)

如果想要在顯存較小的GPU上運(yùn)行，如T4，可對(duì)模型進(jìn)行量化。

from typing import Optional, List, Dict
from datasets import Dataset
from awq import AutoAWQForCausalLM
from bonito import AbstractBonito
from transformers import AutoTokenizer




class QuantizedBonito(AbstractBonito):
    def __init__(self, model_name_or_path):
        self.model = AutoAWQForCausalLM.from_quantized(model_name_or_path, fuse_layers=True).cuda()
        self.tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)


    def generate_task(
        self,
        unannotated_paragraph: str,
        task_type: str,
        sampling_params: dict,
    ) -> Dict:
        """
        Generates synthetic instruction tuning pair using the Quantized Bonito model.
        This method takes a text unannotated text, a task type, and sampling parameters,
        and generates synthetic input-output pair.


        Args:
            unannotated_paragraph (str): The unannotated text or a paragraph
            task_type (str): The type of the tasks. This can be a
                short form or a full form.
            sampling_params (dict): The parameters for
                sampling.
            **kwargs: Additional keyword arguments.


        Returns:
            Dict: The synthetic input-output pair for the task type.
        """


        text_dataset = Dataset.from_list([{"input": unannotated_paragraph}])


        processed_dataset = self._prepare_bonito_input(
            text_dataset, task_type, context_col="input"
        )


        outputs = self._generate_text(processed_dataset["input"], sampling_params)
        examples = []
        for i, example in enumerate(text_dataset.to_list()):
            output = outputs[i]
            example["prediction"] = output.strip()
            examples.append(example)


        synthetic_dataset = Dataset.from_list(examples)


        # filter out the examples that cannot be parsed
        synthetic_dataset_dict = self._postprocess_dataset(
            synthetic_dataset, context_col="input"
        ).to_list()[0]


        return synthetic_dataset_dict


    def _generate_text(
        self,
        dataset: Dataset,
        sampling_params: dict,
        ) -> List[str]:
        """
        Generate text using huggingface transformers generate function.


        This method takes a dataset of prompts, encodes them,
        generates text using the model, decodes the generated
        text, and appends it to a list.


        Args:
            dataset (Dataset): A dataset containing prompts for text generation.
            sampling_params (dict): Parameters for sampling during generation.


        Returns:
            List[str]: A list of generated texts corresponding to the prompts.
        """
        generated_texts = []


        for prompt in dataset:
            input_ids = self.tokenizer.encode(prompt, return_tensors="pt")
            input_ids = input_ids.cuda()


            output = self.model.generate(
                input_ids,
                do_sample=True,
                **sampling_params
            )


            generated_text = self.tokenizer.decode(output[0][len(input_ids[0]):], skip_special_tokens=True)
            generated_texts.append(generated_text)


        return generated_texts

以tasktype為ynqa，即yes-or-no問(wèn)題為例，其生成的結(jié)果如下：

sampling_params = {'max_new_tokens':256, 'top_p':0.95, 'temperature':0.7, 'num_return_sequences':1}
synthetic_dataset = bonito.generate_task(
    unannotated_paragraph,
    task_type="ynqa",
    sampling_params=sampling_params
)
pprint("----Generated Instructions----")
pprint(f'Input: {synthetic_dataset["input"]}')
pprint(f'Output: {synthetic_dataset["output"]}')


'----Generated Instructions----'
('Input: Based on the following passage, is a written communication '
 'confidential? 1. “Confidential Information”, whenever used in this '
 'Agreement, shall mean any data, document, specification and other '
 'information or material, that is delivered or disclosed by UNHCR to the '
 'Recipient in any form whatsoever, whether orally, visually in writing or '
 'otherwise (including computerized form), and that, at the time of disclosure '
 'to the Recipient, is designated as confidential.')
'Output: Yes'

其中，tasktype支持的任務(wù)類型如下：

提取式問(wèn)答（exqa）：根據(jù)給定的文本片段生成問(wèn)題答案，直接從文本中提取答案。
多選問(wèn)題回答（mcqa）：提供一組多選問(wèn)題的答案。
問(wèn)題生成（qg）：根據(jù)提供的文本內(nèi)容創(chuàng)建問(wèn)題。
無(wú)選擇問(wèn)答（qa）：在不提供多項(xiàng)選擇選項(xiàng)的情況下回答問(wèn)題。
是-否問(wèn)題回答（ynqa）：生成問(wèn)題的是或否答案。
共指消解 (coref)：標(biāo)識(shí)文本中引用同一實(shí)體的引用。
釋義生成 (paraphrase)：重寫具有不同措辭的句子或短語(yǔ)，同時(shí)保留原意。
釋義識(shí)別 (paraphrase_id)：確定兩個(gè)句子或短語(yǔ)是否傳達(dá)相同的含義。
句子補(bǔ)全（sent_comp）：補(bǔ)全句子中缺失的部分。
情感分析 (sentiment)：識(shí)別文本中表達(dá)的情緒，如積極、消極或中性。
摘要(summarization)：將較長(zhǎng)的文本濃縮成較短的摘要，抓住要點(diǎn)。
文本生成（Text_gen）：基于提示創(chuàng)建連貫且與上下文相關(guān)的文本。
主題分類（Topic_class）：將文本分類為預(yù)定義的主題。
詞義消歧（wsd）：根據(jù)上下文確定單詞的含義。
文本蘊(yùn)含（te）：預(yù)測(cè)一個(gè)給定的文本是否在邏輯上遵循另一個(gè)文本。
自然語(yǔ)言推理（nli）：確定兩段文本之間的關(guān)系，如矛盾、隱含或中性。

在性能上，相較于GPT-4的方案，bonito在三個(gè)數(shù)據(jù)集中兩個(gè)上取得了超越GPT4的好成績(jī)。

沒有標(biāo)記數(shù)據(jù)集，如何做大模型指令微調(diào)？介紹一款有潛力的標(biāo)記數(shù)據(jù)集生成模型-AI.x社區(qū)

小結(jié)：

相較于使用GPT-4生成標(biāo)記樣本的方法，經(jīng)過(guò)專門面向數(shù)據(jù)集生成微調(diào)的模型Bonito來(lái)講，支持zero-shot級(jí)別的樣本生成，并且可以使用開源的模型，這在開放性，成本、性能上都能具備較強(qiáng)的優(yōu)勢(shì)。

隨著微調(diào)技術(shù)的不斷普及，相信數(shù)據(jù)樣本質(zhì)量和生產(chǎn)成本將受到越來(lái)越多的重視，benito等這樣的數(shù)據(jù)集生成模型也將迎來(lái)更大的發(fā)展。

本文轉(zhuǎn)載自?? AI工程化??，作者： ully

標(biāo)簽

數(shù)據(jù)集

指令微調(diào)

贊

收藏

回復(fù)

舉報(bào)

回復(fù)

相關(guān)推薦

Mol-Instructions: 面向大模型的大規(guī)模生物分子指令數(shù)據(jù)集

mb5f8eba9bdb0af ? 2167瀏覽 ? 0回復(fù)
ChemBench：大語(yǔ)言模型化學(xué)能力評(píng)測(cè)數(shù)據(jù)集

戀戀青鳥 ? 3867瀏覽 ? 0回復(fù)
不同數(shù)據(jù)集有不同的Scaling law？而你可用一個(gè)壓縮算法來(lái)預(yù)測(cè)它

輕薄滴假象 ? 2124瀏覽 ? 0回復(fù)
神器Pandas AI: 一款智能做數(shù)據(jù)分析的工具！

開發(fā)者阿橙 ? 4121瀏覽 ? 0回復(fù)
大模型微調(diào)技巧 | 高質(zhì)量指令數(shù)據(jù)篩選方法-MoDS

NLP工作站 ? 3493瀏覽 ? 0回復(fù)
400萬(wàn)樣本，數(shù)據(jù)才是AIGC的王道！UltraEdit：基于指令的細(xì)粒度圖像編輯數(shù)據(jù)集

angel ? 2753瀏覽 ? 0回復(fù)
基于自定義數(shù)據(jù)集的YOLOv8模型實(shí)戰(zhàn)

51CTO內(nèi)容精選 ? 3081瀏覽 ? 0回復(fù)
Pandas AI: 一款可以智能做數(shù)據(jù)分析的工具！

Halo咯咯 ? 3011瀏覽 ? 0回復(fù)
開發(fā)一款大模型需要經(jīng)過(guò)哪些步驟？開發(fā)一款大模型的完整流程

AI探索時(shí)代 ? 3487瀏覽 ? 0回復(fù)
如何生成Function Calling微調(diào)數(shù)據(jù)？

ermulong ? 1950瀏覽 ? 0回復(fù)
使用 LlamaFactory 結(jié)合開源大語(yǔ)言模型實(shí)現(xiàn)文本分類：從數(shù)據(jù)集構(gòu)建到 LoRA 微調(diào)與推理評(píng)估

AI悠閑區(qū) ? 5108瀏覽 ? 0回復(fù)
從數(shù)據(jù)集到模型：視頻和音頻情緒分析的綜合研究

xuxiangda ? 3344瀏覽 ? 0回復(fù)
英偉達(dá)NVLM多模態(tài)大模型細(xì)節(jié)和數(shù)據(jù)集

大模型自然語(yǔ)言處理 ? 2234瀏覽 ? 0回復(fù)
大模型訓(xùn)練之訓(xùn)練數(shù)據(jù)準(zhǔn)備，即怎么準(zhǔn)備高質(zhì)量的訓(xùn)練數(shù)據(jù)集？

AI探索時(shí)代 ? 3002瀏覽 ? 0回復(fù)
MLLMs人類偏好增強(qiáng)對(duì)齊，自然圖像和數(shù)據(jù)圖表分離；視覺感知標(biāo)記，模型自主決定感知內(nèi)容

AI研究前瞻 ? 1761瀏覽 ? 0回復(fù)
模型訓(xùn)練之數(shù)據(jù)集操作——矩陣變換

AI探索時(shí)代 ? 1515瀏覽 ? 0回復(fù)
怎么自定義一個(gè)數(shù)據(jù)集？自定義數(shù)據(jù)集面臨哪些問(wèn)題？

AI探索時(shí)代 ? 1585瀏覽 ? 0回復(fù)
自己打包一個(gè)數(shù)據(jù)集代碼案例——使用Numpy計(jì)算框架自定義一個(gè)類似MINST的數(shù)據(jù)集

AI探索時(shí)代 ? 1478瀏覽 ? 0回復(fù)
有一款神器！深入探索Transformer語(yǔ)言模型的可視化工具BertViz

智駐未來(lái) ? 774瀏覽 ? 0回復(fù)

這個(gè)用戶很懶，還沒有個(gè)人簡(jiǎn)介

帖子

聲望

粉絲

關(guān)注

最近發(fā)布

熱門推薦

大半精銳盡出！o1下線！滿血o3之后，模型本身就是Manus，最大賣點(diǎn)：替代人干真活！ 1回復(fù)

王炸！MCP 架構(gòu)設(shè)計(jì)深度剖析 & 使用 Spring AI + MCP 四步教你實(shí)現(xiàn) Agent 智能體開發(fā) 0回復(fù)

Dify從入門到高階系列二：手把手教學(xué)！超詳細(xì)的Dify知識(shí)庫(kù)配置全攻略 0回復(fù)

Crawl4AI：GitHub榜首40K星標(biāo)！LLM專屬極速開源爬蟲神器 0回復(fù)

只需5分鐘，教你用Python搭建MCP Server 0回復(fù)

上一篇： ?過(guò)去一年有關(guān)大模型應(yīng)用構(gòu)建的干貨經(jīng)驗(yàn)之運(yùn)營(yíng)篇

下一篇： Gptpdf：一個(gè)簡(jiǎn)單巧妙的復(fù)雜Pdf解析工具，提升RAG效果

社區(qū)精華內(nèi)容

目錄

<cite id="cnfop"></cite>