自拍偷在线精品自拍偷,亚洲欧美中文日韩v在线观看不卡

沒有標(biāo)記數(shù)據(jù)集,如何做大模型指令微調(diào)?介紹一款有潛力的標(biāo)記數(shù)據(jù)集生成模型

發(fā)布于 2024-6-20 09:49
瀏覽
0收藏

在構(gòu)建大模型應(yīng)用時(shí),通常有兩種方式來(lái)改進(jìn)效果,一種是構(gòu)建外部知識(shí)庫(kù),利用RAG來(lái)完成。但RAG并不是萬(wàn)能的,對(duì)于特定領(lǐng)域的LLM應(yīng)用,以及無(wú)需示例,就能完成特定任務(wù)等場(chǎng)合就需要進(jìn)行微調(diào)。然而,微調(diào)本身相較于RAG來(lái)講,需要更多的算力資源和時(shí)間周期,但更大的瓶頸在于微調(diào)需要標(biāo)記過(guò)的樣本數(shù)據(jù)。這對(duì)于很多企業(yè)來(lái)講,很難有這樣高質(zhì)量的數(shù)據(jù)積累,他們的數(shù)據(jù)通常是未經(jīng)標(biāo)記的,可能是一篇一篇的文章或者規(guī)章制度,并不是以問(wèn)答對(duì)的方式而存在。

為了完成微調(diào),傳統(tǒng)做法就是通過(guò)人工的方式進(jìn)行問(wèn)答對(duì)構(gòu)造,在此基礎(chǔ)上斯坦福研究團(tuán)隊(duì)也提出了Alpaca使用GPT-4這樣的強(qiáng)模型模仿種子樣本生成標(biāo)記數(shù)據(jù)集。

沒有標(biāo)記數(shù)據(jù)集,如何做大模型指令微調(diào)?介紹一款有潛力的標(biāo)記數(shù)據(jù)集生成模型-AI.x社區(qū)

??https://arxiv.org/pdf/2402.18334??

筆者介紹一個(gè)新的樣本數(shù)據(jù)生成的項(xiàng)目Bonito(https://github.com/BatsResearch/bonito),一個(gè)用于條件任務(wù)生成的開源模型,它可以將未標(biāo)注的文本轉(zhuǎn)換為特定任務(wù)的訓(xùn)練數(shù)據(jù)集,用于指令微調(diào)。根據(jù)論文介紹,該模型本身是在 mistralai/Mistral-7B-v0.1 的基礎(chǔ)上,利用包含 165 萬(wàn)個(gè)示例的數(shù)據(jù)集(https://huggingface.co/datasets/BatsResearch/ctga-v1)進(jìn)行微調(diào),支持多種任務(wù)類型,包括多選題回答、是非題回答、自然語(yǔ)言推理、主題分類等。

沒有標(biāo)記數(shù)據(jù)集,如何做大模型指令微調(diào)?介紹一款有潛力的標(biāo)記數(shù)據(jù)集生成模型-AI.x社區(qū)


Benito項(xiàng)目本身是一個(gè)數(shù)據(jù)生成的LLM應(yīng)用,模型由vllm加速,使用方法比較簡(jiǎn)單。基本過(guò)程為將文檔內(nèi)容提取出來(lái)(datasets),比如PDF等,然后指定生成任務(wù)類型,并將其傳給bonito.generate_task即可。

Bonito定義:

class Bonito(LLM, AbstractBonito):
    def generate_tasks(
        self,
        text_dataset: Dataset,
        context_col: str,
        task_type: str,
        sampling_params: SamplingParams,
        **kwargs,
    ):
        """
        Generates tasks using the Bonito model.


        This method takes a text dataset, a context column name,
        a task type, and sampling parameters, and generates tasks
        using the Bonito model. It processes the input dataset,
        generates outputs, collects multiple generations into
        one dataset object, and filters out the examples that
        cannot be parsed.


        Args:
            text_dataset (Dataset): The dataset that provides the text
                for the tasks.
            context_col (str): The name of the column in the dataset
                that provides the context for the tasks.
            task_type (str): The type of the tasks. This can be a
                short form or a full form.
            sampling_params (SamplingParams): The parameters for
                sampling.
            **kwargs: Additional keyword arguments.


        Returns:
            Dataset: The synthetic dataset with the generated tasks.
        """
        processed_dataset = self._prepare_bonito_input(
            text_dataset, task_type, context_col, **kwargs
        )
        outputs = self.generate(processed_dataset["input"], sampling_params)


        # collect multiple generations into one dataset object
        examples = []
        for i, example in enumerate(text_dataset.to_list()):
            for output in outputs[i].outputs:
                examples.append(
                    {"context": example[context_col], "prediction": output.text.strip()}
                )


        synthetic_dataset = Dataset.from_list(examples)


        # filter out the examples that cannot be parsed
        synthetic_dataset = self._postprocess_dataset(
            synthetic_dataset, context_col="context", **kwargs
        )


        return synthetic_dataset

基本使用:

from bonito import Bonito
from vllm import SamplingParams
from datasets import load_dataset


# Initialize the Bonito model
bonito = Bonito("BatsResearch/bonito-v1")


# load dataset with unannotated text
unannotated_text = load_dataset(
    "BatsResearch/bonito-experiment",
    "unannotated_contract_nli"
)["train"].select(range(10))


# Generate synthetic instruction tuning dataset
sampling_params = SamplingParams(max_tokens=256, top_p=0.95, temperature=0.5, n=1)
synthetic_dataset = bonito.generate_tasks(
    unannotated_text,
    context_col="input",
    task_type="nli",
    sampling_params=sampling_params
)

如果想要在顯存較小的GPU上運(yùn)行,如T4,可對(duì)模型進(jìn)行量化。

from typing import Optional, List, Dict
from datasets import Dataset
from awq import AutoAWQForCausalLM
from bonito import AbstractBonito
from transformers import AutoTokenizer




class QuantizedBonito(AbstractBonito):
    def __init__(self, model_name_or_path):
        self.model = AutoAWQForCausalLM.from_quantized(model_name_or_path, fuse_layers=True).cuda()
        self.tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)


    def generate_task(
        self,
        unannotated_paragraph: str,
        task_type: str,
        sampling_params: dict,
    ) -> Dict:
        """
        Generates synthetic instruction tuning pair using the Quantized Bonito model.
        This method takes a text unannotated text, a task type, and sampling parameters,
        and generates synthetic input-output pair.


        Args:
            unannotated_paragraph (str): The unannotated text or a paragraph
            task_type (str): The type of the tasks. This can be a
                short form or a full form.
            sampling_params (dict): The parameters for
                sampling.
            **kwargs: Additional keyword arguments.


        Returns:
            Dict: The synthetic input-output pair for the task type.
        """


        text_dataset = Dataset.from_list([{"input": unannotated_paragraph}])


        processed_dataset = self._prepare_bonito_input(
            text_dataset, task_type, context_col="input"
        )


        outputs = self._generate_text(processed_dataset["input"], sampling_params)
        examples = []
        for i, example in enumerate(text_dataset.to_list()):
            output = outputs[i]
            example["prediction"] = output.strip()
            examples.append(example)


        synthetic_dataset = Dataset.from_list(examples)


        # filter out the examples that cannot be parsed
        synthetic_dataset_dict = self._postprocess_dataset(
            synthetic_dataset, context_col="input"
        ).to_list()[0]


        return synthetic_dataset_dict


    def _generate_text(
        self,
        dataset: Dataset,
        sampling_params: dict,
        ) -> List[str]:
        """
        Generate text using huggingface transformers generate function.


        This method takes a dataset of prompts, encodes them,
        generates text using the model, decodes the generated
        text, and appends it to a list.


        Args:
            dataset (Dataset): A dataset containing prompts for text generation.
            sampling_params (dict): Parameters for sampling during generation.


        Returns:
            List[str]: A list of generated texts corresponding to the prompts.
        """
        generated_texts = []


        for prompt in dataset:
            input_ids = self.tokenizer.encode(prompt, return_tensors="pt")
            input_ids = input_ids.cuda()


            output = self.model.generate(
                input_ids,
                do_sample=True,
                **sampling_params
            )


            generated_text = self.tokenizer.decode(output[0][len(input_ids[0]):], skip_special_tokens=True)
            generated_texts.append(generated_text)


        return generated_texts

以tasktype為ynqa,即yes-or-no問(wèn)題為例,其生成的結(jié)果如下:

sampling_params = {'max_new_tokens':256, 'top_p':0.95, 'temperature':0.7, 'num_return_sequences':1}
synthetic_dataset = bonito.generate_task(
    unannotated_paragraph,
    task_type="ynqa",
    sampling_params=sampling_params
)
pprint("----Generated Instructions----")
pprint(f'Input: {synthetic_dataset["input"]}')
pprint(f'Output: {synthetic_dataset["output"]}')


'----Generated Instructions----'
('Input: Based on the following passage, is a written communication '
 'confidential? 1. “Confidential Information”, whenever used in this '
 'Agreement, shall mean any data, document, specification and other '
 'information or material, that is delivered or disclosed by UNHCR to the '
 'Recipient in any form whatsoever, whether orally, visually in writing or '
 'otherwise (including computerized form), and that, at the time of disclosure '
 'to the Recipient, is designated as confidential.')
'Output: Yes'

其中,tasktype支持的任務(wù)類型如下:

  1. 提取式問(wèn)答(exqa):根據(jù)給定的文本片段生成問(wèn)題答案,直接從文本中提取答案。
  2. 多選問(wèn)題回答(mcqa):提供一組多選問(wèn)題的答案。
  3. 問(wèn)題生成(qg):根據(jù)提供的文本內(nèi)容創(chuàng)建問(wèn)題。
  4. 無(wú)選擇問(wèn)答(qa):在不提供多項(xiàng)選擇選項(xiàng)的情況下回答問(wèn)題。
  5. 是-否問(wèn)題回答(ynqa):生成問(wèn)題的是或否答案。
  6. 共指消解 (coref):標(biāo)識(shí)文本中引用同一實(shí)體的引用。
  7. 釋義生成 (paraphrase):重寫具有不同措辭的句子或短語(yǔ),同時(shí)保留原意。
  8. 釋義識(shí)別 (paraphrase_id):確定兩個(gè)句子或短語(yǔ)是否傳達(dá)相同的含義。
  9. 句子補(bǔ)全(sent_comp):補(bǔ)全句子中缺失的部分。
  10. 情感分析 (sentiment):識(shí)別文本中表達(dá)的情緒,如積極、消極或中性。
  11. 摘要(summarization):將較長(zhǎng)的文本濃縮成較短的摘要,抓住要點(diǎn)。
  12. 文本生成(Text_gen):基于提示創(chuàng)建連貫且與上下文相關(guān)的文本。
  13. 主題分類(Topic_class):將文本分類為預(yù)定義的主題。
  14. 詞義消歧(wsd):根據(jù)上下文確定單詞的含義。
  15. 文本蘊(yùn)含(te):預(yù)測(cè)一個(gè)給定的文本是否在邏輯上遵循另一個(gè)文本。
  16. 自然語(yǔ)言推理(nli):確定兩段文本之間的關(guān)系,如矛盾、隱含或中性。


在性能上,相較于GPT-4的方案,bonito在三個(gè)數(shù)據(jù)集中兩個(gè)上取得了超越GPT4的好成績(jī)。

沒有標(biāo)記數(shù)據(jù)集,如何做大模型指令微調(diào)?介紹一款有潛力的標(biāo)記數(shù)據(jù)集生成模型-AI.x社區(qū)


小結(jié):

相較于使用GPT-4生成標(biāo)記樣本的方法,經(jīng)過(guò)專門面向數(shù)據(jù)集生成微調(diào)的模型Bonito來(lái)講,支持zero-shot級(jí)別的樣本生成,并且可以使用開源的模型,這在開放性,成本、性能上都能具備較強(qiáng)的優(yōu)勢(shì)。

隨著微調(diào)技術(shù)的不斷普及,相信數(shù)據(jù)樣本質(zhì)量和生產(chǎn)成本將受到越來(lái)越多的重視,benito等這樣的數(shù)據(jù)集生成模型也將迎來(lái)更大的發(fā)展。

本文轉(zhuǎn)載自?? AI工程化??,作者: ully

收藏
回復(fù)
舉報(bào)
回復(fù)
相關(guān)推薦