自拍偷在线精品自拍偷,亚洲欧美中文日韩v在线观看不卡

<legend id="hbirb"><li id="hbirb"><menuitem id="hbirb"></menuitem></li></legend>

AI.x社區(qū)

軟考社區(qū)

企業(yè)培訓(xùn)

鴻蒙開發(fā)者社區(qū)

WOT技術(shù)大會

公眾號矩陣

移動端

視頻課免費課排行榜短視頻直播課軟考學(xué)堂

全部課程軟考華為認證廠商認證 IT技術(shù)PMP項目管理免費題庫

在線學(xué)習(xí)

文章資源問答課堂專欄直播

51CTO

鴻蒙開發(fā)者社區(qū)

51CTO技術(shù)棧

51CTO官微

51CTO學(xué)堂

51CTO博客

CTO訓(xùn)練營

鴻蒙開發(fā)者社區(qū)訂閱號

51CTO軟考

51CTO學(xué)堂APP

51CTO學(xué)堂企業(yè)版APP

鴻蒙開發(fā)者社區(qū)視頻號

51CTO軟考題庫

賬號設(shè)置退出

Tokenization?指南：字節(jié)對編碼，WordPiece等方法Python代碼詳解

作者：佚名 2024-01-17 16:29:59

在2022年11月OpenAI的ChatGPT發(fā)布之后，大型語言模型(llm)變得非常受歡迎。從那時起，這些語言模型的使用得到了爆炸式的發(fā)展，這在一定程度上得益于HuggingFace的Transformer庫和PyTorch等庫。

在2022年11月OpenAI的ChatGPT發(fā)布之后，大型語言模型(llm)變得非常受歡迎。從那時起，這些語言模型的使用得到了爆炸式的發(fā)展，這在一定程度上得益于HuggingFace的Transformer庫和PyTorch等庫。

計算機要處理語言，首先需要將文本轉(zhuǎn)換成數(shù)字形式。這個過程由一個稱為標記化 Tokenization。

標記化分為2個過程：

1、將輸入文本劃分為token

標記器首先獲取文本并將其分成更小的部分，可以是單詞、單詞的部分或單個字符。這些較小的文本片段被稱為標記。Stanford NLP Group[2]將標記更嚴格地定義為:

在某些特定的文檔中，作為一個有用的語義處理單元組合在一起的字符序列實例。

2、為每個標記分配一個ID

標記器將文本劃分為標記后，可以為每個標記分配一個稱為標記ID的整數(shù)。例如，單詞cat被賦值為15，因此輸入文本中的每個cat標記都用數(shù)字15表示。用數(shù)字表示替換文本標記的過程稱為編碼。類似地將已編碼的記號轉(zhuǎn)換回文本的過程稱為解碼。

使用單個數(shù)字表示記號有其缺點，因此要進一步處理這些編碼以創(chuàng)建詞嵌入，這個不在本文的范圍內(nèi)，我們后面介紹。

標記方法

將文本劃分為標記的主要方法有三種:

1、基于單詞:

基于單詞的標記化是三種標記化方法中最簡單的一種。標記器將通過拆分每個空格字符(有時稱為“基于空白的標記化”)或通過類似的規(guī)則集(如基于標點的標記化)將句子分成單詞[12]。

例如，這個句子:

Cats are great, but dogs are better!

通過空格可以拆分為:

['Cats', 'are', 'great,', 'but', 'dogs', 'are', 'better!']

通過分隔標點和可以拆分為:

['Cats', 'are', 'great', ',', 'but', 'dogs', 'are', 'better', '!']

這里可以看到，用于確定分割的規(guī)則非常重要。空格方法可以更好地提供潛在的稀有標記!，而通過標點割則使兩個不太罕見的標記更加突出!這里要說明下不要完全去掉標點符號，因為它們可以承載非常特殊的含義?！褪且粋€例子，它可以區(qū)分單詞的復(fù)數(shù)形式和所有格形式。例如，“book’s”指的是一本書的某些屬性，而“books”指的是許多書。

生成標記后，每個標記都會可以分配一個編號。下一次生成標記器已經(jīng)看到的標記時，可以簡單地為該標記分配為該單詞指定的數(shù)字。例如，如果在上面的句子中，標記great被賦值為1，那么great的所有后續(xù)實例也將被賦值為1[3]。

優(yōu)缺點:

基于單詞的方法生成的標記包含高度的信息，因為每個標記都包含語義和上下文信息。但是這種方法最大的缺點之一是非常相似的單詞被視為完全獨立的標記。例如，cat和cats之間的聯(lián)系將是不存在的，因此它們將被視為單獨的單詞。這在包含許多單詞的大規(guī)模應(yīng)用程序中成為一個問題，因為模型詞匯表中可能出現(xiàn)的標記數(shù)量(模型所看到的標記總數(shù))可能會變得非常大。英語大約有17萬個單詞，就會導(dǎo)致所謂的詞匯爆炸問題。這方面的一個例子是TransformerXL標記器，它使用基于空白的分割。這導(dǎo)致詞匯量超過25萬[4]。

解決這個問題的一種方法是對模型可以學(xué)習(xí)的標記數(shù)量施加硬限制(例如10,000)。這將把10,000個最常見的標記之外的任何單詞分類為詞匯表外(OOV)，并將標記值分配為UNKNOWN而不是數(shù)值(通常縮寫為UNK)。在存在許多未知單詞的情況下，這會導(dǎo)致性能下降，但如果數(shù)據(jù)中包含的大多是常見單詞，這可能是一種合適的折衷方法。[5]

2、基于字符的分詞器

基于字符的標記法根據(jù)每個字符拆分文本，包括:字母、數(shù)字和標點符號等特殊字符。這大大減少了詞匯量的大小，英語可以用大約256個標記來表示，而不是基于單詞的方法所需的170,000多個[5]。即使是東亞語言，如漢語和日語，其詞匯量也會顯著減少，盡管它們的書寫系統(tǒng)中包含數(shù)千個獨特的字符。

在基于字符的標記器中，以下句子:

Cats are great, but dogs are better!

會被拆分成：

['C', 'a', 't', 's', ' ', 'a', 'r', 'e', ' ', 'g', 'r', 'e', 'a', 't', ',', ' ', 'b', 'u', 't', ' ', 'd', 'o', 'g', 's', ' ', 'a', 'r', 'e', ' ', 'b', 'e', 't', 't', 'e', 'r', '!'`]

優(yōu)缺點:

與基于單詞的方法相比，基于字符的方法的詞匯表大小要小得多，而且詞匯表外的標記也要少得多。它可以對拼寫錯誤的單詞進行標記(盡管與單詞的正確形式不同)。

但是這種方法也有一些缺點。使用基于字符的方法生成的單個標記中存儲的信息非常少。這是因為與基于單詞的方法中的標記不同，沒有捕獲語義或上下文含義(特別是在使用基于字母的書寫系統(tǒng)的語言中，如英語)。這種方法限制了可以輸入語言模型的標記化輸入的大小，因為需要許多數(shù)字來編碼輸入文本。

3、基于子詞的分詞器

基于子詞的標記化可以實現(xiàn)基于詞和基于字符的方法的優(yōu)點，同時最大限度地減少它們的缺點?；谧釉~的方法采取了折中的方案，將單詞中的文本分開，創(chuàng)建具有語義意義的標記，即使它們不是完整的單詞。例如，符號ing和ed雖然本身不是單詞，但它們具有語法意義。

這種方法產(chǎn)生的詞匯表大小小于基于單詞的方法，但大于基于字符的方法。對于每個標記中存儲的信息量也是如此，它也位于前兩個方法生成的標記之間。

只拆分不常用的單詞，可以使詞形、復(fù)數(shù)形式等分解成它們的組成部分，同時保留符號之間的關(guān)系。例如，cat可能是數(shù)據(jù)集中非常常見的單詞，但cats可能不太常見。所以cats將被分成cat和s，其中cats現(xiàn)在被賦予與其他所有cats標記相同的值，而s被賦予不同的值，這可以編碼復(fù)數(shù)的含義。另一個例子是單詞tokenization，它可以分為詞根token和后綴ization。這種方法可以保持句法和語義的相似性[6]。由于這些原因，基于子詞的標記器在今天的NLP模型中非常常用。

標準化和預(yù)標記化

標記化過程需要一些預(yù)處理和后處理步驟，這些步驟組成了標記化管道。其中標記化方法(基于子詞，基于字符等)發(fā)生在模型步驟[7]中。

當使用Hugging Face的transformer庫中的標記器時，標記化管道的所有步驟都會自動處理。整個管道由一個名為Tokenizer的對象執(zhí)行。本節(jié)將深入研究大多數(shù)用戶在處理NLP任務(wù)時不需要手動處理的代碼的內(nèi)部工作原理。還將介紹在標記器庫中自定義基標記器類的步驟，這樣可以在需要時為特定任務(wù)專門構(gòu)建標記器。

1、規(guī)范化方法

規(guī)范化是在將文本拆分為標記之前清理文本的過程。這包括將每個字符轉(zhuǎn)換為小寫，從字符中刪除重復(fù)，刪除不必要的空白等步驟。例如，字符串Thís is áN examplise sénteNCE。不同的規(guī)范化程序?qū)?zhí)行不同的步驟，

Hugging Face的Normalizers包包含幾個基本的Normalizers，一般常用的有：

NFC:不轉(zhuǎn)換大小寫或移除口音

Lower:轉(zhuǎn)換大小寫，但不移除口音

BERT:轉(zhuǎn)換大小寫并移除口音

我們可以看看上面三種方法的對比：

from tokenizers.normalizers import NFC, Lowercase, BertNormalizer
 
 # Text to normalize
 text = 'Thís is áN ExaMPlé     sénteNCE'
 
 # Instantiate normalizer objects
 NFCNorm = NFC()
 LowercaseNorm = Lowercase()
 BertNorm = BertNormalizer()
 
 # Normalize the text
 print(f'NFC:   {NFCNorm.normalize_str(text)}')
 print(f'Lower: {LowercaseNorm.normalize_str(text)}')
 print(f'BERT: {BertNorm.normalize_str(text)}')
 
 #NFC:   Thís is áN ExaMPlé     sénteNCE
 #Lower: thís is án examplé     séntence
 #BERT: this is an example     sentence

下面的示例可以看到，只有NFC刪除了不必要的空白。

from transformers import FNetTokenizerFast, CamembertTokenizerFast, \
                          BertTokenizerFast
 
 # Text to normalize
 text = 'Thís is áN ExaMPlé     sénteNCE'
 
 # Instantiate tokenizers
 FNetTokenizer = FNetTokenizerFast.from_pretrained('google/fnet-base')
 CamembertTokenizer = CamembertTokenizerFast.from_pretrained('camembert-base')
 BertTokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')
 
 # Normalize the text
 print(f'FNet Output:     \
    {FNetTokenizer.backend_tokenizer.normalizer .normalize_str(text)}')
 
 print(f'CamemBERT Output: \
    {CamembertTokenizer.backend_tokenizer.normalizer.normalize_str(text)}')
 
 print(f'BERT Output:     \
    {BertTokenizer.backend_tokenizer.normalizer.normalize_str(text)}')
     
 #FNet Output:     Thís is áN ExaMPlé sénteNCE
 #CamemBERT Output: Thís is áN ExaMPlé     sénteNCE
 #BERT Output:     this is an example     sentence

2、預(yù)標記化

預(yù)標記化步驟是標記化原始文本的第一次分割。執(zhí)行分割是為了給出的最終標記的上限。一個句子可以在預(yù)標記步驟中被分割成幾個詞，然后在模型步驟中，根據(jù)標記方法(例如基于子詞的方法)，將其中的一些詞進一步分割。因此，預(yù)先標記的文本表示標記化后仍然可能保留的最大標記。

例如，一個句子可以根據(jù)每個空格拆分，每個空格加一些標點，或者每個空格加每個標點。

下面顯示了基本的Whitespacesplit預(yù)標記器和稍微復(fù)雜一點的BertPreTokenizer之間的比較。pre_tokenizers包?？瞻最A(yù)標記器的輸出保留標點完整，并且仍然連接到鄰近的單詞。例如，includes:被視為單個單詞。而BERT預(yù)標記器將標點符號視為單個單詞[8]。

from tokenizers.pre_tokenizers import WhitespaceSplit, BertPreTokenizer
 
 # Text to normalize 
 text = ("this sentence's content includes: characters, spaces, and " \
        "punctuation.")
 
 # Define helper function to display pre-tokenized output
 def print_pretokenized_str(pre_tokens):
    for pre_token in pre_tokens:
        print(f'"{pre_token[0]}", ', end='')
 
 # Instantiate pre-tokenizers
 wss = WhitespaceSplit()
 bpt = BertPreTokenizer()
 
 # Pre-tokenize the text
 print('Whitespace Pre-Tokenizer:')
 print_pretokenized_str(wss.pre_tokenize_str(text))
 
 #Whitespace Pre-Tokenizer:
 #"this", "sentence's", "content", "includes:", "characters,", "spaces,", 
 #"and", "punctuation.", 
 
 
 print('\n\nBERT Pre-Tokenizer:')
 print_pretokenized_str(bpt.pre_tokenize_str(text))
 
 #BERT Pre-Tokenizer:
 #"this", "sentence", "'", "s", "content", "includes", ":", "characters", 
 #",", "spaces", ",", "and", "punctuation", ".",

我們可以直接從常見的標記器(如GPT-2和ALBERT (A Lite BERT)標記器)調(diào)用預(yù)標記化方法。這些方法與上面所示的標準BERT預(yù)標記器略有不同，因為在分割標記時不會刪除空格字符。它們被替換為表示空格所在位置的特殊字符。這樣做的好處是，在進一步處理時可以忽略空格字符，但如果需要，可以檢索原始句子。GPT-2模型使用?字符，其特征是大寫G上面有一個點。ALBERT模型使用下劃線字符。

from transformers import AutoTokenizer
 
 # Text to pre-tokenize
 text = ("this sentence's content includes: characters, spaces, and " \
        "punctuation.")
 
 # Instatiate the pre-tokenizers
 GPT2_PreTokenizer = AutoTokenizer.from_pretrained('gpt2').backend_tokenizer \
                    .pre_tokenizer
 
 Albert_PreTokenizer = AutoTokenizer.from_pretrained('albert-base-v1') \
                      .backend_tokenizer.pre_tokenizer
 
 # Pre-tokenize the text
 print('GPT-2 Pre-Tokenizer:')
 print_pretokenized_str(GPT2_PreTokenizer.pre_tokenize_str(text))
 
 #GPT-2 Pre-Tokenizer:
 #"this", "?sentence", "'s", "?content", "?includes", ":", "?characters", ",",
 #"?spaces", ",", "?and", "?punctuation", ".", 
 
 print('\n\nALBERT Pre-Tokenizer:')
 print_pretokenized_str(Albert_PreTokenizer.pre_tokenize_str(text))
 
 #ALBERT Pre-Tokenizer:
 #"▁this", "▁sentence's", "▁content", "▁includes:", "▁characters,", "▁spaces,",
 #"▁and", "▁punctuation.",

下面顯示了同一個示例句子上的BERT預(yù)標記步驟的結(jié)果，返回的對象是一個包含元組的Python列表。每個元組對應(yīng)一個預(yù)標記，其中第一個元素是預(yù)標記字符串，第二個元素是一個元組，包含原始輸入文本中字符串的開始和結(jié)束的索引。

from tokenizers.pre_tokenizers import WhitespaceSplit, BertPreTokenizer
 
 # Text to pre-tokenize
 text = ("this sentence's content includes: characters, spaces, and " \
        "punctuation.")
 
 # Instantiate pre-tokenizer
 bpt = BertPreTokenizer()
 
 # Pre-tokenize the text
 bpt.pre_tokenize_str(example_sentence)

結(jié)果如下：

[('this', (0, 4)),
  ('sentence', (5, 13)),
  ("'", (13, 14)),
  ('s', (14, 15)),
  ('content', (16, 23)),
  ('includes', (24, 32)),
  (':', (32, 33)),
  ('characters', (34, 44)),
  (',', (44, 45)),
  ('spaces', (46, 52)),
  (',', (52, 53)),
  ('and', (54, 57)),
  ('punctuation', (58, 69)),
  ('.', (69, 70))]

子詞標記化方法

在完成了分詞和預(yù)標記后，就可以開始合并標記了，對于transformer模型，有三種通常用于實現(xiàn)基于子詞的方法。它們都使用略微不同的技術(shù)將不常用的單詞分成更小的標記。

1、字節(jié)對編碼 Byte Pair Encoding

字節(jié)對編碼算法是一種常用的標記器，例如GPT和GPT-2模型(OpenAI)， BART (Lewis等人)等[9-10]。它最初被設(shè)計為一種文本壓縮算法，但人們發(fā)現(xiàn)它在語言模型的標記化任務(wù)中工作得非常好。BPE算法將一串文本分解為在參考語料庫(用于訓(xùn)練標記化模型的文本)中頻繁出現(xiàn)的子詞單元[11]。BPE模型的訓(xùn)練方法如下:

a)構(gòu)建語料庫

輸入文本被提供給規(guī)范化和預(yù)標記化模型，創(chuàng)建干凈的單詞列表。然后將這些單詞交給BPE模型，模型確定每個單詞的頻率，并將該數(shù)字與單詞一起存儲在稱為語料庫的列表中。

b)構(gòu)建詞匯

然后語料庫中的單詞被分解成單個字符，并添加到一個稱為詞匯表的空列表中。該算法將在每次確定哪些字符對可以合并在一起時迭代地添加該詞匯表。

c)找出字符對的頻率

然后記錄語料庫中每個單詞的字符對頻率。例如，單詞cat將具有ca, at和ts的字符對。所有單詞都以這種方式進行檢查，并貢獻給全局頻率計數(shù)器。在任何標記中找到的ca實例都會增加ca對的頻率計數(shù)器。

d)創(chuàng)建合并規(guī)則

當每個字符對的頻率已知時，最頻繁的字符對被添加到詞匯表中。詞匯表現(xiàn)在由符號中的每個字母以及最常見的字符對組成。這也提供了一個模型可以使用的合并規(guī)則。例如，如果模型學(xué)習(xí)到ca是最常見的字符對，它已經(jīng)學(xué)習(xí)到語料庫中所有相鄰的c和a實例可以合并以得到ca?，F(xiàn)在可以將其作為單個字符ca處理其余步驟。

重復(fù)步驟c和d，找到更多合并規(guī)則，并向詞匯表中添加更多字符對。這個過程一直持續(xù)到詞匯表大小達到訓(xùn)練開始時指定的目標大小。

下面是BPE算法的Python實現(xiàn)

class TargetVocabularySizeError(Exception):
    def __init__(self, message):
        super().__init__(message)
 
 class BPE:
    '''An implementation of the Byte Pair Encoding tokenizer.'''
 
    def calculate_frequency(self, words):
        ''' Calculate the frequency for each word in a list of words.
 
            Take in a list of words stored as strings and return a list of
            tuples where each tuple contains a string from the words list,
            and an integer representing its frequency count in the list.
 
            Args:
                words (list): A list of words (strings) in any order.
 
            Returns:
                corpus (list[tuple(str, int)]): A list of tuples where the
                  first element is a string of a word in the words list, and
                  the second element is an integer representing the frequency
                  of the word in the list.
        '''
        freq_dict = dict()
 
        for word in words:
            if word not in freq_dict:
                freq_dict[word] = 1
            else:
                freq_dict[word] += 1
 
        corpus = [(word, freq_dict[word]) for word in freq_dict.keys()]
 
        return corpus
 
 
    def create_merge_rule(self, corpus):
        ''' Create a merge rule and add it to the self.merge_rules list.
 
            Args:
                corpus (list[tuple(list, int)]): A list of tuples where the
                    first element is a list of a word in the words list (where
                    the elements are the individual characters (or subwords in
                    later iterations) of the word), and the second element is
                    an integer representing the frequency of the word in the
                    list.
 
            Returns:
                None
        '''
        pair_frequencies = self.find_pair_frequencies(corpus)
        most_frequent_pair = max(pair_frequencies, key=pair_frequencies.get)
        self.merge_rules.append(most_frequent_pair.split(','))
        self.vocabulary.append(most_frequent_pair)
 
 
    def create_vocabulary(self, words):
        ''' Create a list of every unique character in a list of words.
 
            Args:
                words (list): A list of strings containing the words of the
                    input text.
 
            Returns:
                vocabulary (list): A list of every unique character in the list
                    of input words.
        '''
        vocabulary = list(set(''.join(words)))
        return vocabulary
 
    def find_pair_frequencies(self, corpus):
        ''' Find the frequency of each character pair in the corpus.
 
            Loop through the corpus and calculate the frequency of each pair
            of adjacent characters across every word. Return a dictionary of
            each character pair as the keys and the corresponding frequency as
            the values.
 
            Args:
                corpus (list[tuple(list, int)]): A list of tuples where the
                    first element is a list of a word in the words list (where
                    the elements are the individual characters (or subwords in
                    later iterations) of the word), and the second element is
                    an integer representing the frequency of the word in the
                    list.
 
            Returns:
                pair_freq_dict (dict): A dictionary where the keys are the
                    character pairs from the input corpus and the values are an
                    integer representing the frequency of the pair in the
                    corpus.
        '''
        pair_freq_dict = dict()
 
        for word, word_freq in corpus:
            for idx in range(len(word)-1):
 
                char_pair = f'{word[idx]},{word[idx+1]}'
 
                if char_pair not in pair_freq_dict:
                    pair_freq_dict[char_pair] = word_freq
                else:
                    pair_freq_dict[char_pair] += word_freq
 
        return pair_freq_dict
 
 
    def get_merged_chars(self, char_1, char_2):
        ''' Merge the highest score pair and return to the self.merge method.
 
            This method is abstracted so that the BPE class can be used as the
            base class for other Tokenizers, and so the merging method can be
            easily overwritten. For example, in the BPE algorithm the
            characters can simply be concatenated and returned. However in the
            WordPiece algorithm, the # symbols must first be stripped.
 
            Args:
                char_1 (str): The first character in the highest-scoring pair.
                char_2 (str): The second character in the highest-scoring pair.
 
            Returns:
                merged_chars (str): Merged characters.
        '''
        merged_chars = char_1 + char_2
        return merged_chars
 
 
    def initialize_corpus(self, words):
        ''' Split each word into characters and count the word frequency.
 
            Split each word in the input word list on every character. For each
            word, store the split word in a list as the first element inside a
            tuple. Store the frequency count of the word as an integer as the
            second element of the tuple. Create a tuple for every word in this
            fashion and store the tuples in a list called 'corpus', then return
            then corpus list.
 
            Args:
                None
 
            Returns:
                corpus (list[tuple(list, int)]): A list of tuples where the
                    first element is a list of a word in the words list (where
                    the elements are the individual characters of the word),
                    and the second element is an integer representing the
                    frequency of the word in the list.
        '''
        corpus = self.calculate_frequency(words)
        corpus = [([*word], freq) for (word, freq) in corpus]
        return corpus
 
 
    def merge(self, corpus):
        ''' Loop through the corpus and perform the latest merge rule.
 
            Args:
                corpus (list[tuple(list, int)]): A list of tuples where the
                    first element is a list of a word in the words list (where
                    the elements are the individual characters (or subwords in
                    later iterations) of the word), and the second element is
                    an integer representing the frequency of the word in the
                    list.
 
            Returns:
                new_corpus (list[tuple(list, int)]): A modified version of the
                    input argument where the most recent merge rule has been
                    applied to merge the most frequent adjacent characters.
        '''
        merge_rule = self.merge_rules[-1]
        new_corpus = []
 
        for word, word_freq in corpus:
            new_word = []
            idx = 0
 
            while idx < len(word):
                # If a merge pattern has been found
                if (len(word) != 1) and (word[idx] == merge_rule[0]) and\
                (word[idx+1] == merge_rule[1]):
 
                    new_word.append(self.get_merged_chars(word[idx],word[idx+1]))
                    idx += 2
                # If a merge patten has not been found
                else:
                    new_word.append(word[idx])
                    idx += 1
 
            new_corpus.append((new_word, word_freq))
 
        return new_corpus
 
 
    def train(self, words, target_vocab_size):
        ''' Train the model.
 
            Args:
                words (list[str]): A list of words to train the model on.
 
                target_vocab_size (int): The number of words in the vocabulary
                    to be used as the stopping condition when training.
 
            Returns:
                None.
        '''
        self.words = words
        self.target_vocab_size = target_vocab_size
        self.corpus = self.initialize_corpus(self.words)
        self.corpus_history = [self.corpus]
        self.vocabulary = self.create_vocabulary(self.words)
        self.vocabulary_size = len(self.vocabulary)
        self.merge_rules = []
 
        # Iteratively add vocabulary until reaching the target vocabulary size
        if len(self.vocabulary) > self.target_vocab_size:
            raise TargetVocabularySizeError(f'Error: Target vocabulary size \
            must be greater than the initial vocabulary size \
            ({len(self.vocabulary)})')
 
        else:
            while len(self.vocabulary) < self.target_vocab_size:
                try:
                    self.create_merge_rule(self.corpus)
                    self.corpus = self.merge(self.corpus)
                    self.corpus_history.append(self.corpus)
 
                # If no further merging is possible
                except ValueError:
                    print('Exiting: No further merging is possible')
                    break
 
 
    def tokenize(self, text):
        ''' Take in some text and return a list of tokens for that text.
 
            Args:
                text (str): The text to be tokenized.
 
            Returns:
                tokens (list): The list of tokens created from the input text.
        '''
        tokens = [*text]
 
        for merge_rule in self.merge_rules:
 
            new_tokens = []
            idx = 0
 
            while idx < len(tokens):
                # If a merge pattern has been found
                if (len(tokens) != 1) and (tokens[idx] == merge_rule[0]) and \
                    (tokens[idx+1] == merge_rule[1]):
 
                    new_tokens.append(self.get_merged_chars(tokens[idx],
                                                            tokens[idx+1]))
                    idx += 2
                # If a merge patten has not been found
                else:
                    new_tokens.append(tokens[idx])
                    idx += 1
 
            tokens = new_tokens
 
        return tokens

使用的詳細步驟：

# Training set
 words = ['cat', 'cat', 'cat', 'cat', 'cat',
          'cats', 'cats',
          'eat', 'eat', 'eat', 'eat', 'eat', 'eat', 'eat', 'eat', 'eat', 'eat',
          'eating', 'eating', 'eating',
          'running', 'running',
          'jumping',
          'food', 'food', 'food', 'food', 'food', 'food']
 
 # Instantiate the tokenizer
 bpe = BPE()
 bpe.train(words, 21)
 
 # Print the corpus at each stage of the process, and the merge rule used
 print(f'INITIAL CORPUS:\n{bpe.corpus_history[0]}\n')
 for rule, corpus in list(zip(bpe.merge_rules, bpe.corpus_history[1:])):
    print(f'NEW MERGE RULE: Combine "{rule[0]}" and "{rule[1]}"')
    print(corpus, end='\n\n')

結(jié)果輸出

INITIAL CORPUS:
 [(['c', 'a', 't'], 5), (['c', 'a', 't', 's'], 2), (['e', 'a', 't'], 10),
 (['e', 'a', 't', 'i', 'n', 'g'], 3), (['r', 'u', 'n', 'n', 'i', 'n', 'g'], 2), 
 (['j', 'u', 'm', 'p', 'i', 'n', 'g'], 1), (['f', 'o', 'o', 'd'], 6)]
 
 NEW MERGE RULE: Combine "a" and "t"
 [(['c', 'at'], 5), (['c', 'at', 's'], 2), (['e', 'at'], 10), 
 (['e', 'at', 'i', 'n', 'g'], 3), (['r', 'u', 'n', 'n', 'i', 'n', 'g'], 2), 
 (['j', 'u', 'm', 'p', 'i', 'n', 'g'], 1), (['f', 'o', 'o', 'd'], 6)]
 
 NEW MERGE RULE: Combine "e" and "at"
 [(['c', 'at'], 5), (['c', 'at', 's'], 2), (['eat'], 10), 
 (['eat', 'i', 'n', 'g'], 3), (['r', 'u', 'n', 'n', 'i', 'n', 'g'], 2), 
 (['j', 'u', 'm', 'p', 'i', 'n', 'g'], 1), (['f', 'o', 'o', 'd'], 6)]
 
 NEW MERGE RULE: Combine "c" and "at"
 [(['cat'], 5), (['cat', 's'], 2), (['eat'], 10), (['eat', 'i', 'n', 'g'], 3), 
 (['r', 'u', 'n', 'n', 'i', 'n', 'g'], 2), 
 (['j', 'u', 'm', 'p', 'i', 'n', 'g'], 1), (['f', 'o', 'o', 'd'], 6)]
 
 NEW MERGE RULE: Combine "i" and "n"
 [(['cat'], 5), (['cat', 's'], 2), (['eat'], 10), (['eat', 'in', 'g'], 3), 
 (['r', 'u', 'n', 'n', 'in', 'g'], 2), (['j', 'u', 'm', 'p', 'in', 'g'], 1), 
 (['f', 'o', 'o', 'd'], 6)]
 
 NEW MERGE RULE: Combine "in" and "g"
 [(['cat'], 5), (['cat', 's'], 2), (['eat'], 10), (['eat', 'ing'], 3), 
 (['r', 'u', 'n', 'n', 'ing'], 2), (['j', 'u', 'm', 'p', 'ing'], 1), 
 (['f', 'o', 'o', 'd'], 6)]

我們的代碼只是為了學(xué)習(xí)流程，在實際應(yīng)用中可以直接使用transformer庫

BPE標記器只能識別出現(xiàn)在訓(xùn)練數(shù)據(jù)中的字符（characters）。如果出現(xiàn)不包含的詞匯,會將這個字符轉(zhuǎn)換為一個未知的字符。如果模型被用來標記真實數(shù)據(jù)。但是BPE錯誤處理沒有添加未知的字符的標記,所以有的productionized模型是會產(chǎn)生崩潰。

但是GPT-2和RoBERTa中使用的BPE標記器沒有這個問題。它們不是基于Unicode字符分析訓(xùn)練數(shù)據(jù)，而是分析字符的字節(jié)。這被稱為字節(jié)級BPE Byte-Level BPE，它允許一個小的基本詞匯表能夠標記模型可能看到的所有字符。

2、WordPiece

WordPiece是Google為的BERT模型開發(fā)的一種標記化方法，并用于其衍生模型，如DistilBERT和MobileBERT。

WordPiece算法的全部細節(jié)尚未完全向公眾公布，因此本文介紹的方法是基于Hugging Face[12]給出的解釋。WordPiece算法類似于BPE，但使用不同的度量來確定合并規(guī)則。系統(tǒng)不會選擇出現(xiàn)頻率最高的字符對，而是為每對字符計算一個分數(shù)，分數(shù)最高的字符對決定合并哪些字符。WordPiece的訓(xùn)練如下:

a)構(gòu)建語料庫

輸入文本被提供給規(guī)范化和預(yù)標記化模型，以創(chuàng)建干凈的單詞。

b)構(gòu)建詞匯

與BPE一樣，語料庫中的單詞隨后被分解為單個字符，并添加到稱為詞匯表的空列表中。但是這一次不是簡單地存儲每個單獨的字符，而是使用兩個#符號作為標記來確定該字符是在單詞的開頭還是在單詞的中間/結(jié)尾找到的。例如，單詞cat在BPE中會被分成['c'， 'a'， 't']，但在WordPiece中它看起來像['c'， '##a'， '##t']。單詞開頭的c和單詞中間或結(jié)尾的##c將被區(qū)別對待。每次算法確定哪些字符對可以合并在一起時，都會迭代地向這個詞匯表中添加內(nèi)容。

c)計算每個相鄰字符對的配對得分

與BPE模型不同，這次為每個字符對計算一個分數(shù)。識別語料庫中每個相鄰的字符對。'c##a'， ##a##t等，并計算頻率。每個字符單獨出現(xiàn)的頻率也是確定的。已知這些值后，可以根據(jù)以下公式計算配對得分:

這個指標會給經(jīng)常一起出現(xiàn)的字符分配更高的分數(shù)，但單獨出現(xiàn)或與其他字符一起出現(xiàn)的頻率較低。這是WordPiece和BPE的主要區(qū)別，因為BPE不考慮單個字符本身的總體頻率。

d)創(chuàng)建合并規(guī)則

高分代表通常一起出現(xiàn)的字符對。也就是說，如果c##a的配對得分很高，那么c和a在語料庫中經(jīng)常一起出現(xiàn)，而不是單獨出現(xiàn)。與BPE一樣，合并規(guī)則是由得分最高的字符對決定的，但這次不是由頻率決定得分，而是由字符對得分決定。

然后重復(fù)步驟c和d，找到更多合并規(guī)則，并向詞匯表添加更多字符對。這個過程一直持續(xù)到詞匯表大小達到訓(xùn)練開始時指定的目標大小。

簡單代碼示例如下：

class WordPiece(BPE):
 
    def add_hashes(self, word):
        ''' Add # symbols to every character in a word except the first.
 
            Take in a word as a string and add # symbols to every character
            except the first. Return the result as a list where each element is
            a character with # symbols in front, except the first character
            which is just the plain character.
 
            Args:
                word (str): The word to add # symbols to.
 
            Returns:
                hashed_word (list): A list of the characters with # symbols
                    (except the first character which is just the plain
                    character).
        '''
        hashed_word = [word[0]]
 
        for char in word[1:]:
            hashed_word.append(f'##{char}')
 
        return hashed_word
 
 
    def create_merge_rule(self, corpus):
        ''' Create a merge rule and add it to the self.merge_rules list.
 
            Args:
                corpus (list[tuple(list, int)]): A list of tuples where the
                    first element is a list of a word in the words list (where
                    the elements are the individual characters (or subwords in
                    later iterations) of the word), and the second element is
                    an integer representing the frequency of the word in the
                    list.
 
            Returns:
                None
        '''
        pair_frequencies = self.find_pair_frequencies(corpus)
        char_frequencies = self.find_char_frequencies(corpus)
        pair_scores = self.find_pair_scores(pair_frequencies, char_frequencies)
 
        highest_scoring_pair = max(pair_scores, key=pair_scores.get)
        self.merge_rules.append(highest_scoring_pair.split(','))
        self.vocabulary.append(highest_scoring_pair)
 
 
    def create_vocabulary(self, words):
        ''' Create a list of every unique character in a list of words.
 
            Unlike the BPE algorithm where each character is stored normally,
            here a distinction is made by characters that begin a word
            (unmarked), and characters that are in the middle or end of a word
            (marked with a '##'). For example, the word 'cat' will be split
            into ['c', '##a', '##t'].
 
            Args:
                words (list): A list of strings containing the words of the
                    input text.
 
            Returns:
                vocabulary (list): A list of every unique character in the list
                    of input words, marked accordingly with ## to denote if the
                    character was featured in the middle/end of a word, instead
                    of as the first character of the word.
        '''
        vocabulary = set()
        for word in words:
            vocabulary.add(word[0])
            for char in word[1:]:
                vocabulary.add(f'##{char}')
 
        # Convert to list so the vocabulary can be appended to later
        vocabulary = list(vocabulary)
        return vocabulary
 
 
    def find_char_frequencies(self, corpus):
        ''' Find the frequency of each character in the corpus.
 
            Loop through the corpus and calculate the frequency of characters.
            Note that 'c' and '##c' are different characters, since the first
            represents a 'c' at the start of a word, and '##c' represents a 'c'
            in the middle/end of a word. Return a dictionary of each character
            pair as the keys and the corresponding frequency as the values.
 
            Args:
                corpus (list[tuple(list, int)]): A list of tuples where the
                    first element is a list of a word in the words list (where
                    the elements are the individual characters (or subwords in
                    later iterations) of the word), and the second element is
                    an integer representing the frequency of the word in the
                    list.
 
            Returns:
                pair_freq_dict (dict): A dictionary where the keys are the
                    characters from the input corpus and the values are an
                    integer representing the frequency.
        '''
        char_frequencies = dict()
 
        for word, word_freq in corpus:
            for char in word:
                if char in char_frequencies:
                    char_frequencies[char] += word_freq
                else:
                    char_frequencies[char] = word_freq
 
        return char_frequencies
 
 
    def find_pair_scores(self, pair_frequencies, char_frequencies):
        ''' Find the pair score for each character pair in the corpus.
 
            Loops through the pair_frequencies dictionary and calculate the
            pair score for each pair of adjacent characters in the corpus.
            Store the scores in a dictionary and return it.
 
            Args:
                pair_frequencies (dict): A dictionary where the keys are the
                    adjacent character pairs in the corpus and the values are
                    the frequencies of each pair.
 
                char_frequencies (dict): A dictionary where the keys are the
                    characters in the corpus and the values are corresponding
                    frequencies.
 
            Returns:
                pair_scores (dict): A dictionary where the keys are the
                    adjacent character pairs in the input corpus and the values
                    are the corresponding pair score.
        '''
        pair_scores = dict()
 
        for pair in pair_frequencies.keys():
            char_1 = pair.split(',')[0]
            char_2 = pair.split(',')[1]
            denominator = (char_frequencies[char_1]*char_frequencies[char_2])
            score = (pair_frequencies[pair]) / denominator
            pair_scores[pair] = score
 
        return pair_scores
 
 
    def get_merged_chars(self, char_1, char_2):
        ''' Merge the highest score pair and return to the self.merge method.
 
            Remove the # symbols as necessary and merge the highest scoring
            pair then return the merged characters to the self.merge method.
 
 
            Args:
                char_1 (str): The first character in the highest-scoring pair.
                char_2 (str): The second character in the highest-scoring pair.
 
            Returns:
                merged_chars (str): Merged characters.
        '''
        if char_2.startswith('##'):
            merged_chars = char_1 + char_2[2:]
        else:
            merged_chars = char_1 + char_2
 
        return merged_chars
 
 
    def initialize_corpus(self, words):
        ''' Split each word into characters and count the word frequency.
 
            Split each word in the input word list on every character. For each
            word, store the split word in a list as the first element inside a
            tuple. Store the frequency count of the word as an integer as the
            second element of the tuple. Create a tuple for every word in this
            fashion and store the tuples in a list called 'corpus', then return
            then corpus list.
 
            Args:
                None.
 
            Returns:
                corpus (list[tuple(list, int)]): A list of tuples where the
                    first element is a list of a word in the words list (where
                    the elements are the individual characters of the word),
                    and the second element is an integer representing the
                    frequency of the word in the list.
        '''
        corpus = self.calculate_frequency(words)
        corpus = [(self.add_hashes(word), freq) for (word, freq) in corpus]
        return corpus
 
    def tokenize(self, text):
        ''' Take in some text and return a list of tokens for that text.
 
            Args:
                text (str): The text to be tokenized.
 
            Returns:
                tokens (list): The list of tokens created from the input text.
        '''
        # Create cleaned vocabulary list without # and commas to check against
        clean_vocabulary = [word.replace('#', '').replace(',', '') 
                            for word in self.vocabulary]
        clean_vocabulary.sort(key=lambda word: len(word))
        clean_vocabulary = clean_vocabulary[::-1]
 
        # Break down the text into the largest tokens first, then smallest
        remaining_string = text
        tokens = []
        keep_checking = True
 
        while keep_checking:
            keep_checking = False
            for vocab in clean_vocabulary:
                if remaining_string.startswith(vocab):
                    tokens.append(vocab)
                    remaining_string = remaining_string[len(vocab):]
                    keep_checking = True
 
        if len(remaining_string) > 0:
            tokens.append(remaining_string)
 
        return tokens

WordPiece與BPE算法學(xué)習(xí)的標記非常不同?？梢郧宄乜吹剑琖ordPiece更傾向于這樣的組合:字符相互出現(xiàn)的頻率比單獨出現(xiàn)的頻率更高，因此m和p會立即合并，因為它們只一起存在于數(shù)據(jù)集中，而不是單獨存在。

wp = WordPiece()
 wp.train(words, 30)
 
 print(f'INITIAL CORPUS:\n{wp.corpus_history[0]}\n')
 for rule, corpus in list(zip(wp.merge_rules, wp.corpus_history[1:])):
    print(f'NEW MERGE RULE: Combine "{rule[0]}" and "{rule[1]}"')
    print(corpus, end='\n\n')

結(jié)果：

INITIAL CORPUS:
 [(['c', '##a', '##t'], 5), (['c', '##a', '##t', '##s'], 2), 
 (['e', '##a', '##t'], 10), (['e', '##a', '##t', '##i', '##n', '##g'], 3), 
 (['r', '##u', '##n', '##n', '##i', '##n', '##g'], 2), 
 (['j', '##u', '##m', '##p', '##i', '##n', '##g'], 1), 
 (['f', '##o', '##o', '##d'], 6)]
 
 NEW MERGE RULE: Combine "##m" and "##p"
 [(['c', '##a', '##t'], 5), (['c', '##a', '##t', '##s'], 2), 
 (['e', '##a', '##t'], 10), (['e', '##a', '##t', '##i', '##n', '##g'], 3), 
 (['r', '##u', '##n', '##n', '##i', '##n', '##g'], 2), 
 (['j', '##u', '##mp', '##i', '##n', '##g'], 1), 
 (['f', '##o', '##o', '##d'], 6)]
 
 NEW MERGE RULE: Combine "r" and "##u"
 [(['c', '##a', '##t'], 5), (['c', '##a', '##t', '##s'], 2), 
 (['e', '##a', '##t'], 10), (['e', '##a', '##t', '##i', '##n', '##g'], 3), 
 (['ru', '##n', '##n', '##i', '##n', '##g'], 2), 
 (['j', '##u', '##mp', '##i', '##n', '##g'], 1), 
 (['f', '##o', '##o', '##d'], 6)]
 
 NEW MERGE RULE: Combine "j" and "##u"
 [(['c', '##a', '##t'], 5), (['c', '##a', '##t', '##s'], 2), 
 (['e', '##a', '##t'], 10), (['e', '##a', '##t', '##i', '##n', '##g'], 3), 
 (['ru', '##n', '##n', '##i', '##n', '##g'], 2), 
 (['ju', '##mp', '##i', '##n', '##g'], 1), (['f', '##o', '##o', '##d'], 6)]
 
 NEW MERGE RULE: Combine "ju" and "##mp"
 [(['c', '##a', '##t'], 5), (['c', '##a', '##t', '##s'], 2), 
 (['e', '##a', '##t'], 10), (['e', '##a', '##t', '##i', '##n', '##g'], 3), 
 (['ru', '##n', '##n', '##i', '##n', '##g'], 2), 
 (['jump', '##i', '##n', '##g'], 1), (['f', '##o', '##o', '##d'], 6)]
 
 NEW MERGE RULE: Combine "jump" and "##i"
 [(['c', '##a', '##t'], 5), (['c', '##a', '##t', '##s'], 2), 
 (['e', '##a', '##t'], 10), (['e', '##a', '##t', '##i', '##n', '##g'], 3), 
 (['ru', '##n', '##n', '##i', '##n', '##g'], 2), (['jumpi', '##n', '##g'], 1), 
 (['f', '##o', '##o', '##d'], 6)]
 
 NEW MERGE RULE: Combine "##i" and "##n"
 [(['c', '##a', '##t'], 5), (['c', '##a', '##t', '##s'], 2), 
 (['e', '##a', '##t'], 10), (['e', '##a', '##t', '##in', '##g'], 3), 
 (['ru', '##n', '##n', '##in', '##g'], 2), (['jumpi', '##n', '##g'], 1), 
 (['f', '##o', '##o', '##d'], 6)]
 
 NEW MERGE RULE: Combine "ru" and "##n"
 [(['c', '##a', '##t'], 5), (['c', '##a', '##t', '##s'], 2), 
 (['e', '##a', '##t'], 10), (['e', '##a', '##t', '##in', '##g'], 3), 
 (['run', '##n', '##in', '##g'], 2), (['jumpi', '##n', '##g'], 1), 
 (['f', '##o', '##o', '##d'], 6)]
 
 NEW MERGE RULE: Combine "run" and "##n"
 [(['c', '##a', '##t'], 5), (['c', '##a', '##t', '##s'], 2), 
 (['e', '##a', '##t'], 10), (['e', '##a', '##t', '##in', '##g'], 3), 
 (['runn', '##in', '##g'], 2), (['jumpi', '##n', '##g'], 1), 
 (['f', '##o', '##o', '##d'], 6)]
 
 NEW MERGE RULE: Combine "jumpi" and "##n"
 [(['c', '##a', '##t'], 5), (['c', '##a', '##t', '##s'], 2), 
 (['e', '##a', '##t'], 10), (['e', '##a', '##t', '##in', '##g'], 3), 
 (['runn', '##in', '##g'], 2), (['jumpin', '##g'], 1), 
 (['f', '##o', '##o', '##d'], 6)]
 
 NEW MERGE RULE: Combine "runn" and "##in"
 [(['c', '##a', '##t'], 5), (['c', '##a', '##t', '##s'], 2), 
 (['e', '##a', '##t'], 10), (['e', '##a', '##t', '##in', '##g'], 3), 
 (['runnin', '##g'], 2), (['jumpin', '##g'], 1), 
 (['f', '##o', '##o', '##d'], 6)]
 
 NEW MERGE RULE: Combine "##in" and "##g"
 [(['c', '##a', '##t'], 5), (['c', '##a', '##t', '##s'], 2), 
 (['e', '##a', '##t'], 10), (['e', '##a', '##t', '##ing'], 3), 
 (['runnin', '##g'], 2), (['jumpin', '##g'], 1), 
 (['f', '##o', '##o', '##d'], 6)]
 
 NEW MERGE RULE: Combine "runnin" and "##g"
 [(['c', '##a', '##t'], 5), (['c', '##a', '##t', '##s'], 2), 
 (['e', '##a', '##t'], 10), (['e', '##a', '##t', '##ing'], 3), 
 (['running'], 2), (['jumpin', '##g'], 1), (['f', '##o', '##o', '##d'], 6)]
 
 NEW MERGE RULE: Combine "jumpin" and "##g"
 [(['c', '##a', '##t'], 5), (['c', '##a', '##t', '##s'], 2), 
 (['e', '##a', '##t'], 10), (['e', '##a', '##t', '##ing'], 3), 
 (['running'], 2), (['jumping'], 1), (['f', '##o', '##o', '##d'], 6)]
 
 NEW MERGE RULE: Combine "f" and "##o"
 [(['c', '##a', '##t'], 5), (['c', '##a', '##t', '##s'], 2), 
 (['e', '##a', '##t'], 10), (['e', '##a', '##t', '##ing'], 3), 
 (['running'], 2), (['jumping'], 1), (['fo', '##o', '##d'], 6)]

盡管訓(xùn)練數(shù)據(jù)有限，但模型仍然設(shè)法學(xué)習(xí)了一些有用的標記，比如單詞jumper開始。首先，字符串被分解成['jump'，'er']，因為jump是訓(xùn)練集中可以在單詞開頭找到的最大token。接下來，字符串er被分解成單個字符，因為模型還沒有學(xué)會將字符e和r組合在一起。

print(wp.tokenize('jumper'))
 #['jump', 'e', 'r']

3、Unigram

Unigram標記器采用與BPE和WordPiece不同的方法，從一個大詞匯表開始，然后迭代地減少它，直到達到所需的大小。

Unigram模型使用統(tǒng)計方法，其中考慮句子中每個單詞或字符的概率。這些列表中的每個元素都可以被認為是一個標記t，而一系列標記t1, t2，…，tn出現(xiàn)的概率由下式給出:

a)構(gòu)建語料庫

與往常一樣，輸入文本被提供給規(guī)范化和預(yù)標記化模型，以創(chuàng)建干凈的單詞

b)構(gòu)建詞匯

Unigram模型的詞匯表大小一開始非常大，然后迭代地減少，直到達到所需的大小。要構(gòu)造初始詞匯表，請在語料庫中找到所有可能的子字符串。例如，如果語料庫中的第一個單詞是cats，則子字符串['c'， 'a'， 't'， 's'， 'ca'， 'at'， 'ts'， 'cat'， 'ats']將被添加到詞匯表中。

c)計算每個標記的概率

通過查找語料庫中標記的出現(xiàn)次數(shù)，然后除以標記出現(xiàn)的總次數(shù)，可以近似地計算出標記出現(xiàn)的概率。

d)找出單詞的所有可能的切分

假設(shè)訓(xùn)練語料庫中的一個單詞是cat。這可以通過以下方式進行細分:

['c'， 'a'， 't']

(“ca”、“t”)

[' c ', 'at']

(“cat”)

e)計算語料庫中每個分割出現(xiàn)的近似概率

結(jié)合上面的方程式將給出每個系列標記的概率。

由于段['ca'， 't']具有最高的概率得分，因此這是用于標記單詞的段。單詞cat將被標記為['ca'， 't']。可以想象，對于像tokenization這樣的較長的單詞，拆分可能出現(xiàn)在整個單詞的多個位置，例如['token'， 'iza'， tion]或['token'， 'ization]。

f)計算損失

這里的損失是指模型的分數(shù)，如果從詞匯表中刪除一個重要的標記，則損失會大大增加，但如果刪除一個不太重要的標記，則損失不會增加太多。通過計算每個標記被刪除后在模型中的損失，可以找到詞匯表中最沒用的標記。這可以迭代地重復(fù)，直到詞匯表大小減少到只剩下訓(xùn)練集語料庫中最有用的標記。

這里的損失計算公式如下：

一旦刪除了足夠的字符，使詞匯表減少到所需的大小，訓(xùn)練就完成了，模型就可以用于對單詞進行標記。

比較BPE、WordPiece和Unigram

根據(jù)訓(xùn)練集和要標記的數(shù)據(jù)，一些標記器可能比其他標記器表現(xiàn)得更好。在為語言模型選擇標記器時，最好使用用于特定用例的訓(xùn)練集進行實驗，看看哪個能提供最好的結(jié)果。

在這三種方法中，BPE似乎是當前語言模型標記器中最流行的選擇。盡管在這樣一個瞬息萬變的領(lǐng)域，這種變化在未來是很有可能發(fā)生的。但是其他子詞標記器，如sentencepece，近年來越來越受歡迎[13]。

與BPE和Unigram相比，WordPiece似乎產(chǎn)生了更多的單詞標記，但無論模型選擇如何，隨著詞匯量的增加，所有標記器似乎都產(chǎn)生了更少的標記[14]。

標記器的選擇取決于打算與模型一起使用的數(shù)據(jù)集。這里的建議是嘗試BPE或sentencepece進行實驗。

后處理

標記化的最后一步是后處理，如果有必要，可以對輸出進行任何最終修改。BERT使用這一步驟添加了兩種額外類型的標記:

[CLS] -這個標記代表“分類”，用于標記輸入文本的開始。這在BERT中是必需的，因為它被訓(xùn)練的任務(wù)之一是分類(因此標記的名稱)。即使不用于分類任務(wù)，該標記仍然是模型所期望的。

[SEP] -這個標記代表“分隔”，用于分隔輸入中的句子。這對于BERT執(zhí)行的許多任務(wù)都很有用，包括在同一提示符中同時處理多條指令[15]。

tokenizers庫

tokenizers庫使得使用預(yù)訓(xùn)練的tokenizer非常容易。只需導(dǎo)入Tokenizer類，調(diào)用from_pretrained方法，并傳入要使用Tokenizer from的模型名稱。模型列表見[16]。

from tokenizers import Tokenizer
 
 tokenizer = Tokenizer.from_pretrained('bert-base-cased')

我們可以直接使用下面的實現(xiàn)

BertWordPieceTokenizer - The famous Bert tokenizer, using WordPiece
 CharBPETokenizer - The original BPE
 ByteLevelBPETokenizer - The byte level version of the BPE
 SentencePieceBPETokenizer - A BPE implementation compatible with the one used by SentencePiece

h愛可以使用train方法進行自定義的訓(xùn)練。訓(xùn)練完成后使用save方法保存訓(xùn)練好的標記器，這樣就不必再次執(zhí)行訓(xùn)練。

# Import a tokenizer
 from tokenizers import BertWordPieceTokenizer, CharBPETokenizer, \
                        ByteLevelBPETokenizer, SentencePieceBPETokenizer
 
 # Instantiate the model
 tokenizer = CharBPETokenizer()
 
 # Train the model
 tokenizer.train(['./path/to/files/1.txt', './path/to/files/2.txt'])
 
 # Tokenize some text
 encoded = tokenizer.encode('I can feel the magic, can you?')
 
 # Save the model
 tokenizer.save('./path/to/directory/my-bpe.tokenizer.json')

下面是一個完整的自定義訓(xùn)練的流程代碼：

from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers, \
                        processors
 
 # Initialize a tokenizer
 tokenizer = Tokenizer(models.BPE())
 
 # Customize pre-tokenization and decoding
 tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True)
 tokenizer.decoder = decoders.ByteLevel()
 tokenizer.post_processor = processors.ByteLevel(trim_offsets=True)
 
 # And then train
 trainer = trainers.BpeTrainer(
    vocab_size=20000,
    min_frequency=2,
    initial_alphabet=pre_tokenizers.ByteLevel.alphabet()
 )
 tokenizer.train([
    "./path/to/dataset/1.txt",
    "./path/to/dataset/2.txt",
    "./path/to/dataset/3.txt"
 ], trainer=trainer)
 
 # And Save it
 tokenizer.save("byte-level-bpe.tokenizer.json", pretty=True)

總結(jié)

標記化管道是語言模型的關(guān)鍵部分，在決定使用哪種類型的標記器時應(yīng)該仔細考慮。雖然Hugging Face為了我們處理了這部分的工作，但是對標記方法的深刻理解對于微調(diào)模型和在不同數(shù)據(jù)集上獲得的性能是非常重要的。

責(zé)任編輯：華軒來源： DeepHub IMBA

OpenAI 大型語言模型 Python

51CTO技術(shù)棧公眾號

業(yè)務(wù)
速覽

媒體

51CTO CIOAge HC3i

社區(qū)

51CTO博客鴻蒙開發(fā)者社區(qū) AI.x社區(qū)

教育

51CTO學(xué)堂精培企業(yè)培訓(xùn) CTO訓(xùn)練營

<cite id="j04xv"><rp id="j04xv"><form id="j04xv"></form></rp></cite><ruby id="j04xv"></ruby>

<legend id="j04xv"><track id="j04xv"></track></legend>

^{<blockquote id="j04xv"></blockquote>}