自拍偷在线精品自拍偷,亚洲欧美中文日韩v在线观看不卡

<sub id="exk11"></sub><em id="exk11"><rt id="exk11"><form id="exk11"></form></rt></em><cite id="exk11"></cite>

<sub id="exk11"><i id="exk11"><tr id="exk11"></tr></i></sub>

^{<sub id="exk11"><i id="exk11"></i></sub>}

<u id="exk11"><form id="exk11"></form></u>

AI.x社區(qū)

軟考社區(qū)

免費(fèi)課

企業(yè)培訓(xùn)

鴻蒙開發(fā)者社區(qū)

WOT技術(shù)大會(huì)

公眾號(hào)矩陣

移動(dòng)端

視頻課免費(fèi)課排行榜短視頻直播課軟考學(xué)堂

全部課程軟考華為認(rèn)證廠商認(rèn)證 IT技術(shù)PMP項(xiàng)目管理免費(fèi)題庫

在線學(xué)習(xí)

文章資源問答課堂專欄直播

51CTO

鴻蒙開發(fā)者社區(qū)

51CTO技術(shù)棧

51CTO官微

51CTO學(xué)堂

51CTO博客

CTO訓(xùn)練營

鴻蒙開發(fā)者社區(qū)訂閱號(hào)

51CTO軟考

51CTO學(xué)堂APP

51CTO學(xué)堂企業(yè)版APP

鴻蒙開發(fā)者社區(qū)視頻號(hào)

51CTO軟考題庫

賬號(hào)設(shè)置退出

60行代碼，從頭開始構(gòu)建GPT！最全實(shí)踐指南來了

作者：新智元 2024-03-01 13:49:00

人工智能新聞

GPT早已成為大模型時(shí)代的基礎(chǔ)。國外一位開發(fā)者發(fā)布了一篇實(shí)踐指南，僅用60行代碼構(gòu)建GPT。

60行代碼，從頭開始構(gòu)建GPT？

最近，一位開發(fā)者做了一個(gè)實(shí)踐指南，用Numpy代碼從頭開始實(shí)現(xiàn)GPT。

你還可以將 OpenAI發(fā)布的GPT-2模型權(quán)重加載到構(gòu)建的GPT中，并生成一些文本。

話不多說，直接開始構(gòu)建GPT。

什么是GPT？

GPT代表生成式預(yù)訓(xùn)練Transformer，是一種基于Transformer的神經(jīng)網(wǎng)絡(luò)結(jié)構(gòu)。

- 生成式（Generative）：GPT生成文本。

- 預(yù)訓(xùn)練（Pre-trained）：GPT是根據(jù)書本、互聯(lián)網(wǎng)等中的大量文本進(jìn)行訓(xùn)練的。

- Transformer：GPT是一種僅用于解碼器的Transformer神經(jīng)網(wǎng)絡(luò)。

大模型，如OpenAI的GPT-3、谷歌的LaMDA，以及Cohere的Command XLarge，背后都是GPT。它們的特別之處在于， 1) 非常大（擁有數(shù)十億個(gè)參數(shù)），2) 受過大量數(shù)據(jù)（數(shù)百GB的文本）的訓(xùn)練。

直白講，GPT會(huì)在提示符下生成文本。

即便使用非常簡單的API（輸入=文本，輸出=文本），一個(gè)訓(xùn)練有素的GPT也可以做一些非常棒的事情，比如寫郵件，總結(jié)一本書，為Instagram發(fā)帖提供想法，給5歲的孩子解釋黑洞，用SQL編寫代碼，甚至寫遺囑。

以上就是 GPT 及其功能的高級(jí)概述。讓我們深入了解更多細(xì)節(jié)。

輸入/輸出

GPT定義輸入和輸出的格式大致如下所示：

def gpt(inputs: list[int]) -> list[list[float]]:
    # inputs has shape [n_seq]
    # output has shape [n_seq, n_vocab]
    output = # beep boop neural network magic
    return output

輸入是由映射到文本中的token的一系列整數(shù)表示的一些文本：

# integers represent tokens in our text, for example:# text   = "not all heroes wear capes":# tokens = "not"  "all" "heroes" "wear" "capes"
inputs =   [1,     0,    2,      4,     6]

Token是文本的子片段，使用分詞器生成。我們可以使用詞匯表將token映射到整數(shù)：

# the index of a token in the vocab represents the integer id for that token# i.e. the integer id for "heroes" would be 2, since vocab[2] = "heroes"
vocab = ["all", "not", "heroes", "the", "wear", ".", "capes"]


# a pretend tokenizer that tokenizes on whitespace
tokenizer = WhitespaceTokenizer(vocab)


# the encode() method converts a str -> list[int]
ids = tokenizer.encode("not all heroes wear") # ids = [1, 0, 2, 4]# we can see what the actual tokens are via our vocab mapping
tokens = [tokenizer.vocab[i] for i in ids] # tokens = ["not", "all", "heroes", "wear"]# the decode() method converts back a list[int] -> str
text = tokenizer.decode(ids) # text = "not all heroes wear"

簡而言之：

- 有一個(gè)字符串。

- 使用分詞器將其分解成稱為token的小塊。

- 使用詞匯表將這些token映射為整數(shù)。

在實(shí)踐中，我們會(huì)使用更先進(jìn)的分詞方法，而不是簡單地用空白來分割，比如字節(jié)對(duì)編碼（BPE）或WordPiece，但原理是一樣的：

vocab將字符串token映射為整數(shù)索引

encode方法，可以轉(zhuǎn)換str -> list[int]

decode 方法，可以轉(zhuǎn)換 list[int] -> str ([2])

輸出

輸出是一個(gè)二維數(shù)組，其中 output[i][j] 是模型預(yù)測的概率，即 vocab[j] 處的token是下一個(gè)tokeninputs[i+1] 。例如：

vocab = ["all", "not", "heroes", "the", "wear", ".", "capes"]
inputs = [1, 0, 2, 4] # "not" "all" "heroes" "wear"
output = gpt(inputs)
#              ["all", "not", "heroes", "the", "wear", ".", "capes"]
# output[0] =  [0.75    0.1     0.0       0.15    0.0   0.0    0.0  ]
# given just "not", the model predicts the word "all" with the highest probability


#              ["all", "not", "heroes", "the", "wear", ".", "capes"]
# output[1] =  [0.0     0.0      0.8     0.1    0.0    0.0   0.1  ]
# given the sequence ["not", "all"], the model predicts the word "heroes" with the highest probability


#              ["all", "not", "heroes", "the", "wear", ".", "capes"]
# output[-1] = [0.0     0.0     0.0     0.1     0.0    0.05  0.85  ]
# given the whole sequence ["not", "all", "heroes", "wear"], the model predicts the word "capes" with the highest probability

要獲得整個(gè)序列的下一個(gè)token預(yù)測，我們只需獲取 output[-1] 中概率最高的token：

vocab = ["all", "not", "heroes", "the", "wear", ".", "capes"]
inputs = [1, 0, 2, 4] # "not" "all" "heroes" "wear"
output = gpt(inputs)
next_token_id = np.argmax(output[-1]) # next_token_id = 6
next_token = vocab[next_token_id] # next_token = "capes"

將概率最高的token作為我們的預(yù)測，稱為貪婪解碼（Greedy Decoding）或貪婪采樣（greedy sampling）。

預(yù)測序列中的下一個(gè)邏輯詞的任務(wù)稱為語言建模。因此，我們可以將GPT稱為語言模型。

生成一個(gè)單詞很酷，但整個(gè)句子、段落等又如何呢？

生成文本

自回歸

我們可以通過迭代從模型中獲得下一個(gè)token預(yù)測來生成完整的句子。在每次迭代中，我們將預(yù)測的token追加回輸入：

def generate(inputs, n_tokens_to_generate):
    for _ in range(n_tokens_to_generate): # auto-regressive decode loop
        output = gpt(inputs) # model forward pass
        next_id = np.argmax(output[-1]) # greedy sampling
        inputs.append(int(next_id)) # append prediction to input
    return inputs[len(inputs) - n_tokens_to_generate :]  # only return generated ids


input_ids = [1, 0] # "not" "all"
output_ids = generate(input_ids, 3) # output_ids = [2, 4, 6]
output_tokens = [vocab[i] for i in output_ids] # "heroes" "wear" "capes"

這個(gè)預(yù)測未來值（回歸）并將其添加回輸入（自）的過程，就是為什么你可能會(huì)看到GPT被描述為自回歸的原因。

采樣

我們可以從概率分布中采樣，而不是貪婪采樣，從而為生成的引入一些隨機(jī)性：

inputs = [1, 0, 2, 4] # "not" "all" "heroes" "wear"
output = gpt(inputs)
np.random.choice(np.arange(vocab_size), p=output[-1]) # capes
np.random.choice(np.arange(vocab_size), p=output[-1]) # hats
np.random.choice(np.arange(vocab_size), p=output[-1]) # capes
np.random.choice(np.arange(vocab_size), p=output[-1]) # capes
np.random.choice(np.arange(vocab_size), p=output[-1]) # pants

這樣，我們就能在輸入相同內(nèi)容的情況下生成不同的句子。

如果與top-k、top-p和溫度等在采樣前修改分布的技術(shù)相結(jié)合，我們的輸出質(zhì)量就會(huì)大大提高。

這些技術(shù)還引入了一些超參數(shù)，我們可以利用它們來獲得不同的生成行為（例如，提高溫度會(huì)讓我們的模型承擔(dān)更多風(fēng)險(xiǎn)，從而更具「創(chuàng)造性」）。

訓(xùn)練

我們可以像訓(xùn)練其他神經(jīng)網(wǎng)絡(luò)一樣，使用梯度下降法訓(xùn)練GPT，并計(jì)算損失函數(shù)。對(duì)于GPT，我們采用語言建模任務(wù)的交叉熵?fù)p失：

def lm_loss(inputs: list[int], params) -> float:
    # the labels y are just the input shifted 1 to the left
    #
    # inputs = [not,     all,   heros,   wear,   capes]
    #      x = [not,     all,   heroes,  wear]
    #      y = [all,  heroes,     wear,  capes]
    #
    # of course, we don't have a label for inputs[-1], so we exclude it from x
    #
    # as such, for N inputs, we have N - 1 langauge modeling example pairs
    x, y = inputs[:-1], inputs[1:]


    # forward pass
    # all the predicted next token probability distributions at each position
    output = gpt(x, params)


    # cross entropy loss
    # we take the average over all N-1 examples
    loss = np.mean(-np.log(output[y]))


    return loss


def train(texts: list[list[str]], params) -> float:
    for text in texts:
        inputs = tokenizer.encode(text)
        loss = lm_loss(inputs, params)
        gradients = compute_gradients_via_backpropagation(loss, params)
        params = gradient_descent_update_step(gradients, params)
    return params

這是一個(gè)經(jīng)過大量簡化的訓(xùn)練設(shè)置，但可以說明問題。

請注意，我們在gpt函數(shù)簽名中添加了params （為了簡單起見，我們在前面的章節(jié)中沒有添加）。在訓(xùn)練循環(huán)的每一次迭代期間：

- 對(duì)于給定的輸入文本實(shí)例，計(jì)算了語言建模損失

- 損失決定了我們通過反向傳播計(jì)算的梯度

- 我們使用梯度來更新我們的模型參數(shù)，以使損失最小化（梯度下降）

請注意，我們不使用顯式標(biāo)記的數(shù)據(jù)。相反，我們能夠僅從原始文本本身生成輸入/標(biāo)簽對(duì)。這被稱為自監(jiān)督學(xué)習(xí)。

自監(jiān)督使我們能夠大規(guī)模擴(kuò)展訓(xùn)練數(shù)據(jù)，只需獲得盡可能多的原始文本并將其投放到模型中。例如，GPT-3接受了來自互聯(lián)網(wǎng)和書籍的3000億個(gè)文本token的訓(xùn)練：

當(dāng)然，你需要一個(gè)足夠大的模型才能從所有這些數(shù)據(jù)中學(xué)習(xí)，這就是為什么GPT-3有1750億個(gè)參數(shù)，訓(xùn)練的計(jì)算成本可能在100萬至1000萬美元之間。

這個(gè)自監(jiān)督的訓(xùn)練步驟被稱為預(yù)訓(xùn)練，因?yàn)槲覀兛梢灾貜?fù)使用「預(yù)訓(xùn)練」的模型權(quán)重來進(jìn)一步訓(xùn)練模型的下游任務(wù)。預(yù)訓(xùn)練的模型有時(shí)也稱為「基礎(chǔ)模型」。

在下游任務(wù)上訓(xùn)練模型稱為微調(diào)，因?yàn)槟Ｐ蜋?quán)重已經(jīng)經(jīng)過了理解語言的預(yù)訓(xùn)練，只是針對(duì)手頭的特定任務(wù)進(jìn)行了微調(diào)。

「一般任務(wù)的前期訓(xùn)練+特定任務(wù)的微調(diào)」策略被稱為遷移學(xué)習(xí)。

提示

原則上，最初的GPT論文只是關(guān)于預(yù)訓(xùn)練Transformer模型用于遷移學(xué)習(xí)的好處。

論文表明，當(dāng)對(duì)標(biāo)記數(shù)據(jù)集進(jìn)行微調(diào)時(shí)，預(yù)訓(xùn)練的117M GPT在各種自然語言處理任務(wù)中獲得了最先進(jìn)的性能。

直到GPT-2和GPT-3論文發(fā)表后，我們才意識(shí)到，基于足夠的數(shù)據(jù)和參數(shù)預(yù)訓(xùn)練的GPT模型，本身能夠執(zhí)行任何任務(wù)，不需要微調(diào)。

只需提示模型，執(zhí)行自回歸語言建模，然后模型就會(huì)神奇地給出適當(dāng)?shù)捻憫?yīng)。這就是所謂的「上下文學(xué)習(xí)」（in-context learning），因?yàn)槟Ｐ椭皇抢锰崾镜纳舷挛膩硗瓿扇蝿?wù)。

語境中學(xué)習(xí)可以是0次、一次或多次。

在給定提示的情況下生成文本也稱為條件生成，因?yàn)槲覀兊哪Ｐ褪歉鶕?jù)某些輸入生成一些輸出的。

GPT并不局限于NLP任務(wù)。

你可以根據(jù)你想要的任何條件來微調(diào)這個(gè)模型。比如，你可以將GPT轉(zhuǎn)換為聊天機(jī)器人（如ChatGPT），方法是以對(duì)話歷史為條件。

說到這里，讓我們最后來看看實(shí)際的實(shí)現(xiàn)。

設(shè)置

克隆本教程的存儲(chǔ)庫：

git clone https://github.com/jaymody/picoGPT
cd picoGPT

然后安裝依賴項(xiàng)：

pip install -r requirements.txt

注意：這段代碼是用Python 3.9.10測試的。

每個(gè)文件的簡單分類：

- encoder.py包含OpenAI的BPE分詞器的代碼，這些代碼直接取自gpt-2 repo。

- utils.py包含下載和加載GPT-2模型權(quán)重、分詞器和超參數(shù)的代碼。- gpt2.py包含實(shí)際的GPT模型和生成代碼，我們可以將其作為python腳本運(yùn)行。- gpt2_pico.py與gpt2.py相同，但代碼行數(shù)更少。

我們將從頭開始重新實(shí)現(xiàn)gpt2.py ，所以讓我們刪除它并將其重新創(chuàng)建為一個(gè)空文件：

rm gpt2.py
touch gpt2.py

首先，將以下代碼粘貼到gpt2.py中：

import numpy as np

def gpt2(inputs, wte, wpe, blocks, ln_f, n_head):
    pass # TODO: implement this

def generate(inputs, params, n_head, n_tokens_to_generate):
    from tqdm import tqdm

    for _ in tqdm(range(n_tokens_to_generate), "generating"):  # auto-regressive decode loop
        logits = gpt2(inputs, **params, n_head=n_head)  # model forward pass
        next_id = np.argmax(logits[-1])  # greedy sampling
        inputs.append(int(next_id))  # append prediction to input

    return inputs[len(inputs) - n_tokens_to_generate :]  # only return generated ids

def main(prompt: str, n_tokens_to_generate: int = 40, model_size: str = "124M", models_dir: str = "models"):
    from utils import load_encoder_hparams_and_params

    # load encoder, hparams, and params from the released open-ai gpt-2 files
    encoder, hparams, params = load_encoder_hparams_and_params(model_size, models_dir)

    # encode the input string using the BPE tokenizer
    input_ids = encoder.encode(prompt)

    # make sure we are not surpassing the max sequence length of our model
    assert len(input_ids) + n_tokens_to_generate < hparams["n_ctx"]

    # generate output ids
    output_ids = generate(input_ids, params, hparams["n_head"], n_tokens_to_generate)

    # decode the ids back into a string
    output_text = encoder.decode(output_ids)

    return output_text

if __name__ == "__main__":
    import fire

    fire.Fire(main)

將4個(gè)部分分別分解為：

- gpt2函數(shù)是我們將要實(shí)現(xiàn)的實(shí)際GPT代碼。你會(huì)注意到，除了inputs之外，函數(shù)簽名還包括一些額外的內(nèi)容：

wte、 wpe、 blocks和ln_f是我們模型的參數(shù)。

n_head是前向傳遞過程中需要的超參數(shù)。

- generate函數(shù)是我們前面看到的自回歸解碼算法。為了簡單起見，我們使用貪婪抽樣。tqdm是一個(gè)進(jìn)度條，幫助我們可視化解碼過程，因?yàn)樗淮紊梢粋€(gè)token。

- main函數(shù)處理：

加載分詞器（encoder)、模型權(quán)重（params）和超參數(shù)（hparams）

使用分詞器將輸入提示編碼為token ID

調(diào)用生成函數(shù)

將輸出ID解碼為字符串

fire.Fire(main)只是將我們的文件轉(zhuǎn)換為CLI應(yīng)用程序，因此我們最終可以使用python gpt2.py "some prompt here"運(yùn)行代碼

讓我們更詳細(xì)地了解一下筆記本中的encoder 、 hparams和params，或者在交互式的Python會(huì)話中，運(yùn)行：

from utils import load_encoder_hparams_and_params
encoder, hparams, params = load_encoder_hparams_and_params("124M", "models")

這將把必要的模型和分詞器文件下載到models/124M ，并將encoder、 hparams和params加載到我們的代碼中。

編碼器

encoder是GPT-2使用的BPE分詞器：

ids = encoder.encode("Not all heroes wear capes.")
ids
[3673, 477, 10281, 5806, 1451, 274, 13]


encoder.decode(ids)
"Not all heroes wear capes."

使用分詞器的詞匯表（存儲(chǔ)在encoder.decoder中），我們可以看到實(shí)際的token是什么樣子的：

[encoder.decoder[i] for i in ids]
['Not', '?all', '?heroes', '?wear', '?cap', 'es', '.']

請注意，我們的token有時(shí)是單詞（例如Not)，有時(shí)是單詞但前面有空格(例如?all，?表示空格)，有時(shí)是單詞的一部分（例如Capes分為?cap和es)，有時(shí)是標(biāo)點(diǎn)符號(hào)（例如.)。

BPE的一個(gè)優(yōu)點(diǎn)是它可以對(duì)任意字符串進(jìn)行編碼。如果它遇到詞匯表中沒有的內(nèi)容，它只會(huì)將其分解為它能夠理解的子字符串：

[encoder.decoder[i] for i in encoder.encode("zjqfl")]
['z', 'j', 'q', 'fl']

我們還可以檢查詞匯表的大?。?/span>

len(encoder.decoder)
50257

詞匯表以及確定如何拆分字符串的字節(jié)對(duì)合并是通過訓(xùn)練分詞器獲得的。

當(dāng)我們加載分詞器時(shí)，我們從一些文件加載已經(jīng)訓(xùn)練好的單詞和字節(jié)對(duì)合并，當(dāng)我們運(yùn)行l(wèi)oad_encoder_hparams_and_params時(shí)，這些文件與模型文件一起下載。

超參數(shù)

hparams是一個(gè)包含我們模型的超參數(shù)的詞典：

>>> hparams
{
  "n_vocab": 50257, # number of tokens in our vocabulary
  "n_ctx": 1024, # maximum possible sequence length of the input
  "n_embd": 768, # embedding dimension (determines the "width" of the network)
  "n_head": 12, # number of attention heads (n_embd must be divisible by n_head)
  "n_layer": 12 # number of layers (determines the "depth" of the network)
}

我們將在代碼的注釋中使用這些符號(hào)來顯示事物的基本形狀。我們還將使用n_seq表示輸入序列的長度（即n_seq = len(inputs)）。

參數(shù)

params是一個(gè)嵌套的json字典，它保存我們模型的訓(xùn)練權(quán)重。Json的葉節(jié)點(diǎn)是NumPy數(shù)組。我們會(huì)得到：

>>> import numpy as np
>>> def shape_tree(d):
>>>     if isinstance(d, np.ndarray):
>>>         return list(d.shape)
>>>     elif isinstance(d, list):
>>>         return [shape_tree(v) for v in d]
>>>     elif isinstance(d, dict):
>>>         return {k: shape_tree(v) for k, v in d.items()}
>>>     else:
>>>         ValueError("uh oh")
>>>
>>> print(shape_tree(params))
{
    "wpe": [1024, 768],
    "wte": [50257, 768],
    "ln_f": {"b": [768], "g": [768]},
    "blocks": [
        {
            "attn": {
                "c_attn": {"b": [2304], "w": [768, 2304]},
                "c_proj": {"b": [768], "w": [768, 768]},
            },
            "ln_1": {"b": [768], "g": [768]},
            "ln_2": {"b": [768], "g": [768]},
            "mlp": {
                "c_fc": {"b": [3072], "w": [768, 3072]},
                "c_proj": {"b": [768], "w": [3072, 768]},
            },
        },
        ... # repeat for n_layers
    ]
}

這些是從原始OpenAI TensorFlow檢查點(diǎn)加載的：

import tensorflow as tf
tf_ckpt_path = tf.train.latest_checkpoint("models/124M")
for name, _ in tf.train.list_variables(tf_ckpt_path):
    arr = tf.train.load_variable(tf_ckpt_path, name).squeeze()
    print(f"{name}: {arr.shape}")
model/h0/attn/c_attn/b: (2304,)
model/h0/attn/c_attn/w: (768, 2304)
model/h0/attn/c_proj/b: (768,)
model/h0/attn/c_proj/w: (768, 768)
model/h0/ln_1/b: (768,)
model/h0/ln_1/g: (768,)
model/h0/ln_2/b: (768,)
model/h0/ln_2/g: (768,)
model/h0/mlp/c_fc/b: (3072,)
model/h0/mlp/c_fc/w: (768, 3072)
model/h0/mlp/c_proj/b: (768,)
model/h0/mlp/c_proj/w: (3072, 768)
model/h1/attn/c_attn/b: (2304,)
model/h1/attn/c_attn/w: (768, 2304)
...
model/h9/mlp/c_proj/b: (768,)
model/h9/mlp/c_proj/w: (3072, 768)
model/ln_f/b: (768,)
model/ln_f/g: (768,)
model/wpe: (1024, 768)
model/wte: (50257, 768)

下面的代碼將上述TensorFlow變量轉(zhuǎn)換為我們的params詞典。

作為參考，以下是params的形狀，但用它們所代表的hparams替換了數(shù)字：

>>> import tensorflow as tf
>>> tf_ckpt_path = tf.train.latest_checkpoint("models/124M")
>>> for name, _ in tf.train.list_variables(tf_ckpt_path):
>>>     arr = tf.train.load_variable(tf_ckpt_path, name).squeeze()
>>>     print(f"{name}: {arr.shape}")
model/h0/attn/c_attn/b: (2304,)
model/h0/attn/c_attn/w: (768, 2304)
model/h0/attn/c_proj/b: (768,)
model/h0/attn/c_proj/w: (768, 768)
model/h0/ln_1/b: (768,)
model/h0/ln_1/g: (768,)
model/h0/ln_2/b: (768,)
model/h0/ln_2/g: (768,)
model/h0/mlp/c_fc/b: (3072,)
model/h0/mlp/c_fc/w: (768, 3072)
model/h0/mlp/c_proj/b: (768,)
model/h0/mlp/c_proj/w: (3072, 768)
model/h1/attn/c_attn/b: (2304,)
model/h1/attn/c_attn/w: (768, 2304)
...
model/h9/mlp/c_proj/b: (768,)
model/h9/mlp/c_proj/w: (3072, 768)
model/ln_f/b: (768,)
model/ln_f/g: (768,)
model/wpe: (1024, 768)
model/wte: (50257, 768)

基本層

在我們進(jìn)入實(shí)際的GPT體系結(jié)構(gòu)本身之前，最后一件事是，讓我們實(shí)現(xiàn)一些非特定于GPT的更基本的神經(jīng)網(wǎng)絡(luò)層。

GELU

GPT-2選擇的非線性（激活函數(shù)）是GELU（高斯誤差線性單元），它是REU的替代方案：

它由以下函數(shù)近似表示：

def gelu(x):
    return 0.5 * x * (1 + np.tanh(np.sqrt(2 / np.pi) * (x + 0.044715 * x**3)))

與RELU類似，Gelu在輸入上按元素操作：

gelu(np.array([[1, 2], [-2, 0.5]]))
array([[ 0.84119,  1.9546 ],
       [-0.0454 ,  0.34571]])

Softmax

Good ole softmax：

def softmax(x):
    exp_x = np.exp(x - np.max(x, axis=-1, keepdims=True))
    return exp_x / np.sum(exp_x, axis=-1, keepdims=True)

我們使用max(x)技巧來保證數(shù)值穩(wěn)定性。

SoftMax用于將一組實(shí)數(shù)（介于?∞和∞之間）轉(zhuǎn)換為概率（介于0和1之間，所有數(shù)字的總和為1）。我們在輸入的最后一個(gè)軸上應(yīng)用softmax 。

x = softmax(np.array([[2, 100], [-5, 0]]))
x
array([[0.00034, 0.99966],
       [0.26894, 0.73106]])
x.sum(axis=-1)
array([1., 1.])

層歸一化

層歸一化將值標(biāo)準(zhǔn)化，使其平均值為0，方差為1：

def layer_norm(x, g, b, eps: float = 1e-5):
    mean = np.mean(x, axis=-1, keepdims=True)
    variance = np.var(x, axis=-1, keepdims=True)
    x = (x - mean) / np.sqrt(variance + eps)  # normalize x to have mean=0 and var=1 over last axisreturn g * x + b  # scale and offset with gamma/beta params

層歸一化確保每一層的輸入始終在一致的范圍內(nèi)，這會(huì)加快和穩(wěn)定訓(xùn)練過程。

與批處理歸一化一樣，歸一化輸出隨后被縮放，并使用兩個(gè)可學(xué)習(xí)向量gamma和beta進(jìn)行偏移。分母中的小epsilon項(xiàng)用于避免除以零的誤差。

由于種種原因，Transformer采用分層定額代替批量定額。

我們在輸入的最后一個(gè)軸上應(yīng)用層歸一化。

>>> x = np.array([[2, 2, 3], [-5, 0, 1]])
>>> x = layer_norm(x, g=np.ones(x.shape[-1]), b=np.zeros(x.shape[-1]))
>>> x
array([[-0.70709, -0.70709,  1.41418],
       [-1.397  ,  0.508  ,  0.889  ]])
>>> x.var(axis=-1)
array([0.99996, 1.     ]) # floating point shenanigans
>>> x.mean(axis=-1)
array([-0., -0.])
Linear

你的標(biāo)準(zhǔn)矩陣乘法+偏差：

def linear(x, w, b):  # [m, in], [in, out], [out] -> [m, out]
    return x @ w + b

線性層通常稱為映射（因?yàn)樗鼈儚囊粋€(gè)向量空間映射到另一個(gè)向量空間）。

>>> x = np.random.normal(size=(64, 784)) # input dim = 784, batch/sequence dim = 64
>>> w = np.random.normal(size=(784, 10)) # output dim = 10
>>> b = np.random.normal(size=(10,))
>>> x.shape # shape before linear projection
(64, 784)
>>> linear(x, w, b).shape # shape after linear projection
(64, 10)

GPT架構(gòu)

GPT架構(gòu)遵循Transformer的架構(gòu)：

從高層次上講，GPT體系結(jié)構(gòu)有三個(gè)部分：

文本+位置嵌入

一種transformer解碼器堆棧

向單詞步驟的映射

在代碼中，它如下所示：

def gpt2(inputs, wte, wpe, blocks, ln_f, n_head):  # [n_seq] -> [n_seq, n_vocab]
    # token + positional embeddings
    x = wte[inputs] + wpe[range(len(inputs))]  # [n_seq] -> [n_seq, n_embd]


    # forward pass through n_layer transformer blocks
    for block in blocks:
        x = transformer_block(x, **block, n_head=n_head)  # [n_seq, n_embd] -> [n_seq, n_embd]


    # projection to vocab
    x = layer_norm(x, **ln_f)  # [n_seq, n_embd] -> [n_seq, n_embd]
    return x @ wte.T  # [n_seq, n_embd] -> [n_seq, n_vocab]

把所有放在一起

把所有這些放在一起，我們得到了gpt2.py，它總共只有120行代碼（如果刪除注釋和空格，則為60行）。

我們可以通過以下方式測試我們的實(shí)施：

python gpt2.py \"Alan Turing theorized that computers would one day become" \
    --n_tokens_to_generate 8

它給出了輸出：

the most powerful machines on the planet.

它成功了！

我們可以使用下面的Dockerfile測試我們的實(shí)現(xiàn)與OpenAI官方GPT-2 repo的結(jié)果是否一致。

docker build -t "openai-gpt-2" "https://gist.githubusercontent.com/jaymody/9054ca64eeea7fad1b58a185696bb518/raw/Dockerfile"
docker run -dt "openai-gpt-2" --name "openai-gpt-2-app"
docker exec -it "openai-gpt-2-app" /bin/bash -c 'python3 src/interactive_conditional_samples.py --length 8 --model_type 124M --top_k 1'
# paste "Alan Turing theorized that computers would one day become" when prompted

這應(yīng)該會(huì)產(chǎn)生相同的結(jié)果：

the most powerful machines on the planet.

下一步呢？

這個(gè)實(shí)現(xiàn)很酷，但它缺少很多花哨的東西：

GPU/TPU支持

將NumPy替換為JAX：

import jax.numpy as np

你現(xiàn)在可以使用代碼與GPU，甚至TPU！只需確保正確安裝了JAX即可。

反向傳播

同樣，如果我們用JAX替換NumPy：

import jax.numpy as np

然后，計(jì)算梯度就像以下操作一樣簡單：

def lm_loss(params, inputs, n_head) -> float:
    x, y = inputs[:-1], inputs[1:]
    output = gpt2(x, **params, n_head=n_head)
    loss = np.mean(-np.log(output[y]))return loss
grads = jax.grad(lm_loss)(params, inputs, n_head)
Batching

再一次，如果我們用JAX替換NumPy：

import jax.numpy as np

然后，對(duì)gpt2函數(shù)進(jìn)行批處理非常簡單：

gpt2_batched = jax.vmap(gpt2, in_axes=[0, None, None, None, None, None])
gpt2_batched(batched_inputs) # [batch, seq_len] -> [batch, seq_len, vocab]

推理優(yōu)化

我們的實(shí)現(xiàn)效率相當(dāng)?shù)汀Ｄ憧梢赃M(jìn)行的最快、最有效的優(yōu)化（在GPU+批處理支持之外）將是實(shí)現(xiàn)KV緩存。

訓(xùn)練

訓(xùn)練GPT對(duì)于神經(jīng)網(wǎng)絡(luò)來說是相當(dāng)標(biāo)準(zhǔn)的（梯度下降是損失函數(shù)）。

當(dāng)然，在訓(xùn)練GPT時(shí)，你還需要使用標(biāo)準(zhǔn)的技巧包（例如，使用ADAM優(yōu)化器、找到最佳學(xué)習(xí)率、通過輟學(xué)和/或權(quán)重衰減進(jìn)行正則化、使用學(xué)習(xí)率調(diào)度器、使用正確的權(quán)重初始化、批處理等）。

訓(xùn)練一個(gè)好的GPT模型的真正秘訣是調(diào)整數(shù)據(jù)和模型的能力，這才是真正的挑戰(zhàn)所在。

對(duì)于縮放數(shù)據(jù)，你需要一個(gè)大、高質(zhì)量和多樣化的文本語料庫。

- 大意味著數(shù)十億個(gè)token（TB級(jí)的數(shù)據(jù)）。

- 高質(zhì)量意味著您想要過濾掉重復(fù)的示例、未格式化的文本、不連貫的文本、垃圾文本等。

- 多樣性意味著不同的序列長度，關(guān)于許多不同的主題，來自不同的來源，具有不同的視角等等。

評(píng)估

如何評(píng)價(jià)一個(gè)LLM，這是一個(gè)很難的問題。

停止生成

當(dāng)前的實(shí)現(xiàn)要求我們提前指定要生成的token的確切數(shù)量。這并不是一個(gè)好方法，因?yàn)槲覀兩傻膖oken最終會(huì)過長、過短或在句子中途中斷。

為了解決這個(gè)問題，我們可以引入一個(gè)特殊的句尾（EOS）標(biāo)記。

在預(yù)訓(xùn)練期間，我們將EOS token附加到輸入的末尾（即tokens = ["not", "all", "heroes", "wear", "capes", ".", "<|EOS|>"])。

在生成期間，只要我們遇到EOS token（或者如果我們達(dá)到了某個(gè)最大序列長度），就會(huì)停止：

def generate(inputs, eos_id, max_seq_len):
        prompt_len = len(inputs)while inputs[-1] != eos_id and len(inputs) < max_seq_len:
        output = gpt(inputs)
        next_id = np.argmax(output[-1])
        inputs.append(int(next_id))return inputs[prompt_len:]

GPT-2沒有預(yù)訓(xùn)練EOS token，所以我們不能在我們的代碼中使用這種方法。

無條件生成

使用我們的模型生成文本需要我們使用提示符對(duì)其進(jìn)行條件調(diào)整。

但是，我們也可以讓我們的模型執(zhí)行無條件生成，即模型在沒有任何輸入提示的情況下生成文本。

這是通過在預(yù)訓(xùn)練期間將特殊的句子開始（BOS）標(biāo)記附加到輸入開始（即tokens = ["<|BOS|>", "not", "all", "heroes", "wear", "capes", "."])來實(shí)現(xiàn)的。

然后，要無條件地生成文本，我們輸入一個(gè)只包含BOS token的列表：

def generate_unconditioned(bos_id, n_tokens_to_generate):
        inputs = [bos_id]for _ in range(n_tokens_to_generate):
        output = gpt(inputs)
        next_id = np.argmax(output[-1])
        inputs.append(int(next_id))return inputs[1:]

GPT-2預(yù)訓(xùn)練了一個(gè)BOS token（名稱為<|endoftext|>），因此使用我們的實(shí)現(xiàn)無條件生成非常簡單，只需將以下行更改為：

input_ids = encoder.encode(prompt) if prompt else [encoder.encoder["<|endoftext|>"]]

然后運(yùn)行：

python gpt2.py ""

這將生成：

The first time I saw the new version of the game, I was so excited. I was so excited to see the new version of the game, I was so excited to see the new version

因?yàn)槲覀兪褂玫氖秦澙凡蓸?，所以輸出不是很好（重?fù)），而且是確定性的（即，每次我們運(yùn)行代碼時(shí)都是相同的輸出）。為了得到質(zhì)量更高且不確定的生成，我們需要直接從分布中抽樣（理想情況下，在應(yīng)用類似top-p的方法之后）。

無條件生成并不是特別有用，但它是展示GPT能力的一種有趣的方式。

微調(diào)

我們在訓(xùn)練部分簡要介紹了微調(diào)?；叵胍幌拢⒄{(diào)是指當(dāng)我們重新使用預(yù)訓(xùn)練的權(quán)重來訓(xùn)練模型執(zhí)行一些下游任務(wù)時(shí)。我們稱這一過程為遷移學(xué)習(xí)。

從理論上講，我們可以使用零樣本或少樣本提示，來讓模型完成我們的任務(wù)，

然而，如果你可以訪問token的數(shù)據(jù)集，微調(diào)GPT將產(chǎn)生更好的結(jié)果（在給定更多數(shù)據(jù)和更高質(zhì)量的數(shù)據(jù)的情況下，結(jié)果可以擴(kuò)展）。

有幾個(gè)與微調(diào)相關(guān)的不同主題，我將它們細(xì)分如下：

分類微調(diào)

在分類微調(diào)中，我們給模型一些文本，并要求它預(yù)測它屬于哪一類。

例如，以IMDB數(shù)據(jù)集為例，它包含將電影評(píng)為好或差的電影評(píng)論：

--- Example 1 ---
Text: I wouldn't rent this one even on dollar rental night.
Label: Bad
--- Example 2 ---
Text: I don't know why I like this movie so well, but I never get tired of watching it.
Label: Good
--- Example 3 ---
...

為了微調(diào)我們的模型，我們將語言建模頭替換為分類頭，并將其應(yīng)用于最后一個(gè)token輸出：

def gpt2(inputs, wte, wpe, blocks, ln_f, cls_head, n_head):
    x = wte[inputs] + wpe[range(len(inputs))]
    for block in blocks:
        x = transformer_block(x, **block, n_head=n_head)
    x = layer_norm(x, **ln_f)


        # project to n_classes
        # [n_embd] @ [n_embd, n_classes] -> [n_classes]
    return x[-1] @ cls_head

我們只使用最后一個(gè)token輸出x[-1]，因?yàn)槲覀冎恍枰獮檎麄€(gè)輸入生成單一的概率分布，而不是語言建模中的n_seq分布。

尤其，我們采用最后一個(gè)token，因?yàn)樽詈笠粋€(gè)token是唯一被允許關(guān)注整個(gè)序列的token，因此具有關(guān)于整個(gè)輸入文本的信息。

像往常一樣，我們優(yōu)化了w.r.t.交叉熵?fù)p失：

def singe_example_loss_fn(inputs: list[int], label: int, params) -> float:
    logits = gpt(inputs, **params)
    probs = softmax(logits)
    loss = -np.log(probs[label]) # cross entropy loss
    return loss

我們還可以通過應(yīng)用sigmoid而不是softmax來執(zhí)行多標(biāo)簽分類，并獲取關(guān)于每個(gè)類別的二進(jìn)制交叉熵?fù)p失。

生成式微調(diào)

有些任務(wù)不能被整齊地歸類。例如，總結(jié)這項(xiàng)任務(wù)。

我們只需對(duì)輸入和標(biāo)簽進(jìn)行語言建模，就能對(duì)這類任務(wù)進(jìn)行微調(diào)。例如，下面是一個(gè)總結(jié)訓(xùn)練樣本：

--- Article ---
This is an article I would like to summarize.
--- Summary ---
This is the summary.

我們像在預(yù)訓(xùn)練中一樣訓(xùn)練模型（優(yōu)化w.r.t語言建模損失）。

在預(yù)測時(shí)間，我們向模型提供直到--- Summary ---的所有內(nèi)容，然后執(zhí)行自回歸語言建模以生成摘要。

分隔符--- Article ---和--- Summary ---的選擇是任意的。如何選擇文本的格式由你自己決定，只要它在訓(xùn)練和推理之間保持一致。

注意，我們還可以將分類任務(wù)制定為生成式任務(wù)（例如使用IMDB）：

--- Text ---
I wouldn't rent this one even on dollar rental night.
--- Label ---
Bad

指令微調(diào)

如今，大多數(shù)最先進(jìn)的大模型在經(jīng)過預(yù)尋來你后，還會(huì)經(jīng)歷額外的指令微調(diào)。

在這一步中，模型對(duì)數(shù)千個(gè)人類標(biāo)記的指令提示+完成對(duì)進(jìn)行了微調(diào)（生成）。指令微調(diào)也可以稱為有監(jiān)督的微調(diào)，因?yàn)閿?shù)據(jù)是人為標(biāo)記的。

那么，指令微調(diào)有什么好處呢？

雖然預(yù)測維基百科文章中的下一個(gè)單詞能讓模型擅長續(xù)寫句子，但這并不能讓它特別擅長遵循指令、進(jìn)行對(duì)話或總結(jié)文檔（我們希望GPT能做的所有事情）。

在人類標(biāo)注的指令+完成對(duì)上對(duì)其進(jìn)行微調(diào)，是一種教模型如何變得更有用，并使其更易于交互的方法。

這就是所謂的AI對(duì)齊，因?yàn)槲覀冋趯?duì)模型進(jìn)行對(duì)齊，使其按照我們的意愿行事。

參數(shù)高效微調(diào)

當(dāng)我們在上述章節(jié)中談到微調(diào)時(shí)，假定我們正在更新所有模型參數(shù)。

雖然這能產(chǎn)生最佳性能，但在計(jì)算（需要對(duì)整個(gè)模型進(jìn)行反向傳播）和存儲(chǔ)（每個(gè)微調(diào)模型都需要存儲(chǔ)一份全新的參數(shù)副本）方面成本高昂。

解決這個(gè)問題最簡單的方法就是只更新頭部，凍結(jié)（即無法訓(xùn)練）模型的其他部分。

雖然這可以加快訓(xùn)練速度，并大大減少新參數(shù)的數(shù)量，但效果并不是特別好，因?yàn)槲覀兪チ松疃葘W(xué)習(xí)的深度。

相反，我們可以選擇性地凍結(jié)特定層，這將有助于恢復(fù)深度。這樣做的結(jié)果是，效果會(huì)好很多，但我們的參數(shù)效率會(huì)降低很多，也會(huì)失去一些訓(xùn)練速度的提升。

值得一提的是，我們還可以利用參數(shù)高效的微調(diào)方法。

以Adapters 一文為例。在這種方法中，我們在transformer塊中的FFN和MHA層之后添加一個(gè)額外的「適配器」層。

適配層只是一個(gè)簡單的兩層全連接神經(jīng)網(wǎng)絡(luò)，輸入輸出維度為 n_embd ，隱含維度小于 n_embd ：

隱藏維度的大小是一個(gè)超參數(shù)，我們可以對(duì)其進(jìn)行設(shè)置，從而在參數(shù)與性能之間進(jìn)行權(quán)衡。

論文顯示，對(duì)于BERT模型，使用這種方法可以將訓(xùn)練參數(shù)的數(shù)量減少到2%，而與完全微調(diào)相比，性能只受到很小的影響（<1%）。

責(zé)任編輯：張燕妮來源：新智元

數(shù)據(jù)訓(xùn)練

點(diǎn)贊

51CTO技術(shù)棧公眾號(hào)

業(yè)務(wù)
速覽

媒體

51CTO CIOAge HC3i

社區(qū)

51CTO博客鴻蒙開發(fā)者社區(qū) AI.x社區(qū)

教育

51CTO學(xué)堂精培企業(yè)培訓(xùn) CTO訓(xùn)練營