從零實(shí)現(xiàn)大模型-GPT2指令微調(diào) 原創(chuàng)
??The Annotated Transformer注釋加量版??
前面三篇文章實(shí)現(xiàn)了Transformer、BERT以及GPT2的預(yù)訓(xùn)練過程,也就是上圖中的Stage1和Stage2,并通過打印數(shù)據(jù)信息可視化了預(yù)訓(xùn)練和推理過程。
此時的GPT2雖然能預(yù)測下一個詞,但并不能很好地跟隨人類指令,如果想讓它翻譯就能翻譯,想讓它總結(jié)就能總結(jié),接下來還要進(jìn)行指令微調(diào)。
本文我們基于此前的GPT2預(yù)訓(xùn)練模型進(jìn)行指令微調(diào)。
下圖是本文的內(nèi)容概括。
前三篇文章以及本文完整代碼都已整理到一個路徑下,請結(jié)合代碼閱讀本文內(nèi)容。
ttps://github.com/AIDajiangtang/LLM-from-scratch
https://github.com/AIDajiangtang/LLM-from-scratch/blob/main/GPT2_instruction_finetuning_from_scratch.ipynb
0.下載訓(xùn)練數(shù)據(jù)
import json
import os
import urllib
def download_and_load_file(file_path, url):
if not os.path.exists(file_path):
with urllib.request.urlopen(url) as response:
text_data = response.read().decode('utf-8')
with open(file_path, "w", encoding="utf-8") as file:
file.write(text_data)
else:
with open(file_path, "r", encoding="utf-8") as file:
text_data = file.read()
with open(file_path, "r") as file:
data = json.load(file)
return data
file_path = "instruction-data.json"
url = "https://raw.githubusercontent.com/rasbt/LLMs-from-scratch/main/ch07/01_main-chapter-code/instruction-data.json"
data = download_and_load_file(file_path, url)
print("Number of entries:", len(data))
該指令微調(diào)訓(xùn)練數(shù)據(jù)包含1100條訓(xùn)練樣本,下面打印其中一條。
{'instruction': 'Identify the correct spelling of the following word.', 'input': 'Ocassion', 'output': "The correct spelling is 'Occasion.'"}
指令微調(diào)是一種有監(jiān)督學(xué)習(xí)方法,訓(xùn)練數(shù)據(jù)由指令、輸入和輸出組成,然后將這三部分格式化成某種prompt格式,下圖是兩種常見的格式化方法。
本文我們采用Alpaca-style prompt格式化方法。
指令微調(diào)與預(yù)訓(xùn)練,除了訓(xùn)練數(shù)據(jù),其它基本一致。
def format_input(entry):
instruction_text = (
f"Below is an instruction that describes a task. "
f"Write a response that appropriately completes the request."
f"\n\n### Instruction:\n{entry['instruction']}"
)
input_text = f"\n\n### Input:\n{entry['input']}" if entry["input"] else ""
return instruction_text + input_text
接下來我們打印一條格式化后的樣本。
model_input = format_input(data[50])
desired_response = f"\n\n### Response:\n{data[50]['output']}"
print(model_input + desired_response)
Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
Identify the correct spelling of the following word.
### Input:
Ocassion
### Response:
The correct spelling is 'Occasion.'
其中輸入并不是必須的。
### Instruction:
What is an antonym of 'complicated'?
### Response:
An antonym of 'complicated' is 'simple'.
接下來將這1100條訓(xùn)練數(shù)據(jù)劃分為訓(xùn)練集、測試集和驗(yàn)證集。
train_portion = int(len(data) * 0.85) # 85% for training
test_portion = int(len(data) * 0.1) # 10% for testing
val_portion = len(data) - train_portion - test_portion # Remaining 5% for validation
train_data = data[:train_portion]
test_data = data[train_portion:train_portion + test_portion]
val_data = data[train_portion + test_portion:]
1.準(zhǔn)備訓(xùn)練數(shù)據(jù)
準(zhǔn)備訓(xùn)練數(shù)據(jù)過程可分為下面幾步。
先格式化輸入,然后轉(zhuǎn)換為token id。
import torch
from torch.utils.data import Dataset
class InstructionDataset(Dataset):
def __init__(self, data, tokenizer):
self.data = data
# Pre-tokenize texts
self.encoded_texts = []
for entry in data:
instruction_plus_input = format_input(entry)
response_text = f"\n\n### Response:\n{entry['output']}"
full_text = instruction_plus_input + response_text
self.encoded_texts.append(
tokenizer.encode(full_text)
)
def __getitem__(self, index):
return self.encoded_texts[index]
def __len__(self):
return len(self.data)
import tiktoken
tokenizer = tiktoken.get_encoding("gpt2")
print(tokenizer.encode("<|endoftext|>", allowed_special={"<|endoftext|>"}))
[50256]
將樣本劃分成batches,找到每個batch中最長文本長度作為整個batch的長度,將其它數(shù)據(jù)padding到統(tǒng)一長度,<|endoftext|>(50256)作為padding token。
將輸入向右移動一位構(gòu)造標(biāo)簽。
最后將標(biāo)簽中的padding token替換成-100,使其計(jì)算損失時忽略該padding token,但要保留一個<|endoftext|作為結(jié)束符,
def custom_collate_fn(
batch,
pad_token_id=50256,
ignore_index=-100,
allowed_max_length=None,
device="cpu"
):
# Find the longest sequence in the batch
batch_max_length = max(len(item)+1 for item in batch)
# Pad and prepare inputs and targets
inputs_lst, targets_lst = [], []
for item in batch:
new_item = item.copy()
# Add an <|endoftext|> token
new_item += [pad_token_id]
# Pad sequences to max_length
padded = new_item + [pad_token_id] * (batch_max_length - len(new_item))
inputs = torch.tensor(padded[:-1]) # Truncate the last token for inputs
targets = torch.tensor(padded[1:]) # Shift +1 to the right for targets
# New: Replace all but the first padding tokens in targets by ignore_index
mask = targets == pad_token_id
indices = torch.nonzero(mask).squeeze()
if indices.numel() > 1:
targets[indices[1:]] = ignore_index
# New: Optionally truncate to maximum sequence length
if allowed_max_length is not None:
inputs = inputs[:allowed_max_length]
targets = targets[:allowed_max_length]
inputs_lst.append(inputs)
targets_lst.append(targets)
# Convert list of inputs and targets to tensors and transfer to target device
inputs_tensor = torch.stack(inputs_lst).to(device)
targets_tensor = torch.stack(targets_lst).to(device)
return inputs_tensor, targets_tensor
在實(shí)際中,通常還會將標(biāo)簽中非輸出的內(nèi)容替換成-100,使其不參與損失計(jì)算。
from torch.utils.data import DataLoader
num_workers = 0
batch_size = 8
torch.manual_seed(123)
train_dataset = InstructionDataset(train_data, tokenizer)
train_loader = DataLoader(
train_dataset,
batch_size=batch_size,
collate_fn=customized_collate_fn,
shuffle=True,
drop_last=True
)
設(shè)置batch_size = 8,接下來查看每個batch的輸入和標(biāo)簽的數(shù)據(jù)維度。
print("Train loader:")
for inputs, targets in train_loader:
print(inputs.shape, targets.shape)
Train loader:
torch.Size([8, 61]) torch.Size([8, 61])
torch.Size([8, 76]) torch.Size([8, 76])
torch.Size([8, 73]) torch.Size([8, 73])
torch.Size([8, 68]) torch.Size([8, 68])
torch.Size([8, 65]) torch.Size([8, 65])
torch.Size([8, 72]) torch.Size([8, 72])
torch.Size([8, 80]) torch.Size([8, 80])
torch.Size([8, 67]) torch.Size([8, 67])
torch.Size([8, 62]) torch.Size([8, 62])
torch.Size([8, 75]) torch.Size([8, 75])
torch.Size([8, 62]) torch.Size([8, 62])
torch.Size([8, 68]) torch.Size([8, 68])
torch.Size([8, 67]) torch.Size([8, 67])
torch.Size([8, 77]) torch.Size([8, 77])
torch.Size([8, 69]) torch.Size([8, 69])
torch.Size([8, 79]) torch.Size([8, 79])
torch.Size([8, 71]) torch.Size([8, 71])
torch.Size([8, 66]) torch.Size([8, 66])
torch.Size([8, 83]) torch.Size([8, 83])
打印一個樣本,查看是否正確。
輸入:
tensor([21106, 318, 281, 12064, 326, 8477, 257, 4876, 13, 19430,
257, 2882, 326, 20431, 32543, 262, 2581, 13, 198, 198,
21017, 46486, 25, 198, 30003, 6525, 262, 6827, 1262, 257,
985, 576, 13, 198, 198, 21017, 23412, 25, 198, 464,
5156, 318, 845, 13779, 13, 198, 198, 21017, 18261, 25,
198, 464, 5156, 318, 355, 13779, 355, 257, 4936, 13,
50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256],
device='cuda:0')
標(biāo)簽:
tensor([ 318, 281, 12064, 326, 8477, 257, 4876, 13, 19430, 257,
2882, 326, 20431, 32543, 262, 2581, 13, 198, 198, 21017,
46486, 25, 198, 30003, 6525, 262, 6827, 1262, 257, 985,
576, 13, 198, 198, 21017, 23412, 25, 198, 464, 5156,
318, 845, 13779, 13, 198, 198, 21017, 18261, 25, 198,
464, 5156, 318, 355, 13779, 355, 257, 4936, 13, 50256,
-100, -100, -100, -100, -100, -100, -100, -100, -100],
device='cuda:0')
2.加載預(yù)訓(xùn)練模型
加載GPT2預(yù)訓(xùn)練模型。
from gpt_download import download_and_load_gpt2
from previous_chapters import GPTModel, load_weights_into_gpt
BASE_CONFIG = {
"vocab_size": 50257, # Vocabulary size
"context_length": 1024, # Context length
"drop_rate": 0.0, # Dropout rate
"qkv_bias": True # Query-key-value bias
}
model_configs = {
"gpt2-small (124M)": {"emb_dim": 768, "n_layers": 12, "n_heads": 12},
"gpt2-medium (355M)": {"emb_dim": 1024, "n_layers": 24, "n_heads": 16},
"gpt2-large (774M)": {"emb_dim": 1280, "n_layers": 36, "n_heads": 20},
"gpt2-xl (1558M)": {"emb_dim": 1600, "n_layers": 48, "n_heads": 25},
}
CHOOSE_MODEL = "gpt2-medium (355M)"
BASE_CONFIG.update(model_configs[CHOOSE_MODEL])
model_size = CHOOSE_MODEL.split(" ")[-1].lstrip("(").rstrip(")")
settings, params = download_and_load_gpt2(model_size=model_size, models_dir="gpt2")
model = GPTModel(BASE_CONFIG)
load_weights_into_gpt(model, params)
model.eval();
在開始指令微調(diào)前,先驗(yàn)證下預(yù)訓(xùn)練模型的效果。
torch.manual_seed(123)
input_text = format_input(val_data[0])
print(input_text)
Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
Convert the active sentence to passive: 'The chef cooks the meal every day.'
from previous_chapters import (
generate,
text_to_token_ids,
token_ids_to_text
)
token_ids = generate(
model=model,
idx=text_to_token_ids(input_text, tokenizer),
max_new_tokens=35,
context_size=BASE_CONFIG["context_length"],
eos_id=50256,
)
generated_text = token_ids_to_text(token_ids, tokenizer)
response_text = generated_text[len(input_text):].strip()
print(response_text)
### Response:
The chef cooks the meal every day.
### Instruction:
Convert the active sentence to passive: 'The chef cooks the
通過結(jié)果可知,預(yù)訓(xùn)練模型并沒有跟隨人類指令,雖然有輸出,但只是簡單的復(fù)制了輸入和指令內(nèi)容。
3.指令微調(diào)
前面我們說過,除了構(gòu)造訓(xùn)練數(shù)據(jù)外,指令微調(diào)過程和預(yù)訓(xùn)練過程基本一致。
3.1詞嵌入
假設(shè)第一個batch的輸入X和標(biāo)簽的維度[8, 61],接下來將[8, 61]個token ids轉(zhuǎn)換成詞嵌入,根據(jù)超參數(shù)設(shè)置:"emb_dim": 768,詞嵌入層輸出[8, 61,768]。
另外,在計(jì)算注意力時,沒有考慮token之間的相對位置,所以要在詞嵌入上加一個位置編碼,位置編碼向量維度與詞嵌入維度相同。
最終輸出[8, 61,768]維詞嵌入。
3.2.TransformerBlock
根據(jù)超參數(shù)設(shè)置"n_layers": 12,模型會經(jīng)過12個結(jié)構(gòu)相同,但參數(shù)獨(dú)立的TransformerBlock模塊。
TransformerBlock是由MultiHeadAttention、FeedForward和LayerNorm構(gòu)成。
接下來我們看看數(shù)據(jù)是如何流經(jīng)這些層的。
3.3.MultiHeadAttention
輸入的詞嵌入[8, 61,768]先經(jīng)過三個矩陣[768, 768]變換,分別得到q、k、v,維度都是[8, 61,768]。
根據(jù)超參數(shù)設(shè)置"n_heads": 12,將q、k、v reshape成[8, 61, 12,64],再轉(zhuǎn)置成[8, 12,61, 64]。將原始768維詞嵌入劃分到12個頭中,每個頭64維,這就實(shí)現(xiàn)了多頭注意力機(jī)制。
然后計(jì)算每個頭的注意力,注意力分?jǐn)?shù)矩陣維度[8, 12, 61, 61]。
為了防止看到未來時刻的內(nèi)容,構(gòu)造一個上三角掩碼矩陣[61, 61],其對角線以上的部分設(shè)置True, 再將注意力分?jǐn)?shù)矩陣中對應(yīng)掩碼矩陣為True的位置設(shè)置為負(fù)無窮,這樣softmax 之后接近于零,以屏蔽未來位置的注意力得分。
self.register_buffer(
'mask',
torch.triu(torch.ones(
context_length, # 61
context_length, # 61
), diagnotallow=1)
)
mask_bool = self.mask.bool()[:num_tokens, :num_tokens]
# Mask the attention scores
attention_scores.masked_fill_(mask_bool, -torch.inf)
然后將注意力分?jǐn)?shù)矩陣[8, 12, 61, 61]與值矩陣v[8, 12,61, 64]相乘,輸出[8, 12,61, 64]。
最后將多個頭的輸出通過轉(zhuǎn)置,reshape成[8, 61, 768],再經(jīng)過一個線性層[768, 768]輸出[8, 61, 768],最終與輸入進(jìn)行殘差鏈接輸出[8, 61, 768]。
3.4.LaynerNorm
LaynerNorm的目的是為了計(jì)算穩(wěn)定,不改變維度,LaynerNorm層的輸入輸出維度均是[8, 61, 768]。
3.5.FeedForward
FeedForward是一個MLP層,前面LaynerNorm層的輸出[8, 61, 768],8*61個詞嵌入并行通過MLP層,先升維到4*768,再恢復(fù)到768,中間使用GELU非線性激活函數(shù)。
MLP層不會改變輸入維度[8, 61, 768],但會通過非線性變換會進(jìn)一步修改詞嵌入的值,以次提升模型的表示能力,生成更高層次的抽象特征。
3.6.輸出
MLP層的輸出[8, 61, 768]先經(jīng)過一個LaynerNorm進(jìn)行平滑操作。
最后8*61個token并行經(jīng)過一個輸出線性層[768, n_vcab],將[8, 61, 768]映射成[8, 61, n_vcab],n_vcab為詞表大小。
也就是每個token都會輸出一個概率分布,這n_vcab個概率值表示下一個token屬于詞表中n_vcab個詞的概率。
3.7.計(jì)算損失
訓(xùn)練過程中需要通過計(jì)算損失來更新參數(shù),如何根據(jù)輸出[8, 61, n_vcab]計(jì)算損失呢?
在準(zhǔn)備訓(xùn)練數(shù)據(jù)時已經(jīng)構(gòu)造了標(biāo)簽,維度與輸入X一致,也是[8, 61]。
def calc_loss_batch(input_batch, target_batch, model, device):
"""
Calculates the loss for a single batch.
"""
input_batch = input_batch.to(device)
target_batch = target_batch.to(device)
# Run the model
logits = model(input_batch)
print("target_batch loss")
print(target_batch.flatten().shape)
print("logits.flatten(0, 1)")
print(logits.flatten(0, 1).shape)
# Calculate the loss
loss = torch.nn.functional.cross_entropy(
logits.flatten(0, 1),
target_batch.flatten(),
)
return loss
input_batch是輸入X,維度[8, 61],target_batch是標(biāo)簽,維度[8, 61],輸入經(jīng)過模型后輸出[8, 61, n_vcab],展平后[488, n_vcab],標(biāo)簽展平后[488],每個元素表示詞表中位置。
cross_entropy估計(jì)是根據(jù)這[488]位置構(gòu)造one-hot編碼,然后與輸出logits計(jì)算損失值。
本文轉(zhuǎn)載自公眾號人工智能大講堂
