機器學習 | 從0開發(fā)大模型之注意力機制
無論是在機器學習,深度學習還是人工智能中,我們都在尋找一種模擬人腦的機制,注意力機制源于人腦,比如當我們欣賞一本書的時候,當我們對某個知識點感興趣的時候,會映像深刻,可能會反復讀某一段文字,但是對于其他不敢興趣的內(nèi)容會忽略,說明人腦在處理信號的時候是一定會劃權(quán)重,而注意力機制正是模仿大腦這種核心的功能。
1、什么是注意力機制
在卷積,全連接等神經(jīng)網(wǎng)絡(luò)中,都只考慮不隨意的線索,只關(guān)注重要的特征信息。 注意力機制是考慮所有線索,并關(guān)注其中最重要的特征信息:
- 隨意線索被稱為查詢(query)
- 每個輸入都是一個值(value)和不隨意線索(key)
- 通過注意力池化層偏向性選擇某些輸入
注意力機制
(1)非參數(shù)注意力
非參數(shù)注意力機制是使用查詢和鍵值對來計算注意力權(quán)重,然后通過加權(quán)平均來獲得輸出,數(shù)學表達式:
其曲線如下:
可以認為x=query,y=key=value,其中輸入任何一個x,輸出都是平均值,但是這個函數(shù)的設(shè)計不太合理,因此Nadarrya 和Watson 提出更好的想法,其公式如下:
通過核函數(shù)將x和xi的距離關(guān)系做映射,如果一個xi(key)越接近給定的查詢x,那么給yi(value)的權(quán)重就越大,反之越小,這樣就可以控制注意力的大小,其曲線如下:
兩段驗證的代碼也給出來(參考《動手學深度學習》):
import torch
from torch import nn
n_train = 50 # 訓練樣本數(shù)
test_len = 10
x_train, _ = torch.sort(torch.rand(n_train) * test_len) # 排序后的訓練樣本
def f(x):
return 2 * torch.sin(x) + x**0.8
y_train = f(x_train) + torch.normal(0.0, 0.5, (n_train,)) # 訓練樣本的輸出
x_test = torch.arange(0, test_len, 0.1) # 測試樣本
y_truth = f(x_test) # 測試樣本的真實輸出
n_test = len(x_test) # 測試樣本數(shù)
n_test
# 求平均值
y_hat = torch.repeat_interleave(y_train.mean(), n_test)
print(y_hat)
X_repeat = x_test.repeat_interleave(n_train).reshape((-1, n_train))
attention_weights = nn.functional.softmax(-(X_repeat - x_train)**2 / 2, dim=1)
y_hat = torch.matmul(attention_weights, y_train)
(2)參數(shù)注意力
非參數(shù)和參數(shù)的區(qū)別在于,參數(shù)注意力使用的是參數(shù)矩陣,而非參數(shù)注意力使用的是核函數(shù),其公式如下:
參數(shù)矩陣需要學習一個w參數(shù),通過w參數(shù)拿到最優(yōu)的擬合曲線,最簡單的方法就是通過訓練,訓練代碼如下(參考《動手學深度學習》):
class NWKernelRegression(nn.Module):
def __init__(self, **kwargs):
super().__init__(**kwargs)
self.w = nn.Parameter(torch.rand((1,), requires_grad=True))
def forward(self, queries, keys, values):
# queries和attention_weights的形狀為(查詢個數(shù),“鍵-值”對個數(shù))
queries = queries.repeat_interleave(keys.shape[1]).reshape((-1, keys.shape[1]))
self.attention_weights = nn.functional.softmax(
-((queries - keys) * self.w)**2 / 2, dim=1)
# values的形狀為(查詢個數(shù),“鍵-值”對個數(shù))
return torch.bmm(self.attention_weights.unsqueeze(1),
values.unsqueeze(-1)).reshape(-1)
X_tile = x_train.repeat((n_train, 1))
Y_tile = y_train.repeat((n_train, 1))
keys = X_tile[(1 - torch.eye(n_train)).type(torch.bool)].reshape((n_train, -1))
values = Y_tile[(1 - torch.eye(n_train)).type(torch.bool)].reshape((n_train, -1))
net = NWKernelRegression()
loss = nn.MSELoss(reduction='none')
trainer = torch.optim.SGD(net.parameters(), lr=0.5)
for epoch in range(100):
trainer.zero_grad()
l = loss(net(x_train, keys, values), y_train)
l.sum().backward()
trainer.step()
print(f'epoch {epoch + 1}, loss {float(l.sum()):.6f}')
keys = x_train.repeat((n_test, 1))
values = y_train.repeat((n_test, 1))
y_hat = net(x_test, keys, values).unsqueeze(1).detach()
# 輸出:
epoch 1, loss 211.842209
epoch 2, loss 29.818174
epoch 3, loss 29.818165
epoch 4, loss 29.818157
epoch 5, loss 29.818144
...
其擬合曲線:
預測曲線3
2、多頭注意力
前面講了注意力的參數(shù),目的是通過query結(jié)合key,value獲取對應的輸出,但是如果整個序列只是一個單獨的注意力池化,輸出的擬合可能不準確,因此加入多頭注意力,其公式如下:
其中每個w都是可以學習的參數(shù),這樣可以組合多個(query,key,value)獲得輸出,最后將輸出做一層全連接層,得到最終的結(jié)果,設(shè)計如下:
多頭注意力
由于大模型的開發(fā)中會用到,所以這里給出代碼(參考《動手學深度學習》):
def transpose_qkv(X, num_heads):
"""為了多注意力頭的并行計算而變換形狀"""
# 輸入X的形狀:(batch_size,查詢或者“鍵-值”對的個數(shù),num_hiddens)
# 輸出X的形狀:(batch_size,查詢或者“鍵-值”對的個數(shù),num_heads,
# num_hiddens/num_heads)
X = X.reshape(X.shape[0], X.shape[1], num_heads, -1)
# 輸出X的形狀:(batch_size,num_heads,查詢或者“鍵-值”對的個數(shù),
# num_hiddens/num_heads)
X = X.permute(0, 2, 1, 3)
# 最終輸出的形狀:(batch_size*num_heads,查詢或者“鍵-值”對的個數(shù),
# num_hiddens/num_heads)
return X.reshape(-1, X.shape[2], X.shape[3])
def transpose_output(X, num_heads):
"""逆轉(zhuǎn)transpose_qkv函數(shù)的操作"""
X = X.reshape(-1, num_heads, X.shape[1], X.shape[2])
X = X.permute(0, 2, 1, 3)
return X.reshape(X.shape[0], X.shape[1], -1)
class MultiHeadAttention(nn.Module):
"""多頭注意力"""
def __init__(self, key_size, query_size, value_size, num_hiddens,
num_heads, dropout, bias=False, **kwargs):
super(MultiHeadAttention, self).__init__(**kwargs)
self.num_heads = num_heads
self.attention = d2l.DotProductAttention(dropout) # 這里用到縮放點積注意力
self.W_q = nn.Linear(query_size, num_hiddens, bias=bias)
self.W_k = nn.Linear(key_size, num_hiddens, bias=bias)
self.W_v = nn.Linear(value_size, num_hiddens, bias=bias)
self.W_o = nn.Linear(num_hiddens, num_hiddens, bias=bias)
def forward(self, queries, keys, values, valid_lens):
# queries,keys,values的形狀:
# (batch_size,查詢或者“鍵-值”對的個數(shù),num_hiddens)
# valid_lens 的形狀:
# (batch_size,)或(batch_size,查詢的個數(shù))
# 經(jīng)過變換后,輸出的queries,keys,values 的形狀:
# (batch_size*num_heads,查詢或者“鍵-值”對的個數(shù),
# num_hiddens/num_heads)
queries = transpose_qkv(self.W_q(queries), self.num_heads)
keys = transpose_qkv(self.W_k(keys), self.num_heads)
values = transpose_qkv(self.W_v(values), self.num_heads)
if valid_lens is not None:
# 在軸0,將第一項(標量或者矢量)復制num_heads次,
# 然后如此復制第二項,然后諸如此類。
valid_lens = torch.repeat_interleave(
valid_lens, repeats=self.num_heads, dim=0)
# output的形狀:(batch_size*num_heads,查詢的個數(shù),
# num_hiddens/num_heads)
output = self.attention(queries, keys, values, valid_lens)
# output_concat的形狀:(batch_size,查詢的個數(shù),num_hiddens)
output_concat = transpose_output(output, self.num_heads)
return self.W_o(output_concat)
def attention_nhead():
num_hiddens, num_heads = 100, 5
attention = MultiHeadAttention(num_hiddens, num_hiddens, num_hiddens,
num_hiddens, num_heads, 0.5)
attention.eval()
batch_size, num_queries = 2, 4
num_kvpairs, valid_lens = 6, torch.tensor([3, 2])
X = torch.ones((batch_size, num_queries, num_hiddens))
Y = torch.ones((batch_size, num_kvpairs, num_hiddens))
output = attention(X, Y, Y, valid_lens)
print(output)
attention_nhead()
# 輸出:
tensor([[[-3.6907e-01, -1.1405e-04, 3.2671e-01, -1.7356e-01, -8.1225e-01,
-3.7096e-01, 2.7797e-01, -2.6977e-01, -2.5845e-01, -2.3081e-01,
3.0618e-01, 2.7673e-01, -2.6381e-01, -8.4385e-02, 6.8697e-01,
-3.0869e-01, -2.6311e-01, 3.3698e-01, 2.0350e-02, -1.1740e-01,
-2.9579e-01, -2.3887e-01, -1.3595e-01, 1.6481e-01, 3.6974e-01,
-1.2254e-01, -4.8702e-01, -3.3106e-01, 1.9889e-01, 4.6272e-04,
-3.0664e-01, 1.0336e-01, 1.5175e-01, 5.1327e-02, -1.7456e-01,
1.0848e-01, -2.1586e-01, -1.3530e-01, 1.4878e-01, 2.2182e-01,
-1.8205e-01, 4.2394e-02, -1.2869e-01, -6.1095e-02, 1.1372e-01,
-2.4854e-01, 9.8994e-02, -4.2462e-01, -1.9857e-02, -1.0072e-01,
7.6214e-01, 1.4569e-01, 2.4027e-01, -1.4111e-01, -3.5483e-01,
1.2154e-02, -4.0619e-01, -1.7395e-01, 1.2091e-02, 1.2583e-01,
4.5608e-01, -2.2189e-01, 1.1187e-01, -2.2936e-01, 2.6352e-01,
-2.1522e-02, 1.7198e-01, 2.4890e-01, -5.9914e-01, -3.3339e-01,
-5.0526e-03, 2.5246e-01, -5.5496e-02, 8.2241e-02, 2.3885e-01,
-6.4767e-02, 4.5753e-01, 1.4007e-01, 3.2348e-01, -2.9186e-01,
-2.0273e-01, 7.9331e-01, 2.4528e-01, -2.3202e-01, 6.0938e-01,
-3.4037e-01, -3.0914e-01, 2.0632e-01, -1.1952e-01, -1.4625e-01,
5.5157e-01, -1.5517e-01, 5.0877e-01, 1.9026e-01, -3.7252e-02,
-1.7278e-01, -2.9345e-01, -1.2168e-01, 1.7021e-01, 7.7886e-01],
...
不過光看代碼,其中每個輸入到輸出的數(shù)據(jù)變化,這里不太清晰,于是找資料從網(wǎng)上找到一張解釋的圖,因此借鑒過來:
圖片
3、注意力計算加速庫:FlashAttention
我們分析一下以上注意力機制的計算復雜度:(1)輸入長度為n的序列,每個位置進行位置編碼有d維的向量,查詢矩陣Q的維度為Nxd,鍵矩陣K的維度為Nxd,值矩陣V的維度為Nxd;(2)線性變換:對輸入的序列進行變換得到Q、K、V,每個token的embedding維度為d,每個都需要計算,復雜度為O(n*k*3d);(3)注意力計算:為了獲取注意權(quán)重,需要對每個序列進行計算,復雜度為O(n^2*k*d);(4)最后進行加權(quán)求和,復雜度為O(n*k*d);總的時間復雜度為O(n^2*k*d),所以會根據(jù)輸入的大小成指數(shù)增長,為了計算加速,PyTorch2.0以上版本集成FlashAttention。
由于FlashAttention 加速原理比較復雜(基于GPU硬件對attention的優(yōu)化技術(shù)),這里原理就不介紹了(后續(xù)整理資料再單獨介紹),只介紹如何使用:
import torch
from torch import nn
import torch.nn.functional as F
def attention_scaled_dot_product():
device = "cuda" if torch.cuda.is_available() else "cpu"
query, key, value = torch.randn(2, 3, 8, device=device), torch.randn(2, 3, 8, device=device), torch.randn(2, 3, 8, device=device)
output = F.scaled_dot_product_attention(query, key, value)
print(output)
attention_scaled_dot_product()
# 輸出如下:
tensor([[[-0.2161, 0.1339, 0.0048, -0.4695, -0.9136, -0.6143, 0.7153,
-0.5775],
[ 0.0440, 0.3198, 0.3169, -0.4145, -0.6033, -0.4155, 0.4611,
-0.3980],
[-0.0162, 0.3195, 0.3146, -0.4202, -0.6638, -0.4621, 0.6024,
-0.4443]],
[[ 0.6024, -0.3102, -0.2522, -1.0542, 0.6863, 0.5142, 1.6795,
0.1051],
[ 0.5328, -0.4506, -0.3581, -1.1292, 1.0069, 0.3114, 1.9865,
-0.0842],
[-1.1632, -1.6378, 0.7211, 1.0084, 0.0335, 1.1377, 1.3419,
-1.2655]]])
參考
(1)阿里云的Notebook:https://tianchi.aliyun.com/lab/home?notebookLabId=793630¬ebookId=872766
(2)《動手學深度學習》:https://zh.d2l.ai/chapter_attention-mechanisms/bahdanau-attention.html(3)https://www.cvmart.net/community/detail/8302