自拍偷在线精品自拍偷,亚洲欧美中文日韩v在线观看不卡

51CTO首頁(yè)

AI.x社區(qū)

軟考社區(qū)

免費(fèi)課

企業(yè)培訓(xùn)

鴻蒙開(kāi)發(fā)者社區(qū)

WOT技術(shù)大會(huì)

公眾號(hào)矩陣

移動(dòng)端

視頻課免費(fèi)課排行榜短視頻直播課軟考學(xué)堂

全部課程軟考華為認(rèn)證廠商認(rèn)證 IT技術(shù)PMP項(xiàng)目管理免費(fèi)題庫(kù)

在線學(xué)習(xí)

文章資源問(wèn)答課堂專欄直播

51CTO

鴻蒙開(kāi)發(fā)者社區(qū)

51CTO技術(shù)棧

51CTO官微

51CTO學(xué)堂

51CTO博客

CTO訓(xùn)練營(yíng)

鴻蒙開(kāi)發(fā)者社區(qū)訂閱號(hào)

51CTO軟考

51CTO學(xué)堂APP

51CTO學(xué)堂企業(yè)版APP

鴻蒙開(kāi)發(fā)者社區(qū)視頻號(hào)

51CTO軟考題庫(kù)

賬號(hào)設(shè)置退出

只知道TF和PyTorch還不夠，快來(lái)看看怎么從PyTorch轉(zhuǎn)向自動(dòng)微分神器JAX

作者：機(jī)器之心編譯 2020-05-15 08:18:51

開(kāi)發(fā) 開(kāi)發(fā)工具深度學(xué)習(xí)

說(shuō)到當(dāng)前的深度學(xué)習(xí)框架，我們往往繞不開(kāi) TensorFlow 和 PyTorch。本文是一個(gè)教程貼，教你理解 Jax 的底層邏輯，讓你更輕松地從 PyTorch 等進(jìn)行遷移。

說(shuō)到當(dāng)前的深度學(xué)習(xí)框架，我們往往繞不開(kāi) TensorFlow 和 PyTorch。但除了這兩個(gè)框架，一些新生力量也不容小覷，其中之一便是 JAX。它具有正向和反向自動(dòng)微分功能，非常擅長(zhǎng)計(jì)算高階導(dǎo)數(shù)。這一嶄露頭角的框架究竟有多好用?怎樣用它來(lái)展示神經(jīng)網(wǎng)絡(luò)內(nèi)部復(fù)雜的梯度更新和反向傳播?本文是一個(gè)教程貼，教你理解 Jax 的底層邏輯，讓你更輕松地從 PyTorch 等進(jìn)行遷移。

Jax 是谷歌開(kāi)發(fā)的一個(gè) Python 庫(kù)，用于機(jī)器學(xué)習(xí)和數(shù)學(xué)計(jì)算。一經(jīng)推出，Jax 便將其定義為一個(gè) Python+NumPy 的程序包。它有著可以進(jìn)行微分、向量化，在 TPU 和 GPU 上采用 JIT 語(yǔ)言等特性。簡(jiǎn)而言之，這就是 GPU 版本的 numpy，還可以進(jìn)行自動(dòng)微分。甚至一些研究者，如 Skye Wanderman-Milne，在去年的 NeurlPS 2019 大會(huì)上就介紹了 Jax。

但是，要讓開(kāi)發(fā)者從已經(jīng)很熟悉的 PyTorch 或 TensorFlow 2.X 轉(zhuǎn)移到 Jax 上，無(wú)疑是一個(gè)很大的改變：這兩者在構(gòu)建計(jì)算和反向傳播的方式上有著本質(zhì)的不同。PyTorch 構(gòu)建一個(gè)計(jì)算圖，并計(jì)算前向和反向傳播過(guò)程。結(jié)果節(jié)點(diǎn)上的梯度是由中間節(jié)點(diǎn)的梯度累計(jì)而成的。

Jax 則不同，它讓你用 Python 函數(shù)來(lái)表達(dá)計(jì)算過(guò)程，并用 grad( ) 將其轉(zhuǎn)換為一個(gè)梯度函數(shù)，從而讓你能夠進(jìn)行評(píng)價(jià)。但是它并不給出結(jié)果，而是給出結(jié)果的梯度。兩者的對(duì)比如下所示：

這樣一來(lái)，你進(jìn)行編程和構(gòu)建模型的方式就不一樣了。所以你可以使用 tape-based 的自動(dòng)微分方法，并使用有狀態(tài)的對(duì)象。但是 Jax 可能讓你感到很吃驚，因?yàn)檫\(yùn)行 grad() 函數(shù)的時(shí)候，它讓微分過(guò)程如同函數(shù)一樣。

也許你已經(jīng)決定看看如 flax、trax 或 haiku 這些基于 Jax 的工具。在看 ResNet 等例子時(shí)，你會(huì)發(fā)現(xiàn)它和其他框架中的代碼不一樣。除了定義層、運(yùn)行訓(xùn)練外，底層的邏輯是什么樣的?這些小小的 numpy 程序是如何訓(xùn)練了一個(gè)巨大的架構(gòu)?

本文便是介紹 Jax 構(gòu)建模型的教程，機(jī)器之心節(jié)選了其中的兩個(gè)部分：

快速回顧 PyTorch 上的 LSTM-LM 應(yīng)用;
看看 PyTorch 風(fēng)格的代碼(基于 mutate 狀態(tài))，并了解純函數(shù)是如何構(gòu)建模型的(Jax);

PyTorch 上的 LSTM 語(yǔ)言模型

我們首先用 PyTorch 實(shí)現(xiàn) LSTM 語(yǔ)言模型，如下為代碼：

import torch 
class LSTMCell(torch.nn.Module):  
    def __init__(self, in_dim, out_dim):  
        super(LSTMCell, self).__init__()  
        self.weight_ih = torch.nn.Parameter(torch.rand(4*out_dim, in_dim))  
        self.weight_hh = torch.nn.Parameter(torch.rand(4*out_dim, out_dim))  
        self.bias = torch.nn.Parameter(torch.zeros(4*out_dim,))   
 
    def forward(self, inputs, h, c):  
        ifgo = self.weight_ih @ inputs + self.weight_hh @ h + self.bias  
        i, f, g, o = torch.chunk(ifgo, 4)  
        i = torch.sigmoid(i)  
        f = torch.sigmoid(f)  
        g = torch.tanh(g)  
        o = torch.sigmoid(o)  
        new_c = f * c + i * g  
        new_h = o * torch.tanh(new_c)  
        return (new_h, new_c)

然后，我們基于這個(gè) LSTM 神經(jīng)元構(gòu)建一個(gè)單層的網(wǎng)絡(luò)。這里會(huì)有一個(gè)嵌入層，它和可學(xué)習(xí)的 (h,c)0 會(huì)展示單個(gè)參數(shù)如何改變。

class LSTMLM(torch.nn.Module):  
    def __init__(self, vocab_size, dim=17):  
        super().__init__()  
        self.cell = LSTMCell(dim, dim)  
        self.embeddings = torch.nn.Parameter(torch.rand(vocab_size, dim))  
        self.c_0 = torch.nn.Parameter(torch.zeros(dim)) 
 
    @property  
    def hc_0(self):  
        return (torch.tanh(self.c_0), self.c_0) 
 
    def forward(self, seq, hc):  
         loss = torch.tensor(0.)  
          for idx in seq:  
              loss -= torch.log_softmax(self.embeddings @ hc[0], dim=-1)[idx]  
              hc = self.cell(self.embeddings[idx,:], *hc)  
          return loss, hc   
 
    def greedy_argmax(self, hc, length=6):  
        with torch.no_grad():  
            idxs = []  
            for i in range(length):  
                idx = torch.argmax(self.embeddings @ hc[0])  
                idxs.append(idx.item())  
                hc = self.cell(self.embeddings[idx,:], *hc)  
        return idxs

構(gòu)建后，進(jìn)行訓(xùn)練：

torch.manual_seed(0) 
# As training data, we will have indices of words/wordpieces/characters, 
# we just assume they are tokenized and integerized (toy example obviously). 
import jax.numpy as jnp 
vocab_size = 43 # prime trick! :) 
training_data = jnp.array([4, 8, 15, 16, 23, 42]) 
 
lm = LSTMLM(vocab_sizevocab_size=vocab_size) 
print("Sample before:", lm.greedy_argmax(lm.hc_0)) 
 
bptt_length = 3 # to illustrate hc.detach-ing 
 
for epoch in range(101):  
    hc = lm.hc_0  
    totalloss = 0.  
    for start in range(0, len(training_data), bptt_length):  
        batch = training_data[start:start+bptt_length]  
        loss, (h, c) = lm(batch, hc)  
        hc = (h.detach(), c.detach())  
        if epoch % 50 == 0:  
            totalloss += loss.item()  
        loss.backward()  
        for name, param in lm.named_parameters():  
            if param.grad is not None:  
                param.data -= 0.1 * param.grad  
                del param.grad  
     if totalloss:  
         print("Loss:", totalloss) 
          
print("Sample after:", lm.greedy_argmax(lm.hc_0)) 
Sample before: [42, 34, 34, 34, 34, 34] 
Loss: 25.953862190246582 
Loss: 3.7642268538475037 
Loss: 1.9537211656570435 
Sample after: [4, 8, 15, 16, 23, 42]

可以看到，PyTorch 的代碼已經(jīng)比較清楚了，但是還是有些問(wèn)題。盡管我非常注意，但是還是要關(guān)注計(jì)算圖中的節(jié)點(diǎn)數(shù)量。那些中間節(jié)點(diǎn)需要在正確的時(shí)間被清除。

純函數(shù)

為了理解 JAX 如何處理這一問(wèn)題，我們首先需要理解純函數(shù)的概念。如果你之前做過(guò)函數(shù)式編程，那你可能對(duì)以下概念比較熟悉：純函數(shù)就像數(shù)學(xué)中的函數(shù)或公式。它定義了如何從某些輸入值獲得輸出值。重要的是，它沒(méi)有「副作用」，即函數(shù)的任何部分都不會(huì)訪問(wèn)或改變?nèi)魏稳譅顟B(tài)。

我們?cè)?Pytorch 中寫代碼時(shí)充滿了中間變量或狀態(tài)，而且這些狀態(tài)經(jīng)常會(huì)改變，這使得推理和優(yōu)化工作變得非常棘手。因此，JAX 選擇將程序員限制在純函數(shù)的范圍內(nèi)，不讓上述情況發(fā)生。

在深入了解 JAX 之前，可以先看幾個(gè)純函數(shù)的例子。純函數(shù)必須滿足以下條件：

你在什么情況下執(zhí)行函數(shù)、何時(shí)執(zhí)行函數(shù)應(yīng)該不影響輸出——只要輸入不變，輸出也應(yīng)該不變;
無(wú)論我們將函數(shù)執(zhí)行了 0 次、1 次還是多次，事后應(yīng)該都是無(wú)法辨別的。

以下非純函數(shù)都至少違背了上述條件中的一條：

import random 
import time 
nr_executions = 0 
 
def pure_fn_1(x):  
    return 2 * x 
     
def pure_fn_2(xs):  
    ys = []  
    for x in xs:  
        # Mutating stateful variables *inside* the function is fine!  
        ys.append(2 * x)  
    return ys 
 
def impure_fn_1(xs):  
    # Mutating arguments has lasting consequences outside the function! :(  
    xs.append(sum(xs))  
    return xs 
 
def impure_fn_2(x):  
    # Very obviously mutating  
    global state is bad... global  
    nr_executions nr_executions += 1  
    return 2 * x 
 
def impure_fn_3(x):  
    # ...but just accessing it is, too, because now the function depends on the  
    # execution context!  
    return nr_executions * x 
 
def impure_fn_4(x):  
    # Things like IO are classic examples of impurity.  
    # All three of the following lines are violations of purity:  
    print("Hello!")  
    user_input = input()  
    execution_time = time.time()  
    return 2 * x 
 
def impure_fn_5(x):  
    # Which constraint does this violate? Both, actually! You access the current  
    # state of randomness *and* advance the number generator!  
    p = random.random()  
    return p * x 
Let's see a pure function that JAX operates on: the example from the intro figure. 
 
# (almost) 1-D linear regression 
def f(w, x):  
    return w * x 
 
print(f(13., 42.)) 
546.0

目前為止還沒(méi)有出現(xiàn)什么狀況。JAX 現(xiàn)在允許你將下列函數(shù)轉(zhuǎn)換為另一個(gè)函數(shù)，不是返回結(jié)果，而是返回函數(shù)結(jié)果針對(duì)函數(shù)第一個(gè)參數(shù)的梯度。

import jax 
import jax.numpy as jnp 
 
# Gradient: with respect to weights! JAX uses the first argument by default. 
df_dw = jax.grad(f) 
 
def manual_df_dw(w, x):  
    return x 
     
assert df_dw(13., 42.) == manual_df_dw(13., 42.) 
 
print(df_dw(13., 42.)) 
42.0

到目前為止，前面的所有內(nèi)容你大概都在 JAX 的 README 文檔見(jiàn)過(guò)，內(nèi)容也很合理。但怎么跳轉(zhuǎn)到類似 PyTorch 代碼里的那種大模塊呢?

首先，我們來(lái)添加一個(gè)偏置項(xiàng)，并嘗試將一維線性回歸變量包裝成一個(gè)我們習(xí)慣使用的對(duì)象——一種線性回歸「層」(LinearRegressor「layer」):

class LinearRegressor():  
    def __init__(self, w, b):  
    self.w = w  
    self.b = b  
     
    def predict(self, x):  
        return self.w * x + self.b  
         
    def rms(self, xs: jnp.ndarray, ys: jnp.ndarray):  
        return jnp.sqrt(jnp.sum(jnp.square(self.w * xs + self.b - ys))) 
         
my_regressor = LinearRegressor(13., 0.) 
 
# A kind of loss fuction, used for training 
xs = jnp.array([42.0]) 
ys = jnp.array([500.0]) 
print(my_regressor.rms(xs, ys)) 
 
# Prediction for test data 
print(my_regressor.predict(42.)) 
46.0 
546.0

接下來(lái)要怎么利用梯度進(jìn)行訓(xùn)練呢?我們需要一個(gè)純函數(shù)，它將我們的模型權(quán)重作為函數(shù)的輸入?yún)?shù)，可能會(huì)像這樣：

def loss_fn(w, b, xs, ys):  
    my_regressor = LinearRegressor(w, b)  
    return my_regressor.rms(xsxs=xs, ysys=ys) 
     
# We use argnums=(0, 1) to tell JAX to give us 
# gradients wrt first and second parameter. 
grad_fn = jax.grad(loss_fn, argnums=(0, 1)) 
 
print(loss_fn(13., 0., xs, ys)) 
print(grad_fn(13., 0., xs, ys)) 
46.0 
(DeviceArray(42., dtype=float32), DeviceArray(1., dtype=float32))

你要說(shuō)服自己這是對(duì)的?，F(xiàn)在，這是可行的，但顯然，在 loss_fn 的定義部分枚舉所有參數(shù)是不可行的。

幸運(yùn)的是，JAX 不僅可以對(duì)標(biāo)量、向量、矩陣進(jìn)行微分，還能對(duì)許多類似樹(shù)的數(shù)據(jù)結(jié)構(gòu)進(jìn)行微分。這種結(jié)構(gòu)被稱為 pytree，包括 python dicts：

def loss_fn(params, xs, ys):  
    my_regressor = LinearRegressor(params['w'], params['b'])  
    return my_regressor.rms(xsxs=xs, ysys=ys) 
 
grad_fn = jax.grad(loss_fn) 
 
print(loss_fn({'w': 13., 'b': 0.}, xs, ys)) 
print(grad_fn({'w': 13., 'b': 0.}, xs, ys)) 
46.0 
{'b': DeviceArray(1., dtype=float32), 'w': DeviceArray(42., dtype=float32)}So this already looks nicer! We could write a training loop like this:

現(xiàn)在看起來(lái)好多了!我們可以寫一個(gè)下面這樣的訓(xùn)練循環(huán)：

params = {'w': 13., 'b': 0.} 
 
for _ in range(15):  
    print(loss_fn(params, xs, ys))  
    grads = grad_fn(params, xs, ys)  
    for name in params.keys():  
        params[name] -= 0.002 * grads[name] 
         
# Now, predict: 
LinearRegressor(params['w'], params['b']).predict(42.) 
46.0 
42.47003 
38.940002 
35.410034 
31.880066 
28.350098 
24.820068 
21.2901 
17.760132 
14.230164 
10.700165 
7.170166 
3.6401978 
0.110198975 
3.4197998 
DeviceArray(500.1102, dtype=float32)

注意，現(xiàn)在已經(jīng)可以使用更多的 JAX helper 來(lái)進(jìn)行自我更新：由于參數(shù)和梯度擁有共同的(類似樹(shù)的)結(jié)構(gòu)，我們可以想象將它們置于頂端，創(chuàng)造一個(gè)新樹(shù)，其值在任何地方都是這兩個(gè)樹(shù)的「組合」，如下所示：

def update_combiner(param, grad, lr=0.002):  
    return param - lr * grad 
     
params = jax.tree_multimap(update_combiner, params, grads) 
# instead of: 
# for name in params.keys(): 
# params[name] -= 0.1 * grads[name]

參考鏈接：https://sjmielke.com/jax-purify.htm

【本文是51CTO專欄機(jī)構(gòu)“機(jī)器之心”的原創(chuàng)譯文，微信公眾號(hào)“機(jī)器之心( id: almosthuman2014)”】

戳這里，看該作者更多好文

責(zé)任編輯：趙寧寧來(lái)源： 51CTO專欄

TF PyTorch PyTorch 深度學(xué)習(xí)

點(diǎn)贊

51CTO技術(shù)棧公眾號(hào)

業(yè)務(wù)
速覽

媒體

51CTO CIOAge HC3i

社區(qū)

51CTO博客鴻蒙開(kāi)發(fā)者社區(qū) AI.x社區(qū)

教育

51CTO學(xué)堂精培企業(yè)培訓(xùn) CTO訓(xùn)練營(yíng)