自拍偷在线精品自拍偷,亚洲欧美中文日韩v在线观看不卡

AI.x社區(qū)

軟考社區(qū)

免費(fèi)課

企業(yè)培訓(xùn)

鴻蒙開發(fā)者社區(qū)

WOT技術(shù)大會(huì)

公眾號(hào)矩陣

移動(dòng)端

視頻課免費(fèi)課排行榜短視頻直播課軟考學(xué)堂

全部課程軟考華為認(rèn)證廠商認(rèn)證 IT技術(shù)PMP項(xiàng)目管理免費(fèi)題庫

在線學(xué)習(xí)

文章資源問答課堂專欄直播

51CTO

鴻蒙開發(fā)者社區(qū)

51CTO技術(shù)棧

51CTO官微

51CTO學(xué)堂

51CTO博客

CTO訓(xùn)練營

鴻蒙開發(fā)者社區(qū)訂閱號(hào)

51CTO軟考

51CTO學(xué)堂APP

51CTO學(xué)堂企業(yè)版APP

鴻蒙開發(fā)者社區(qū)視頻號(hào)

51CTO軟考題庫

賬號(hào)設(shè)置退出

使用TENSORFLOW訓(xùn)練循環(huán)神經(jīng)網(wǎng)絡(luò)語言模型

作者：佚名 2017-08-28 21:31:37

人工智能深度學(xué)習(xí)

Language Model，即語言模型，其主要思想是，在知道前一部分的詞的情況下，推斷出下一個(gè)最有可能出現(xiàn)的詞。在本文中，我們更加關(guān)注的是，如何使用RNN來推測(cè)下一個(gè)詞。

讀了將近一個(gè)下午的TensorFlow Recurrent Neural Network教程，翻看其在PTB上的實(shí)現(xiàn)，感覺晦澀難懂，因此參考了部分代碼，自己寫了一個(gè)簡(jiǎn)化版的Language Model，思路借鑒了Keras的LSTM text generation。

代碼地址：Github

轉(zhuǎn)載請(qǐng)注明出處：Gaussic

語言模型

Language Model，即語言模型，其主要思想是，在知道前一部分的詞的情況下，推斷出下一個(gè)最有可能出現(xiàn)的詞。例如，知道了 The fat cat sat on the，我們認(rèn)為下一個(gè)詞為mat的可能性比hat要大，因?yàn)樨埜锌赡茏谔鹤由?，而不是帽子上?/p>

這可能被你認(rèn)為是常識(shí)，但是在自然語言處理中，這個(gè)任務(wù)是可以用概率統(tǒng)計(jì)模型來描述的。就拿The fat cat sat on the mat來說。我們可能統(tǒng)計(jì)出第一個(gè)詞The出現(xiàn)的概率p(The)p(The)，The后面是fat的條件概率為p(fat|The)p(fat|The)，The fat同時(shí)出現(xiàn)的聯(lián)合概率：

p(The,fat)=p(The)⋅p(fat|The)p(The,fat)=p(The)·p(fat|The)

這個(gè)聯(lián)合概率，就是The fat的合理性，即這句話的出現(xiàn)符不符合自然語言的評(píng)判標(biāo)準(zhǔn)，通俗點(diǎn)表述就是這是不是句人話。同理，根據(jù)鏈?zhǔn)揭?guī)則，The fat cat的聯(lián)合概率可求：

p(The,fat,cat)=p(The)⋅p(fat|The)⋅p(cat|The,fat)p(The,fat,cat)=p(The)·p(fat|The)·p(cat|The,fat)

在知道前面的詞為The cat的情況下，下一個(gè)詞為cat的概率可以推導(dǎo)出來：

p(cat|The,fat)=p(The,fat,cat)p(The,fat)p(cat|The,fat)=p(The,fat,cat)p(The,fat)

分子是The fat cat在語料庫中出現(xiàn)的次數(shù)，分母是The fat在語料庫中出現(xiàn)的次數(shù)。

因此，The fat cat sat on the mat整個(gè)句子的合理性同樣可以推導(dǎo)，這個(gè)句子的合理性即為它的概率。公式化的描述如下：

p(S)=p(w1,w2,⋅⋅⋅,wn)=p(w1)⋅p(w2|w1)⋅p(w3|w1,w2)⋅⋅⋅p(wn|w1,w2,w3,⋅⋅⋅,wn−1)p(S)=p(w1,w2,···,wn)=p(w1)·p(w2|w1)·p(w3|w1,w2)···p(wn|w1,w2,w3,···,wn−1)

（公式后的n-1應(yīng)該為下標(biāo)，插件問題，下同）

可以看出一個(gè)問題，每當(dāng)計(jì)算下一個(gè)詞的條件概率，需要計(jì)算前面所有詞的聯(lián)合概率。這個(gè)計(jì)算量相當(dāng)?shù)凝嫶蟆２⑶?，一個(gè)句子中大部分詞同時(shí)出現(xiàn)的概率往往少之又少，數(shù)據(jù)稀疏非常嚴(yán)重，需要一個(gè)非常大的語料庫來訓(xùn)練。

一個(gè)簡(jiǎn)單的優(yōu)化是基于馬爾科夫假設(shè)，下一個(gè)詞的出現(xiàn)僅與前面的一個(gè)或n個(gè)詞有關(guān)。

最簡(jiǎn)單的情況，下一個(gè)詞的出現(xiàn)僅僅和前面一個(gè)詞有關(guān)，稱之為bigram。

p(S)=p(w1,w2,⋅⋅⋅,wn)=p(w1)⋅p(w2|w1)⋅p(w3|w2)⋅p(w4|w3)⋅⋅⋅p(wn|wn−1)p(S)=p(w1,w2,···,wn)=p(w1)·p(w2|w1)·p(w3|w2)·p(w4|w3)···p(wn|wn−1)

再復(fù)雜點(diǎn)，下一個(gè)詞的出現(xiàn)僅和前面兩個(gè)詞有關(guān)，稱之為trigram。

p(S)=p(w1,w2,⋅⋅⋅,wn)=p(w1)⋅p(w2|w1)⋅p(w3|w1,w2)⋅p(w4|w2,w3)⋅⋅⋅p(wn|wn−2,wn−1)p(S)=p(w1,w2,···,wn)=p(w1)·p(w2|w1)·p(w3|w1,w2)·p(w4|w2,w3)···p(wn|wn−2,wn−1)

這樣的條件概率雖然好求，但是會(huì)丟失大量的前面的詞的信息，有時(shí)會(huì)對(duì)結(jié)果產(chǎn)生不良影響。因此如何選擇一個(gè)有效的n，使得既能簡(jiǎn)化計(jì)算，又能保留大部分的上下文信息。

以上均是傳統(tǒng)語言模型的描述。如果不太深究細(xì)節(jié)，我們的任務(wù)就是，知道前面n個(gè)詞，來計(jì)算下一個(gè)詞出現(xiàn)的概率。并且使用語言模型來生成新的文本。

在本文中，我們更加關(guān)注的是，如何使用RNN來推測(cè)下一個(gè)詞。

數(shù)據(jù)準(zhǔn)備

TensorFlow的官方文檔使用的是Mikolov準(zhǔn)備好的PTB數(shù)據(jù)集。我們可以將其下載并解壓出來：

$ wget http://www.fit.vutbr.cz/~imikolov/rnnlm/simple-examples.tgz 
$ tar xvf simple-examples.tgz

部分?jǐn)?shù)據(jù)如下，不常用的詞轉(zhuǎn)換成了<unk>標(biāo)記，數(shù)字轉(zhuǎn)換成了N：

we 're talking about years ago before anyone heard of asbestos having any questionable properties 
there is no asbestos in our products now 
neither <unk> nor the researchers who studied the workers were aware of any research on smokers of the kent cigarettes 
we have no useful information on whether users are at risk said james a. <unk> of boston 's <unk> cancer institute 
the total of N deaths from malignant <unk> lung cancer and <unk> was far higher than expected the researchers said

讀取文件中的數(shù)據(jù)，將換行符轉(zhuǎn)換為<eos>，然后轉(zhuǎn)換為詞的list：

def _read_words(filename): 
    with open(filename, 'r', encoding='utf-8') as f: 
        return f.read().replace('\n', '<eos>').split()

f = _read_words('simple-examples/data/ptb.train.txt') 
print(f[:20])

得到：

['aer', 'banknote', 'berlitz', 'calloway', 'centrust', 'cluett', 'fromstein', 'gitano', 'guterman', 'hydro-quebec', 'ipo', 'kia', 'memotec', 'mlx', 'nahb', 'punts', 'rake', 'regatta', 'rubens', 'sim']

構(gòu)建詞匯表，詞與id互轉(zhuǎn)：

def _build_vocab(filename): 
    data = _read_words(filename) 
 
    counter = Counter(data) 
    count_pairs = sorted(counter.items(), key=lambda x: -x[1]) 
 
    words, _ = list(zip(*count_pairs)) 
    word_to_id = dict(zip(words, range(len(words)))) 
 
    return words, word_to_id

words, words_to_id = _build_vocab('simple-examples/data/ptb.train.txt') 
print(words[:10]) 
print(list(map(lambda x: words_to_id[x], words[:10])))

輸出：

('the', '<unk>', '<eos>', 'N', 'of', 'to', 'a', 'in', 'and', "'s") 
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

將一個(gè)文件轉(zhuǎn)換為id表示：

def _file_to_word_ids(filename, word_to_id): 
    data = _read_words(filename) 
    return [word_to_id[x] for x in data if x in word_to_id]

words_in_file = _file_to_word_ids('simple-examples/data/ptb.train.txt', words_to_id) 
print(words_in_file[:20])

詞匯表已根據(jù)詞頻進(jìn)行排序，由于第一句話非英文，所以id靠后。

[9980, 9988, 9981, 9989, 9970, 9998, 9971, 9979, 9992, 9997, 9982, 9972, 9993, 9991, 9978, 9983, 9974, 9986, 9999, 9990]

將一句話從id列表轉(zhuǎn)換回詞：

def to_words(sentence, words): 
    return list(map(lambda x: words[x], sentence))

將以上函數(shù)整合：

def ptb_raw_data(data_path=None): 
    train_path = os.path.join(data_path, 'ptb.train.txt') 
    valid_path = os.path.join(data_path, 'ptb.valid.txt') 
    test_path = os.path.join(data_path, 'ptb.test.txt') 
 
    words, word_to_id = _build_vocab(train_path) 
    train_data = _file_to_word_ids(train_path, word_to_id) 
    valid_data = _file_to_word_ids(valid_path, word_to_id) 
    test_data = _file_to_word_ids(test_path, word_to_id) 
 
    return train_data, valid_data, test_data, words, word_to_id

以上部分和官方的例子有一定的相似之處。接下來的處理和官方存在很大的不同，主要參考了Keras例程處理文檔的操作：

def ptb_producer(raw_data, batch_size=64, num_steps=20, stride=1): 
    data_len = len(raw_data) 
 
    sentences = [] 
    next_words = [] 
    for i in range(0, data_len - num_steps, stride): 
        sentences.append(raw_data[i:(i + num_steps)]) 
        next_words.append(raw_data[i + num_steps]) 
 
    sentences = np.array(sentences) 
    next_words = np.array(next_words) 
 
    batch_len = len(sentences) // batch_size 
    x = np.reshape(sentences[:(batch_len * batch_size)], \ 
        [batch_len, batch_size, -1]) 
 
    y = np.reshape(next_words[:(batch_len * batch_size)], \ 
        [batch_len, batch_size]) 
 
    return x, y

參數(shù)解析：

raw_data: 即ptb_raw_data()函數(shù)產(chǎn)生的數(shù)據(jù)
batch_size: 神經(jīng)網(wǎng)絡(luò)使用隨機(jī)梯度下降，數(shù)據(jù)按多個(gè)批次輸出，此為每個(gè)批次的數(shù)據(jù)量
num_steps: 每個(gè)句子的長(zhǎng)度，相當(dāng)于之前描述的n的大小，這在循環(huán)神經(jīng)網(wǎng)絡(luò)中又稱為時(shí)序的長(zhǎng)度。
stride: 取數(shù)據(jù)的步長(zhǎng)，決定數(shù)據(jù)量的大小。

代碼解析：

這個(gè)函數(shù)將一個(gè)原始數(shù)據(jù)list轉(zhuǎn)換為多個(gè)批次的數(shù)據(jù)，即[batch_len, batch_size, num_steps]。

首先，程序每一次取了num_steps個(gè)詞作為一個(gè)句子，即x，以這num_steps個(gè)詞后面的一個(gè)詞作為它的下一個(gè)預(yù)測(cè)，即為y。這樣，我們首先把原始數(shù)據(jù)整理成了batch_len * batch_size個(gè)x和y的表示，類似于已知x求y的分類問題。

為了滿足隨機(jī)梯度下降的需要，我們還需要把數(shù)據(jù)整理成一個(gè)個(gè)小的批次，每次喂一個(gè)批次的數(shù)據(jù)給TensorFlow來更新權(quán)重，這樣，數(shù)據(jù)就整理為[batch_len, batch_size, num_steps]的格式。

打印部分?jǐn)?shù)據(jù)：

train_data, valid_data, test_data, words, word_to_id = ptb_raw_data('simple-examples/data') 
x_train, y_train = ptb_producer(train_data) 
print(x_train.shape) 
print(y_train.shape)

輸出：

(14524, 64, 20) 
(14524, 64)

可見我們得到了14524個(gè)批次的數(shù)據(jù)，每個(gè)批次的訓(xùn)練集維度為[64, 20]。

print(' '.join(to_words(x_train[100, 3], words)))

第100個(gè)批次的第3句話為：

despite steady sales growth <eos> magna recently cut its quarterly dividend in half and the company 's class a shares

print(words[np.argmax(y_train[100, 3])])

它的下一個(gè)詞為：

the

構(gòu)建模型

配置項(xiàng)

class LMConfig(object): 
    """language model 配置項(xiàng)""" 
    batch_size = 64       # 每一批數(shù)據(jù)的大小 
    num_steps = 20        # 每一個(gè)句子的長(zhǎng)度 
    stride = 3            # 取數(shù)據(jù)時(shí)的步長(zhǎng) 
 
    embedding_dim = 64    # 詞向量維度 
    hidden_dim = 128      # RNN隱藏層維度 
    num_layers = 2        # RNN層數(shù) 
 
    learning_rate = 0.05  # 學(xué)習(xí)率 
    dropout = 0.2         # 每一層后的丟棄概率

讀取輸入

讓模型可以按批次的讀取數(shù)據(jù)。

class PTBInput(object): 
    """按批次讀取數(shù)據(jù)""" 
    def __init__(self, config, data): 
        self.batch_size = config.batch_size 
        self.num_steps = config.num_steps 
        self.vocab_size = config.vocab_size # 詞匯表大小 
 
        self.input_data, self.targets = ptb_producer(data, 
            self.batch_size, self.num_steps) 
 
        self.batch_len = self.input_data.shape[0] # 總批次 
        self.cur_batch = 0  # 當(dāng)前批次 
 
    def next_batch(self): 
        """讀取下一批次""" 
        x = self.input_data[self.cur_batch] 
        y = self.targets[self.cur_batch] 
 
        # 轉(zhuǎn)換為one-hot編碼 
        y_ = np.zeros((y.shape[0], self.vocab_size), dtype=np.bool) 
        for i in range(y.shape[0]): 
            y_[i][y[i]] = 1 
 
        # 如果到最后一個(gè)批次，則回到最開頭 
        self.cur_batch = (self.cur_batch +1) % self.batch_len 
 
        return x, y_

模型

class PTBModel(object): 
    def __init__(self, config, is_training=True): 
 
        self.num_steps = config.num_steps 
        self.vocab_size = config.vocab_size 
 
        self.embedding_dim = config.embedding_dim 
        self.hidden_dim = config.hidden_dim 
        self.num_layers = config.num_layers 
        self.rnn_model = config.rnn_model 
 
        self.learning_rate = config.learning_rate 
        self.dropout = config.dropout 
 
        self.placeholders()  # 輸入占位符 
        self.rnn()           # rnn 模型構(gòu)建 
        self.cost()          # 代價(jià)函數(shù) 
        self.optimize()      # 優(yōu)化器 
        self.error()         # 錯(cuò)誤率 
 
 
    def placeholders(self): 
        """輸入數(shù)據(jù)的占位符""" 
        self._inputs = tf.placeholder(tf.int32, [None, self.num_steps]) 
        self._targets = tf.placeholder(tf.int32, [None, self.vocab_size]) 
 
 
    def input_embedding(self): 
        """將輸入轉(zhuǎn)換為詞向量表示""" 
        with tf.device("/cpu:0"): 
            embedding = tf.get_variable( 
                "embedding", [self.vocab_size, 
                    self.embedding_dim], dtype=tf.float32) 
            _inputs = tf.nn.embedding_lookup(embedding, self._inputs) 
 
        return _inputs 
 
 
    def rnn(self): 
        """rnn模型構(gòu)建""" 
        def lstm_cell():  # 基本的lstm cell 
            return tf.contrib.rnn.BasicLSTMCell(self.hidden_dim, 
                state_is_tuple=True) 
 
        def gru_cell():   # gru cell，速度更快 
            return tf.contrib.rnn.GRUCell(self.hidden_dim) 
 
        def dropout_cell():    # 在每個(gè)cell后添加dropout 
            if (self.rnn_model == 'lstm'): 
                cell = lstm_cell() 
            else: 
                cell = gru_cell() 
            return tf.contrib.rnn.DropoutWrapper(cell, 
                output_keep_prob=self.dropout) 
 
        cells = [dropout_cell() for _ in range(self.num_layers)] 
        cell = tf.contrib.rnn.MultiRNNCell(cells, state_is_tuple=True)  # 多層rnn 
 
        _inputs = self.input_embedding() 
        _outputs, _ = tf.nn.dynamic_rnn(cell=cell, 
            inputs=_inputs, dtype=tf.float32) 
 
        # _outputs的shape為 [batch_size, num_steps, hidden_dim] 
        last = _outputs[:, -1, :]  # 只需要最后一個(gè)輸出 
 
        # dense 和 softmax 用于分類，以找出各詞的概率 
        logits = tf.layers.dense(inputs=last, units=self.vocab_size)    
        prediction = tf.nn.softmax(logits)   
 
        self._logits = logits 
        self._pred = prediction 
 
    def cost(self): 
        """計(jì)算交叉熵代價(jià)函數(shù)""" 
        cross_entropy = tf.nn.softmax_cross_entropy_with_logits( 
            logits=self._logits, labels=self._targets) 
        cost = tf.reduce_mean(cross_entropy) 
        self.cost = cost 
 
    def optimize(self): 
        """使用adam優(yōu)化器""" 
        optimizer = tf.train.AdamOptimizer(learning_rate=self.learning_rate) 
        self.optim = optimizer.minimize(self.cost) 
 
    def error(self): 
        """計(jì)算錯(cuò)誤率""" 
        mistakes = tf.not_equal( 
            tf.argmax(self._targets, 1), tf.argmax(self._pred, 1)) 
        self.errors = tf.reduce_mean(tf.cast(mistakes, tf.float32))

訓(xùn)練

def run_epoch(num_epochs=10): 
    config = LMConfig()   # 載入配置項(xiàng) 
 
    # 載入源數(shù)據(jù)，這里只需要訓(xùn)練集 
    train_data, _, _, words, word_to_id = \ 
        ptb_raw_data('simple-examples/data') 
    config.vocab_size = len(words) 
 
    # 數(shù)據(jù)分批 
    input_train = PTBInput(config, train_data) 
    batch_len = input_train.batch_len 
 
    # 構(gòu)建模型 
    model = PTBModel(config) 
 
    # 創(chuàng)建session，初始化變量 
    sess = tf.Session() 
    sess.run(tf.global_variables_initializer()) 
 
    print('Start training...') 
    for epoch in range(num_epochs):  # 迭代輪次 
        for i in range(batch_len):   # 經(jīng)過多少個(gè)batch 
            x_batch, y_batch = input_train.next_batch() 
 
            # 取一個(gè)批次的數(shù)據(jù)，運(yùn)行優(yōu)化 
            feed_dict = {model._inputs: x_batch, model._targets: y_batch} 
            sess.run(model.optim, feed_dict=feed_dict) 
 
            # 每500個(gè)batch，輸出一次中間結(jié)果 
            if i % 500 == 0: 
                cost = sess.run(model.cost, feed_dict=feed_dict) 
 
                msg = "Epoch: {0:>3}, batch: {1:>6}, Loss: {2:>6.3}" 
                print(msg.format(epoch + 1, i + 1, cost)) 
 
                # 輸出部分預(yù)測(cè)結(jié)果 
                pred = sess.run(model._pred, feed_dict=feed_dict) 
                word_ids = sess.run(tf.argmax(pred, 1)) 
                print('Predicted:', ' '.join(words[w] for w in word_ids)) 
                true_ids = np.argmax(y_batch, 1) 
                print('True:', ' '.join(words[w] for w in true_ids)) 
 
    print('Finish training...') 
    sess.close()

需要經(jīng)過多次的訓(xùn)練才能得到一個(gè)較為合理的結(jié)果。

責(zé)任編輯：龐桂玉來源： 36大數(shù)據(jù)

TensorFlow 深度學(xué)習(xí)神經(jīng)網(wǎng)絡(luò)

點(diǎn)贊

51CTO技術(shù)棧公眾號(hào)

業(yè)務(wù)
速覽

媒體

51CTO CIOAge HC3i

社區(qū)

51CTO博客鴻蒙開發(fā)者社區(qū) AI.x社區(qū)

教育

51CTO學(xué)堂精培企業(yè)培訓(xùn) CTO訓(xùn)練營

<meter id="6yl6x"></meter>

<ruby id="6yl6x"></ruby>