使用TENSORFLOW訓(xùn)練循環(huán)神經(jīng)網(wǎng)絡(luò)語言模型
讀了將近一個(gè)下午的TensorFlow Recurrent Neural Network教程,翻看其在PTB上的實(shí)現(xiàn),感覺晦澀難懂,因此參考了部分代碼,自己寫了一個(gè)簡(jiǎn)化版的Language Model,思路借鑒了Keras的LSTM text generation。
代碼地址:Github
轉(zhuǎn)載請(qǐng)注明出處:Gaussic
語言模型
Language Model,即語言模型,其主要思想是,在知道前一部分的詞的情況下,推斷出下一個(gè)最有可能出現(xiàn)的詞。例如,知道了 The fat cat sat on the,我們認(rèn)為下一個(gè)詞為mat的可能性比hat要大,因?yàn)樨埜锌赡茏谔鹤由?,而不是帽子上?/p>
這可能被你認(rèn)為是常識(shí),但是在自然語言處理中,這個(gè)任務(wù)是可以用概率統(tǒng)計(jì)模型來描述的。就拿The fat cat sat on the mat來說。我們可能統(tǒng)計(jì)出第一個(gè)詞The出現(xiàn)的概率p(The)p(The),The后面是fat的條件概率為p(fat|The)p(fat|The),The fat同時(shí)出現(xiàn)的聯(lián)合概率:
- p(The,fat)=p(The)⋅p(fat|The)p(The,fat)=p(The)·p(fat|The)
- p(The,fat,cat)=p(The)⋅p(fat|The)⋅p(cat|The,fat)p(The,fat,cat)=p(The)·p(fat|The)·p(cat|The,fat)
- p(cat|The,fat)=p(The,fat,cat)p(The,fat)p(cat|The,fat)=p(The,fat,cat)p(The,fat)
- p(S)=p(w1,w2,⋅⋅⋅,wn)=p(w1)⋅p(w2|w1)⋅p(w3|w1,w2)⋅⋅⋅p(wn|w1,w2,w3,⋅⋅⋅,wn−1)p(S)=p(w1,w2,···,wn)=p(w1)·p(w2|w1)·p(w3|w1,w2)···p(wn|w1,w2,w3,···,wn−1)
- p(S)=p(w1,w2,⋅⋅⋅,wn)=p(w1)⋅p(w2|w1)⋅p(w3|w2)⋅p(w4|w3)⋅⋅⋅p(wn|wn−1)p(S)=p(w1,w2,···,wn)=p(w1)·p(w2|w1)·p(w3|w2)·p(w4|w3)···p(wn|wn−1)
- p(S)=p(w1,w2,⋅⋅⋅,wn)=p(w1)⋅p(w2|w1)⋅p(w3|w1,w2)⋅p(w4|w2,w3)⋅⋅⋅p(wn|wn−2,wn−1)p(S)=p(w1,w2,···,wn)=p(w1)·p(w2|w1)·p(w3|w1,w2)·p(w4|w2,w3)···p(wn|wn−2,wn−1)
這樣的條件概率雖然好求,但是會(huì)丟失大量的前面的詞的信息,有時(shí)會(huì)對(duì)結(jié)果產(chǎn)生不良影響。因此如何選擇一個(gè)有效的n,使得既能簡(jiǎn)化計(jì)算,又能保留大部分的上下文信息。
以上均是傳統(tǒng)語言模型的描述。如果不太深究細(xì)節(jié),我們的任務(wù)就是,知道前面n個(gè)詞,來計(jì)算下一個(gè)詞出現(xiàn)的概率。并且使用語言模型來生成新的文本。
在本文中,我們更加關(guān)注的是,如何使用RNN來推測(cè)下一個(gè)詞。
- $ wget http://www.fit.vutbr.cz/~imikolov/rnnlm/simple-examples.tgz
- $ tar xvf simple-examples.tgz
- we 're talking about years ago before anyone heard of asbestos having any questionable properties
- there is no asbestos in our products now
- neither <unk> nor the researchers who studied the workers were aware of any research on smokers of the kent cigarettes
- we have no useful information on whether users are at risk said james a. <unk> of boston 's <unk> cancer institute
- the total of N deaths from malignant <unk> lung cancer and <unk> was far higher than expected the researchers said
- def _read_words(filename):
- with open(filename, 'r', encoding='utf-8') as f:
- return f.read().replace('\n', '<eos>').split()
- f = _read_words('simple-examples/data/ptb.train.txt')
- print(f[:20])
- ['aer', 'banknote', 'berlitz', 'calloway', 'centrust', 'cluett', 'fromstein', 'gitano', 'guterman', 'hydro-quebec', 'ipo', 'kia', 'memotec', 'mlx', 'nahb', 'punts', 'rake', 'regatta', 'rubens', 'sim']
- def _build_vocab(filename):
- data = _read_words(filename)
- counter = Counter(data)
- count_pairs = sorted(counter.items(), key=lambda x: -x[1])
- words, _ = list(zip(*count_pairs))
- word_to_id = dict(zip(words, range(len(words))))
- return words, word_to_id
- words, words_to_id = _build_vocab('simple-examples/data/ptb.train.txt')
- print(words[:10])
- print(list(map(lambda x: words_to_id[x], words[:10])))
- ('the', '<unk>', '<eos>', 'N', 'of', 'to', 'a', 'in', 'and', "'s")
- [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
- def _file_to_word_ids(filename, word_to_id):
- data = _read_words(filename)
- return [word_to_id[x] for x in data if x in word_to_id]
- words_in_file = _file_to_word_ids('simple-examples/data/ptb.train.txt', words_to_id)
- print(words_in_file[:20])
- [9980, 9988, 9981, 9989, 9970, 9998, 9971, 9979, 9992, 9997, 9982, 9972, 9993, 9991, 9978, 9983, 9974, 9986, 9999, 9990]
- def to_words(sentence, words):
- return list(map(lambda x: words[x], sentence))
- def ptb_raw_data(data_path=None):
- train_path = os.path.join(data_path, 'ptb.train.txt')
- valid_path = os.path.join(data_path, 'ptb.valid.txt')
- test_path = os.path.join(data_path, 'ptb.test.txt')
- words, word_to_id = _build_vocab(train_path)
- train_data = _file_to_word_ids(train_path, word_to_id)
- valid_data = _file_to_word_ids(valid_path, word_to_id)
- test_data = _file_to_word_ids(test_path, word_to_id)
- return train_data, valid_data, test_data, words, word_to_id
- def ptb_producer(raw_data, batch_size=64, num_steps=20, stride=1):
- data_len = len(raw_data)
- sentences = []
- next_words = []
- for i in range(0, data_len - num_steps, stride):
- sentences.append(raw_data[i:(i + num_steps)])
- next_words.append(raw_data[i + num_steps])
- sentences = np.array(sentences)
- next_words = np.array(next_words)
- batch_len = len(sentences) // batch_size
- x = np.reshape(sentences[:(batch_len * batch_size)], \
- [batch_len, batch_size, -1])
- y = np.reshape(next_words[:(batch_len * batch_size)], \
- [batch_len, batch_size])
- return x, y
- raw_data: 即ptb_raw_data()函數(shù)產(chǎn)生的數(shù)據(jù)
- batch_size: 神經(jīng)網(wǎng)絡(luò)使用隨機(jī)梯度下降,數(shù)據(jù)按多個(gè)批次輸出,此為每個(gè)批次的數(shù)據(jù)量
- num_steps: 每個(gè)句子的長(zhǎng)度,相當(dāng)于之前描述的n的大小,這在循環(huán)神經(jīng)網(wǎng)絡(luò)中又稱為時(shí)序的長(zhǎng)度。
- stride: 取數(shù)據(jù)的步長(zhǎng),決定數(shù)據(jù)量的大小。
代碼解析:
這個(gè)函數(shù)將一個(gè)原始數(shù)據(jù)list轉(zhuǎn)換為多個(gè)批次的數(shù)據(jù),即[batch_len, batch_size, num_steps]。
首先,程序每一次取了num_steps個(gè)詞作為一個(gè)句子,即x,以這num_steps個(gè)詞后面的一個(gè)詞作為它的下一個(gè)預(yù)測(cè),即為y。這樣,我們首先把原始數(shù)據(jù)整理成了batch_len * batch_size個(gè)x和y的表示,類似于已知x求y的分類問題。
為了滿足隨機(jī)梯度下降的需要,我們還需要把數(shù)據(jù)整理成一個(gè)個(gè)小的批次,每次喂一個(gè)批次的數(shù)據(jù)給TensorFlow來更新權(quán)重,這樣,數(shù)據(jù)就整理為[batch_len, batch_size, num_steps]的格式。
打印部分?jǐn)?shù)據(jù):
- train_data, valid_data, test_data, words, word_to_id = ptb_raw_data('simple-examples/data')
- x_train, y_train = ptb_producer(train_data)
- print(x_train.shape)
- print(y_train.shape)
- (14524, 64, 20)
- (14524, 64)
- print(' '.join(to_words(x_train[100, 3], words)))
- despite steady sales growth <eos> magna recently cut its quarterly dividend in half and the company 's class a shares
- print(words[np.argmax(y_train[100, 3])])
- the
- class LMConfig(object):
- """language model 配置項(xiàng)"""
- batch_size = 64 # 每一批數(shù)據(jù)的大小
- num_steps = 20 # 每一個(gè)句子的長(zhǎng)度
- stride = 3 # 取數(shù)據(jù)時(shí)的步長(zhǎng)
- embedding_dim = 64 # 詞向量維度
- hidden_dim = 128 # RNN隱藏層維度
- num_layers = 2 # RNN層數(shù)
- learning_rate = 0.05 # 學(xué)習(xí)率
- dropout = 0.2 # 每一層后的丟棄概率
- class PTBInput(object):
- """按批次讀取數(shù)據(jù)"""
- def __init__(self, config, data):
- self.batch_size = config.batch_size
- self.num_steps = config.num_steps
- self.vocab_size = config.vocab_size # 詞匯表大小
- self.input_data, self.targets = ptb_producer(data,
- self.batch_size, self.num_steps)
- self.batch_len = self.input_data.shape[0] # 總批次
- self.cur_batch = 0 # 當(dāng)前批次
- def next_batch(self):
- """讀取下一批次"""
- x = self.input_data[self.cur_batch]
- y = self.targets[self.cur_batch]
- # 轉(zhuǎn)換為one-hot編碼
- y_ = np.zeros((y.shape[0], self.vocab_size), dtype=np.bool)
- for i in range(y.shape[0]):
- y_[i][y[i]] = 1
- # 如果到最后一個(gè)批次,則回到最開頭
- self.cur_batch = (self.cur_batch +1) % self.batch_len
- return x, y_
- class PTBModel(object):
- def __init__(self, config, is_training=True):
- self.num_steps = config.num_steps
- self.vocab_size = config.vocab_size
- self.embedding_dim = config.embedding_dim
- self.hidden_dim = config.hidden_dim
- self.num_layers = config.num_layers
- self.rnn_model = config.rnn_model
- self.learning_rate = config.learning_rate
- self.dropout = config.dropout
- self.placeholders() # 輸入占位符
- self.rnn() # rnn 模型構(gòu)建
- self.cost() # 代價(jià)函數(shù)
- self.optimize() # 優(yōu)化器
- self.error() # 錯(cuò)誤率
- def placeholders(self):
- """輸入數(shù)據(jù)的占位符"""
- self._inputs = tf.placeholder(tf.int32, [None, self.num_steps])
- self._targets = tf.placeholder(tf.int32, [None, self.vocab_size])
- def input_embedding(self):
- """將輸入轉(zhuǎn)換為詞向量表示"""
- with tf.device("/cpu:0"):
- embedding = tf.get_variable(
- "embedding", [self.vocab_size,
- self.embedding_dim], dtype=tf.float32)
- _inputs = tf.nn.embedding_lookup(embedding, self._inputs)
- return _inputs
- def rnn(self):
- """rnn模型構(gòu)建"""
- def lstm_cell(): # 基本的lstm cell
- return tf.contrib.rnn.BasicLSTMCell(self.hidden_dim,
- state_is_tuple=True)
- def gru_cell(): # gru cell,速度更快
- return tf.contrib.rnn.GRUCell(self.hidden_dim)
- def dropout_cell(): # 在每個(gè)cell后添加dropout
- if (self.rnn_model == 'lstm'):
- cell = lstm_cell()
- else:
- cell = gru_cell()
- return tf.contrib.rnn.DropoutWrapper(cell,
- output_keep_prob=self.dropout)
- cells = [dropout_cell() for _ in range(self.num_layers)]
- cell = tf.contrib.rnn.MultiRNNCell(cells, state_is_tuple=True) # 多層rnn
- _inputs = self.input_embedding()
- _outputs, _ = tf.nn.dynamic_rnn(cell=cell,
- inputs=_inputs, dtype=tf.float32)
- # _outputs的shape為 [batch_size, num_steps, hidden_dim]
- last = _outputs[:, -1, :] # 只需要最后一個(gè)輸出
- # dense 和 softmax 用于分類,以找出各詞的概率
- logits = tf.layers.dense(inputs=last, units=self.vocab_size)
- prediction = tf.nn.softmax(logits)
- self._logits = logits
- self._pred = prediction
- def cost(self):
- """計(jì)算交叉熵代價(jià)函數(shù)"""
- cross_entropy = tf.nn.softmax_cross_entropy_with_logits(
- logits=self._logits, labels=self._targets)
- cost = tf.reduce_mean(cross_entropy)
- self.cost = cost
- def optimize(self):
- """使用adam優(yōu)化器"""
- optimizer = tf.train.AdamOptimizer(learning_rate=self.learning_rate)
- self.optim = optimizer.minimize(self.cost)
- def error(self):
- """計(jì)算錯(cuò)誤率"""
- mistakes = tf.not_equal(
- tf.argmax(self._targets, 1), tf.argmax(self._pred, 1))
- self.errors = tf.reduce_mean(tf.cast(mistakes, tf.float32))
- def run_epoch(num_epochs=10):
- config = LMConfig() # 載入配置項(xiàng)
- # 載入源數(shù)據(jù),這里只需要訓(xùn)練集
- train_data, _, _, words, word_to_id = \
- ptb_raw_data('simple-examples/data')
- config.vocab_size = len(words)
- # 數(shù)據(jù)分批
- input_train = PTBInput(config, train_data)
- batch_len = input_train.batch_len
- # 構(gòu)建模型
- model = PTBModel(config)
- # 創(chuàng)建session,初始化變量
- sess = tf.Session()
- sess.run(tf.global_variables_initializer())
- print('Start training...')
- for epoch in range(num_epochs): # 迭代輪次
- for i in range(batch_len): # 經(jīng)過多少個(gè)batch
- x_batch, y_batch = input_train.next_batch()
- # 取一個(gè)批次的數(shù)據(jù),運(yùn)行優(yōu)化
- feed_dict = {model._inputs: x_batch, model._targets: y_batch}
- sess.run(model.optim, feed_dict=feed_dict)
- # 每500個(gè)batch,輸出一次中間結(jié)果
- if i % 500 == 0:
- cost = sess.run(model.cost, feed_dict=feed_dict)
- msg = "Epoch: {0:>3}, batch: {1:>6}, Loss: {2:>6.3}"
- print(msg.format(epoch + 1, i + 1, cost))
- # 輸出部分預(yù)測(cè)結(jié)果
- pred = sess.run(model._pred, feed_dict=feed_dict)
- word_ids = sess.run(tf.argmax(pred, 1))
- print('Predicted:', ' '.join(words[w] for w in word_ids))
- true_ids = np.argmax(y_batch, 1)
- print('True:', ' '.join(words[w] for w in true_ids))
- print('Finish training...')
- sess.close()
需要經(jīng)過多次的訓(xùn)練才能得到一個(gè)較為合理的結(jié)果。