自拍偷在线精品自拍偷,亚洲欧美中文日韩v在线观看不卡

<cite id="zkb5q"><rp id="zkb5q"></rp></cite>

AI.x社區(qū)

軟考社區(qū)

企業(yè)培訓

鴻蒙開發(fā)者社區(qū)

WOT技術大會

公眾號矩陣

移動端

視頻課免費課排行榜短視頻直播課軟考學堂

全部課程軟考華為認證廠商認證 IT技術 PMP項目管理免費題庫

文章資源問答課堂專欄直播

51CTO

鴻蒙開發(fā)者社區(qū)

51CTO技術棧

51CTO官微

51CTO學堂

51CTO博客

CTO訓練營

鴻蒙開發(fā)者社區(qū)訂閱號

51CTO軟考

51CTO學堂APP

51CTO學堂企業(yè)版APP

鴻蒙開發(fā)者社區(qū)視頻號

51CTO軟考題庫

賬號設置退出

使用機器學習生成圖像描述

作者：deephub 2021-04-25 16:21:32

人工智能機器學習

圖像描述是為圖像提供適當文字描述的過程。作為人類，這似乎是一件容易的任務，即使是五歲的孩子也可以輕松完成，但是我們如何編寫一個將輸入作為圖像并生成標題作為輸出的計算機程序呢？

在深度神經網絡的最新發(fā)展之前，業(yè)內最聰明的人都無法解決這個問題，但是在深度神經網絡問世之后，考慮到我們擁有所需的數據集，這樣做是完全有可能的。

例如，網絡模型可以生成與下圖相關的以下任何標題，即“A white dog in a grassy area”，“white dog with brown spots”甚至“A dog on grass and some pink flowers ”。

數據集

我們選擇的數據集為“ Flickr 8k”。我們之所以選擇此數據，是因為它易于訪問且具有可以在普通PC上進行訓練的完美大小，也足夠訓練網絡生成適當的標題。數據分為三組，主要是包含6k圖像的訓練集，包含1k圖像的開發(fā)集和包含1k圖像的測試集。每個圖像包含5個標題。示例之一如下：

使用機器學習生成圖像描述

A child in a pink dress is climbing up a set of stairs in an entryway.

A girl going into a wooden building.

A little girl climbing into a wooden playhouse.

A little girl climbing the stairs to her playhouse.

A little girl in a pink dress going into a wooden cabin.

數據清理

任何機器學習程序的第一步也是最重要的一步是清理數據并清除所有不需要的數據。在處理標題中的文本數據時，我們將執(zhí)行基本的清理步驟，例如將計算機中的所有字母都轉換為小寫字母“ Hey”和“ hey”是兩個完全不同的單詞，刪除特殊標記和標點符號，例如*， (，£，$，%等)，并消除所有包含數字的單詞。

我們首先為數據集中的所有唯一內容創(chuàng)建詞匯表，即8000(圖片數量)* 5(每個圖像的標題)= 40000標題。我們發(fā)現它等于8763。但是這些詞中的大多數只出現了1到2次，我們不希望它們出現在我們的模型中，因為這不會使我們的模型對異常值具有魯棒性。因此，我們將詞匯中包含的單詞的最少出現次數設置為10個閾值，該閾值等于1652個唯一單詞。

我們要做的另一件事是在每個描述中添加兩個標記，以指示字幕的開始和結束。這兩個標記分別是“ startseq”和“ endseq”，分別表示字幕的開始和結尾。

首先，導入所有必需的庫：

import numpy as np   
from numpy import array   
import pandas as pd   
import matplotlib.pyplot as plt   
import string   
import os   
from PIL import Image   
import glob   
import pickle   
from time import time   
from keras.preprocessing import sequence   
from keras.models import Sequential   
from keras.layers import LSTM, Embedding, Dense, Flatten, Reshape, concatenate, Dropout   
from keras.optimizers import Adam   
from keras.layers.merge import add   
from keras.applications.inception_v3 import InceptionV3   
from keras.preprocessing import image   
from keras.models import Model   
from keras import Input, layers   
from keras.applications.inception_v3 import preprocess_input   
from keras.preprocessing.sequence import pad_sequences   
from keras.utils import to_categorical

讓我們定義一些輔助函數：

# load descriptions   
def load_doc(filename):   
file = open(filename, 'r')   
text = file.read()   
file.close()   
return text   
  
  
def load_descriptions(doc):   
mapping = dict()   
for line in doc.split('\n'):   
tokens = line.split()   
if len(line) < 2:   
continue   
image_id, image_desc = tokens[0], tokens[1:]   
image_id = image_id.split('.')[0]   
image_desc = ' '.join(image_desc)   
if image_id not in mapping:   
mapping[image_id] = list()   
mapping[image_id].append(image_desc)   
return mapping   
  
def clean_descriptions(descriptions):   
table = str.maketrans('', '', string.punctuation)   
for key, desc_list in descriptions.items():   
for i in range(len(desc_list)):   
desc = desc_list[i]   
desc = desc.split()   
desc = [word.lower() for word in desc]   
desc = [w.translate(table) for w in desc]   
desc = [word for word in desc if len(word)>1]   
desc = [word for word in desc if word.isalpha()]   
desc_list[i] = ' '.join(desc)   
  
return descriptions   
  
# save descriptions to file, one per line   
def save_descriptions(descriptions, filename):   
lines = list()   
for key, desc_list in descriptions.items():   
for desc in desc_list:   
lines.append(key + ' ' + desc)   
data = '\n'.join(lines)   
file = open(filename, 'w')   
file.write(data)   
file.close()   
  
  
# load clean descriptions into memory   
def load_clean_descriptions(filename, dataset):   
doc = load_doc(filename)   
descriptions = dict()   
for line in doc.split('\n'):   
tokens = line.split()   
image_id, image_desc = tokens[0], tokens[1:]   
if image_id in dataset:   
if image_id not in descriptions:   
descriptions[image_id] = list()   
desc = 'startseq ' + ' '.join(image_desc) + ' endseq'   
descriptions[image_id].append(desc)   
return descriptions   
  
def load_set(filename):   
doc = load_doc(filename)   
dataset = list()   
for line in doc.split('\n'):   
if len(line) < 1:   
continue   
identifier = line.split('.')[0]   
dataset.append(identifier)   
return set(dataset)   
  
# load training dataset   
  
  
filename = "dataset/Flickr8k_text/Flickr8k.token.txt"   
doc = load_doc(filename)   
descriptions = load_descriptions(doc)   
descriptions = clean_descriptions(descriptions)   
save_descriptions(descriptions, 'descriptions.txt')   
filename = 'dataset/Flickr8k_text/Flickr_8k.trainImages.txt'   
train = load_set(filename)   
train_descriptions = load_clean_descriptions('descriptions.txt', train)

讓我們一一解釋：

load_doc：獲取文件的路徑并返回該文件內的內容

load_descriptions：獲取包含描述的文件的內容，并生成一個字典，其中以圖像id為鍵，以描述為值列表

clean_descriptions：通過將所有字母都轉換為小寫字母，忽略數字和標點符號以及僅包含一個字符的單詞來清理描述

save_descriptions：將描述字典作為文本文件保存到內存中

load_set：從文本文件加載圖像的所有唯一標識符

load_clean_descriptions：使用上面提取的唯一標識符加載所有已清理的描述

數據預處理

接下來，我們對圖像和字幕進行一些數據預處理。圖像基本上是我們的特征向量，即我們對網絡的輸入。因此，我們需要先將它們轉換為固定大小的向量，然后再將其傳遞到神經網絡中。為此，我們使用了由Google Research [3]創(chuàng)建的Inception V3模型(卷積神經網絡)進行遷移學習。該模型在'ImageNet'數據集[4]上進行了訓練，可以對1000張圖像進行圖像分類，但是我們的目標不是進行分類，因此我們刪除了最后一個softmax層，并為每張圖像提取了2048個固定矢量，如圖所示以下：

使用機器學習生成圖像描述

標題文字是我們模型的輸出，即我們必須預測的內容。但是預測并不會一次全部發(fā)生，而是會逐字預測字幕。為此，我們需要將每個單詞編碼為固定大小的向量(將在下一部分中完成)。為此，我們首先需要創(chuàng)建兩個字典，即“單詞到索引”將每個單詞映射到一個索引(在我們的情況下為1到1652)，以及“索引到單詞”將字典將每個索引映射到其對應的單詞字典。我們要做的最后一件事是計算在數據集中具有最大長度的描述的長度，以便我們可以填充所有其他內容以保持固定長度。在我們的情況下，該長度等于34。

字詞嵌入

如前所述，我們將每個單詞映射到固定大小的向量(即200)中，我們將使用預訓練的GLOVE模型。最后，我們?yōu)樵~匯表中的所有1652個單詞創(chuàng)建一個嵌入矩陣，其中為詞匯表中的每個單詞包含一個固定大小的向量。

# Create a list of all the training captions   
all_train_captions = []   
for key, val in train_descriptions.items():   
for cap in val:   
all_train_captions.append(cap)   
  
  
# Consider only words which occur at least 10 times in the corpus   
word_count_threshold = 10   
word_counts = {}   
nsents = 0   
for sent in all_train_captions:   
nsents += 1   
for w in sent.split(' '):   
word_counts[w] = word_counts.get(w, 0) + 1   
  
vocab = [w for w in word_counts if word_counts[w] >= word_count_threshold]   
print('Preprocessed words {} -> {}'.format(len(word_counts), len(vocab)))   
  
  
ixtoword = {}   
wordtoix = {}   
  
ix = 1   
for w in vocab:   
wordtoix[w] = ix   
ixtoword[ix] = w   
ix += 1   
  
vocab_size = len(ixtoword) + 1 # one for appended 0's   
  
# Load Glove vectors   
glove_dir = 'glove.6B'   
embeddings_index = {}   
f = open(os.path.join(glove_dir, 'glove.6B.200d.txt'), encoding="utf-8")   
  
for line in f:   
values = line.split()   
word = values[0]   
coefs = np.asarray(values[1:], dtype='float32')   
embeddings_index[word] = coefs   
f.close()   
  
embedding_dim = 200   
  
# Get 200-dim dense vector for each of the words in out vocabulary   
embedding_matrix = np.zeros((vocab_size, embedding_dim))   
  
for word, i in wordtoix.items():   
embedding_vector = embeddings_index.get(word)   
if embedding_vector is not None:   
embedding_matrix[i] = embedding_vector

讓我們接收下這段代碼：

第1至5行：將所有訓練圖像的所有描述提取到一個列表中

第9-18行：僅選擇詞匯中出現次數超過10次的單詞

第21–30行：創(chuàng)建一個要索引的單詞和一個對單詞詞典的索引。

第33–42行：將Glove Embeddings加載到字典中，以單詞作為鍵，將vector嵌入為值

第44–52行：使用上面加載的嵌入為詞匯表中的單詞創(chuàng)建嵌入矩陣

數據準備

這是該項目最重要的方面之一。對于圖像，我們需要使用Inception V3模型將它們轉換為固定大小的矢量，如前所述。

# Below path contains all the images   
all_images_path = 'dataset/Flickr8k_Dataset/Flicker8k_Dataset/'   
# Create a list of all image names in the directory   
all_images = glob.glob(all_images_path + '*.jpg')   
  
# Create a list of all the training and testing images with their full path names   
def create_list_of_images(file_path):   
images_names = set(open(file_path, 'r').read().strip().split('\n'))   
images = []   
  
for image in all_images:   
if image[len(all_images_path):] in image_names:   
images.append(image)   
  
return images   
  
  
train_images_path = 'dataset/Flickr8k_text/Flickr_8k.trainImages.txt'   
test_images_path = 'dataset/Flickr8k_text/Flickr_8k.testImages.txt'   
  
train_images = create_list_of_images(train_images_path)   
test_images = create_list_of_images(test_images_path)   
  
#preprocessing the images   
def preprocess(image_path):   
img = image.load_img(image_path, target_size=(299, 299))   
x = image.img_to_array(img)   
x = np.expand_dims(x, axis=0)   
x = preprocess_input(x)   
return x   
  
# Load the inception v3 model   
model = InceptionV3(weights='imagenet')   
  
# Create a new model, by removing the last layer (output layer) from the inception v3   
model_new = Model(model.input, model.layers[-2].output)   
  
# Encoding a given image into a vector of size (2048, )   
def encode(image):   
image = preprocess(image)   
fea_vec = model_new.predict(image)   
fea_vec = np.reshape(fea_vec, fea_vec.shape[1])   
return fea_vec   
  
  
encoding_train = {}   
for img in train_images:   
encoding_train[img[len(all_images_path):]] = encode(img)   
  
  
encoding_test = {}   
for img in test_images:   
encoding_test[img[len(all_images_path):]] = encode(img)   
  
# Save the bottleneck features to disk   
with open("encoded_files/encoded_train_images.pkl", "wb") as encoded_pickle:   
pickle.dump(encoding_train, encoded_pickle)   
  
with open("encoded_files/encoded_test_images.pkl", "wb") as encoded_pickle:   
pickle.dump(encoding_test, encoded_pickle)   
  
  
train_features = load(open("encoded_files/encoded_train_images.pkl", "rb"))

第1-22行：將訓練和測試圖像的路徑加載到單獨的列表中
第25–53行：循環(huán)訓練和測試集中的每個圖像，將它們加載為固定大小，對其進行預處理，使用InceptionV3模型提取特征，最后對其進行重塑。
第56–63行：將提取的特征保存到磁盤

現在，我們不會一次預測所有的標題文字，因為我們不只是將圖像提供給計算機，并要求它為其生成文字。我們要做的就是給它圖像的特征向量，以及標題的第一個單詞，并讓它預測第二個單詞。然后我們給它給出前兩個單詞，并讓它預測第三個單詞。讓我們考慮數據集部分中給出的圖像和標題“一個女孩正在進入木結構建筑”。在這種情況下，在添加令牌“ startseq”和“ endseq”之后，以下分別是我們的輸入(Xi)和輸出(Yi)。

使用機器學習生成圖像描述

此后，我們將使用我們創(chuàng)建的“索引”字典來更改輸入和輸出中的每個詞以映射索引。在進行批處理時，我們希望所有序列的長度均等，這就是為什么要在每個序列后附加0直到它們成為最大長度(如上所述計算為34)的原因。正如人們所看到的那樣，這是大量的數據，將其立即加載到內存中是根本不可行的，為此，我們將使用一個數據生成器將其加載到小塊中降低是用的內存。

# data generator, intended to be used in a call to model.fit_generator()   
def data_generator(descriptions, photos, wordtoix, max_length, num_photos_per_batch):   
X1, X2, y = list(), list(), list()   
n=0   
# loop for ever over images   
while 1:   
for key, desc_list in descriptions.items():   
n+=1   
# retrieve the photo feature   
photo = photos[key+'.jpg']   
for desc in desc_list:   
# encode the sequence   
seq = [wordtoix[word] for word in desc.split(' ') if word in wordtoix]   
# split one sequence into multiple X, y pairs   
for i in range(1, len(seq)):   
# split into input and output pair   
in_seq, out_seq = seq[:i], seq[i]   
# pad input sequence   
in_seq = pad_sequences([in_seq], maxlen=max_length)[0]   
# encode output sequence   
out_seq = to_categorical([out_seq], num_classes=vocab_size)[0]   
# store   
X1.append(photo)   
X2.append(in_seq)   
y.append(out_seq)   
# yield the batch data   
if n==num_photos_per_batch:   
yield [[array(X1), array(X2)], array(y)]   
X1, X2, y = list(), list(), list()   
n=0

上面的代碼遍歷所有圖像和描述，并生成表中的數據項。 yield將使函數再次從同一行運行，因此，讓我們分批加載數據

模型架構和訓練

如前所述，我們的模型在每個點都有兩個輸入，一個輸入特征圖像矢量，另一個輸入部分文字。我們首先將0.5的Dropout應用于圖像矢量，然后將其與256個神經元層連接。對于部分文字，我們首先將其連接到嵌入層，并使用如上所述經過GLOVE訓練的嵌入矩陣的權重。然后，我們應用Dropout 0.5和LSTM(長期短期記憶)。最后，我們將這兩種方法結合在一起，并將它們連接到256個神經元層，最后是一個softmax層，該層預測我們詞匯中每個單詞的概率。可以使用下圖概括高級體系結構：

使用機器學習生成圖像描述

以下是訓練期間選擇的超參數：損失被選擇為“categorical-loss entropy”，優(yōu)化器為“Adam”。該模型總共訓練了30輪，但對于前20輪，批次大小和學習率分別為0.001和3，而接下來的10輪分別為0.0001和6。

inputs1 = Input(shape=(2048,))   
fe1 = Dropout(0.5)(inputs1)   
fe2 = Dense(256, activation='relu')(fe1)   
inputs2 = Input(shape=(max_length1,))   
se1 = Embedding(vocab_size, embedding_dim, mask_zero=True)(inputs2)   
se2 = Dropout(0.5)(se1)   
se3 = LSTM(256)(se2)   
decoder1 = add([fe2, se3])   
decoder2 = Dense(256, activation='relu')(decoder1)   
outputs = Dense(vocab_size, activation='softmax')(decoder2)   
model = Model(inputs=[inputs1, inputs2], outputs=outputs)   
  
model.layers[2].set_weights([embedding_matrix])   
model.layers[2].trainable = False   
  
model.compile(loss='categorical_crossentropy', optimizer='adam')   
  
epochs = 20   
number_pics_per_batch = 3   
steps = len(train_descriptions)//number_pics_per_batch   
  
generator = data_generator(train_descriptions, train_features, wordtoix, max_length1, number_pics_per_batch)   
history = model.fit_generator(generator, epochs=20, steps_per_epoch=steps, verbose=1)   
  
  
model.optimizer.lr = 0.0001   
epochs = 10   
number_pics_per_batch = 6   
steps = len(train_descriptions)//number_pics_per_batch   
  
generator = data_generator(train_descriptions, train_features, wordtoix, max_length1, number_pics_per_batch)   
history1 = model.fit_generator(generator, epochs=10, steps_per_epoch=steps, verbose=1)   
model.save('saved_model/model_' + str(30) + '.h5')

讓我們來解釋一下代碼：

第1-11行：定義模型架構

第13–14行：將嵌入層的權重設置為上面創(chuàng)建的嵌入矩陣，并且還設置trainable = False，因此該層將不再受任何訓練

第16–33行：如上所述，使用超參數在兩個單獨的間隔中訓練模型

推理

下面顯示了前20輪的訓練損失，然后是接下來的10輪的訓練損失：

使用機器學習生成圖像描述

為了進行推斷，我們編寫了一個函數，該函數根據我們的模型(即貪心)將下一個單詞預測為具有最大概率的單詞

def greedySearch(photo):   
in_text = 'startseq'   
for i in range(max_length1):   
sequence = [wordtoix[w] for w in in_text.split() if w in wordtoix]   
sequence = pad_sequences([sequence], maxlen=max_length1)   
yhat = model.predict([photo,sequence], verbose=0)   
yhat = np.argmax(yhat)   
word = ixtoword[yhat]   
in_text += ' ' + word   
if word == 'endseq':   
break   
final = in_text.split()   
final = final[1:-1]   
final = ' '.join(final)   
return final   
  
z=1   
pic = list(encoding_test.keys())[999]   
image = encoding_test[pic].reshape((1,2048))   
x=plt.imread(images+pic)   
plt.imshow(x)   
plt.show()   
print("Greedy:",greedySearch(image))

使用機器學習生成圖像描述

效果還不錯

責任編輯：華軒來源：今日頭條

機器學習圖像程序

51CTO技術棧公眾號

業(yè)務
速覽

媒體

51CTO CIOAge HC3i

社區(qū)

51CTO博客鴻蒙開發(fā)者社區(qū) AI.x社區(qū)

教育

51CTO學堂精培企業(yè)培訓 CTO訓練營

<sub id="scy0q"><p id="scy0q"></p></sub>

<sup id="scy0q"><rt id="scy0q"></rt></sup>

<sub id="scy0q"></sub>