從零開(kāi)始,用英偉達(dá)T4、A10訓(xùn)練小型文生視頻模型,幾小時(shí)搞定
本文經(jīng)計(jì)算機(jī)視覺(jué)研究院公眾號(hào)授權(quán)轉(zhuǎn)載,轉(zhuǎn)載請(qǐng)聯(lián)系出處。
OpenAI 的 Sora、Stability AI 的 Stable Video Diffusion 以及許多其他已經(jīng)發(fā)布或未來(lái)將出現(xiàn)的文本生成視頻模型,是繼大語(yǔ)言模型 (LLM) 之后 2024 年最流行的 AI 趨勢(shì)之一。
在這篇博客中,作者將展示如何將從頭開(kāi)始構(gòu)建一個(gè)小規(guī)模的文本生成視頻模型,涵蓋了從理解理論概念、到編寫(xiě)整個(gè)架構(gòu)再到生成最終結(jié)果的所有內(nèi)容。
由于作者沒(méi)有大算力的 GPU,所以僅編寫(xiě)了小規(guī)模架構(gòu)。以下是在不同處理器上訓(xùn)練模型所需時(shí)間的比較。
作者表示,在 CPU 上運(yùn)行顯然需要更長(zhǎng)的時(shí)間來(lái)訓(xùn)練模型。如果你需要快速測(cè)試代碼中的更改并查看結(jié)果,CPU 不是最佳選擇。因此建議使用 Colab 或 Kaggle 的 T4 GPU 進(jìn)行更高效、更快速的訓(xùn)練。
構(gòu)建目標(biāo)
我們采用了與傳統(tǒng)機(jī)器學(xué)習(xí)或深度學(xué)習(xí)模型類似的方法,即在數(shù)據(jù)集上進(jìn)行訓(xùn)練,然后在未見(jiàn)過(guò)數(shù)據(jù)上進(jìn)行測(cè)試。在文本轉(zhuǎn)視頻的背景下,假設(shè)有一個(gè)包含 10 萬(wàn)個(gè)狗撿球和貓追老鼠視頻的訓(xùn)練數(shù)據(jù)集,然后訓(xùn)練模型來(lái)生成貓撿球或狗追老鼠的視頻。
圖源:iStock, GettyImages
雖然此類訓(xùn)練數(shù)據(jù)集在互聯(lián)網(wǎng)上很容易獲得,但所需的算力極高。因此,我們將使用由 Python 代碼生成的移動(dòng)對(duì)象視頻數(shù)據(jù)集。同時(shí)使用 GAN(生成對(duì)抗網(wǎng)絡(luò))架構(gòu)來(lái)創(chuàng)建模型,而不是 OpenAI Sora 使用的擴(kuò)散模型。
我們也嘗試使用擴(kuò)散模型,但內(nèi)存要求超出了自己的能力。另一方面,GAN 可以更容易、更快地進(jìn)行訓(xùn)練和測(cè)試。
準(zhǔn)備條件
我們將使用 OOP(面向?qū)ο缶幊蹋?,因此必須?duì)它以及神經(jīng)網(wǎng)絡(luò)有基本的了解。此外 GAN(生成對(duì)抗網(wǎng)絡(luò))的知識(shí)不是必需的,因?yàn)檫@里簡(jiǎn)單介紹它們的架構(gòu)。
- OOP:https://www.youtube.com/watch?v=q2SGW2VgwAM
- 神經(jīng)網(wǎng)絡(luò)理論:https://www.youtube.com/watch?v=Jy4wM2X21u0
- GAN 架構(gòu):https://www.youtube.com/watch?v=TpMIssRdhco
- Python 基礎(chǔ):https://www.youtube.com/watch?v=eWRfhZUzrAc
了解 GAN 架構(gòu)
什么是 GAN?
生成對(duì)抗網(wǎng)絡(luò)是一種深度學(xué)習(xí)模型,其中兩個(gè)神經(jīng)網(wǎng)絡(luò)相互競(jìng)爭(zhēng):一個(gè)從給定的數(shù)據(jù)集創(chuàng)建新數(shù)據(jù)(如圖像或音樂(lè)),另一個(gè)則判斷數(shù)據(jù)是真實(shí)的還是虛假的。這個(gè)過(guò)程一直持續(xù)到生成的數(shù)據(jù)與原始數(shù)據(jù)無(wú)法區(qū)分。
真實(shí)世界應(yīng)用
- 生成圖像:GAN 根據(jù)文本 prompt 創(chuàng)建逼真的圖像或修改現(xiàn)有圖像,例如增強(qiáng)分辨率或?yàn)楹诎渍掌砑宇伾?/span>
- 數(shù)據(jù)增強(qiáng):GAN 生成合成數(shù)據(jù)來(lái)訓(xùn)練其他機(jī)器學(xué)習(xí)模型,例如為欺詐檢測(cè)系統(tǒng)創(chuàng)建欺詐交易數(shù)據(jù)。
- 補(bǔ)充缺失信息:GAN 可以填充缺失數(shù)據(jù),例如根據(jù)地形圖生成地下圖像以用于能源應(yīng)用。
- 生成 3D 模型:GAN 將 2D 圖像轉(zhuǎn)換為 3D 模型,在醫(yī)療保健等領(lǐng)域非常有用,可用于為手術(shù)規(guī)劃創(chuàng)建逼真的器官圖像。
GAN 工作原理
GAN 由兩個(gè)深度神經(jīng)網(wǎng)絡(luò)組成:生成器和判別器。這兩個(gè)網(wǎng)絡(luò)在對(duì)抗設(shè)置中一起訓(xùn)練,其中一個(gè)網(wǎng)絡(luò)生成新數(shù)據(jù),另一個(gè)網(wǎng)絡(luò)評(píng)估數(shù)據(jù)是真是假。
GAN 訓(xùn)練示例
讓我們以圖像到圖像的轉(zhuǎn)換為例,解釋一下 GAN 模型,重點(diǎn)是修改人臉。
1. 輸入圖像:輸入圖像是一張真實(shí)的人臉圖像。
2. 屬性修改:生成器會(huì)修改人臉的屬性,比如給眼睛加上墨鏡。
3. 生成圖像:生成器會(huì)創(chuàng)建一組添加了太陽(yáng)鏡的圖像。
4. 判別器的任務(wù):判別器接收到混合的真實(shí)圖像(帶有太陽(yáng)鏡的人)和生成的圖像(添加了太陽(yáng)鏡的人臉)。
5. 評(píng)估:判別器嘗試區(qū)分真實(shí)圖像和生成圖像。
6. 反饋回路:如果判別器正確識(shí)別出假圖像,生成器會(huì)調(diào)整其參數(shù)以生成更逼真的圖像。如果生成器成功欺騙了判別器,判別器會(huì)更新其參數(shù)以提高檢測(cè)能力。
通過(guò)這一對(duì)抗過(guò)程,兩個(gè)網(wǎng)絡(luò)都在不斷改進(jìn)。生成器越來(lái)越善于生成逼真的圖像,而判別器則越來(lái)越善于識(shí)別假圖像,直到達(dá)到平衡,判別器再也無(wú)法區(qū)分真實(shí)圖像和生成的圖像。此時(shí),GAN 已成功學(xué)會(huì)生成逼真的修改圖像。
設(shè)置背景
我們將使用一系列 Python 庫(kù),讓我們導(dǎo)入它們。
# Operating System module for interacting with the operating system
import os
# Module for generating random numbers
import random
# Module for numerical operations
import numpy as np
# OpenCV library for image processing
import cv2
# Python Imaging Library for image processing
from PIL import Image, ImageDraw, ImageFont
# PyTorch library for deep learning
import torch
# Dataset class for creating custom datasets in PyTorch
from torch.utils.data import Dataset
# Module for image transformations
import torchvision.transforms as transforms
# Neural network module in PyTorch
import torch.nn as nn
# Optimization algorithms in PyTorch
import torch.optim as optim
# Function for padding sequences in PyTorch
from torch.nn.utils.rnn import pad_sequence
# Function for saving images in PyTorch
from torchvision.utils import save_image
# Module for plotting graphs and images
import matplotlib.pyplot as plt
# Module for displaying rich content in IPython environments
from IPython.display import clear_output, display, HTML
# Module for encoding and decoding binary data to text
import base64
現(xiàn)在我們已經(jīng)導(dǎo)入了所有的庫(kù),下一步就是定義我們的訓(xùn)練數(shù)據(jù),用于訓(xùn)練 GAN 架構(gòu)。
對(duì)訓(xùn)練數(shù)據(jù)進(jìn)行編碼
我們需要至少 10000 個(gè)視頻作為訓(xùn)練數(shù)據(jù)。為什么呢?因?yàn)槲覝y(cè)試了較小數(shù)量的視頻,結(jié)果非常糟糕,幾乎沒(méi)有任何效果。下一個(gè)重要問(wèn)題是:這些視頻內(nèi)容是什么? 我們的訓(xùn)練視頻數(shù)據(jù)集包括一個(gè)圓圈以不同方向和不同運(yùn)動(dòng)方式移動(dòng)的視頻。讓我們來(lái)編寫(xiě)代碼并生成 10,000 個(gè)視頻,看看它的效果如何。
# Create a directory named 'training_dataset'
os.makedirs('training_dataset', exist_ok=True)
# Define the number of videos to generate for the dataset
num_videos = 10000
# Define the number of frames per video (1 Second Video)
frames_per_video = 10
# Define the size of each image in the dataset
img_size = (64, 64)
# Define the size of the shapes (Circle)
shape_size = 10
設(shè)置一些基本參數(shù)后,接下來(lái)我們需要定義訓(xùn)練數(shù)據(jù)集的文本 prompt,并據(jù)此生成訓(xùn)練視頻。
# Define text prompts and corresponding movements for circles
prompts_and_movements = [
("circle moving down", "circle", "down"), # Move circle downward
("circle moving left", "circle", "left"), # Move circle leftward
("circle moving right", "circle", "right"), # Move circle rightward
("circle moving diagonally up-right", "circle", "diagonal_up_right"), # Move circle diagonally up-right
("circle moving diagonally down-left", "circle", "diagonal_down_left"), # Move circle diagonally down-left
("circle moving diagonally up-left", "circle", "diagonal_up_left"), # Move circle diagonally up-left
("circle moving diagonally down-right", "circle", "diagonal_down_right"), # Move circle diagonally down-right
("circle rotating clockwise", "circle", "rotate_clockwise"), # Rotate circle clockwise
("circle rotating counter-clockwise", "circle", "rotate_counter_clockwise"), # Rotate circle counter-clockwise
("circle shrinking", "circle", "shrink"), # Shrink circle
("circle expanding", "circle", "expand"), # Expand circle
("circle bouncing vertically", "circle", "bounce_vertical"), # Bounce circle vertically
("circle bouncing horizontally", "circle", "bounce_horizontal"), # Bounce circle horizontally
("circle zigzagging vertically", "circle", "zigzag_vertical"), # Zigzag circle vertically
("circle zigzagging horizontally", "circle", "zigzag_horizontal"), # Zigzag circle horizontally
("circle moving up-left", "circle", "up_left"), # Move circle up-left
("circle moving down-right", "circle", "down_right"), # Move circle down-right
("circle moving down-left", "circle", "down_left"), # Move circle down-left
]
我們已經(jīng)利用這些 prompt 定義了圓的幾個(gè)運(yùn)動(dòng)軌跡?,F(xiàn)在,我們需要編寫(xiě)一些數(shù)學(xué)公式,以便根據(jù) prompt 移動(dòng)圓。
# Define function with parameters
def create_image_with_moving_shape(size, frame_num, shape, direction):
# Create a new RGB image with specified size and white background
img = Image.new('RGB', size, color=(255, 255, 255))
# Create a drawing context for the image
draw = ImageDraw.Draw(img)
# Calculate the center coordinates of the image
center_x, center_y = size[0] // 2, size[1] // 2
# Initialize position with center for all movements
position = (center_x, center_y)
# Define a dictionary mapping directions to their respective position adjustments or image transformations
direction_map = {
# Adjust position downwards based on frame number
"down": (0, frame_num * 5 % size[1]),
# Adjust position to the left based on frame number
"left": (-frame_num * 5 % size[0], 0),
# Adjust position to the right based on frame number
"right": (frame_num * 5 % size[0], 0),
# Adjust position diagonally up and to the right
"diagonal_up_right": (frame_num * 5 % size[0], -frame_num * 5 % size[1]),
# Adjust position diagonally down and to the left
"diagonal_down_left": (-frame_num * 5 % size[0], frame_num * 5 % size[1]),
# Adjust position diagonally up and to the left
"diagonal_up_left": (-frame_num * 5 % size[0], -frame_num * 5 % size[1]),
# Adjust position diagonally down and to the right
"diagonal_down_right": (frame_num * 5 % size[0], frame_num * 5 % size[1]),
# Rotate the image clockwise based on frame number
"rotate_clockwise": img.rotate(frame_num * 10 % 360, center=(center_x, center_y), fillcolor=(255, 255, 255)),
# Rotate the image counter-clockwise based on frame number
"rotate_counter_clockwise": img.rotate(-frame_num * 10 % 360, center=(center_x, center_y), fillcolor=(255, 255, 255)),
# Adjust position for a bouncing effect vertically
"bounce_vertical": (0, center_y - abs(frame_num * 5 % size[1] - center_y)),
# Adjust position for a bouncing effect horizontally
"bounce_horizontal": (center_x - abs(frame_num * 5 % size[0] - center_x), 0),
# Adjust position for a zigzag effect vertically
"zigzag_vertical": (0, center_y - frame_num * 5 % size[1]) if frame_num % 2 == 0 else (0, center_y + frame_num * 5 % size[1]),
# Adjust position for a zigzag effect horizontally
"zigzag_horizontal": (center_x - frame_num * 5 % size[0], center_y) if frame_num % 2 == 0 else (center_x + frame_num * 5 % size[0], center_y),
# Adjust position upwards and to the right based on frame number
"up_right": (frame_num * 5 % size[0], -frame_num * 5 % size[1]),
# Adjust position upwards and to the left based on frame number
"up_left": (-frame_num * 5 % size[0], -frame_num * 5 % size[1]),
# Adjust position downwards and to the right based on frame number
"down_right": (frame_num * 5 % size[0], frame_num * 5 % size[1]),
# Adjust position downwards and to the left based on frame number
"down_left": (-frame_num * 5 % size[0], frame_num * 5 % size[1])
}
# Check if direction is in the direction map
if direction in direction_map:
# Check if the direction maps to a position adjustment
if isinstance(direction_map[direction], tuple):
# Update position based on the adjustment
position = tuple(np.add(position, direction_map[direction]))
else: # If the direction maps to an image transformation
# Update the image based on the transformation
img = direction_map[direction]
# Return the image as a numpy array
return np.array(img)
上述函數(shù)用于根據(jù)所選方向在每一幀中移動(dòng)我們的圓。我們只需在其上運(yùn)行一個(gè)循環(huán),直至生成所有視頻的次數(shù)。
# Iterate over the number of videos to generate
for i in range(num_videos):
# Randomly choose a prompt and movement from the predefined list
prompt, shape, direction = random.choice(prompts_and_movements)
# Create a directory for the current video
video_dir = f'training_dataset/video_{i}'
os.makedirs(video_dir, exist_ok=True)
# Write the chosen prompt to a text file in the video directory
with open(f'{video_dir}/prompt.txt', 'w') as f:
f.write(prompt)
# Generate frames for the current video
for frame_num in range(frames_per_video):
# Create an image with a moving shape based on the current frame number, shape, and direction
img = create_image_with_moving_shape(img_size, frame_num, shape, direction)
# Save the generated image as a PNG file in the video directory
cv2.imwrite(f'{video_dir}/frame_{frame_num}.png', img)
運(yùn)行上述代碼后,就會(huì)生成整個(gè)訓(xùn)練數(shù)據(jù)集。以下是訓(xùn)練數(shù)據(jù)集文件的結(jié)構(gòu)。
每個(gè)訓(xùn)練視頻文件夾包含其幀以及對(duì)應(yīng)的文本 prompt。讓我們看一下我們的訓(xùn)練數(shù)據(jù)集樣本。
在我們的訓(xùn)練數(shù)據(jù)集中,我們沒(méi)有包含圓圈先向上移動(dòng)然后向右移動(dòng)的運(yùn)動(dòng)。我們將使用這個(gè)作為測(cè)試 prompt,來(lái)評(píng)估我們訓(xùn)練的模型在未見(jiàn)過(guò)的數(shù)據(jù)上的表現(xiàn)。
還有一個(gè)重要的要點(diǎn)需要注意,我們的訓(xùn)練數(shù)據(jù)包含許多物體從場(chǎng)景中移出或部分出現(xiàn)在攝像機(jī)前方的樣本,類似于我們?cè)?OpenAI Sora 演示視頻中觀察到的情況。
在我們的訓(xùn)練數(shù)據(jù)中包含此類樣本的原因是為了測(cè)試當(dāng)圓圈從角落進(jìn)入場(chǎng)景時(shí),模型是否能夠保持一致性而不會(huì)破壞其形狀。
現(xiàn)在我們的訓(xùn)練數(shù)據(jù)已經(jīng)生成,需要將訓(xùn)練視頻轉(zhuǎn)換為張量,這是 PyTorch 等深度學(xué)習(xí)框架中使用的主要數(shù)據(jù)類型。此外,通過(guò)將數(shù)據(jù)縮放到較小的范圍,執(zhí)行歸一化等轉(zhuǎn)換有助于提高訓(xùn)練架構(gòu)的收斂性和穩(wěn)定性。
預(yù)處理訓(xùn)練數(shù)據(jù)
我們必須為文本轉(zhuǎn)視頻任務(wù)編寫(xiě)一個(gè)數(shù)據(jù)集類,它可以從訓(xùn)練數(shù)據(jù)集目錄中讀取視頻幀及其相應(yīng)的文本 prompt,使其可以在 PyTorch 中使用。
# Define a dataset class inheriting from torch.utils.data.Dataset
class TextToVideoDataset(Dataset):
def __init__(self, root_dir, transform=None):
# Initialize the dataset with root directory and optional transform
self.root_dir = root_dir
self.transform = transform
# List all subdirectories in the root directory
self.video_dirs = [os.path.join(root_dir, d) for d in os.listdir(root_dir) if os.path.isdir(os.path.join(root_dir, d))]
# Initialize lists to store frame paths and corresponding prompts
self.frame_paths = []
self.prompts = []
# Loop through each video directory
for video_dir in self.video_dirs:
# List all PNG files in the video directory and store their paths
frames = [os.path.join(video_dir, f) for f in os.listdir(video_dir) if f.endswith('.png')]
self.frame_paths.extend(frames)
# Read the prompt text file in the video directory and store its content
with open(os.path.join(video_dir, 'prompt.txt'), 'r') as f:
prompt = f.read().strip()
# Repeat the prompt for each frame in the video and store in prompts list
self.prompts.extend([prompt] * len(frames))
# Return the total number of samples in the dataset
def __len__(self):
return len(self.frame_paths)
# Retrieve a sample from the dataset given an index
def __getitem__(self, idx):
# Get the path of the frame corresponding to the given index
frame_path = self.frame_paths[idx]
# Open the image using PIL (Python Imaging Library)
image = Image.open(frame_path)
# Get the prompt corresponding to the given index
prompt = self.prompts[idx]
# Apply transformation if specified
if self.transform:
image = self.transform(image)
# Return the transformed image and the prompt
return image, prompt
在繼續(xù)編寫(xiě)架構(gòu)代碼之前,我們需要對(duì)訓(xùn)練數(shù)據(jù)進(jìn)行歸一化處理。我們使用 16 的 batch 大小并對(duì)數(shù)據(jù)進(jìn)行混洗以引入更多隨機(jī)性。
實(shí)現(xiàn)文本嵌入層
你可能已經(jīng)看到,在 Transformer 架構(gòu)中,起點(diǎn)是將文本輸入轉(zhuǎn)換為嵌入,從而在多頭注意力中進(jìn)行進(jìn)一步處理。類似地,我們?cè)谶@里必須編寫(xiě)一個(gè)文本嵌入層?;谠搶樱珿AN 架構(gòu)訓(xùn)練在我們的嵌入數(shù)據(jù)和圖像張量上進(jìn)行。
# Define a class for text embedding
class TextEmbedding(nn.Module):
# Constructor method with vocab_size and embed_size parameters
def __init__(self, vocab_size, embed_size):
# Call the superclass constructor
super(TextEmbedding, self).__init__()
# Initialize embedding layer
self.embedding = nn.Embedding(vocab_size, embed_size)
# Define the forward pass method
def forward(self, x):
# Return embedded representation of input
return self.embedding(x)
詞匯量將基于我們的訓(xùn)練數(shù)據(jù),在稍后進(jìn)行計(jì)算。嵌入大小將為 10。如果使用更大的數(shù)據(jù)集,你還可以使用 Hugging Face 上已有的嵌入模型。
實(shí)現(xiàn)生成器層
現(xiàn)在我們已經(jīng)知道生成器在 GAN 中的作用,接下來(lái)讓我們對(duì)這一層進(jìn)行編碼,然后了解其內(nèi)容。
class Generator(nn.Module):
def __init__(self, text_embed_size):
super(Generator, self).__init__()
# Fully connected layer that takes noise and text embedding as input
self.fc1 = nn.Linear(100 + text_embed_size, 256 * 8 * 8)
# Transposed convolutional layers to upsample the input
self.deconv1 = nn.ConvTranspose2d(256, 128, 4, 2, 1)
self.deconv2 = nn.ConvTranspose2d(128, 64, 4, 2, 1)
self.deconv3 = nn.ConvTranspose2d(64, 3, 4, 2, 1) # Output has 3 channels for RGB images
# Activation functions
self.relu = nn.ReLU(True) # ReLU activation function
self.tanh = nn.Tanh() # Tanh activation function for final output
def forward(self, noise, text_embed):
# Concatenate noise and text embedding along the channel dimension
x = torch.cat((noise, text_embed), dim=1)
# Fully connected layer followed by reshaping to 4D tensor
x = self.fc1(x).view(-1, 256, 8, 8)
# Upsampling through transposed convolution layers with ReLU activation
x = self.relu(self.deconv1(x))
x = self.relu(self.deconv2(x))
# Final layer with Tanh activation to ensure output values are between -1 and 1 (for images)
x = self.tanh(self.deconv3(x))
return x
該 Generator 類負(fù)責(zé)根據(jù)隨機(jī)噪聲和文本嵌入的組合創(chuàng)建視頻幀,旨在根據(jù)給定的文本描述生成逼真的視頻幀。該網(wǎng)絡(luò)從完全連接層 (nn.Linear) 開(kāi)始,將噪聲向量和文本嵌入組合成單個(gè)特征向量。然后,該向量被重新整形并經(jīng)過(guò)一系列的轉(zhuǎn)置卷積層 (nn.ConvTranspose2d),這些層將特征圖逐步上采樣到所需的視頻幀大小。
這些層使用 ReLU 激活 (nn.ReLU) 實(shí)現(xiàn)非線性,最后一層使用 Tanh 激活 (nn.Tanh) 將輸出縮放到 [-1, 1] 的范圍。因此,生成器將抽象的高維輸入轉(zhuǎn)換為以視覺(jué)方式表示輸入文本的連貫視頻幀。
實(shí)現(xiàn)判別器層
在編寫(xiě)完生成器層之后,我們需要實(shí)現(xiàn)另一半,即判別器部分。
class Discriminator(nn.Module):
def __init__(self):
super(Discriminator, self).__init__()
# Convolutional layers to process input images
self.conv1 = nn.Conv2d(3, 64, 4, 2, 1) # 3 input channels (RGB), 64 output channels, kernel size 4x4, stride 2, padding 1
self.conv2 = nn.Conv2d(64, 128, 4, 2, 1) # 64 input channels, 128 output channels, kernel size 4x4, stride 2, padding 1
self.conv3 = nn.Conv2d(128, 256, 4, 2, 1) # 128 input channels, 256 output channels, kernel size 4x4, stride 2, padding 1
# Fully connected layer for classification
self.fc1 = nn.Linear(256 * 8 * 8, 1) # Input size 256x8x8 (output size of last convolution), output size 1 (binary classification)
# Activation functions
self.leaky_relu = nn.LeakyReLU(0.2, inplace=True) # Leaky ReLU activation with negative slope 0.2
self.sigmoid = nn.Sigmoid() # Sigmoid activation for final output (probability)
def forward(self, input):
# Pass input through convolutional layers with LeakyReLU activation
x = self.leaky_relu(self.conv1(input))
x = self.leaky_relu(self.conv2(x))
x = self.leaky_relu(self.conv3(x))
# Flatten the output of convolutional layers
x = x.view(-1, 256 * 8 * 8)
# Pass through fully connected layer with Sigmoid activation for binary classification
x = self.sigmoid(self.fc1(x))
return x
判別器類用作二元分類器,區(qū)分真實(shí)視頻幀和生成的視頻幀。目的是評(píng)估視頻幀的真實(shí)性,從而指導(dǎo)生成器產(chǎn)生更真實(shí)的輸出。該網(wǎng)絡(luò)由卷積層 (nn.Conv2d) 組成,這些卷積層從輸入視頻幀中提取分層特征, Leaky ReLU 激活 (nn.LeakyReLU) 增加非線性,同時(shí)允許負(fù)值的小梯度。
然后,特征圖被展平并通過(guò)完全連接層 (nn.Linear),最終以 S 形激活 (nn.Sigmoid) 輸出指示幀是真實(shí)還是假的概率分?jǐn)?shù)。
通過(guò)訓(xùn)練判別器準(zhǔn)確地對(duì)幀進(jìn)行分類,生成器同時(shí)接受訓(xùn)練以創(chuàng)建更令人信服的視頻幀,從而騙過(guò)判別器。
編寫(xiě)訓(xùn)練參數(shù)
我們必須設(shè)置用于訓(xùn)練 GAN 的基礎(chǔ)組件,例如損失函數(shù)、優(yōu)化器等。
# Check for GPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Create a simple vocabulary for text prompts
all_prompts = [prompt for prompt, _, _ in prompts_and_movements] # Extract all prompts from prompts_and_movements list
vocab = {word: idx for idx, word in enumerate(set(" ".join(all_prompts).split()))} # Create a vocabulary dictionary where each unique word is assigned an index
vocab_size = len(vocab) # Size of the vocabulary
embed_size = 10 # Size of the text embedding vector
def encode_text(prompt):
# Encode a given prompt into a tensor of indices using the vocabulary
return torch.tensor([vocab[word] for word in prompt.split()])
# Initialize models, loss function, and optimizers
text_embedding = TextEmbedding(vocab_size, embed_size).to(device) # Initialize TextEmbedding model with vocab_size and embed_size
netG = Generator(embed_size).to(device) # Initialize Generator model with embed_size
netD = Discriminator().to(device) # Initialize Discriminator model
criterion = nn.BCELoss().to(device) # Binary Cross Entropy loss function
optimizerD = optim.Adam(netD.parameters(), lr=0.0002, betas=(0.5, 0.999)) # Adam optimizer for Discriminator
optimizerG = optim.Adam(netG.parameters(), lr=0.0002, betas=(0.5, 0.999)) # Adam optimizer for Generator
這是我們必須轉(zhuǎn)換代碼以在 GPU 上運(yùn)行的部分(如果可用)。我們已經(jīng)編寫(xiě)了代碼來(lái)查找 vocab_size,并且我們正在為生成器和判別器使用 ADAM 優(yōu)化器。你可以選擇自己的優(yōu)化器。在這里,我們將學(xué)習(xí)率設(shè)置為較小的值 0.0002,嵌入大小為 10,這比其他可供公眾使用的 Hugging Face 模型要小得多。
編寫(xiě)訓(xùn)練 loop
就像其他神經(jīng)網(wǎng)絡(luò)一樣,我們將以類似的方式對(duì) GAN 架構(gòu)訓(xùn)練進(jìn)行編碼。
# Number of epochs
num_epochs = 13
# Iterate over each epoch
for epoch in range(num_epochs):
# Iterate over each batch of data
for i, (data, prompts) in enumerate(dataloader):
# Move real data to device
real_data = data.to(device)
# Convert prompts to list
prompts = [prompt for prompt in prompts]
# Update Discriminator
netD.zero_grad() # Zero the gradients of the Discriminator
batch_size = real_data.size(0) # Get the batch size
labels = torch.ones(batch_size, 1).to(device) # Create labels for real data (ones)
output = netD(real_data) # Forward pass real data through Discriminator
lossD_real = criterion(output, labels) # Calculate loss on real data
lossD_real.backward() # Backward pass to calculate gradients
# Generate fake data
noise = torch.randn(batch_size, 100).to(device) # Generate random noise
text_embeds = torch.stack([text_embedding(encode_text(prompt).to(device)).mean(dim=0) for prompt in prompts]) # Encode prompts into text embeddings
fake_data = netG(noise, text_embeds) # Generate fake data from noise and text embeddings
labels = torch.zeros(batch_size, 1).to(device) # Create labels for fake data (zeros)
output = netD(fake_data.detach()) # Forward pass fake data through Discriminator (detach to avoid gradients flowing back to Generator)
lossD_fake = criterion(output, labels) # Calculate loss on fake data
lossD_fake.backward() # Backward pass to calculate gradients
optimizerD.step() # Update Discriminator parameters
# Update Generator
netG.zero_grad() # Zero the gradients of the Generator
labels = torch.ones(batch_size, 1).to(device) # Create labels for fake data (ones) to fool Discriminator
output = netD(fake_data) # Forward pass fake data (now updated) through Discriminator
lossG = criterion(output, labels) # Calculate loss for Generator based on Discriminator's response
lossG.backward() # Backward pass to calculate gradients
optimizerG.step() # Update Generator parameters
# Print epoch information
print(f"Epoch [{epoch + 1}/{num_epochs}] Loss D: {lossD_real + lossD_fake}, Loss G: {lossG}")
通過(guò)反向傳播,我們的損失將針對(duì)生成器和判別器進(jìn)行調(diào)整。我們?cè)谟?xùn)練 loop 中使用了 13 個(gè) epoch。我們測(cè)試了不同的值,但如果 epoch 高于這個(gè)值,結(jié)果并沒(méi)有太大差異。此外,過(guò)度擬合的風(fēng)險(xiǎn)很高。如果我們的數(shù)據(jù)集更加多樣化,包含更多動(dòng)作和形狀,則可以考慮使用更高的 epoch,但在這里沒(méi)有這樣做。
當(dāng)我們運(yùn)行此代碼時(shí),它會(huì)開(kāi)始訓(xùn)練,并在每個(gè) epoch 之后 print 生成器和判別器的損失。
## OUTPUT ##
Epoch [1/13] Loss D: 0.8798642754554749, Loss G: 1.300612449645996
Epoch [2/13] Loss D: 0.8235711455345154, Loss G: 1.3729925155639648
Epoch [3/13] Loss D: 0.6098687052726746, Loss G: 1.3266581296920776
...
保存訓(xùn)練的模型
訓(xùn)練完成后,我們需要保存訓(xùn)練好的 GAN 架構(gòu)的判別器和生成器,這只需兩行代碼即可實(shí)現(xiàn)。
# Save the Generator model's state dictionary to a file named 'generator.pth'
torch.save(netG.state_dict(), 'generator.pth')
# Save the Discriminator model's state dictionary to a file named 'discriminator.pth'
torch.save(netD.state_dict(), 'discriminator.pth')
生成 AI 視頻
正如我們所討論的,我們?cè)谖匆?jiàn)過(guò)的數(shù)據(jù)上測(cè)試模型的方法與我們訓(xùn)練數(shù)據(jù)中涉及狗取球和貓追老鼠的示例類似。因此,我們的測(cè)試 prompt 可能涉及貓取球或狗追老鼠等場(chǎng)景。
在我們的特定情況下,圓圈向上移動(dòng)然后向右移動(dòng)的運(yùn)動(dòng)在訓(xùn)練數(shù)據(jù)中不存在,因此模型不熟悉這種特定運(yùn)動(dòng)。但是,模型已經(jīng)在其他動(dòng)作上進(jìn)行了訓(xùn)練。我們可以使用此動(dòng)作作為 prompt 來(lái)測(cè)試我們訓(xùn)練過(guò)的模型并觀察其性能。
# Inference function to generate a video based on a given text promptdef generate_video(text_prompt, num_frames=10): # Create a directory for the generated video frames based on the text prompt os.makedirs(f'generated_video_{text_prompt.replace(" ", "_")}', exist_ok=True) # Encode the text prompt into a text embedding tensor text_embed = text_embedding(encode_text(text_prompt).to(device)).mean(dim=0).unsqueeze(0) # Generate frames for the video for frame_num in range(num_frames): # Generate random noise noise = torch.randn(1, 100).to(device) # Generate a fake frame using the Generator network with torch.no_grad(): fake_frame = netG(noise, text_embed) # Save the generated fake frame as an image file save_image(fake_frame, f'generated_video_{text_prompt.replace(" ", "_")}/frame_{frame_num}.png')# usage of the generate_video function with a specific text promptgenerate_video('circle moving up-right')
當(dāng)我們運(yùn)行上述代碼時(shí),它將生成一個(gè)目錄,其中包含我們生成視頻的所有幀。我們需要使用一些代碼將所有這些幀合并為一個(gè)短視頻。
# Define the path to your folder containing the PNG frames
folder_path = 'generated_video_circle_moving_up-right'
# Get the list of all PNG files in the folder
image_files = [f for f in os.listdir(folder_path) if f.endswith('.png')]
# Sort the images by name (assuming they are numbered sequentially)
image_files.sort()
# Create a list to store the frames
frames = []
# Read each image and append it to the frames list
for image_file in image_files:
image_path = os.path.join(folder_path, image_file)
frame = cv2.imread(image_path)
frames.append(frame)
# Convert the frames list to a numpy array for easier processing
frames = np.array(frames)
# Define the frame rate (frames per second)
fps = 10
# Create a video writer object
fourcc = cv2.VideoWriter_fourcc(*'XVID')
out = cv2.VideoWriter('generated_video.avi', fourcc, fps, (frames[0].shape[1], frames[0].shape[0]))
# Write each frame to the video
for frame in frames:
out.write(frame)
# Release the video writer
out.release()
確保文件夾路徑指向你新生成的視頻所在的位置。運(yùn)行此代碼后,你將成功創(chuàng)建 AI 視頻。讓我們看看它是什么樣子。
我們進(jìn)行了多次訓(xùn)練,訓(xùn)練次數(shù)相同。在兩種情況下,圓圈都是從底部開(kāi)始,出現(xiàn)一半。好消息是,我們的模型在兩種情況下都嘗試執(zhí)行直立運(yùn)動(dòng)。
例如,在嘗試 1 中,圓圈沿對(duì)角線向上移動(dòng),然后執(zhí)行向上運(yùn)動(dòng),而在嘗試 2 中,圓圈沿對(duì)角線移動(dòng),同時(shí)尺寸縮小。在兩種情況下,圓圈都沒(méi)有向左移動(dòng)或完全消失,這是一個(gè)好兆頭。
最后,作者表示已經(jīng)測(cè)試了該架構(gòu)的各個(gè)方面,發(fā)現(xiàn)訓(xùn)練數(shù)據(jù)是關(guān)鍵。通過(guò)在數(shù)據(jù)集中包含更多動(dòng)作和形狀,你可以增加可變性并提高模型的性能。由于數(shù)據(jù)是通過(guò)代碼生成的,因此生成更多樣的數(shù)據(jù)不會(huì)花費(fèi)太多時(shí)間;相反,你可以專注于完善邏輯。
此外,文章中討論的 GAN 架構(gòu)相對(duì)簡(jiǎn)單。你可以通過(guò)集成高級(jí)技術(shù)或使用語(yǔ)言模型嵌入 (LLM) 而不是基本神經(jīng)網(wǎng)絡(luò)嵌入來(lái)使其更復(fù)雜。此外,調(diào)整嵌入大小等參數(shù)會(huì)顯著影響模型的有效性。