自拍偷在线精品自拍偷,亚洲欧美中文日韩v在线观看不卡

51CTO首頁(yè)

AI.x社區(qū)

軟考社區(qū)

免費(fèi)課

企業(yè)培訓(xùn)

鴻蒙開(kāi)發(fā)者社區(qū)

WOT技術(shù)大會(huì)

公眾號(hào)矩陣

移動(dòng)端

視頻課免費(fèi)課排行榜短視頻直播課軟考學(xué)堂

全部課程軟考華為認(rèn)證廠商認(rèn)證 IT技術(shù)PMP項(xiàng)目管理免費(fèi)題庫(kù)

在線學(xué)習(xí)

文章資源問(wèn)答課堂專欄直播

51CTO

鴻蒙開(kāi)發(fā)者社區(qū)

51CTO技術(shù)棧

51CTO官微

51CTO學(xué)堂

51CTO博客

CTO訓(xùn)練營(yíng)

鴻蒙開(kāi)發(fā)者社區(qū)訂閱號(hào)

51CTO軟考

51CTO學(xué)堂APP

51CTO學(xué)堂企業(yè)版APP

鴻蒙開(kāi)發(fā)者社區(qū)視頻號(hào)

51CTO軟考題庫(kù)

賬號(hào)設(shè)置退出

從零開(kāi)始，用英偉達(dá)T4、A10訓(xùn)練小型文生視頻模型，幾小時(shí)搞定

作者：計(jì)算機(jī)視覺(jué)研究院 2024-07-03 10:20:25

人工智能新聞

在這篇博客中，作者將展示如何將從頭開(kāi)始構(gòu)建一個(gè)小規(guī)模的文本生成視頻模型，涵蓋了從理解理論概念、到編寫(xiě)整個(gè)架構(gòu)再到生成最終結(jié)果的所有內(nèi)容。

本文經(jīng)計(jì)算機(jī)視覺(jué)研究院公眾號(hào)授權(quán)轉(zhuǎn)載，轉(zhuǎn)載請(qǐng)聯(lián)系出處。

OpenAI 的 Sora、Stability AI 的 Stable Video Diffusion 以及許多其他已經(jīng)發(fā)布或未來(lái)將出現(xiàn)的文本生成視頻模型，是繼大語(yǔ)言模型 (LLM) 之后 2024 年最流行的 AI 趨勢(shì)之一。

在這篇博客中，作者將展示如何將從頭開(kāi)始構(gòu)建一個(gè)小規(guī)模的文本生成視頻模型，涵蓋了從理解理論概念、到編寫(xiě)整個(gè)架構(gòu)再到生成最終結(jié)果的所有內(nèi)容。

由于作者沒(méi)有大算力的 GPU，所以僅編寫(xiě)了小規(guī)模架構(gòu)。以下是在不同處理器上訓(xùn)練模型所需時(shí)間的比較。

作者表示，在 CPU 上運(yùn)行顯然需要更長(zhǎng)的時(shí)間來(lái)訓(xùn)練模型。如果你需要快速測(cè)試代碼中的更改并查看結(jié)果，CPU 不是最佳選擇。因此建議使用 Colab 或 Kaggle 的 T4 GPU 進(jìn)行更高效、更快速的訓(xùn)練。

構(gòu)建目標(biāo)

我們采用了與傳統(tǒng)機(jī)器學(xué)習(xí)或深度學(xué)習(xí)模型類似的方法，即在數(shù)據(jù)集上進(jìn)行訓(xùn)練，然后在未見(jiàn)過(guò)數(shù)據(jù)上進(jìn)行測(cè)試。在文本轉(zhuǎn)視頻的背景下，假設(shè)有一個(gè)包含 10 萬(wàn)個(gè)狗撿球和貓追老鼠視頻的訓(xùn)練數(shù)據(jù)集，然后訓(xùn)練模型來(lái)生成貓撿球或狗追老鼠的視頻。

圖源：iStock, GettyImages

雖然此類訓(xùn)練數(shù)據(jù)集在互聯(lián)網(wǎng)上很容易獲得，但所需的算力極高。因此，我們將使用由 Python 代碼生成的移動(dòng)對(duì)象視頻數(shù)據(jù)集。同時(shí)使用 GAN（生成對(duì)抗網(wǎng)絡(luò)）架構(gòu)來(lái)創(chuàng)建模型，而不是 OpenAI Sora 使用的擴(kuò)散模型。

我們也嘗試使用擴(kuò)散模型，但內(nèi)存要求超出了自己的能力。另一方面，GAN 可以更容易、更快地進(jìn)行訓(xùn)練和測(cè)試。

準(zhǔn)備條件

我們將使用 OOP（面向?qū)ο缶幊蹋?，因此必須?duì)它以及神經(jīng)網(wǎng)絡(luò)有基本的了解。此外 GAN（生成對(duì)抗網(wǎng)絡(luò)）的知識(shí)不是必需的，因?yàn)檫@里簡(jiǎn)單介紹它們的架構(gòu)。

OOP：https://www.youtube.com/watch?v=q2SGW2VgwAM
神經(jīng)網(wǎng)絡(luò)理論：https://www.youtube.com/watch?v=Jy4wM2X21u0
GAN 架構(gòu)：https://www.youtube.com/watch?v=TpMIssRdhco
Python 基礎(chǔ)：https://www.youtube.com/watch?v=eWRfhZUzrAc

了解 GAN 架構(gòu)

什么是 GAN？

生成對(duì)抗網(wǎng)絡(luò)是一種深度學(xué)習(xí)模型，其中兩個(gè)神經(jīng)網(wǎng)絡(luò)相互競(jìng)爭(zhēng)：一個(gè)從給定的數(shù)據(jù)集創(chuàng)建新數(shù)據(jù)（如圖像或音樂(lè)），另一個(gè)則判斷數(shù)據(jù)是真實(shí)的還是虛假的。這個(gè)過(guò)程一直持續(xù)到生成的數(shù)據(jù)與原始數(shù)據(jù)無(wú)法區(qū)分。

真實(shí)世界應(yīng)用

生成圖像：GAN 根據(jù)文本 prompt 創(chuàng)建逼真的圖像或修改現(xiàn)有圖像，例如增強(qiáng)分辨率或?yàn)楹诎渍掌砑宇伾?/span>
數(shù)據(jù)增強(qiáng)：GAN 生成合成數(shù)據(jù)來(lái)訓(xùn)練其他機(jī)器學(xué)習(xí)模型，例如為欺詐檢測(cè)系統(tǒng)創(chuàng)建欺詐交易數(shù)據(jù)。
補(bǔ)充缺失信息：GAN 可以填充缺失數(shù)據(jù)，例如根據(jù)地形圖生成地下圖像以用于能源應(yīng)用。
生成 3D 模型：GAN 將 2D 圖像轉(zhuǎn)換為 3D 模型，在醫(yī)療保健等領(lǐng)域非常有用，可用于為手術(shù)規(guī)劃創(chuàng)建逼真的器官圖像。

GAN 工作原理

GAN 由兩個(gè)深度神經(jīng)網(wǎng)絡(luò)組成：生成器和判別器。這兩個(gè)網(wǎng)絡(luò)在對(duì)抗設(shè)置中一起訓(xùn)練，其中一個(gè)網(wǎng)絡(luò)生成新數(shù)據(jù)，另一個(gè)網(wǎng)絡(luò)評(píng)估數(shù)據(jù)是真是假。

GAN 訓(xùn)練示例

讓我們以圖像到圖像的轉(zhuǎn)換為例，解釋一下 GAN 模型，重點(diǎn)是修改人臉。

1. 輸入圖像：輸入圖像是一張真實(shí)的人臉圖像。

2. 屬性修改：生成器會(huì)修改人臉的屬性，比如給眼睛加上墨鏡。

3. 生成圖像：生成器會(huì)創(chuàng)建一組添加了太陽(yáng)鏡的圖像。

4. 判別器的任務(wù)：判別器接收到混合的真實(shí)圖像（帶有太陽(yáng)鏡的人）和生成的圖像（添加了太陽(yáng)鏡的人臉）。

5. 評(píng)估：判別器嘗試區(qū)分真實(shí)圖像和生成圖像。

6. 反饋回路：如果判別器正確識(shí)別出假圖像，生成器會(huì)調(diào)整其參數(shù)以生成更逼真的圖像。如果生成器成功欺騙了判別器，判別器會(huì)更新其參數(shù)以提高檢測(cè)能力。

通過(guò)這一對(duì)抗過(guò)程，兩個(gè)網(wǎng)絡(luò)都在不斷改進(jìn)。生成器越來(lái)越善于生成逼真的圖像，而判別器則越來(lái)越善于識(shí)別假圖像，直到達(dá)到平衡，判別器再也無(wú)法區(qū)分真實(shí)圖像和生成的圖像。此時(shí)，GAN 已成功學(xué)會(huì)生成逼真的修改圖像。

設(shè)置背景

我們將使用一系列 Python 庫(kù)，讓我們導(dǎo)入它們。

# Operating System module for interacting with the operating system
import os


# Module for generating random numbers
import random


# Module for numerical operations
import numpy as np


# OpenCV library for image processing
import cv2


# Python Imaging Library for image processing
from PIL import Image, ImageDraw, ImageFont


# PyTorch library for deep learning
import torch


# Dataset class for creating custom datasets in PyTorch
from torch.utils.data import Dataset


# Module for image transformations
import torchvision.transforms as transforms


# Neural network module in PyTorch
import torch.nn as nn


# Optimization algorithms in PyTorch
import torch.optim as optim


# Function for padding sequences in PyTorch
from torch.nn.utils.rnn import pad_sequence


# Function for saving images in PyTorch
from torchvision.utils import save_image


# Module for plotting graphs and images
import matplotlib.pyplot as plt


# Module for displaying rich content in IPython environments
from IPython.display import clear_output, display, HTML


# Module for encoding and decoding binary data to text
import base64

現(xiàn)在我們已經(jīng)導(dǎo)入了所有的庫(kù)，下一步就是定義我們的訓(xùn)練數(shù)據(jù)，用于訓(xùn)練 GAN 架構(gòu)。

對(duì)訓(xùn)練數(shù)據(jù)進(jìn)行編碼

我們需要至少 10000 個(gè)視頻作為訓(xùn)練數(shù)據(jù)。為什么呢？因?yàn)槲覝y(cè)試了較小數(shù)量的視頻，結(jié)果非常糟糕，幾乎沒(méi)有任何效果。下一個(gè)重要問(wèn)題是：這些視頻內(nèi)容是什么？我們的訓(xùn)練視頻數(shù)據(jù)集包括一個(gè)圓圈以不同方向和不同運(yùn)動(dòng)方式移動(dòng)的視頻。讓我們來(lái)編寫(xiě)代碼并生成 10,000 個(gè)視頻，看看它的效果如何。

# Create a directory named 'training_dataset'
os.makedirs('training_dataset', exist_ok=True)


# Define the number of videos to generate for the dataset
num_videos = 10000


# Define the number of frames per video (1 Second Video)
frames_per_video = 10


# Define the size of each image in the dataset
img_size = (64, 64)


# Define the size of the shapes (Circle)
shape_size = 10

設(shè)置一些基本參數(shù)后，接下來(lái)我們需要定義訓(xùn)練數(shù)據(jù)集的文本 prompt，并據(jù)此生成訓(xùn)練視頻。

# Define text prompts and corresponding movements for circles
prompts_and_movements = [
 ("circle moving down", "circle", "down"), # Move circle downward
 ("circle moving left", "circle", "left"), # Move circle leftward
 ("circle moving right", "circle", "right"), # Move circle rightward
 ("circle moving diagonally up-right", "circle", "diagonal_up_right"), # Move circle diagonally up-right
 ("circle moving diagonally down-left", "circle", "diagonal_down_left"), # Move circle diagonally down-left
 ("circle moving diagonally up-left", "circle", "diagonal_up_left"), # Move circle diagonally up-left
 ("circle moving diagonally down-right", "circle", "diagonal_down_right"), # Move circle diagonally down-right
 ("circle rotating clockwise", "circle", "rotate_clockwise"), # Rotate circle clockwise
 ("circle rotating counter-clockwise", "circle", "rotate_counter_clockwise"), # Rotate circle counter-clockwise
 ("circle shrinking", "circle", "shrink"), # Shrink circle
 ("circle expanding", "circle", "expand"), # Expand circle
 ("circle bouncing vertically", "circle", "bounce_vertical"), # Bounce circle vertically
 ("circle bouncing horizontally", "circle", "bounce_horizontal"), # Bounce circle horizontally
 ("circle zigzagging vertically", "circle", "zigzag_vertical"), # Zigzag circle vertically
 ("circle zigzagging horizontally", "circle", "zigzag_horizontal"), # Zigzag circle horizontally
 ("circle moving up-left", "circle", "up_left"), # Move circle up-left
 ("circle moving down-right", "circle", "down_right"), # Move circle down-right
 ("circle moving down-left", "circle", "down_left"), # Move circle down-left
]

我們已經(jīng)利用這些 prompt 定義了圓的幾個(gè)運(yùn)動(dòng)軌跡?，F(xiàn)在，我們需要編寫(xiě)一些數(shù)學(xué)公式，以便根據(jù) prompt 移動(dòng)圓。

# Define function with parameters
def create_image_with_moving_shape(size, frame_num, shape, direction):
 
 # Create a new RGB image with specified size and white background
 img = Image.new('RGB', size, color=(255, 255, 255)) 


 # Create a drawing context for the image
 draw = ImageDraw.Draw(img) 


 # Calculate the center coordinates of the image
 center_x, center_y = size[0] // 2, size[1] // 2 


 # Initialize position with center for all movements
 position = (center_x, center_y) 


 # Define a dictionary mapping directions to their respective position adjustments or image transformations
 direction_map = { 
 # Adjust position downwards based on frame number
 "down": (0, frame_num * 5 % size[1]), 
 # Adjust position to the left based on frame number
 "left": (-frame_num * 5 % size[0], 0), 
 # Adjust position to the right based on frame number
 "right": (frame_num * 5 % size[0], 0), 
 # Adjust position diagonally up and to the right
 "diagonal_up_right": (frame_num * 5 % size[0], -frame_num * 5 % size[1]), 
 # Adjust position diagonally down and to the left
 "diagonal_down_left": (-frame_num * 5 % size[0], frame_num * 5 % size[1]), 
 # Adjust position diagonally up and to the left
 "diagonal_up_left": (-frame_num * 5 % size[0], -frame_num * 5 % size[1]), 
 # Adjust position diagonally down and to the right
 "diagonal_down_right": (frame_num * 5 % size[0], frame_num * 5 % size[1]), 
 # Rotate the image clockwise based on frame number
 "rotate_clockwise": img.rotate(frame_num * 10 % 360, center=(center_x, center_y), fillcolor=(255, 255, 255)), 
 # Rotate the image counter-clockwise based on frame number
 "rotate_counter_clockwise": img.rotate(-frame_num * 10 % 360, center=(center_x, center_y), fillcolor=(255, 255, 255)), 
 # Adjust position for a bouncing effect vertically
 "bounce_vertical": (0, center_y - abs(frame_num * 5 % size[1] - center_y)), 
 # Adjust position for a bouncing effect horizontally
 "bounce_horizontal": (center_x - abs(frame_num * 5 % size[0] - center_x), 0), 
 # Adjust position for a zigzag effect vertically
 "zigzag_vertical": (0, center_y - frame_num * 5 % size[1]) if frame_num % 2 == 0 else (0, center_y + frame_num * 5 % size[1]), 
 # Adjust position for a zigzag effect horizontally
 "zigzag_horizontal": (center_x - frame_num * 5 % size[0], center_y) if frame_num % 2 == 0 else (center_x + frame_num * 5 % size[0], center_y), 
 # Adjust position upwards and to the right based on frame number
 "up_right": (frame_num * 5 % size[0], -frame_num * 5 % size[1]), 
 # Adjust position upwards and to the left based on frame number
 "up_left": (-frame_num * 5 % size[0], -frame_num * 5 % size[1]), 
 # Adjust position downwards and to the right based on frame number
 "down_right": (frame_num * 5 % size[0], frame_num * 5 % size[1]), 
 # Adjust position downwards and to the left based on frame number
 "down_left": (-frame_num * 5 % size[0], frame_num * 5 % size[1]) 
 }


 # Check if direction is in the direction map
 if direction in direction_map: 
 # Check if the direction maps to a position adjustment
 if isinstance(direction_map[direction], tuple): 
 # Update position based on the adjustment
 position = tuple(np.add(position, direction_map[direction])) 
 else: # If the direction maps to an image transformation
 # Update the image based on the transformation
 img = direction_map[direction] 


 # Return the image as a numpy array
 return np.array(img)

上述函數(shù)用于根據(jù)所選方向在每一幀中移動(dòng)我們的圓。我們只需在其上運(yùn)行一個(gè)循環(huán)，直至生成所有視頻的次數(shù)。

# Iterate over the number of videos to generate
for i in range(num_videos):
    # Randomly choose a prompt and movement from the predefined list
    prompt, shape, direction = random.choice(prompts_and_movements)
    
    # Create a directory for the current video
    video_dir = f'training_dataset/video_{i}'
    os.makedirs(video_dir, exist_ok=True)
    
    # Write the chosen prompt to a text file in the video directory
    with open(f'{video_dir}/prompt.txt', 'w') as f:
        f.write(prompt)
    
    # Generate frames for the current video
    for frame_num in range(frames_per_video):
        # Create an image with a moving shape based on the current frame number, shape, and direction
        img = create_image_with_moving_shape(img_size, frame_num, shape, direction)
        
        # Save the generated image as a PNG file in the video directory
        cv2.imwrite(f'{video_dir}/frame_{frame_num}.png', img)

運(yùn)行上述代碼后，就會(huì)生成整個(gè)訓(xùn)練數(shù)據(jù)集。以下是訓(xùn)練數(shù)據(jù)集文件的結(jié)構(gòu)。

每個(gè)訓(xùn)練視頻文件夾包含其幀以及對(duì)應(yīng)的文本 prompt。讓我們看一下我們的訓(xùn)練數(shù)據(jù)集樣本。

在我們的訓(xùn)練數(shù)據(jù)集中，我們沒(méi)有包含圓圈先向上移動(dòng)然后向右移動(dòng)的運(yùn)動(dòng)。我們將使用這個(gè)作為測(cè)試 prompt，來(lái)評(píng)估我們訓(xùn)練的模型在未見(jiàn)過(guò)的數(shù)據(jù)上的表現(xiàn)。

還有一個(gè)重要的要點(diǎn)需要注意，我們的訓(xùn)練數(shù)據(jù)包含許多物體從場(chǎng)景中移出或部分出現(xiàn)在攝像機(jī)前方的樣本，類似于我們?cè)?OpenAI Sora 演示視頻中觀察到的情況。

在我們的訓(xùn)練數(shù)據(jù)中包含此類樣本的原因是為了測(cè)試當(dāng)圓圈從角落進(jìn)入場(chǎng)景時(shí)，模型是否能夠保持一致性而不會(huì)破壞其形狀。

現(xiàn)在我們的訓(xùn)練數(shù)據(jù)已經(jīng)生成，需要將訓(xùn)練視頻轉(zhuǎn)換為張量，這是 PyTorch 等深度學(xué)習(xí)框架中使用的主要數(shù)據(jù)類型。此外，通過(guò)將數(shù)據(jù)縮放到較小的范圍，執(zhí)行歸一化等轉(zhuǎn)換有助于提高訓(xùn)練架構(gòu)的收斂性和穩(wěn)定性。

預(yù)處理訓(xùn)練數(shù)據(jù)

我們必須為文本轉(zhuǎn)視頻任務(wù)編寫(xiě)一個(gè)數(shù)據(jù)集類，它可以從訓(xùn)練數(shù)據(jù)集目錄中讀取視頻幀及其相應(yīng)的文本 prompt，使其可以在 PyTorch 中使用。

# Define a dataset class inheriting from torch.utils.data.Dataset
class TextToVideoDataset(Dataset):
    def __init__(self, root_dir, transform=None):
        # Initialize the dataset with root directory and optional transform
        self.root_dir = root_dir
        self.transform = transform
        # List all subdirectories in the root directory
        self.video_dirs = [os.path.join(root_dir, d) for d in os.listdir(root_dir) if os.path.isdir(os.path.join(root_dir, d))]
        # Initialize lists to store frame paths and corresponding prompts
        self.frame_paths = []
        self.prompts = []


        # Loop through each video directory
        for video_dir in self.video_dirs:
            # List all PNG files in the video directory and store their paths
            frames = [os.path.join(video_dir, f) for f in os.listdir(video_dir) if f.endswith('.png')]
            self.frame_paths.extend(frames)
            # Read the prompt text file in the video directory and store its content
            with open(os.path.join(video_dir, 'prompt.txt'), 'r') as f:
                prompt = f.read().strip()
            # Repeat the prompt for each frame in the video and store in prompts list
            self.prompts.extend([prompt] * len(frames))


    # Return the total number of samples in the dataset
    def __len__(self):
        return len(self.frame_paths)


    # Retrieve a sample from the dataset given an index
    def __getitem__(self, idx):
        # Get the path of the frame corresponding to the given index
        frame_path = self.frame_paths[idx]
        # Open the image using PIL (Python Imaging Library)
        image = Image.open(frame_path)
        # Get the prompt corresponding to the given index
        prompt = self.prompts[idx]


        # Apply transformation if specified
        if self.transform:
            image = self.transform(image)


        # Return the transformed image and the prompt
        return image, prompt

在繼續(xù)編寫(xiě)架構(gòu)代碼之前，我們需要對(duì)訓(xùn)練數(shù)據(jù)進(jìn)行歸一化處理。我們使用 16 的 batch 大小并對(duì)數(shù)據(jù)進(jìn)行混洗以引入更多隨機(jī)性。

實(shí)現(xiàn)文本嵌入層

你可能已經(jīng)看到，在 Transformer 架構(gòu)中，起點(diǎn)是將文本輸入轉(zhuǎn)換為嵌入，從而在多頭注意力中進(jìn)行進(jìn)一步處理。類似地，我們?cè)谶@里必須編寫(xiě)一個(gè)文本嵌入層?；谠搶樱珿AN 架構(gòu)訓(xùn)練在我們的嵌入數(shù)據(jù)和圖像張量上進(jìn)行。

# Define a class for text embedding
class TextEmbedding(nn.Module):
    # Constructor method with vocab_size and embed_size parameters
    def __init__(self, vocab_size, embed_size):
        # Call the superclass constructor
        super(TextEmbedding, self).__init__()
        # Initialize embedding layer
        self.embedding = nn.Embedding(vocab_size, embed_size)


    # Define the forward pass method
    def forward(self, x):
        # Return embedded representation of input
        return self.embedding(x)

詞匯量將基于我們的訓(xùn)練數(shù)據(jù)，在稍后進(jìn)行計(jì)算。嵌入大小將為 10。如果使用更大的數(shù)據(jù)集，你還可以使用 Hugging Face 上已有的嵌入模型。

實(shí)現(xiàn)生成器層

現(xiàn)在我們已經(jīng)知道生成器在 GAN 中的作用，接下來(lái)讓我們對(duì)這一層進(jìn)行編碼，然后了解其內(nèi)容。

class Generator(nn.Module):
 def __init__(self, text_embed_size):
 super(Generator, self).__init__()
 
 # Fully connected layer that takes noise and text embedding as input
 self.fc1 = nn.Linear(100 + text_embed_size, 256 * 8 * 8)
 
 # Transposed convolutional layers to upsample the input
 self.deconv1 = nn.ConvTranspose2d(256, 128, 4, 2, 1)
 self.deconv2 = nn.ConvTranspose2d(128, 64, 4, 2, 1)
 self.deconv3 = nn.ConvTranspose2d(64, 3, 4, 2, 1) # Output has 3 channels for RGB images
 
 # Activation functions
 self.relu = nn.ReLU(True) # ReLU activation function
 self.tanh = nn.Tanh() # Tanh activation function for final output


 def forward(self, noise, text_embed):
 # Concatenate noise and text embedding along the channel dimension
 x = torch.cat((noise, text_embed), dim=1)
 
 # Fully connected layer followed by reshaping to 4D tensor
 x = self.fc1(x).view(-1, 256, 8, 8)
 
 # Upsampling through transposed convolution layers with ReLU activation
 x = self.relu(self.deconv1(x))
 x = self.relu(self.deconv2(x))
 
 # Final layer with Tanh activation to ensure output values are between -1 and 1 (for images)
 x = self.tanh(self.deconv3(x))
 
 return x

該 Generator 類負(fù)責(zé)根據(jù)隨機(jī)噪聲和文本嵌入的組合創(chuàng)建視頻幀，旨在根據(jù)給定的文本描述生成逼真的視頻幀。該網(wǎng)絡(luò)從完全連接層 (nn.Linear) 開(kāi)始，將噪聲向量和文本嵌入組合成單個(gè)特征向量。然后，該向量被重新整形并經(jīng)過(guò)一系列的轉(zhuǎn)置卷積層 (nn.ConvTranspose2d)，這些層將特征圖逐步上采樣到所需的視頻幀大小。

這些層使用 ReLU 激活 (nn.ReLU) 實(shí)現(xiàn)非線性，最后一層使用 Tanh 激活 (nn.Tanh) 將輸出縮放到 [-1, 1] 的范圍。因此，生成器將抽象的高維輸入轉(zhuǎn)換為以視覺(jué)方式表示輸入文本的連貫視頻幀。

實(shí)現(xiàn)判別器層

在編寫(xiě)完生成器層之后，我們需要實(shí)現(xiàn)另一半，即判別器部分。

class Discriminator(nn.Module):
    def __init__(self):
        super(Discriminator, self).__init__()
        
        # Convolutional layers to process input images
        self.conv1 = nn.Conv2d(3, 64, 4, 2, 1)   # 3 input channels (RGB), 64 output channels, kernel size 4x4, stride 2, padding 1
        self.conv2 = nn.Conv2d(64, 128, 4, 2, 1) # 64 input channels, 128 output channels, kernel size 4x4, stride 2, padding 1
        self.conv3 = nn.Conv2d(128, 256, 4, 2, 1) # 128 input channels, 256 output channels, kernel size 4x4, stride 2, padding 1
        
        # Fully connected layer for classification
        self.fc1 = nn.Linear(256 * 8 * 8, 1)  # Input size 256x8x8 (output size of last convolution), output size 1 (binary classification)
        
        # Activation functions
        self.leaky_relu = nn.LeakyReLU(0.2, inplace=True)  # Leaky ReLU activation with negative slope 0.2
        self.sigmoid = nn.Sigmoid()  # Sigmoid activation for final output (probability)


    def forward(self, input):
        # Pass input through convolutional layers with LeakyReLU activation
        x = self.leaky_relu(self.conv1(input))
        x = self.leaky_relu(self.conv2(x))
        x = self.leaky_relu(self.conv3(x))
        
        # Flatten the output of convolutional layers
        x = x.view(-1, 256 * 8 * 8)
        
        # Pass through fully connected layer with Sigmoid activation for binary classification
        x = self.sigmoid(self.fc1(x))
        
        return x

判別器類用作二元分類器，區(qū)分真實(shí)視頻幀和生成的視頻幀。目的是評(píng)估視頻幀的真實(shí)性，從而指導(dǎo)生成器產(chǎn)生更真實(shí)的輸出。該網(wǎng)絡(luò)由卷積層 (nn.Conv2d) 組成，這些卷積層從輸入視頻幀中提取分層特征， Leaky ReLU 激活 (nn.LeakyReLU) 增加非線性，同時(shí)允許負(fù)值的小梯度。

然后，特征圖被展平并通過(guò)完全連接層 (nn.Linear)，最終以 S 形激活 (nn.Sigmoid) 輸出指示幀是真實(shí)還是假的概率分?jǐn)?shù)。

通過(guò)訓(xùn)練判別器準(zhǔn)確地對(duì)幀進(jìn)行分類，生成器同時(shí)接受訓(xùn)練以創(chuàng)建更令人信服的視頻幀，從而騙過(guò)判別器。

編寫(xiě)訓(xùn)練參數(shù)

我們必須設(shè)置用于訓(xùn)練 GAN 的基礎(chǔ)組件，例如損失函數(shù)、優(yōu)化器等。

# Check for GPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


# Create a simple vocabulary for text prompts
all_prompts = [prompt for prompt, _, _ in prompts_and_movements]  # Extract all prompts from prompts_and_movements list
vocab = {word: idx for idx, word in enumerate(set(" ".join(all_prompts).split()))}  # Create a vocabulary dictionary where each unique word is assigned an index
vocab_size = len(vocab)  # Size of the vocabulary
embed_size = 10  # Size of the text embedding vector


def encode_text(prompt):
    # Encode a given prompt into a tensor of indices using the vocabulary
    return torch.tensor([vocab[word] for word in prompt.split()])


# Initialize models, loss function, and optimizers
text_embedding = TextEmbedding(vocab_size, embed_size).to(device)  # Initialize TextEmbedding model with vocab_size and embed_size
netG = Generator(embed_size).to(device)  # Initialize Generator model with embed_size
netD = Discriminator().to(device)  # Initialize Discriminator model
criterion = nn.BCELoss().to(device)  # Binary Cross Entropy loss function
optimizerD = optim.Adam(netD.parameters(), lr=0.0002, betas=(0.5, 0.999))  # Adam optimizer for Discriminator
optimizerG = optim.Adam(netG.parameters(), lr=0.0002, betas=(0.5, 0.999))  # Adam optimizer for Generator

這是我們必須轉(zhuǎn)換代碼以在 GPU 上運(yùn)行的部分（如果可用）。我們已經(jīng)編寫(xiě)了代碼來(lái)查找 vocab_size，并且我們正在為生成器和判別器使用 ADAM 優(yōu)化器。你可以選擇自己的優(yōu)化器。在這里，我們將學(xué)習(xí)率設(shè)置為較小的值 0.0002，嵌入大小為 10，這比其他可供公眾使用的 Hugging Face 模型要小得多。

編寫(xiě)訓(xùn)練 loop

就像其他神經(jīng)網(wǎng)絡(luò)一樣，我們將以類似的方式對(duì) GAN 架構(gòu)訓(xùn)練進(jìn)行編碼。

# Number of epochs
num_epochs = 13


# Iterate over each epoch
for epoch in range(num_epochs):
    # Iterate over each batch of data
    for i, (data, prompts) in enumerate(dataloader):
        # Move real data to device
        real_data = data.to(device)
        
        # Convert prompts to list
        prompts = [prompt for prompt in prompts]


        # Update Discriminator
        netD.zero_grad()  # Zero the gradients of the Discriminator
        batch_size = real_data.size(0)  # Get the batch size
        labels = torch.ones(batch_size, 1).to(device)  # Create labels for real data (ones)
        output = netD(real_data)  # Forward pass real data through Discriminator
        lossD_real = criterion(output, labels)  # Calculate loss on real data
        lossD_real.backward()  # Backward pass to calculate gradients
       
        # Generate fake data
        noise = torch.randn(batch_size, 100).to(device)  # Generate random noise
        text_embeds = torch.stack([text_embedding(encode_text(prompt).to(device)).mean(dim=0) for prompt in prompts])  # Encode prompts into text embeddings
        fake_data = netG(noise, text_embeds)  # Generate fake data from noise and text embeddings
        labels = torch.zeros(batch_size, 1).to(device)  # Create labels for fake data (zeros)
        output = netD(fake_data.detach())  # Forward pass fake data through Discriminator (detach to avoid gradients flowing back to Generator)
        lossD_fake = criterion(output, labels)  # Calculate loss on fake data
        lossD_fake.backward()  # Backward pass to calculate gradients
        optimizerD.step()  # Update Discriminator parameters


        # Update Generator
        netG.zero_grad()  # Zero the gradients of the Generator
        labels = torch.ones(batch_size, 1).to(device)  # Create labels for fake data (ones) to fool Discriminator
        output = netD(fake_data)  # Forward pass fake data (now updated) through Discriminator
        lossG = criterion(output, labels)  # Calculate loss for Generator based on Discriminator's response
        lossG.backward()  # Backward pass to calculate gradients
        optimizerG.step()  # Update Generator parameters
    
    # Print epoch information
    print(f"Epoch [{epoch + 1}/{num_epochs}] Loss D: {lossD_real + lossD_fake}, Loss G: {lossG}")

通過(guò)反向傳播，我們的損失將針對(duì)生成器和判別器進(jìn)行調(diào)整。我們?cè)谟?xùn)練 loop 中使用了 13 個(gè) epoch。我們測(cè)試了不同的值，但如果 epoch 高于這個(gè)值，結(jié)果并沒(méi)有太大差異。此外，過(guò)度擬合的風(fēng)險(xiǎn)很高。如果我們的數(shù)據(jù)集更加多樣化，包含更多動(dòng)作和形狀，則可以考慮使用更高的 epoch，但在這里沒(méi)有這樣做。

當(dāng)我們運(yùn)行此代碼時(shí)，它會(huì)開(kāi)始訓(xùn)練，并在每個(gè) epoch 之后 print 生成器和判別器的損失。

## OUTPUT ##


Epoch [1/13] Loss D: 0.8798642754554749, Loss G: 1.300612449645996
Epoch [2/13] Loss D: 0.8235711455345154, Loss G: 1.3729925155639648
Epoch [3/13] Loss D: 0.6098687052726746, Loss G: 1.3266581296920776


...

保存訓(xùn)練的模型

訓(xùn)練完成后，我們需要保存訓(xùn)練好的 GAN 架構(gòu)的判別器和生成器，這只需兩行代碼即可實(shí)現(xiàn)。

# Save the Generator model's state dictionary to a file named 'generator.pth'
torch.save(netG.state_dict(), 'generator.pth')


# Save the Discriminator model's state dictionary to a file named 'discriminator.pth'
torch.save(netD.state_dict(), 'discriminator.pth')

生成 AI 視頻

正如我們所討論的，我們?cè)谖匆?jiàn)過(guò)的數(shù)據(jù)上測(cè)試模型的方法與我們訓(xùn)練數(shù)據(jù)中涉及狗取球和貓追老鼠的示例類似。因此，我們的測(cè)試 prompt 可能涉及貓取球或狗追老鼠等場(chǎng)景。

在我們的特定情況下，圓圈向上移動(dòng)然后向右移動(dòng)的運(yùn)動(dòng)在訓(xùn)練數(shù)據(jù)中不存在，因此模型不熟悉這種特定運(yùn)動(dòng)。但是，模型已經(jīng)在其他動(dòng)作上進(jìn)行了訓(xùn)練。我們可以使用此動(dòng)作作為 prompt 來(lái)測(cè)試我們訓(xùn)練過(guò)的模型并觀察其性能。

# Inference function to generate a video based on a given text promptdef generate_video(text_prompt, num_frames=10):    # Create a directory for the generated video frames based on the text prompt    os.makedirs(f'generated_video_{text_prompt.replace(" ", "_")}', exist_ok=True)        # Encode the text prompt into a text embedding tensor    text_embed = text_embedding(encode_text(text_prompt).to(device)).mean(dim=0).unsqueeze(0)        # Generate frames for the video    for frame_num in range(num_frames):        # Generate random noise        noise = torch.randn(1, 100).to(device)                # Generate a fake frame using the Generator network        with torch.no_grad():            fake_frame = netG(noise, text_embed)                # Save the generated fake frame as an image file        save_image(fake_frame, f'generated_video_{text_prompt.replace(" ", "_")}/frame_{frame_num}.png')# usage of the generate_video function with a specific text promptgenerate_video('circle moving up-right')

當(dāng)我們運(yùn)行上述代碼時(shí)，它將生成一個(gè)目錄，其中包含我們生成視頻的所有幀。我們需要使用一些代碼將所有這些幀合并為一個(gè)短視頻。

# Define the path to your folder containing the PNG frames
folder_path = 'generated_video_circle_moving_up-right'

# Get the list of all PNG files in the folder
image_files = [f for f in os.listdir(folder_path) if f.endswith('.png')]


# Sort the images by name (assuming they are numbered sequentially)
image_files.sort()


# Create a list to store the frames
frames = []


# Read each image and append it to the frames list
for image_file in image_files:
 image_path = os.path.join(folder_path, image_file)
 frame = cv2.imread(image_path)
 frames.append(frame)


# Convert the frames list to a numpy array for easier processing
frames = np.array(frames)


# Define the frame rate (frames per second)
fps = 10


# Create a video writer object
fourcc = cv2.VideoWriter_fourcc(*'XVID')
out = cv2.VideoWriter('generated_video.avi', fourcc, fps, (frames[0].shape[1], frames[0].shape[0]))


# Write each frame to the video
for frame in frames:
 out.write(frame)


# Release the video writer
out.release()

確保文件夾路徑指向你新生成的視頻所在的位置。運(yùn)行此代碼后，你將成功創(chuàng)建 AI 視頻。讓我們看看它是什么樣子。

我們進(jìn)行了多次訓(xùn)練，訓(xùn)練次數(shù)相同。在兩種情況下，圓圈都是從底部開(kāi)始，出現(xiàn)一半。好消息是，我們的模型在兩種情況下都嘗試執(zhí)行直立運(yùn)動(dòng)。

例如，在嘗試 1 中，圓圈沿對(duì)角線向上移動(dòng)，然后執(zhí)行向上運(yùn)動(dòng)，而在嘗試 2 中，圓圈沿對(duì)角線移動(dòng)，同時(shí)尺寸縮小。在兩種情況下，圓圈都沒(méi)有向左移動(dòng)或完全消失，這是一個(gè)好兆頭。

最后，作者表示已經(jīng)測(cè)試了該架構(gòu)的各個(gè)方面，發(fā)現(xiàn)訓(xùn)練數(shù)據(jù)是關(guān)鍵。通過(guò)在數(shù)據(jù)集中包含更多動(dòng)作和形狀，你可以增加可變性并提高模型的性能。由于數(shù)據(jù)是通過(guò)代碼生成的，因此生成更多樣的數(shù)據(jù)不會(huì)花費(fèi)太多時(shí)間；相反，你可以專注于完善邏輯。

此外，文章中討論的 GAN 架構(gòu)相對(duì)簡(jiǎn)單。你可以通過(guò)集成高級(jí)技術(shù)或使用語(yǔ)言模型嵌入 (LLM) 而不是基本神經(jīng)網(wǎng)絡(luò)嵌入來(lái)使其更復(fù)雜。此外，調(diào)整嵌入大小等參數(shù)會(huì)顯著影響模型的有效性。

責(zé)任編輯：張燕妮來(lái)源：計(jì)算機(jī)視覺(jué)研究院

英偉達(dá)模型

點(diǎn)贊

51CTO技術(shù)棧公眾號(hào)

業(yè)務(wù)
速覽

媒體

51CTO CIOAge HC3i

社區(qū)

51CTO博客鴻蒙開(kāi)發(fā)者社區(qū) AI.x社區(qū)

教育

51CTO學(xué)堂精培企業(yè)培訓(xùn) CTO訓(xùn)練營(yíng)