盤點(diǎn)目前最常用的四種語(yǔ)言模型壓縮技術(shù)

作者：二旺 2024-10-09 23:27:08

模型壓縮是在不損害其有效性的情況下最小化機(jī)器學(xué)習(xí)模型大小的行為。由于大型神經(jīng)網(wǎng)絡(luò)經(jīng)常因?yàn)檫^(guò)度參數(shù)化而包含冗余的計(jì)算單元，這種方法對(duì)它們是有效的。

你能在不犧牲性能的情況下讓大型語(yǔ)言模型（LLM）變得更??？盡管人們總是對(duì)越來(lái)越大的語(yǔ)言模型感興趣，但MistralAI向我們展示了尺寸的重要性是相對(duì)的，而對(duì)邊緣計(jì)算日益增長(zhǎng)的興趣促使我們用小型語(yǔ)言模型獲得不錯(cuò)的結(jié)果。另一種方法是通過(guò)壓縮技術(shù)。在本文中，我將解釋這些技術(shù)，并提供一些簡(jiǎn)單的代碼片段作為示例。

壓縮意味著減少參數(shù)數(shù)量或整體內(nèi)存占用，從而實(shí)現(xiàn)更小的模型大?。ɡ纾瑥?0GB減少到9GB）。這個(gè)過(guò)程有助于在存儲(chǔ)和推理速度方面提高模型的效率，使它們更容易部署在資源有限的環(huán)境中。常見的模型壓縮技術(shù)包括：

量化：通過(guò)改變模型權(quán)重（例如，從32位浮點(diǎn)數(shù)到8位整數(shù)）的精度來(lái)減少內(nèi)存占用。
剪枝：移除不太重要的權(quán)重或神經(jīng)元，減少參數(shù)數(shù)量。
知識(shí)蒸餾：訓(xùn)練一個(gè)更小的模型（學(xué)生模型）來(lái)模仿一個(gè)更大的模型（教師模型），將知識(shí)蒸餾成具有類似性能的壓縮版本。
權(quán)重共享：在不同層之間使用共享權(quán)重來(lái)減少存儲(chǔ)需求，無(wú)論是通過(guò)設(shè)計(jì)還是在訓(xùn)練后應(yīng)用。

模型量化

模型量化通過(guò)改變權(quán)重或激活的精度表示（通常是32位或16位）來(lái)壓縮LLM，將其轉(zhuǎn)換為低精度表示（例如，8位、4位甚至二進(jìn)制）。我們可以量化權(quán)重、激活函數(shù)或進(jìn)行其他技巧：

權(quán)重量化：神經(jīng)網(wǎng)絡(luò)使用的權(quán)重通常存儲(chǔ)為32位或16位浮點(diǎn)數(shù)。量化將這些權(quán)重減少到更低的位寬，如8位整數(shù)（INT8）或4位整數(shù)（INT4）。這是通過(guò)將原始權(quán)重范圍映射到具有較少位的較小范圍來(lái)實(shí)現(xiàn)的，顯著減少了內(nèi)存使用。
激活量化：與權(quán)重類似，激活（推理期間層的輸出）可以被量化為更低的精度。通過(guò)用較少的位表示激活，減少了模型在推理期間的內(nèi)存占用。
量化感知訓(xùn)練（QAT）：在QAT中，模型在模擬量化的同時(shí)進(jìn)行訓(xùn)練，允許它適應(yīng)更低的精度。這有助于保持準(zhǔn)確性，因?yàn)槟Ｐ蛯W(xué)會(huì)了對(duì)量化效應(yīng)更加健壯（參見Tailor等人在Arxiv上的研究）。
訓(xùn)練后量化（PTQ）：這種方法涉及以全精度正常訓(xùn)練模型，然后在此之后應(yīng)用量化。雖然PTQ更簡(jiǎn)單、更快，但與QAT相比，它可能導(dǎo)致準(zhǔn)確性的更大下降（如Wang等人在NIPS2021上的研究）。

權(quán)重量化可以使用bitsandbytes輕松實(shí)現(xiàn)。安裝庫(kù)：

pip install torch transformers bitsandbytes

例如，對(duì)于GPT2運(yùn)行以下代碼：

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Specify the model you want to use
model_name = "gpt2"  # You can replace this with any other LLM model
# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Load the model with 8-bit quantization using bitsandbytes
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    load_in_8bit=True,  # Enable 8-bit quantization
    device_map="auto"   # Automatically allocate to available device (CPU/GPU)
)
# Example text for inference
input_text = "Weight Quantization is an efficient technique for compressing language models."
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")
# Generate text
with torch.no_grad():
    output_ids = model.generate(input_ids, max_length=50)
# Decode and print the generated text
output_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(output_text)

剪枝

剪枝移除不必要的或不太重要的權(quán)重、神經(jīng)元或整個(gè)層，就像從樹上移除不必要的分支一樣。這減少了模型的大小，加快了推理速度，并降低了內(nèi)存和計(jì)算需求，使其在盡可能保持原始性能的同時(shí)更加高效。

這比量化更直接，因?yàn)槲覀兪紫刃枰业饺哂嗟牟糠帧＠?，我們需要找到冗余的參?shù)并在沒有它們的情況下微調(diào)模型。

最常見的是，我們移除權(quán)重、神經(jīng)元或?qū)?，但?duì)注意力頭剪枝（特定于基于Transformer的模型）作為一種結(jié)構(gòu)化剪枝的興趣日益增長(zhǎng)（參見Wang等人在Arxiv上的研究）。在這里，每個(gè)注意力層有多個(gè)頭。一些頭對(duì)模型性能的貢獻(xiàn)比其他頭更大，所以注意力頭剪枝移除了不太重要的頭。

剪枝的示例代碼可能如下，我們從GPT2模型中移除一定百分比的權(quán)重：


import torch
import torch.nn.utils.prune as prune
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load the pretrained model and tokenizer
model_name = "gpt2"  # You can replace this with any other LLM model
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Define a pruning method (here we use L1 unstructured pruning)
def prune_model_layer(layer, amount=0.3):
    # Prune 30% of the weights with the lowest L1 norm in the linear layers
    for name, module in layer.named_modules():
        if isinstance(module, torch.nn.Linear):
            prune.l1_unstructured(module, name="weight", amount=amount)
            print(f"Pruned layer {name} with amount {amount}")
# Apply pruning to all transformer layers in the model
for layer in model.transformer.h:
    prune_model_layer(layer, amount=0.3)  # Prune 30% of the weights
# Check the sparsity of the model
total_params = 0
pruned_params = 0
for name, module in model.named_modules():
    if isinstance(module, torch.nn.Linear):
        total_params += module.weight.nelement()
        pruned_params += torch.sum(module.weight == 0).item()
print(f"Total parameters: {total_params}")
print(f"Pruned parameters: {pruned_params}")
print(f"Sparsity: {pruned_params / total_params:.2%}")
# Test the pruned model on a sample input
input_text = "Pruning is an effective way to compress language models."
input_ids = tokenizer(input_text, return_tensors="pt").input_ids
# Generate text using the pruned model
with torch.no_grad():
    output_ids = model.generate(input_ids, max_length=50)
# Decode and print the generated text
output_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(output_text)

模型蒸餾

模型蒸餾是一種將“知識(shí)”從大型、更復(fù)雜的模型（稱為教師模型）轉(zhuǎn)移到小型、更簡(jiǎn)單的模型（稱為學(xué)生模型）的技術(shù)，后者的參數(shù)更少。這個(gè)過(guò)程使得學(xué)生模型在保持更小的尺寸或速度的同時(shí)，能夠達(dá)到接近教師模型的性能，正如我們?cè)陂_始時(shí)承諾的。

這個(gè)過(guò)程從一個(gè)大型的、預(yù)訓(xùn)練的LLM開始，作為教師模型，例如GPT2或LLama。這個(gè)模型通常非常準(zhǔn)確，但需要大量的計(jì)算資源來(lái)進(jìn)行推理。

一個(gè)更小、更高效的模型（“學(xué)生模型”）被訓(xùn)練來(lái)模仿教師模型的行為，如miniGPT2或TinyLlama（盡管Tinyllama是以不同的方式構(gòu)建的）。學(xué)生模型從原始訓(xùn)練數(shù)據(jù)和教師模型生成的輸出（軟標(biāo)簽）中學(xué)習(xí)。

以下是Python中教師-學(xué)生互動(dòng)的示例，從教師GPT2開始：

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
import torch.nn.functional as F

# Load the teacher (large) and student (smaller) models
teacher_model_name = "gpt2"  # You can replace this with any large LLM
student_model_name = "tiny-gpt2"  # A smaller variant to act as the student
# Load the teacher model and tokenizer
teacher_model = AutoModelForCausalLM.from_pretrained(teacher_model_name).to("cuda")
teacher_tokenizer = AutoTokenizer.from_pretrained(teacher_model_name)
# Load the student model and tokenizer
student_model = AutoModelForCausalLM.from_pretrained(student_model_name).to("cuda")
student_tokenizer = AutoTokenizer.from_pretrained(student_model_name)
# Load a dataset for training (e.g., Wikitext for language modeling)
dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="train")
# Set training parameters
learning_rate = 5e-5
epochs = 3
optimizer = torch.optim.AdamW(student_model.parameters(), lr=learning_rate)
# Set temperature for softening probabilities
temperature = 2.0
alpha = 0.5  # Weighting factor for combining loss functions
# Training loop for knowledge distillation
for epoch in range(epochs):
    for i, example in enumerate(dataset):
        # Get the input text
        input_text = example["text"]
        
        # Skip empty lines
        if not input_text.strip():
            continue
        
        # Tokenize the input text for the teacher and student models
        teacher_inputs = teacher_tokenizer(input_text, return_tensors="pt", truncation=True, padding="max_length", max_length=32).to("cuda")
        student_inputs = student_tokenizer(input_text, return_tensors="pt", truncation=True, padding="max_length", max_length=32).to("cuda")
        
        # Get teacher predictions (soft labels)
        with torch.no_grad():
            teacher_outputs = teacher_model(**teacher_inputs)
            teacher_logits = teacher_outputs.logits / temperature
            teacher_probs = F.softmax(teacher_logits, dim=-1)
        
        # Get student predictions
        student_outputs = student_model(**student_inputs)
        student_logits = student_outputs.logits
        
        # Calculate distillation loss (Kullback-Leibler divergence)
        distillation_loss = F.kl_div(
            input=F.log_softmax(student_logits / temperature, dim=-1),
            target=teacher_probs,
            reduction="batchmean",
            log_target=False
        ) * (temperature ** 2)
        
        # Calculate student task loss (Cross-Entropy with true labels)
        target_labels = student_inputs["input_ids"]
        task_loss = F.cross_entropy(student_logits.view(-1, student_logits.size(-1)), target_labels.view(-1), ignore_index=student_tokenizer.pad_token_id)
        
        # Combined loss
        loss = alpha * distillation_loss + (1 - alpha) * task_loss
        
        # Backpropagation and optimization
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        # Print training progress
        if i % 100 == 0:
            print(f"Epoch [{epoch + 1}/{epochs}], Step [{i}], Loss: {loss.item():.4f}")
print("Knowledge distillation completed!")

權(quán)重共享

通過(guò)在幾個(gè)模型組件之間共享參數(shù)，我們可以減少神經(jīng)網(wǎng)絡(luò)的內(nèi)存占用。當(dāng)一些或所有層共享同一組權(quán)重而不是每層或組件都有獨(dú)特的權(quán)重時(shí)，模型必須保持的參數(shù)數(shù)量大大減少。人們可以預(yù)先定義模型的架構(gòu)，使其具有共享權(quán)重，或者在訓(xùn)練后將權(quán)重共享作為一種模型壓縮技術(shù)。例如，一種可能性是像下面的代碼一樣對(duì)權(quán)重進(jìn)行聚類：


import torch
import numpy as np
from sklearn.cluster import KMeans

def apply_weight_sharing(model, num_clusters=16):
    # Iterate through each parameter in the model
    for name, param in model.named_parameters():
        if param.requires_grad:  # Only consider trainable parameters
            # Flatten the weights into a 1D array for clustering
            weights = param.data.cpu().numpy().flatten().reshape(-1, 1)
            # Apply k-means clustering
            kmeans = KMeans(n_clusters=num_clusters)
            kmeans.fit(weights)
            # Replace weights with their corresponding cluster centroids
            cluster_centroids = kmeans.cluster_centers_
            labels = kmeans.labels_
            # Map the original weights to their shared values
            shared_weights = np.array([cluster_centroids[label] for label in labels]).reshape(param.data.shape)
            # Update the model's parameters with the shared weights
            param.data = torch.tensor(shared_weights, dtype=param.data.dtype).to(param.device)
    return model
# Example usage with a pre-trained model
from transformers import GPT2LMHeadModel
model = GPT2LMHeadModel.from_pretrained("gpt2")
model = apply_weight_sharing(model, num_clusters=16)  # Apply weight sharing with 16 clusters
print("Weight sharing applied to the model!")