自拍偷在线精品自拍偷,亚洲欧美中文日韩v在线观看不卡

<var id="n6afp"><button id="n6afp"><span id="n6afp"></span></button></var>

AI.x社區(qū)

軟考社區(qū)

企業(yè)培訓

鴻蒙開發(fā)者社區(qū)

WOT技術大會

公眾號矩陣

移動端

視頻課免費課排行榜短視頻直播課軟考學堂

全部課程軟考華為認證廠商認證 IT技術 PMP項目管理免費題庫

文章資源問答課堂專欄直播

51CTO

鴻蒙開發(fā)者社區(qū)

51CTO技術棧

51CTO官微

51CTO學堂

51CTO博客

CTO訓練營

鴻蒙開發(fā)者社區(qū)訂閱號

51CTO軟考

51CTO學堂APP

51CTO學堂企業(yè)版APP

鴻蒙開發(fā)者社區(qū)視頻號

51CTO軟考題庫

賬號設置退出

深度學習之模型壓縮、加速模型推理

作者：二旺 2023-11-19 23:36:50

人工智能深度學習

在本文中，我們探討了幾種模型壓縮方法，以加速模型推斷階段，這對于生產(chǎn)中的模型來說可能是一個關鍵要求。

簡介

當將一個機器學習模型部署到生產(chǎn)環(huán)境中時，通常需要滿足一些在模型原型階段沒有考慮到的要求。例如，在生產(chǎn)中使用的模型將不得不處理來自不同用戶的大量請求。因此，您將希望進行優(yōu)化，以獲得較低的延遲和/或吞吐量。

延遲：是任務完成所需的時間，就像單擊鏈接后加載網(wǎng)頁所需的時間。它是開始某項任務和看到結(jié)果之間的等待時間。
吞吐量：是系統(tǒng)在一定時間內(nèi)可以處理的請求數(shù)。

這意味著機器學習模型在進行預測時必須非?？焖?，為此有各種技術可以提高模型推斷的速度，本文將介紹其中最重要的一些。

模型壓縮

有一些旨在使模型更小的技術，因此它們被稱為模型壓縮技術，而另一些則側(cè)重于使模型在推斷階段更快，因此屬于模型優(yōu)化領域。但通常使模型更小也有助于提高推斷速度，因此在這兩個研究領域之間的界限非常模糊。

1.低秩分解

這是我們首次看到的第一種方法，它正在受到廣泛研究，事實上，最近已經(jīng)有很多關于它的論文發(fā)布。

基本思想是用低維度的矩陣（雖然更正確的說法是張量，因為我們經(jīng)常有超過2維的矩陣）替換神經(jīng)網(wǎng)絡的矩陣（表示網(wǎng)絡層的矩陣）。通過這種方式，我們將減少網(wǎng)絡參數(shù)的數(shù)量，從而提高推斷速度。

一個微不足道的例子是，在CNN網(wǎng)絡中，將3x3的卷積替換為1x1的卷積。這種技術被用于網(wǎng)絡結(jié)構(gòu)中，比如SqueezeNet。

最近，類似的思想也被應用于其他用途，比如允許在資源有限的情況下微調(diào)大型語言模型。當為下游任務微調(diào)預訓練模型時，仍然需要在預訓練模型的所有參數(shù)上訓練模型，這可能非常昂貴。

因此，名為“大型語言模型的低秩適應”（或LoRA）的方法的思想是用較小的矩陣對原始模型進行替換（使用矩陣分解），這些矩陣具有較小的尺寸。這樣，只需要重新訓練這些新矩陣，以使預訓練模型適應更多下游任務。

圖片

在LoRA中的矩陣分解

現(xiàn)在，讓我們看看如何使用Hugging Face的PEFT庫來實現(xiàn)對LoRA進行微調(diào)。假設我們想要使用LoRA對bigscience/mt0-large進行微調(diào)。首先，我們必須確保導入我們需要的內(nèi)容。

!pip install peft
!pip install transformers

  from transformers import AutoModelForSeq2SeqLM
  from peft import get_peft_model, LoraConfig, TaskType

  model_name_or_path = "bigscience/mt0-large"
  tokenizer_name_or_path = "bigscience/mt0-large"

接下來的步驟將是創(chuàng)建在微調(diào)期間應用于LoRA的配置。

peft_config = LoraConfig(
    task_type=TaskType.SEQ_2_SEQ_LM, inference_mode=False, r=8, lora_alpha=32, lora_dropout=0.1
)

然后，我們使用Transformers庫的基本模型以及我們?yōu)長oRA創(chuàng)建的配置對象來實例化模型。

model = AutoModelForSeq2SeqLM.from_pretrained(model_name_or_path)
model = get_peft_model(model, peft_config)
model.print_trainable_parameters()

2.知識蒸餾

這是另一種方法，允許我們將“小”模型放入生產(chǎn)中。思想是有一個稱為教師的大模型，和一個稱為學生的較小模型，我們將使用教師的知識來教學生如何進行預測。這樣，我們可以只將學生放入生產(chǎn)環(huán)境中。

這種方法的一個經(jīng)典示例是以這種方式開發(fā)的模型DistillBERT，它是BERT的學生模型。DistilBERT比BERT小40%，但保留了97%的語言理解能力，并且推斷速度快60%。這種方法有一個缺點是：您仍然需要擁有大型教師模型，以便對學生進行訓練，而您可能沒有足夠的資源來訓練類似教師的模型。

讓我們看看如何在Python中進行知識蒸餾的簡單示例。要理解的一個關鍵概念是Kullback–Leibler散度，它是一個用于理解兩個分布之間差異的數(shù)學概念，實際上在我們的案例中，我們想要理解兩個模型的預測之間的差異，因此訓練的損失函數(shù)將基于這個數(shù)學概念。

import tensorflow as tf
from tensorflow.keras import layers, models
from tensorflow.keras.datasets import mnist
from tensorflow.keras.utils import to_categorical
import numpy as np

# Load the MNIST dataset
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

# Preprocess the data
train_images = train_images.reshape((60000, 28, 28, 1)).astype('float32') / 255
test_images = test_images.reshape((10000, 28, 28, 1)).astype('float32') / 255
train_labels = to_categorical(train_labels)
test_labels = to_categorical(test_labels)

# Define the teacher model (a larger model)
teacher_model = models.Sequential([
    layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)),
    layers.MaxPooling2D((2, 2)),
    layers.Conv2D(64, (3, 3), activation='relu'),
    layers.MaxPooling2D((2, 2)),
    layers.Conv2D(64, (3, 3), activation='relu'),
    layers.Flatten(),
    layers.Dense(64, activation='relu'),
    layers.Dense(10, activation='softmax')
])

teacher_model.compile(optimizer='adam',
                      loss='categorical_crossentropy',
                      metrics=['accuracy'])

# Train the teacher model
teacher_model.fit(train_images, train_labels, epochs=5, batch_size=64, validation_split=0.2)

# Define the student model (a smaller model)
student_model = models.Sequential([
    layers.Flatten(input_shape=(28, 28, 1)),
    layers.Dense(64, activation='relu'),
    layers.Dense(10, activation='softmax')
])

student_model.compile(optimizer='adam',
                      loss='categorical_crossentropy',
                      metrics=['accuracy'])

# Knowledge distillation step: Transfer knowledge from the teacher to the student
def distillation_loss(y_true, y_pred):
    alpha = 0.1  # Temperature parameter (adjust as needed)
    return tf.keras.losses.KLDivergence()(tf.nn.softmax(y_true / alpha, axis=1),
                                           tf.nn.softmax(y_pred / alpha, axis=1))

# Train the student model using knowledge distillation
student_model.fit(train_images, train_labels, epochs=10, batch_size=64,
                  validation_split=0.2, loss=distillation_loss)

# Evaluate the student model
test_loss, test_acc = student_model.evaluate(test_images, test_labels)
print(f'Test accuracy: {test_acc * 100:.2f}%')

3.剪枝

剪枝是我在研究生論文中研究過的一種模型壓縮方法，事實上，我之前曾發(fā)表過一篇關于如何在Julia中實現(xiàn)剪枝的文章：Julia中用于人工神經(jīng)網(wǎng)絡的迭代剪枝方法。

剪枝是為了解決決策樹中的過擬合問題而誕生的，實際上是通過剪掉樹的分支來減小樹的深度。該概念后來被用于神經(jīng)網(wǎng)絡，其中會刪除網(wǎng)絡中的邊和/或節(jié)點（取決于是否執(zhí)行非結(jié)構(gòu)化剪枝或結(jié)構(gòu)化剪枝）。

假設要從網(wǎng)絡中刪除整個節(jié)點，表示層的矩陣將變小，因此您的模型也會變小，因此也會變快。相反，如果我們刪除單個邊，矩陣的大小將保持不變，但是我們將在刪除的邊的位置放置零，因此我們將獲得非常稀疏的矩陣。因此，在非結(jié)構(gòu)化剪枝中，優(yōu)勢不在于增加速度，而在于內(nèi)存，因為將稀疏矩陣保存在內(nèi)存中比保存密集矩陣要占用更少的空間。

但我們要剪枝的是哪些節(jié)點或邊呢？通常是最不必要的節(jié)點或邊，推薦大家可以研究下下面兩篇論文：《Optimal Brain Damage》和《Optimal Brain Surgeon and general network pruning》。

讓我們看一個如何在簡單的MNIST模型中實現(xiàn)剪枝的Python腳本。

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.datasets import mnist
from tensorflow.keras.utils import to_categorical
from tensorflow_model_optimization.sparsity import keras as sparsity
import numpy as np

# Load the MNIST dataset
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

# Preprocess the data
train_images = train_images.reshape((60000, 28, 28, 1)).astype('float32') / 255
test_images = test_images.reshape((10000, 28, 28, 1)).astype('float32') / 255
train_labels = to_categorical(train_labels)
test_labels = to_categorical(test_labels)

# Create a simple neural network model
def create_model():
    model = Sequential([
        tf.keras.layers.Flatten(input_shape=(28, 28, 1)),
        tf.keras.layers.Dense(128, activation='relu'),
        tf.keras.layers.Dropout(0.2),
        tf.keras.layers.Dense(64, activation='relu'),
        tf.keras.layers.Dropout(0.2),
        tf.keras.layers.Dense(10, activation='softmax')
    ])
    return model

# Create and compile the original model
model = create_model()
model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

# Train the original model
model.fit(train_images, train_labels, epochs=5, batch_size=64, validation_split=0.2)

# Prune the model
# Specify the pruning parameters
pruning_params = {
    'pruning_schedule': sparsity.PolynomialDecay(initial_sparsity=0.50,
                                                 final_sparsity=0.90,
                                                 begin_step=0,
                                                 end_step=2000,
                                                 frequency=100)
}

# Create a pruned model
pruned_model = sparsity.prune_low_magnitude(create_model(), **pruning_params)

# Compile the pruned model
pruned_model.compile(optimizer='adam',
                     loss='categorical_crossentropy',
                     metrics=['accuracy'])

# Train the pruned model (fine-tuning)
pruned_model.fit(train_images, train_labels, epochs=2, batch_size=64, validation_split=0.2)

# Strip pruning wrappers to create a smaller and faster model
final_model = sparsity.strip_pruning(pruned_model)

# Evaluate the final pruned model
test_loss, test_acc = final_model.evaluate(test_images, test_labels)
print(f'Test accuracy after pruning: {test_acc * 100:.2f}%')

量化

我認為沒有錯的說量化可能是目前最廣泛使用的壓縮技術。同樣，基本思想很簡單。通常，我們使用32位浮點數(shù)表示神經(jīng)網(wǎng)絡的參數(shù)。但如果我們使用更低精度的數(shù)值呢？我們可以使用16位、8位、4位，甚至1位，并且擁有二進制網(wǎng)絡！

這意味著什么？通過使用較低精度的數(shù)字，模型將更輕，更小，但也會失去精度，提供比原始模型更近似的結(jié)果。當我們需要在邊緣設備上部署時，特別是在某些特殊硬件上，如智能手機上，這是一種經(jīng)常使用的技術，因為它允許我們大大縮小網(wǎng)絡的大小。許多框架允許輕松應用量化，例如TensorFlow Lite、PyTorch或TensorRT。

量化可以在訓練前應用，因此我們直接截斷了一個網(wǎng)絡，其參數(shù)只能在某個范圍內(nèi)取值，或者在訓練后應用，因此最終會對參數(shù)的值進行四舍五入。在這里，我們再次快速看一下如何在Python中應用量化。


import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten, Dropout
from tensorflow.keras.datasets import mnist
from tensorflow.keras.utils import to_categorical
import numpy as np

# Load the MNIST dataset
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

# Preprocess the data
train_images = train_images.reshape((60000, 28, 28, 1)).astype('float32') / 255
test_images = test_images.reshape((10000, 28, 28, 1)).astype('float32') / 255
train_labels = to_categorical(train_labels)
test_labels = to_categorical(test_labels)

# Create a simple neural network model
def create_model():
    model = Sequential([
        Flatten(input_shape=(28, 28, 1)),
        Dense(128, activation='relu'),
        Dropout(0.2),
        Dense(64, activation='relu'),
        Dropout(0.2),
        Dense(10, activation='softmax')
    ])
    return model

# Create and compile the original model
model = create_model()
model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

# Train the original model
model.fit(train_images, train_labels, epochs=5, batch_size=64, validation_split=0.2)

# Quantize the model to 8-bit integers
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
quantized_model = converter.convert()

# Save the quantized model to a file
with open('quantized_model.tflite', 'wb') as f:
    f.write(quantized_model)

# Load the quantized model for inference
interpreter = tf.lite.Interpreter(model_path='quantized_model.tflite')
interpreter.allocate_tensors()

# Evaluate the quantized model
test_loss, test_acc = 0.0, 0.0
for i in range(len(test_images)):
    input_data = np.array([test_images[i]], dtype=np.float32)
    interpreter.set_tensor(interpreter.get_input_details()[0]['index'], input_data)
    interpreter.invoke()
    output_data = interpreter.get_tensor(interpreter.get_output_details()[0]['index'])
    test_loss += tf.keras.losses.categorical_crossentropy(test_labels[i], output_data).numpy()
    test_acc += np.argmax(test_labels[i]) == np.argmax(output_data)

test_loss /= len(test_images)
test_acc /= len(test_images)

print(f'Test accuracy after quantization: {test_acc * 100:.2f}%')

結(jié)論

在本文中，我們探討了幾種模型壓縮方法，以加速模型推斷階段，這對于生產(chǎn)中的模型來說可能是一個關鍵要求。特別是，我們關注了低秩分解、知識蒸餾、剪枝和量化等方法，解釋了基本思想，并展示了Python中的簡單實現(xiàn)。模型壓縮對于在具有有限資源（RAM、GPU等）的特定硬件上部署模型也非常有用，比如智能手機。

責任編輯：趙寧寧來源：小白玩轉(zhuǎn)Python

深度學習模型壓縮

51CTO技術棧公眾號

業(yè)務
速覽

媒體

51CTO CIOAge HC3i

社區(qū)

51CTO博客鴻蒙開發(fā)者社區(qū) AI.x社區(qū)

教育

51CTO學堂精培企業(yè)培訓 CTO訓練營

<cite id="3dcu1"><rp id="3dcu1"><form id="3dcu1"></form></rp></cite>