自拍偷在线精品自拍偷,亚洲欧美中文日韩v在线观看不卡

<sub id="z3frf"></sub>

AI.x社區(qū)

軟考社區(qū)

免費(fèi)課

企業(yè)培訓(xùn)

鴻蒙開發(fā)者社區(qū)

WOT技術(shù)大會(huì)

公眾號(hào)矩陣

移動(dòng)端

視頻課免費(fèi)課排行榜短視頻直播課軟考學(xué)堂

全部課程軟考華為認(rèn)證廠商認(rèn)證 IT技術(shù)PMP項(xiàng)目管理免費(fèi)題庫

在線學(xué)習(xí)

文章資源問答課堂專欄直播

51CTO

鴻蒙開發(fā)者社區(qū)

51CTO技術(shù)棧

51CTO官微

51CTO學(xué)堂

51CTO博客

CTO訓(xùn)練營

鴻蒙開發(fā)者社區(qū)訂閱號(hào)

51CTO軟考

51CTO學(xué)堂APP

51CTO學(xué)堂企業(yè)版APP

鴻蒙開發(fā)者社區(qū)視頻號(hào)

51CTO軟考題庫

賬號(hào)設(shè)置退出

訓(xùn)練大模型也不怕，輕量級(jí)TorchShard庫減少GPU內(nèi)存消耗，API與PyTorch相同

作者：Kaiyu Yue 2021-07-27 11:20:10

新聞人工智能

訓(xùn)練大模型時(shí)，如何優(yōu)雅地減少 GPU 內(nèi)存消耗？你不妨試試這個(gè) TorchShard 庫，兼具模型并行與數(shù)據(jù)并行等特點(diǎn)，還具有與 PyTorch 相同的 API 設(shè)計(jì)。

模型并行性能夠促進(jìn)視覺任務(wù)的性能。但是目前，還沒有一個(gè)標(biāo)準(zhǔn)庫可以讓我們像采用混合精度等其他 SOTA 技術(shù)那樣輕松地采用模型并行性。

最近，馬里蘭大學(xué)帕克分校計(jì)算機(jī)科學(xué)系的研究者 Kaiyu Yue 開源了一個(gè)工具TorchShard，這是一個(gè)輕量級(jí)的引擎，用于將 PyTorch 張量切片成并行的 shard。當(dāng)模型擁有大量的線性層（例如 BERT、GPT）或者很多類（數(shù)百萬）時(shí)，TorchShard 可以減少 GPU 內(nèi)存并擴(kuò)展訓(xùn)練規(guī)模，它具有與 PyTorch 相同的 API 設(shè)計(jì)。

項(xiàng)目地址：https://github.com/KaiyuYue/torchshard

BERT 和 GPT 等超大模型正在成為 NLP 領(lǐng)域應(yīng)用中的趨勢(shì)。然而訓(xùn)練這種大模型面臨內(nèi)存限制的問題，為了解決這個(gè)難題，研究者使用 Megatron-LM 和 PyTorch-Lightning 模型并行性擴(kuò)大訓(xùn)練。其中，Megatron-LM 只專注于大規(guī)模訓(xùn)練語言模型，而 PyTorch-Lightning 僅基于 sharded 優(yōu)化器狀態(tài)和梯度，如 DeepSpeed。

在計(jì)算機(jī)視覺任務(wù)中，我們會(huì)在訓(xùn)練基于 Transformer、MLP 模型或在數(shù)百萬個(gè)類中訓(xùn)練模型時(shí)遇到同樣的問題。TorchShard 的目標(biāo)是：

建立一個(gè)標(biāo)準(zhǔn)的 PyTorch 擴(kuò)展庫，用于使用模型并行性進(jìn)行擴(kuò)展訓(xùn)練；
以一種簡單、自然的方式使用 PyTorch。

TorchShard 是對(duì)模型并行單元（mpu）的徹底重寫，是 Megatron-LM 核心。最重要的是，TorchShard 具有與 PyTorch 相同的 API 設(shè)計(jì)，這意味著所有的子類和子函數(shù)都保持與 PyTorch 相同。例如，如果你想讓原來的線性層 torch.nn. linear 是并行的，只需將 torch 變成 ts，并調(diào)用帶有 dim 參數(shù)的子類 nn.ParallelLinear，如下所示：

import torchshard as ts 
 
ts.init_process_group(group_size=2) # init parallel groups 
 
m = torch.nn.Sequential( 
 
torch.nn.Linear(20, 30, bias=True), 
 
ts.nn.ParallelLinear(30, 30, bias=True, dim=None), # equal to nn.Linear() 
 
ts.nn.ParallelLinear(30, 30, bias=True, dim=0), # parallel in row dimension 
 
ts.nn.ParallelLinear(30, 30, bias=True, dim=1), # parallel in column dimension 
 
).cuda() 
 
x = m(x) # forward 
 
loss = ts.nn.functional.parallel_cross_entropy(x, y) # parallel loss function 
 
loss.backward() # backward 
 
torch.save( 
 
ts.collect_state_dict(m, m.state_dict()), 'm.pt') # save model state

除此之外，TorchShard 還支持與 DDP 一起使用時(shí)的各種特性，保存和加載 shard checkpoints，初始化 shard 參數(shù)，以及跨多臺(tái)機(jī)器和 GPU 處理張量。具體如下：

torchshard 包含必要的功能和操作，如 torch 包；
torchshard.nn 包含圖形的基本構(gòu)建塊，如 torch.nn 包；
torchshard.nn.functional 包含 torchshard.nn 的相應(yīng)功能操作，如 torch.nn.functional 包；
torchshard.distributed 包含處理分布式張量和組的基本功能，如 torch.distributed 包更容易使用。

如何開始 TorchShard？

安裝要求：Python 版本 3.6 以上（含）以及 PyTorch 版本 1.9.0 以上（含）。通過 pip 安裝 TorchShard 庫：

pip install torchshard

這里以 ImageNet 上訓(xùn)練 ResNet-50 為例，展示僅需幾行代碼就能在項(xiàng)目中使用 TorchShard。通常 ResNet-50 設(shè)計(jì)范式包含兩部分：卷積塊和全連接層，如下圖 1 所示。一般來說，由于大量的類依賴于數(shù)據(jù)集，最后的線性層比卷積塊有更多的參數(shù)。所以我們切片最后一個(gè)線性層來檢查其最大尺寸。

圖 1：DDP 以及 DDP + TorchShard 前向訓(xùn)練流。

在上圖 1 中，左邊展示了傳統(tǒng)的 DDP 訓(xùn)練范式。假設(shè)我們有兩個(gè)等級(jí)，DDP 將強(qiáng)制每個(gè)等級(jí)有重復(fù)的模型參數(shù)。然而，TorchShard 會(huì)將層級(jí)參數(shù)切片到不同的等級(jí)，從而減少整個(gè) GPU 內(nèi)存。現(xiàn)在向 ImageNet 官方訓(xùn)練腳本添加一些代碼，修改后的版本已經(jīng)成為 TorchShard 項(xiàng)目的一部分。

首先將 torchshard import 進(jìn)來：

import torchshard as ts

然后需要初始化模型并行的進(jìn)程組，就像初始化 DDP 進(jìn)程組的方法一樣。只需要設(shè)置一個(gè)功能參數(shù)來告訴 torchshard 應(yīng)該從目標(biāo)層中切片出多少個(gè) shard。

ts.distributed.init_process_group(group_size=args.world_size)

接下來將模型轉(zhuǎn)換為并行版本，其中可以直接將整個(gè)模型輸入到轉(zhuǎn)換輔助函數(shù)中，無需特殊處理。

import resnet 
 
model = resnet.__dict__[args.arch](pretrained=args.pretrained) 
 
ts.nn.ParallelLinear.convert_parallel_linear( 
 
model, dim=args.model_parallel_dim 
 
) 
 
print("=> paralleling model'{}'".format(args.arch))

此外，不要忘記損失函數(shù) torchshard.nn.ParallelCrossEntropy ，該損失函數(shù)可以根據(jù)輸入張量在原始 PyTorch 版本和并行版本之間切換運(yùn)行模式。例如，如果輸入張量是由 torchshard 并行層產(chǎn)生的，torchshard.nn.ParallelCrossEntropy 將以并行方式計(jì)算損失值。

criterion = ts.nn.ParallelCrossEntropyLoss().cuda(args.gpu)

當(dāng)模型并行模式（TorchShard）和數(shù)據(jù)并行模式（DDP）一起工作時(shí)，我們需要處理并行層的輸入。每個(gè)等級(jí)中的參數(shù)和訓(xùn)練數(shù)據(jù)都不同。因此，我們?cè)?ResNet forward 中的并行線性層之前收集輸入張量。

x = ts.distributed.gather(x, dim=0) # gather input along the dim of batch size 
 
x = self.fc(x)

同樣地，我們?cè)谟?jì)算損失值之前收集目標(biāo)張量。

output = model(images) 
 
if args.enable_model_parallel: 
 
target = ts.distributed.gather(target, dim=0) 
 
loss = criterion(output, target)

最后，使用 TorchShard 函數(shù)保存和加載 checkpoints 非常簡單。TorchShard 提供了名為 torchshard.collect_state_dict 基本函數(shù)用于保存 checkpoints，torchshard.relocate_state_dict 用于加載 checkpoints。

保存檢查點(diǎn)：

state_dict = model.state_dict() 
 
# collect states across all ranks 
 
state_dict = ts.collect_state_dict(model, state_dict) 
 
if ts.distributed.get_rank() == 0: 
 
torch.save(state_dict, 'resnet50.pt') # save as before

加載檢查點(diǎn)：

if ts.distributed.get_rank() == 0: 
 
state_dict = torch.load('resnet50.pt') 
 
# relocate state_dict() for all ranks 
 
state_dict = ts.relocate_state_dict(model, state_dict) 
 
model.load_state_dict(state_dict) # load as before

現(xiàn)在我們已經(jīng)完成了在 ImageNet 上為 shard 訓(xùn)練添加代碼，然后可以通過增加類的數(shù)量來擴(kuò)展它，即最后一個(gè)線性層的輸出特征維度。訓(xùn)練腳本可以在 torchshard/project/imagenet 中找到。下圖展示了在 8 個(gè) NVIDIA TITAN-XP (12196 MiB) GPU 、類數(shù) ≤ 1000000 上和 16 個(gè) GPU 、類數(shù)為 2000000 上訓(xùn)練 ResNet-50 擴(kuò)展能力。

圖 2：在不同并行策略下使用標(biāo)準(zhǔn) ResNet 訓(xùn)練設(shè)置（即輸入大小 224 和批量大小 256）的 GPU 內(nèi)存成本。

使用 AMP 與 ZeRO

TorchShard 以簡單自然的 PyTorch 方式與其他技術(shù)（例如自動(dòng)混合精度 AMP 以及 ZeRO）一起混合使用。

# gradscaler 
 
scaler = torch.cuda.amp.GradScaler(enabled=args.enable_amp_mode) 
 
 
 
with torch.cuda.amp.autocast(enabled=args.enable_amp_mode): # compute output 
 
output = model(images) 
 
 
 
if args.enable_model_parallel: 
 
target = ts.distributed.gather(target, dim=0) 
 
loss = criterion(output, target) 
 
 
 
# compute gradient and do SGD step 
 
scaler.scale(loss).backward() 
 
scaler.step(optimizer) 
 
scaler.update() 
 
optimizer.zero_grad()

圖 3：在不同并行策略以及 AMP 下，使用標(biāo)準(zhǔn)的 ResNet 訓(xùn)練設(shè)置時(shí)（輸入尺寸 224，batch 大小 256），使用 GPU 內(nèi)存的成本。

ZeRO 是 DeepSpeed 的核心，與 PyTorch >= 1.9.0 一起使用。如果你想測(cè)試一個(gè)函數(shù)，請(qǐng)安裝最新版本的腳本來運(yùn)行，代碼如下：

from torch.distributed.optim import ZeroRedundancyOptimizer 
 
 
 
if args.enable_zero_optim: 
 
print('=> using ZeroRedundancyOptimizer') 
 
optimizer = torch.distributed.optim.ZeroRedundancyOptimizer( 
 
model.parameters(), 
 
optimizer_class=torch.optim.SGD, 
 
lr=args.lr, 
 
momentum=args.momentum, 
 
weight_decay=args.weight_decay) 
 
else: 
 
optimizer = torch.optim.SGD(model.parameters(), args.lr, 
 
momentum=args.momentum, 
 
weight_decay=args.weight_decay)

圖 4：在不同的并行策略和 ZeRO 優(yōu)化器下，在標(biāo)準(zhǔn) ResNet 訓(xùn)練設(shè)置（輸入大小 224 和批大小 256）的 GPU 內(nèi)存成本。

此外，TorchShard 還提供了基本的 Python API 以及和相應(yīng)的模板文件，以簡化自定義并行層的實(shí)現(xiàn)。

研究者將持續(xù)開發(fā) TorchShard，如 TorchShard 下一個(gè)特性是新的數(shù)據(jù)采樣器 torchshard.utils.data.DistributedGroupSampler，它的命名遵循 torch.utils.data.DistributedSampler。該采樣器旨在幫助用戶構(gòu)建 M-way 數(shù)據(jù)并行、N-way 模型并行，使得其就像 DDP 中的 DistributedSampler 一樣簡單。用戶唯一要做的就是設(shè)置模型并行組號(hào)，然后 DistributedGroupSampler 來確保同一模型并行組中的模塊具有相同的訓(xùn)練數(shù)據(jù)。

責(zé)任編輯：張燕妮來源：機(jī)器之心Pro

模型人工智能深度學(xué)習(xí)

點(diǎn)贊

51CTO技術(shù)棧公眾號(hào)

業(yè)務(wù)
速覽

媒體

51CTO CIOAge HC3i

社區(qū)

51CTO博客鴻蒙開發(fā)者社區(qū) AI.x社區(qū)

教育

51CTO學(xué)堂精培企業(yè)培訓(xùn) CTO訓(xùn)練營

<style id="6sxt8"></style>

<sub id="6sxt8"><p id="6sxt8"></p></sub>