前不久,Meta前腳發(fā)布完開源大語言模型LLaMA,后腳就被網(wǎng)友放出了無門檻下載鏈接,「慘遭」開放。
消息一出,圈內(nèi)瞬間就熱鬧了起來,大家紛紛開始下載測試。
但那些手頭沒有頂級顯卡的朋友們,就只能望模型興嘆了。
不過,問題不大。Georgi Gerganov在最近做了一個名為「llama.cpp」的項目——沒有GPU也能跑LLaMA。

項目地址:https://github.com/ggerganov/llama.cpp
是的,這也包括搭載了蘋果芯片的Mac。并且還獲得了LeCun的轉(zhuǎn)發(fā)支持。

在M1/M2的Mac上跑LLaMA
目前來說,比較全面的教程有兩個,分別基于蘋果的M1和M2處理器。

第一篇:https://dev.l1x.be/posts/2023/03/12/using-llama-with-m1-mac/

第二篇:https://til.simonwillison.net/llms/llama-7b-m2
話不多說,先看效果。
比如在最小的7B模型上提問:「第一個登上月球的人是」。
很快,幾秒之后,結(jié)果就出來了。

-p '第一個登上月球的人是'
第一個登上月球的人是38歲的宇航員Neil A. Armstrong。
1969年7月20日,阿波羅11號在月球上著陸。
阿波羅11號是第一個登陸月球的載人任務(wù),也是太空競賽的高潮。1969年7月,Neil Armstrong和Edwin "Buzz" Aldrin成為第一批在月球上登陸的人類。
阿波羅11號于7月16日從肯尼迪航天中心發(fā)射。
從阿姆斯特朗登月的年齡、中間名和日期來看,并沒有出現(xiàn)明顯的事實性錯誤。這對于如此體量的小模型來說,實屬不易。
如果用下面這個提示,則可以生成了一些實用的Python代碼。
-p 'def open_and_return_content(filename):'
def open_and_return_content(filename):
""" Opens file (returning the content) and performs basic sanity checks """if os.path.isfile(filename):
with open(filename) as f:
content = f.read()
return contentelse:
print('WARNING: file "{}" does not exist'.format(filename), file=sys.stderr)
return ''def get_file_info(filename, fullpath):
""" Get file information (i.e., permission, owner, group, size) """
接下來,我們就來看看具體是如何實現(xiàn)的。
第一步:下載模型
首先要做的就是下載LLaMA模型。
你可以通過官方的表格向Meta提交申請,或者從網(wǎng)友分享的鏈接里直接獲取。
總之,完成后你會看到下面這堆東西:

正如你所看到的,不同的模型都在不同的文件夾里。每個模型都有一個params.json,包含關(guān)于該模型的細節(jié)。比如:

第二步:安裝依賴項
首先,你需要安裝Xcode來編譯C++項目。
接下來,是構(gòu)建C++項目的依賴項(pkgconfig和cmake)。
brew install pkgconfig cmake
在環(huán)境的配置上,假如你用的是Python 3.11,則可以創(chuàng)建一個虛擬環(huán)境:
/opt/homebrew/bin/python3.11 -m venv venv
然后激活venv。(如果是fish以外的shell,只要去掉.fish后綴即可)
最后,安裝Torch。
pip3 install --pre torch torchvision --extra-index-url https://download.pytorch.org/whl/nightly/cpu
如果你對利用新的Metal性能著色器(MPS)后端進行GPU訓(xùn)練加速感興趣,可以通過運行以下程序來進行驗證。但這不是在M1上運行LLaMA的必要條件。
python
Python 3.11.2 (main, Feb 16 2023, 02:55:59) [Clang 14.0.0 (clang-1400.0.29.202)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch; torch.backends.mps.is_available()True
第三步:編譯LLaMA CPP
git clone git@github.com:ggerganov/llama.cpp.git
在安裝完所有的依賴項后,你可以運行make:
make
I llama.cpp build info:
I UNAME_S: Darwin
I UNAME_P: arm
I UNAME_M: arm64
I CFLAGS: -I. -O3 -DNDEBUG -std=c11 -fPIC -pthread -DGGML_USE_ACCELERATE
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread
I LDFLAGS: -framework Accelerate
I CC: Apple clang version 14.0.0 (clang-1400.0.29.202)I CXX: Apple clang version 14.0.0 (clang-1400.0.29.202)
cc -I. -O3 -DNDEBUG -std=c11 -fPIC -pthread -DGGML_USE_ACCELERATE -c ggml.c -o ggml.o
c++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread -c utils.cpp -o utils.o
c++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread main.cpp ggml.o utils.o -o main -framework Accelerate
./main -h
usage: ./main [options]
options:
-h, --help show this help message and exit
-s SEED, --seed SEED RNG seed (default: -1)
-t N, --threads N number of threads to use during computation (default: 4)
-p PROMPT, --prompt PROMPT
prompt to start generation with (default: random)
-n N, --n_predict N number of tokens to predict (default: 128)
--top_k N top-k sampling (default: 40)
--top_p N top-p sampling (default: 0.9)
--temp N temperature (default: 0.8)
-b N, --batch_size N batch size for prompt processing (default: 8)
-m FNAME, --model FNAME
model path (default: models/llama-7B/ggml-model.bin)
c++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread quantize.cpp ggml.o utils.o -o quantize -framework Accelerate
第四步:轉(zhuǎn)換模型
假設(shè)你已經(jīng)把模型放在llama.cpp repo中的models/下。
python convert-pth-to-ggml.py models/7B 1
那么,應(yīng)該會看到像這樣的輸出:
{'dim': 4096, 'multiple_of': 256, 'n_heads': 32, 'n_layers': 32, 'norm_eps': 1e-06, 'vocab_size': 32000}n_parts = 1Processing part 0Processing variable: tok_embeddings.weight with shape: torch.Size([32000, 4096]) and type: torch.float16
Processing variable: norm.weight with shape: torch.Size([4096]) and type: torch.float16
Converting to float32
Processing variable: output.weight with shape: torch.Size([32000, 4096]) and type: torch.float16
Processing variable: layers.0.attention.wq.weight with shape: torch.Size([4096, 4096]) and type: torch.f
loat16
Processing variable: layers.0.attention.wk.weight with shape: torch.Size([4096, 4096]) and type: torch.f
loat16
Processing variable: layers.0.attention.wv.weight with shape: torch.Size([4096, 4096]) and type: torch.f
loat16
Processing variable: layers.0.attention.wo.weight with shape: torch.Size([4096, 4096]) and type: torch.f
loat16
Processing variable: layers.0.feed_forward.w1.weight with shape: torch.Size([11008, 4096]) and type: tor
ch.float16
Processing variable: layers.0.feed_forward.w2.weight with shape: torch.Size([4096, 11008]) and type: tor
ch.float16
Processing variable: layers.0.feed_forward.w3.weight with shape: torch.Size([11008, 4096]) and type: tor
ch.float16
Processing variable: layers.0.attention_norm.weight with shape: torch.Size([4096]) and type: torch.float
16...
Done. Output file: models/7B/ggml-model-f16.bin, (part 0 )
下一步將是進行量化處理:
./quantize ./models/7B/ggml-model-f16.bin ./models/7B/ggml-model-q4_0.bin 2
輸出如下:
llama_model_quantize: loading model from './models/7B/ggml-model-f16.bin'llama_model_quantize: n_vocab = 32000llama_model_quantize: n_ctx = 512llama_model_quantize: n_embd = 4096llama_model_quantize: n_mult = 256llama_model_quantize: n_head = 32llama_model_quantize: n_layer = 32llama_model_quantize: f16 = 1...
layers.31.attention_norm.weight - [ 4096, 1], type = f32 size = 0.016 MB
layers.31.ffn_norm.weight - [ 4096, 1], type = f32 size = 0.016 MB
llama_model_quantize: model size = 25705.02 MB
llama_model_quantize: quant size = 4017.27 MB
llama_model_quantize: hist: 0.000 0.022 0.019 0.033 0.053 0.078 0.104 0.125 0.134 0.125 0.104 0.078 0.053 0.033 0.019 0.022
main: quantize time = 29389.45 ms
main: total time = 29389.45 ms
第五步:運行模型
./main -m ./models/7B/ggml-model-q4_0.bin \
-t 8 \
-n 128 \
-p 'The first president of the USA was '
main: seed = 1678615879llama_model_load: loading model from './models/7B/ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000llama_model_load: n_ctx = 512llama_model_load: n_embd = 4096llama_model_load: n_mult = 256llama_model_load: n_head = 32llama_model_load: n_layer = 32llama_model_load: n_rot = 128llama_model_load: f16 = 2llama_model_load: n_ff = 11008llama_model_load: n_parts = 1llama_model_load: ggml ctx size = 4529.34 MB
llama_model_load: memory_size = 512.00 MB, n_mem = 16384llama_model_load: loading model part 1/1 from './models/7B/ggml-model-q4_0.bin'llama_model_load: .................................... donellama_model_load: model size = 4017.27 MB / num tensors = 291
main: prompt: 'The first president of the USA was 'main: number of tokens in prompt = 9 1 -> '' 1576 -> 'The' 937 -> ' first' 6673 -> ' president' 310 -> ' of' 278 -> ' the' 8278 -> ' USA' 471 -> ' was' 29871 -> ' '
sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000
The first president of the USA was 57 years old when he assumed office (George Washington). Nowadays, the US electorate expects the new president to be more young at heart. President Donald Trump was 70 years old when he was inaugurated. In contrast to his predecessors, he is physically fit, healthy and active. And his fitness has been a prominent theme of his presidency. During the presidential campaign, he famously said he
would be the “most active president ever” — a statement Trump has not yet achieved, but one that fits his approach to the office. His tweets demonstrate his physical activity.
main: mem per token = 14434244 bytes
main: load time = 1311.74 ms
main: sample time = 278.96 ms
main: predict time = 7375.89 ms / 54.23 ms per token
main: total time = 9216.61 ms
資源使用情況
第二位博主表示,在運行時,13B模型使用了大約4GB的內(nèi)存,以及748%的CPU。(設(shè)定的就是讓模型使用8個CPU核心)
沒有指令微調(diào)
GPT-3和ChatGPT效果如此之好的關(guān)鍵原因之一是,它們都經(jīng)過了指令微調(diào),
這種額外的訓(xùn)練使它們有能力對人類的指令做出有效的反應(yīng)。比如「總結(jié)一下這個」或「寫一首關(guān)于水獺的詩」或「從這篇文章中提取要點」。
撰寫教程的博主表示,據(jù)他觀察,LLaMA并沒有這樣的能力。
也就是說,給LLaMA的提示需要采用經(jīng)典的形式:「一些將由......完成的文本」。這也讓提示工程變得更加困難。
舉個例子,博主至今都還沒有想出一個正確的提示,從而讓LLaMA實現(xiàn)文本的總結(jié)。