使用CLIP和LLM構(gòu)建多模態(tài)RAG系統(tǒng)
在本文中我們將探討使用開源大型語言多模態(tài)模型(Large Language Multi-Modal)構(gòu)建檢索增強生成(RAG)系統(tǒng)。本文的重點是在不依賴LangChain或LLlama index的情況下實現(xiàn)這一目標,這樣可以避免更多的框架依賴。
什么是RAG
在人工智能領(lǐng)域,檢索增強生成(retrieve - augmented Generation, RAG)作為一種變革性技術(shù)改進了大型語言模型(Large Language Models)的能力。從本質(zhì)上講,RAG通過允許模型從外部源動態(tài)檢索實時信息來增強AI響應(yīng)的特異性。
該體系結(jié)構(gòu)將生成能力與動態(tài)檢索過程無縫結(jié)合,使人工智能能夠適應(yīng)不同領(lǐng)域中不斷變化的信息。與微調(diào)和再訓(xùn)練不同,RAG提供了一種經(jīng)濟高效的解決方案,允許人工智能在不改變整個模型的情況下能夠得到最新和相關(guān)的信息。
RAG的作用
1、提高準確性和可靠性
通過將大型語言模型(llm)重定向到權(quán)威的知識來源來解決它們的不可預(yù)測性。降低了提供虛假或過時信息的風(fēng)險,確保更準確和可靠的反應(yīng)。
2、增加透明度和信任
像LLM這樣的生成式人工智能模型往往缺乏透明度,這使得人們很難相信它們的輸出。RAG通過允許組織對生成的文本輸出有更大的控制,解決了對偏差、可靠性和遵從性的關(guān)注。
3、減輕幻覺
LLM容易產(chǎn)生幻覺反應(yīng)——連貫但不準確或捏造的信息。RAG通過確保響應(yīng)以權(quán)威來源為基礎(chǔ),減少關(guān)鍵部門誤導(dǎo)性建議的風(fēng)險。
4、具有成本效益的適應(yīng)性
RAG提供了一種經(jīng)濟有效的方法來提高AI輸出,而不需要廣泛的再訓(xùn)練/微調(diào)??梢酝ㄟ^根據(jù)需要動態(tài)獲取特定細節(jié)來保持最新和相關(guān)的信息,確保人工智能對不斷變化的信息的適應(yīng)性。
多模式模態(tài)模型
多模態(tài)涉及有多個輸入,并將其結(jié)合成單個輸出,以CLIP為例:CLIP的訓(xùn)練數(shù)據(jù)是文本-圖像對,通過對比學(xué)習(xí),模型能夠?qū)W習(xí)到文本-圖像對的匹配關(guān)系。
該模型為表示相同事物的不同輸入生成相同(非常相似)的嵌入向量。
多模態(tài)大型語言(multi-modal large language)
GPT4v和Gemini vision就是探索集成了各種數(shù)據(jù)類型(包括圖像、文本、語言、音頻等)的多模態(tài)語言模型(MLLM)。雖然像GPT-3、BERT和RoBERTa這樣的大型語言模型(llm)在基于文本的任務(wù)中表現(xiàn)出色,但它們在理解和處理其他數(shù)據(jù)類型方面面臨挑戰(zhàn)。為了解決這一限制,多模態(tài)模型結(jié)合了不同的模態(tài),從而能夠更全面地理解不同的數(shù)據(jù)。
多模態(tài)大語言模型它超越了傳統(tǒng)的基于文本的方法。以GPT-4為例,這些模型可以無縫地處理各種數(shù)據(jù)類型,包括圖像和文本,從而更全面地理解信息。
與RAG相結(jié)合
這里我們將使用Clip嵌入圖像和文本,將這些嵌入存儲在ChromDB矢量數(shù)據(jù)庫中。然后將利用大模型根據(jù)檢索到的信息參與用戶聊天會話。
我們將使用來自Kaggle的圖片和維基百科的信息來創(chuàng)建一個花卉專家聊天機器人。
首先我們安裝軟件包:
! pip install -q timm einops wikipedia chromadb open_clip_torch
!pip install -q transformers==4.36.0
!pip install -q bitsandbytes==0.41.3 accelerate==0.25.0
預(yù)處理數(shù)據(jù)的步驟很簡單只是把圖像和文本放在一個文件夾里。
可以隨意使用任何矢量數(shù)據(jù)庫,這里我們使用ChromaDB。
import chromadb
from chromadb.utils.embedding_functions import OpenCLIPEmbeddingFunction
from chromadb.utils.data_loaders import ImageLoader
from chromadb.config import Settings
client = chromadb.PersistentClient(path="DB")
embedding_function = OpenCLIPEmbeddingFunction()
image_loader = ImageLoader() # must be if you reads from URIs
ChromaDB需要自定義嵌入函數(shù)。
from chromadb import Documents, EmbeddingFunction, Embeddings
class MyEmbeddingFunction(EmbeddingFunction):
def __call__(self, input: Documents) -> Embeddings:
# embed the documents somehow or images
return embeddings
這里將創(chuàng)建2個集合,一個用于文本,另一個用于圖像。
collection_images = client.create_collection(
name='multimodal_collection_images',
embedding_functinotallow=embedding_function,
data_loader=image_loader)
collection_text = client.create_collection(
name='multimodal_collection_text',
embedding_functinotallow=embedding_function,
)
# Get the Images
IMAGE_FOLDER = '/kaggle/working/all_data'
image_uris = sorted([os.path.join(IMAGE_FOLDER, image_name) for image_name in os.listdir(IMAGE_FOLDER) if not image_name.endswith('.txt')])
ids = [str(i) for i in range(len(image_uris))]
collection_images.add(ids=ids, uris=image_uris) #now we have the images collection
對于Clip,我們可以像這樣使用文本檢索圖像。
from matplotlib import pyplot as plt
retrieved = collection_images.query(query_texts=["tulip"], include=['data'], n_results=3)
for img in retrieved['data'][0]:
plt.imshow(img)
plt.axis("off")
plt.show()
也可以使用圖像檢索相關(guān)的圖像。
文本集合如下所示:
# now the text DB
from chromadb.utils import embedding_functions
default_ef = embedding_functions.DefaultEmbeddingFunction()
text_pth = sorted([os.path.join(IMAGE_FOLDER, image_name) for image_name in os.listdir(IMAGE_FOLDER) if image_name.endswith('.txt')])
list_of_text = []
for text in text_pth:
with open(text, 'r') as f:
text = f.read()
list_of_text.append(text)
ids_txt_list = ['id'+str(i) for i in range(len(list_of_text))]
ids_txt_list
collection_text.add(
documents = list_of_text,
ids =ids_txt_list
)
然后使用上面的文本集合獲取嵌入。
results = collection_text.query(
query_texts=["What is the bellflower?"],
n_results=1
)
results
結(jié)果如下:
{'ids': [['id0']],
'distances': [[0.6072186183744086]],
'metadatas': [[None]],
'embeddings': None,
'documents': [['Campanula () is the type genus of the Campanulaceae family of flowering plants. Campanula are commonly known as bellflowers and take both their common and scientific names from the bell-shaped flowers—campanula is Latin for "little bell".\nThe genus includes over 500 species and several subspecies, distributed across the temperate and subtropical regions of the Northern Hemisphere, with centers of diversity in the Mediterranean region, Balkans, Caucasus and mountains of western Asia. The range also extends into mountains in tropical regions of Asia and Africa.\nThe species include annual, biennial and perennial plants, and vary in habit from dwarf arctic and alpine species under 5 cm high, to large temperate grassland and woodland species growing to 2 metres (6 ft 7 in) tall.']],
'uris': None,
'data': None}
或使用圖片獲取文本。
query_image = '/kaggle/input/flowers/flowers/rose/00f6e89a2f949f8165d5222955a5a37d.jpg'
raw_image = Image.open(query_image)
doc = collection_text.query(
query_embeddings=embedding_function(query_image),
n_results=1,
)['documents'][0][0]
上圖的結(jié)果如下:
A rose is either a woody perennial flowering plant of the genus Rosa (), in the family Rosaceae (), or the flower it bears. There are over three hundred species and tens of thousands of cultivars. They form a group of plants that can be erect shrubs, climbing, or trailing, with stems that are often armed with sharp prickles. Their flowers vary in size and shape and are usually large and showy, in colours ranging from white through yellows and reds. Most species are native to Asia, with smaller numbers native to Europe, North America, and northwestern Africa. Species, cultivars and hybrids are all widely grown for their beauty and often are fragrant. Roses have acquired cultural significance in many societies. Rose plants range in size from compact, miniature roses, to climbers that can reach seven meters in height. Different species hybridize easily, and this has been used in the development of the wide range of garden roses.
這樣我們就完成了文本和圖像的匹配工作,其實這里都是CLIP的工作,下面我們開始加入LLM。
from huggingface_hub import hf_hub_download
hf_hub_download(repo_id="visheratin/LLaVA-3b", filename="configuration_llava.py", local_dir="./", force_download=True)
hf_hub_download(repo_id="visheratin/LLaVA-3b", filename="configuration_phi.py", local_dir="./", force_download=True)
hf_hub_download(repo_id="visheratin/LLaVA-3b", filename="modeling_llava.py", local_dir="./", force_download=True)
hf_hub_download(repo_id="visheratin/LLaVA-3b", filename="modeling_phi.py", local_dir="./", force_download=True)
hf_hub_download(repo_id="visheratin/LLaVA-3b", filename="processing_llava.py", local_dir="./", force_download=True)
我們是用visheratin/LLaVA-3b。
from modeling_llava import LlavaForConditionalGeneration
import torch
model = LlavaForConditionalGeneration.from_pretrained("visheratin/LLaVA-3b")
model = model.to("cuda")
加載tokenizer。
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("visheratin/LLaVA-3b")
然后定義處理器,方便我們以后調(diào)用。
from processing_llava import LlavaProcessor, OpenCLIPImageProcessor
image_processor = OpenCLIPImageProcessor(model.config.preprocess_config)
processor = LlavaProcessor(image_processor, tokenizer)
下面就可以直接使用了。
question = 'Answer with organized answers: What type of rose is in the picture? Mention some of its characteristics and how to take care of it ?'
query_image = '/kaggle/input/flowers/flowers/rose/00f6e89a2f949f8165d5222955a5a37d.jpg'
raw_image = Image.open(query_image)
doc = collection_text.query(
query_embeddings=embedding_function(query_image),
n_results=1,
)['documents'][0][0]
plt.imshow(raw_image)
plt.show()
imgs = collection_images.query(query_uris=query_image, include=['data'], n_results=3)
for img in imgs['data'][0][1:]:
plt.imshow(img)
plt.axis("off")
plt.show()
得到的結(jié)果如下:
結(jié)果還包含了我們需要的大部分信息。
這樣我們整合就完成了,最后就是創(chuàng)建聊天模板。
prompt = """<|im_start|>system
A chat between a curious human and an artificial intelligence assistant.
The assistant is an exprt in flowers , and gives helpful, detailed, and polite answers to the human's questions.
The assistant does not hallucinate and pays very close attention to the details.<|im_end|>
<|im_start|>user
<image>
{question} Use the following article as an answer source. Do not write outside its scope unless you find your answer better {article} if you thin your answer is better add it after document.<|im_end|>
<|im_start|>assistant
""".format(questinotallow='question', article=doc)
如何創(chuàng)建聊天過程我們這里就不詳細介紹了,完整代碼在這里:
https://github.com/nadsoft-opensource/RAG-with-open-source-multi-modal