自拍偷在线精品自拍偷,亚洲欧美中文日韩v在线观看不卡

<s id="6ww0o"><li id="6ww0o"></li></s>

<sub id="6ww0o"><p id="6ww0o"></p></sub>

51CTO首頁(yè)

AI.x社區(qū)

軟考社區(qū)

免費(fèi)課

企業(yè)培訓(xùn)

鴻蒙開(kāi)發(fā)者社區(qū)

WOT技術(shù)大會(huì)

公眾號(hào)矩陣

移動(dòng)端

視頻課免費(fèi)課排行榜短視頻直播課軟考學(xué)堂

全部課程軟考華為認(rèn)證廠商認(rèn)證 IT技術(shù)PMP項(xiàng)目管理免費(fèi)題庫(kù)

在線(xiàn)學(xué)習(xí)

文章資源問(wèn)答課堂專(zhuān)欄直播

51CTO

鴻蒙開(kāi)發(fā)者社區(qū)

51CTO技術(shù)棧

51CTO官微

51CTO學(xué)堂

51CTO博客

CTO訓(xùn)練營(yíng)

鴻蒙開(kāi)發(fā)者社區(qū)訂閱號(hào)

51CTO軟考

51CTO學(xué)堂APP

51CTO學(xué)堂企業(yè)版APP

鴻蒙開(kāi)發(fā)者社區(qū)視頻號(hào)

51CTO軟考題庫(kù)

賬號(hào)設(shè)置退出

多模態(tài)RAG構(gòu)建指南：為AI系統(tǒng)提供更多可能性

作者：晶顏 2024-12-06 08:20:26

在這篇文章中，我們將討論開(kāi)發(fā)人員如何使用Milvus構(gòu)建他們自己的多模態(tài)RAG系統(tǒng)。我們還將引導(dǎo)你構(gòu)建這樣一個(gè)系統(tǒng)，該系統(tǒng)可以處理文本和圖像數(shù)據(jù)，特別是執(zhí)行相似性搜索，并利用語(yǔ)言模型來(lái)優(yōu)化輸出。?

譯者 | 晶顏

審校 | 重樓

本文提供了關(guān)于如何使用Milvus構(gòu)建多模態(tài)RAG系統(tǒng)以及如何為AI系統(tǒng)開(kāi)辟各種可能性的深入指南。

局限于單一的數(shù)據(jù)格式已經(jīng)逐漸落伍。隨著企業(yè)越來(lái)越依賴(lài)信息來(lái)做出關(guān)鍵決策，他們需要能夠比較不同格式的數(shù)據(jù)。幸運(yùn)的是，傳統(tǒng)的限于單一數(shù)據(jù)類(lèi)型的人工智能系統(tǒng)已經(jīng)讓位于能夠理解和處理復(fù)雜信息的多模態(tài)（Multimodal）系統(tǒng)。

多模態(tài)搜索和多模態(tài)檢索增強(qiáng)生成（RAG）系統(tǒng)近年來(lái)在這一領(lǐng)域取得了很大進(jìn)展。這些系統(tǒng)能夠處理多種類(lèi)型的數(shù)據(jù)，包括文本、圖像和音頻，以提供上下文感知的響應(yīng)。

在這篇文章中，我們將討論開(kāi)發(fā)人員如何使用Milvus構(gòu)建他們自己的多模態(tài)RAG系統(tǒng)。我們還將引導(dǎo)你構(gòu)建這樣一個(gè)系統(tǒng)，該系統(tǒng)可以處理文本和圖像數(shù)據(jù)，特別是執(zhí)行相似性搜索，并利用語(yǔ)言模型來(lái)優(yōu)化輸出。

Milvus是什么？

向量數(shù)據(jù)庫(kù)是一種特殊類(lèi)型的數(shù)據(jù)庫(kù)，用于存儲(chǔ)、索引和檢索向量嵌入，向量嵌入是數(shù)據(jù)的數(shù)學(xué)表示（如圖像、文本和音頻），不僅可以比較數(shù)據(jù)的等價(jià)性，還可以比較數(shù)據(jù)的語(yǔ)義相似性。Milvus就是一個(gè)開(kāi)源、高性能的向量數(shù)據(jù)庫(kù)。你可以在GitHub上找到它，它有Apache-2.0許可證并已獲得超過(guò)3萬(wàn)顆星星。

Milvus幫助開(kāi)發(fā)人員提供靈活的解決方案來(lái)管理和查詢(xún)大規(guī)模向量數(shù)據(jù)。Milvus的效率使其成為開(kāi)發(fā)人員使用深度學(xué)習(xí)模型構(gòu)建應(yīng)用程序的理想選擇，例如檢索增強(qiáng)生成（RAG）、多模態(tài)搜索、推薦引擎和異常檢測(cè)。

Milvus提供多種部署選項(xiàng)來(lái)滿(mǎn)足開(kāi)發(fā)人員的需求。Milvus Lite是一個(gè)輕量級(jí)版本，可以在Python應(yīng)用程序中運(yùn)行，非常適合在本地環(huán)境中創(chuàng)建應(yīng)用程序原型。Milvus Standalone和Milvus Distributed是可擴(kuò)展和“生產(chǎn)就緒”（即產(chǎn)品已經(jīng)過(guò)充分測(cè)試和優(yōu)化，可在生產(chǎn)環(huán)境中使用）的選項(xiàng)。

多模態(tài)RAG：擴(kuò)展至文本之外

在構(gòu)建系統(tǒng)之前，了解傳統(tǒng)的基于文本的RAG及其向多模態(tài)RAG的演變是很重要的。

檢索增強(qiáng)生成（RAG）是從外部源檢索上下文信息并從大型語(yǔ)言模型（LLM）生成更準(zhǔn)確輸出的一種方法。傳統(tǒng)的RAG是提高LLM輸出的一種非常有效的策略，但是它仍然局限于文本數(shù)據(jù)。而在許多現(xiàn)實(shí)世界的應(yīng)用程序中，數(shù)據(jù)已經(jīng)擴(kuò)展到文本之外，結(jié)合圖像、圖表和其他形式提供了關(guān)鍵的上下文。

多模態(tài)RAG通過(guò)支持使用不同的數(shù)據(jù)類(lèi)型解決了上述限制，為L(zhǎng)LM提供了更好的上下文。

簡(jiǎn)單地說(shuō)，在多模態(tài)RAG系統(tǒng)中，檢索組件能夠跨不同的數(shù)據(jù)模態(tài)搜索相關(guān)信息，生成組件根據(jù)檢索到的信息生成更準(zhǔn)確的結(jié)果。

理解向量嵌入和相似性搜索

向量嵌入和相似性搜索是多模態(tài)RAG的兩個(gè)基本概念。讓我們先來(lái)理解它們。

向量嵌入

如前所述，向量嵌入是數(shù)據(jù)的數(shù)學(xué)/數(shù)值表示。機(jī)器使用這種表示來(lái)理解不同數(shù)據(jù)類(lèi)型（如文本、圖像和音頻）的語(yǔ)義含義。

在使用自然語(yǔ)言處理（NLP）時(shí)，將文檔塊轉(zhuǎn)換為向量，并將語(yǔ)義相似的單詞映射到向量空間中的附近點(diǎn)。圖像也是如此，其中嵌入表示語(yǔ)義特征。這使我們能夠以數(shù)字格式理解顏色、紋理和物體形狀等指標(biāo)。

使用向量嵌入的主要目的是幫助保持不同數(shù)據(jù)塊之間的關(guān)系和相似性。

相似性搜索

相似性搜索用于查找和定位給定數(shù)據(jù)集中的數(shù)據(jù)。在向量嵌入的背景下，相似性搜索在給定的數(shù)據(jù)集中找到最接近查詢(xún)向量的向量。

以下是幾種常用的度量向量之間相似性的方法：

歐幾里得距離：測(cè)量向量空間中兩點(diǎn)之間的直線(xiàn)距離。
余弦相似度：測(cè)量?jī)蓚€(gè)向量之間夾角的余弦值（關(guān)注它們的方向而不是大?。?/li>
點(diǎn)積：對(duì)應(yīng)元素相加的簡(jiǎn)單乘法。

相似性度量的選擇通常取決于特定于應(yīng)用程序的數(shù)據(jù)以及開(kāi)發(fā)人員處理問(wèn)題的方式。

在大規(guī)模數(shù)據(jù)集上進(jìn)行相似性搜索時(shí)，需要很強(qiáng)大的計(jì)算能力和資源。這就是近似最近鄰（ANN）算法發(fā)揮作用的地方。人工神經(jīng)網(wǎng)絡(luò)算法用于交換小百分比或數(shù)量的準(zhǔn)確性，以獲得顯著的速度提升。這使得它們成為大規(guī)模應(yīng)用程序的合適選擇。

Milvus還使用先進(jìn)的人工神經(jīng)網(wǎng)絡(luò)算法（包括HNSW和DiskANN），在大型向量嵌入數(shù)據(jù)集上執(zhí)行高效的相似性搜索，使開(kāi)發(fā)人員能夠快速找到相關(guān)數(shù)據(jù)點(diǎn)。此外，Milvus支持其他索引算法，如HSNW， IVF， CAGRA等，使其成為一個(gè)更有效的向量搜索解決方案。

用Milvus構(gòu)建多模態(tài)RAG

現(xiàn)在我們已經(jīng)學(xué)習(xí)了這些概念，是時(shí)候使用Milvus構(gòu)建一個(gè)多模態(tài)RAG系統(tǒng)了。在下述示例中，我們將使用Milvus Lite（Milvus的輕量級(jí)版本，非常適合實(shí)驗(yàn)和原型設(shè)計(jì)）進(jìn)行向量存儲(chǔ)和檢索，BGE用于精確的圖像處理和嵌入，GPT用于高級(jí)結(jié)果重新排序。

先決條件

首先，你需要一個(gè)Milvus實(shí)例來(lái)存儲(chǔ)你的數(shù)據(jù)。你可以使用pip設(shè)置Milvus Lite，使用Docker運(yùn)行本地實(shí)例，或者通過(guò)Zilliz Cloud注冊(cè)一個(gè)免費(fèi)托管的Milvus帳戶(hù)。

其次，你需要為你的RAG管道提供LLM，因此建議前往OpenAI并獲取API密鑰。免費(fèi)層足以使此代碼運(yùn)行。

接下來(lái)，創(chuàng)建一個(gè)新目錄和一個(gè)Python虛擬環(huán)境（或者采取你用來(lái)管理Python的任何步驟）。

對(duì)于本教程，你還需要安裝pymilvus庫(kù)（它是Milvus的官方Python SDK）和一些常用工具。

設(shè)置Milvus Lite

pip install -U pymilvus

安裝依賴(lài)項(xiàng)

pip install --upgrade pymilvus openai datasets opencv-python timm einops ftfy peft tqdm
git clone https://github.com/FlagOpen/FlagEmbedding.git
pip install -e FlagEmbedding

下載數(shù)據(jù)

下面的命令將下載示例數(shù)據(jù)并將其解壓縮到本地文件夾“./images_folder”，其中包括：

圖片：Amazon Reviews 2023的一個(gè)子集，包含大約900張來(lái)自“Appliance”、 “Cell_Phones_and_Accessories”和“Electronics”類(lèi)別的圖片。
查詢(xún)圖片示例：leopard.jpg

wget https://github.com/milvus-io/bootcamp/releases/download/data/amazon_reviews_2023_subset.tar.gztar -xzf amazon_reviews_2023_subset.tar.gz

加載嵌入模型

我們將使用可視化BGE模型“big - visualizing -base-en-v1.5”來(lái)生成圖像和文本的嵌入。

現(xiàn)在從HuggingFace下載權(quán)重。

wget https://huggingface.co/BAAI/bge-visualized/resolve/main/Visualized_base_en_v1.

然后，讓我們構(gòu)建一個(gè)編碼器。

import torchfrom visual_bge.modeling import Visualized_BGE
class Encoder:

    def __init__(self, model_name: str, model_path: str):

        self.model = Visualized_BGE(model_name_bge=model_name, model_weight=model_path)

        self.model.eval()

    def encode_query(self, image_path: str, text: str) -> list[float]:

        with torch.no_grad():

            query_emb = self.model.encode(image=image_path, text=text)

        return query_emb.tolist()[0]

    def encode_image(self, image_path: str) -> list[float]:

        with torch.no_grad():

            query_emb = self.model.encode(image=image_path)

        return query_emb.tolist()[0]

model_name = "BAAI/bge-base-en-v1.5"

model_path = "./Visualized_base_en_v1.5.pth"  # Change to your own value if using a different model path

encoder = Encoder(model_name, model_path)

生成嵌入和加載數(shù)據(jù)到Milvus

本節(jié)將指導(dǎo)你如何將示例圖像與其相應(yīng)的嵌入加載到數(shù)據(jù)庫(kù)中。

生成嵌入

首先，我們需要為數(shù)據(jù)集中的所有圖像創(chuàng)建嵌入。

從data目錄加載所有圖像并將它們轉(zhuǎn)換為嵌入。

import os
from tqdm import tqdm
from glob import glob

data_dir = (

    "./images_folder"  # Change to your own value if using a different data directory
)

image_list = glob(

    os.path.join(data_dir, "images", "*.jpg")
)  # We will only use images ending with ".jpg"

image_dict = {}
for image_path in tqdm(image_list, desc="Generating image embeddings: "):

    try:

        image_dict[image_path] = encoder.encode_image(image_path)

    except Exception as e:

        print(f"Failed to generate embedding for {image_path}. Skipped.")

        continue
print("Number of encoded images:", len(image_dict))

執(zhí)行多模態(tài)搜索和重新排序結(jié)果

在本節(jié)中，我們將首先使用多模態(tài)查詢(xún)搜索相關(guān)圖像，然后使用LLM服務(wù)對(duì)檢索結(jié)果進(jìn)行重新排序，并找到帶有解釋的最佳圖像。

運(yùn)行多模態(tài)搜索

現(xiàn)在，我們準(zhǔn)備使用由圖像和文本指令組成的查詢(xún)來(lái)執(zhí)行高級(jí)多模態(tài)搜索。

query_image = os.path.join(

    data_dir, "leopard.jpg"
)  # Change to your own query image path

query_text = "phone case with this image theme"

query_vec = encoder.encode_query(image_path=query_image, text=query_text)

search_results = milvus_client.search(

    collection_name=collection_name,

    data=[query_vec],

    output_fields=["image_path"],

    limit=9,  # Max number of search results to return

    search_params={"metric_type": "COSINE", "params": {}},  # Search parameters
)[0]

retrieved_images = [hit.get("entity").get("image_path") for hit in search_results]
print(retrieved_images)

結(jié)果如下：

['./images_folder/images/518Gj1WQ-RL._AC_.jpg', 
'./images_folder/images/41n00AOfWhL._AC_.jpg'

用GPT-40重新排序結(jié)果

現(xiàn)在，我們將使用GPT-40對(duì)檢索到的圖像進(jìn)行排序，并找到最匹配的結(jié)果。最后，LLM還將解釋排名原因。

1. 創(chuàng)建全景視圖。

import numpy as np
import cv2

img_height = 300

img_width = 300

row_count = 3
def create_panoramic_view(query_image_path: str, retrieved_images: list) -> np.ndarray:

    """

creates a 5x5 panoramic view image from a list of images

args:

images: list of images to be combined

returns:

np.ndarray: the panoramic view image

"""

    panoramic_width = img_width * row_count

    panoramic_height = img_height * row_count

    panoramic_image = np.full(

        (panoramic_height, panoramic_width, 3), 255, dtype=np.uint8

    )

    # create and resize the query image with a blue border

    query_image_null = np.full((panoramic_height, img_width, 3), 255, dtype=np.uint8)

    query_image = Image.open(query_image_path).convert("RGB")

    query_array = np.array(query_image)[:, :, ::-1]

    resized_image = cv2.resize(query_array, (img_width, img_height))

    border_size = 10

    blue = (255, 0, 0)  # blue color in BGR

    bordered_query_image = cv2.copyMakeBorder(

        resized_image,

        border_size,

        border_size,

        border_size,

        border_size,

        cv2.BORDER_CONSTANT,

        value=blue,

    )

    query_image_null[img_height * 2 : img_height * 3, 0:img_width] = cv2.resize(

        bordered_query_image, (img_width, img_height)

    )

    # add text "query" below the query image

    text = "query"

    font_scale = 1

    font_thickness = 2

    text_org = (10, img_height * 3 + 30)

    cv2.putText(

        query_image_null,

        text,

        text_org,

        cv2.FONT_HERSHEY_SIMPLEX,

        font_scale,

        blue,

        font_thickness,

        cv2.LINE_AA,

    )

    # combine the rest of the images into the panoramic view

    retrieved_imgs = [

        np.array(Image.open(img).convert("RGB"))[:, :, ::-1] for img in retrieved_images

    ]

    for i, image in enumerate(retrieved_imgs):

        image = cv2.resize(image, (img_width - 4, img_height - 4))

        row = i // row_count

        col = i % row_count

        start_row = row * img_height

        start_col = col * img_width

        border_size = 2

        bordered_image = cv2.copyMakeBorder(

            image,

            border_size,

            border_size,

            border_size,

            border_size,

            cv2.BORDER_CONSTANT,

            value=(0, 0, 0),

        )

        panoramic_image[

            start_row : start_row + img_height, start_col : start_col + img_width

        ] = bordered_image

        # add red index numbers to each image

        text = str(i)

        org = (start_col + 50, start_row + 30)

        (font_width, font_height), baseline = cv2.getTextSize(

            text, cv2.FONT_HERSHEY_SIMPLEX, 1, 2

        )

        top_left = (org[0] - 48, start_row + 2)

        bottom_right = (org[0] - 48 + font_width + 5, org[1] + baseline + 5)

        cv2.rectangle(

            panoramic_image, top_left, bottom_right, (255, 255, 255), cv2.FILLED

        )

        cv2.putText(

            panoramic_image,

            text,

            (start_col + 10, start_row + 30),

            cv2.FONT_HERSHEY_SIMPLEX,

            1,

            (0, 0, 255),

            2,

            cv2.LINE_AA,

        )

    # combine the query image with the panoramic view

    panoramic_image = np.hstack([query_image_null, panoramic_image])

    return panoramic_image

2. 將查詢(xún)圖像和檢索圖像與索引結(jié)合在一個(gè)全景視圖中。

from PIL import Image

combined_image_path = os.path.join(data_dir, "combined_image.jpg")

panoramic_image = create_panoramic_view(query_image, retrieved_images)

cv2.imwrite(combined_image_path, panoramic_image)

combined_image     = Image    .open(combined_image_path    )

show_combined_image = combined_image.resize((300, 300))

show_combined_image.show()

多模態(tài)搜索結(jié)果

3. 對(duì)結(jié)果重新排序并給出解釋

我們將把所有組合的圖像發(fā)送到多模態(tài)LLM服務(wù)，并提供適當(dāng)?shù)奶崾?，?duì)檢索結(jié)果進(jìn)行排序并給出解釋。注意：要啟用GPT- 40作為L(zhǎng)LM，你需要提前準(zhǔn)備好你的OpenAI API Key。

import requests
import base64

openai_api_key = "sk-***"  # Change to your OpenAI API Key
def generate_ranking_explanation(

    combined_image_path: str, caption: str, infos: dict = None
) -> tuple[list[int], str]:

    with open(combined_image_path, "rb") as image_file:

        base64_image = base64.b64encode(image_file.read()).decode("utf-8")

    information = (

        "You are responsible for ranking results for a Composed Image Retrieval. "

        "The user retrieves an image with an 'instruction' indicating their retrieval intent. "

        "For example, if the user queries a red car with the instruction 'change this car to blue,' a similar type of car in blue would be ranked higher in the results. "

        "Now you would receive instruction and query image with blue border. Every item has its red index number in its top left. Do not misunderstand it. "

        f"User instruction: {caption} \n\n"

    )

    # add additional information for each image

    if infos:

        for i, info in enumerate(infos["product"]):

            information += f"{i}. {info}\n"

    information += (

        "Provide a new ranked list of indices from most suitable to least suitable, followed by an explanation for the top 1 most suitable item only. "

        "The format of the response has to be 'Ranked list: []' with the indices in brackets as integers, followed by 'Reasons:' plus the explanation why this most fit user's query intent."

    )

    headers = {

        "Content-Type": "application/json",

        "Authorization": f"Bearer {openai_api_key}",

    }

    payload = {

        "model": "gpt-4o",

        "messages": [

            {

                "role": "user",

                "content": [

                    {"type": "text", "text": information},

                    {

                        "type": "image_url",

                        "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"},

                    },

                ],

            }

        ],

        "max_tokens": 300,

    }

    response = requests.post(

        "https://api.openai.com/v1/chat/completions", headers=headers, json=payload

    )

    result = response.json()["choices"][0]["message"]["content"]

    # parse the ranked indices from the response

    start_idx = result.find("[")

    end_idx = result.find("]")

    ranked_indices_str = result[start_idx + 1 : end_idx].split(",")

    ranked_indices = [int(index.strip()) for index in ranked_indices_str]

    # extract explanation

    explanation = result[end_idx + 1 :].strip()

    return ranked_indices, explanation

得到排名后的圖像指標(biāo)和最佳結(jié)果的原因：

ranked_indices, explanation = generate_ranking_explanation(

    combined_image_path, query_text
)

4. 顯示最佳結(jié)果并附有說(shuō)明

print(explanation)

best_index = ranked_indices[0]

best_img = Image.open(retrieved_images[best_index])

best_img = best_img.resize((150, 150))

best_img.show()

結(jié)果：

“原因：最適合用戶(hù)查詢(xún)意圖的項(xiàng)是索引6，因?yàn)橹噶钪付艘粋€(gè)以圖片為主題的手機(jī)殼，是一只豹子。索引為6的手機(jī)殼采用了類(lèi)似豹紋的主題設(shè)計(jì)，最符合用戶(hù)對(duì)圖像主題手機(jī)殼的需求?！?/p>

豹紋手機(jī)殼-最佳效果

結(jié)語(yǔ)

在這篇文章中，我們討論了使用Milvus（一個(gè)開(kāi)源向量數(shù)據(jù)庫(kù)）構(gòu)建一個(gè)多模態(tài)RAG系統(tǒng)的具體操作指南，介紹了開(kāi)發(fā)人員如何設(shè)置Milvus、加載圖像數(shù)據(jù)、執(zhí)行相似性搜索以及使用LLM對(duì)檢索結(jié)果進(jìn)行重新排序以獲得更準(zhǔn)確的響應(yīng)。

可以說(shuō)，多模態(tài)RAG解決方案為人工智能系統(tǒng)提供了多種可能性，可以輕松理解和處理多種形式的數(shù)據(jù)。一些常見(jiàn)的可能性包括改進(jìn)圖像搜索引擎、更好的上下文驅(qū)動(dòng)結(jié)果等等，其他更多可能性等你來(lái)探索！

原文標(biāo)題：Want to Search for Something With an Image and a Text Description? Try a Multimodal RAG，作者：Jiang Chen

責(zé)任編輯：姜華來(lái)源： 51CTO內(nèi)容精選

多模態(tài)RAG ?向量數(shù)據(jù)庫(kù)人工智能

點(diǎn)贊

51CTO技術(shù)棧公眾號(hào)

業(yè)務(wù)
速覽

媒體

51CTO CIOAge HC3i

社區(qū)

51CTO博客鴻蒙開(kāi)發(fā)者社區(qū) AI.x社區(qū)

教育

51CTO學(xué)堂精培企業(yè)培訓(xùn) CTO訓(xùn)練營(yíng)