自拍偷在线精品自拍偷,亚洲欧美中文日韩v在线观看不卡

OpenAI 自帶檢索真的好用嗎?定量測(cè)評(píng)帶你深度了解!

原創(chuàng) 精選
人工智能
我們基于 Ragas 測(cè)評(píng)工具,將 OpenAI assistant 和基于向量數(shù)據(jù)庫(kù)的開(kāi)源 RAG 方案做了詳盡的比較和分析。可以發(fā)現(xiàn),雖然 OpenAI assistant 的確在檢索方面表現(xiàn)尚佳,但在回答效果,召回表現(xiàn)等方面卻遜色于向量 RAG 檢索方案,Ragas 的各項(xiàng)指標(biāo)也定量地反應(yīng)出該結(jié)論。

向量數(shù)據(jù)庫(kù)的勁敵來(lái)了?又有一批賽道創(chuàng)業(yè)公司要倒下?

……

這是 OpenAI 上線 Assistant 檢索功能后,技術(shù)圈傳出的部分聲音。原因在于,此功能可以為用戶提供基于知識(shí)庫(kù)問(wèn)答的 RAG(檢索增強(qiáng)增強(qiáng)) 能力。而此前,大家更傾向于將向量數(shù)據(jù)庫(kù)作為 RAG 方案的重要組件,以達(dá)到減少大模型出現(xiàn)“幻覺(jué)”的效果。

那么,問(wèn)題來(lái)了,OpenAI 自帶的 Assistant 檢索功能 V.S. 基于向量數(shù)據(jù)庫(kù)構(gòu)建的開(kāi)源 RAG 方案相比,誰(shuí)更勝一籌?

本著嚴(yán)謹(jǐn)?shù)那笞C精神,我們對(duì)這個(gè)問(wèn)題進(jìn)行了定量測(cè)評(píng),結(jié)果很有意思:OpenAI 真的很強(qiáng)!

不過(guò),在基于向量數(shù)據(jù)庫(kù)的開(kāi)源 RAG 方案面前就有些遜色了!

接下來(lái),我將還原整個(gè)測(cè)評(píng)過(guò)程。需要強(qiáng)調(diào)的是,要完成這些測(cè)評(píng)并不容易,少量的測(cè)試樣本根本無(wú)法有效衡量 RAG 應(yīng)用的各方面效果。

因此,需要采用一個(gè)公平、客觀的 RAG 效果測(cè)評(píng)工具,在一個(gè)合適的數(shù)據(jù)集上進(jìn)行測(cè)評(píng),進(jìn)行定量的評(píng)估和分析,并保證結(jié)果的可復(fù)現(xiàn)性。

話不多說(shuō),上過(guò)程!

一、評(píng)測(cè)工具

Ragas (https://docs.ragas.io/en/latest/)是一個(gè)致力于測(cè)評(píng) RAG 應(yīng)用效果的開(kāi)源框架。用戶只需要提供 RAG 過(guò)程中的部分信息,如 question、 contexts、 answer 等,它就能使用這些信息來(lái)定量評(píng)估多個(gè)指標(biāo)。通過(guò) pip 安裝 Ragas,只需幾行代碼,即可進(jìn)行評(píng)估,過(guò)程如下:

Python
from ragas import evaluate
from datasets import Dataset

# prepare your huggingface dataset in the format
# dataset = Dataset({
#     features: ['question', 'contexts', 'answer', 'ground_truths'],
#     num_rows: 25
# })
results = evaluate(dataset)

# {'ragas_score': 0.860, 'context_precision': 0.817,
# 'faithfulness': 0.892, 'answer_relevancy': 0.874

Ragas 有許多評(píng)測(cè)的得分指標(biāo)子類別,比如:

?從 generation 角度出發(fā),有描述回答可信度的 Faithfulness,回答和問(wèn)題相關(guān)度的 Answer relevancy

?從 retrieval 角度出發(fā),有衡量知識(shí)召回精度的 Context precision,知識(shí)召回率的 Context recall,召回內(nèi)容相關(guān)性的 Context Relevancy

?從 answer 與 ground truth 比較角度出發(fā),有描述回答相關(guān)性的 Answer semantic similarity,回答正確性的 Answer Correctness

?從 answer 本身出發(fā),有各種 Aspect Critique

圖片圖片

圖片來(lái)源:https://docs.ragas.io/en/latest/concepts/metrics/index.html

這些指標(biāo)各自衡量的角度不同,舉個(gè)例子,比如指標(biāo) answer correctness,它是結(jié)果導(dǎo)向,直接衡量 RAG 應(yīng)用回答的正確性。下面是一個(gè) answer correctness 高分與低分的對(duì)比例子:

Plain Text
Ground truth: Einstein was born in 1879 at Germany .
High answer correctness: In 1879, in Germany, Einstein was born.
Low answer correctness: In Spain, Einstein was born in 1879.

其它指標(biāo)細(xì)節(jié)可參考官方文檔:

(https://docs.ragas.io/en/latest/concepts/metrics/index.html)。

重要的是,每個(gè)指標(biāo)衡量角度不同,這樣用戶就可以全方位,多角度地評(píng)估 RAG 應(yīng)用的好壞。

二、測(cè)評(píng)數(shù)據(jù)集

我們使用 Financial Opinion Mining and Question Answering (fiqa) Dataset (https://sites.google.com/view/fiqa/)作為測(cè)試數(shù)據(jù)集。主要有以下幾方面的原因:

?該數(shù)據(jù)集是屬于金融專業(yè)領(lǐng)域的數(shù)據(jù)集,它的語(yǔ)料來(lái)源非常多樣化,并包含了人工回答內(nèi)容。里面涵蓋非常冷門的金融專業(yè)知識(shí),大概率不會(huì)出現(xiàn)在 GPT 的訓(xùn)練數(shù)據(jù)集。這樣就比較適合用來(lái)當(dāng)作外掛的知識(shí)庫(kù),以和沒(méi)見(jiàn)過(guò)這些知識(shí)的 LLM 形成對(duì)比。

?該數(shù)據(jù)集原本就是用來(lái)評(píng)估 Information Retrieval (IR) 能力的,因此它有標(biāo)注好了的知識(shí)片段,這些片段可以直接當(dāng)做召回的標(biāo)準(zhǔn)答案(ground truth)。

?Ragas 官方也把它視作一個(gè)標(biāo)準(zhǔn)的入門測(cè)試數(shù)據(jù)集(https://docs.ragas.io/en/latest/getstarted/evaluation.html#the-data),并提供了構(gòu)建它的腳本(https://github.com/explodinggradients/ragas/blob/main/experiments/baselines/fiqa/dataset-exploration-and-baseline.ipynb)。因此有一定的社區(qū)基礎(chǔ),可以得到一致的認(rèn)可,也比較合適用來(lái)做 baseline。

我們先使用轉(zhuǎn)換腳本來(lái)將最原始的 fiqa 數(shù)據(jù)集轉(zhuǎn)換構(gòu)建成 Ragas 方便處理的格式??梢韵瓤匆谎墼撛u(píng)測(cè)數(shù)據(jù)集的內(nèi)容,它有 647 個(gè)金融相關(guān)的 query 問(wèn)題,每個(gè)問(wèn)題對(duì)應(yīng)的知識(shí)原文內(nèi)容列表就是 ground_truths,它一般包含 1 到 4 條知識(shí)內(nèi)容片段。

fiqa數(shù)據(jù)集示例fiqa數(shù)據(jù)集示例

進(jìn)行到這一步,測(cè)試數(shù)據(jù)就準(zhǔn)備好了。我們只需要將 question 這一列,拿去提問(wèn) RAG 應(yīng)用,然后將 RAG 應(yīng)用的回答和召回,合并上 ground truths,將所有這些信息,用 Ragas 評(píng)測(cè)打分。

三、RAG對(duì)照設(shè)置

接下來(lái)就是搭建我們要對(duì)比的兩個(gè) RAG 應(yīng)用,來(lái)對(duì)比跑分。下面開(kāi)始搭建兩套 RAG 應(yīng)用:OpenAI assistant 和基于向量數(shù)據(jù)庫(kù)自定義的 RAG pipeline。

1.OpenAI assistant

我們采用 OpenAI 官方的 assistant retrieval 方式介紹(https://platform.openai.com/docs/assistants/tools/knowledge-retrieval),構(gòu)建 assistant 和上傳知識(shí),并且使用 OpenAI 官方給出的方式(https://platform.openai.com/docs/assistants/how-it-works/message-annotations)拿到 answer 和召回的 contexts,其它都采用默認(rèn)設(shè)置。

2.基于向量數(shù)據(jù)庫(kù)的 RAG pipeline

緊接著我們打造一條基于向量召回的 RAG pipeline。用 Milvus (https://milvus.io/)向量數(shù)據(jù)庫(kù)存儲(chǔ)知識(shí),用 HuggingFaceEmbeddings (https://python.langchain.com/docs/integrations/platforms/huggingface)中的 BAAI/bge-base-en 模型構(gòu)建 embedding用 LangChain (https://python.langchain.com/docs/get_started/introduction)的組件進(jìn)行文檔導(dǎo)入和 Agent 構(gòu)建。

下面列出了兩套方案的對(duì)比:

圖片

這里注意到,我們用的 LLM model 都是 gpt-4-1106-preview,其它的策略由于 OpenAI 是閉源的,所以應(yīng)該和它有許多差別。篇幅所限具體實(shí)現(xiàn)細(xì)節(jié)在此不作展開(kāi),可以參考我們的實(shí)現(xiàn)代碼(https://github.com/milvus-io/bootcamp/tree/master/evaluation)。

四、結(jié)果和分析

1.實(shí)驗(yàn)結(jié)果

我們使用 Ragas 里的多個(gè)指標(biāo)對(duì)它們進(jìn)行打分,得到下面每個(gè)指標(biāo)的對(duì)比結(jié)果:

各項(xiàng)指標(biāo)對(duì)比各項(xiàng)指標(biāo)對(duì)比

可以看到,在我們統(tǒng)計(jì)的5項(xiàng)指標(biāo)中,OpenAI assistant 除了在 answer_similarity 這項(xiàng)超過(guò)自定義的 RAG pipeline 之外,其它指標(biāo)都略低于自定義的 RAG pipeline。

另外,Ragas 也可以通過(guò)計(jì)算各項(xiàng)指標(biāo)的調(diào)和平均數(shù),來(lái)得到一個(gè)總體平均的得分,叫作 Ragas score。調(diào)和平均數(shù)的作用在于懲罰低分項(xiàng)。從總體上看,OpenAI assistant 的 Ragas score 也是低于自定義的 RAG pipeline。

Ragas score對(duì)比Ragas score對(duì)比

那么,為什么會(huì)有這樣的結(jié)果呢?通過(guò)大量的單條結(jié)果對(duì)比,我們發(fā)現(xiàn),基于向量數(shù)據(jù)庫(kù)搭建的 RAG pipeline 與OpenAI assistant 相比,有如下優(yōu)勢(shì):

1) OpenAI assistant 更傾向于用自己的知識(shí)回答,而少用召回的知識(shí)。

這可能是由于其內(nèi)部的 Agent 設(shè)定導(dǎo)致,當(dāng)遇到 GPT 可能知道的事實(shí)時(shí),它更自信,選擇使用自己訓(xùn)練時(shí)的知識(shí),而不去使用召回的知識(shí),但也許這時(shí)的知識(shí)就是正好和它訓(xùn)練的相悖。舉個(gè)例子:

提問(wèn):

Plain Text
Are personal finance / money management classes taught in high school, anywhere?

OpenAI assistant 回答:

Plain Text
Personal finance and money management classes are indeed taught in some high schools, although the availability and requirement of such classes can vary greatly depending on the country and even within regions of a country. In the United States, for example, the Council for Economic Education (CEE) conducts surveys and reports on the state of financial education in K-12 schools. Their findings suggest that as of my last update, a growing number of states have integrated personal finance into their educational standards, and some have mandated that students take a personal finance course in order to graduate.

Although I don't have access to a live database or the most recent studies, you can often find information about financial education in schools by:

1. Checking government education websites.
2. Looking at reports from financial education nonprofits and councils.
3. Contacting local school boards or departments of education.

If you are interested in specific locations or have a particular focus in mind, you could specify that so I could provide more directed information, if available.

這個(gè)回答對(duì)不對(duì)呢?只從回答來(lái)看是沒(méi)什么問(wèn)題,但我們注意到標(biāo)亮的地方,強(qiáng)調(diào)的是其沒(méi)有聯(lián)網(wǎng)獲取最新知識(shí),這就說(shuō)明它沒(méi)有用到上傳上去的知識(shí)文檔內(nèi)容。

再看看 ground truth,也就是數(shù)據(jù)集里標(biāo)注的正確答案:

Plain Text
["In Houston, Texas USA where I went to a private high school they had a half-semester class in personal finance, but it was optional and didn't give you any credits towards graduation. You are right though, it should be a standard class. After all, who doesn't need that information in their adult lives, and not everyone goes to college."
...
 '"We had a ""civics"" class when I was a freshman in high school. This was in the Ann Arbor, MI public schools. It covered the very basics (how to balance your checkbook, what are stocks, how do income taxes work, what is interest, etc.) of money management along with an overview of politics and the legal system. It was a really light class, though, and didn\'t go deeply into personal finance and money management. I agree that such a class would be very valuable, as would cooking, nutrition, and basic home and car repair."']

可以看到,文檔里舉例了許多回答這個(gè)問(wèn)題的具體場(chǎng)景,而 OpenAI 的回答確實(shí)沒(méi)有用到它們。它認(rèn)為這個(gè)問(wèn)題過(guò)于簡(jiǎn)單,足以直接回答,而忽略了知識(shí)庫(kù)里可能需要的信息。

接下來(lái)我們看看基于向量數(shù)據(jù)庫(kù)的 RAG 的回答:

Plain Text
Personal finance and money management classes are indeed taught in some high schools, but the availability and requirement of such classes can vary greatly. For example, in Houston, Texas, a private high school offered an optional half-semester class in personal finance that did not provide credits towards graduation. Staples High School in Westport, Connecticut, introduced a personal financial management class focused on using mathematics to develop financial literacy skills, covering topics such as earnings, banking, credit cards, loans, taxes, insurance, investing, budgeting, and buying personal property. In Ann Arbor, Michigan, public schools offered a 'civics' class that included basic money management topics. Additionally, in the UK, the 'Every Child Matters' framework includes 'achieving economic well-being' as an important element, and some schools use a School Rewards System to teach kids about finance and economy. While these examples show that such classes exist, they are not universally required or standardized, and the depth and focus of the curriculum can vary.

高亮出了 RAG 回答中引用 ground truth 的內(nèi)容,它很好地把知識(shí)內(nèi)容融入到回答中去,這才是用戶需要的。

2.)OpenAI 對(duì)知識(shí)的切分和召回有待優(yōu)化,開(kāi)源自定義方案更勝一籌

我們可以通過(guò)查看 assistant 的中間召回的知識(shí),來(lái)分析它對(duì)知識(shí)文檔的切分策略,或者反推其 embedding 模型的效果。舉個(gè)例子:

提問(wèn):

JSON
Pros / cons of being more involved with IRA investments [duplicate]

OpenAI assistant 的中間召回片段:

JSON
['PROS: CONS']

這顯然是一個(gè)錯(cuò)誤的召回片段,而且它只召回了這一條片段。首先片段的切分不太合理,把后面的內(nèi)容切掉了。其次 embedding 模型并沒(méi)有把更重要的、可以回答這個(gè)問(wèn)題的片段召回,只是召回了提問(wèn)詞相似的片段。

自定義 RAG pipeline 的召回片段:

Plain Text
['in the tax rate, there\'s also a significant difference in the amount being taxed. Thus, withdrawing from IRA is generally not a good idea, and you will never be better off with withdrawing from IRA than with cashing out taxable investments (from tax perspective). That\'s by design."'
 "Sounds like a bad idea. The IRA is built on the power of compounding. Removing contributions will hurt your retirement savings, and you will never be able to make that up. Instead, consider tax-free investments. State bonds, Federal bonds, municipal bonds, etc. For example, I invest in California muni bonds fund which gives me ~3-4% annual dividend income - completely tax free. In addition - there's capital appreciation of your fund holdings. There are risks, of course, for example rate changes will affect yields and capital appreciation, so consult with someone knowledgeable in this area (or ask another question here, for the basics). This will give you the same result as you're expecting from your Roth IRA trick, without damaging your retirement savings potential."
 "In addition to George Marian's excellent advice, I'll add that if you're hitting the limits on IRA contributions, then you'd go back to your 401(k). So, put enough into your 401(k) to get the match, then max out IRA contributions to give you access to more and better investment options, then go back to your 401(k) until you top that out as well, assuming you have that much available to invest for retirement."
 "While tax deferral is a nice feature, the 401k is not the Holy Grail.  I've seen plenty of 401k's where the investment options are horrible: sub-par performance, high fees, limited options.   That's great that you've maxed out your Roth IRA.  I commend you for that.   As long as the investment options in your 401k are good, then I would stick with it."
 "retirement plans which offer them good cheap index funds. These people probably don't need to worry quite as much. Finally, having two accounts is more complicated. Please contact someone who knows more about taxes than I am to figure out what limitations apply for contributing to both IRAs and 401(k)s in the same year."]

可以看到,自行搭建的 RAG pipeline 把許多 IRA 投資的信息片段都召回出來(lái)了,這些內(nèi)容也是有效地結(jié)合到最后 LLM 的回答中去。

此外,可以注意到,向量召回也有類似 BM25 這種分詞召回的效果,召回的關(guān)鍵詞確實(shí)都是需要的詞“IRA”,因此向量召回不僅在整體語(yǔ)義上有效,在微觀詞匯上召回效果也不遜色于詞頻召回。

2.其它方面

除實(shí)驗(yàn)效果分析之外,對(duì)比更加靈活的自定義開(kāi)源 RAG 方案,OpenAI assistant 還有一些較為明顯的劣勢(shì):

?OpenAI assistant 無(wú)法調(diào)整RAG流程中的參數(shù),內(nèi)部是個(gè)黑盒,這也導(dǎo)致了沒(méi)法對(duì)其優(yōu)化。而自定義 RAG 方案可以調(diào)整 top_k、chunk size、embedding 模型等組件或參數(shù),這樣也可以在特定數(shù)據(jù)上進(jìn)行優(yōu)化。

?OpenAI 存儲(chǔ)文件量有限,而向量數(shù)據(jù)庫(kù)可以存儲(chǔ)海量知識(shí)。OpenAI 單文件上傳有上限 512 MB 并不能超過(guò) 2,000,000 個(gè) token。

因此,OpenAI 無(wú)法完成業(yè)務(wù)更復(fù)雜,數(shù)據(jù)量更大或更加定制化的 RAG 服務(wù)。

五、總結(jié)

我們基于 Ragas 測(cè)評(píng)工具,將 OpenAI assistant 和基于向量數(shù)據(jù)庫(kù)的開(kāi)源 RAG 方案做了詳盡的比較和分析??梢园l(fā)現(xiàn),雖然 OpenAI assistant 的確在檢索方面表現(xiàn)尚佳,但在回答效果,召回表現(xiàn)等方面卻遜色于向量 RAG 檢索方案,Ragas 的各項(xiàng)指標(biāo)也定量地反應(yīng)出該結(jié)論。

因此,對(duì)于構(gòu)建更加強(qiáng)大、效果更好的 RAG 應(yīng)用,開(kāi)發(fā)者可以考慮基于 Milvus(https://zilliz.com/what-is-milvus) 或 Zilliz Cloud(https://cloud.zilliz.com.cn/signup) 等向量數(shù)據(jù)庫(kù),構(gòu)建定義檢索功能,從而帶來(lái)更好的效果和靈活的選擇。

責(zé)任編輯:武曉燕 來(lái)源: 51CTO技術(shù)棧
相關(guān)推薦

2009-11-05 15:43:02

Visual Stud

2016-01-13 10:34:57

物聯(lián)網(wǎng)物聯(lián)網(wǎng)技術(shù)

2025-01-16 16:36:00

2010-01-12 17:33:06

C++

2023-05-29 08:11:42

@Value注解Bean

2020-12-02 13:24:07

強(qiáng)化學(xué)習(xí)算法

2014-04-17 16:42:03

DevOps

2022-07-26 00:00:22

HTAP系統(tǒng)數(shù)據(jù)庫(kù)

2020-01-15 10:17:41

Kubernetes容器負(fù)載均衡

2017-03-13 09:41:12

2010-01-15 16:45:35

C++語(yǔ)言

2021-01-15 07:44:21

SQL注入攻擊黑客

2021-11-09 09:48:13

Logging python模塊

2019-09-16 08:40:42

2014-11-28 10:31:07

Hybrid APP

2020-02-27 10:49:26

HTTPS網(wǎng)絡(luò)協(xié)議TCP

2023-03-16 10:49:55

2021-10-26 00:27:28

Python以太坊智能

2017-10-18 22:01:12

2023-10-24 08:53:24

FutureTas并發(fā)編程
點(diǎn)贊
收藏

51CTO技術(shù)棧公眾號(hào)