表格場景RAG怎么做?TableRAG:一種增強(qiáng)大規(guī)模表格理解框架 原創(chuàng)
前面很多期介紹了密集文檔場景的RAG方法,今天來看看大量表格場景的RAG怎么做的。
現(xiàn)有結(jié)合大模型的方法通常需要將整個(gè)表格作為輸入,這會(huì)導(dǎo)致一些挑戰(zhàn),比如位置偏差、上下文長度限制等,尤其是在處理大型表格時(shí)。為了解決這些問題,文章提出了TableRAG框架,該框架利用查詢擴(kuò)展結(jié)合模式和單元格檢索,以在向LLM提供信息之前精確定位關(guān)鍵信息。這種方法能夠更高效地編碼數(shù)據(jù)和精確檢索,顯著減少提示長度并減輕信息丟失。
表提示技術(shù)在LLM中的應(yīng)用比較
(a) Read Table
語言模型讀取整個(gè)表格。這是最直接的方法,但往往不可行,因?yàn)榇笮捅砀駮?huì)超出模型的處理能力。陰影區(qū)域表示提供給語言模型的數(shù)據(jù),包括所有行和列。對(duì)于大型表格,這種方法不現(xiàn)實(shí),因?yàn)闀?huì)超過模型的令牌限制。
(b) Read Schema
語言模型只讀取表格的模式(schema),即列名和數(shù)據(jù)類型。只包含列名和數(shù)據(jù)類型的信息,不包含表格內(nèi)容的具體信息。這種方法會(huì)導(dǎo)致表格內(nèi)容的信息丟失。
(c) Row-Column Retrieval
對(duì)行和列進(jìn)行編碼,然后根據(jù)它們與問題的相似性進(jìn)行選擇。只有行和列的交集被呈現(xiàn)給語言模型。 編碼后,基于與問題的相關(guān)性選擇行和列。 對(duì)于大型表格,編碼所有行和列仍然不可行。
(d) Schema-Cell Retrieval (Ours)
編碼列名和單元格,并根據(jù)它們與語言模型生成的關(guān)于問題查詢的相關(guān)性進(jìn)行檢索。只有檢索到的模式和單元格提供給語言模型。 包括檢索到的列名和單元格值。 提高了編碼和推理的效率。
(e) Retrieval Performance on ArcadeQA
展示了在 ArcadeQA 數(shù)據(jù)集上不同方法的檢索結(jié)果。TableRAG 在列和單元格檢索方面都優(yōu)于其他方法,從而提高了后續(xù)表格推理過程的性能。
方法
TableRAG Example
核心思想是結(jié)合模式檢索和單元格檢索,獲得解決問題的必要信息,通過程序輔助的LLM。實(shí)際上,沒必要將整個(gè)表格給LLM。相反,關(guān)鍵信息通常位于與問題直接相關(guān)的特定列名、數(shù)據(jù)類型和單元格值中。例如,考慮一個(gè)問題“錢包的平均價(jià)格是多少?”為了解決這個(gè)問題,程序可能只需要提取與“錢包”相關(guān)的行,然后從價(jià)格列計(jì)算平均值。僅知道相關(guān)列名以及表中“錢包”的表示方式就足以編寫程序。因此,TableRAG解決了RAG的上下文長度限制。
TableRAG流程圖:表格被用來構(gòu)建Schema和單元格數(shù)據(jù)庫。然后通過LLM將問題擴(kuò)展成多個(gè)模式和單元格查詢。這些查詢依次用于Schema檢索和列-單元格對(duì)。每個(gè)查詢的前K個(gè)候選項(xiàng)被組合起來,輸入到LLM求解器的提示中以回答問題。
TableRAG核心組件
- Tabular Query Expansion(表格查詢擴(kuò)展)
為了有效地操作表格,關(guān)鍵是要精確地找出查詢所需的列名和單元格值。與之前的方法不同,TableRAG 不僅使用問題本身作為單一查詢,而是為模式和單元格值生成單獨(dú)的查詢。例如,對(duì)于問題 "What is the average price for wallets?",模型被提示生成針對(duì)列名(如 "product" 和 "price")以及相關(guān)單元格值(如 "wallet")的潛在查詢。然后,這些查詢被用來從表格中檢索相關(guān)的模式和單元格值。 - Schema Retrieval(Schema檢索)
在生成查詢后,Schema檢索會(huì)使用預(yù)訓(xùn)練的編碼器???fenc?
? 來獲取相關(guān)的列名。編碼器將查詢與編碼的列名進(jìn)行匹配,以確定相關(guān)性。檢索到的模式數(shù)據(jù)包括列名、數(shù)據(jù)類型和示例值。對(duì)于被識(shí)別為數(shù)值或日期時(shí)間類型的列,會(huì)顯示最小值和最大值作為示例值;對(duì)于分類列,會(huì)展示三個(gè)最常見的類別作為示例值。通過這種方式,檢索到的模式為表格的格式和內(nèi)容提供了結(jié)構(gòu)化的概覽,這將用于更有針對(duì)性的數(shù)據(jù)提取。
相關(guān)prompt如下:
========================================= Prompt =========================================
Given a large table regarding "amazon seller order status prediction orders data", I want
to answer a question: "What is the average price for leather wallets?"
Since I cannot view the table directly, please suggest some column names that might contain
the necessary data to answer this question.
Please answer with a list of column names in JSON format without any additional explanation
.
Example:
["column1", "column2", "column3"]
======================================= Completion =======================================
["product_name", "category", "price"]
- Cell Retrieval(單元格檢索)
在Schema檢索之后,進(jìn)行單元格檢索以提取回答查詢所需的特定單元格值。這涉及到構(gòu)建一個(gè)由表格 T 中的不同列-值對(duì)組成的數(shù)據(jù)庫,表示為 $ V = {(C_j, v_{ij})} $,其中 $ C_j $ 是第 $ j $ 列的列名。在實(shí)踐中,不同值的數(shù)量通常遠(yuǎn)小于單元格的總數(shù),這顯著提高了單元格檢索的效率。
單元格檢索在 TableRAG 中起著至關(guān)重要的作用:
相關(guān)prompt如下:
========================================= Prompt =========================================
Given a large table regarding "amazon seller order status prediction orders data", I want
to answer a question: "What is the average price for leather wallets?"
Please extract some keywords which might appear in the table cells and help answer the
question.
The keywords should be categorical values rather than numerical values.
The keywords should be contained in the question.
Please answer with a list of keywords in JSON format without any additional explanation.
Example:
["keyword1", "keyword2", "keyword3"]
======================================= Completion =======================================
["leather wallets", "average price", "amazon seller", "order status prediction", "orders
data"]
- 單元格識(shí)別:它允許語言模型準(zhǔn)確地檢測表格中特定關(guān)鍵詞的存在,這對(duì)于有效的索引至關(guān)重要。
- 單元格-列關(guān)聯(lián):它還使語言模型能夠?qū)⑻囟▎卧衽c其相關(guān)的列名關(guān)聯(lián)起來,這在問題涉及特定屬性時(shí)至關(guān)重要。
- Cell Retrieval with Encoding Budget
在最壞的情況下,不同值的數(shù)量可能與單元格的總數(shù)相匹配。為了保持 TableRAG 在這種情況下的可行性,引入了一個(gè)單元格編碼預(yù)算$ B $。如果不同值的數(shù)量超過$ B $,編碼過程將限制在出現(xiàn)頻率最高的 $ B $ 對(duì),從而在處理大型表格時(shí)提高效率。 - Program-Aided Solver(程序輔助求解器)
在獲得與問題相關(guān)的列名和單元格值后,語言模型可以使用這些信息有效地與表格交互。TableRAG 與可以以編程方式與表格交互的語言模型代理兼容。在這項(xiàng)工作中,作者考慮了 ReAct,這是一種流行的擴(kuò)展語言模型功能的方法,已在最近的文獻(xiàn)中用于在表格 QA 基準(zhǔn)測試中取得最先進(jìn)的結(jié)果。
相關(guān)prompt如下:
========================================= Prompt =========================================
You are working with a pandas dataframe regarding "amazon seller order status prediction
orders data" in Python. The name of the dataframe is ‘df‘. Your task is to use ‘
python_repl_ast‘ to answer the question: "What is the average price for leather wallets?"
Tool description:
- ‘python_repl_ast‘: A Python interactive shell. Use this to execute python commands. Input
should be a valid single line python command.
Since you cannot view the table directly, here are some schemas and cell values retrieved
from the table.
Schema Retrieval Results:
{"column_name": "item_total", "dtype": "object", "cell_examples": [’$449.00’, ’$399.00’, ’
$549.00’]}
{"column_name": "quantity", "dtype": "int64", "min": 1, "max": 4}
{"column_name": "order_no", "dtype": "object", "cell_examples": [’402-4845680-8041921’,
’405-9763961-5211537’, ’404-3964908-7850720’]}
...
Cell Retrieval Queries: leather wallets, average price, order status, prediction, amazon
seller
Cell Retrieval Results:
{"column_name": "order_status", "cell_value": "Delivered to buyer"}
{"column_name": "description", "cell_value": "Pure Leather Camel Color Gent’s Wallet with
Coin Compartment and Card Holders | Men’s Ultra Slim Money Organiser (1 pc)"}
...
Strictly follow the given format to respond:
Thought: you should always think about what to do
Action: the single line Python command to execute
Observation: the result of the action
... (this Thought/Action/Observation can repeat N times)
Thought: before giving the final answer, you should think about the observations
Final Answer: the final answer to the original input question (Answer1, Answer2, ...)
Notes:
- Do not use markdown or any other formatting in your responses.
- Ensure the last line is only "Final Answer: Answer1, Answer2, ..." form, no other form.
- Directly output the Final Answer rather than outputting by Python.
- Ensure to have a concluding thought that verifies the table, observations and the
question before giving the final answer.
Now, given a table regarding "amazon seller order status prediction orders data", please
use ‘python_repl_ast‘ with the column names and cell values above to answer the question: "
What is the average price for leather wallets?"
Begin!
======================================= Completion =======================================
Thought: I need to remove the comma in the item_total values before converting them to
float.
Action: df[’item_total’] = df[’item_total’].str.replace(’$’,’’).str.replace(’,’,’’).astype(
float)
Observation: success!
Thought: Now, I can filter out the rows with leather wallets and calculate the average
price.
Action: df[df[’description’].str.contains(’wallet’, case=False)][’item_total’].mean()
Observation: 442.7916666666667
Thought: before giving the final answer, I should consider the observations and ensure that
the calculations are accurate.
Final Answer: The average price for leather wallets is $442.79.
實(shí)驗(yàn)結(jié)果
數(shù)據(jù)集:為了驗(yàn)證TableRAG在大規(guī)模表格上的可擴(kuò)展性,作者構(gòu)建了兩個(gè)新的基準(zhǔn)數(shù)據(jù)集:ArcadeQA和BirdQA,分別源自Arcade和BIRD-SQL數(shù)據(jù)集。此外,作者還從TabFact數(shù)據(jù)集中生成了合成數(shù)據(jù),將表格擴(kuò)展到更大的規(guī)模。
并且比較了四種不同的方法,包括ReadTable、ReadSchema、RandRowSampling和RowColRetrieval。所有方法都基于相同的PyReAct求解器實(shí)現(xiàn)。
TableRAG的檢索設(shè)計(jì)顯著減少了計(jì)算成本和token使用,同時(shí)保持了高性能。
參考文獻(xiàn)
TableRAG: Million-Token Table Understanding with Language Models,https://arxiv.org/abs/2410.04739v1
本文轉(zhuǎn)載自公眾號(hào)大模型自然語言處理 作者:余俊暉
