譯者 | 李睿
審校 | 重樓
開發(fā)人員需要開發(fā)和理解一種新的文本聚類方法,并使用DeepSeek推理模型解釋推理結(jié)果。
本文將探索大型語言模型(LLM)中的推理領(lǐng)域,并介紹DeepSeek這款優(yōu)秀工具,它能幫助人們解釋推論結(jié)果,構(gòu)建能讓終端用戶更加信賴的機(jī)器學(xué)習(xí)系統(tǒng)。
在默認(rèn)情況下,機(jī)器學(xué)習(xí)模型是一種黑盒,不會(huì)為決策提供開箱即用的解釋(XAI)。本文介紹如何使用DeepSeek模型,并嘗試將解釋或推理能力添加到機(jī)器學(xué)習(xí)世界中。
方法
首先構(gòu)建自定義嵌入和嵌入函數(shù)來創(chuàng)建向量數(shù)據(jù)存儲(chǔ),并使用DeepSeek模型來執(zhí)行推理。
以下是展示整個(gè)流程的一個(gè)簡(jiǎn)單的流程圖。
數(shù)據(jù)
(1)選擇一個(gè)新聞文章數(shù)據(jù)集來識(shí)別新文章的類別。該數(shù)據(jù)集可在Kaggle網(wǎng)站上下載。
(2)從數(shù)據(jù)集中,使用short_description進(jìn)行向量嵌入,并使用類別特征為每篇文章分配適當(dāng)?shù)臉?biāo)簽。
(3)數(shù)據(jù)集相當(dāng)干凈,不需要對(duì)其進(jìn)行預(yù)處理。
(4)使用pandas庫加載數(shù)據(jù)集,并使用scikit-learn將其拆分為訓(xùn)練和測(cè)試數(shù)據(jù)集。
1 import pandas as pd
2
3 df = pd.read_json('./News_Category_Dataset_v3.json',lines=True)
4
5 from sklearn.model_selection import train_test_split
6 # Separate features (X) and target (y)
7 X = df.drop('category', axis=1)
8 y = df['category']
9
10 # Split data into training and testing sets
11 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
12
13 train_df = pd.concat([X_train, y_train], axis=1)
14 test_df = pd.concat([X_test, y_test], axis=1)
生成文本嵌入
使用以下庫進(jìn)行文本嵌入:
- langchain—用于創(chuàng)建示例提示和語義相似性選擇器。
- langchain_chroma—用于創(chuàng)建嵌入并將其存儲(chǔ)在數(shù)據(jù)存儲(chǔ)中。
1 from chromadb import Documents, EmbeddingFunction, Embeddings
2
3 from langchain_chroma import Chroma
4 from langchain_core.example_selectors import SemanticSimilarityExampleSelector
5 from langchain_core.prompts import FewShotPromptTemplate, PromptTemplate
接下來,將構(gòu)建自定義嵌入和嵌入函數(shù)。這些自定義函數(shù)將允許查詢部署在本地或遠(yuǎn)程實(shí)例上的模型。
閱讀器可以為部署在遠(yuǎn)程實(shí)例上的實(shí)例合并必要的安全機(jī)制(HTTPS、數(shù)據(jù)加密等),并調(diào)用REST端點(diǎn)來檢索模型嵌入。
1 class MyEmbeddings(Embeddings):
2
3 def __init__(self):
4 # Server address and port (replace with your actual values)
5 self.url = ""
6 # Request headers
7 self.headers = {
8 "Content-Type": "application/json"
9 }
10
11 self.data = {
12 # Use any text embedding model of your choice
13 "model": "text-embedding-nomic-embed-text-v1.5",
14 "input": None,
15 "encoding_format": "float"
16 }
17
18 def embed_documents(self, texts):
19 embeddings = []
20 for text in texts:
21 embeddings.append(self.embed_query(text))
22 return embeddings
23
24 def embed_query(self, input):
25 self.data['input'] = input
26 with requests.post(self.url, headers=self.headers, data=json.dumps(self.data)) as response:
27 res = response.text
28 yaml_object = yaml.safe_load(res)
29 embeddings = yaml_object['data'][0]['embedding']
30 return embeddings
31
32
33
34 class MyEmbeddingFunction(EmbeddingFunction):
35
36 def __call__(self, input: Documents) -> Embeddings:
37 return MyEmbeddings()
將定義一個(gè)簡(jiǎn)單的函數(shù),它將為新聞文章創(chuàng)建一個(gè)語義相似性選擇器。選擇器將用于使用訓(xùn)練數(shù)據(jù)集創(chuàng)建向量嵌入。
1 def create_semantic_similarity_selector(train_df):
2
3 example_prompt = PromptTemplate(
4 input_variables=["input", "output"],
5 template="Input: {input}\nOutput: {output}",
6 )
7
8 # Examples of a pretend task of creating antonyms.
9 examples = []
10
11 for row in train_df.iterrows():
12 example = {}
13 example['input'] = row[1]['short_description']
14 example['output'] = row[1]['category']
15 examples.append(example)
16
17 semantic_similarity_selector = SemanticSimilarityExampleSelector.from_examples(
18 # The list of examples available to select from.
19 examples,
20 # The embedding class used to produce embeddings which are used to measure semantic similarity.
21 MyEmbeddings(),
22 # The VectorStore class that is used to store the embeddings and do a similarity search over.
23 Chroma,
24 # The number of examples to produce.
25 k=1,
26 )
27
28 return semantic_similarity_selector
調(diào)用上面的函數(shù)來生成新聞文章的嵌入。需要注意的是,訓(xùn)練過程可能很耗時(shí),可以將其并行化以使其更快運(yùn)行。
1 semantic_similarity_selector = create_semantic_similarity_selector(train_df)
色度向量數(shù)據(jù)存儲(chǔ)用于存儲(chǔ)各種新聞文章及其相關(guān)標(biāo)簽的向量表示。然后使用數(shù)據(jù)存儲(chǔ)中的嵌入來執(zhí)行與測(cè)試數(shù)據(jù)集中文章的語義相似性,并檢查該方法的準(zhǔn)確性。
將調(diào)用DeepSeek REST端點(diǎn),并將從語義相似性選擇器接收到的響應(yīng)和實(shí)際結(jié)果傳遞給測(cè)試數(shù)據(jù)集。隨后,將創(chuàng)建一個(gè)包含DeepSeek模型進(jìn)行推理所需信息的上下文。
1 def explain_model_result(text, model_answer, actual_answer):
2 # REST end point for deepseek model.
3 url = ""
4
5 # Request headers
6 headers = {
7 "Content-Type": "application/json"
8 }
9
10 promptJson = {
11 "question": 'Using the text, can you explain why the model answer and actual answer match or do not match ?',
12 "model_answer": model_answer,
13 "actual_answer": actual_answer,
14 "context": text,
15 }
16 prompt = json.dumps(promptJson)
17
18 # Request data (replace with your prompt)
19 data = {
20 "messages": [{"role": "user", "content": prompt}],
21 "temperature": 0.7,
22 "stream": True
23 }
24 captured_explanation = ""
25 with requests.post(url, headers=headers, data=json.dumps(data), stream=True) as response:
26 if response.status_code == 200:
27 for chunk in response.iter_content(chunk_size=None):
28 if chunk:
29 # Attempt to decode the chunk as UTF-8
30 decoded_chunk = chunk.decode('utf-8')
31 # Process the chunk as a json or yaml to extract the explanation and concat it with captured_explanation object.
32 captured_explanation += yaml.safe_load(decoded_chunk)['data']['choices'][0]['delta']['content']
33 else:
34 print(f"Request failed with status code {response.status_code}")
35
36 return captured_explanation
以下解析測(cè)試數(shù)據(jù)集,并從DeepSeek模型中獲取解釋。
1 results_df = pd.DataFrame()
2 results_df['input'] = None
3 results_df['model_answer'] = None
4 results_df['actual_answer'] = None
5 results_df['explanation'] = None
6 for row in test_df.iterrows():
7 example = {}
8 example['input'] = row[1]['short_description']
9 model_result_category = semantic_similarity_selector.select_examples(example)
10 example['explanation'] = explain_model_result(example['input'],model_result_category[0]['output'],row[1]['category'])
11 example['model_answer'] = model_result_category[0]['output']
12 example['input'] = row[1]['category']
13 results_df.loc[len(results_df)] = example
結(jié)果
以下將分享測(cè)試數(shù)據(jù)集中一些場(chǎng)景的結(jié)果。
第一個(gè)例子是當(dāng)從語義相似性中捕獲的答案與測(cè)試數(shù)據(jù)集中的實(shí)際答案不匹配時(shí)。這種洞察力使人們能夠了解模型在做出預(yù)測(cè)時(shí)是否合理。think標(biāo)簽提供了DeepSeek模型的初始思維過程,以理解問題陳述,然后為最終用戶整理答案。
在第一個(gè)例子中,DeepSeek可以解釋為什么模型得到了錯(cuò)誤的答案。該模型只考慮句子中的旅游(TRAVEL)方面,而沒有考慮娛樂(ENTERTAINMENT)方面。這意味著嵌入模型需要更多的娛樂示例進(jìn)行訓(xùn)練。
1 <think>
2
3 Okay, so I'm trying to figure out why the model answer is "TRAVEL" and the actual answer is "ENTERTAINMENT." The context given is about going to Central Park and the Delacorte Theater to see Shakespeare under the night sky, with rain expected.
4
5 First, I need to understand what each term refers to. TRAVEL relates to visiting places, moving around, or tourism in general. ENTERTAINMENT is more about enjoyable activities, like theater, concerts, movies, etc., that provide pleasure.
6
7 Looking at the context: the person went to Central Park, which is a place they traveled to. Then they went to the Delacorte Theater for a Shakespeare performance. So, the main activities are both TRAVEL and ENTERTAINMENT because they're going out, seeing a play.
8
9 But why would there be a mismatch between model and actual answers? Maybe the model didn't consider that travel can also include attending theater or other performances as part of the entertainment aspect. So the actual answer is broader, including both aspects, while the model focuses only on TRAVEL.
10
11 Alternatively, perhaps the context doesn't emphasize one over the other. The key points are moving to a place (TRAVEL) and attending an event for enjoyment (ENTERTAINMENT). Depending on the focus, either could be considered the main answer, but typically, when someone goes somewhere to attend an event, both travel and entertainment elements come into play.
12
13 So, the model answer might have only picked TRAVEL, not considering ENTERTAINMENT. The actual answer includes both aspects, hence the mismatch.
14
15 </think>
16
17 The discrepancy between the model answer "TRAVEL" and the actual answer "ENTERTAINMENT" arises from the context provided. While travel is indeed part of the narrative—visiting Central Park—the primary focus is on attending a theater performance, which falls under entertainment. The actual answer captures both elements: moving to a place (travel) and engaging in an enjoyable event (entertainment). Thus, the model may have overlooked the entertainment aspect, considering only travel, whereas the actual response encompasses both activities.
18
19 **Answer:**
20
21
21 The model answer "TRAVEL" focuses on visiting places, while the actual answer "ENTERTAINMENT" includes attending a theater performance. The context highlights both elements, leading to the mismatch in answers.
需要考慮模型的答案和實(shí)際答案不匹配的場(chǎng)景。然而,該模型在預(yù)測(cè)方面是合理的。
在第二個(gè)例子中,發(fā)生在法國的持械搶劫是一種犯罪(CRIME),因此將用戶文章標(biāo)記為犯罪的模型不一定是錯(cuò)誤的。DeepSeek的推理能夠?yàn)榻Y(jié)果提供合理的理由。
1 <think>
2
3 Okay, so I'm trying to figure out why the model answer is "CRIME" and the actual answer is "WORLD NEWS" given the context: "French police recovered the merchandise stolen in an armed robbery on Wednesday."
4
5 First, I need to understand what each part of this question is asking. The user wants me to explain why these two answers match or don't match based on the provided text. So, maybe I should start by analyzing the context and seeing how it relates to both CRIME and WORLD NEWS.
6
7 Looking at the context: It's about French police recovering stolen merchandise from an armed robbery. An armed robbery is definitely a crime, so that makes sense why "CRIME" is an answer. The model answer being CRIME aligns with the event described because it's a criminal act—a robbery.
8
9 Now, why is WORLD NEWS also an answer? Well, if I think about news in general, events like this are newsworthy. So perhaps the actual answer is considering that this incident is part of world news because it involves police action and theft, which can be significant enough to be reported internationally.
10
11 But wait, does the context specify whether it's considered world news or not? It just mentions French police recovering stolen goods in an armed robbery on Wednesday. Without more details about the scale or international implications, I'm not sure if it's explicitly world news. Maybe WORLD NEWS is a broader category that includes such events as part of their reporting.
12
13 So, why do they match and why don't they? They both are relevant because CRIME is directly related to the event, while WORLD NEWS is considering the newsworthiness of the event. But perhaps the model answer is more focused on the specific event being a crime, whereas WORLD NEWS is taking a broader view of news events.
14
15 I'm also thinking about how each answer approaches the context differently. The model answer is zeroing in on the type of event (crime), while the actual answer is considering it as a newsworthy event that might be reported in world news sections or channels.
16
17 So, to sum up, both answers are relevant but approach the context from different angles—Crime specifically relates to the event's nature, whereas WORLD NEWS broadens it into the realm of news reporting.
18
19 </think>
20
21 The model answer "CRIME" and the actual answer "WORLD NEWS" both address the provided context, but they do so from different perspectives.
22
23 **CRIME**: The context explicitly describes an armed robbery, which is a criminal act. Therefore, "CRIME" directly relates to the nature of the event.
24
25 **WORLD NEWS**: While the event is indeed newsworthy, the term "WORLD NEWS" broadens the scope beyond the specific details, considering it as part of international or general news reporting.
26
27 In essence, both answers are relevant but focus on different aspects: "CRIME" highlights the event's nature as a criminal act, while "WORLD NEWS" recognizes its potential newsworthiness.
當(dāng)模型和實(shí)際答案匹配時(shí),將考慮推理。
匹配結(jié)果背后的推理增強(qiáng)了人們的信任,并建立了對(duì)模型預(yù)測(cè)的信心。
1 <think>
2
3 Okay, so I'm trying to understand why the model answer and the actual answer both say "MEDIA" when explaining the context. The question is asking me to explain whether they match or not based on the given text.
4
5 First, let me look at the context provided: "The newsroom is awash with fears about the casino mogul's intentions after his refusal to admit ownership." This sentence mentions a casino mogul and their refusal to own up to something. That refusal is causing fear in the newsroom regarding their intentions.
6
7 Now, looking at both the model answer and the actual answer, they both are "MEDIA". I think this might be an abbreviation or shorthand for media-related reasons. The context talks about the newsroom, which is a place associated with media organizations or journalists. So, when there's fear about someone's intentions, especially related to ownership, it makes sense that the media would be concerned and involved.
8
9 Wait, but why specifically does it say "MEDIA"? Maybe it's short for something else or maybe it's just a coincidence. However, given that the context is in a newsroom setting, it's likely referring to media-related issues.
10
11 Let me break it down: The casino mogul refusing ownership could imply some kind of hidden agenda or control. Since this is in a newsroom, which deals with reporting and information, the refusal might be seen as trying to hide something. This fear could lead to media coverage or reporting on it, hence the mention of MEDIA.
12
13 So, putting it together, both answers refer to the involvement or concern of media outlets regarding the casino mogul's actions. Therefore, they match because the context clearly ties into a media-related situation.
14
15 </think>
16
17 The model answer and actual answer both state "MEDIA" because the context provided refers to a newsroom setting, where concerns about a casino mogul's intentions are likely related to media involvement or reporting. The refusal to admit ownership sparks fear, suggesting possible media coverage or concerns, thus aligning both answers with the media-related context.
18
19 **Answer:** Both answers match because the context involves a newsroom, indicating media-related concerns over a casino mogul's intentions.
結(jié)論
推理的力量使用戶對(duì)模型的功能有了寶貴的了解,并凸顯了其優(yōu)勢(shì)與不足。它增強(qiáng)了對(duì)系統(tǒng)的理解,并提供了改進(jìn)系統(tǒng)的路徑。
設(shè)想一個(gè)只提供推理決策而不提供任何解釋的模型,那么用戶將永遠(yuǎn)無法理解為什么模型會(huì)提供這些決策,并且可能難以對(duì)其產(chǎn)生信心。
原文標(biāo)題:Text Clustering With Deepseek Reasoning,作者:Kalpan Dharamshi