OpenAI 大佬一句話顛覆認知：AI 搞得好不好，真正重要的是數(shù)據(jù)集的選擇

作者：AI寒武紀 2025-01-06 07:45:00

在得出“X 方法無效”這樣的結論之前，你應該謹慎，要確保用于測試的數(shù)據(jù)集確實能夠檢驗該方法。

OpenAI 研究員 Jason Wei 剛剛發(fā)表了一篇博文，探討了在當前 AI 研究中一項被低估卻至關重要的技能：找到真正能體現(xiàn)新方法有效性的數(shù)據(jù)集。這項技能在十年前還不存在，但如今卻可能成為一項研究成敗的關鍵。

一個常見的例子是“思維鏈 (Chain of Thought, CoT) 在哪些數(shù)據(jù)集上能提升性能？”。近期一篇論文甚至認為 CoT 主要對數(shù)學和邏輯任務有幫助。Wei 認為這種觀點是缺乏想象力和多樣化評估的表現(xiàn)。如果我們簡單地在 100 個隨機用戶聊天提示上測試 CoT 模型，可能看不到明顯的差異，但這僅僅是因為這些提示本來就不需要 CoT 就能解決。事實上，在一些特定的數(shù)據(jù)子集上，CoT 能帶來巨大提升——例如數(shù)學和編程任務，以及任何驗證不對稱的任務。

換句話說，在斷言“X 方法無效”之前，需要確保用于測試的數(shù)據(jù)集確實能夠體現(xiàn)該方法的價值。

Jason Wei 的這篇博文強調了在當前 AI 研究中，隨著模型能力的不斷增強，數(shù)據(jù)集的選擇變得更加微妙和關鍵。

全文

Jason Wei 人工智能研究員 @OpenAI

An underrated but occasionally make-or-break skill in AI research (that didn’t really exist ten years ago) is the ability to find a dataset that actually exercises a new method you are working on. Back in the day when the bottleneck in AI was learning, many methods were dataset-agnostic; for example, a better optimizer would be expected to improve on both ImageNet and CIFAR-10. Nowadays language models are so multi-task that the answer to whether something works is almost always “it depends on the dataset”.

在人工智能研究中，一項被低估但偶爾能決定成敗的技能（十年前還不存在）是找到一個真正能檢驗你正在研究的新方法的數(shù)據(jù)集的能力。在過去，人工智能的瓶頸是學習，許多方法與數(shù)據(jù)集無關；例如，一個更好的優(yōu)化器應該在 ImageNet 和 CIFAR-10 上都能提高性能。如今，語言模型具有如此強大的多任務處理能力，以至于某件事是否有效，答案幾乎總是“取決于數(shù)據(jù)集”。

A common example of this is the question, “on what datasets does chain of thought improve performance?” A recent paper even argued (will link below) that CoT mainly helps on math/logic, and I think that is both a failure of imagination and a lack of diverse evals. Naively you might try CoT models on 100 random user chat prompts and not see much difference, but this is because the prompts were already solvable without CoT. In fact there is a small and very important slice of data where CoT makes a big difference—the obvious examples are math and coding, but include almost any task with asymmetry of verification. For example, generating a poem that fits a list of constraints is hard on the first try but much easier if you can draft and revise using CoT.

一個常見的例子是這個問題：“思維鏈 (Chain of Thought, CoT) 在哪些數(shù)據(jù)集上能提高性能？” 一篇最近的論文甚至認為（鏈接附后）CoT 主要有助于數(shù)學/邏輯，我認為這既是想象力的失敗，也是缺乏多樣化評估的結果。你可能會簡單地在 100 個隨機用戶聊天提示上嘗試 CoT 模型，卻看不到太大的區(qū)別，但這是因為這些提示在沒有 CoT 的情況下已經可以解決。事實上，在一小部分非常重要的數(shù)據(jù)上，CoT 可以帶來很大的不同——明顯的例子是數(shù)學和編碼，但也包括幾乎任何具有驗證不對稱性的任務。例如，生成一首符合一系列約束條件的詩歌，第一次嘗試時很困難，但如果你可以使用 CoT 進行草擬和修改，就會容易得多。

As another made-up example, let’s say you want to know if browsing improves performance on geology exams. Maybe using browsing on some random geology dataset didn’t improve performance. The important thing to do here would be to see if the without-browsing model was actually suffering due to lack of world knowledge—if it wasn’t, then this was the wrong dataset to try browsing on.

再舉一個虛構的例子，假設你想知道瀏覽網頁是否能提高地質學考試的成績。也許在一些隨機的地質學數(shù)據(jù)集上使用瀏覽并沒有提高性能。這里重要的是要查看沒有瀏覽功能的模型是否真的因為缺乏世界知識而表現(xiàn)不佳——如果不是，那么這就不是一個測試瀏覽功能的正確數(shù)據(jù)集。

In other words you should hesitate to draw a conclusion like “X method doesn’t work” without ensuring that the dataset used for testing actually exercises that method. The inertia from five years ago is to take existing benchmarks and try to solve them, but nowadays there is a lot more flexibility and sometimes it even makes sense to create a custom dataset to showcase the initial usefulness of an idea. Obviously the danger with doing this is that a contrived dataset may not represent a substantial portion of user queries. But if the method is in principle general I think this is a good way to start and something people should do more often.

換句話說，在得出“X 方法無效”這樣的結論之前，你應該謹慎，要確保用于測試的數(shù)據(jù)集確實能夠檢驗該方法。五年前的慣性是采用現(xiàn)有的基準數(shù)據(jù)集并嘗試解決它們，但如今的靈活性要大得多，有時甚至可以創(chuàng)建一個自定義數(shù)據(jù)集來展示一個想法的初步實用性。顯然，這樣做的危險在于，人為設計的數(shù)據(jù)集可能無法代表用戶查詢的很大一部分。但如果該方法在原則上是通用的，我認為這是一個好的開始，也是人們應該更多嘗試的事情。

責任編輯：張燕妮來源： AI寒武紀

測試數(shù)據(jù)AI

自拍偷在线精品自拍偷,亚洲欧美中文日韩v在线观看不卡

OpenAI 大佬一句話顛覆認知：AI 搞得好不好，真正重要的是數(shù)據(jù)集的選擇

全文