Python中七種主要關(guān)鍵詞提取算法的基準(zhǔn)測試
我一直在尋找有效關(guān)鍵字提取任務(wù)算法。 目標(biāo)是找到一種算法,能夠以有效的方式提取關(guān)鍵字,并且能夠平衡提取質(zhì)量和執(zhí)行時(shí)間,因?yàn)槲业臄?shù)據(jù)語料庫迅速增加已經(jīng)達(dá)到了數(shù)百萬行。 我對于算法一個(gè)主要的要求是提取關(guān)鍵字本身總是要有意義的,即使脫離了上下文的語境也能夠表達(dá)一定的含義。
本篇文章使用 2000 個(gè)文檔的語料庫對幾種著名的關(guān)鍵字提取算法進(jìn)行測試和試驗(yàn)。
使用的庫列表
我使用了以下python庫進(jìn)行研究
NLTK,以幫助我在預(yù)處理階段和一些輔助函數(shù)
- RAKE
- YAKE
- PKE
- KeyBERT
- Spacy
Pandas 和Matplotlib還有其他通用庫
實(shí)驗(yàn)流程
基準(zhǔn)測試的工作方式如下

我們將首先導(dǎo)入包含我們的文本數(shù)據(jù)的數(shù)據(jù)集。 然后,我們將為每個(gè)算法創(chuàng)建提取邏輯的單獨(dú)函數(shù)
algorithm_name(str: text) → [keyword1, keyword2, ..., keywordn]
然后,我們創(chuàng)建的一個(gè)函數(shù)用于提取整個(gè)語料庫的關(guān)鍵詞。
extract_keywords_from_corpus(algorithm, corpus) → {algorithm, corpus_keywords, elapsed_time}
下一步,使用Spacy幫助我們定義一個(gè)匹配器對象,用來判斷關(guān)鍵字是否對我們的任務(wù)有意義,該對象將返回 true 或 false。
最后,我們會(huì)將所有內(nèi)容打包到一個(gè)輸出最終報(bào)告的函數(shù)中。
數(shù)據(jù)集
我使用的是來自互聯(lián)網(wǎng)的小文本數(shù)數(shù)據(jù)集。這是一個(gè)樣本
- ['To follow up from my previous questions. . Here is the result!\n',
- 'European mead competitions?\nI’d love some feedback on my mead, but entering the Mazer Cup isn’t an option for me, since shipping alcohol to the USA from Europe is illegal. (I know I probably wouldn’t get caught/prosecuted, but any kind of official record of an issue could screw up my upcoming citizenship application and I’m not willing to risk that).\n\nAre there any European mead comps out there? Or at least large beer comps that accept entries in the mead categories and are likely to have experienced mead judges?', 'Orange Rosemary Booch\n', 'Well folks, finally happened. Went on vacation and came home to mold.\n', 'I’m opening a gelato shop in London on Friday so we’ve been up non-stop practicing flavors - here’s one of our most recent attempts!\n', "Does anyone have resources for creating shelf stable hot sauce? Ferment and then water or pressure can?\nI have dozens of fresh peppers I want to use to make hot sauce, but the eventual goal is to customize a recipe and send it to my buddies across the States. I believe canning would be the best way to do this, but I'm not finding a lot of details on it. Any advice?", 'what is the practical difference between a wine filter and a water filter?\nwondering if you could use either', 'What is the best custard base?\nDoes someone have a recipe that tastes similar to Culver’s frozen custard?', 'Mold?\n'
大部分是與食物相關(guān)的。我們將使用2000個(gè)文檔的樣本來測試我們的算法。
我們現(xiàn)在還沒有對文本進(jìn)行預(yù)處理,因?yàn)橛幸恍┧惴ǖ慕Y(jié)果是基于stopwords和標(biāo)點(diǎn)符號(hào)的。
算法
讓我們定義關(guān)鍵字提取函數(shù)。
- # initiate BERT outside of functions
- bert = KeyBERT()
- # 1. RAKE
- def rake_extractor(text):
- """
- Uses Rake to extract the top 5 keywords from a text
- Arguments: text (str)
- Returns: list of keywords (list)
- """
- r = Rake()
- r.extract_keywords_from_text(text)
- return r.get_ranked_phrases()[:5]
- # 2. YAKE
- def yake_extractor(text):
- """
- Uses YAKE to extract the top 5 keywords from a text
- Arguments: text (str)
- Returns: list of keywords (list)
- """
- keywords = yake.KeywordExtractor(lan="en", n=3, windowsSize=3, top=5).extract_keywords(text)
- results = []
- for scored_keywords in keywords:
- for keyword in scored_keywords:
- if isinstance(keyword, str):
- results.append(keyword)
- return results
- # 3. PositionRank
- def position_rank_extractor(text):
- """
- Uses PositionRank to extract the top 5 keywords from a text
- Arguments: text (str)
- Returns: list of keywords (list)
- """
- # define the valid Part-of-Speeches to occur in the graph
- pos = {'NOUN', 'PROPN', 'ADJ', 'ADV'}
- extractor = pke.unsupervised.PositionRank()
- extractor.load_document(text, language='en')
- extractor.candidate_selection(pos=pos, maximum_word_number=5)
- # 4. weight the candidates using the sum of their word's scores that are
- # computed using random walk biaised with the position of the words
- # in the document. In the graph, nodes are words (nouns and
- # adjectives only) that are connected if they occur in a window of
- # 3 words.
- extractor.candidate_weighting(window=3, pos=pos)
- # 5. get the 5-highest scored candidates as keyphrases
- keyphrases = extractor.get_n_best(n=5)
- results = []
- for scored_keywords in keyphrases:
- for keyword in scored_keywords:
- if isinstance(keyword, str):
- results.append(keyword)
- return results
- # 4. SingleRank
- def single_rank_extractor(text):
- """
- Uses SingleRank to extract the top 5 keywords from a text
- Arguments: text (str)
- Returns: list of keywords (list)
- """
- pos = {'NOUN', 'PROPN', 'ADJ', 'ADV'}
- extractor = pke.unsupervised.SingleRank()
- extractor.load_document(text, language='en')
- extractor.candidate_selection(pos=pos)
- extractor.candidate_weighting(window=3, pos=pos)
- keyphrases = extractor.get_n_best(n=5)
- results = []
- for scored_keywords in keyphrases:
- for keyword in scored_keywords:
- if isinstance(keyword, str):
- results.append(keyword)
- return results
- # 5. MultipartiteRank
- def multipartite_rank_extractor(text):
- """
- Uses MultipartiteRank to extract the top 5 keywords from a text
- Arguments: text (str)
- Returns: list of keywords (list)
- """
- extractor = pke.unsupervised.MultipartiteRank()
- extractor.load_document(text, language='en')
- pos = {'NOUN', 'PROPN', 'ADJ', 'ADV'}
- extractor.candidate_selection(pos=pos)
- # 4. build the Multipartite graph and rank candidates using random walk,
- # alpha controls the weight adjustment mechanism, see TopicRank for
- # threshold/method parameters.
- extractor.candidate_weighting(alpha=1.1, threshold=0.74, method='average')
- keyphrases = extractor.get_n_best(n=5)
- results = []
- for scored_keywords in keyphrases:
- for keyword in scored_keywords:
- if isinstance(keyword, str):
- results.append(keyword)
- return results
- # 6. TopicRank
- def topic_rank_extractor(text):
- """
- Uses TopicRank to extract the top 5 keywords from a text
- Arguments: text (str)
- Returns: list of keywords (list)
- """
- extractor = pke.unsupervised.TopicRank()
- extractor.load_document(text, language='en')
- pos = {'NOUN', 'PROPN', 'ADJ', 'ADV'}
- extractor.candidate_selection(pos=pos)
- extractor.candidate_weighting()
- keyphrases = extractor.get_n_best(n=5)
- results = []
- for scored_keywords in keyphrases:
- for keyword in scored_keywords:
- if isinstance(keyword, str):
- results.append(keyword)
- return results
- # 7. KeyBERT
- def keybert_extractor(text):
- """
- Uses KeyBERT to extract the top 5 keywords from a text
- Arguments: text (str)
- Returns: list of keywords (list)
- """
- keywords = bert.extract_keywords(text, keyphrase_ngram_range=(3, 5), stop_words="english", top_n=5)
- results = []
- for scored_keywords in keywords:
- for keyword in scored_keywords:
- if isinstance(keyword, str):
- results.append(keyword)
- return results
每個(gè)提取器將文本作為參數(shù)輸入并返回一個(gè)關(guān)鍵字列表。對于使用來講非常簡單。
注意:由于某些原因,我不能在函數(shù)之外初始化所有提取器對象。每當(dāng)我這樣做時(shí),TopicRank和MultiPartiteRank都會(huì)拋出錯(cuò)誤。就性能而言,這并不完美,但基準(zhǔn)測試仍然可以完成。

我們已經(jīng)通過傳遞 pos = {'NOUN', 'PROPN', 'ADJ', 'ADV'} 來限制一些可接受的語法模式——這與 Spacy 一起將確保幾乎所有的關(guān)鍵字都是從人類語言視角來選擇的。 我們還希望關(guān)鍵字包含三個(gè)單詞,只是為了有更具體的關(guān)鍵字并避免過于籠統(tǒng)。
從整個(gè)語料庫中提取關(guān)鍵字
現(xiàn)在讓我們定義一個(gè)函數(shù),該函數(shù)將在輸出一些信息的同時(shí)將單個(gè)提取器應(yīng)用于整個(gè)語料庫。
- def extract_keywords_from_corpus(extractor, corpus):
- """This function uses an extractor to retrieve keywords from a list of documents"""
- extractor_name = extractor.__name__.replace("_extractor", "")
- logging.info(f"Starting keyword extraction with {extractor_name}")
- corpus_kws = {}
- start = time.time()
- # logging.info(f"Timer initiated.") <-- uncomment this if you want to output start of timer
- for idx, text in tqdm(enumerate(corpus), desc="Extracting keywords from corpus..."):
- corpus_kws[idx] = extractor(text)
- end = time.time()
- # logging.info(f"Timer stopped.") <-- uncomment this if you want to output end of timer
- elapsed = time.strftime("%H:%M:%S", time.gmtime(end - start))
- logging.info(f"Time elapsed: {elapsed}")
- return {"algorithm": extractor.__name__,
- "corpus_kws": corpus_kws,
- "elapsed_time": elapsed}
這個(gè)函數(shù)所做的就是將傳入的提取器數(shù)據(jù)和一系列有用的信息組合成一個(gè)字典(比如執(zhí)行任務(wù)花費(fèi)了多少時(shí)間)來方便我們后續(xù)生成報(bào)告。
語法匹配函數(shù)
這個(gè)函數(shù)確保提取器返回的關(guān)鍵字始終(幾乎?)意義。 例如,

我們可以清楚地了解到,前三個(gè)關(guān)鍵字可以獨(dú)立存在,它們完全是有意義的。我們不需要更多信息來理解關(guān)鍵詞的含義,但是第四個(gè)就毫無任何意義,所以需要盡量避免這種情況。
Spacy 與 Matcher 對象可以幫助我們做到這一點(diǎn)。 我們將定義一個(gè)匹配函數(shù),它接受一個(gè)關(guān)鍵字,如果定義的模式匹配,則返回 True 或 False。
- def match(keyword):
- """This function checks if a list of keywords match a certain POS pattern"""
- patterns = [
- [{'POS': 'PROPN'}, {'POS': 'VERB'}, {'POS': 'VERB'}],
- [{'POS': 'NOUN'}, {'POS': 'VERB'}, {'POS': 'NOUN'}],
- [{'POS': 'VERB'}, {'POS': 'NOUN'}],
- [{'POS': 'ADJ'}, {'POS': 'ADJ'}, {'POS': 'NOUN'}],
- [{'POS': 'NOUN'}, {'POS': 'VERB'}],
- [{'POS': 'PROPN'}, {'POS': 'PROPN'}, {'POS': 'PROPN'}],
- [{'POS': 'PROPN'}, {'POS': 'PROPN'}, {'POS': 'NOUN'}],
- [{'POS': 'ADJ'}, {'POS': 'NOUN'}],
- [{'POS': 'ADJ'}, {'POS': 'NOUN'}, {'POS': 'NOUN'}, {'POS': 'NOUN'}],
- [{'POS': 'PROPN'}, {'POS': 'PROPN'}, {'POS': 'PROPN'}, {'POS': 'ADV'}, {'POS': 'PROPN'}],
- [{'POS': 'PROPN'}, {'POS': 'PROPN'}, {'POS': 'PROPN'}, {'POS': 'VERB'}],
- [{'POS': 'PROPN'}, {'POS': 'PROPN'}],
- [{'POS': 'NOUN'}, {'POS': 'NOUN'}],
- [{'POS': 'ADJ'}, {'POS': 'PROPN'}],
- [{'POS': 'PROPN'}, {'POS': 'ADP'}, {'POS': 'PROPN'}],
- [{'POS': 'PROPN'}, {'POS': 'ADJ'}, {'POS': 'NOUN'}],
- [{'POS': 'PROPN'}, {'POS': 'VERB'}, {'POS': 'NOUN'}],
- [{'POS': 'NOUN'}, {'POS': 'ADP'}, {'POS': 'NOUN'}],
- [{'POS': 'PROPN'}, {'POS': 'NOUN'}, {'POS': 'PROPN'}],
- [{'POS': 'VERB'}, {'POS': 'ADV'}],
- [{'POS': 'PROPN'}, {'POS': 'NOUN'}],
- ]
- matcher = Matcher(nlp.vocab)
- matcher.add("pos-matcher", patterns)
- # create spacy object
- doc = nlp(keyword)
- # iterate through the matches
- matches = matcher(doc)
- # if matches is not empty, it means that it has found at least a match
- if len(matches) > 0:
- return True
- return False
基準(zhǔn)測試函數(shù)
我們馬上就要完成了。 這是啟動(dòng)腳本和收集結(jié)果之前的最后一步。
我們將定義一個(gè)基準(zhǔn)測試函數(shù),它接收我們的語料庫和一個(gè)布爾值,用于對我們的數(shù)據(jù)進(jìn)行打亂。 對于每個(gè)提取器,它調(diào)用
extract_keywords_from_corpus 函數(shù)返回一個(gè)包含該提取器結(jié)果的字典。 我們將該值存儲(chǔ)在列表中。
對于列表中的每個(gè)算法,我們計(jì)算
- 平均提取關(guān)鍵詞數(shù)
- 匹配關(guān)鍵字的平均數(shù)量
- 計(jì)算一個(gè)分?jǐn)?shù)表示找到的平均匹配數(shù)除以執(zhí)行操作所花費(fèi)的時(shí)間
我們將所有數(shù)據(jù)存儲(chǔ)在 Pandas DataFrame 中,然后將其導(dǎo)出為 .csv。
- def get_sec(time_str):
- """Get seconds from time."""
- h, m, s = time_str.split(':')
- return int(h) * 3600 + int(m) * 60 + int(s)
- def benchmark(corpus, shuffle=True):
- """This function runs the benchmark for the keyword extraction algorithms"""
- logging.info("Starting benchmark...\n")
- # Shuffle the corpus
- if shuffle:
- random.shuffle(corpus)
- # extract keywords from corpus
- results = []
- extractors = [
- rake_extractor,
- yake_extractor,
- topic_rank_extractor,
- position_rank_extractor,
- single_rank_extractor,
- multipartite_rank_extractor,
- keybert_extractor,
- ]
- for extractor in extractors:
- result = extract_keywords_from_corpus(extractor, corpus)
- results.append(result)
- # compute average number of extracted keywords
- for result in results:
- len_of_kw_list = []
- for kws in result["corpus_kws"].values():
- len_of_kw_list.append(len(kws))
- result["avg_keywords_per_document"] = np.mean(len_of_kw_list)
- # match keywords
- for result in results:
- for idx, kws in result["corpus_kws"].items():
- match_results = []
- for kw in kws:
- match_results.append(match(kw))
- result["corpus_kws"][idx] = match_results
- # compute average number of matched keywords
- for result in results:
- len_of_matching_kws_list = []
- for idx, kws in result["corpus_kws"].items():
- len_of_matching_kws_list.append(len([kw for kw in kws if kw]))
- result["avg_matched_keywords_per_document"] = np.mean(len_of_matching_kws_list)
- # compute average percentange of matching keywords, round 2 decimals
- result["avg_percentage_matched_keywords"] = round(result["avg_matched_keywords_per_document"] / result["avg_keywords_per_document"], 2)
- # create score based on the avg percentage of matched keywords divided by time elapsed (in seconds)
- for result in results:
- elapsed_seconds = get_sec(result["elapsed_time"]) + 0.1
- # weigh the score based on the time elapsed
- result["performance_score"] = round(result["avg_matched_keywords_per_document"] / elapsed_seconds, 2)
- # delete corpus_kw
- for result in results:
- del result["corpus_kws"]
- # create results dataframe
- df = pd.DataFrame(results)
- df.to_csv("results.csv", index=False)
- logging.info("Benchmark finished. Results saved to results.csv")
- return df
結(jié)果
- results = benchmark(texts[:2000], shuffle=True)

下面是產(chǎn)生的報(bào)告

我們可視化一下:

根據(jù)我們定義的得分公式(
avg_matched_keywords_per_document/time_elapsed_in_seconds), Rake 在 2 秒內(nèi)處理 2000 個(gè)文檔,盡管準(zhǔn)確度不如 KeyBERT,但時(shí)間因素使其獲勝。
如果我們只考慮準(zhǔn)確性,計(jì)算為
avg_matched_keywords_per_document 和 avg_keywords_per_document 之間的比率,我們得到這些結(jié)果

從準(zhǔn)確性的角度來看,Rake 的表現(xiàn)也相當(dāng)不錯(cuò)。如果我們不考慮時(shí)間的話,KeyBERT 肯定會(huì)成為最準(zhǔn)確、最有意義關(guān)鍵字提取的算法。Rake 雖然在準(zhǔn)確度上排第二,但是差了一大截。
如果需要準(zhǔn)確性,KeyBERT 肯定是首選,如果要求速度的話Rake肯定是首選,因?yàn)樗乃俣葔K,準(zhǔn)確率也算能接受吧。