自拍偷在线精品自拍偷,亚洲欧美中文日韩v在线观看不卡

AI.x社區(qū)

軟考社區(qū)

企業(yè)培訓

鴻蒙開發(fā)者社區(qū)

WOT技術大會

公眾號矩陣

移動端

視頻課免費課排行榜短視頻直播課軟考學堂

全部課程軟考華為認證廠商認證 IT技術 PMP項目管理免費題庫

文章資源問答課堂專欄直播

51CTO

鴻蒙開發(fā)者社區(qū)

51CTO技術棧

51CTO官微

51CTO學堂

51CTO博客

CTO訓練營

鴻蒙開發(fā)者社區(qū)訂閱號

51CTO軟考

51CTO學堂APP

51CTO學堂企業(yè)版APP

鴻蒙開發(fā)者社區(qū)視頻號

51CTO軟考題庫

賬號設置退出

全文內容推薦引擎之中文分詞

作者：閆濤 2011-08-16 16:24:28

數(shù)據(jù)庫

基于內容的推薦引擎有兩種實現(xiàn)途徑，一種是根據(jù)條目的元數(shù)據(jù)(可以將元數(shù)據(jù)理解為屬性)，另一種是根據(jù)條目的文本描述信息。本系列中將先描述基于條目描述信息的全文檢索實現(xiàn)方式，然后描述基于元數(shù)據(jù)的內容推薦引擎實現(xiàn)方式。

基于內容的推薦引擎有兩種實現(xiàn)途徑，一種是根據(jù)條目的元數(shù)據(jù)(可以將元數(shù)據(jù)理解為屬性)，另一種是根據(jù)條目的文本描述信息。本系列中將先描述基于條目描述信息的全文檢索實現(xiàn)方式，然后描述基于元數(shù)據(jù)的內容推薦引擎實現(xiàn)方式。

對于基于條目文本描述信息的內容推薦引擎，目前有很多資料可以參考，基本步聚是先對文本內容進行分詞，包括提取出單詞、去掉常用詞如的地得、加入同意詞、對英語還有去掉復數(shù)形式和過去分詞形式等;第二步是計算各個詞在每篇文章中的出現(xiàn)頻率，以及在所有文章中的出現(xiàn)頻率，即TF/IDF;第三步計算文章向量;***是利用自動聚類算法，對條目進行聚類，這樣就可以實現(xiàn)向用戶推薦同類產品的需求了。

但是在這里有一個非常重要的問題沒有解決，就是中文分詞的問題，這些文章中絕大部分都是以英文為背景的，而英文分詞方面，分出單詞很簡單，只需要空格作為分隔符就可以了，而中文中詞與詞之間沒有空格，其次是英文中單復數(shù)、過去分詞等比較多，需要還原成單數(shù)現(xiàn)在式，但是中文中這個問題基本不存在，再有就是英文需要在分詞后識別長的詞組，而中文這一步也不需進行。

針對以上這些難題，在我的項目中，采用了MMSeg4j中文分詞模塊，這個項目集成了據(jù)說是搜狗輸入法的10萬多詞庫(大家知道中文分詞的關鍵是中文詞庫)。

另外，我還希望中文分詞可以在全文檢索引擎和全文內容推薦引擎共用，由于全文檢索引擎采用了Apache Lucene 3.x版本，需要中文分詞模塊符合Lucene的體系架構，幸運的是MMSeg4j提供了Lucene所需的Tokenizer實現(xiàn)類，同時還需要重點解決如下問題：

由于打開索引文件比較慢，所以整個程序共享一個indexer和searcher
考慮到準實時性需求，采用了Lucene新版本中reopen機制，每次查詢前讀入索引增量
采用Lucene默鎖機制

在項目中我定義了全文檢索引擎類：

public class FteEngine { 
 
　　public static void initFteEngine(String _indexPathname) { 
　　　　indexPathname = _indexPathname; 
　　} 
 
　　public static FteEngine getInstance() { // Singleton模式 
　　　　if (null == engine) { 
　　　　　　engine = new FteEngine(); 
　　　　} 
　　　　return engine; 
　　} 
 
　　public IndexWriter getIndexWriter() { 
　　　　return writer; 
　　} 
 
　　public IndexSearcher getIndexSearcher() { 
　　　　try { 
　　　　　　IndexReader newReader = reader.reopen(); // 讀入新增加的增量索引內容，滿足實時索引需求 
　　　　　　if (!reader.equals(newReader)) { 
　　　　　　　　reader.close(); 
　　　　　　　　reader = newReader; 
　　　　　　} 
　　　　　　searcher = new IndexSearcher(reader); 
　　　　} catch (CorruptIndexException e) { .... 
 
　　　　} catch (IOException e) {.... 
　　　　} 
　　　　return searcher; 
　　} 
 
　　public Analyzer getAnalyzer() { 
　　　　return analyzer; 
　　} 
 
　　public void stop() { 
　　　　try { 
　　　　　　if (searcher != null) { 
　　　　　　　　searcher.close(); 
　　　　　　} 
　　　　　　reader.close(); 
　　　　writer.close(); 
　　　　indexDir.close(); 
　　　　} catch (IOException e) {.... 
　　　　} 
　　} 
 
　　private FteEngine() { 
　　　　analyzer = new MMSegAnalyzer(); // 初始化中文分詞模塊，會讀入中文字典 
　　　　IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_31, analyzer); 
　　　　iwc.setOpenMode(OpenMode.CREATE_OR_APPEND); 
　　　　try { 
　　　　　　indexDir = FSDirectory.open(new File(indexPathname)); 
　　　　　　writer = new IndexWriter(indexDir, iwc); // writer和reader整個程序共用 
　　　　　　reader = IndexReader.open(writer, true); 
　　　　} catch (CorruptIndexException e) {...... 
　　　　} catch (LockObtainFailedException e) {...... 
　　　　} catch (IOException e) {..... 
 
　　　　} 
　　} 
　　private static FteEngine engine = null; 
　　private static String indexPathname = null; 
　　private Directory indexDir = null; 
　　private IndexWriter writer = null; 
　　private IndexSearcher searcher = null; 
　　private Analyzer analyzer = null; 
　　private IndexReader reader = null; 
} 
 
具體中文分詞可以使用如下代碼： 
 
FteEngine fteEngine = FteEngine.getInstance(); 
Analyzer analyzer = fteEngine.getAnalyzer(); 
String text = "測試2011年如java有意見 分岐其中華人民共合國，oracle咬死獵人的狗！"; 
TokenStream tokenStrm = analyzer.tokenStream("contents", new StringReader(text)); 
OffsetAttribute offsetAttr = tokenStrm.getAttribute(OffsetAttribute.class); 
CharTermAttribute charTermAttr = tokenStrm.getAttribute(CharTermAttribute.class); 
String term = null; 
int i = 0; 
int len = 0; 
char[] charBuf = null; 
try { 
　　while (tokenStrm.incrementToken()) { 
　　charBuf = charTermAttr.buffer(); 
　　for (i = (charBuf.length - 1); i >= 0; i--) { 
　　　　if (charBuf[i] > 0) { 
　　　　　　len = i + 1; 
　　　　　　break; 
　　　　} 
　　} 
　　//term = new String(charBuf, offsetAttr.startOffset(), offsetAttr.endOffset()); 
　　term = new String(charBuf, 0, offsetAttr.endOffset() - offsetAttr.startOffset()); 
　　System.out.println(term); 
} 
} catch (IOException e) { 
　　// TODO Auto-generated catch block 
　　e.printStackTrace(); 
}

打印的內容如下：

測試 2011 年如 java 有意見分岐其中華人民共合國 oracle 咬死獵人的狗

當我們在缺省詞庫中加入單詞：分岐中華人民共合國后，那么分詞結果可以變?yōu)椋?/p>

測試 2011 年如 java 有意見分岐其中華人民共合國 oracle 咬死獵人的狗

由此可見，可以通過完善中文詞庫，得到越來越好的中文分詞效果。

原文鏈接：http://www.cnblogs.com/yantao7589/archive/2011/08/16/2140399.html

【編輯推薦】

代號：Denali，SQL Server再出擊
說說SQL Server編年史
簡單說說SQL Server上的加密術
擦亮自己的眼睛去看SQL Server

責任編輯：艾婧來源：閆濤的博客

全文檢索數(shù)據(jù)挖掘

51CTO技術棧公眾號

業(yè)務
速覽

媒體

51CTO CIOAge HC3i

社區(qū)

51CTO博客鴻蒙開發(fā)者社區(qū) AI.x社區(qū)

教育

51CTO學堂精培企業(yè)培訓 CTO訓練營

<cite id="h61ty"><track id="h61ty"></track></cite>

<s id="h61ty"><li id="h61ty"></li></s>

<cite id="h61ty"><track id="h61ty"></track></cite>

<style id="h61ty"></style><legend id="h61ty"><track id="h61ty"></track></legend>

<cite id="h61ty"></cite>