當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

lucene实战--打分算法没有那么难！

發布時間：2025/4/5 编程问答 14 豆豆

生活随笔收集整理的這篇文章主要介紹了 lucene实战--打分算法没有那么难！小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

作為一個開放源代碼項目，Lucene從問世之后，引發了開放源代碼社群的巨大反響，程序員們不僅使用它構建具體的全文檢索應用，而且將之集成到各種系統軟件中去，以及構建Web應用，甚至某些商業軟件也采用了Lucene作為其內部全文檢索子系統的核心。apache軟件基金會的網站使用了Lucene作為全文檢索的引擎，IBM的開源軟件eclipse的2.1版本中也采用了Lucene作為幫助子系統的全文索引引擎，相應的IBM的商業軟件Web Sphere中也采用了Lucene。Lucene以其開放源代碼的特性、優異的索引結構、良好的系統架構獲得了越來越多的應用。

Lucene作為一個全文檢索引擎，其具有如下突出的優點：

（1）索引文件格式獨立于應用平臺。Lucene定義了一套以8位字節為基礎的索引文件格式，使得兼容系統或者不同平臺的應用能夠共享建立的索引文件。

（2）在傳統全文檢索引擎的倒排索引的基礎上，實現了分塊索引，能夠針對新的文件建立小文件索引，提升索引速度。然后通過與原有索引的合并，達到優化的目的。

（3）優秀的面向對象的系統架構，使得對于Lucene擴展的學習難度降低，方便擴充新功能。

（4）設計了獨立于語言和文件格式的文本分析接口，索引器通過接受Token流完成索引文件的創立，用戶擴展新的語言和文件格式，只需要實現文本分析的接口。

（5）已經默認實現了一套強大的查詢引擎，用戶無需自己編寫代碼即使系統可獲得強大的查詢能力，Lucene的查詢實現中默認實現了布爾操作、模糊查詢（Fuzzy Search）、分組查詢等等。

lucene的索引結構

1. 準備工作

? 1.1 下載最新源碼，https://github.com/apache/lucene-solr

? 1.2 編譯，按照說明，使用ant進行編譯(我使用了ant eclipse)

? 1.3.將編譯后的文件導入到eclipse，sts或者idea中

2.新建測試類

public void test() throws IOException, ParseException {Analyzer analyzer = new NGramAnalyzer();// Store the index in memory:Directory directory = new RAMDirectory();// To store an index on disk, use this instead://Path path = FileSystems.getDefault().getPath("E:\\demo\\data", "access.data");//Directory directory = FSDirectory.open(path);IndexWriterConfig config = new IndexWriterConfig(analyzer);IndexWriter iwriter = new IndexWriter(directory, config);Document doc = new Document();String text = "我是中國人.";doc.add(new Field("fieldname", text, TextField.TYPE_STORED));iwriter.addDocument(doc);iwriter.close();// Now search the index:DirectoryReader ireader = DirectoryReader.open(directory);IndexSearcher isearcher = new IndexSearcher(ireader);isearcher.setSimilarity(new BM25Similarity());// Parse a simple query that searches for "text":QueryParser parser = new QueryParser("fieldname", analyzer);Query query = parser.parse("中國,人");ScoreDoc[] hits = isearcher.search(query, 1000).scoreDocs;// Iterate through the results:for (int i = 0; i < hits.length; i++) {Document hitDoc = isearcher.doc(hits[i].doc);System.out.println(hitDoc.getFields().toString());}ireader.close();directory.close();}private static class NGramAnalyzer extends Analyzer {@Overrideprotected TokenStreamComponents createComponents(String fieldName) {final Tokenizer tokenizer = new KeywordTokenizer();return new TokenStreamComponents(tokenizer, new NGramTokenFilter(tokenizer, 1, 4, true));}}

?其中，分詞使用自定義的NGramAnalyzer，它繼承自Analyzer，Analyzer分析文本，并將文本轉換為TokenStream。詳細如下：

/*** An Analyzer builds TokenStreams, which analyze text. It thus represents a* policy for extracting index terms from text.* <p>* In order to define what analysis is done, subclasses must define their* {@link TokenStreamComponents TokenStreamComponents} in {@link #createComponents(String)}.* The components are then reused in each call to {@link #tokenStream(String, Reader)}.* <p>* Simple example:* <pre class="prettyprint">* Analyzer analyzer = new Analyzer() {* {@literal @Override}* protected TokenStreamComponents createComponents(String fieldName) {* Tokenizer source = new FooTokenizer(reader);* TokenStream filter = new FooFilter(source);* filter = new BarFilter(filter);* return new TokenStreamComponents(source, filter);* }* {@literal @Override}* protected TokenStream normalize(TokenStream in) {* // Assuming FooFilter is about normalization and BarFilter is about* // stemming, only FooFilter should be applied* return new FooFilter(in);* }* };* </pre>* For more examples, see the {@link org.apache.lucene.analysis Analysis package documentation}.* <p>* For some concrete implementations bundled with Lucene, look in the analysis modules:* <ul>* <li><a href="{@docRoot}/../analyzers-common/overview-summary.html">Common</a>:* Analyzers for indexing content in different languages and domains.* <li><a href="{@docRoot}/../analyzers-icu/overview-summary.html">ICU</a>:* Exposes functionality from ICU to Apache Lucene. * <li><a href="{@docRoot}/../analyzers-kuromoji/overview-summary.html">Kuromoji</a>:* Morphological analyzer for Japanese text.* <li><a href="{@docRoot}/../analyzers-morfologik/overview-summary.html">Morfologik</a>:* Dictionary-driven lemmatization for the Polish language.* <li><a href="{@docRoot}/../analyzers-phonetic/overview-summary.html">Phonetic</a>:* Analysis for indexing phonetic signatures (for sounds-alike search).* <li><a href="{@docRoot}/../analyzers-smartcn/overview-summary.html">Smart Chinese</a>:* Analyzer for Simplified Chinese, which indexes words.* <li><a href="{@docRoot}/../analyzers-stempel/overview-summary.html">Stempel</a>:* Algorithmic Stemmer for the Polish Language.* </ul>** @since 3.1*/

?ClassicSimilarity是TFIDFSimilarity的封裝，因TFIDFSimilarity是抽象方法，無法直接new出實例.這個算法是lucene早期的默認打分實現。

將測試類放入solr-lucene源碼中，并進行debug，如果想要分析TFIDF算法，可以直接new?ClassicSimilarity 然后放入IndexSearch，其它的類似。

3.算法介紹

新版的lucene使用了BM25Similarity作為默認打分實現。這里顯式使用了BM25Similarity，算法詳細。這里簡要介紹一下：

其中:

　　?D即文檔(Document),Q即查詢語句(Query),score(D,Q)指使用Q的查詢語句在該文檔下的打分函數。

　　IDF即倒排文件頻次(Inverse Document Frequency)指在倒排文檔中出現的次數，qi是Q分詞后term

? ? ? ?其中，N是總的文檔數目，n(qi)是出現分詞qi的文檔數目。

　　f(qi,D)是qi分詞在文檔Document出現的頻次

　　?k1和b是可調參數，默認值為1.2，0.75

　　|D|是文檔的單詞的個數， $avgdl 指庫里的平均文檔長度。$

$4.算法實現$

$? 1.IDF實現$

$單個IDF實現$

/** Implemented as <code>log(1 + (docCount - docFreq + 0.5)/(docFreq + 0.5))</code>. */protected float idf(long docFreq, long docCount) {return (float) Math.log(1 + (docCount - docFreq + 0.5D)/(docFreq + 0.5D));}

?　　IDF的集合實現

@Overridepublic final SimWeight computeWeight(float boost, CollectionStatistics collectionStats, TermStatistics... termStats) {Explanation idf = termStats.length == 1 ? idfExplain(collectionStats, termStats[0]) : idfExplain(collectionStats, termStats);float avgdl = avgFieldLength(collectionStats);float[] oldCache = new float[256];float[] cache = new float[256];for (int i = 0; i < cache.length; i++) {oldCache[i] = k1 * ((1 - b) + b * OLD_LENGTH_TABLE[i] / avgdl);cache[i] = k1 * ((1 - b) + b * LENGTH_TABLE[i] / avgdl);}return new BM25Stats(collectionStats.field(), boost, idf, avgdl, oldCache, cache);}/*** Computes a score factor for a phrase.* * <p>* The default implementation sums the idf factor for* each term in the phrase.* * @param collectionStats collection-level statistics* @param termStats term-level statistics for the terms in the phrase* @return an Explain object that includes both an idf * score factor for the phrase and an explanation * for each term.*/public Explanation idfExplain(CollectionStatistics collectionStats, TermStatistics termStats[]) {double idf = 0d; // sum into a double before casting into a floatList<Explanation> details = new ArrayList<>();for (final TermStatistics stat : termStats ) {Explanation idfExplain = idfExplain(collectionStats, stat);details.add(idfExplain);idf += idfExplain.getValue();}return Explanation.match((float) idf, "idf(), sum of:", details);}

2.k1和b參數實現

public BM25Similarity(float k1, float b) {if (Float.isFinite(k1) == false || k1 < 0) {throw new IllegalArgumentException("illegal k1 value: " + k1 + ", must be a non-negative finite value");}if (Float.isNaN(b) || b < 0 || b > 1) {throw new IllegalArgumentException("illegal b value: " + b + ", must be between 0 and 1");}this.k1 = k1;this.b = b;}/** BM25 with these default values:* <ul>* <li>{@code k1 = 1.2}</li>* <li>{@code b = 0.75}</li>* </ul>*/public BM25Similarity() {this(1.2f, 0.75f);}

? 3.平均文檔長度avgdl?計算

/** The default implementation computes the average as <code>sumTotalTermFreq / docCount</code> */protected float avgFieldLength(CollectionStatistics collectionStats) {final long sumTotalTermFreq;if (collectionStats.sumTotalTermFreq() == -1) {// frequencies are omitted (tf=1), its # of postingsif (collectionStats.sumDocFreq() == -1) {// theoretical case only: remove!return 1f;}sumTotalTermFreq = collectionStats.sumDocFreq();} else {sumTotalTermFreq = collectionStats.sumTotalTermFreq();}final long docCount = collectionStats.docCount() == -1 ? collectionStats.maxDoc() : collectionStats.docCount();return (float) (sumTotalTermFreq / (double) docCount);}

4.參數Weigh的計算

/** Cache of decoded bytes. */private static final float[] OLD_LENGTH_TABLE = new float[256];private static final float[] LENGTH_TABLE = new float[256];static {for (int i = 1; i < 256; i++) {float f = SmallFloat.byte315ToFloat((byte)i);OLD_LENGTH_TABLE[i] = 1.0f / (f*f);}OLD_LENGTH_TABLE[0] = 1.0f / OLD_LENGTH_TABLE[255]; // otherwise inffor (int i = 0; i < 256; i++) {LENGTH_TABLE[i] = SmallFloat.byte4ToInt((byte) i);}}@Overridepublic final SimWeight computeWeight(float boost, CollectionStatistics collectionStats, TermStatistics... termStats) {Explanation idf = termStats.length == 1 ? idfExplain(collectionStats, termStats[0]) : idfExplain(collectionStats, termStats);float avgdl = avgFieldLength(collectionStats);float[] oldCache = new float[256];float[] cache = new float[256];for (int i = 0; i < cache.length; i++) { oldCache[i] = k1 * ((1 - b) + b * OLD_LENGTH_TABLE[i] / avgdl);cache[i] = k1 * ((1 - b) + b * LENGTH_TABLE[i] / avgdl);}return new BM25Stats(collectionStats.field(), boost, idf, avgdl, oldCache, cache);}

相當于?

5.WeightValue計算

BM25Stats(String field, float boost, Explanation idf, float avgdl, float[] oldCache, float[] cache) {this.field = field;this.boost = boost;this.idf = idf;this.avgdl = avgdl;this.weight = idf.getValue() * boost;this.oldCache = oldCache;this.cache = cache;}BM25DocScorer(BM25Stats stats, int indexCreatedVersionMajor, NumericDocValues norms) throws IOException {this.stats = stats;this.weightValue = stats.weight * (k1 + 1);this.norms = norms;if (indexCreatedVersionMajor >= 7) {lengthCache = LENGTH_TABLE;cache = stats.cache;} else {lengthCache = OLD_LENGTH_TABLE;cache = stats.oldCache;}}

?相當于

紅色部分相乘

6.總的得分計算

@Overridepublic float score(int doc, float freq) throws IOException {// if there are no norms, we act as if b=0float norm;if (norms == null) {norm = k1;} else {if (norms.advanceExact(doc)) { norm = cache[((byte) norms.longValue()) & 0xFF];} else {norm = cache[0];}}return weightValue * freq / (freq + norm);}

其中norm是從cache里取的，cache是放入了

那么整個公式就完整的出來了

7.深入

? ?打分的數據來源于CollectionStatistics,TermStatistics及freq，那么它們是哪里得到的？

SynonymWeight(Query query, IndexSearcher searcher, float boost) throws IOException {super(query);CollectionStatistics collectionStats = searcher.collectionStatistics(terms[0].field());//1long docFreq = 0;long totalTermFreq = 0;termContexts = new TermContext[terms.length];for (int i = 0; i < termContexts.length; i++) {termContexts[i] = TermContext.build(searcher.getTopReaderContext(), terms[i]);TermStatistics termStats = searcher.termStatistics(terms[i], termContexts[i]);//2docFreq = Math.max(termStats.docFreq(), docFreq);if (termStats.totalTermFreq() == -1) {totalTermFreq = -1;} else if (totalTermFreq != -1) {totalTermFreq += termStats.totalTermFreq();}}TermStatistics[] statics=new TermStatistics[terms.length];for(int i=0;i<terms.length;i++) {TermStatistics pseudoStats = new TermStatistics(terms[i].bytes(), docFreq, totalTermFreq,query.getKeyword());statics[i]=pseudoStats;}this.similarity = searcher.getSimilarity(true);this.simWeight = similarity.computeWeight(boost, collectionStats, statics);}

CollectionStatistics的來源

/*** Returns {@link CollectionStatistics} for a field.* * This can be overridden for example, to return a field's statistics* across a distributed collection.* @lucene.experimental*/public CollectionStatistics collectionStatistics(String field) throws IOException {final int docCount;final long sumTotalTermFreq;final long sumDocFreq;assert field != null;Terms terms = MultiFields.getTerms(reader, field);if (terms == null) {docCount = 0;sumTotalTermFreq = 0;sumDocFreq = 0;} else {docCount = terms.getDocCount();sumTotalTermFreq = terms.getSumTotalTermFreq();sumDocFreq = terms.getSumDocFreq();}return new CollectionStatistics(field, reader.maxDoc(), docCount, sumTotalTermFreq, sumDocFreq);}

TermStatistics的來源

/*** Returns {@link TermStatistics} for a term.* * This can be overridden for example, to return a term's statistics* across a distributed collection.* @lucene.experimental*/public TermStatistics termStatistics(Term term, TermContext context) throws IOException {return new TermStatistics(term.bytes(), context.docFreq(), context.totalTermFreq(),term.text());}

freq的來源(tf)

@Overrideprotected float score(DisiWrapper topList) throws IOException {return similarity.score(topList.doc, tf(topList));}/** combines TF of all subs. */final int tf(DisiWrapper topList) throws IOException {int tf = 0;for (DisiWrapper w = topList; w != null; w = w.next) {tf += ((TermScorer)w.scorer).freq();}return tf;}

底層實現

Lucene50PostingsReader.BlockPostingsEnum

@Overridepublic int nextDoc() throws IOException {if (docUpto == docFreq) {return doc = NO_MORE_DOCS;}if (docBufferUpto == BLOCK_SIZE) {refillDocs();}accum += docDeltaBuffer[docBufferUpto];freq = freqBuffer[docBufferUpto];posPendingCount += freq;docBufferUpto++;docUpto++;doc = accum;position = 0;return doc;}

8.總結

?BM25算法的全稱是 Okapi BM25，是一種二元獨立模型的擴展，也可以用來做搜索的相關度排序。本文通過和lucene的BM25Similarity的實現來深入理解整個打分公式。

在此基礎之上，又分析了CollectionStatistics,TermStatistics及freq這些參數是如何計算的。

通過整個分析過程，我們想要定制自己的打分公式，只需要實現Similarity或者SimilarityBase類，然后實現業務上的打分公式即可。

注意：實現了自己的Similarity類后solr不能直接使用，需要將其放到org.apache.solr.search.similarities，使用時

配置managed-schema如下：
<similarity class="solr.DavidSimilarityFactory"/>? 注意，路徑不是org.apache.solr.search.similarities.DavidSimilarityFactory，而是solr.DavidSimilarityFactory。若使用org.apache.solr.search.similarities.DavidSimilarityFactory則報錯：

classnotfound

參考文獻

【1】https://en.wikipedia.org/wiki/Okapi_BM25

【2】https://www.elastic.co/cn/blog/found-bm-vs-lucene-default-similarity

【3】http://www.blogjava.net/hoojo/archive/2012/09/06/387140.html

【4】https://cwiki.apache.org/confluence/display/GEODE/Lucene+Internals

轉載于:https://www.cnblogs.com/davidwang456/p/10439112.html

總結

以上是生活随笔為你收集整理的lucene实战--打分算法没有那么难！的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：多线程下HashMap的死循环
下一篇： Netty 和 RPC 框架线程模型分析

国产亚洲精品久久久久动-影视先锋中文字幕-av网站在线观看一区-亚洲视频 在线观看-久久亚洲不卡-欧美精品一区在线观看-欧美乱淫视频-欧美熟妇另类久久久久久不卡-粉嫩av一区二区三区四区五区-日韩欧美操

编程问答

lucene实战--打分算法没有那么难！

總結

国产亚洲精品久久久久动-影视先锋中文字幕-av网站在线观看一区-亚洲视频在线观看-久久亚洲不卡-欧美精品一区在线观看-欧美乱淫视频-欧美熟妇另类久久久久久不卡-粉嫩av一区二区三区四区五区-日韩欧美操