22: lucene.net full text retrieval

Pain point

In Baidu search engine, the keywords entered are not completely accurate,
The search engine can search the content similar to the keyword. When Jingdong searches for goods, it can also search the content similar to it!
However, in our relational database, this cannot be done; Because the database is an accurate match;
In the actual business, more often, although the content entered by the user is not so accurate, I also hope to be able to search for similar content instead of directly telling me that I can't find it!

terms of settlement

Full text index:
The People's Republic of China
Participle:
Split a long search sentence into various words, match the words and query the data;
It is no longer based on relational database to query, but based on full-text index to search
1. Republic of
2. People
3. Chinese
4. China

lucene. What is it? Www. 18fu

Lucene.net is a high-performance and all-round search engine framework class library for full-text retrieval, which is completely developed by C# development. It is a technology, any application that needs full-text retrieval.
Lucene was originally in java, but later appeared in C #;
Version:3.0.3 --- (it seems that it is not updated now. There are many bug s in the latest beta version)
http://lucenenet.apache.org/download/download.html

If you need to build a full-text retrieval or log system. It is recommended to use ElasticSearch+Logstash+Kibana in conjunction with NEST. There are official operation documents. You can view learning.
ps: ElasticSearch's 9200 products should not be open to the public network.

lucene.net and various query methods

One in and one out;
First step: generate Lucene index from relational database (through word segmentation storage);
One out: query, through the query conditions and word segmentation to match the data in Lucene index (no longer query based on relational database, but query based on Lucene index)

  1. Lucene.Net.Documents
  2. Lucene.Net.Analysis
  3. Lucene.Net.Index
  4. Lucene.Net.QueryParsers
  5. Lucene.Net.Search
  6. Lucene.Net.Store
  7. Lucene.Net.Util

lucene storage structure

Documents

Provide a simple document class. A document just includes a series of named Fields. Their contents can be text or an IO Instance of reader.
It is used to store a record in the database - a record in the database corresponds to a document; When storing, there must be more than one document;

Analysis

An abstract analyzer API is defined to convert text from a Reader to a TokenStream, that is, an enumeration container including some Tokens. The composition of a TokenStream is generated by applying TokenFilters to the output of a Tokenizer. A small number of implementations of standard analyzer and stopbased analyzer have been provided
Word segmenter: to segment sentences;

Index

It provides two main classes, one is IndexWriter, which is used to create indexes and add document s, and the other is IndexSearcher, which is used to access data in indexes.
Complete the reading and writing of the index;

QueryParsers

Implement a QueryParser.
When querying, you need to set up various query criteria;
Give you a long sentence, through word segmentation, split to generate various query conditions;

Search

Provide data structures to present queries: TermQuery is used for individual words, PhraseQuery is used for phrases, and boolean query is used for queries combined through boolean relationships. The abstract Searcher is used to transform queries into hits. IndexSearcher implements retrieval on a single IndexReader
Extract data from the index according to the conditions you provide;

Store

An abstract class is defined for storing persistent data, i.e. Directory. A collection contains some named files, which are written by an IndexOutput and read by an IndexInput. Two implementations are provided. FSDirectory uses a file system Directory to store files, while the other RAMDirectory implements the memory resident data structures that treat files as resident memory
Used to save index data, including folders and related files;

Search

Search TermQuery

TermQuery: unit query new Term("title", "Zhang San")
title: Zhang San

Search BoolenQuery

BoolenQuery:
new Term("title", "Zhang San") and new Term("title", "Li Si")
title: Zhang San + title: Li Si and==+

new Term("title","Zhang San")   or new Term("title","Li Si")    	
title:Zhang San  title:Li Si      or== Space

Search WildcardQuery

WildcardQuery: wildcard
new Term("title", "Zhang?") Title: Zhang? Match starts with "Zhang"

Search PrefixQuery

PrefixQuery: prefix query starts with xx
title: Zhang*

Search PhraseQuery

PhraseQuery: the interval distance does not include China, and the distance between them cannot exceed 5
title: "the Republic of China" ~ 5
Republic of China

Search FuzzyQuery

FuzzyQuery: approximate query, iPhone --- iPhone Title: iPhone~

Search RangeQuery

RangeQuery: range query [1100] {1100}
Open interval
Price time

lucene usage process

Nuget import file

  1. Lucene.Net //nuget implementation dll
  2. PanGu / / the three files below are used to realize word segmentation
    After quoting Pangu participle, there will be another Dict folder under the directory structure, which contains the dictionary used by Pangu participle

Insert data to index

        /// <summary>
        ///Initialize index
        /// </summary>
        public static void InitIndex()
        {
            List<Commodity> commodityList = GetList();//Get data source
            FSDirectory directory = FSDirectory.Open(StaticConstant.TestIndexPath);//Folder (StaticConstant.TestIndexPath) local storage path
            //After word segmentation, write the content to the hard disk
            //Pangu analyzer Pangu participle; The people's Republic of China, from the back to the front, matches the same words as the dictionary, and then saves them; I suggest you look at the official website of Pangu participle; Dictionaries can be maintained manually;
            //City can play - online popular words - no by default, Pangu participle, we can add these words by ourselves;
            using (IndexWriter writer = new IndexWriter(directory, new PanGuAnalyzer(), true, IndexWriter.MaxFieldLength.LIMITED))//Index writer
            {
                foreach (Commodity commdity in commodityList)
                {
                    int k = 22;
                    //for (int k = 0; k < 10; k++)
                    //{
                        Document doc = new Document();//A piece of data
                        doc.Add(new Field("id", commdity.Id.ToString(), Field.Store.NO, Field.Index.NOT_ANALYZED));//Whether a field column name value is saved and whether the value is word segmentation
                        doc.Add(new Field("title", commdity.Title, Field.Store.YES, Field.Index.ANALYZED));
                        doc.Add(new Field("url", commdity.Url, Field.Store.NO, Field.Index.NOT_ANALYZED));
                        doc.Add(new Field("imageurl", commdity.ImageUrl, Field.Store.NO, Field.Index.NOT_ANALYZED));
                        doc.Add(new Field("content", "this is lucene working,powerful tool " + k, Field.Store.YES, Field.Index.ANALYZED));
                        doc.Add(new NumericField("price", Field.Store.YES, true).SetDoubleValue((double)(commdity.Price + k)));
                        //doc.Add(new NumericField("time", Field.Store.YES, true).SetLongValue(DateTime.Now.ToFileTimeUtc()));
                        doc.Add(new NumericField("time", Field.Store.YES, true).SetIntValue(int.Parse(DateTime.Now.ToString("yyyyMMdd")) + k));
                        writer.AddDocument(doc);//Write it in
                    //}
                }
                writer.Optimize();//Optimization is merging
            }
        }

Query data

public static void Show()
        {
            FSDirectory dir = FSDirectory.Open(StaticConstant.TestIndexPath);
            IndexSearcher searcher = new IndexSearcher(dir);//Finder
            {


                FuzzyQuery query = new FuzzyQuery(new Term("title", "Coffee products are also high school politics"));
                //Termquery = new termquery (new term ("title", "anniversary")// contain
                TopDocs docs = searcher.Search(query, null, 10000);//Data found
                foreach (ScoreDoc sd in docs.ScoreDocs)
                {
                    Document doc = searcher.Doc(sd.Doc);
                    Console.WriteLine("***************************************");
                    Console.WriteLine(string.Format("id={0}", doc.Get("id")));
                    Console.WriteLine(string.Format("title={0}", doc.Get("title")));
                    Console.WriteLine(string.Format("time={0}", doc.Get("time")));
                    Console.WriteLine(string.Format("price={0}", doc.Get("price")));
                    Console.WriteLine(string.Format("content={0}", doc.Get("content")));
                }
                Console.WriteLine("1 Hit a total of{0}individual", docs.TotalHits);
            }

            QueryParser parser = new QueryParser(Version.LUCENE_30, "title", new PanGuAnalyzer());//Parser
            {
                string keyword = "The new curriculum standard of political education in senior high school electives legal knowledge in life";
                //string keyword = "senior high school politicians teach the new curriculum standard elective legal knowledge in life, and coffee products are also senior high school politics";
                {
                    Query query = parser.Parse(keyword);
                    TopDocs docs = searcher.Search(query, null, 10000);//Data found

                    int i = 0;
                    foreach (ScoreDoc sd in docs.ScoreDocs)
                    {
                        if (i++ < 1000)
                        {
                            Document doc = searcher.Doc(sd.Doc);
                            Console.WriteLine("***************************************");
                            Console.WriteLine(string.Format("id={0}", doc.Get("id")));
                            Console.WriteLine(string.Format("title={0}", doc.Get("title")));
                            Console.WriteLine(string.Format("time={0}", doc.Get("time")));
                            Console.WriteLine(string.Format("price={0}", doc.Get("price")));
                        }
                    }
                    Console.WriteLine($"Total hits{docs.TotalHits}");
                }
                {
                    Query query = parser.Parse(keyword);
                    NumericRangeFilter<int> timeFilter = NumericRangeFilter.NewIntRange("time", 20090101, 20201231, true, true);//filter
                    SortField sortPrice = new SortField("price", SortField.DOUBLE, false);//false:: descending order
                    SortField sortTime = new SortField("time", SortField.INT, true);//true: ascending order
                    Sort sort = new Sort(sortTime, sortPrice);//Which comes first and which comes second

                    TopDocs docs = searcher.Search(query, timeFilter, 10000, sort);//Data found

                    //What can I do? You can query by page!
                    int i = 0;
                    foreach (ScoreDoc sd in docs.ScoreDocs)
                    {
                        if (i++ < 1000)
                        {
                            Document doc = searcher.Doc(sd.Doc);
                            Console.WriteLine("***************************************");
                            Console.WriteLine(string.Format("id={0}", doc.Get("id")));
                            Console.WriteLine(string.Format("title={0}", doc.Get("title")));
                            Console.WriteLine(string.Format("time={0}", doc.Get("time")));
                            Console.WriteLine(string.Format("price={0}", doc.Get("price")));
                        }
                    }
                    Console.WriteLine("3 Hit a total of{0}individual", docs.TotalHits);
                }
            }
        }
  1. In the process of using it, I found a problem. In a word, I just added a space between the words of the Title, and obtained little data.
  2. To solve problem 1, manually add spaces to solve it. The results can be seen at last
  3. However, ElasticSearch is still recommended. Lucene feels that there are few documents, few application scenarios in the future, and it takes too much effort to encapsulate itself.

Added by Siggles on Wed, 09 Feb 2022 05:19:04 +0200