Full text search Lucene

Getting started with lucene

What is lucene

Role of Lucene

Usage scenario

Advantages and disadvantages

lucene application

Indexing process

Search process

Use of field field

Index library maintenance

Word splitter

Advanced search case

Lucene advanced

Lucene underlying storage structure

Dictionary sorting algorithm

Lucene optimization

Some precautions for using Lucene

1. Theoretical basis of search technology

1.1. Why learn Lucene

The search function is realized in the original way. Our search process is as follows:

[the external link image transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-vAiWQwDP-1624282922149)(image/001.png)]

The above figure is the original search engine technology. If there are few users and the amount of data in the database is relatively small, it is common for enterprises to realize the search function in this way.

However, when there is too much data, the pressure of the database will become great and the query speed will become very slow. We need to use better solutions to share the pressure of the database.

The current scheme (using Lucene) is shown in the following figure

[the external link image transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-YPv274Et-1624282922157)(image/002.png)]

In order to solve the problem of database pressure and speed, our database becomes an index library. We use Lucene's API to operate the index library on the server. In this way, it is completely isolated from the database.

1.2. Data query method

1.2.1. Sequential scanning method

Algorithm description:

The so-called sequential scanning, for example, to find a file whose content contains a string, is to look at a document one by one. For each document, we see the end from the beginning. If this document contains this string, this document is the file we want to find, and then look at the next file until all the files are scanned.

**Advantages:**

High query accuracy

Disadvantages:

The query speed will be slower and slower with the increase of the amount of query data

Usage scenario:

like keyword fuzzy query in database
Ctrl + F query function of text editor

1.2.2. Inverted index

Let's start with a chestnut:

For example, we use Xinhua dictionary to query Chinese characters. Xinhua dictionary has a directory (index) of partial radicals. We first look up the directory to find the corresponding partial radicals in the directory, and then we can find the location (document) of the word through the partial radicals in the directory.

Lucene will create an inverted index of documents

1. Extract key information from resources and establish index (directory)

2. When searching, find the location of resources according to keywords (directories)

Algorithm description:

Before query, the content of the query will be extracted to form a document (body), and the document will be segmented to form an index (directory). The index is related to the document. When querying, query the index first. The process of finding a document through the index is called full-text retrieval.

Word segmentation: it is to cut a sentence into words one by one and remove the stop words (De, Di, De, a, an, the, etc.). Remove spaces, remove punctuation, convert capital letters to lowercase letters, and remove duplicate words.

Why is inverted index faster than sequential scan?

**Understanding: * * because the index can remove duplicate words, the commonly used Chinese words and words are probably equal to Dictionary Plus Dictionary, and the commonly used English is also included in the Oxford dictionary If you use the speed of the computer to query, the contents of dictionary + dictionary + Oxford dictionary are very fast But with these dictionaries, there are countless articles The maximum size of the index is dictionary + dictionary Therefore, it is faster to find the document by querying the index and then the association between the index and the document The sequential scanning rule is to directly query countless articles one by one, and even the calculation speed will be very slow

advantage:

High query accuracy

The query speed is fast, and the query speed will not gradually slow down due to the increase of query content

Disadvantages:

Index files will take up additional disk space, that is, the amount of disk will increase.

Usage scenario:

Massive data query

1.3. Application scenario of full-text retrieval technology

Application scenario:

1. On site search (baidu Post Bar, forum, jd.com, taobao)

2. Vertical domain search (818 worknet)

3. Professional search engine companies (google, baidu)

2. Introduction to Lucene

2.1. What is full text retrieval

By scanning each word in the article, the computer indexing program establishes an index for each word, indicating the number and position of the word in the article. When the user queries, the retrieval program searches according to the index established in advance, and feeds back the search results to the user

2.2. What is Lucene

[the external link image transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-hIOiyJwe-1624282922163)(image/003.png)]

He is Doug Cutting, the initiator of Lucene, Nutch, Hadoop and other projects

Lucene is a sub project of 4 jakarta project team of apache Software Foundation. It is an open source full-text search engine toolkit. However, it is not a complete full-text search engine, but the architecture of a full-text search engine. It provides a complete query engine, index engine and some text analysis engines (English and German).

Lucene's purpose is to provide software developers with a simple and easy-to-use toolkit to facilitate the realization of full-text retrieval function in the target system, or to establish a complete full-text retrieval engine on this basis.

At present, the search function of many applications is based on Lucene, such as the search function of Eclipse's help system. Lucene can index text type data, so as long as you can convert the data format you want to index into text, Lucene can index and search your documents. For example, if you want to index some HTML documents and PDF documents, you first need to convert the HTML documents and PDF documents into text format, then give the converted content to Lucene for indexing, then save the created index file to disk or memory, and finally query on the index file according to the query conditions entered by the user. Not specifying the format of the document to be indexed also makes Lucene applicable to almost all search applications.

Lucene is an open source library for full-text retrieval and search, which is supported and provided by the Apache Software Foundation
Lucene provides a simple but powerful application program interface, which can do full-text indexing and search. In the Java development environment, Lucene is a mature free and open source tool
Lucene is not a ready-made search engine product, but it can be used to make search engine products

2.3. Lucene official website

Official website: http://lucene.apache.org/

[the external link image transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-hk4DDIyF-1624282922165)(image/007.png)]

3. Lucene full text retrieval process

3.1. Index and search flowchart

[the external link image transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-iIT4UMvg-1624282922167)(image/clip_image008.jpg)]

1. Green indicates the indexing process. Index the original content to be searched and build an index library. The indexing process includes:

Determine the original content, that is, the content to search

Get documents
Create document
Analysis document
Index document

2. Red indicates the search process. Search content from the index library. The search process includes:

User through search interface

Create query
Perform a search from the index library
Render search results

3.2. Indexing process

In the process of document indexing, the document content to be searched by the user is indexed, and the index is stored in the index library (index).

3.2.1. Original content

The original content refers to the content to be indexed and searched.

The original content includes web pages on the Internet, data in the database, files on disk, etc.

3.2.2. Get documents (collect data)

The process of obtaining the original information to be searched from the Internet, database and file system is information collection. The purpose of collecting data is to index the original content.

Classification of collected data:

1. For web pages on the Internet, you can use tools to grab web pages locally and generate html files.

2. The data in the database can be directly connected to the database to read the data in the table.

3. For a file in the file system, the contents of the file can be read through I/O operation.

The software that collects information on the Internet is usually called crawler or spider, also known as network robot. Crawler visits every web page on the Internet and stores the obtained web page content.

3.2.3. create documents

The purpose of obtaining the original content is to index. Before indexing, the original content needs to be created into a Document. The Document includes fields one by one, in which the content is stored.

Here, we can treat a file on the disk as a document. The document includes some fields, as shown in the following figure:

[the external link image transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-2EBdBzfe-1624282922169)(image/008.png)]

Note: each Document can have multiple fields, different documents can have different fields, and the same Document can have the same Field (domain name and domain value are the same)

3.2.4. Analysis document

To create the original content as a document containing a Field, you need to analyze the content in the Field and analyze it into words one by one.

For example, the following documents are analyzed as follows:

Content of original document:

vivo X23 8GB+128GB magic night blue all Netcom 4G mobile phone

HUAWEI Huawei maimang 7 6G+64G bright black all Netcom 4G mobile phone

Words after analysis:

vivo, x23, 8GB, 128GB, magic night, magic night blue, all network, all Netcom, Netcom, 4G, mobile phone, HUAWEI, HUAWEI, maimang 7....

3.2.5. Index document

Index the vocabulary units obtained from the analysis of all documents. The purpose of the index is to search. Finally, only the indexed vocabulary units should be searched to find documents.

Creating an index is to index vocabulary units and find documents through words. This index structure is called inverted index structure.

The inverted index structure is to find documents according to the content (vocabulary), as shown in the following figure:

[the external link image transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-gPqeMfpZ-1624282922170)(image/009.png)]

Inverted index structure is also called reverse index structure, which includes index and document. Index is vocabulary. Its scale is small and document collection is large.

3.2.6 Lucene underlying storage structure

[external link image transfer failed. The source station may have anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-OfFMeu1S-1624282922171)(image/004.png)]

3.3. Search process

Search is the process that users enter keywords and search from the index. Search the index according to the keyword, and find the corresponding document according to the index, so as to find the content to be searched.

3.3.1. user

That is, the role of using search. The user can be a natural person or a program called remotely.

3.3.2. User search interface

The full-text retrieval system provides a user search interface for users to Submit search keywords, and display the search results after the search is completed. As shown below:

[the external link image transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-0wrSt1Db-1624282922172)(image/clip_image013.jpg)]

Lucene does not provide the function of making user search interface. It needs to develop the search interface according to its own needs.

3.3.3. Create query

The user needs to build a query object before entering the query keyword to perform the search. The query object can specify the query keyword and the Field document Field to be searched. The query object will generate specific query syntax, such as:

Name: Mobile: it means to search the document with "mobile" in the Field name.

**name: Huawei AND mobile: * * means to search for documents including the keyword "Huawei" AND "mobile phone".

3.3.4. Perform search

Search indexing process:

1. According to the query syntax, find the index of the corresponding search term in the inverted index dictionary, so as to find the linked document list of the index.

For example, the search syntax is * * "name: Huawei AND mobile phone" * * which means that the searched documents should include both "Huawei" AND "mobile phone".

[the external link image transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-eRyqGupK-1624282922173)(image/010.png)]

2. Because it is AND, it is necessary to intersect the linked list containing Huawei AND mobile phone words, AND the document linked list should include each search word

3. Get the Field field data in the document.

3.3.5. Render results

Display the query results to users with a friendly interface. Users find the information they want according to the search results. In order to help users find their own results quickly, many display effects are provided, such as highlighting keywords in search results, snapshots provided by Baidu, etc.

[the external link image transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-6PDhvnQd-1624282922174)(image/clip_image016.jpg)]

4. Introduction to Lucene

4.1. Lucene ready

Lucene can be downloaded from the official website. Lucene files have been prepared for the course. We use version 7.7.2. The file location is as follows:

[the external link image transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-qZfH5dbW-1624282922175)(image/005.png)]

Effect after decompression:

[the external link image transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-NOXtTEVi-1624282922177)(image/006.png)]

Using the jar package of these three files, the lucene function can be realized

4.2. development environment

JDK: 1.8 (for Lucene 7 or above, JDK1.8 or above must be used)

Database: MySQL

The database script location is shown in the figure below:

[external link image transfer failed. The source station may have anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-k7cGturK-1624282922178)(image/011.png)]

The effect of importing to MySQL is as follows:

[the external link image transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-y0rIPUf3-1624282922180)(image/012.png)]

4.3. Create Java project

The maven project does not depend on the skeleton and can be tested. The results are as follows:

[the external link image transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-jHN9HiKa-1624282922181)(image/013.png)]

4.4. Indexing process

4.4.1. data acquisition

In e-commerce websites, the data source of full-text retrieval is in the database, and the contents of sku table in the database need to be accessed through jdbc.

4.4.1.1. Create pojo

public class Sku {
    
	//Commodity primary key id
    private String id;
    //Trade name
    private String name;
    //Price
    private Integer price;
    //Inventory quantity
    private Integer num;
    //picture
    private String image;
    //Classification name
    private String categoryName;
    //Brand name
    private String brandName;
    //Specifications
    private String spec;
    //sales volume
    private Integer saleNum;

	get/set. . . 

}

4.4.1.2. Create DAO interface

public interface SkuDao {

    /**
     * Query all Sku data
     * @return
     **/

    public List<Sku> querySkuList();
}

4.4.1.3. Create DAO interface implementation class

Implementation using jdbc

public class SkuDaoImpl implements SkuDao {

    public List<Sku> querySkuList() {
        // Database link
        Connection connection = null;
        // Precompiled statement
        PreparedStatement preparedStatement = null;
        // Result set
        ResultSet resultSet = null;
        // Product list
        List<Sku> list = new ArrayList<Sku>();

        try {
            // Load database driver
            Class.forName("com.mysql.jdbc.Driver");
            // Connect database
            connection = DriverManager.getConnection("jdbc:mysql://localhost:3306/lucene", "root", "admin");

            // SQL statement
            String sql = "SELECT * FROM tb_sku";
            // Create preparedStatement
            preparedStatement = connection.prepareStatement(sql);
            // Get result set
            resultSet = preparedStatement.executeQuery();
            // Result set analysis
            while (resultSet.next()) {
                Sku sku = new Sku();
                sku.setId(resultSet.getString("id"));
                sku.setName(resultSet.getString("name"));
                sku.setSpec(resultSet.getString("spec"));
                sku.setBrandName(resultSet.getString("brand_name"));
                sku.setCategoryName(resultSet.getString("category_name"));
                sku.setImage(resultSet.getString("image"));
                sku.setNum(resultSet.getInt("num"));
                sku.setPrice(resultSet.getInt("price"));
                sku.setSaleNum(resultSet.getInt("sale_num"));
                list.add(sku);
            }
        } catch (Exception e) {
            e.printStackTrace();
        }

        return list;
    }
}

4.4.2. Implement indexing process

Collect data
Create Document object
Create parser (word splitter)
Create IndexWriterConfig configuration information class
Create a Directory object and declare the repository location
Create IndexWriter write object
Write Document to index library
Release resources

public class TestManager {

    @Test
    public void createIndexTest() throws Exception {
        // 1. Collect data
        SkuDao skuDao = new SkuDaoImpl();
        List<Sku> skuList = skuDao.querySkuList();

        // 2. Create Document object
        List<Document> documents = new ArrayList<Document>();
        for (Sku sku : skuList) {
            Document document = new Document();

            // Add Field field Field to Document
            // Item Id
            // Store.YES: indicates that it is stored in the document field
            document.add(new TextField("id", sku.getId(), Field.Store.YES));
            // Trade name
            document.add(new TextField("name", sku.getName(), Field.Store.YES));
            // commodity price
            document.add(new TextField("price", sku.getPrice().toString(), Field.Store.YES));
            // Brand name
            document.add(new TextField("brandName", sku.getBrandName(), Field.Store.YES));
            // Classification name
            document.add(new TextField("categoryName", sku.getCategoryName(), Field.Store.YES));
            // Picture address
            document.add(new TextField("image", sku.getImage(), Field.Store.YES));

            // Put the Document in the list
            documents.add(document);
        }

        // 3. Create Analyzer word splitter, analyze documents and segment documents
        Analyzer analyzer = new StandardAnalyzer();

        // 4. Create a Directory object and declare the location of the index library
        Directory directory = FSDirectory.open(Paths.get("E:\\dir"));

        // 5. Create an IndexWriteConfig object and write the configuration required by the index
        IndexWriterConfig config = new IndexWriterConfig(analyzer);

        // 6. Create IndexWriter write object
        IndexWriter indexWriter = new IndexWriter(directory, config);

        // 7. Write to the index library and add the document object document through IndexWriter
        for (Document doc : documents) {
            indexWriter.addDocument(doc);
        }

        // 8. Release resources
        indexWriter.close();
    }

}

Execution effect:

The following files appear in the folder, indicating that the index was created successfully

[the external link image transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-kcCQKPZ2-1624282922182)(image/014.png)]

4.5. Using Luke to view the index

Luke as a tool in Lucene Toolkit( http://www.getopt.org/luke/ ), you can query and modify index files through the interface

luke is located in the following figure:

[the external link image transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-e4Zafxfr-1624282922183)(image/015.png)]

Put the contents of luke-swing-8.0.0 into a folder in the root directory of the hard disk, without spaces and Chinese names

Run Luke bat

[the external chain picture transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the picture and upload it directly (img-447ICgMS-1624282922184)(image(6.png)]

After opening, use the following figure:

[the external link image transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-seeeao7qi-1624282922185) (image / 017. PNG)]

The following figure shows the display effect of index fields:

[the external link image transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-PFy9Xepo-1624282922185)(image/018.png)]

The following figure shows the display effect of document field

[the external link image transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-6Qq3rVnP-1624282922186)(image/019.png)]

4.6. Search process

4.6.1. Enter query statement

Lucene can enter query statements through the query object. Like the sql of the database, Lucene also has a fixed query syntax:

The most basic ones are: AND, OR, NOT, etc. (must be capitalized)

Take chestnuts for example:

The user wants to find a document with manual or machine keywords in the name field.

Its corresponding query statement: Name: hand OR name: machine

The following figure shows an example of using luke search:

[the external link image transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-b9Ys915K-1624282922188)(image/020.png)]

4.6.1.1. Search word segmentation

Like the word segmentation in the index process, the keyword entered by the user should be segmented here. Generally, the word segmentation used in the index and search is the same.

For example, enter the search keyword "java learning". After word segmentation, there are two words: Java and learning. The contents related to Java and learning are searched out, as follows:

[the external link image transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-NdLElh7K-1624282922189)(image/clip_image042.jpg)]

4.6.2. code implementation

Create Query search object
Create a Directory stream object and declare the location of the index library
Create index read object IndexReader
Create index search object IndexSearcher
Use the index to search the object, execute the search, and return the result set TopDocs
Parse result set
Release resources

The IndexSearcher search method is as follows:

method	explain
indexSearcher.search(query, n)	According to the Query search, the n records with the highest score are returned
indexSearcher.search(query, filter, n)	According to the Query search, add a filtering strategy and return the n records with the highest score
indexSearcher.search(query, n, sort)	According to Query search, add sorting strategy and return the n records with the highest score
indexSearcher.search(booleanQuery, filter, n, sort)	According to Query search, add filtering strategy and sorting strategy to return the n records with the highest score

code implementation

public class TestSearch {

    @Test
    public void testIndexSearch() throws Exception {
        // 1. Create Query search object
        // Create word breaker
        Analyzer analyzer = new StandardAnalyzer();
        // Create a search parser. The first parameter is the default Field field, and the second parameter is the word splitter
        QueryParser queryParser = new QueryParser("brandName", analyzer);

        // Create search object
        Query query = queryParser.parse("name:mobile phone AND Huawei");

        // 2. Create a Directory stream object and declare the location of the index library
        Directory directory = FSDirectory.open(Paths.get("E:\\dir"));

        // 3. Create index reading object IndexReader
        IndexReader reader = DirectoryReader.open(directory);

        // 4. Create index search object
        IndexSearcher searcher = new IndexSearcher(reader);

        // 5. Use the index to search the object, execute the search, and return the result set TopDocs
        // The first parameter: search object, and the second parameter: the number of data returned, specifying the top n data returned in the query result
        TopDocs topDocs = searcher.search(query, 10);
        System.out.println("The total number of data queried is:" + topDocs.totalHits);
        // Get query result set
        ScoreDoc[] docs = topDocs.scoreDocs;

        // 6. Parse result set
        for (ScoreDoc scoreDoc : docs) {
            // Get document
            int docID = scoreDoc.doc;
            Document doc = searcher.doc(docID);

            System.out.println("=============================");
            System.out.println("docID:" + docID);
            System.out.println("id:" + doc.get("id"));
            System.out.println("name:" + doc.get("name"));
            System.out.println("price:" + doc.get("price"));
            System.out.println("brandName:" + doc.get("brandName"));
            System.out.println("image:" + doc.get("image"));
        }
        // 7. Release resources
        reader.close();
    }
}

5. Field field type

5.1. Field property

Field is the field in the document, including field name and field value. A document can include multiple fields. Document is just a carrier of field. Field value is not only the content to be indexed, but also the content to be searched.

Tokenized

Yes: word segmentation, that is, word segmentation with Field value. The purpose of word segmentation is to index.

For example: product name, product description, etc. users need to enter keyword search for these contents. Due to the large format and many contents of the search, the vocabulary unit needs to be indexed after word segmentation

No: no word segmentation

For example, commodity id, order number, id number, etc.

Indexed

Yes: index. Index the word after Field word segmentation or the whole Field value and store it in the index Field. The purpose of the index is to search.

For example, the index of commodity names and commodities should be indexed after analysis, and the order number and ID number should be indexed, but they should also be indexed.

No: do not index.

For example, image path, file path, etc. are not used as indexes for query conditions.

Stored

Yes: store the Field value in the Document Field, and the Field stored in the Document Field can be obtained from the Document.

For example: product name, order number, and all fields to be obtained from Document in the future should be stored.

No: do not store the Field value

For example, the product description has a large content and does not need to be stored. If you want to show the product description to users, you can get it from the relational database of the system.

5.2. Field common types

The file types commonly used in development are listed below. Note the attributes of Field and select according to the requirements:

Field class	data type	Analyzed word segmentation	Indexed whether to index	Is Stored stored	explain
StringField(FieldName, FieldValue,Store.YES))	character string	N	Y	Y or N	This Field is used to build a string Field, but not to segment, and store the entire string in the index, such as order number, ID number, etc., is stored in the document using Store.YES or store No decision
FloatPoint(FieldName, FieldValue)	Float type	Y	Y	N	This Field is used to build a Float digital Field for word segmentation and indexing. It is not stored, such as (price) in the document
DoublePoint(FieldName, FieldValue)	Double type	Y	Y	N	This Field is used to build a Double numeric Field for word segmentation and indexing without storage
LongPoint(FieldName, FieldValue)	Long type	Y	Y	N	This Field is used to build a Long numeric Field for word segmentation and indexing without storage
IntPoint(FieldName, FieldValue)	Integer type	Y	Y	N	This Field is used to build an Integer numeric Field for word segmentation and indexing without storage
StoredField(FieldName, FieldValue)	Overloaded method, supporting multiple types	N	N	Y	This Field is used to build different types of fields, which are not analyzed or indexed, but are stored in the document
TextField(FieldName, FieldValue, Store.NO) or TextField(FieldName, reader)	String or stream	Y	Y	Y or N	If it is a Reader, lucene guesses that there are many contents and will adopt the strategy of Unstored
NumericDocValuesField(FieldName, FieldValue)	numerical value	_	_	_	Used in conjunction with other field sorting

5.3. Field modification

5.3.1. Modify analysis

Book id:

Word segmentation: there is no word segmentation, because the product will not be searched according to the product id

Index or not: do not index, because there is no need to search by book ID

Store or not: it needs to be stored, because the query result page needs to use the value of id.

Book Name:

Word segmentation: word segmentation is necessary, because we need to search according to the keyword of the book name.

Index: to index.

Whether to store: to store.

Book price:

Word segmentation: to segment words, lucene needs word segmentation and index for numeric values as long as there is a search demand, because lucene needs special word segmentation for numeric content, which requires word segmentation and index.

Index: to index

Store: to store

Book picture address:

Word segmentation: no word segmentation

Index: no index

Store: to store

Book Description:

Word segmentation: word segmentation is required

Index: to index

Whether to store: because the book description has a large capacity, it is not directly displayed on the query result page and is not stored.

No storage is not recorded in the index domain of lucene, which saves the index file space of lucene.

If you want to display the description on the details page, the solution:

Take out the book id from lucene, and query the book table in relational database (MySQL) according to the book id to get the description information.

5.3.2. Code modification

Modify the testCreateIndex() method previously written.

code snippet

Document document = new Document();

// Add Field field Field to Document
// Commodity Id, no word segmentation, index, storage
document.add(new StringField("id", sku.getId(), Field.Store.YES));
// Commodity name, word segmentation, index, storage
document.add(new TextField("name", sku.getName(), Field.Store.YES));

// Commodity price, word segmentation, index, no storage, no sorting
document.add(new FloatPoint("price", sku.getPrice()));
//Add price storage support
document.add(new StoredField("price", sku.getPrice()));
//Add price sorting support
//document.add(new NumericDocValuesField("price",sku.getPrice()));

// Brand name, no word segmentation, index, storage
document.add(new StringField("brandName", sku.getBrandName(), Field.Store.YES));
// Classification name, no word segmentation, index, storage
document.add(new StringField("categoryName", sku.getCategoryName(), Field.Store.YES));
// Picture address, no word segmentation, no index, storage
document.add(new StoredField("image", sku.getImage()));

// Put the Document in the list
documents.add(document);

6. Index maintenance

6.1. demand

Managers change the book information through the e-commerce system. At this time, the relational database is updated. If lucene is used to search the book information, the lucene index library needs to be updated in time when the book information in the database table changes.

6.2. Add index

Call indexwriter Adddocument (doc) adds an index.

Refer to the creation index of the getting started program.

6.3. Modify index

To update the index, delete it first and then add it. It is recommended to use this method for the update requirements. To ensure that the existing index is updated, you can query it first to determine the existence of the update record and perform the update operation.

If the target document object for updating the index does not exist, the addition is performed.

code

@Test
public void testIndexUpdate() throws Exception {
    // Create word breaker
    Analyzer analyzer = new StandardAnalyzer();
    // Create Directory stream object
    Directory directory = FSDirectory.open(Paths.get("E:\\dir"));
    // Create an IndexWriteConfig object and write the configuration required by the index
    IndexWriterConfig config = new IndexWriterConfig(analyzer);
    // Create write object
    IndexWriter indexWriter = new IndexWriter(directory, config);

    // Create Document
    Document document = new Document();
    document.add(new TextField("id", "1202790956", Field.Store.YES));
    document.add(new TextField("name", "lucene test test 002", Field.Store.YES));

    // After updating, all eligible documents will be deleted and added again.
    indexWriter.updateDocument(new Term("id", "1202790956"), document);

    // Release resources
    indexWriter.close();
}

6.4. Delete index

6.4.1. Deletes the specified index

Delete the index according to the Term item, and all that meet the conditions will be deleted.

@Test
public void testIndexDelete() throws Exception {
    // Create word breaker
    Analyzer analyzer = new StandardAnalyzer();
    // Create Directory stream object
    Directory directory = FSDirectory.open(Paths.get("E:\\dir"));
    // Create an IndexWriteConfig object and write the configuration required by the index
    IndexWriterConfig config = new IndexWriterConfig(analyzer);
    // Create write object
    IndexWriter indexWriter = new IndexWriter(directory, config);

    // Delete the index library according to Term, name:java
    indexWriter.deleteDocuments(new Term("id", "998188"));

    // Release resources
    indexWriter.close();
}

The effect is as follows: the index field has not changed

[the external link image transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (IMG fgtnxrno-1624282922190) (image / 021. PNG)]

The document field data is deleted

[the external link image transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-HepgQmw4-1624282922191)(image/022.png)]

6.4.2. Delete all indexes (use with caution)

Delete all the index information of the index directory, directly and completely, and cannot be recovered.

It is recommended that the reference relational database be deleted based on the primary key, so a primary key Field needs to be created when creating the index, and it is deleted according to this primary key Field when deleting.

After the index is deleted, it will be placed in the recycle bin of Lucene, Lucene 3 Version x can restore deleted documents, 3 Cannot recover after X.

code:

@Test
public void testIndexDeleteAll() throws Exception {
    // Create word breaker
    Analyzer analyzer = new StandardAnalyzer();
    // Create Directory stream object
    Directory directory = FSDirectory.open(Paths.get("E:\\dir"));
    // Create an IndexWriteConfig object and write the configuration required by the index
    IndexWriterConfig config = new IndexWriterConfig(analyzer);
    // Create write object
    IndexWriter indexWriter = new IndexWriter(directory, config);

    // Delete all
    indexWriter.deleteAll();

    // Release resources
    indexWriter.close();
}

Index field data emptying

[the external link image transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-99AoJQL3-1624282922192)(image/023.png)]

The document field data is also cleared

[external link image transfer failed. The source station may have anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-b0odNYFO-1624282922193)(image/024.png)]

7. Word splitter

7.1. Word segmentation comprehension

Before indexing the content in the Document, you need to use a word splitter for word segmentation. The purpose of word segmentation is to search. The main process of word segmentation is word segmentation before filtering.

Word segmentation: the collected data will be stored in the Field field Field of the document object. Word segmentation is to cut the value value of the Field in the document into words one by one.
Filtering: including removing punctuation filtering, removing stop words filtering (yes, a, an, the, etc.), capital to lowercase, word shape restoration (plural to singular parameter, past tense to present tense...) Wait.

What is a stop word? Stop words is to save storage space and improve search efficiency. Search engines will automatically ignore some words or words when indexing pages or processing search requests. These words or words are called stop words. For example, mood particles, adverbs, prepositions and conjunctions usually have no clear meaning. They can only play a certain role in a complete sentence, such as the common "de", "Zai", "yes", "ah", etc.

For word segmentation, different languages have different rules. Lucene provides word splitters in different countries as a toolkit

7.2. Analyzer usage timing

7.2.1. Use Analyzer when indexing

Enter a keyword to search. When you need to match the keyword with the words contained in the document domain content, you need to analyze the document domain content, which needs to be processed by the Analyzer to generate a vocabulary unit (Token). The object analyzed by the Analyzer is the Field field Field in the document. When the Field attribute tokenized is true, the Field value will be analyzed, as shown in the following figure:

[the external chain image transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-2Bi2SHly-1624282922193)(image/clip_image058.jpg)]

For some fields, analysis can be omitted:

1. Contents not used as query criteria, such as file path

2, not matching the content of the word and match the overall content of Field, such as order number, ID number, etc.

7.2.2. Use Analyzer when searching

The analysis of search keywords is the same as the index analysis. The Analyzer is used to analyze and segment the search keywords, and each word after analysis is used to search. For example, the search keyword: Spring web, which is segmented by the Analyzer, comes to the conclusion that spring web takes the word to the index dictionary table, finds the index link to the Document, and parses the content of the Document.

Queries that match the entire Field domain can be analyzed without searching, such as order number, ID number query, etc.

Note: the analyzer used in search should be consistent with that used in index.

7.3. Lucene native word splitter

The following is the built-in word splitter in Lucene

7.3.1. StandardAnalyzer

**Features:**

The standard word segmentation device provided by Lucene can segment words in English. For Chinese, it is a single word segmentation, that is, a word is regarded as a word

Here's org apache. lucene. analysis. standard. Part of the source code of standardanalyzer:

protected TokenStreamComponents createComponents(String fieldName) {
    final StandardTokenizer src = new StandardTokenizer();
    src.setMaxTokenLength(this.maxTokenLength);
    TokenStream tok = new LowerCaseFilter(src);
    TokenStream tok = new StopFilter(tok, this.stopwords);
    return new TokenStreamComponents(src, tok) {
        protected void setReader(Reader reader) {
            src.setMaxTokenLength(StandardAnalyzer.this.maxTokenLength);
            super.setReader(reader);
        }
    };
}

Tokenizer is a word splitter, which is responsible for converting the reader into a vocabulary unit, that is, word segmentation processing. Lucene provides many word splitters, and third-party word splitters can also be used, such as IKAnalyzer, a Chinese word splitter.

TokenFilter is a word segmentation filter, which is responsible for filtering vocabulary units. TokenFilter can be a filter chain. Lucene provides many word segmentation filters, such as case conversion, removing stop words, etc.

The following figure shows the generation process of vocabulary unit:

[the external link image transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-NmlzWX6F-1624282922198)(image/clip_image054.jpg)]

Starting from a Reader character stream, create a Reader based Tokenizer participle, and generate a vocabulary unit Token through three tokenfilters.

For example, the following documents are analyzed by the analyzer as follows:

Content of original document:

[the external link image transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (IMG evygyf99-1624282922199) (image / clip_image055. GIF)]

Multiple vocabulary units obtained after analysis:

[the external link image transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (IMG kriuxp0j-1624282922200) (image / clip_image056. GIF)]

7.3.2. WhitespaceAnalyzer

characteristic:

Just remove the blank space, there is no other operation, and Chinese is not supported.

Test code:

@Test
public void TestWhitespaceAnalyzer() throws Exception{
    // 1. Create a word splitter, analyze the document, and segment the document
    Analyzer analyzer = new WhitespaceAnalyzer();

    // 2. Create a Directory object and declare the location of the index library
    Directory directory = FSDirectory.open(Paths.get("E:\\dir"));

    // 3. Create an IndexWriteConfig object and write the configuration required by the index
    IndexWriterConfig config = new IndexWriterConfig(analyzer);

    // 4. Create IndexWriter write object
    IndexWriter indexWriter = new IndexWriter(directory, config);

    // 5. Write to the index library and add the document object document through IndexWriter
    Document doc = new Document();
    doc.add(new TextField("name", "vivo X23 8GB+128GB Magic night blue", Field.Store.YES));
    indexWriter.addDocument(doc);

    // 6. Release resources
    indexWriter.close();
}

result:

[the external link image transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-s9oi5Lwm-1624282922200)(image/025.png)]

7.3.3. SimpleAnalyzer

characteristic:

Remove all symbols except letters and change all letters to lowercase. It should be noted that this word splitter also removes numbers and does not support Chinese.

Test:

@Test
public void TestSimpleAnalyzer() throws Exception{
    // 1. Create a word splitter, analyze the document, and segment the document
    Analyzer analyzer = new SimpleAnalyzer();

    // 2. Create a Directory object and declare the location of the index library
    Directory directory = FSDirectory.open(Paths.get("E:\\dir"));

    // 3. Create an IndexWriteConfig object and write the configuration required by the index
    IndexWriterConfig config = new IndexWriterConfig(analyzer);

    // 4. Create IndexWriter write object
    IndexWriter indexWriter = new IndexWriter(directory, config);

    // 5. Write to the index library and add the document object document through IndexWriter
    Document doc = new Document();
    doc.add(new TextField("name", "vivo，X23.  8GB+128GB； Magic night blue", Field.Store.YES));
    indexWriter.addDocument(doc);

    // 6. Release resources
    indexWriter.close();
}

result:

[the external link image transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-qGBzgdXd-1624282922201)(image/026.png)]

7.3.4. CJKAnalyzer

characteristic:

This supports Chinese, Japanese and Korean characters. The first three letters are the abbreviations of these three countries. For Chinese, it is a dichotomy word segmentation. Remove the spaces and punctuation marks. Personally, I feel that the support for Chinese is still very poor.

code:

@Test
public void TestCJKAnalyzer() throws Exception{
    // 1. Create a word splitter, analyze the document, and segment the document
    Analyzer analyzer = new CJKAnalyzer();

    // 2. Create a Directory object and declare the location of the index library
    Directory directory = FSDirectory.open(Paths.get("E:\\dir"));

    // 3. Create an IndexWriteConfig object and write the configuration required by the index
    IndexWriterConfig config = new IndexWriterConfig(analyzer);

    // 4. Create IndexWriter write object
    IndexWriter indexWriter = new IndexWriter(directory, config);

    // 5. Write to the index library and add the document object document through IndexWriter
    Document doc = new Document();
    doc.add(new TextField("name", "vivo，X23.  8GB+128GB； Magic night blue", Field.Store.YES));
    indexWriter.addDocument(doc);

    // 6. Release resources
    indexWriter.close();
}

result:

[the external link image transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-J0g61IVR-1624282922202)(image/027.png)]

7.3.5. SmartChineseAnalyzer

characteristic:

The support for Chinese is not very good, and the expansibility is poor. It is difficult to deal with expanding thesaurus, disabling thesaurus and thesaurus.

code:

@Test
public void TestSmartChineseAnalyzer() throws Exception{
    // 1. Create a word splitter, analyze the document, and segment the document
    Analyzer analyzer = new SmartChineseAnalyzer();

    // 2. Create a Directory object and declare the location of the index library
    Directory directory = FSDirectory.open(Paths.get("E:\\dir"));

    // 3. Create an IndexWriteConfig object and write the configuration required by the index
    IndexWriterConfig config = new IndexWriterConfig(analyzer);

    // 4. Create IndexWriter write object
    IndexWriter indexWriter = new IndexWriter(directory, config);

    // 5. Write to the index library and add the document object document through IndexWriter
    Document doc = new Document();
    doc.add(new TextField("name", "vivo，X23.  8GB+128GB； Magic night blue", Field.Store.YES));
    indexWriter.addDocument(doc);

    // 6. Release resources
    indexWriter.close();
}

result:

[the transfer of external chain pictures fails. The source station may have anti-theft chain mechanism. It is recommended to save the pictures and upload them directly (img-kzFAw7tV-1624282922204)(image/028.png)]

7.4. Third party Chinese word splitter

7.4.1. What is a Chinese word splitter

Anyone who has learned English knows that English is based on words, which are separated by spaces or commas and periods. Therefore, for English, we can simply judge whether a string is a word by spaces, such as I love China. Love and China are easy to be distinguished by programs.

In Chinese, the word is taken as the unit. The word forms a word, and the word and word form a sentence. Chinese "I love China" is different. Computers don't know whether "China" is a word or "love China" is a word.

Cutting Chinese sentences into meaningful words is Chinese word segmentation, also known as word segmentation. I love China. The result of participle is: I, love, China.

7.4.2. Introduction to the third-party Chinese word splitter

Paoding: the latest version of paoding jieniu is in https://code.google.com/p/paoding/ Lucene 3.0 is supported at most in, and the latest submitted code is 2008-06-03. The latest submitted code in svn is also submitted in 2010. It is outdated and will not be considered.
mmseg4j: the latest version has been released from https://code.google.com/p/mmseg4j/ move to https://github.com/chenlb/mmseg4j-solr , Lucene 4.10 is supported, and the latest code submitted in github is June 2014. From 2009 to 2014, there are 18 versions in total, that is, there are almost three versions in a year, with great activity, and mmseg algorithm is used.
IK analyzer: the latest version is in https://code.google.com/p/ik-analyzer/ On, it supports Lucene 4.10. Since the launch of version 1.0 in December 2006, IKAnalyzer has launched four major versions. Initially, it was a Chinese word segmentation component based on the open source project Luence and combined with dictionary word segmentation and grammar analysis algorithm. Since version 3.0, IK has developed into a common word segmentation component for Java, which is independent of Lucene project, and provides the default optimization implementation of Lucene. In the 2012 version, IK implemented a simple word segmentation ambiguity elimination algorithm, marking the evolution of IK word segmentation from simple dictionary word segmentation to simulated semantic word segmentation. But it was not updated after December 2012.
ansj_seg: the latest version is in https://github.com/NLPchina/ansj_seg tags has only version 1.1, which has been updated six times from 2012 to 2014, but the author explained on October 10, 2014: "maybe I don't have the energy to maintain ansj_seg in the future", which is now by "NLP"_ "China" management. Updated in November 2014. It does not specify whether Lucene is supported. It is a word segmentation algorithm made by CRF (conditional random field) algorithm.
Imdict Chinese analyzer: the latest version is in https://code.google.com/p/imdict-chinese-analyzer/ , the latest update is also in May 2009. Download the source code and do not support Lucene 4.10. HMM (hidden Markov chain) algorithm is used.
Jcseg: the latest version is in GIT oschina. Net / lionsoul / jcseg, which supports Lucene 4.10. The author has high activity. Using mmseg algorithm.

7.4.3. Using the Chinese word splitter IKAnalyzer

IKAnalyzer inherits the Analyzer abstract class of Lucene. Like the Analyzer method of Lucene, IKAnalyzer changes the Analyzer test code to IKAnalyzer to test the effect of Chinese word segmentation.

If you use the Chinese word splitter IK analyzer, you need to use the same word splitter in the index and search program: IK analyzer.

Add dependency, POM Adding dependencies to XML

<dependency>
    <groupId>org.wltea.ik-analyzer</groupId>
    <artifactId>ik-analyzer</artifactId>
    <version>8.1.0</version>
</dependency>

Add profile:

[the external link image transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-D80rmYPq-1624282922209)(image/029.png)]
Test code

@Test
public void TestIKAnalyzer() throws Exception{
    // 1. Create a word splitter, analyze the document, and segment the document
    Analyzer analyzer = new IKAnalyzer();

    // 2. Create a Directory object and declare the location of the index library
    Directory directory = FSDirectory.open(Paths.get("E:\\dir"));

    // 3. Create an IndexWriteConfig object and write the configuration required by the index
    IndexWriterConfig config = new IndexWriterConfig(analyzer);

    // 4. Create IndexWriter write object
    IndexWriter indexWriter = new IndexWriter(directory, config);

    // 5. Write to the index library and add the document object document through IndexWriter
    Document doc = new Document();
    doc.add(new TextField("name", "vivo X23 8GB+128GB Magic night blue,Water drop screen,Game phone.China Mobile Unicom Telecom all Netcom 4 G mobile phone", Field.Store.YES));
    indexWriter.addDocument(doc);

    // 6. Release resources
    indexWriter.close();

}

test result

[the external chain picture transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the picture and upload it directly (img-oZiwFM5B-1624282922210)(image/030.png)]

7.4.4. Expand Chinese Thesaurus

If you want to configure extension words and stop words, create extension word files and stop word files.

Copy the configuration file from the ikanalyzer package

[the external link image transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-Sa26i31H-1624282922212)(image/031.png)]

Copy to resource folder

[the external link image transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-PTcV06fe-1624282922213)(image/029.png)]

IKAnalyzer.cfg.xml configuration file

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">  
<properties>  
	<comment>IK Analyzer Extended configuration</comment>
	<!--Users can configure their own extended dictionary here 	-->
	<entry key="ext_dict">ext.dic;</entry> 

	<!--Users can configure their own extended stop word dictionary here-->
	<entry key="ext_stopwords">stopword.dic;</entry> 
	
</properties>

**Disable dictionary stopword DIC function:**

Words in the disabled dictionary, such as a, an, the, De, etc., will be filtered out during word segmentation

**Extended dictionary ext.dic functions:**

Expand the words in the dictionary, such as: Chuanzhi podcast, dark horse programmer, Guizhou Maotai and other proper nouns. In Chinese, some company names, industry names, classifications and brands are not words in Chinese, but proper nouns These word splitters are not recognized by default, so they need to be put into the extended dictionary. The effect is to be forcibly divided into one word

8. Lucene advanced search

8.1. Text search

QueryParser supports the default search domain. The first parameter is the default search domain

When the parse method is executed, if the query syntax contains a domain name, search from the specified domain name. If there are only query keywords, search results from the default search domain

**Requirement Description: * * the query name contains the results of Huawei mobile phone keywords

Test code:

 @Test
public void testIndexSearch() throws Exception {
    // 1. Create Query search object
    // Create word breaker
    Analyzer analyzer = new IKAnalyzer();
    // Create a search parser. The first parameter is the default Field field, and the second parameter is the word splitter
    QueryParser queryParser = new QueryParser("brandName", analyzer);

    // Create search object
    Query query = queryParser.parse("name:Huawei Mobile");

    // 2. Create a Directory stream object and declare the location of the index library
    Directory directory = FSDirectory.open(Paths.get("E:\\dir"));

    // 3. Create index reading object IndexReader
    IndexReader reader = DirectoryReader.open(directory);

    // 4. Create index search object
    IndexSearcher searcher = new IndexSearcher(reader);

    // 5. Use the index to search the object, execute the search, and return the result set TopDocs
    // The first parameter: search object, and the second parameter: the number of data returned, specifying the top n data returned in the query result
    TopDocs topDocs = searcher.search(query, 50);
    System.out.println("The total number of data queried is:" + topDocs.totalHits);
    // Get query result set
    ScoreDoc[] docs = topDocs.scoreDocs;

    // 6. Parse result set
    for (ScoreDoc scoreDoc : docs) {
        // Get document
        int docID = scoreDoc.doc;
        Document doc = searcher.doc(docID);

        System.out.println("=============================");
        System.out.println("docID:" + docID);
        System.out.println("id:" + doc.get("id"));
        System.out.println("name:" + doc.get("name"));
        System.out.println("price:" + doc.get("price"));
        System.out.println("brandName:" + doc.get("brandName"));
        System.out.println("image:" + doc.get("image"));
    }
    // 7. Release resources
    reader.close();
}

8.2. Numerical range search

**Demand Description: * * query goods with price greater than or equal to 100 and less than or equal to 1000

Test code:

@Test
public void testNumberSearch() throws Exception {
    // 1. Create Query search object
    Query query = FloatPoint.newRangeQuery("price", 100, 1000);

    // 2. Create a Directory stream object and declare the location of the index library
    Directory directory = FSDirectory.open(Paths.get("E:\\dir"));

    // 3. Create index reading object IndexReader
    IndexReader reader = DirectoryReader.open(directory);

    // 4. Create index search object
    IndexSearcher searcher = new IndexSearcher(reader);

    // 5. Use the index to search the object, execute the search, and return the result set TopDocs
    // The first parameter: search object, and the second parameter: the number of data returned, specifying the top n data returned in the query result
    TopDocs topDocs = searcher.search(query, 10);
    System.out.println("The total number of data queried is:" + topDocs.totalHits);
    // Get query result set
    ScoreDoc[] docs = topDocs.scoreDocs;

    // 6. Parse result set
    for (ScoreDoc scoreDoc : docs) {
        // Get document
        int docID = scoreDoc.doc;
        Document doc = searcher.doc(docID);

        System.out.println("=============================");
        System.out.println("docID:" + docID);
        System.out.println("id:" + doc.get("id"));
        System.out.println("name:" + doc.get("name"));
        System.out.println("price:" + doc.get("price"));
        System.out.println("brandName:" + doc.get("brandName"));
        System.out.println("image:" + doc.get("image"));
    }
    // 7. Release resources
    reader.close();
}

8.3. Combined search

**Demand Description: * * query the goods whose price is greater than or equal to 100 and less than or equal to 1000, and whose name does not contain Huawei mobile phone keywords

BooleanClause.Occur.MUST be equivalent to and, and

BooleanClause.Occur.MUST_NOT must not be equivalent to not, not

BooleanClause. Occur. Show should be equivalent to or, or

**Note: * * if the logic condition is, there is only MUST_NOT, or multiple logical conditions are MUST_NOT, invalid, no data can be found

@Test
public void testBooleanSearch() throws Exception {
    // Create word breaker
    Analyzer analyzer = new IKAnalyzer();
    // Create value range search object
    Query query1 = FloatPoint.newRangeQuery("price", 100, 1000);

    // Create text search object
    QueryParser queryParser = new QueryParser("name", analyzer);
    // Create search object
    Query query2 = queryParser.parse("Huawei Mobile");

    //Create composite search object
    BooleanQuery.Builder builder = new BooleanQuery.Builder();
    builder.add(new BooleanClause(query1, BooleanClause.Occur.MUST));
    builder.add(new BooleanClause(query2, BooleanClause.Occur.MUST_NOT));

    // 2. Create a Directory stream object and declare the location of the index library
    Directory directory = FSDirectory.open(Paths.get("E:\\dir"));

    // 3. Create index reading object IndexReader
    IndexReader reader = DirectoryReader.open(directory);

    // 4. Create index search object
    IndexSearcher searcher = new IndexSearcher(reader);

    // 5. Use the index to search the object, execute the search, and return the result set TopDocs
    // The first parameter: search object, and the second parameter: the number of data returned, specifying the top n data returned in the query result
    TopDocs topDocs = searcher.search(builder.build(), 10);
    System.out.println("The total number of data queried is:" + topDocs.totalHits);
    // Get query result set
    ScoreDoc[] docs = topDocs.scoreDocs;

    // 6. Parse result set
    for (ScoreDoc scoreDoc : docs) {
        // Get document
        int docID = scoreDoc.doc;
        Document doc = searcher.doc(docID);

        System.out.println("=============================");
        System.out.println("docID:" + docID);
        System.out.println("id:" + doc.get("id"));
        System.out.println("name:" + doc.get("name"));
        System.out.println("price:" + doc.get("price"));
        System.out.println("brandName:" + doc.get("brandName"));
        System.out.println("image:" + doc.get("image"));
    }
    // 7. Release resources
    reader.close();
}

9. Search case

Finished product effect:

[external link image transfer failed. The source station may have anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-rRHYzPUN-1624282922214)(image2.png)]

9.1. Introduce dependency

In the POM of the project Introducing dependencies into XML:

<properties>
    <maven.compiler.source>1.8</maven.compiler.source>
    <maven.compiler.target>1.8</maven.compiler.target>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
    <project.reporting.outputEncoding>UTF-8</project.reporting.outputEncoding>
    <skipTests>true</skipTests>
</properties>

<parent>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-parent</artifactId>
    <version>2.1.4.RELEASE</version>
</parent>

<dependencies>
    <dependency>
        <groupId>commons-io</groupId>
        <artifactId>commons-io</artifactId>
        <version>2.6</version>
    </dependency>
    <dependency>
        <groupId>org.apache.lucene</groupId>
        <artifactId>lucene-core</artifactId>
        <version>7.7.2</version>
    </dependency>
    <dependency>
        <groupId>org.apache.lucene</groupId>
        <artifactId>lucene-analyzers-common</artifactId>
        <version>7.7.2</version>
    </dependency>
    <dependency>
        <groupId>org.apache.lucene</groupId>
        <artifactId>lucene-queryparser</artifactId>
        <version>7.7.2</version>
    </dependency>

    <!-- test -->
    <dependency>
        <groupId>junit</groupId>
        <artifactId>junit</artifactId>
        <version>4.12</version>
        <scope>test</scope>
    </dependency>
    <!-- mysql Database driven -->
    <dependency>
        <groupId>mysql</groupId>
        <artifactId>mysql-connector-java</artifactId>
        <version>5.1.48</version>
    </dependency>

    <!-- IK Chinese word splitter -->
    <dependency>
        <groupId>org.wltea.ik-analyzer</groupId>
        <artifactId>ik-analyzer</artifactId>
        <version>8.1.0</version>
    </dependency>

    <!--web Start dependence-->
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-web</artifactId>
    </dependency>
    <!-- introduce thymeleaf -->
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-thymeleaf</artifactId>
    </dependency>
    <!-- Json Conversion tool -->
    <dependency>
        <groupId>com.alibaba</groupId>
        <artifactId>fastjson</artifactId>
        <version>1.2.51</version>
    </dependency>
<dependencies>

9.2. Project add pages and resources

Copy Lucene course materials \ resources \ pages and static resources to the resources directory of the project

[the external link image transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-Hy31ig3o-1624282922216)(image3.png)]

9.3. Create package and startup classes

Create a directory and add a startup class:

[the external link image transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-K6EYBMpn-1624282922217)(image4.png)]

Startup class code:

@SpringBootApplication
public class LuceneApplication {
    public static void main(String[] args) {
        SpringApplication.run(LuceneApplication.class, args);
    }
}

9.4. configuration file

Create application. In the resources directory of the project YML contents are as follows:

spring:
  thymeleaf:
    cache: false

9.5. Business Code:

9.5.1. Encapsulate pojo

Add ResultModel entity class under pojo package

public class ResultModel {

	// Product list
	private List<Sku> skuList;
	// Total number of goods
	private Long recordCount;
	// PageCount 
	private Long pageCount;
	// Current page
	private long curPage;
    
    .....get and set method.......slightly

9.5.2. controller code

@Controller
@RequestMapping("/list")
public class SearchController {

    @Autowired
    private SearchService searchService;

    @RequestMapping
    public String list(String queryString, String price, Integer page, Model model) throws Exception {
        if (StringUtils.isEmpty(page)) {
            page = 1;
        }
        if (page <= 0) {
            page = 1;
        }

        ResultModel resultModel = searchService.search(queryString, price, page);

        model.addAttribute("result", resultModel);
        model.addAttribute("queryString", queryString);
        model.addAttribute("price", price);
        model.addAttribute("page", page);
        return "search";
    }
}

9.5.3. service code

service interface:

public interface SearchService {

    /**
     * Full text search based on keywords
     *
     * @param queryString   Query keyword
     * @param price         Price filter conditions
     * @param page          Current page
     */
    public ResultModel search(String queryString, String price, Integer page) throws Exception;
}

service implementation class:

@Service
public class SearchServiceImpl implements SearchService {

    //Set to query 20 pieces of data per page
    public final static Integer PAGE_SIZE = 20;

    @Override
    public ResultModel search(String queryString, String price, Integer page) throws Exception{
        /**
         * 1. Object encapsulation required
         */
        ResultModel resultModel = new ResultModel();
        List<Sku> skuList = new ArrayList<>();
        //What is the query starting from
        Integer start = (page - 1) * PAGE_SIZE;
        //How many pieces of data are queried
        Integer end = page * PAGE_SIZE;
        
        // Create word breaker
        Analyzer analyzer = new IKAnalyzer();
        // Create composite search object
        BooleanQuery.Builder builder = new BooleanQuery.Builder();

        /**
         * 2. Encapsulate according to keyword search criteria
         */
        // Create text search object
        QueryParser queryParser = new QueryParser("name", analyzer);
        Query query1 = null;
        if (StringUtils.isEmpty(queryString)) {
            // If the query keyword is empty, query all
            query1 = queryParser.parse("*:*");
        } else {
            // Set according to keyword query criteria
            query1 = queryParser.parse(queryString);
        }
        //Add text search object to combined query object
        builder.add(new BooleanClause(query1, BooleanClause.Occur.MUST));

        /**
         * 3. Filter queries by number range
         */
        if (!StringUtils.isEmpty(price)) {
            //Search the price and cut out the minimum and maximum values
            String[] split = price.split("-");
            // Create value range search object
            Query query2 = FloatPoint.newRangeQuery("price", Float.parseFloat(split[0]), Float.parseFloat(split[1]));
            builder.add(new BooleanClause(query2, BooleanClause.Occur.MUST));
        }

        // 4. Create a Directory stream object and declare the location of the index library
        Directory directory = FSDirectory.open(Paths.get("E:\\dir"));

        // 5. Create index reading object IndexReader
        IndexReader reader = DirectoryReader.open(directory);

        // 6. Create index search object
        IndexSearcher searcher = new IndexSearcher(reader);

        // 7. Use the index to search the object, execute the search, and return the result set TopDocs
        // The first parameter: search object, and the second parameter: the number of data returned, specifying the top n data returned in the query result
        TopDocs topDocs = searcher.search(builder.build(), end);

        System.out.println("The total number of data queried is:" + topDocs.totalHits);
        //8. Get the total number of queries
        resultModel.setRecordCount(topDocs.totalHits);

        //9. Get query result set
        ScoreDoc[] docs = topDocs.scoreDocs;

        //10. Parse result set
        for (int i = start; i < end; i++) {
            //Read document
            Document doc = reader.document(docs[i].doc);
            //Encapsulate the query result data
            Sku sku = new Sku();
            sku.setId(doc.get("id"));
            sku.setPrice(Integer.parseInt(String.valueOf(doc.get("price"))));
            sku.setImage(doc.get("image"));
            sku.setName(doc.get("name"));
            sku.setBrandName(doc.get("brandName"));
            sku.setCategoryName(doc.get("categoryName"));
            skuList.add(sku);
        }

        /**
         * 11. Encapsulates the returned result set
         */
        //. result set
        resultModel.setSkuList(skuList);
        //Current page
        resultModel.setCurPage(page);
        //PageCount 
        Long pageCount = topDocs.totalHits % PAGE_SIZE > 0 ? (topDocs.totalHits / PAGE_SIZE) + 1 :topDocs.totalHits / PAGE_SIZE;
        resultModel.setPageCount(pageCount);
       
        //12. Release resources
        reader.close();
        return resultModel;
    }
}

10. Lucene underlying storage structure (Advanced)

10.1. Detailed understanding of lucene storage structure

**Storage structure:**

[external link image transfer failed. The source station may have anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-00oq37n4-1624282922218)(image6.png)]

**Index:**

A directory has an index. In Lucene, an index is placed in a folder.

**Segment:**

An index (logical index) consists of multiple segments, which can be combined to reduce disk IO when reading content
Data writing in Lucene will first write to a Buffer in memory. When the data in the Buffer reaches a certain amount, it will be flush ed into a Segment. Each Segment has its own independent Index and can be queried independently, but the data can never be changed. This mode avoids random writes. Data writes are batch additions, which can achieve high throughput. The document written in Segment cannot be modified, but can be deleted. The deletion method is not to change in place inside the file, but to save the DocID of the document to be deleted by another file to ensure that the data file cannot be modified. The query of Index needs to query multiple segments and merge the results. It also needs to process the deleted documents. In order to optimize the query, Lucene will have a strategy to merge multiple segments.

**Document:**

Document is the basic unit of indexing. Different documents are saved in different segments. A segment can contain multiple documents.
The newly added documents are saved separately in a newly generated segment. With the merging of segments, different documents will be merged into the same segment.

**Field:**

A document contains different types of information and can be indexed separately, such as title, time, text, description, etc., which can be saved in different fields.
Different domains can be indexed differently.

**Term:**

Word is the smallest unit of index. It is a string after lexical analysis and language processing.

10.2. Index library physical file

[the external link image transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-FchUeVRt-1624282922219)(image5.png)]

10.3. Index library file extension cross reference table

name	File extension	Short description
Segments File	segments_N	The information of a commit point is saved
Lock File	write.lock	Prevent multiple indexwriters from writing to an index file at the same time
Segment Info	.si	The metadata information of the index segment is saved
Compound File	.cfs，.cfe	An optional virtual file that stores all index information in the composite index file
Fields	.fnm	Save information about fields
Field Index	.fdx	Save pointer to field data
Field Data	.fdt	The value of the field stored in the document
Term Dictionary	.tim	Term dictionary, storing term information
Term Index	.tip	Index to Term Dictionary
Frequencies	.doc	It consists of a list of docs containing each term and frequency
Positions	.pos	Stores location information for term s that appear in the index
Payloads	.pay	Store additional per position metadata information, such as character offsets and user payloads
Norms	.nvd，.nvm	The. nvm file stores the metadata of the index field weighting factor nvd file saves index field weighted data
Per-Document Values	.dvd，.dvm	The. dvm file saves the metadata of the index document scoring factor dvd file saves index document scoring data
Term Vector Index	.tvx	Store offset in document data file
Term Vector Documents	.tvd	Each document information containing term vectors
Term Vector Fields	.tvf	Field level information about term vectors
Live Documents	.liv	What is the information of a valid file
Point values	.dii，.dim	Keep index points, if any

10.4. Construction of dictionary

Why Lucene searches fast with large amount of data should be divided into two parts:

One reason is that the underlying inverted index storage structure
Another point is that it is fast to query keywords because of the index structure of the dictionary

10.4.1. Dictionary data structure comparison

The dictionary in the inverted index is located in memory, and its structure is particularly important. There are many kinds of dictionary structures, each with its own advantages and disadvantages. The simplest is to sort the array, retrieve data through binary search, have hash table faster, and disk search has B-tree and B + tree. However, an inverted index structure that can support TB data needs to have a balance in time and space, The following figure lists the advantages and disadvantages of some common Dictionaries:

data structure	Advantages and disadvantages
Jump table	It occupies small memory and is adjustable, but it does not support fuzzy query well
Sort listarray / list	Use dichotomy to find, imbalance
Dictionary tree	The query efficiency is related to the string length, but it is only suitable for English dictionaries
Hashtable	High performance, large memory consumption, almost three times the original data
Double array dictionary tree	It is suitable for making Chinese dictionaries with small memory consumption. Many word segmentation tools use this algorithm
Finite State Transducers (FST)	A finite state transition machine, Lucene 4 has an open source implementation and is widely used
B tree	Disk index is easy to update, but the retrieval speed is slow. It is mostly used in database

Lucene3.0 also used the jump table structure before, and later changed to FST, but the jump table is also applied in other places of Lucene, such as inverted table merging and document number index.

10.4.2. Jump table principle

Lucene3. The jump table structure used before version 0 was replaced by FST structure

* * advantages: * * simple structure, jump interval, controllable series, Lucene 3 The jump table structure was also used before 0, but the jump table is also applied in other places of Lucene, such as inverted table merging and document number index.
* * disadvantages: * * fuzzy query support is not good

Single linked list:

Even if the query of an element in a single linked list is ordered, we can't reduce the query time by binary search method.

Generally speaking, it is to find one by one according to the order of the linked list

For example: to find 85 this node, you need to find it 7 times

[the external chain picture transfer fails. The source station may have anti-theft chain mechanism. It is recommended to save the picture and upload it directly (img-xsOhAJzB-1624282922220)(image0.png)]

Jump table:

For example, you need to query 85 nodes 6 times in total

At Level 3, query 3 times, query to the end of 1, and return to node 37
At level 2, the query starts from node 37, twice, ends at 1, and returns to node 71
In level 1 layer, query from node 71, query once, and query to node 85

[the transfer of external chain pictures fails. The source station may have an anti-theft chain mechanism. It is recommended to save the pictures and upload them directly (img-zvqguaA2-1624282922221)(image1.png)]

10.4.3. Brief analysis of FST principle

The data structure adopted by Lucene is FST, which is characterized by:
**Advantages: * * low memory usage, compression ratio generally between 3-20 times, good fuzzy query support and fast query
**Disadvantages: * * complex structure, orderly input requirements and difficult update

It is known that FST requires orderly input, so Lucene will sort the parsed document words in advance, and then build FST. Assuming that the input is abd, Abe, ACF and ACG, the whole construction process is as follows:

[the external link image transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-fEmyNVD8-1624282922222)(image8.png)]

[the external link image transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-aDypurqS-1624282922225)(image9.png)]

input data:

String inputValues[] = {"hei","ma","cheng","xu","yuan","good"};
long outputValues[] = {0,1,2,3,4,5};

The data entered are as follows:

hei/0
ma/1
cheng/2
xu/3
yuan/4
good/5

The storage results are as follows:

[the external link image transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-LDN67oFE-1624282922226)(image7.png)]

9. Lucene optimization (Advanced)

9.1. Solve a lot of disk IO

config.setMaxBufferedDocs(100000); Control the number of document s saved in memory before writing a new segment. Setting a larger number can speed up the indexing speed.

The higher the value, the faster the indexing speed, but it will consume more memory
indexWriter. Forcemerge (number of documents); Set N documents to merge into one segment

The larger the value, the faster the index speed and the slower the search speed; The smaller the value, the slower the index speed and the faster the search speed

A higher value means lower segment merging overhead during indexing, but it also means slower search speed because the index usually contains more segments. If this value is set too high, higher index performance can be obtained. However, if the index is optimized at the end, the lower value will lead to faster search speed, because the program will use the concurrency mechanism to complete the segment merging operation during the index operation. Therefore, it is suggested to test the high and low values of the program respectively, and use the actual performance of the computer to tell you the optimal value.

Create index code optimization test:

@Test
public void createIndexTest() throws Exception {
    // 1. Collect data
    SkuDao skuDao = new SkuDaoImpl();
    List<Sku> skuList = skuDao.querySkuList();

    // 2. Create Document object
    List<Document> documents = new ArrayList<Document>();
    for (Sku sku : skuList) {
        Document document = new Document();

        // Add Field field Field to Document
        // Commodity Id, no word segmentation, index, storage
        document.add(new StringField("id", sku.getId(), Field.Store.YES));
        // Commodity name, word segmentation, index, storage
        document.add(new TextField("name", sku.getName(), Field.Store.YES));

        // Commodity price, word segmentation, index, no storage, no sorting
        document.add(new FloatPoint("price", sku.getPrice()));
        //Add price storage support
        document.add(new StoredField("price", sku.getPrice()));
        //Add price sorting support
        //document.add(new NumericDocValuesField("price",sku.getPrice()));


        // Brand name, no word segmentation, index, storage
        document.add(new StringField("brandName", sku.getBrandName(), Field.Store.YES));
        // Classification name, no word segmentation, index, storage
        document.add(new StringField("categoryName", sku.getCategoryName(), Field.Store.YES));
        // Picture address, no word segmentation, no index, storage
        document.add(new StoredField("image", sku.getImage()));

        // Put the Document in the list
        documents.add(document);
    }

    long startTime = System.currentTimeMillis();

    // 3. Create Analyzer word splitter, analyze documents and segment documents
    Analyzer analyzer = new IKAnalyzer();

    // 4. Create a Directory object and declare the location of the index library
    Directory directory = FSDirectory.open(Paths.get("E:\\dir"));

    // 5. Create an IndexWriteConfig object and write the configuration required by the index
    IndexWriterConfig config = new IndexWriterConfig(analyzer);

    //Control the number of document s saved in memory before writing a new segment. Setting a larger number can speed up the indexing speed.
    config.setMaxBufferedDocs(100000);

    // 6. Create IndexWriter write object
    IndexWriter indexWriter = new IndexWriter(directory, config);

    //Set 100000 documents to merge into one segment
    indexWriter.forceMerge(100000);

    // 7. Write to the index library and add the document object document through IndexWriter
    for (Document doc : documents) {
        indexWriter.addDocument(doc);
    }

    // 8. Release resources
    indexWriter.close();
    long endTime = System.currentTimeMillis();

    System.out.println("======Run time is:===" + (endTime - startTime) + "ms");
}

9.2. Choose the right word splitter

Different word splitters have different word segmentation effects and take different time

Although the word segmentation speed of standard analyzer is faster than that of IKAnalyzer, because standard analyzer does not support Chinese well, in order to pursue good word segmentation effect and query accuracy, IKAnalyzer can only be used. IKAnalyzer supports disabling dictionary and expanding dictionary. The accuracy of query matching can be improved by adjusting the contents of the two dictionaries

9.3. Select an appropriate location to store the index library

class	Write operation	Read operation	characteristic
SimpleFSDirectory	java.io.RandomAccessFile	java.io.RandomAccessFile	Simple implementation and poor concurrency
NIOFSDirectory	java.nio.FileChannel	FSDirectory.FSIndexOutput	Strong concurrency and major bug s on windows platform
MMapDirectory	Memory mapping	FSDirectory.FSIndexOutput	Read operations are memory based

Test code modification:

Directory directory = MMapDirectory.open(Paths.get("E:\\dir"));

9.4. Selection of Search api

Try to use TermQuery instead of QueryParser
Try to avoid a wide range of date queries

10. Lucene relevance ranking (Advanced)

10.1. What is relevance ranking

Lucene scores the relevance of query keywords and index documents, and those with high scores are at the top.

10.1.1. How to score

Lucene is calculated in real time according to the search keywords when users search. It is divided into two steps:

Calculate the weight of Term
Calculate the score of document relevance according to the weight value of words.

10.1.2. What is the weight of words

It is clear that the minimum unit of the index is a Term (a word in the index Dictionary). The search is also to search from Term, and then find the document according to Term. The importance of Term to the document is called weight. There are two factors affecting the weight of Term:

Term Frequency (tf):
Refers to how many times this Term appears in this document. The larger the tf, the more important it is. The more the word Term appears in the document, the more important it is to the document. For example, the word "Lucene" appears many times in the document, indicating that the document is mainly about Lucene technology.
Document Frequency (df):
Refers to how many documents contain Term. The larger the df, the less important it is.
For example, in an English document, the more this appears, the more important it is? No, the more documents contain this word (Term), indicating that this word (Term) is too common to distinguish these documents, so the less important it is.

10.1.3. How to affect the relevance ranking

boost is a weighted value (the default weighted value is 1.0f), which can affect the calculation of weight.

When indexing, set the weighted value of the field in a document to be high, and the matched document may be in the front when searching.
When searching, a certain domain is weighted. When querying the combined domain, the domain with high weighted value is matched, and the final calculated correlation score is high.

boost is set for the field or Document.

10.2. Human impact correlation ranking

When querying, you can artificially affect the query results by setting the weight of the query field

@Test
public void testIndexSearch() throws Exception {

    long startTime = System.currentTimeMillis();

    // 1. Create Query search object
    // Create word breaker
    Analyzer analyzer = new IKAnalyzer();

    //Query domain name
    String[] fields = {"name","brandName","categoryName"};
    //Set weight
    Map<String, Float> boots = new HashMap<>();
    boots.put("categoryName", 10000000f);
    // Search by multiple domains
    MultiFieldQueryParser queryParser = new MultiFieldQueryParser(fields, analyzer, boots);
    // Create search object
    Query query = queryParser.parse("mobile phone");

    // 2. Create a Directory stream object and declare the location of the index library
    Directory directory = MMapDirectory.open(Paths.get("E:\\dir"));

    // 3. Create index reading object IndexReader
    IndexReader reader = DirectoryReader.open(directory);

    // 4. Create index search object
    IndexSearcher searcher = new IndexSearcher(reader);

    // 5. Use the index to search the object, execute the search, and return the result set TopDocs
    // The first parameter: search object, and the second parameter: the number of data returned, specifying the top n data returned in the query result
    TopDocs topDocs = searcher.search(query, 50);
    System.out.println("The total number of data queried is:" + topDocs.totalHits);
    // Get query result set
    ScoreDoc[] docs = topDocs.scoreDocs;

    // 6. Parse result set
    for (ScoreDoc scoreDoc : docs) {
        // Get document
        int docID = scoreDoc.doc;
        Document doc = searcher.doc(docID);

        System.out.println("=============================");
        System.out.println("docID:" + docID);
        System.out.println("id:" + doc.get("id"));
        System.out.println("name:" + doc.get("name"));
        System.out.println("price:" + doc.get("price"));
        System.out.println("brandName:" + doc.get("brandName"));
        System.out.println("image:" + doc.get("image"));
    }
    // 7. Release resources
    reader.close();

    long endTime = System.currentTimeMillis();
    System.out.println("==========Time consuming:============" + (startTime - endTime) + "ms");
}

11. Precautions for Lucene use (Advanced)

Keywords are case sensitive
Keywords such as OR AND TO are case sensitive. lucene only recognizes uppercase and lowercase as ordinary words.
Read write mutex
At the same time, there can only be one write operation to the index, which can be searched at the same time
File lock
Forced exit during index writing will leave a lock file in the tmp directory so that future write operations cannot be carried out. You can delete it manually
Time format
lucene only supports one time format, yyMMddHHmmss, so if you send a YY MM DD HH: mm: SS time to lucene, it will not be treated as time
Set boost
Sometimes when searching, the weight of a field needs to be larger. For example, you may think that the articles with keywords in the title are more valuable than the articles with keywords in the body. You can set the boost of the title to be larger, and the search results will give priority to the articles with keywords in the title

Keywords: Java lucene

Added by quanghoc on Fri, 28 Jan 2022 01:32:15 +0200

Programming VIP

Full text search Lucene

Full text search Lucene

1. Theoretical basis of search technology

1.1. Why learn Lucene

1.2. Data query method

1.2.1. Sequential scanning method

1.2.2. Inverted index

1.3. Application scenario of full-text retrieval technology

2. Introduction to Lucene

2.1. What is full text retrieval

2.2. What is Lucene

2.3. Lucene official website

3. Lucene full text retrieval process

3.1. Index and search flowchart

3.2. Indexing process

3.2.1. Original content

3.2.2. Get documents (collect data)

3.2.3. create documents

3.2.4. Analysis document

3.2.5. Index document

3.2.6 Lucene underlying storage structure

3.3. Search process

3.3.1. user

3.3.2. User search interface

3.3.3. Create query

3.3.4. Perform search

3.3.5. Render results

4. Introduction to Lucene

4.1. Lucene ready

4.2. development environment

4.3. Create Java project

4.4. Indexing process

4.4.1. data acquisition

4.4.1.1. Create pojo

4.4.1.2. Create DAO interface

4.4.1.3. Create DAO interface implementation class

4.4.2. Implement indexing process

4.5. Using Luke to view the index

4.6. Search process

4.6.1. Enter query statement

4.6.1.1. Search word segmentation

4.6.2. code implementation

5. Field field type

5.1. Field property

5.2. Field common types

5.3. Field modification

5.3.1. Modify analysis

5.3.2. Code modification

6. Index maintenance

6.1. demand

6.2. Add index

6.3. Modify index

6.4. Delete index

6.4.1. Deletes the specified index

6.4.2. Delete all indexes (use with caution)

7. Word splitter

7.1. Word segmentation comprehension

7.2. Analyzer usage timing

7.2.1. Use Analyzer when indexing

7.2.2. Use Analyzer when searching

7.3. Lucene native word splitter

7.3.1. StandardAnalyzer

7.3.2. WhitespaceAnalyzer

7.3.3. SimpleAnalyzer

7.3.4. CJKAnalyzer

7.3.5. SmartChineseAnalyzer

7.4. Third party Chinese word splitter

7.4.1. What is a Chinese word splitter

7.4.2. Introduction to the third-party Chinese word splitter

7.4.3. Using the Chinese word splitter IKAnalyzer

7.4.4. Expand Chinese Thesaurus

8. Lucene advanced search

8.1. Text search

8.2. Numerical range search

8.3. Combined search

9. Search case

9.1. Introduce dependency

9.2. Project add pages and resources

9.3. Create package and startup classes