luene core code analysis report 13

2021SC@SDUSC

Close the IndexWriter object

code:

writer.close(); 
--> IndexWriter.closeInternal(boolean) 
 --> (1) Writes index information from memory to disk: flush(waitForMerges, true, true); 
 --> (2) Merge segments: mergeScheduler.merge(this); 

The merging of segments will be discussed in later chapters. Here, we only discuss the process of writing index information from to disk.
code:
I

ndexWriter.flush(boolean triggerMerge, boolean flushDocStores, boolean flushDeletes) 
--> IndexWriter.doFlush(boolean flushDocStores, boolean flushDeletes) 
 --> IndexWriter.doFlushInternal(boolean flushDocStores, boolean flushDeletes) 

Writing an index to disk involves the following processes:

Get the segment name to write: string segment = docwriter getSegment(); Documentswriter writes cached information to segment: docwriter flush(flushDocStores);
Generate a new segment information object: newSegment = new SegmentInfo(segment, flushedDocCount,
directory, false, true, docStoreOffset, docStoreSegment, docStoreIsCompoundFile,
docWriter.hasProx());

Preparing to delete document: docwriter pushDeletes();
Generate cfs segment: docwriter createCompoundFile(segment);
Delete document: applyDeletes();

Get the segment name to write

code:

SegmentInfo newSegment = null; 
final int numDocs = docWriter.getNumDocsInRAM();//Total number of documents
String docStoreSegment = docWriter.getDocStoreSegment();//The storage field and the segment to be written to the word vector
 Name,"_0" 
int docStoreOffset = docWriter.getDocStoreOffset();//The offset in the segment to which the storage field and word vector are written
String segment = docWriter.getSegment();//Segment name, "_0" 

Lucene's index file structure chapter describes in detail that the storage field and word vector can be stored in different segments from the index field.

Write cached content to segment write cached content to segment

code:

flushedDocCount = docWriter.flush(flushDocStores); 

This process includes the following two stages;
1. Close the storage domain and word vector information according to the basic index chain
1. Write the index result to the segment according to the structure of the basic index chain

Close the storage domain and word vector information according to the basic index chain

The code is:

closeDocStore(); 
flushState.numDocsInStore = 0; 

It mainly closes the storage domain and word vector information according to the basic index chain structure:
consumer(DocFieldProcessor).closeDocStore(flushState);
consumer(DocInverter).closeDocStore(state);
consumer(TermsHash).closeDocStore(state);
consumer(FreqProxTermsWriter).closeDocStore(state);
if (nextTermsHash != null) nextTermsHash.closeDocStore(state);
consumer(TermVectorsTermsWriter).closeDocStore(state);
endConsumer(NormsWriter).closeDocStore(state); fieldsWriter(StoredFieldsWriter).closeDocStore(state);
Among them, the following two closedocstores are of substantive significance:

Closing of word vector:

TermVectorsTermsWriter.closeDocStore(SegmentWriteState) 
void closeDocStore(final SegmentWriteState state) throws IOException { 
 if (tvx != null) { 
 //Write zeros in the tvd file for documents that do not save word vectors. Even if the word vector is not saved, it is saved in TVx and tvd
 Leave a place 
 fill(state.numDocsInStore - docWriter.getDocStoreOffset()); 
 //Close the write stream of TVx, tvf and TVD files 
 tvx.close(); 
 tvf.close(); 
 tvd.close(); 
 tvx = null; 
 //Record the written file name. When cfs files are generated in the future, these written files will be generated into a unified cfs file. 
 state.flushedFiles.add(state.docStoreSegmentName + "." + 
IndexFileNames.VECTORS_INDEX_EXTENSION); 
 state.flushedFiles.add(state.docStoreSegmentName + "." + 
IndexFileNames.VECTORS_FIELDS_EXTENSION); 
 state.flushedFiles.add(state.docStoreSegmentName + "." + 
IndexFileNames.VECTORS_DOCUMENTS_EXTENSION); 
 //Deleted from the member variable openFiles of DocumentsWriter, ᳾ may be deleted by indexfiledelete
 docWriter.removeOpenFile(state.docStoreSegmentName + "." + 
IndexFileNames.VECTORS_INDEX_EXTENSION); 
 docWriter.removeOpenFile(state.docStoreSegmentName + "." + 
IndexFileNames.VECTORS_FIELDS_EXTENSION); 
 docWriter.removeOpenFile(state.docStoreSegmentName + "." + 
IndexFileNames.VECTORS_DOCUMENTS_EXTENSION); 
 lastDocID = 0; 
 } 
} 

Shutdown of storage domain

StoredFieldsWriter.closeDocStore(SegmentWriteState) 
public void closeDocStore(SegmentWriteState state) throws IOException { 
 //Close FDX, and FDT writes to the stream
 fieldsWriter.close(); 
 --> fieldsStream.close(); 
 --> indexStream.close(); 
 fieldsWriter = null; 
 lastDocID = 0; 
 //Record the file name written 
 state.flushedFiles.add(state.docStoreSegmentName + "." + 
IndexFileNames.FIELDS_EXTENSION); 
 state.flushedFiles.add(state.docStoreSegmentName + "." + 
IndexFileNames.FIELDS_INDEX_EXTENSION); 
 state.docWriter.removeOpenFile(state.docStoreSegmentName + "." + 
IndexFileNames.FIELDS_EXTENSION); 
 state.docWriter.removeOpenFile(state.docStoreSegmentName + "." + 
IndexFileNames.FIELDS_INDEX_EXTENSION); 
} 

Write the index result to the segment according to the structure of the basic index chain

The code is:

consumer(DocFieldProcessor).flush(threads, flushState); 
 //Reclaim fieldHash for the next round of indexing. In order to improve efficiency, the objects in the index chain are reused.
 Map<DocFieldConsumerPerThread, Collection<DocFieldConsumerPerField>> 
childThreadsAndFields = new HashMap<DocFieldConsumerPerThread, 
Collection<DocFieldConsumerPerField>>(); 
 for ( DocConsumerPerThread thread : threads) { 
 DocFieldProcessorPerThread perThread = (DocFieldProcessorPerThread) thread; 
 childThreadsAndFields.put(perThread.consumer, perThread.fields()); 
 perThread.trimFields(state); 
 } 
 //Write storage domain
 --> fieldsWriter(StoredFieldsWriter).flush(state); 
 //Write index field
 --> consumer(DocInverter).flush(childThreadsAndFields, state); 
 //Write the domain metadata information and record the file name written so that cfs files can be generated later
 --> final String fileName = state.segmentFileName(IndexFileNames.FIELD_INFOS_EXTENSION); 
 --> fieldInfos.write(state.directory, fileName); 
 --> state.flushedFiles.add(fileName); 

This process is also based on the basic index chain:
consumer(DocFieldProcessor).flush(...);
consumer(DocInverter).flush(...);
consumer(TermsHash).flush(...);
consumer(FreqProxTermsWriter).flush(...);
if (nextTermsHash != null) nextTermsHash.flush(...);
consumer(TermVectorsTermsWriter).flush(...);
endConsumer(NormsWriter).flush(...);
fieldsWriter(StoredFieldsWriter).flush(...);

Write storage domain write storage domain

The code is:

StoredFieldsWriter.flush(SegmentWriteState state) { 
 if (state.numDocsInStore > 0) { 
 initFieldsWriter(); 
 fill(state.numDocsInStore - docWriter.getDocStoreOffset()); 
 } 
 if (fieldsWriter != null) 
 fieldsWriter.flush(); 
 } 

It can be seen from the code that FDX and FDT files are written, but they have been written in the closeDocStore above, and state Numdocsinstore is set to zero and fieldsWriter is set to null. In fact, nothing is done here.

Write index field

The code is:

DocInverter.flush(Map<DocFieldConsumerPerThread,Collection<DocFieldConsumerPerField>>, 
SegmentWriteState) 
 //Write inverted table and word vector information
 --> consumer(TermsHash).flush(childThreadsAndFields, state); 
 //Write standardization factor
 --> endConsumer(NormsWriter).flush(endChildThreadsAndFields, state); 

Write inverted table and word vector information

The code is:

TermsHash.flush(Map<InvertedDocConsumerPerThread,Collection<InvertedDocConsumerPerField>>, 
SegmentWriteState) 
 //Write inverted table information
 --> consumer(FreqProxTermsWriter).flush(childThreadsAndFields, state); 
 //Recycle RawPostingList 
 --> shrinkFreePostings(threadsAndFields, state); 
 //Write word vector information
 --> if (nextTermsHash != null) nextTermsHash.flush(nextThreadsAndFields, state); 
 --> consumer(TermVectorsTermsWriter).flush(childThreadsAndFields, state); 

Write inverted table information

The code is:

FreqProxTermsWriter.flush(Map<TermsHashConsumerPerThread, 
 Collection<TermsHashConsumerPerField>>, SegmentWriteState) 

(a) All fields are sorted by name so that fields with the same name can be processed together so that fields with the same name can be processed together

 Collections.sort(allFields); 
 final int numAllFields = allFields.size(); 

(b) Write object to generate inverted table

 final FormatPostingsFieldsConsumer consumer = new FormatPostingsFieldsWriter(state, 
fieldInfos); 
 int start = 0; 

© For each domain

 while(start < numAllFields) { 

(c-1) find all domains with the same name

 final FieldInfo fieldInfo = allFields.get(start).fieldInfo; 
 final String fieldName = fieldInfo.name; 
 int end = start+1; 
 while(end < numAllFields && allFields.get(end).fieldInfo.name.equals(fieldName)) 
 end++; 
 FreqProxTermsWriterPerField[] fields = new FreqProxTermsWriterPerField[end-start]; 
 for(int i=start;i<end;i++) { 
 fields[i-start] = allFields.get(i); 
 fieldInfo.storePayloads |= fields[i-start].hasPayloads; 
 } 

(c-2) add the inverted list of fields with the same name to the file

 appendPostings(fields, consumer); 

(c-3) free space

 for(int i=0;i<fields.length;i++) { 
 TermsHashPerField perField = fields[i].termsHashPerField; 
 int numPostings = perField.numPostings;
 perField.reset(); 
 perField.shrinkHash(numPostings); 
 fields[i].reset(); 
 } 
 start = end; 
 } 

(d) Close write object of inverted table

 consumer.finish(); 

(b) Write object to generate inverted table
The code is:

public FormatPostingsFieldsWriter(SegmentWriteState state, FieldInfos fieldInfos) throws 
IOException { 
 dir = state.directory; 
 segment = state.segmentName; 
 totalNumDocs = state.numDocs; 
 this.fieldInfos = fieldInfos; 
 //Used to write tii,tis 
 termsOut = new TermInfosWriter(dir, segment, fieldInfos, state.termIndexInterval); 
 //Jump table for writing freq and prox 
 skipListWriter = new DefaultSkipListWriter(termsOut.skipInterval, termsOut.maxSkipLevels, 
totalNumDocs, null, null); 
 //Record the file name written, 
 state.flushedFiles.add(state.segmentFileName(IndexFileNames.TERMS_EXTENSION)); 
 state.flushedFiles.add(state.segmentFileName(IndexFileNames.TERMS_INDEX_EXTENSION)); 
 //Use the above two write objects to write segments in a certain format 
 termsWriter = new FormatPostingsTermsWriter(state, this); 
} 

© Adds an inverted list of fields with the same name to the file
The code is:

FreqProxTermsWriter.appendPostings(FreqProxTermsWriterPerField[], 
FormatPostingsFieldsConsumer) { 
 int numFields = fields.length; 
 final FreqProxFieldMergeState[] mergeStates = new FreqProxFieldMergeState[numFields]; 
 for(int i=0;i<numFields;i++) { 
 FreqProxFieldMergeState fms = mergeStates[i] = new FreqProxFieldMergeState(fields[i]); 
 boolean result = fms.nextTerm(); //For all fields, take the first word (Term) 
 } 

Keywords: Back-end lucene

Added by Das Capitolin on Mon, 20 Dec 2021 09:30:12 +0200