2021SC@SDUSC
Close the IndexWriter object
code:
writer.close(); --> IndexWriter.closeInternal(boolean) --> (1) Writes index information from memory to disk: flush(waitForMerges, true, true); --> (2) Merge segments: mergeScheduler.merge(this);
The merging of segments will be discussed in later chapters. Here, we only discuss the process of writing index information from to disk.
code:
I
ndexWriter.flush(boolean triggerMerge, boolean flushDocStores, boolean flushDeletes) --> IndexWriter.doFlush(boolean flushDocStores, boolean flushDeletes) --> IndexWriter.doFlushInternal(boolean flushDocStores, boolean flushDeletes)
Writing an index to disk involves the following processes:
Get the segment name to write: string segment = docwriter getSegment(); Documentswriter writes cached information to segment: docwriter flush(flushDocStores);
Generate a new segment information object: newSegment = new SegmentInfo(segment, flushedDocCount,
directory, false, true, docStoreOffset, docStoreSegment, docStoreIsCompoundFile,
docWriter.hasProx());
Preparing to delete document: docwriter pushDeletes();
Generate cfs segment: docwriter createCompoundFile(segment);
Delete document: applyDeletes();
Get the segment name to write
code:
SegmentInfo newSegment = null; final int numDocs = docWriter.getNumDocsInRAM();//Total number of documents String docStoreSegment = docWriter.getDocStoreSegment();//The storage field and the segment to be written to the word vector Name,"_0" int docStoreOffset = docWriter.getDocStoreOffset();//The offset in the segment to which the storage field and word vector are written String segment = docWriter.getSegment();//Segment name, "_0"
Lucene's index file structure chapter describes in detail that the storage field and word vector can be stored in different segments from the index field.
Write cached content to segment write cached content to segment
code:
flushedDocCount = docWriter.flush(flushDocStores);
This process includes the following two stages;
1. Close the storage domain and word vector information according to the basic index chain
1. Write the index result to the segment according to the structure of the basic index chain
Close the storage domain and word vector information according to the basic index chain
The code is:
closeDocStore(); flushState.numDocsInStore = 0;
It mainly closes the storage domain and word vector information according to the basic index chain structure:
consumer(DocFieldProcessor).closeDocStore(flushState);
consumer(DocInverter).closeDocStore(state);
consumer(TermsHash).closeDocStore(state);
consumer(FreqProxTermsWriter).closeDocStore(state);
if (nextTermsHash != null) nextTermsHash.closeDocStore(state);
consumer(TermVectorsTermsWriter).closeDocStore(state);
endConsumer(NormsWriter).closeDocStore(state); fieldsWriter(StoredFieldsWriter).closeDocStore(state);
Among them, the following two closedocstores are of substantive significance:
Closing of word vector:
TermVectorsTermsWriter.closeDocStore(SegmentWriteState) void closeDocStore(final SegmentWriteState state) throws IOException { if (tvx != null) { //Write zeros in the tvd file for documents that do not save word vectors. Even if the word vector is not saved, it is saved in TVx and tvd Leave a place fill(state.numDocsInStore - docWriter.getDocStoreOffset()); //Close the write stream of TVx, tvf and TVD files tvx.close(); tvf.close(); tvd.close(); tvx = null; //Record the written file name. When cfs files are generated in the future, these written files will be generated into a unified cfs file. state.flushedFiles.add(state.docStoreSegmentName + "." + IndexFileNames.VECTORS_INDEX_EXTENSION); state.flushedFiles.add(state.docStoreSegmentName + "." + IndexFileNames.VECTORS_FIELDS_EXTENSION); state.flushedFiles.add(state.docStoreSegmentName + "." + IndexFileNames.VECTORS_DOCUMENTS_EXTENSION); //Deleted from the member variable openFiles of DocumentsWriter, may be deleted by indexfiledelete docWriter.removeOpenFile(state.docStoreSegmentName + "." + IndexFileNames.VECTORS_INDEX_EXTENSION); docWriter.removeOpenFile(state.docStoreSegmentName + "." + IndexFileNames.VECTORS_FIELDS_EXTENSION); docWriter.removeOpenFile(state.docStoreSegmentName + "." + IndexFileNames.VECTORS_DOCUMENTS_EXTENSION); lastDocID = 0; } }
Shutdown of storage domain
StoredFieldsWriter.closeDocStore(SegmentWriteState) public void closeDocStore(SegmentWriteState state) throws IOException { //Close FDX, and FDT writes to the stream fieldsWriter.close(); --> fieldsStream.close(); --> indexStream.close(); fieldsWriter = null; lastDocID = 0; //Record the file name written state.flushedFiles.add(state.docStoreSegmentName + "." + IndexFileNames.FIELDS_EXTENSION); state.flushedFiles.add(state.docStoreSegmentName + "." + IndexFileNames.FIELDS_INDEX_EXTENSION); state.docWriter.removeOpenFile(state.docStoreSegmentName + "." + IndexFileNames.FIELDS_EXTENSION); state.docWriter.removeOpenFile(state.docStoreSegmentName + "." + IndexFileNames.FIELDS_INDEX_EXTENSION); }
Write the index result to the segment according to the structure of the basic index chain
The code is:
consumer(DocFieldProcessor).flush(threads, flushState); //Reclaim fieldHash for the next round of indexing. In order to improve efficiency, the objects in the index chain are reused. Map<DocFieldConsumerPerThread, Collection<DocFieldConsumerPerField>> childThreadsAndFields = new HashMap<DocFieldConsumerPerThread, Collection<DocFieldConsumerPerField>>(); for ( DocConsumerPerThread thread : threads) { DocFieldProcessorPerThread perThread = (DocFieldProcessorPerThread) thread; childThreadsAndFields.put(perThread.consumer, perThread.fields()); perThread.trimFields(state); } //Write storage domain --> fieldsWriter(StoredFieldsWriter).flush(state); //Write index field --> consumer(DocInverter).flush(childThreadsAndFields, state); //Write the domain metadata information and record the file name written so that cfs files can be generated later --> final String fileName = state.segmentFileName(IndexFileNames.FIELD_INFOS_EXTENSION); --> fieldInfos.write(state.directory, fileName); --> state.flushedFiles.add(fileName);
This process is also based on the basic index chain:
consumer(DocFieldProcessor).flush(...);
consumer(DocInverter).flush(...);
consumer(TermsHash).flush(...);
consumer(FreqProxTermsWriter).flush(...);
if (nextTermsHash != null) nextTermsHash.flush(...);
consumer(TermVectorsTermsWriter).flush(...);
endConsumer(NormsWriter).flush(...);
fieldsWriter(StoredFieldsWriter).flush(...);
Write storage domain write storage domain
The code is:
StoredFieldsWriter.flush(SegmentWriteState state) { if (state.numDocsInStore > 0) { initFieldsWriter(); fill(state.numDocsInStore - docWriter.getDocStoreOffset()); } if (fieldsWriter != null) fieldsWriter.flush(); }
It can be seen from the code that FDX and FDT files are written, but they have been written in the closeDocStore above, and state Numdocsinstore is set to zero and fieldsWriter is set to null. In fact, nothing is done here.
Write index field
The code is:
DocInverter.flush(Map<DocFieldConsumerPerThread,Collection<DocFieldConsumerPerField>>, SegmentWriteState) //Write inverted table and word vector information --> consumer(TermsHash).flush(childThreadsAndFields, state); //Write standardization factor --> endConsumer(NormsWriter).flush(endChildThreadsAndFields, state);
Write inverted table and word vector information
The code is:
TermsHash.flush(Map<InvertedDocConsumerPerThread,Collection<InvertedDocConsumerPerField>>, SegmentWriteState) //Write inverted table information --> consumer(FreqProxTermsWriter).flush(childThreadsAndFields, state); //Recycle RawPostingList --> shrinkFreePostings(threadsAndFields, state); //Write word vector information --> if (nextTermsHash != null) nextTermsHash.flush(nextThreadsAndFields, state); --> consumer(TermVectorsTermsWriter).flush(childThreadsAndFields, state);
Write inverted table information
The code is:
FreqProxTermsWriter.flush(Map<TermsHashConsumerPerThread, Collection<TermsHashConsumerPerField>>, SegmentWriteState)
(a) All fields are sorted by name so that fields with the same name can be processed together so that fields with the same name can be processed together
Collections.sort(allFields); final int numAllFields = allFields.size();
(b) Write object to generate inverted table
final FormatPostingsFieldsConsumer consumer = new FormatPostingsFieldsWriter(state, fieldInfos); int start = 0;
© For each domain
while(start < numAllFields) {
(c-1) find all domains with the same name
final FieldInfo fieldInfo = allFields.get(start).fieldInfo; final String fieldName = fieldInfo.name; int end = start+1; while(end < numAllFields && allFields.get(end).fieldInfo.name.equals(fieldName)) end++; FreqProxTermsWriterPerField[] fields = new FreqProxTermsWriterPerField[end-start]; for(int i=start;i<end;i++) { fields[i-start] = allFields.get(i); fieldInfo.storePayloads |= fields[i-start].hasPayloads; }
(c-2) add the inverted list of fields with the same name to the file
appendPostings(fields, consumer);
(c-3) free space
for(int i=0;i<fields.length;i++) { TermsHashPerField perField = fields[i].termsHashPerField; int numPostings = perField.numPostings; perField.reset(); perField.shrinkHash(numPostings); fields[i].reset(); } start = end; }
(d) Close write object of inverted table
consumer.finish();
(b) Write object to generate inverted table
The code is:
public FormatPostingsFieldsWriter(SegmentWriteState state, FieldInfos fieldInfos) throws IOException { dir = state.directory; segment = state.segmentName; totalNumDocs = state.numDocs; this.fieldInfos = fieldInfos; //Used to write tii,tis termsOut = new TermInfosWriter(dir, segment, fieldInfos, state.termIndexInterval); //Jump table for writing freq and prox skipListWriter = new DefaultSkipListWriter(termsOut.skipInterval, termsOut.maxSkipLevels, totalNumDocs, null, null); //Record the file name written, state.flushedFiles.add(state.segmentFileName(IndexFileNames.TERMS_EXTENSION)); state.flushedFiles.add(state.segmentFileName(IndexFileNames.TERMS_INDEX_EXTENSION)); //Use the above two write objects to write segments in a certain format termsWriter = new FormatPostingsTermsWriter(state, this); }
© Adds an inverted list of fields with the same name to the file
The code is:
FreqProxTermsWriter.appendPostings(FreqProxTermsWriterPerField[], FormatPostingsFieldsConsumer) { int numFields = fields.length; final FreqProxFieldMergeState[] mergeStates = new FreqProxFieldMergeState[numFields]; for(int i=0;i<numFields;i++) { FreqProxFieldMergeState fms = mergeStates[i] = new FreqProxFieldMergeState(fields[i]); boolean result = fms.nextTerm(); //For all fields, take the first word (Term) }