preface
Xiao Ming wrote a spelling correction tool in Chinese and English last time, fooled by the product manager: https://github.com/houbb/word-checker.
I thought it could be done once and for all. Until yesterday, I found another open source project with nothing to do. The description is as follows:
Spelling correction & Fuzzy search: 1 million times faster through Symmetric Delete spelling correction algorithm The Symmetric Delete spelling correction algorithm reduces the complexity of edit candidate generation and dictionary lookup for a given Damerau-Levenshtein distance. It is six orders of magnitude faster (than the standard approach with deletes + transposes + replaces + inserts) and language independent.
Xiao Ming thinks he has spent 100W times. The cow is blowing badly.
Adhering to the principle of not believing in rumors and not spreading rumors, Xiao Ming began his journey of learning algorithms.
Idea of word spelling algorithm
There are several algorithms for spelling English words:
Real time calculation and editing distance
Given two strings s 1 s_1 s1 and s 2 s_2 s2, the editing distance between them is s 1 s_1 s1. Convert to s 2 s_2 s2) minimum number of editing operations required.
The most common editing operations allowed for this purpose are: (i) inserting a character into a string; (ii) delete a character from the string and (iii) replace a character in the string with another character; For these operations, the edit distance is sometimes called the Levenshtein Distance.
This algorithm is the easiest to think of, but the cost of real-time computing is very expensive. Generally not implemented as an industry.
Peter Norvig's spelling algorithm
Generate all possible words with editing distance (delete + transpose + replace + insert) from query words and search them in the dictionary.
For words with length N, the letter size is a, and the editing distance is d=1. There will be n times of deletion, n-1 times of transposition, a*n times of change, and a*(n+1) times of insertion. There are 2n+2an+a-1 words in total.
This algorithm is the one Xiao Ming chose at the beginning, and its performance is much better than the previous algorithm.
However, the search is still expensive (114324 terms of n=9, a=36, d=2) and language related (because the alphabet is used to generate terms, which is different in many ways).
An idea of performance improvement
If Xiao Ming is allowed to improve this algorithm, there is an idea that space is used for time.
The deletion + transpose + replacement + insertion of a correct word is generated in advance, and then inserted into the dictionary.
However, there is a big problem here. The number of dictionaries generated by this preprocessing is too large and some are unacceptable.
So, is the world safe?
Let's take a look at the protagonist of this article.
Symmetric deletion spelling correction (symmpell)
Algorithm description
Generate terms with editing distance (delete only) from each dictionary term and add them to the dictionary together with the original term.
This must be performed only once in the precomputation step.
Generate terms that have an edit distance (delete only) from the input term and search for them in the dictionary.
For words with length N, letter size a and editing distance 1, only n terms will be deleted, and there are a total of n terms in the search.
The cost of this method is the precomputation time and storage space for x deletions of each original dictionary entry, which is acceptable in most cases.
The number x of times a single dictionary entry is deleted depends on the maximum editing distance.
Symmetric deletion spelling correction algorithm reduces the complexity of editing candidate generation and dictionary lookup by only using deletion instead of deletion + transpose + replace + insert. It is six orders of magnitude faster (editing distance = 3) and language independent.
Some remarks
In order to facilitate everyone's understanding, the original author also wrote a note.
Note 1: during pre calculation, different words in the dictionary may cause the same deleted words: delete(sun,1)==delete(sin,1)==sn.
Although we only generate a new dictionary entry (sn), internally we need to store the two original terms as spelling correction suggestions (sun,sin)
Note 2: there are four different types of comparison pairs:
dictionary entry==input entry, delete(dictionary entry,p1)==input entry // Pretreatment dictionary entry==delete(input entry,p2) delete(dictionary entry,p1)==delete(input entry,p2)
Only replace and transpose require the last comparison type.
However, we need to check whether the suggested dictionary terms are really the replacement or adjacent transpose of the input terms to prevent false positives of higher editing distance (banksnap and banklink, but bank!=kanb and bank!= xban and bank! = baxn).
Note 3: we use the search engine index itself, not a special spelling dictionary.
This has several benefits:
It is dynamically updated. Each newly indexed word whose frequency exceeds a certain threshold will also be automatically used for spelling correction.
Since we need to search the index anyway, spell correction requires little additional cost.
When we index misspelled terms (that is, they are not marked as correct in the index), we will immediately correct the spelling and index the page for the correct terms.
Note 4: We implemented query suggestion / completion in a similar way.
This is a good way to prevent spelling mistakes first.
Each new indexed word whose frequency exceeds a certain threshold is stored as a recommendation for all its prefixes (if they do not already exist, they are created in the index).
Because we provide instant search, there is little additional cost to find suggestions. Multiple terms are sorted by the number of results stored in the index.
reasoning
The sympell algorithm takes advantage of the fact that the editing distance between the two terms is symmetrical:
We can generate all entries with an edit distance < 2 from the query entry (trying to reverse the query entry error) and check them against all dictionary entries,
We can generate all terms with an edit distance < 2 from each dictionary term (try to create query term errors) and check query terms based on them.
By converting the correct dictionary term to the wrong string and the wrong input term to the correct string, we can combine the two and meet in the middle.
Because adding a character to the dictionary is equivalent to deleting a character from the input string and vice versa, we can limit the conversion to deletion only on both sides.
example
This paragraph makes Xiao Ming a little confused, so here is an example for everyone to understand.
For example, the user enters goox
The correct thesaurus is: good
The corresponding edit distance is 1.
Then, by deleting, the good preprocessing storage will become: {good = good, OOD = good; good = good; good = good;}
When judging user input:
(1) goox does not exist
(2) Delete goox
oox gox goo
You can find that good corresponds to good.
Read here, the little partner must have found the cleverness of this algorithm.
Through the deletion of the original dictionary, the effect of deletion + addition + modification in the original algorithm has been basically achieved.
Edit distance
We are using Variant 3 because only removing the transformation is language independent and costs three orders of magnitude less.
Where does speed come from?
Precomputation, that is, generating possible misspelling variants (delete only) and storing them at index time, is the first prerequisite.
Fast index access during search by using a hash table with an average search time complexity of O(1) is the second prerequisite.
However, only symmetric deletion spelling correction above this can bring this O(1) speed to spelling check, because it can greatly reduce the number of spelling error candidates to be calculated (generated and indexed) in advance.
The method of applying precomputing to Norvig is not feasible because precomputing all possible candidates for deletion + transpose + replacement + insertion of all terms will lead to huge time and space consumption.
Computational complexity
The symppell algorithm is a constant time (O(1) time), that is, independent of the dictionary size (but depending on the average term length and the maximum editing distance), because our index is based on O(1) of the hash table with average search time complexity.
code implementation
Just talking, not practicing fake moves.
After reading it, Xiao Ming adjusted his original algorithm overnight.
Thesaurus preprocessing
Previously for the following Thesaurus:
the,23135851162 of,13151942776 and,12997637966
Just build a word and the corresponding frequency freqMap.
Now we need to edit the word and delete the word with distance = 1:
/** * Symmetrical deletion spelling correction Thesaurus * <p> * 1. If the word length is less than 1, it will not be processed. * 2. Subtract 1 from the length of the word, remove one letter in turn, and take the rest as the key, * value Is an original candidate to list. * 3. How to be more elegant? * 4. How to sort more elegant? * <p> * If you do not consider the custom thesaurus, you can directly preprocess the thesaurus, but it only reduces the initialization time, which is of little significance. * * @param freqMap Frequency Map * @param resultsMap Result map * @since 0.1.0 */ static synchronized void initSymSpellMap(Map<String, Long> freqMap, Map<String, List<CandidateDto>> resultsMap) { if (MapUtil.isEmpty(freqMap)) { return; } for (Map.Entry<String, Long> entry : freqMap.entrySet()) { String key = entry.getKey(); Long count = entry.getValue(); // Length judgment int len = key.length(); // You can adjust it later according to the editing distance if (len <= 1) { continue; } char[] chars = key.toCharArray(); Set<String> tempSet = new HashSet<>(chars.length); for (int i = 0; i < chars.length; i++) { String text = buildString(chars, i); // Skip repeated words if (tempSet.contains(text)) { continue; } List<CandidateDto> candidateDtos = resultsMap.get(text); if (candidateDtos == null) { candidateDtos = new ArrayList<>(); } // Take the original key as the value candidateDtos.add(new CandidateDto(key, count)); // The deleted text is used as the key resultsMap.put(text, candidateDtos); tempSet.add(text); } } // Unified sorting for (Map.Entry<String, List<CandidateDto>> entry : resultsMap.entrySet()) { String key = entry.getKey(); List<CandidateDto> list = entry.getValue(); if (list.size() > 1) { // sort Collections.sort(list); resultsMap.put(key, list); } } }
The implementation of building and deleting strings is relatively simple:
/** * Build string * * @param chars Character array * @param excludeIndex Excluded indexes * @return character string * @since 0.1.0 */ public static String buildString(char[] chars, int excludeIndex) { StringBuilder stringBuilder = new StringBuilder(chars.length - 1); for (int i = 0; i < chars.length; i++) { if (i == excludeIndex) { continue; } stringBuilder.append(chars[i]); } return stringBuilder.toString(); }
Here are a few points to note:
(1) If the word is less than or equal to the editing distance, it does not need to be deleted. Because it's gone==
(2) Be careful to skip repeated words. For example, good, the deletion result will have two good.
(3) Unified sorting is still necessary to improve the performance of real-time query.
Of course, Xiao Ming thought that if the thesaurus is fixed, the preprocessed thesaurus can be handled directly, which greatly improves the loading speed.
But this is better than nothing, and the impact is not great.
Adjustment of core algorithm
The core algorithm obtains the alternative list and directly queries according to the given four cases.
freqData the frequency information of the correct dictionary.
The information of the dictionary after symSpellData is deleted.
/** * dictionary entry==input entry, * delete(dictionary entry,p1)==input entry // Pretreatment * dictionary entry==delete(input entry,p2) * delete(dictionary entry,p1)==delete(input entry,p2) * * For performance reasons, a quick return is performed here. It can be configured in the later stage and will not be processed for the time being. * * @param word word * @param context context * @return result * @since 0.1.0 */ @Override protected List<CandidateDto> getAllCandidateList(String word, IWordCheckerContext context) { IWordData wordData = context.wordData(); Map<String, Long> freqData = wordData.freqData(); Map<String, List<CandidateDto>> symSpellData = wordData.symSpellData(); //0. The original dictionary contains if (freqData.containsKey(word)) { // Return original information CandidateDto dto = CandidateDto.of(word, freqData.get(word)); return Collections.singletonList(dto); } // If the length is 1 if(word.length() <= 1) { CandidateDto dtoA = CandidateDto.of("a", 9081174698L); CandidateDto dtoI = CandidateDto.of("i", 3086225277L); return Arrays.asList(dtoA, dtoI); } List<CandidateDto> resultList = new ArrayList<>(); //1. Symmetrical deletion contains the input words List<CandidateDto> symSpellList = symSpellData.get(word); if(CollectionUtil.isNotEmpty(symSpellList)) { resultList.addAll(symSpellList); } // All deleted arrays Set<String> subWordSet = InnerWordDataUtil.buildStringSet(word.toCharArray()); //2. The input word exists in the original dictionary after deletion. for(String subWord : subWordSet) { if(freqData.containsKey(subWord)) { CandidateDto dto = CandidateDto.of(subWord, freqData.get(subWord)); resultList.add(dto); } } //3. After the input word is deleted, it exists in the symmetric deletion dictionary. for(String subWord : subWordSet) { if(symSpellData.containsKey(subWord)) { resultList.addAll(symSpellData.get(subWord)); } } if(CollectionUtil.isNotEmpty(resultList)) { return resultList; } //4. Perform replacement and modification (recursive call once) without even processing. // To ensure that the editing distance is 1, only the original dictionary is considered List<String> edits = edits(word); for(String edit : edits) { if(freqData.containsKey(edit)) { CandidateDto dto = CandidateDto.of(edit, freqData.get(edit)); resultList.add(dto); } } return resultList; }
The following points need to be noted:
(1) If the original dictionary already contains, it will be returned directly. The description is spelled correctly.
(2) If the length is 1, I and a can be fixed and returned.
(3) Each other scenario can also be returned quickly if performance is considered.
Your server performance will never improve the 1000X configuration, but the algorithm can, but the salary can't.
Summary
Good algorithm, the improvement of the program is very significant.
We should continue to study in the future.
The code in this article has been greatly simplified for your understanding. Interested partners can see the source code themselves:
https://github.com/houbb/word-checker
I'm old ma. I look forward to seeing you again next time.