1, Prefix tree
Generally, when designing websites, there will be problem publishing or content publishing functions. A very important point of these functions is how to realize sensitive word filtering. Otherwise, there may be bad information publishing, or the published content may contain code fragments with malicious functions. The basic algorithm for sensitive word filtering is prefix tree algorithm, Prefix tree is also called dictionary tree. Prefix tree matching can speed up the matching of sensitive words.
Prefix tree is also called Trie, dictionary tree and lookup tree. The main features are: high search efficiency, but large memory consumption; It is mainly used in string retrieval, word frequency statistics, string sorting, etc.
What exactly is a prefix tree? How is the function of prefix tree implemented?
Take a specific example: if there is a string "xwabfabcff" and the sensitive words are "abc", "bf" and "be", detect the string. If there is a sensitive word, replace the sensitive word with "*" to implement an algorithm.
Characteristics of prefix tree:
1. The following node is an empty node without any characters.
2. Except for the root node, each node has only one character.
3. Each node contains different child nodes. For example, the child node of root has two b's, but we only keep one
4. Make a mark at the end node of each sensitive word to indicate that the string composed from the root node to this node is a sensitive word, and the unmarked node in the middle and the string in the middle of the root node do not constitute a sensitive word.
Algorithm logic of prefix tree:
1. Preparation: we need three pointers: ① the pointer points to the prefix tree and points to the root node by default; ② ③ the pointer points to the string to be detected (the same direction ruler distance method, ② go from beginning to end, mark the beginning of the sensitive word, ③ move with ②, mark the end of the sensitive word), and points to the first character of the string by default. We also need a string (StringBuilder) to store the detection results.
2. ① Access the first layer of the tree and find that there is no 'x', then ② and ③ go down and store the 'x' in the StringBuilder string. " w 'similarly.
3. At this time, ② and ③ point to 'a', ① access the first layer of the tree and find 'a', but 'a' is not marked, so it is not a sensitive word, so 'a' is stored in the StringBuilder string. Then ② do not move, ① and ③ continue to go down until they reach the marked node or do not match, ① return, ② go down one step, ③ return to the place pointed by ② at this time. Repeat the above steps.
4. If a sensitive word is detected, store "* * *" in StringBuilder and make ② skip this sensitive word, and ② and ③ jointly point to the next position of the original ③.
5. ② ③ when it reaches the end of the string, the detection is completed. The final result is "xwa******ff".
2, Sensitive word filter
When we redevelop the project, we need to develop a reusable tool for filtering sensitive words, which becomes a sensitive word filter, so that it can be reused in the project.
There are three main steps to develop a sensitive word filter:
1. Define prefix tree
2. Initialize the prefix tree according to the sensitive words
3. Write methods for filtering sensitive words
The code implementation is as follows:
import org.apache.commons.lang3.CharUtils; import org.apache.commons.lang3.StringUtils; import org.slf4j.Logger; import org.slf4j.LoggerFactory; import org.springframework.stereotype.Component; import javax.annotation.PostConstruct; import java.io.BufferedReader; import java.io.IOException; import java.io.InputStream; import java.io.InputStreamReader; import java.util.HashMap; import java.util.Map; @Component public class SensitiveFilter { // Log private static final Logger logger = LoggerFactory.getLogger(SensitiveFilter.class); // Substitution character private static final String REPLACEMENT = "***"; // Initialize root node private TrieNode rootNode = new TrieNode(); /** * 2. Initialize the prefix tree according to the sensitive words */ @PostConstruct// When the container instantiates the Bean when the server starts, and calls the Bean's construction method, the method will be called automatically public void init() { try ( // Load the sensitive words file Txt is a self built file for storing sensitive words InputStream is = this.getClass().getClassLoader().getResourceAsStream("sensitive-words.txt"); // Byte stream -- > character stream -- > character buffer stream BufferedReader reader = new BufferedReader(new InputStreamReader(is)); ) { String keyword; while((keyword = reader.readLine()) != null){ // Add to the prefix tree. addKeyword is a custom method to add a sensitive word to the prefix tree this.addKeyword(keyword); } } catch (IOException e) { logger.error("Failed to load sensitive word file:" + e.getMessage()); } } // Encapsulation method: add a sensitive word to the prefix tree private void addKeyword(String keyword){ TrieNode tempNode = rootNode; for (int i = 0; i < keyword.length(); i++) { char c = keyword.charAt(i); TrieNode subNode = tempNode.getSubNode(c); if(subNode == null){ // If there is no such character in the child node, the child node is initialized with this character and assembled into the tree subNode = new TrieNode(); tempNode.addSubNode(c,subNode); } // Point to the word node and enter the next layer cycle tempNode = subNode; // Set end ID if(i == keyword.length() -1){ tempNode.setKeywordEnd(true); } } } /** * 3. Retrieve and filter sensitive words * @param text Text to be filtered * @return Filtered text */ public String filter(String text){ if(StringUtils.isBlank(text)){ return null; } // Pointer ① TrieNode tempNode = rootNode; // Pointer ② int begin = 0; // Pointer ③ int position = 0; // Storage results StringBuilder sb = new StringBuilder(); while(position < text.length()){ char c = text.charAt(position); // Skip symbol if(isSymbol(c)){ // If the pointer ① is at the root node, this symbol is included in the result and the pointer ② moves down one step if(tempNode == rootNode){ sb.append(c); begin++; } // Whether the symbol appears when it is not detected or when it is being detected, the pointer ③ always goes down one step // (when it is not detected, go down one step together with pointer ②. When it is detected, pointer ② does not move, and pointer ③ goes down one step) position++; continue; } // Check child nodes tempNode = tempNode.getSubNode(c); if(tempNode == null){ // String starting with begin is not a sensitive word sb.append(text.charAt(begin)); // Go to the next position begin++; position = begin; // The pointer ① returns and points to the root node again tempNode = rootNode; }else if (tempNode.isKeywordEnd()){ // If sensitive words are found, replace the begin~position string sb.append(REPLACEMENT); // Go to the next position position++; begin = position; // The pointer ① returns to its position and points back to you tempNode = rootNode; }else { // Check next character position++; } } // Count the last batch of characters into the result: pointer ③ reaches the middle end point before pointer ②, and the string between them is not a sensitive word sb.append(text.substring(begin)); return sb.toString(); } // Packaging method: judge whether it is a special symbol private boolean isSymbol(Character c){ // 0x2E80~0x9FFF are East Asian characters and should not be treated as special symbols return !CharUtils.isAsciiAlphanumeric(c) && (c < 0x2E80 || c > 0x9FFF); } /** * 1. Define prefix tree */ private class TrieNode { // Sensitive word (keyword) end identifier private boolean isKeywordEnd = false; // Child node (key is a child character and value is a child node) private Map<Character, TrieNode> subNodes = new HashMap<>(); public boolean isKeywordEnd() { return isKeywordEnd; } public void setKeywordEnd(boolean keywordEnd) { isKeywordEnd = keywordEnd; } // Add child node public void addSubNode(Character c, TrieNode node) { subNodes.put(c, node); } // Get child nodes public TrieNode getSubNode(Character c) { return subNodes.get(c); } } }