How does SpringBoot use prefix trees to filter sensitive words

1, Prefix tree

Generally, when designing websites, there will be problem publishing or content publishing functions. A very important point of these functions is how to realize sensitive word filtering. Otherwise, there may be bad information publishing, or the published content may contain code fragments with malicious functions. The basic algorithm for sensitive word filtering is prefix tree algorithm, Prefix tree is also called dictionary tree. Prefix tree matching can speed up the matching of sensitive words.
Prefix tree is also called Trie, dictionary tree and lookup tree. The main features are: high search efficiency, but large memory consumption; It is mainly used in string retrieval, word frequency statistics, string sorting, etc.

What exactly is a prefix tree? How is the function of prefix tree implemented?
Take a specific example: if there is a string "xwabfabcff" and the sensitive words are "abc", "bf" and "be", detect the string. If there is a sensitive word, replace the sensitive word with "*" to implement an algorithm.

Characteristics of prefix tree:
        1. The following node is an empty node without any characters.
        2. Except for the root node, each node has only one character.
        3. Each node contains different child nodes. For example, the child node of root has two b's, but we only keep one
        4. Make a mark at the end node of each sensitive word to indicate that the string composed from the root node to this node is a sensitive word, and the unmarked node in the middle and the string in the middle of the root node do not constitute a sensitive word.

Algorithm logic of prefix tree:
        1. Preparation: we need three pointers: ① the pointer points to the prefix tree and points to the root node by default; ② ③ the pointer points to the string to be detected (the same direction ruler distance method, ② go from beginning to end, mark the beginning of the sensitive word, ③ move with ②, mark the end of the sensitive word), and points to the first character of the string by default. We also need a string (StringBuilder) to store the detection results.
        2. ① Access the first layer of the tree and find that there is no 'x', then ② and ③ go down and store the 'x' in the StringBuilder string. " w 'similarly.
        
        3. At this time, ② and ③ point to 'a', ① access the first layer of the tree and find 'a', but 'a' is not marked, so it is not a sensitive word, so 'a' is stored in the StringBuilder string. Then ② do not move, ① and ③ continue to go down until they reach the marked node or do not match, ① return, ② go down one step, ③ return to the place pointed by ② at this time. Repeat the above steps.
        4. If a sensitive word is detected, store "* * *" in StringBuilder and make ② skip this sensitive word, and ② and ③ jointly point to the next position of the original ③.
        
        5. ② ③ when it reaches the end of the string, the detection is completed. The final result is "xwa******ff".

2, Sensitive word filter

When we redevelop the project, we need to develop a reusable tool for filtering sensitive words, which becomes a sensitive word filter, so that it can be reused in the project.
There are three main steps to develop a sensitive word filter:

1. Define prefix tree

2. Initialize the prefix tree according to the sensitive words

3. Write methods for filtering sensitive words

The code implementation is as follows:

import org.apache.commons.lang3.CharUtils;
import org.apache.commons.lang3.StringUtils;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.springframework.stereotype.Component;

import javax.annotation.PostConstruct;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.util.HashMap;
import java.util.Map;

@Component
public class SensitiveFilter {

    // Log
    private static final Logger logger = LoggerFactory.getLogger(SensitiveFilter.class);

    // Substitution character
    private static final String REPLACEMENT = "***";

    // Initialize root node
    private TrieNode rootNode = new TrieNode();

    /**
     * 2. Initialize the prefix tree according to the sensitive words
     */
    @PostConstruct// When the container instantiates the Bean when the server starts, and calls the Bean's construction method, the method will be called automatically
    public void init() {
        try (
                // Load the sensitive words file Txt is a self built file for storing sensitive words
                InputStream is = this.getClass().getClassLoader().getResourceAsStream("sensitive-words.txt");
                // Byte stream -- > character stream -- > character buffer stream
                BufferedReader reader = new BufferedReader(new InputStreamReader(is));
        ) {
            String keyword;
            while((keyword = reader.readLine()) != null){
                // Add to the prefix tree. addKeyword is a custom method to add a sensitive word to the prefix tree
                this.addKeyword(keyword);
            }

        } catch (IOException e) {
            logger.error("Failed to load sensitive word file:" + e.getMessage());
        }

    }

    // Encapsulation method: add a sensitive word to the prefix tree
    private void addKeyword(String keyword){
        TrieNode tempNode = rootNode;
        for (int i = 0; i < keyword.length(); i++) {
            char c = keyword.charAt(i);
            TrieNode subNode = tempNode.getSubNode(c);

            if(subNode == null){
                // If there is no such character in the child node, the child node is initialized with this character and assembled into the tree
                subNode = new TrieNode();
                tempNode.addSubNode(c,subNode);
            }

            // Point to the word node and enter the next layer cycle
            tempNode = subNode;

            // Set end ID
            if(i == keyword.length() -1){
                tempNode.setKeywordEnd(true);
            }
        }
    }

    /**
     * 3. Retrieve and filter sensitive words
     * @param text Text to be filtered
     * @return Filtered text
     */
    public String filter(String text){
        if(StringUtils.isBlank(text)){
            return null;
        }

        // Pointer ①
        TrieNode tempNode = rootNode;
        // Pointer ②
        int begin = 0;
        // Pointer ③
        int position = 0;
        // Storage results
        StringBuilder sb = new StringBuilder();

        while(position < text.length()){
            char c = text.charAt(position);

            // Skip symbol
            if(isSymbol(c)){
                // If the pointer ① is at the root node, this symbol is included in the result and the pointer ② moves down one step
                if(tempNode == rootNode){
                    sb.append(c);
                    begin++;
                }
                // Whether the symbol appears when it is not detected or when it is being detected, the pointer ③ always goes down one step
                // (when it is not detected, go down one step together with pointer ②. When it is detected, pointer ② does not move, and pointer ③ goes down one step)
                position++;
                continue;
            }

            // Check child nodes
            tempNode = tempNode.getSubNode(c);
            if(tempNode == null){
                // String starting with begin is not a sensitive word
                sb.append(text.charAt(begin));
                // Go to the next position
                begin++;
                position = begin;
                // The pointer ① returns and points to the root node again
                tempNode = rootNode;
            }else if (tempNode.isKeywordEnd()){
                // If sensitive words are found, replace the begin~position string
                sb.append(REPLACEMENT);
                // Go to the next position
                position++;
                begin = position;
                // The pointer ① returns to its position and points back to you
                tempNode = rootNode;
            }else {
                // Check next character
                position++;
            }
        }
        // Count the last batch of characters into the result: pointer ③ reaches the middle end point before pointer ②, and the string between them is not a sensitive word
        sb.append(text.substring(begin));

        return sb.toString();
    }

    // Packaging method: judge whether it is a special symbol
    private boolean isSymbol(Character c){
        // 0x2E80~0x9FFF are East Asian characters and should not be treated as special symbols
        return !CharUtils.isAsciiAlphanumeric(c) && (c < 0x2E80 || c > 0x9FFF);
    }

    /**
     * 1. Define prefix tree
     */
    private class TrieNode {

        // Sensitive word (keyword) end identifier
        private boolean isKeywordEnd = false;

        // Child node (key is a child character and value is a child node)
        private Map<Character, TrieNode> subNodes = new HashMap<>();

        public boolean isKeywordEnd() {
            return isKeywordEnd;
        }

        public void setKeywordEnd(boolean keywordEnd) {
            isKeywordEnd = keywordEnd;
        }

        // Add child node
        public void addSubNode(Character c, TrieNode node) {
            subNodes.put(c, node);
        }

        // Get child nodes
        public TrieNode getSubNode(Character c) {
            return subNodes.get(c);
        }
    }

}

Keywords: Java Spring data structure Back-end filter

Added by j0n on Tue, 18 Jan 2022 10:08:00 +0200