Filter sensitive words using prefix tree

Filter sensitive words using prefix tree

Introduction to prefix tree

Prefix tree is also called word lookup tree, Trie tree , is a kind of tree structure , is a variant of hash tree. The typical application is to count, sort and save a large number of data character String (but not limited to string), so it is often used by search engine system for text word frequency statistics. Its advantages are: using the common prefix of string to reduce the query time, minimize unnecessary string comparison, and the query efficiency is higher than that of hash tree.

It has three basic properties:

1. The root node does not contain characters, and each node except the root node contains only one character;

2. From the root node to a node, the characters passing through the path are connected to the corresponding string of the node;

3. All child nodes of each node contain different characters.

1. Import sensitive words can be stored in the database or in files

2. Defines the prefix tree using internal classes

  • isEndNode / / check whether it is the last node
  • Map < character, trienode > subnodes defines a child node. key is a child character and value is a child node
  • Add Child
  • Obtain child nodes through characters
private class TrieNode{
    //See if it's the last node
    private boolean isEndNode=false;
    //Define a child node. key is a child character and value is a child node
    private Map<Character,TrieNode> subNodes=new HashMap<>();


    //Add child node
    public void addSubNode(Character character,TrieNode trieNode)
    {
        subNodes.put(character,trieNode);
    }

    //Get child nodes by characters
    public TrieNode getSubNode(Character character)
    {
        return subNodes.get(character);

    }
    public boolean isEndNode() {
        return isEndNode;
    }

    public void setEndNode(boolean endNode) {
        isEndNode = endNode;
    }
}

3. Initialize prefix tree based on sensitive words

@ PostConstruct this annotation is used to modify a non static void () method. The method modified by @ PostConstruct will be loaded on the server Servlet And will only be executed once by the server. PostConstruct executes automatically after the constructor.

  • Get the compiled sensitive word file
  • Convert byte stream into character stream and then into buffer stream
  • Read line by line, and add the read sensitive words to the prefix tree
//Initialize prefix tree
@PostConstruct
public void init()
{
    try ( InputStream inputStream=this.getClass().getClassLoader().getResourceAsStream("sensitive-words.txt");
          BufferedReader reader=new BufferedReader(new InputStreamReader(inputStream));
          )
    {
        String word;
        while ((word=reader.readLine())!=null)
        {
            //Add sensitive words to prefix tree
            this.addKeyword(word);

        }
    } catch (Exception e) {
        logger.error("Failed to load sensitive word file"+e.getMessage());
    }
}

Add a sensitive word to the prefix tree

  • When creating a temporary node, point to the root node first
  • In character units, read one character at a time. First call getSubNode to see if there are child nodes with the same character
  • If there are no child nodes with the same characters, you need to initialize and add child nodes
  • Point the pointer to the current child node for the next cycle
  • Set the end ID. if the current node is the last node, set its ID to true
//Add a sensitive word to the prefix tree
private void addKeyword(String word)
{
    //When creating a temporary node, point to the root node first
    TrieNode TempNode=rootNode;
    for(int i=0;i<word.length();i++)
    {
        char c=word.charAt(i);
        TrieNode subNode = TempNode.getSubNode(c);
        //If the obtained child node is empty, add it. Otherwise, it indicates that there are already child nodes with the same characters
        if(subNode==null){
            subNode=new TrieNode();
            TempNode.addSubNode(c,subNode);
        }
        //Point to the child node to enter the next cycle
        TempNode=subNode;

        //Set end flag
        if(i==word.length()-1)
        {
            TempNode.setEndNode(true);
        }
        
    }
}

4. Writing methods for filtering sensitive words

  • ❤ bet ❤ Bo ❤
  • Define three pointers. Pointer 1 points to the root node. Pointer 2 points to the beginning of the text. Pointer 3 points to the beginning of the text
  • stringbuildeer stores the filtering results
  • Whether the pointer may end with the second condition to the third condition
  • If the character is a symbol and is the root node at this time, the symbol will be added to continue the next cycle
  • Get the child node judgment of the root node. If it is empty, it indicates that the word beginning with this character is not a sensitive word begin + + and adds this character to the result. If it is not empty, it indicates that it may be a sensitive word position + + to continue judgment
// Judge whether it is a symbol
private boolean isSymbol(Character c) {
    // 0x2E80~0x9FFF are East Asian characters
    return !CharUtils.isAsciiAlphanumeric(c) && (c < 0x2E80 || c > 0x9FFF);
}
/**
 * Filter sensitive words
 *
 * @param text Text to be filtered
 * @return Filtered text
 */
public String filter(String text) {
  //If position is equal to the length of text, there is no sensitive word
            if(position==text.length())
            {
                stringBuilder.append(text.charAt(begin));
                position=++begin;
            }
    //If the text is empty, it is not filtered
   if(StringUtils.isBlank(text))
   {
       return null;
   }
   //Definition pointer 1 points to the root node
    TrieNode tempNode=rootNode;
   //Definition pointer 2 points to the beginning of the text
    int begin=0;
    //Definition pointer 3 points to the beginning of the text
    int position=0;
    //Define stringbuildeer to store filtering results
    StringBuilder stringBuilder=new StringBuilder();
    //Take whether the second pointer reaches the end as the loop condition, and take the third pointer as the condition may be omitted
    while (begin<text.length())
    {
        //Gets the character of the position position
        char c=text.charAt(position);

        //Skip symbol
        //such as ❤ bet ❤ Bo ❤
        if(isSymbol(c))
        {
            //If the character is a symbol and is the root node at this time, the symbol will be added to continue the next cycle
            if(tempNode==rootNode)
            {
                position=++begin;
                stringBuilder.append(c);

            }
            // Pointer 3 goes one step down regardless of whether the symbol is at the beginning or in the middle
            position++;
            continue;
        }
        //Get the child node judgment of the root node. If it is empty, it means that the word beginning with this character is not a sensitive word begin + +, and add this character to the result
        //If it is not empty, it may be a sensitive word. position + + continue to judge
        tempNode=tempNode.getSubNode(c);
        if(tempNode==null)
        {
            position=++begin;
            stringBuilder.append(c);
            //Re point to the root node
            tempNode=rootNode;
        }else if(tempNode.isEndNode) {
            //replace
            stringBuilder.append(REPLACE_WORD);
            //Go to the next position
            begin=++position;
            //Re point to the root node
            tempNode=rootNode;
        }else {
            // Check next character
            position++;
        }
    }
    return stringBuilder.toString();

}

5. Testing

@RunWith(SpringRunner.class)
@SpringBootTest
@ContextConfiguration(classes = CommunityApplication.class)
public class SensitiveFilterTest {

    @Autowired
    private SensitiveFilter sensitiveFilter;

    @Test
    public void testSensitiveFilter() {
        String text = "You can gamble here,You can whore,Can take drugs,Can be invoiced,Ha ha ha!";
        text = sensitiveFilter.filter(text);
        System.out.println(text);

        text = "Here you can☆bet☆Bo☆,sure☆Whoring☆Whore☆,sure☆absorb☆poison☆,sure☆open☆ticket☆,Ha ha ha!";
        text = sensitiveFilter.filter(text);
        System.out.println(text);
    }

}



*****Output results
 Here you can***,sure***,sure***,sure***,Ha ha ha!
Here you can☆***☆,sure☆***☆,sure☆***☆,sure☆***☆,Ha ha ha!


Keywords: Java Spring Boot data structure Back-end

Added by quetz67 on Sun, 27 Feb 2022 05:37:12 +0200