Filter sensitive words using prefix tree
Introduction to prefix tree
Prefix tree is also called word lookup tree, Trie tree , is a kind of tree structure , is a variant of hash tree. The typical application is to count, sort and save a large number of data character String (but not limited to string), so it is often used by search engine system for text word frequency statistics. Its advantages are: using the common prefix of string to reduce the query time, minimize unnecessary string comparison, and the query efficiency is higher than that of hash tree.
It has three basic properties:
1. The root node does not contain characters, and each node except the root node contains only one character;
2. From the root node to a node, the characters passing through the path are connected to the corresponding string of the node;
3. All child nodes of each node contain different characters.
1. Import sensitive words can be stored in the database or in files
2. Defines the prefix tree using internal classes
- isEndNode / / check whether it is the last node
- Map < character, trienode > subnodes defines a child node. key is a child character and value is a child node
- Add Child
- Obtain child nodes through characters
private class TrieNode{ //See if it's the last node private boolean isEndNode=false; //Define a child node. key is a child character and value is a child node private Map<Character,TrieNode> subNodes=new HashMap<>(); //Add child node public void addSubNode(Character character,TrieNode trieNode) { subNodes.put(character,trieNode); } //Get child nodes by characters public TrieNode getSubNode(Character character) { return subNodes.get(character); } public boolean isEndNode() { return isEndNode; } public void setEndNode(boolean endNode) { isEndNode = endNode; } }
3. Initialize prefix tree based on sensitive words
@ PostConstruct this annotation is used to modify a non static void () method. The method modified by @ PostConstruct will be loaded on the server Servlet And will only be executed once by the server. PostConstruct executes automatically after the constructor.
- Get the compiled sensitive word file
- Convert byte stream into character stream and then into buffer stream
- Read line by line, and add the read sensitive words to the prefix tree
//Initialize prefix tree @PostConstruct public void init() { try ( InputStream inputStream=this.getClass().getClassLoader().getResourceAsStream("sensitive-words.txt"); BufferedReader reader=new BufferedReader(new InputStreamReader(inputStream)); ) { String word; while ((word=reader.readLine())!=null) { //Add sensitive words to prefix tree this.addKeyword(word); } } catch (Exception e) { logger.error("Failed to load sensitive word file"+e.getMessage()); } }
Add a sensitive word to the prefix tree
- When creating a temporary node, point to the root node first
- In character units, read one character at a time. First call getSubNode to see if there are child nodes with the same character
- If there are no child nodes with the same characters, you need to initialize and add child nodes
- Point the pointer to the current child node for the next cycle
- Set the end ID. if the current node is the last node, set its ID to true
//Add a sensitive word to the prefix tree private void addKeyword(String word) { //When creating a temporary node, point to the root node first TrieNode TempNode=rootNode; for(int i=0;i<word.length();i++) { char c=word.charAt(i); TrieNode subNode = TempNode.getSubNode(c); //If the obtained child node is empty, add it. Otherwise, it indicates that there are already child nodes with the same characters if(subNode==null){ subNode=new TrieNode(); TempNode.addSubNode(c,subNode); } //Point to the child node to enter the next cycle TempNode=subNode; //Set end flag if(i==word.length()-1) { TempNode.setEndNode(true); } } }
4. Writing methods for filtering sensitive words
- ❤ bet ❤ Bo ❤
- Define three pointers. Pointer 1 points to the root node. Pointer 2 points to the beginning of the text. Pointer 3 points to the beginning of the text
- stringbuildeer stores the filtering results
- Whether the pointer may end with the second condition to the third condition
- If the character is a symbol and is the root node at this time, the symbol will be added to continue the next cycle
- Get the child node judgment of the root node. If it is empty, it indicates that the word beginning with this character is not a sensitive word begin + + and adds this character to the result. If it is not empty, it indicates that it may be a sensitive word position + + to continue judgment
// Judge whether it is a symbol private boolean isSymbol(Character c) { // 0x2E80~0x9FFF are East Asian characters return !CharUtils.isAsciiAlphanumeric(c) && (c < 0x2E80 || c > 0x9FFF); }
/** * Filter sensitive words * * @param text Text to be filtered * @return Filtered text */ public String filter(String text) { //If position is equal to the length of text, there is no sensitive word if(position==text.length()) { stringBuilder.append(text.charAt(begin)); position=++begin; } //If the text is empty, it is not filtered if(StringUtils.isBlank(text)) { return null; } //Definition pointer 1 points to the root node TrieNode tempNode=rootNode; //Definition pointer 2 points to the beginning of the text int begin=0; //Definition pointer 3 points to the beginning of the text int position=0; //Define stringbuildeer to store filtering results StringBuilder stringBuilder=new StringBuilder(); //Take whether the second pointer reaches the end as the loop condition, and take the third pointer as the condition may be omitted while (begin<text.length()) { //Gets the character of the position position char c=text.charAt(position); //Skip symbol //such as ❤ bet ❤ Bo ❤ if(isSymbol(c)) { //If the character is a symbol and is the root node at this time, the symbol will be added to continue the next cycle if(tempNode==rootNode) { position=++begin; stringBuilder.append(c); } // Pointer 3 goes one step down regardless of whether the symbol is at the beginning or in the middle position++; continue; } //Get the child node judgment of the root node. If it is empty, it means that the word beginning with this character is not a sensitive word begin + +, and add this character to the result //If it is not empty, it may be a sensitive word. position + + continue to judge tempNode=tempNode.getSubNode(c); if(tempNode==null) { position=++begin; stringBuilder.append(c); //Re point to the root node tempNode=rootNode; }else if(tempNode.isEndNode) { //replace stringBuilder.append(REPLACE_WORD); //Go to the next position begin=++position; //Re point to the root node tempNode=rootNode; }else { // Check next character position++; } } return stringBuilder.toString(); }
5. Testing
@RunWith(SpringRunner.class) @SpringBootTest @ContextConfiguration(classes = CommunityApplication.class) public class SensitiveFilterTest { @Autowired private SensitiveFilter sensitiveFilter; @Test public void testSensitiveFilter() { String text = "You can gamble here,You can whore,Can take drugs,Can be invoiced,Ha ha ha!"; text = sensitiveFilter.filter(text); System.out.println(text); text = "Here you can☆bet☆Bo☆,sure☆Whoring☆Whore☆,sure☆absorb☆poison☆,sure☆open☆ticket☆,Ha ha ha!"; text = sensitiveFilter.filter(text); System.out.println(text); } } *****Output results Here you can***,sure***,sure***,sure***,Ha ha ha! Here you can☆***☆,sure☆***☆,sure☆***☆,sure☆***☆,Ha ha ha!