P5357 "[template] AC automata (secondary enhanced version)"

1. Title

Title Link: P5357 "[template] AC automata (secondary enhanced version)" .

Title Description

Give you a text string s and N pattern strings T_{1..n}, please find out each mode string t separately_ I) number of occurrences in S.

Input format

The first line contains a positive integer n, which indicates the number of pattern strings.

Next, line n, line I contains a string t composed of lowercase English letters_ i​ .

The last line contains a string S composed of lowercase English letters.

The data does not guarantee that any two mode strings are different.

Output format

The output contains n lines, where line i contains a non negative integer representing T_i number of occurrences in S.

Sample input and output

Enter #1

5
a
bb
aa
abaa
abaaa
abaaabaa

Output #1

6
0
3
2
1

Description / tips

1 \leq n \leq 2 \times 10^5,T_ The total length of {1.. n} does not exceed 2 \ times 10 ^ 5, and the length of S does not exceed 2 \times 10^6.

2. Problem solving

analysis

  • Ordinary queries are obviously not good (TLE), so we need to consider how to optimize ordinary queries. The main reason for tle caused by ordinary queries is the recursive jump when jumping the fail pointer. For strings like aaaa \cdots aaaa, it is equivalent to recursively jumping the fail pointer every time a character is searched forward, and each jump of fail only leads to the depth minus 1, resulting in the worst time complexity of O(n*m) (where n is the main string length and M is the pattern string length).
  • Since the whole dictionary tree is determined and the fail pointer is also determined, that is, if a node is updated, the nodes in the dictionary tree that can match the true suffix of the corresponding string of the node are determined (i.e. the nodes reached by recursive jump fail), and will be updated in the process of recursive jump fail. In this case, we can not be in such a hurry to update all nodes, but consider marking the nodes that match the longest string. Finally, after processing, we can uniformly recursively jump fail to update all matching nodes.
  • After matching a complete main string, all the nodes of the longest string are marked. Which nodes should start recursively jumping the fail pointer at this time? Note that the process of recursively jumping the fail pointer is essentially a process from the leaf node of the tree to the root node. Here, the tree refers to the directed tree built by the fail pointer, and the root node is the root node of the dictionary tree (because fail[0] = 0). Therefore, for the directed tree constructed by the fail pointer, the in degree and out degree of the leaf node are 0 and 1 (the position pointed to by the fail pointer of a node is fixed and unique). The first thing we need to deal with is all the leaf nodes, and then the parent node pointed to by the leaf node, that is, the parent node can be processed only after all the child nodes associated with the input edge of the parent node are processed, And so on, up to the root node, which is the order of topological sorting.

According to this idea, the worst complexity after optimization is reduced to O(n+m).

code

#include <bits/stdc++.h>
using namespace std;

// AC automata
struct Automaton {
    #ifndef _AUTOMATON_
    #define ll int
    #define MAXN 400005
    #define MAXCHAR 26
    #endif
    ll cnt; 
    ll trie[MAXN][MAXCHAR];
    ll exist[MAXN];
    ll fail[MAXN];
    ll mark[MAXN];          // Mode string number
    ll in[MAXN];            // Penetration
    Automaton(): cnt(0) {}
    // Insert string
    void insert(char *str, ll i) {
        ll p = 0;
        for(ll i = 0; str[i]; ++i) {
            ll c = str[i] - 'a';
            if(!trie[p][c]) {
                trie[p][c] = ++cnt;
            }
            p = trie[p][c];
        }
        mark[i] = p;
    }
    // Build mismatch pointer
    void build() {
        queue <ll> q;
        for(ll i = 0; i < MAXCHAR; ++i) {
            if(trie[0][i]) {
                q.push(trie[0][i]);
            }
        }
        // BFS 
        while(q.size()) {
            ll p = q.front();
            q.pop();
            for(ll i = 0; i < MAXCHAR; ++i) {
                if(trie[p][i]) {
                    fail[trie[p][i]] = trie[fail[p]][i];
                    ++in[trie[fail[p]][i]];
                    q.push(trie[p][i]);
                } else {
                    trie[p][i] = trie[fail[p]][i];
                }
            }
        }
    }
    // Match main string
    void match(char *str) {
        ll p = 0;
        for(ll i = 0; str[i]; ++i) {
            ll c = str[i] - 'a';
            ll u = trie[p][c];
            ++exist[u];
            p = trie[p][c];
        }
    }
    // Calculate the number of occurrences of all mode strings
    void solve() {
        queue <ll> q;
        for(ll i = 1; i <= cnt; ++i) {
            if(!in[i])  q.push(i);
        }
        while(q.size()) {
            ll p = q.front();
            q.pop();
            if(fail[p]) {
                exist[fail[p]] += exist[p];
                --in[fail[p]];
                if(!in[fail[p]])    q.push(fail[p]);
            }
        }
    }
};


int main()
{
    ll n;
    static char str[MAXN*10];
    static Automaton ac;
    scanf("%d", &n);
    for(ll i = 0; i < n; ++i) {
        scanf("%s", str);
        ac.insert(str, i);
    }
    scanf("%s", str);
    ac.build();
    ac.match(str);
    ac.solve();
    for(ll i = 0; i < n; ++i) {
        printf("%d\n", ac.exist[ac.mark[i]]);
    }
    return 0;
}

Added by Toadums on Wed, 02 Mar 2022 14:44:04 +0200