SAM complexity proof

Proof of the complexity of $SAM $(mostly my own understanding and opinion of the blog)

This part is my memory, which can be omitted

Recall $SAM first$

In my understanding of $SAM $, first pick a picture

Initial string $aabbabd$

First, it is found that a straight line of $s - > 9 $in the figure below is $aabbabd $and $is the original string

From here, we can see the $endpos $relationship, which is different from $AC $automata

If it is found that the endings of some substrings are the same, you can share a node. If the $endpos $of all substrings that can be represented from the starting point to this point are the same, you can obviously share this point, which is energy saving in space

Because this $SAM $is to represent all substrings or suffixes, if $endpos $is the same, that is, it will continue to extend backward at this point after the end of this state

Let's have a rough understanding first. Can we understand that if we want to insert a suffix, we need to save space. Then if a point ending in it passes through multiple substrings, we can use this point multiple times

Take $4 $in the figure below as an example. If there are three suffixes that meet the conditions (explained below), this node can be used three times in the three suffixes

The $endpos $represented by this $4 $node is only $endpos=4$

For example, why is the suffix $ab $not on a path of $s - > 7 $, which should have appeared. It is found that $ab $is on the path of $s - > 8 $, and this is separated. Why, because $endpos (ab)= 7, $should appear because his $endpos $has $7 $, but it doesn't appear because it's a new type. If it's classified into one category, it can't be satisfied. After that, it's unified at this point, so it can only be a family of its own

It doesn't matter if you have your own family. After all, the $endpos $set has repeated parts. Obviously, this $endpos $set is orderly and incremental. When a string is another string suffix, the shorter the string, the larger the $endpos $and the larger the set must contain a small set. Then this thing can be determined through a pointing relationship

The transfer edge is also equivalent to the transfer edge of $AC $automata, which is the next $endpos $set that can be reached by this $endpos $set. As mentioned above, we put all the unification that only ends at this point together, so we can start here and transfer to all the $endpos $that can be reached

Let's summarize the above and think about how this thing is constructed

Find all suffixes and insert $\ xcancel{\huge NO} one by one$

Incremental method to construct $\ checkmark$

Think about complexity in the incremental process

First, clarify what we maintain in the automata and what we change when modifying

Maintenance $len_{max},len_{min},trans,link$

$SAM can find a few things quickly. It's certainly not used here

Construct automata, suppose we insert $s at present_ k$

We need to add a character after building the previous automata, that is, there are $k-1 $more characters to represent, that is, all substrings containing the last character

First, one more node on the suffix automata means $endpos=k $. Obviously, all new strings from large to small. First, the largest $endpoz=k $, then the $endpos $of the small string may not only be $k $, but may also appear in the front. The embodiment on the automata is $tran [now] [s [k]]= 0 $, that is to say, the suffix has been expressed. At this time, look at the parts that need to be changed. First, the $endpos $of the expressed suffix has changed, and there is one more position. If there is no change in this state, you need to separate this state. It is proved above that the longer the string $endpos $is, the smaller it will be divided into two parts, the changed and the unchanged. Note $$ At this time, insert a string, either add a node or keep it unchanged, and no more nodes will be added. Then what we need to solve is the complexity of jumping and finding (to be honest, I haven't seen this carefully...)

The time complexity is related to the number of states you add new characters each time and how many times you need to jump

As proved above, the number of States is $O(2\times n)$

Put a copy of the code

//review 
//Link The link position of a string when all suffixes are transformed 
//trans Is the state to which one character is added
//There are only many strings in a state, but there is only one character on the edge of the transition
// Multiple characters on one side are suffix trees
//actually SAM The edge on the represents a state transition
//Since the same states are merged, the space is better 
#include<bits/stdc++.h>
#define MAXN 3100000
using namespace std;
string s;
int cnt[MAXN],tr[MAXN][30],len[MAXN],fa[MAXN],sz[MAXN];
int last=1,tot=1;
int ans=0;
vector<int>road[MAXN];
void add(int c)
{
     int p=last; //Location of the new point of the last increment
     int now;
     now=last=++tot; //Update new node location
     cnt[now]=1;
     len[now]=len[p]+1; //Current node maxlen,If not split, then maxlen Must be the last length+1
     for(;p&&!tr[p][c];p=fa[p]) tr[p][c]=now;
     //to update trans,If there is no such state, it will become the current state
     if(!p) fa[now]=1; //fa Used to jump link
     //Link Jump forward, the string is smaller and smaller, and the state is more and more 
     //If!p Prove all suffixes of this string(new suffix)of(endpos)Same so link Direct to 1
     else
     {
            //Start disassembly point
           int q=tr[p][c];
           //There is a transition state at this time
           //Equivalent to the original ababc Transfer, then abab plus c There is already a transition state
           if(len[q]==len[p]+1) fa[now]=q;
           //If you find this point, it happens to be last+1 of maxlen，Then it's equivalent to
           //If you have a transfer point, you can directly now of Link Just point to the past
           else
           {
                  //Vigorously remove the point
               int spilt=++tot;
               for(int i=0;i<=25;i++)
               {
                      //To break it down, break it down into one maxlen Big x And a small one y
                   //Because the smaller it is, put it in the front
                   //The small one is connected with the original one first for information replication
                   //Obviously, a state plus a character goes to a new state
                   tr[spilt][i]=tr[q][i]; 
               } 
               fa[spilt]=fa[q];
               //spilt Is small, inherited state 
               len[spilt]=len[p]+1;
               //Found that now in fact only need to change Link
               //trans Its function is to transfer existence, so it doesn't matter
               //Because there is one more position at this time
               //So from the location of the breakpoint Link Must change
               //Then change Link That's it
               //It is found that although there is this transfer edge, the state is different
               //Then you can think about both spilt
               fa[q]=fa[now]=spilt;
               for(;p&&tr[p][c]==q;p=fa[p]) tr[p][c]=spilt;
               //Change the front transfer edge 
           } 
     } 
}
void dfs(int x)
{
     for(int i=0;i<road[x].size();i++)
     {
          int y=road[x][i];
          dfs(y);
          cnt[x]+=cnt[y];
     }
     if(cnt[x]!=1) ans=max(ans,cnt[x]*len[x]);
//     cout<<cnt[x]*len[x]<<endl;
} 
int main()
{
    cin>>s;
    int len=s.size();
    s=' '+s;
    for(int i=1;i<=len;i++)
    {
        add(s[i]-'a');
    }
    for(int i=2;i<=tot;i++)
    {
        road[fa[i]].push_back(i);
        //Suffix tree 
    }
    dfs(1);
    cout<<ans;
}

Take a look at the $add $function

The rest are $O(1), $except for a few cycles. Look at these cycles

First cycle

for(;p&&!tr[p][c];p=fa[p]) tr[p][c]=now;

Each point has a $trans $. Each point is assigned at most once and shared equally. Each point is operated only once. The number of points is the number of States, $O(|S|)$

In fact, what you change is a continuous part. You will change a continuous section every time. There will never be a section that has been passed many times. Then each point will be passed at most once

Second cycle

for(int i=0;i<=25;i++)
{
//To break it down, break it down into one maxlen Big x And a small one y
//Because the smaller it is, put it in the front
//The small one is connected with the original one first for information replication
//Obviously, a state plus a character goes to a new state
tr[spilt][i]=tr[q][i];
}

If you copy one state at a time, the complexity is $O(25|S|) $, although it is a constant of $25 $

Third cycle

for(;p&&tr[p][c]==q;p=fa[p]) tr[p][c]=spilt;
//Change the front transfer edge

This seems so troublesome

First of all, this thing is to change the transfer edge. Isn't it divided into two parts now, one is the unchanged part and the other is the changed part

So changing $tran $is to get to the old state. You need to get these transfers to the old state (newly opened $split $points), which is equivalent to copying it again

What we change is all the edges connected with the old state. Then we consider all the next operations. The essence is to split this point. Then we consider the following splitting operations because the $endpos $of this point changes. It also proves that the shorter the string is, the easier it is to change. Then it is definitely not this point that is considered to change, but another point obtained by this splitting operation. Let's understand it sensibly, If you only split one point at a time, you can't put aside the easy to become and split the hard to become. That is, each node can be traversed at most once, and the edges of each node can be traversed at most once. Then the complexity is related to the number of edges

The number of sides, that is, the number of $tran $, is $O(|S|) $in the whole $SAM $(how come the proof I see is not human...)

Or consider building a spanning tree. The current $SAM $is not complete

The function of $trans $is to traverse all substrings and run back from the termination node to all substrings that end with it (running backwards). If you find that it cannot be expressed, you can add edges. If you consider adding edges, it is because $endpos $is different (if it is the same, you can run down), then you can add up to $endpos $set size edges

Then, for each termination node, run it again, adding $\ sum(|endpos|) $(that is, the sum of the $endpos $set size) and $O(n) $at most

So $trans $is also $O(n) $

Start writing from $3.7,21:00 $and have a simulation match in the middle until $3.8,13:05 $is finished

Facing a proof card for a long time ~, like zz,hhh

Keywords: string

Added by Rommeo on Tue, 08 Mar 2022 07:24:19 +0200

Programming VIP

SAM complexity proof

Popular Keywords