ctc_ Derivation of loss formula and its C + + implementation

CTC introduction

This article does not introduce the background and specific application of CTC. The basic knowledge of CTC can be understood through the following articles:
Hannun Awni classic blog.

General idea

Initially, we get an input matrix. Its rows represent time steps with a length of T, and its columns represent the probabilities of different characters in different time slices with a length of alpha_ Size (including empty characters). At the same time, we have a lebel sequence, which represents our correct output.
What we need to do is to use the matrix forward to calculate the probability according to the label sequence, and save the probability of each character at each time step in the matrix α (t,u). Then the probability is inversely calculated and stored in the matrix β (t,u). Finally, the gradient of each input is calculated by using the formula we derived.
In this paper, our implementation takes only one sample as an example.

preparation

1. softmax

The input matrix we initially obtained does not represent probability, so we need to softmax each row of the matrix.
softmax formula: S i = e v i ∑ j e v j S_i=\dfrac{e^{v_i}}{\sum_je^{v_j}} Si=∑jevjevi

2. Extension of label

CTC's algorithm requires us to extend the original laebl, specifically adding an overhead symbol at the beginning and end of laebl and between every two characters. If the original label is {1,3,5,5,7}, the extended symbol is:
{0,1,0,3,0,5,0,5,0,7,0} note: the symbol corresponding to the number here is the subscript in the symbol table, and 0 is a null character by default.

3. Take logarithm

Because in CTC_ In the process of loss calculation, a large number of probabilities are multiplied. These floating-point numbers are often small and underflow is likely to occur. In order to stabilize the numerical value, we often take logarithms of these probabilities first and convert the addition and multiplication of probabilities into logarithmic calculation:
Probability multiplication Transformation: log ⁡ ( a ⋅ b ) = log ⁡ ( a ) + log ⁡ ( b ) \log(a\cdot b)=\log(a)+\log(b) log(a⋅b)=log(a)+log(b)
Probabilistic addition Transformation: log ⁡ ( a + b ) = log ⁡ ( 1 + e ( log ⁡ ( b ) − log ⁡ ( a ) ) ) \log(a+b)=\log(1+e^{(\log(b)-\log(a))}) log(a+b)=log(1+e(log(b)−log(a)))

Special attention is required because log(0)Unable to express,
And we will involve a lot of problems in the calculation log(0)The operation of,
So we need to define a-INFINITY,
You need to initialize all to-INFINITY,It represents log(0). 
double log_sum(double a, double b)
{
	//log(a+b)=log(a)+log(1+exp(logb-loga));
	if (a == -INFINITY)
		return b;
	if (b == -INFINITY)
		return a;

	return a + log(1 + exp(b - a));
}

double log_prod(double a,double b)
{
	if (a == -INFINITY && b == -INFINITY)
		return -INFINITY;
	if (a == -INFINITY)
		return b;
	if (b == -INFINITY)
		return a;

	return a + b;
}

4. Relevant codes

In the specific implementation, I mainly refer to the work of Baidu, which opened warp in 16 years_ CTC Code: warp_ctc.

Data structure (class) design:

class cpu_ctc {
public:
	cpu_ctc(vector<int> labels, vector<vector<double>> Prob);
	void compute();
	double compute_alphas();
	double compute_betas();
	void compute_gradients();
	
private:
	int* s_inc;
	int* e_inc;
	int blank_label = 0;   //The sequence number corresponding to a null character in the character table 
						 //The default value is 0; 
	double prob_;        //Probability calculated by forward propagation (logarithm);
	int L,S,T;
	//L represents the length of the tag sequence, S represents the extended sequence length, and T represents the number of input time steps.
	int repeats;  //Represents the number of repeated tags in the tag sequence.
	
	vector<vector<double>> alphas;  //Calculate forward probability; 
	vector<vector<double>> betas;   //Calculate backward probability
	vector<vector<double>> gradients;   //Calculate the gradient; 
	vector<vector<double>> Prob;
	vector<vector<double>> tmp_alphas;
	vector<vector<double>> tmp_betas;
	vector<vector<double>> tmp_gradients;

	int* labels_with_blanks;
	//Note: the occurrence of repeats is equivalent to that of the original label, and an empty label needs to be expanded! 
	int prepare(vector<int> labels, vector<vector<double>> Prob);
};

The specific significance of the uncommented design will be described in detail later.

Initialization work:

//Initialize some data in the constructor;
//The writing is relatively simple;
cpu_ctc::cpu_ctc(vector<int> labels, vector<vector<double>> Prob)
{
	for (int i = 0; i < Prob.size(); i++)  //Initialization input probability
	{
		vector<double> single;
		for (int j = 0; j < Prob[i].size(); j++)
			single.push_back(Prob[i][j]);
		this->Prob.push_back(single);
	}

	for (int i = 0; i < Prob.size(); i++) //alpha array and beta array
	//For the normal probability calculation array, the full initialization is 0
	{
		vector<double> tmp;
		for (int j = 0; j < 2 * labels.size() + 1; j++)
			tmp.push_back(0);

		this->alphas.push_back(tmp);
		this->betas.push_back(tmp);
		//this->gradients.push_back(tmp);
	}
	
	//tmp_ Alpha and tmp_betas is the probability array under the log field, all initialized to - INFINITY
	for (int i = 0; i < Prob.size(); i++)
	{
		vector<double> tmp;
		for (int j = 0; j < 2 * labels.size() + 1; j++)
			tmp.push_back(-INFINITY);

		this->tmp_alphas.push_back(tmp);
		this->tmp_betas.push_back(tmp);
		//this->gradients.push_back(tmp);
	}
//Initialize gradient array;
	for (int i = 0; i < Prob.size(); i++)
	{
		vector<double> tmp;
		vector<double> tmp_;
		for (int j = 0; j < Prob[0].size(); j++)
		{
			tmp.push_back(0);
			tmp_.push_back(-INFINITY);
		}
			
		this->tmp_gradients.push_back(tmp_);
		this->gradients.push_back(tmp);
	}
	L = labels.size();
	T = Prob.size();
	repeats = this->prepare(labels, Prob);
}

prepare function:
The prepare function has two main purposes:
1. Generate s_int and e_int array; These two arrays are useful in calculating forward and reverse probabilities.
2. Expand labels with blank symbol;

//It mainly imitates the open source code of Baidu warp CTC;
int cpu_ctc::prepare(vector<int> labels, vector<vector<double>> Prob)
{
	S = 2 * labels.size() + 1;
	s_inc = (int*)malloc(sizeof(int) * (labels.size() * 2 + 1));
	e_inc = (int*)malloc(sizeof(int) * (labels.size() * 2 + 1));
	labels_with_blanks = (int*)malloc(sizeof(int) * (labels.size() * 2 + 1));

	int s_counter = 0;
	int e_counter = 0;

	this->s_inc[s_counter++] = 1;  //remain=0,start+=1;   
	int repeats = 0;

	for (int i = 1; i < labels.size(); i++)
	{
		if (labels[i - 1] == labels[i])  //repeat; 
		{
			repeats++;
			s_inc[s_counter++] = 1;
			s_inc[s_counter++] = 1;
			e_inc[e_counter++] = 1;
			e_inc[e_counter++] = 1;
		}
		else
		{
			this->s_inc[s_counter++] = 2;
			this->e_inc[e_counter++] = 2;
		}
	}
	this->e_inc[e_counter++] = 1;

	for (int i = 0; i < labels.size(); i++)
	{
		this->labels_with_blanks[2 * i] = this->blank_label;
		this->labels_with_blanks[2 * i + 1] = labels[i];
	}

	labels_with_blanks[2 * labels.size()] = this->blank_label;
	return repeats;
}

For specific understanding, please refer to these two articles:
CTC implementation - compute ctc loss (1).
CTC implementation - compute ctc loss (2).

Forward calculation

principle

Forward propagation has two main tasks. One is to calculate the probability of label l given the input x, that is p ( l ∣ x ) p(l|x) p(l∣x). And a sequence l l l may be obtained through multiple path mapping, with l l With the increase of l length, the number of corresponding paths increases exponentially, so we need an efficient algorithm to help us calculate.
Let the length of the original label be u, the extended length be 2U+1, and the extended sequence be l ′ l' l′. For a specific sequence l l l. We define forward variables α ( t , u ) = ∑ π ∈ V ( t , u ) ∏ i = 1 t y π i \alpha(t,u)=\sum_{\pi\in{V(t,u)}}\prod_{i=1}^ty_{\pi}^i α(t,u)=π∈V(t,u)∑i=1∏tyπi
among V ( t , u ) V(t,u) V(t,u) represents all mapped sequences l l l. And the length is t, and in step T, the output is l u ′ l_{u}^{'} The set of lu '.
because α ( t , u ) \alpha(t,u) α Each subsequent state of (t,u) must depend on the previous state transition, which can be solved by dynamic programming algorithm, and α \alpha α Array size is ( 2 U + 1 ) ⋅ T (2U+1)\cdot T (2U+1)⋅T.
Initial state:
The correct beginning of all label s can only be empty or the first character, so the initialization status is:
α ( 1 , 1 ) = y 0 1 α ( 1 , 2 ) = y 1 1 α ( 1 , u ) = 0 u > 2 \alpha(1,1)=y_{0}^1\\ \alpha(1,2)=y_{1}^1 \\ \alpha(1,u)=0 \quad u>2 α(1,1)=y01α(1,2)=y11α(1,u)=0u>2
Recurrence relation:
α ( t , u ) = y l u ′ t ∑ i = f ( u ) u α ( t − 1 , i ) \alpha(t,u)=y_{l_{u}^{'}}^{t}\sum_{i=f(u)}^{u}\alpha(t-1,i) α(t,u)=ylu′ti=f(u)∑uα(t−1,i)
Of which:
f ( u ) = { u − 1 i f l u ′ = b l a n k ∣ l u − 2 ′ = l u ′ u − 2 o t h e r w i s e f(u)=\left\{ \begin{array}{rcl} u-1 & if l_{u}^{'}=blank | l_{u-2}^{'}=l_{u}{'}\\ u-2 &otherwise \end{array} \right. f(u)={u−1u−2iflu′=blank∣lu−2′=lu′otherwise
That is, when the current character is empty or the same as the previous character (so it must transition through a null character), the current state depends on the first two states of the previous time step, that is, it is transferred from the current character and the current previous character.
Other situations can be transferred from three states.
final p ( l ∣ x ) = ∑ i = 0 2 u + 1 α ( T − 1 , i ) p(l|x)=\sum_{i=0}^{2u+1}\alpha(T-1,i) p(l∣x)=∑i=02u+1α(T−1,i)

Concrete implementation

In particular, there are both normal probability calculations, i.e. the calculations corresponding to the alpha array, and calculations in the log field, i.e. TMP_ The calculation corresponding to alpha.
Still refer to Baidu's logic.

double cpu_ctc::compute_alphas()
{
	int start = ((L + repeats - T) < 0) ? 0 : 1,
		end = (L * 2 + 1) > 1 ? 2 : 1;
	
	for (int i = start; i < end; i++)  
	{
		alphas[0][i] = Prob[0][labels_with_blanks[i]];      
		tmp_alphas[0][i] = log(Prob[0][labels_with_blanks[i]]);                        
	}

	for (int t = 1; t < T; ++t)   //Simulation time  
	{
		int remain = L + repeats - (T - t);   //The difference between the number of label s and the remaining time; 
		if (remain >= 0)                     
			start += this->s_inc[remain];      

		if (t <= L + repeats)	//If t > L, it has reached the end;       
			end += e_inc[t - 1];          
		                         
		int startloop = start;            
		if (start == 0)                              
		{
			alphas[t][0] = alphas[t-1][0]*Prob[t][blank_label];  //Probability multiplication; 
			tmp_alphas[t][0] = tmp_alphas[t - 1][0]+log(Prob[t][blank_label]);

			startloop += 1;
		}

		for (int i = startloop; i < end; i++)
		{
			double prev_sum = alphas[t - 1][i] + alphas[t - 1][i - 1];
			double tmp_sum = log_sum(tmp_alphas[t - 1][i], tmp_alphas[t - 1][i - 1]);

			if (labels_with_blanks[i] != blank_label && i != 1 && labels_with_blanks[i - 2] != labels_with_blanks[i])
			{
				prev_sum += alphas[t - 1][i - 2];
				tmp_sum = log_sum(tmp_sum,tmp_alphas[t-1][i-2]);
			}
			alphas[t][i] = prev_sum*Prob[t][labels_with_blanks[i]];
			tmp_alphas[t][i] = tmp_sum + log(Prob[t][labels_with_blanks[i]]);
		}
	}
	double score = 0;
	for (int i = start; i < end; i++)
	{
		score += alphas[T-1][i];
	}
	double tmp_score = -INFINITY;
	for (int i = start; i < end; i++)
	{
		tmp_score = log_sum(tmp_score, tmp_alphas[T - 1][i]);
	}
	prob_ = log(score);
	return score;
}

Reverse calculation

principle

The main task of reverse computing is computing storage β ( t , u ) \beta(t,u) β (t,u) array, and verify the final probability with the forward probability.
Of which: β ( t , u ) = ∑ π ∈ W ( t , u ) ∏ i = 1 T − t − 1 y π i + t \beta(t,u)=\sum_{\pi\in{W(t,u)}}\prod_{i=1}^{T-t-1}y_{\pi}^{i+t} β(t,u)=π∈W(t,u)∑i=1∏T−t−1yπi+t
W ( t , u ) W(t,u) W(t,u) means that all outputs at time t are l u ′ l_{u}^{'} The part from t to the end of the legal path of lu '.
and β ( t , u ) \beta(t,u) β (t,u) is all legal paths from time t+1 to the end of the sequence (mapped as l l l) Sum of probabilities (excluding time t).
such α ( t , u ) ⋅ β ( t , u ) \alpha(t,u) \cdot \beta(t,u) α (t,u)⋅ β The value of (T, U) is output in time step t l u ′ l_{u}^{'} The sum of the probabilities of all legal paths of lu '.
The solution of inverse array can be understood as the inverse process of forward, and its overall idea is similar to that of forward.
initialization:
β ( T − 1 , U ′ ) = β ( T − 1 , U ′ − 1 ) = 1 β ( T − 1 , u ) = 0 u < U ′ − 1 \beta(T-1,U')=\beta(T-1,U'-1)=1\\ \beta(T-1,u)=0\quad u<U'-1 β(T−1,U′)=β(T−1,U′−1)=1β(T−1,u)=0u<U′−1
The recurrence formula is: β ( t , u ) = y l u ′ t ∑ i = u g ( u ) β ( t + 1 , i ) \beta(t,u)=y_{l_{u}^{'}}^{t}\sum_{i=u}^{g(u)}\beta(t+1,i) β(t,u)=ylu′ti=u∑g(u)β(t+1,i)
Of which:
g ( u ) = { u + 1 i f l u ′ = b l a n k ∣ l u + 2 ′ = l u ′ u + 2 o t h e r w i s e g(u)=\left\{ \begin{array}{rcl} u+1 & if l_{u}^{'}=blank | l_{u+2}^{'}=l_{u}{'}\\ u+2 &otherwise \end{array} \right. g(u)={u+1u+2iflu′=blank∣lu+2′=lu′otherwise

Concrete implementation

The implementation of reverse computing is very similar to forward computing, which is equivalent to the inverse process of forward computing.
The calculation under normal probability and log domain is realized:

double cpu_ctc::compute_betas()
{
	int start = S > 1 ? (S - 2) : 0,
		end = (T > (S / 2) + repeats) ? S : S - 1;

	for (int i = start; i < end; i++)
	{
		betas[T - 1][i] = Prob[T - 1][labels_with_blanks[i]];
		tmp_betas[T - 1][i] = log(Prob[T - 1][labels_with_blanks[i]]);
	}

	for (int t = T - 2; t >= 0; t--)
	{
		int remain = (S / 2) + repeats - (T - t);
		if (remain >= -1)
			start -= s_inc[remain + 1];
		if (t < (S / 2) + repeats)
			end -= e_inc[t];
		int endloop = end;
		for (int i = start; i < endloop; ++i) {
			double next_sum=0;
			double tmp_sum = -INFINITY;
			if (i == S - 1)
			{
				next_sum = betas[t + 1][i];
				tmp_sum = log(betas[t + 1][i]);
			}
				
			else
			{
				next_sum = betas[t + 1][i] + betas[t + 1][i + 1];
				tmp_sum = log_sum(tmp_betas[t + 1][i], tmp_betas[t + 1][i + 1]);
			}

			if (labels_with_blanks[i] != blank_label && i != (S - 2) && labels_with_blanks[i] != labels_with_blanks[i + 2]) {
				next_sum += betas[t + 1][i + 2];
				tmp_sum = log_sum(tmp_sum, tmp_betas[t + 1][i + 2]);
			}
	
			betas[t][i] = next_sum * Prob[t][labels_with_blanks[i]];
			tmp_betas[t][i] = tmp_sum + log(Prob[t][labels_with_blanks[i]]);
		}
	}

	double score = 0;
	double tmp_score = -INFINITY;
	for (int i = start; i < end; i++)
	{
		score += betas[0][i];
		tmp_score = log_sum(tmp_score, tmp_betas[0][i]);
	}

	for (int i = 0; i < tmp_betas.size(); i++)
		for (int j = 0; j < tmp_betas[0].size(); j++)
			if (tmp_betas[i][j] != -INFINITY)
				tmp_betas[i][j] = (tmp_betas[i][j]- log(Prob[i][labels_with_blanks[j]]));
	for (int i = 0; i < betas.size(); i++)
		for (int j = 0; j < betas[i].size(); j++)
			betas[i][j] = betas[i][j] / Prob[i][labels_with_blanks[j]];
	
	return score;
}

It should be noted that after calculating the beta array, we divide it by Prob[i][labels_with_blanks[j]], because the alpha array is multiplied by the beta array once more y k t y_{k}^{t} ykt# its own probability.

loss function

The loss function of CTC is defined as follows:
L ( S ) = − l n ∏ ( x , z ) ∈ S p ( z ∣ x ) = − ∑ ( x , z ) ∈ S l n p ( z ∣ x ) L(S)=-ln\prod_{(x,z)\in S}p(z|x)=-\sum_{(x,z)\in S}lnp(z|x) L(S)=−ln(x,z)∈S∏p(z∣x)=−(x,z)∈S∑lnp(z∣x)
Where S is the training set. The loss function can be interpreted as: given the tag sequence and input, the probability of finally outputting the correct sequence. In order to facilitate calculation, we take this probability as a negative logarithm. Minimizing the loss after taking the negative logarithm is to maximize the output probability.
And as we mentioned earlier, for a given t t t and u u u,%= α ( t , u ) β ( t , u ) \alpha(t,u) \beta(t,u) α (t,u) β (T, U) has specific meaning. The specific formula is:
α ( t , u ) β ( t , u ) = ∑ π ∈ X ( t , u ) ∏ t = 1 T y π t t \alpha(t,u) \beta(t,u)=\sum_{\pi \in{X(t,u)}} \prod_{t=1}^{T}y_{\pi_{t}}^{t} α(t,u)β(t,u)=π∈X(t,u)∑t=1∏Tyπtt
X ( t , u ) X(t,u) X(t,u) indicates passing at time t l u ′ l_{u}^{'} lu 'is a collection of all paths.
Further, it can be converted to
α ( t , u ) β ( t , u ) = ∑ π ∈ X ( t , u ) p ( π ∣ x ) \alpha(t,u) \beta(t,u)=\sum_{\pi \in{X(t,u)}} p(\pi|x) α(t,u)β(t,u)=π∈X(t,u)∑p(π∣x)
Therefore, given t at any time, our probability can be calculated as follows:
p ( z ∣ x ) = ∑ u = 0 ∣ z ′ − 1 ∣ α ( t , u ) β ( t , u ) p(z|x)=\sum_{u=0}^{|z^{'}-1|}\alpha(t,u) \beta(t,u) p(z∣x)=u=0∑∣z′−1∣α(t,u)β(t,u)
Its meaning is from the first position to the last position of the extended sequence at any time α β \alpha\beta αβ The sum of the products is p ( z ∣ x ) p(z|x) The probability of p(z ∣ x) can be intuitively understood as long as we understand the meaning of all the above formulas. At any t time, if the legal path passes l u ′ l_{u}^{'} The product of lu 'must not be 0, but not necessarily 0 The sum of all products must be the final probability. This is also easily verified by the program.

Gradient calculation

I believe everyone knows a little when looking at the forward process and the reverse process, and understands these calculation processes, but does not understand the use of calculating these intermediate results. In fact, what we have done before is to prepare for the final gradient of input.
(to be continued.............)

Keywords: C++ Machine Learning Deep Learning

Added by stuckwithcode on Fri, 24 Dec 2021 11:37:27 +0200

Programming VIP

ctc_ Derivation of loss formula and its C + + implementation

CTC introduction

General idea

preparation

1. softmax

2. Extension of label

3. Take logarithm

4. Relevant codes

Forward calculation

principle

Concrete implementation

Reverse calculation

principle

Concrete implementation

loss function

Gradient calculation

Popular Keywords