Explanation of BILSTM-CRF code

BILSTM-CRF code

Code from Named entity recognition (NER): introduction to the principle of BiLSTM-CRF + pytoch_ Tutorial code parsing
Part I: Guide Package
1.torch.nn package mainly contains Modules used to build each layer, such as full connection, two-dimensional convolution, pooling, etc; The torch.nn package also contains a series of useful loss functions.
2.torch.optim package mainly contains optimization algorithms used to update parameters, such as SGD, AdaGrad, RMSProp, Adam, etc.
3.import torch.autograd as autograd

Part II
It is used to set the seed of random initialization, that is, the above number. The number is fixed, and the random number obtained each time is fixed.

1. Set label and BILSTM hidden layer
First, determine the number of tags: SBIOE
The number of hidden features of BILSTM is 4, and one-way is 2.

START_TAG = "<START>"
STOP_TAG = "<STOP>"
EMBEDDING_DIM = 5  # Because there are 5 B \ I \ o \ start \ stop tags in total, they are embedded_ Dim is 5
HIDDEN_DIM = 4  # This is actually the number of features of the hidden layer of BiLSTM. Because it is bidirectional, it is twice and unidirectional is 2

2. Dataset
Here we use two sentences to test. The part of speech corresponding to each sentence has the correct label.

training_data = [(
    "the wall street journal reported today that apple corporation made money".split(),
    "B I I I O O O B I O O".split()
), (
    "georgia tech is a university in georgia".split(),
    "B I O O O O B".split()
)]

3. Process the words in the sentences in the data set, and take out and label the words in the sentences without repetition.
Set a word_to_ix stores each word in the sentence.
First, we take out each sentence and its corresponding label, and cycle each sentence with sentence and tag. For example, the first sentence "the wall... money" and its label "B I... 0" are extracted for the 0 th time.
Now we only store the words that have appeared (regardless of the label). The following code explains:

word_to_ix = {}
for sentence, tags in training_data:
    for word in sentence:
        if word not in word_to_ix: #word_to_ix whether the content on the left is included
            word_to_ix[word] = len(word_to_ix)

if word not in word_to_ix:
word_to_ix[word] = len(word_to_ix)

This sentence means word_to ix when there is no word, it is stored and assigned to each new word. The assigned value is length.

The final results are all in word_ to_ In IX:
word_to_ix: {'the': 0, 'wall': 1, 'street': 2, 'journal': 3, 'reported': 4, 'today': 5, 'that': 6, 'apple': 7, 'corporation': 8, 'made': 9, 'money': 10, 'georgia': 11, 'tech': 12, 'is': 13, 'a': 14, 'university': 15, 'in': 16}

4. Save 5 tags to tag_ to_ In your dictionary.

tag_to_ix = {"B": 0, "I": 1, "O": 2, START_TAG: 3, STOP_TAG: 4}

5. Several practical small functions

def prepare_sequence(seq, to_ix):
    idxs = [to_ix[w] for w in seq]  #  Convert the word w in the sentence "w1,w2,w3" (word sequence) [w1,w2,w3..] into the index value to in the thesaurus_ ix [w] 
    return torch.tensor(idxs, dtype=torch.long)
def argmax(vec):
    # Gets the index of the maximum value
    _, idx = torch.max(vec, 1)
    return idx.item()
def log_sum_exp(vec):
    max_score = vec[0, argmax(vec)]  # max_ The dimension of score is 1
    max_score_broadcast = max_score.view(1, -1).expand(1, vec.size()[1])  # Dimension is 1 * 5
    return max_score + torch.log(torch.sum(torch.exp(vec - max_score_broadcast)))

6. Next, we input our sentences into the BILSTM-CRF model,

model = BiLSTM_CRF(len(word_to_ix), tag_to_ix, EMBEDDING_DIM, HIDDEN_DIM)

The third part focuses on the BILSTM-CRF model
So let's take a look at bilstm first_ CRF model:

class BiLSTM_CRF(nn.Module):
    def __init__(self, vocab_size, tag_to_ix, embedding_dim, hidden_dim):
        super(BiLSTM_CRF, self).__init__()
        self.embedding_dim = embedding_dim # Embedded dimension
        self.hidden_dim = hidden_dim  # Hidden layer dimension
        self.vocab_size = vocab_size # Vocabulary size
        self.tag_to_ix = tag_to_ix  # Label to subscript
        self.tagset_size = len(tag_to_ix)  # Target value range size

        self.word_embeds = nn.Embedding(vocab_size, embedding_dim)  # Embedded layer
        self.lstm = nn.LSTM(embedding_dim, hidden_dim // 2, num_layers=1, bidirectional=True) # bidirectional LSTM
        self.hidden2tag = nn.Linear(hidden_dim, self.tagset_size)
        # Transition matrix, transitions[i][j] denotes from label_j transfer to label_ Although the probability of I is generated randomly, it will be updated iteratively later
        self.transitions = nn.Parameter(torch.randn(self.tagset_size, self.tagset_size))

        self.transitions.data[tag_to_ix[START_TAG], :] = -10000  # Transfer from any tag to START_TAG impossible
        self.transitions.data[:, tag_to_ix[STOP_TAG]] = -10000  # From STOP_TAG transfer to any tag is not possible

        self.hidden = self.init_hidden()  # Input of random initialization LSTM (h_0, c_0)

1. Define model def_init_ ():

class BiLSTM_CRF(nn.Module):
def init(self, vocab_size, tag_to_ix, embedding_dim, hidden_dim):

Description class xx (NN. Module): def_ init_ You can see this link.
vocab_size is the length of len(word_to_ix), that is, all words (not repeated). Set tag_to_ix also wear it, embedding_dim is the number of tags, hidden_dim is the number of hidden bilstm layers.

super(BiLSTM_CRF, self).init()

Like the custom model, this sentence is to call the constructor of the parent class.
(2) Define some model parameters. The following explanation will not help you

self.word_embeds = nn.Embedding(vocab_size, embedding_dim) # embedding layer

here It is the understanding of nn.Embedding
Simply put, put the word vector into the network. What should the output of the word vector look like? The output is: (vocab_size,embedding_dim).

self.lstm = nn.LSTM(embedding_dim, hidden_dim // 2, num_layers=1, bidirectional=True)

This is the understanding of nn.LSTM
The word vector latitude embedding is entered_ dim=5,hidden_ The latitude of dim is 4 / 2 = 2, the number of layers of recurrent neural network is 1, and bidirecton represents the use of bidirectional LSTM.

Similarly, PyTorch's nn.Linear() is used to set the full connection layer in the network.

The following line of code looks at the transition probability, that is, from a tag - > tag:
The transition matrix is randomly generated and will be updated iteratively.

self.transitions = nn.Parameter(torch.randn(self.tagset_size, self.tagset_size))
Self.transitions.data [tag_to_ix [START_TAG],:] = - 10000 # transfer from any tag to start_ Tag impossible
self.transitions.data[:, tag_to_ix[STOP_TAG]] = -10000 # from stop_ Tag transfer to any tag is not possible

(1) Define define_ Hidden() function:

    def init_hidden(self):
        return (torch.randn(2, 1, self.hidden_dim // 2),
                torch.randn(2, 1, self.hidden_dim // 2))

This should be the double-layer lstm hidden layer. Because it is bidirectional, two are generated. Each latitude is (1, self.hidden_dim//2). The input of the hidden layer is initialized randomly. This function uses self.hidden = self.init_ After hidden ():

    (tensor([[[ 1.1456, -0.0378]],[[ 0.9321,  0.5295]]]), 
    tensor([[[-1.5295,  0.2228]],[[-1.4066, -0.2268]]]))

(2) It should be the score for calculating the total path

    def _forward_alg(self, feats):  #Forward algorithm, feats is the output of all time steps of LSTM
        '''
        Input: emission matrix(emission score),Actually LSTM Output of—— sentence Each of word through BiLSTM After, corresponding to each label Score of
        Output: sum of all possible path scores/Normalization factor/Partition function/Z(x)
        '''
        init_alphas = torch.full((1, self.tagset_size), -10000.)#Initialization score values are all - 10000, and start is set_ The value of tag is 0, which is similar to the first step of dijkstra in calculating the shortest path
        init_alphas[0][self.tag_to_ix[START_TAG]] = 0. # The alpha of the start position is 0

        # Wrap into a variable for automatic back propagation
        forward_var = init_alphas
        for feat in feats:  # w_i
            alphas_t = []
            for next_tag in range(self.tagset_size):  # tag_j  next_tag has target_size is a possible value
                # t time tag_ Broadcast of I Emission score (1). It needs to be compared with the five previous at time t-1_ Tags transfer to this tag_ Addition of transition scors of I
                emit_score = feat[next_tag].view(1, -1).expand(1, self.tagset_size)  # 1 * 5: traversal label set emission fraction_ Score [i] [J] is the probability that word I is labeled as part of speech J. in the two-layer for loop, the first layer is the i-th feature (part of speech) in the word sequence, and the second layer is the j-th tag (part of speech) in the tag set
                # Five previous at time t-1_ Tags to this tag_ transition scors of I
                trans_score = self.transitions[next_tag].view(1, -1)  # The dimension is 1 * 5 transition scores_ Socre [m1] [M2] is the probability that the previous word of the current word in the sequence is marked as m1 and the current word is marked as m2, that is, the probability of transferring from m1 to the next state is m2

                next_tag_var = forward_var + trans_score + emit_score
                # Sum and realize w_(t-1) to w_t propulsion
                alphas_t.append(log_sum_exp(next_tag_var).view(1))
            forward_var = torch.cat(alphas_t).view(1, -1)  # 1*5 forward_var size:[1] [label set size]
            # Each time step is marked as the fraction of part of speech J - > [1-len (tags)], and each time step is updated to forward after the completion of the for loop traversal_ VaR is the last score when traversing the end of the word sequence

        # Finally, add the forward var of the last word to the probability of transferring stop tag
        terminal_var = forward_var + self.transitions[self.tag_to_ix[STOP_TAG]]
        alpha = log_sum_exp(terminal_var)
        return alpha

This function is only a score calculated by forward propagation according to random transitions, using the idea of dynamic programming. However, because the random transition matrix is used, the calculated value is very large, score > 20. This is the sum of the total paths. This article introduces the method. Only by understanding this method can you understand the code. CRF calculation details in BiLSTM-CRF
Or see the code analysis of this blog: Bilstm in pytorch_ CRF model source code learning
perhaps BILSTM_CRF code parsing

Analysis code,:

init_alphas = torch.full((1, self.tagset_size), -10000.)

torch.full means that I create a (1,self.tagset_size) latitude matrix, each of which is - 10000:
This may mean that the initialization score values are all - 10000 and set start_ The value of tag is 0, which is similar to the first step of dijkstra calculating the shortest path.

for feat in feats:

Traversing sentence feats is a matrix of [sentence length] [tag set size]. The total time step = = sentence length. This cycle is a cycle of 11 words. (there are 11 words in the first sentence). The 5 tag probabilities of the first word are as follows:

    tensor([-0.2095,  0.1737, -0.3876,  0.4378, -0.3475]

for next_tag in range(self.tagset_size):

Recycle the first of the five tags. As you can see from the link above, we need to expand the probability of one tag to (1,5) latitude. calculation

next_tag_var = forward_var + trans_score + emit_score

The latitude is as follows:

 tensor([[-10000., -10000., -10000.,      0., -10000.]]) 
 trans_score: tensor([[-1.1811e-01, -1.4420e+00, -1.1108e+00, -1.1187e+00, -1.0000e+04]],grad_fn=<ViewBackward>)
 emit_score: tensor([[-0.2095, -0.2095, -0.2095, -0.2095, -0.2095]],grad_fn=<ExpandBackward>)

alphas_t.append(log_sum_exp(next_tag_var).view(1))

Log each Pmt(1,5) dimension (the sum of three 1,5-dimensional matrices)_ sum_ Exp becomes a number and is gradually added to alpha_ t. The final aphps is (1,5) dimensional.
Set alpha_ T is the saved Pmt,(m is the number of tags, t is the first word), which is actually the path score to the T word. M tags, he currently has m paths.

forward_var = torch.cat(alphas_t).view(1, -1)

At this time, the label cycle has been out, and the probability of all m labels of the t word has ended, f
forward_var size:[1] [tag set size] the score marked as part of speech J - > [1-len (tags)] in each time step. Each time step is updated to forward after the for loop traversal_ VaR is the last score when traversing the end of the word sequence.
Then enter the cycle of the second word, and continue to use forward_var_size, when the cycle of all words in a sentence ends, jump out of the cycle.

terminal_var = forward_var +self.transitions[self.tag_to_ix[STOP_TAG]]
alpha = log_sum_exp(terminal_var)

Add a sentence at the end and transfer the last word to STOP_TAG score, get the total score of all possible paths.

(3) The lstm layer is obtained, and the feats are obtained:

def _get_lstm_features(self, sentence):
	self.hidden = self.init_hidden()#Initialize state vector
	embeds = self.word_embeds(sentence) #Get sentence vector: replace each word with the corresponding word vector
    embeds = embeds.unsqueeze(1)#Add a dimension with dimension 1 at the position of 1
 	lstm_out, self.hidden = self.lstm(embeds, self.hidden)#lstm layer inputs sentence vector and state vector, and outputs an output vector and state vector 
	lstm_out = lstm_out.view(len(sentence), self.hidden_dim)#reshape output vector flattened to [sentence length, hidden layer dimension]
	lstm_feats = self.hidden2tag(lstm_out) #The full join layer maps the flattened word vector to the category number
	return lstm_feats

It can be seen that the function passes through embedding, LSTM and linear layers, which is a matrix calculated according to LSTM. Here is a tensor of 11x5, and this tensor of 11x5 is the emission matrix!!! Emission matrix!!! Emission matrix!!! (emission matrix)
(4) Calculate the score of the tag sequence in the current model

def _score_sentence(self, feats, tags):
        # Gives the score of a provided tag sequence
        score = torch.zeros(1)
        tags = torch.cat([torch.tensor([self.tag_to_ix[START_TAG]], dtype=torch.long), tags])
        for i, feat in enumerate(feats):
            score = score + \
                self.transitions[tags[i + 1], tags[i]] + feat[tags[i + 1]] #Recursively calculate the path score: transfer score + transmission score
        score = score + self.transitions[self.tag_to_ix[STOP_TAG], tags[-1]]#Add the last word of the sentence and transfer to stop_ Transfer score of tag
        return score

A score calculated from the real tag, which is the same as the def above_ forward_ What's the difference between alg (self, feet)? The common point is that both of them use the random transfer matrix to calculate the score, but the difference is that the function above calculates a maximum possible path, but it may not be the real value of each label transfer. For example, the real label is n V, but because transitions are random, the above function actually gets n n, so there is a gap in the score between the two. The later back propagation can update the transitions so that the transfer matrix approximates the real "transfer matrix". (personal understanding).
(5) Viterbi algorithm

    #Decoding to obtain the predicted sequence and the score of the predicted sequence
    def _viterbi_decode(self, feats):
        backpointers = []
 
        # Initialize the viterbi variables in log space
        init_vvars = torch.Tensor(1, self.tagset_size).fill_(-10000.)
        init_vvars[0][self.tag_to_ix[START_TAG]] = 0
 
        # forward_var at step i holds the viterbi variables for step i-1
        forward_var = autograd.Variable(init_vvars)
        for feat in feats:
            bptrs_t = []  # holds the backpointers for this step
            viterbivars_t = []  # holds the viterbi variables for this step
 
            for next_tag in range(self.tagset_size):
                # next_tag_var[i] holds the viterbi variable for tag i at the
                # previous step, plus the score of transitioning
                # from tag i to next_tag.
                # We don't include the emission scores here because the max
                # does not depend on them (we add them in below)
                next_tag_var = forward_var + self.transitions[next_tag] #Other tags (B,I,E,Start,End) to tag next_tag probability
                best_tag_id = argmax(next_tag_var)
                bptrs_t.append(best_tag_id)
                viterbivars_t.append(next_tag_var[0][best_tag_id].view(1))
            # Now add in the emission scores, and assign forward_var to the set
            # of viterbi variables we just computed
            forward_var = (torch.cat(viterbivars_t) + feat).view(1, -1)#The maximum score of each of the five sequences from step 0 to step(i-1)
            backpointers.append(bptrs_t) #bptrs_t has 5 elements
 
        # Transition to STOP_TAG
        terminal_var = forward_var + self.transitions[self.tag_to_ix[STOP_TAG]]#Other labels to stop_ Transition probability of tag
        best_tag_id = argmax(terminal_var)
        path_score = terminal_var[0][best_tag_id]
 
        # Follow the back pointers to decode the best path.
        best_path = [best_tag_id]
        for bptrs_t in reversed(backpointers):#Go back and find a best path
            best_tag_id = bptrs_t[best_tag_id]
            best_path.append(best_tag_id)
        # Pop off the start tag (we dont want to return that to the caller)
        start = best_path.pop()
        assert start == self.tag_to_ix[START_TAG]  # Sanity check
        best_path.reverse()# Turn the path straight from the back to the front
        return path_score, best_path

Viterbi decoding is actually used in prediction to output score and path value.
(6) Loss function - very important!!! This uses the "real path" - loss of the total path to iteratively update! It should be the following formula:

def neg_log_likelihood(self, sentence, tags):
    feats = self._get_lstm_features(sentence)#11 * 5 output after passing through LSTM+Linear matrix, and then it is used as the input of CRF.
    forward_score = self._forward_alg(feats) #A score in dimension 0, 20. *
    gold_score = self._score_sentence(feats, tags)#tensor([ 4.5836])

    return forward_score - gold_score #This is the difference between the two, and then directly based on this difference, back propagation

(7) forward function

    def forward(self, sentence):
        '''
        In the decoding process, Viterbi decodes and selects the annotation path with the maximum probability
        '''
        lstm_feats = self._get_lstm_features(sentence)

        score, tag_seq = self._viterbi_decode(lstm_feats)
        return score, tag_seq

Def forward (self, sentance): the forward function is only used to predict. It is not used to call it when train ing.

Part IV. training data set
(1) Gradient descent

optimizer = optim.SGD(model.parameters(), lr=0.01, weight_decay=1e-4)

for epoch in range(300):
    for sentence, tags in training_data:
        model.zero_grad()

        # input
        sentence_in = prepare_sequence(sentence, word_to_ix)
        targets = torch.tensor([tag_to_ix[t] for t in tags], dtype=torch.long)

        # Get loss
        loss = model.neg_log_likelihood(sentence_in, targets)

        # Back propagation
        loss.backward()
        optimizer.step()

Keywords: Pytorch Deep Learning NLP

Added by msurabbott on Fri, 15 Oct 2021 23:47:19 +0300