1, Principle introduction
RNN can remember context information, so it is often used to process time series data. In theory, RNN can memorize infinite length of historical information, but due to the accumulation of gradients, the amount of calculation is too large to be operated in practice. Therefore, in practice, RNN can only record the information of the first few words. Thus, the network structure of GRU and LSTM is extended, which mainly increases the gating mechanism to control the flow of information, so as to improve the memory ability of historical information and discard redundant irrelevant information. This paper will use a relatively simple two-way LSTM to complete emotion analysis.
2, Data processing
This paper uses aclImdb data set, which contains 50000 comment data and its tags. The tags contain positive and negative categories. It is divided into training set and test set according to the ratio of 1:1. Next, import the package to be used:
import collections import os import random import tarfile import torch from torch import nn import torchtext.vocab as Vocab import torch.utils.data as Data import sys sys.path.append("..") os.environ["CUDA_VISIBLE_DEVICES"] = "0" device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') DATA_ROOT='C:/Users/wuyang/Datasets'
The following is data reading, which converts the label into 0, 1 scalars.
fname = os.path.join(DATA_ROOT, "aclImdb_v1.tar.gz") if not os.path.exists(os.path.join(DATA_ROOT, "aclImdb")): print("Unzip from a compressed package...") with tarfile.open(fname, 'r') as f: f.extractall(DATA_ROOT) from tqdm import tqdm def read_imdb(folder='train', data_root='C:/Users/wuyang/Datasets/aclImdb'): data = [] for label in ['pos', 'neg']: folder_name = os.path.join(data_root, folder, label) for file in tqdm(os.listdir(folder_name)): with open(os.path.join(folder_name, file), 'rb') as f: review = f.read().decode('utf-8').replace('\n', '').lower() data.append([review, 1 if label == 'pos' else 0]) random.shuffle(data) return data train_data, test_data = read_imdb('train'), read_imdb('test')
After that, the comments are segmented. There are many kinds of word segmentation in English comments. We can separate the comments directly by spaces to achieve better results.
def get_tokenized_imdb(data):#participle """ data: list of [string, label] """ def tokenizer(text): return [tok.lower() for tok in text.split(' ')] return [tokenizer(review) for review, _ in data]
Next, use the Counter function to count the number and types of words in the training set, and filter out words with word frequency less than 5.
def get_vocab_imdb(data): tokenized_data = get_tokenized_imdb(data) counter = collections.Counter([tk for st in tokenized_data for tk in st]) return Vocab.Vocab(counter, min_freq=5) #Count word frequency and filter out words with word frequency less than 5 vocab = get_vocab_imdb(train_data)
Since the input of RNN is a fixed length sequence, and the length of each comment is different, we need to fix the length of the sequence. The method is to set a standard sequence length, then cut the longer comments to a fixed length, and supplement the shorter comments with 0 to a fixed length. Here we take the fixed length as 500
def preprocess_imdb(data,vocab):#The shorter text is lengthened and the longer text is cropped to obtain a unified long sequence max_l=500 def pad(x): return x[:max_l] if len(x)>max_l else x+[0]*(max_l-len(x)) tokenized_data=get_tokenized_imdb(data) features=torch.tensor([pad([vocab.stoi[word] for word in words]) for words in tokenized_data]) labels = torch.tensor([score for _,score in data]) return features,labels
Next, we set up a training and test iterator and set the batch size to 64
batch_size = 64#Batch size train_set = Data.TensorDataset(*preprocess_imdb(train_data, vocab)) test_set = Data.TensorDataset(*preprocess_imdb(test_data, vocab)) train_iter = Data.DataLoader(train_set, batch_size, shuffle=True) test_iter = Data.DataLoader(test_set, batch_size)
Three, model building
Next, we use pytorch to build a BiRNN network. Note that the input of the network is (batch_size,seq_length), that is, batch size and sequence length. The input sequence first enters the Embedding layer to get the word vector corresponding to each word, that is, convert each word into a vector of fixed dimensions. Embedded here_ Set the size to 300. After passing through the Embedding layer, the intermediate result with the shape of (batch_size,seq_length, embedded_size) is obtained.
After that, we turn the intermediate result into rank and get the tensor of (seq_length, batch_size, embedded_size) The sequence is then input to the BiLSTM module. BiLSTM will output the output of each time step of the last layer structure in the shape of (seq_length, batch_size, num_directions * hidden_size). Here num_ Directions is 2. After that, we connect the hidden state of the initial time step and the final time step of output to obtain the tensor of (batch_size,num_directionshidden_size2). Finally, we input the tensor into the full connection layer and convert it into a vector of (batch_size,2) to complete the classification of sequences.
class BiRNN(nn.Module): def __init__(self,vocab,embed_size,num_hidden,num_layers): super(BiRNN,self).__init__() self.embedding=nn.Embedding(len(vocab),embed_size) self.encoder=nn.LSTM(input_size=embed_size, hidden_size=num_hidden, num_layers=num_layers, bidirectional=True) self.decoder=nn.Linear(4*num_hidden,2) def forward(self,x): # The shape of inputs is (batch size, seq_len). Because LSTM needs to take the sequence length (seq_len) as the first dimension, the input will be transposed # Then extract the word features, and the output shape is (seq_len, batch size, word vector dimension) embeddings=self.embedding(x.permute(1,0)) # outputs shape is (seq_len, batch size, 2 * number of hidden units) outputs,_=self.encoder(embeddings) # Connect the hidden state of the initial time step and the final time step as the full connection layer input. Its shape is # (batch size, 4 * number of hidden units). encoding = torch.cat((outputs[0], outputs[-1]), -1) outs = self.decoder(encoding) return outs embed_size, num_hiddens, num_layers = 100, 100, 2 net = BiRNN(vocab, embed_size, num_hiddens, num_layers)
The Embedding layer is used in the above model, and the word vector needs to be trained. This step may take more time and the effect is not satisfactory. A better choice is to load the trained word vector and use it directly. The following is the method to load the pre training word vector.
glove_vocab=Vocab.GloVe(name='6B',dim=100,cache=os.path.join(DATA_ROOT, "glove"))#Load pre training word vector def load_pretrained_embedding(words,pretrained_vocab):#Load pre training word vector embed=torch.zeros(len(words),pretrained_vocab.vectors[0].shape[0]) oov_count=0 for i,word in enumerate(words): try: idx=pretrained_vocab.stoi[word] embed[i,:]=pretrained_vocab.vectors[idx] except KeyError: oov_count+=1 if oov_count>0: print('words number:',oov_count) return embed net.embedding.weight.data.copy_( load_pretrained_embedding(vocab.itos,glove_vocab) ) net.embedding.weight.requires_grad = False
4, Model training
Model training is no different from other networks. The process is as follows. I won't elaborate too much.
lr, num_epochs = 0.01, 5 # To filter out the embedding parameters that do not calculate the gradient, because the pre training word vector is used for loading, the embedding layer does not update the parameters and the gradient does not exist optimizer = torch.optim.Adam(filter(lambda p: p.requires_grad, net.parameters()), lr=lr) loss = nn.CrossEntropyLoss() def train(train_iter, test_iter, net, loss, optimizer, device, num_epochs): net = net.to(device) print("training on ", device) batch_count = 0 for epoch in range(num_epochs): train_l_sum, train_acc_sum, n, start = 0.0, 0.0, 0, time.time() for X, y in train_iter: X = X.to(device) y = y.to(device) y_hat = net(X) l = loss(y_hat, y) optimizer.zero_grad() l.backward() optimizer.step() train_l_sum += l.cpu().item() train_acc_sum += (y_hat.argmax(dim=1) == y).sum().cpu().item() n += y.shape[0] batch_count += 1 test_acc = evaluate_accuracy(test_iter, net) print('epoch %d, loss %.4f, train acc %.3f, test acc %.3f, time %.1f sec' % (epoch + 1, train_l_sum / batch_count, train_acc_sum / n, test_acc, time.time() - start)) train(train_iter, test_iter, net, loss, optimizer, device, num_epochs)
Here are the training results:
training on cpu epoch 1, loss 0.6216, train acc 0.629, test acc 0.808, time 2335.8 sec epoch 2, loss 0.2077, train acc 0.812, test acc 0.823, time 2106.4 sec epoch 3, loss 0.1243, train acc 0.836, test acc 0.836, time 2053.3 sec epoch 4, loss 0.0849, train acc 0.854, test acc 0.828, time 1915.6 sec