preface
Learning objectives
-
Learn about news topic classification and relevant data
-
Master the implementation process of constructing news topic classifier using shallow network
-
About news topic classification tasks:
- Taking the text description content of a news report as the input, the model is used to help us judge which type of news it is most likely to belong to. This is a typical text classification problem. We assume that each type is mutually exclusive, that is, the text description has and has only one type
News topic classification data:
- Get data through torchtext:
# Import related torch Toolkit import torch import torchtext # Import text classification tasks in torchtext.datasets from torchtext.datasets import text_classification import os # Define the data download path and the data folder of the current path load_data_path = "./data" # If the path does not exist, create it if not os.path.isdir(load_data_path): os.mkdir(load_data_path) # Select text classification dataset 'Ag in torchtext_ News' is the news topic classification data, which is saved in the specified directory # The training and verification data after numerical mapping are loaded into memory train_dataset, test_dataset = text_classification.DATASETS['AG_NEWS'](root=load_data_path)
Data file Preview:
- data/ - ag_news_csv.tar.gz - ag_news_csv/ classes.txt readme.txt test.csv train.csv
Document description:
- train.csv represents training data, 120000 pieces of data in total;
- test.csv represents validation data, 7600 pieces of data in total;
- classes.txt is a label (news topic) meaning file, in which four words' World ',' Sports', 'Business' and' Sci/Tech 'represent the four topics of news,
- readme.txt is the English description of the dataset
train.csv Preview:
"3","Wall St. Bears Claw Back Into the Black (Reuters)","Reuters - Short-sellers, Wall Street's dwindling\band of ultra-cynics, are seeing green again." "3","Carlyle Looks Toward Commercial Aerospace (Reuters)","Reuters - Private investment firm Carlyle Group,\which has a reputation for making well-timed and occasionally\controversial plays in the defense industry, has quietly placed\its bets on another part of the market." "3","Oil and Economy Cloud Stocks' Outlook (Reuters)","Reuters - Soaring crude prices plus worries\about the economy and the outlook for earnings are expected to\hang over the stock market next week during the depth of the\summer doldrums." "3","Iraq Halts Oil Exports from Main Southern Pipeline (Reuters)","Reuters - Authorities have halted oil export\flows from the main pipeline in southern Iraq after\intelligence showed a rebel militia could strike\infrastructure, an oil official said on Saturday." "3","Oil prices soar to all-time record, posing new menace to US economy (AFP)","AFP - Tearaway world oil prices, toppling records and straining wallets, present a new economic menace barely three months before the US presidential elections." "3","Stocks End Up, But Near Year Lows (Reuters)","Reuters - Stocks ended slightly higher on Friday\but stayed near lows for the year as oil prices surged past #36;46\a barrel, offsetting a positive outlook from computer maker\Dell Inc. (DELL.O)" "3","Money Funds Fell in Latest Week (AP)","AP - Assets of the nation's retail money market mutual funds fell by #36;1.17 billion in the latest week to #36;849.98 trillion, the Investment Company Institute said Thursday." "3","Fed minutes show dissent over inflation (USATODAY.com)","USATODAY.com - Retail sales bounced back a bit in July, and new claims for jobless benefits fell last week, the government said Thursday, indicating the economy is improving from a midsummer slump." "3","Safety Net (Forbes.com)","Forbes.com - After earning a PH.D. in Sociology, Danny Bazil Riley started to work as the general manager at a commercial real estate firm at an annual base salary of #36;70,000. Soon after, a financial planner stopped by his desk to drop off brochures about insurance benefits available through his employer. But, at 32, ""buying insurance was the furthest thing from my mind,"" says Riley." "3","Wall St. Bears Claw Back Into the Black"," NEW YORK (Reuters) - Short-sellers, Wall Street's dwindling band of ultra-cynics, are seeing green again."
Description of document content:
- train.csv is composed of three columns, separated by ',' respectively representing: label, news title and news brief; The labels are represented by "1", "2", "3" and "4", which correspond to the contents in classes in turn
- The content format and meaning of test.csv and train.csv are the same
The implementation of the whole case can be divided into the following five steps:
- Step 1: build a text classification model with Embedding layer
- Step 2: batch process the data
- Step 3: construct the training and verification function
- Step 4: model training and verification
- Step 5: check the word vector embedded in the embedding layer
1. Build a text classification model with Embedding layer
# Import the necessary torch model building tools import torch.nn as nn import torch.nn.functional as F # Specify BATCH_SIZE BATCH_SIZE = 16 # Detect available devices. If there is a GPU, the GPU will be used first device = torch.device("cuda" if torch.cuda.is_available() else "cpu") class TextSentiment(nn.Module): """Text classification model""" def __init__(self, vocab_size, embed_dim, num_class): """ description: Class initialization function :param vocab_size: The total number of different words contained in the whole corpus :param embed_dim: Specifies the dimension in which the word is embedded :param num_class: Total number of categories for text categorization """ super().__init__() # Instantiate the embedding layer, and spark = true means that only part of the weight is updated each time the gradient is solved for the layer self.embedding = nn.Embedding(vocab_size, embed_dim, sparse=True) # Instantiate the linear layer, and the parameters are embedded_ Dim and num_class. self.fc = nn.Linear(embed_dim, num_class) # Initialize weights for layers self.init_weights() def init_weights(self): """Initialize weight function""" # Specifies the number of value ranges of the initial weight initrange = 0.5 # The weight parameters of each layer are initialized to uniform distribution self.embedding.weight.data.uniform_(-initrange, initrange) self.fc.weight.data.uniform_(-initrange, initrange) # The offset is initialized to 0 self.fc.bias.data.zero_() def forward(self, text): """ :param text: Result of text numeric mapping :return: Tensors with the same size as the number of categories, Used to determine the text category """ # Get embedded results embedded # >>> embedded.shape # (m, 32) where m is batch_ Total number of words in size data embedded = self.embedding(text) # Next, we need to convert (m, 32) into (BATCH_SIZE, 32) # So that the corresponding loss can be calculated after passing through the fc layer # First, we know that the value of m is much larger than BATCH_SIZE=16, # Divide batch by M_ Size, m contains c batch in total_ SIZE c = embedded.size(0) // BATCH_SIZE # Then take C * batch from embedded_ Size vectors get new embedded # The number of vectors in this new embedded can be divided by BATCH_SIZE embedded = embedded[:BATCH_SIZE*c] # Because we want to use the average pooling method to find the average number of columns with a specified number of rows in embedded, # However, the average pooling method works on rows and requires three-dimensional input # Therefore, we transpose the new embedded and expand the dimension embedded = embedded.transpose(1, 0).unsqueeze(0) # Then the average pooled method is called, and the core size is c # That is, take the element of each c and calculate the mean value once as the result embedded = F.avg_pool1d(embedded, kernel_size=c) # Finally, you need to subtract the new dimension, and then transpose it back to the fc layer return self.fc(embedded[0].transpose(1, 0))
Instantiated model:
# Get the total number of different words contained in the whole corpus VOCAB_SIZE = len(train_dataset.get_vocab()) # Specify word embedding dimension EMBED_DIM = 32 # Get the total number of categories NUN_CLASS = len(train_dataset.get_labels()) # Instantiation model model = TextSentiment(VOCAB_SIZE, EMBED_DIM, NUN_CLASS).to(device)
2. batch process the data
def generate_batch(batch): """ description: generate batch Data function :param batch: It is composed of the sample tensor and the tuple of the corresponding label batch_size List of sizes Shape such as: [(label1, sample1), (lable2, sample2), ..., (labelN, sampleN)] return: Sample tensors and labels are in their respective tabular forms(tensor) Shape such as: text = tensor([sample1, sample2, ..., sampleN]) label = tensor([label1, label2, ..., labelN]) """ # Obtain the label tensor from batch label = torch.tensor([entry[0] for entry in batch]) # Obtain the sample tensor from batch text = [entry[1] for entry in batch] text = torch.cat(text) # Return results return text, label
Call:
# Assume an input: batch = [(1, torch.tensor([3, 23, 2, 8])), (0, torch.tensor([3, 45, 21, 6]))] res = generate_batch(batch) print(res)
Output effect:
# The two pieces of input data are spliced accordingly (tensor([ 3, 23, 2, 8, 3, 45, 21, 6]), tensor([1, 0]))
3. Build training and verification function
# Method of importing data loader in torch from torch.utils.data import DataLoader def train(train_data): """Model training function""" # Initialization training loss and accuracy is 0 train_loss = 0 train_acc = 0 # Generate batch using data loader_ Batch training with size data # Data is N multiple generate_batch after batch function processing_ Size data generator data = DataLoader(train_data, batch_size=BATCH_SIZE, shuffle=True, collate_fn=generate_batch) # Loop through the data and update the parameters with the data of each batch for i, (text, cls) in enumerate(data): # Set the optimizer initial gradient to 0 optimizer.zero_grad() # The model inputs a batch data and obtains the output output = model(text) # Calculate the loss based on the real label and model output loss = criterion(output, cls) # Add the loss of this batch to the total loss train_loss += loss.item() # Error back propagation loss.backward() # Parameters are updated optimizer.step() # Add the accuracy of the batch to the total accuracy train_acc += (output.argmax(1) == cls).sum().item() # Adjust optimizer learning rate scheduler.step() # Return to the average loss and average accuracy of this round of training return train_loss / len(train_data), train_acc / len(train_data) def valid(valid_data): """Model validation function""" # Initialization verification loss and accuracy is 0 loss = 0 acc = 0 # As with training, use DataLoader to obtain training data generator data = DataLoader(valid_data, batch_size=BATCH_SIZE, collate_fn=generate_batch) # Take out data validation by batch for text, cls in data: # In the verification phase, the gradient is no longer solved with torch.no_grad(): # Get output using model output = model(text) # Calculate loss loss = criterion(output, cls) # Add loss and accuracy to total loss and accuracy loss += loss.item() acc += (output.argmax(1) == cls).sum().item() # Return the average loss and average accuracy of this round of verification return loss / len(valid_data), acc / len(valid_data)
4. Conduct model training and verification
# Import time Kit import time # Import data random division method tool from torch.utils.data.dataset import random_split # Specify the number of training rounds N_EPOCHS = 10 # Define initial verification loss min_valid_loss = float('inf') # Select the loss function, where the predefined cross entropy loss function is selected criterion = torch.nn.CrossEntropyLoss().to(device) # Select random gradient descent optimizer optimizer = torch.optim.SGD(model.parameters(), lr=4.0) # Select the optimizer step adjustment method StepLR to attenuate the learning rate scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1, gamma=0.9) # From train_ The dataset takes 0.95 as the training set and takes its length first train_len = int(len(train_dataset) * 0.95) # Then use random_split is divided out of order to obtain the corresponding training set and verification set sub_train_, sub_valid_ = \ random_split(train_dataset, [train_len, len(train_dataset) - train_len]) # Start each round for epoch in range(N_EPOCHS): # Record the start time of introduction training start_time = time.time() # Call train and valid functions to get the average loss and average accuracy of training and verification train_loss, train_acc = train(sub_train_) valid_loss, valid_acc = valid(sub_valid_) # Calculate the total time taken for training and verification (seconds) secs = int(time.time() - start_time) # Expressed in minutes and seconds mins = secs / 60 secs = secs % 60 # Print training and verification time-consuming, average loss, average accuracy print('Epoch: %d' %(epoch + 1), " | time in %d minutes, %d seconds" %(mins, secs)) print(f'\tLoss: {train_loss:.4f}(train)\t|\tAcc: {train_acc * 100:.1f}%(train)') print(f'\tLoss: {valid_loss:.4f}(valid)\t|\tAcc: {valid_acc * 100:.1f}%(valid)')
Output effect:
120000lines [00:06, 17834.17lines/s] 120000lines [00:11, 10071.77lines/s] 7600lines [00:00, 10432.95lines/s] Epoch: 1 | time in 0 minutes, 36 seconds Loss: 0.0592(train) | Acc: 63.9%(train) Loss: 0.0005(valid) | Acc: 69.2%(valid) Epoch: 2 | time in 0 minutes, 37 seconds Loss: 0.0507(train) | Acc: 71.3%(train) Loss: 0.0005(valid) | Acc: 70.7%(valid) Epoch: 3 | time in 0 minutes, 36 seconds Loss: 0.0484(train) | Acc: 72.8%(train) Loss: 0.0005(valid) | Acc: 71.4%(valid) Epoch: 4 | time in 0 minutes, 36 seconds Loss: 0.0474(train) | Acc: 73.4%(train) Loss: 0.0004(valid) | Acc: 72.0%(valid) Epoch: 5 | time in 0 minutes, 36 seconds Loss: 0.0455(train) | Acc: 74.8%(train) Loss: 0.0004(valid) | Acc: 72.5%(valid) Epoch: 6 | time in 0 minutes, 36 seconds Loss: 0.0451(train) | Acc: 74.9%(train) Loss: 0.0004(valid) | Acc: 72.3%(valid) Epoch: 7 | time in 0 minutes, 36 seconds Loss: 0.0446(train) | Acc: 75.3%(train) Loss: 0.0004(valid) | Acc: 72.0%(valid) Epoch: 8 | time in 0 minutes, 36 seconds Loss: 0.0437(train) | Acc: 75.9%(train) Loss: 0.0004(valid) | Acc: 71.4%(valid) Epoch: 9 | time in 0 minutes, 36 seconds Loss: 0.0431(train) | Acc: 76.2%(train) Loss: 0.0004(valid) | Acc: 72.7%(valid) Epoch: 10 | time in 0 minutes, 36 seconds Loss: 0.0426(train) | Acc: 76.6%(train) Loss: 0.0004(valid) | Acc: 72.6%(valid)
5. View the word vector embedded in the embedding layer
# Print the Embedding matrix obtained from the state Dictionary of the model print(model.state_dict()['embedding.weight'])
Output effect:
tensor([[ 0.4401, -0.4177, -0.4161, ..., 0.2497, -0.4657, -0.1861], [-0.2574, -0.1952, 0.1443, ..., -0.4687, -0.0742, 0.2606], [-0.1926, -0.1153, -0.0167, ..., -0.0954, 0.0134, -0.0632], ..., [-0.0780, -0.2331, -0.3656, ..., -0.1899, 0.4083, 0.3002], [-0.0696, 0.4396, -0.1350, ..., 0.1019, 0.2792, -0.4749], [-0.2978, 0.1872, -0.1994, ..., 0.3435, 0.4729, -0.2608]])
summary
Code some methods have been deprecated, please refer to the latest training code: [NLP] text classification TorchText actual combat - AG_NEWS news topic classification task (PyTorch version)
come on.
thank!
strive!