[NLP] news topic classification task

preface

Learning objectives

  • Learn about news topic classification and relevant data

  • Master the implementation process of constructing news topic classifier using shallow network

  • About news topic classification tasks:

    • Taking the text description content of a news report as the input, the model is used to help us judge which type of news it is most likely to belong to. This is a typical text classification problem. We assume that each type is mutually exclusive, that is, the text description has and has only one type

News topic classification data:

  • Get data through torchtext:
# Import related torch Toolkit
import torch
import torchtext
# Import text classification tasks in torchtext.datasets
from torchtext.datasets import text_classification
import os

# Define the data download path and the data folder of the current path
load_data_path = "./data"
# If the path does not exist, create it
if not os.path.isdir(load_data_path):
    os.mkdir(load_data_path)

# Select text classification dataset 'Ag in torchtext_ News' is the news topic classification data, which is saved in the specified directory
# The training and verification data after numerical mapping are loaded into memory
train_dataset, test_dataset = text_classification.DATASETS['AG_NEWS'](root=load_data_path)

Data file Preview:

- data/
    - ag_news_csv.tar.gz
    - ag_news_csv/
        classes.txt
        readme.txt
        test.csv
        train.csv

Document description:

  • train.csv represents training data, 120000 pieces of data in total;
  • test.csv represents validation data, 7600 pieces of data in total;
  • classes.txt is a label (news topic) meaning file, in which four words' World ',' Sports', 'Business' and' Sci/Tech 'represent the four topics of news,
  • readme.txt is the English description of the dataset

train.csv Preview:

"3","Wall St. Bears Claw Back Into the Black (Reuters)","Reuters - Short-sellers, Wall Street's dwindling\band of ultra-cynics, are seeing green again."
"3","Carlyle Looks Toward Commercial Aerospace (Reuters)","Reuters - Private investment firm Carlyle Group,\which has a reputation for making well-timed and occasionally\controversial plays in the defense industry, has quietly placed\its bets on another part of the market."
"3","Oil and Economy Cloud Stocks' Outlook (Reuters)","Reuters - Soaring crude prices plus worries\about the economy and the outlook for earnings are expected to\hang over the stock market next week during the depth of the\summer doldrums."
"3","Iraq Halts Oil Exports from Main Southern Pipeline (Reuters)","Reuters - Authorities have halted oil export\flows from the main pipeline in southern Iraq after\intelligence showed a rebel militia could strike\infrastructure, an oil official said on Saturday."
"3","Oil prices soar to all-time record, posing new menace to US economy (AFP)","AFP - Tearaway world oil prices, toppling records and straining wallets, present a new economic menace barely three months before the US presidential elections."
"3","Stocks End Up, But Near Year Lows (Reuters)","Reuters - Stocks ended slightly higher on Friday\but stayed near lows for the year as oil prices surged past  #36;46\a barrel, offsetting a positive outlook from computer maker\Dell Inc. (DELL.O)"
"3","Money Funds Fell in Latest Week (AP)","AP - Assets of the nation's retail money market mutual funds fell by  #36;1.17 billion in the latest week to  #36;849.98 trillion, the Investment Company Institute said Thursday."
"3","Fed minutes show dissent over inflation (USATODAY.com)","USATODAY.com - Retail sales bounced back a bit in July, and new claims for jobless benefits fell last week, the government said Thursday, indicating the economy is improving from a midsummer slump."
"3","Safety Net (Forbes.com)","Forbes.com - After earning a PH.D. in Sociology, Danny Bazil Riley started to work as the general manager at a commercial real estate firm at an annual base salary of  #36;70,000. Soon after, a financial planner stopped by his desk to drop off brochures about insurance benefits available through his employer. But, at 32, ""buying insurance was the furthest thing from my mind,"" says Riley."
"3","Wall St. Bears Claw Back Into the Black"," NEW YORK (Reuters) - Short-sellers, Wall Street's dwindling  band of ultra-cynics, are seeing green again."

Description of document content:

  • train.csv is composed of three columns, separated by ',' respectively representing: label, news title and news brief; The labels are represented by "1", "2", "3" and "4", which correspond to the contents in classes in turn
  • The content format and meaning of test.csv and train.csv are the same

The implementation of the whole case can be divided into the following five steps:

  • Step 1: build a text classification model with Embedding layer
  • Step 2: batch process the data
  • Step 3: construct the training and verification function
  • Step 4: model training and verification
  • Step 5: check the word vector embedded in the embedding layer

1. Build a text classification model with Embedding layer

# Import the necessary torch model building tools
import torch.nn as nn
import torch.nn.functional as F

# Specify BATCH_SIZE
BATCH_SIZE = 16

# Detect available devices. If there is a GPU, the GPU will be used first
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

class TextSentiment(nn.Module):
    """Text classification model"""
    def __init__(self, vocab_size, embed_dim, num_class):
        """
        description: Class initialization function
        :param vocab_size: The total number of different words contained in the whole corpus
        :param embed_dim: Specifies the dimension in which the word is embedded
        :param num_class: Total number of categories for text categorization
        """ 
        super().__init__()
        # Instantiate the embedding layer, and spark = true means that only part of the weight is updated each time the gradient is solved for the layer
        self.embedding = nn.Embedding(vocab_size, embed_dim, sparse=True)
        # Instantiate the linear layer, and the parameters are embedded_ Dim and num_class.
        self.fc = nn.Linear(embed_dim, num_class)
        # Initialize weights for layers
        self.init_weights()

    def init_weights(self):
        """Initialize weight function"""
        # Specifies the number of value ranges of the initial weight
        initrange = 0.5
        # The weight parameters of each layer are initialized to uniform distribution
        self.embedding.weight.data.uniform_(-initrange, initrange)
        self.fc.weight.data.uniform_(-initrange, initrange)
        # The offset is initialized to 0
        self.fc.bias.data.zero_()

    def forward(self, text):
        """
        :param text: Result of text numeric mapping
        :return: Tensors with the same size as the number of categories, Used to determine the text category
        """
        # Get embedded results embedded
        # >>> embedded.shape
        # (m, 32) where m is batch_ Total number of words in size data
        embedded = self.embedding(text)
        # Next, we need to convert (m, 32) into (BATCH_SIZE, 32)
        # So that the corresponding loss can be calculated after passing through the fc layer
        # First, we know that the value of m is much larger than BATCH_SIZE=16,
        # Divide batch by M_ Size, m contains c batch in total_ SIZE
        c = embedded.size(0) // BATCH_SIZE
        # Then take C * batch from embedded_ Size vectors get new embedded
        # The number of vectors in this new embedded can be divided by BATCH_SIZE
        embedded = embedded[:BATCH_SIZE*c]
        # Because we want to use the average pooling method to find the average number of columns with a specified number of rows in embedded,
        # However, the average pooling method works on rows and requires three-dimensional input
        # Therefore, we transpose the new embedded and expand the dimension
        embedded = embedded.transpose(1, 0).unsqueeze(0)
        # Then the average pooled method is called, and the core size is c
        # That is, take the element of each c and calculate the mean value once as the result
        embedded = F.avg_pool1d(embedded, kernel_size=c)
        # Finally, you need to subtract the new dimension, and then transpose it back to the fc layer
        return self.fc(embedded[0].transpose(1, 0))

Instantiated model:

# Get the total number of different words contained in the whole corpus
VOCAB_SIZE = len(train_dataset.get_vocab())
# Specify word embedding dimension
EMBED_DIM = 32
# Get the total number of categories
NUN_CLASS = len(train_dataset.get_labels())
# Instantiation model
model = TextSentiment(VOCAB_SIZE, EMBED_DIM, NUN_CLASS).to(device)

2. batch process the data

def generate_batch(batch):
    """
    description: generate batch Data function
    :param batch: It is composed of the sample tensor and the tuple of the corresponding label batch_size List of sizes
                  Shape such as:
                  [(label1, sample1), (lable2, sample2), ..., (labelN, sampleN)]
    return: Sample tensors and labels are in their respective tabular forms(tensor)
             Shape such as:
             text = tensor([sample1, sample2, ..., sampleN])
             label = tensor([label1, label2, ..., labelN])
    """
    # Obtain the label tensor from batch
    label = torch.tensor([entry[0] for entry in batch])
    # Obtain the sample tensor from batch
    text = [entry[1] for entry in batch]
    text = torch.cat(text)
    # Return results
    return text, label

Call:

# Assume an input:
batch = [(1, torch.tensor([3, 23, 2, 8])), (0, torch.tensor([3, 45, 21, 6]))]
res = generate_batch(batch)
print(res)

Output effect:

# The two pieces of input data are spliced accordingly
(tensor([ 3, 23,  2,  8,  3, 45, 21,  6]), tensor([1, 0]))

3. Build training and verification function

# Method of importing data loader in torch
from torch.utils.data import DataLoader

def train(train_data):
    """Model training function"""
    # Initialization training loss and accuracy is 0
    train_loss = 0
    train_acc = 0

    # Generate batch using data loader_ Batch training with size data
    # Data is N multiple generate_batch after batch function processing_ Size data generator
    data = DataLoader(train_data, batch_size=BATCH_SIZE, shuffle=True,
                      collate_fn=generate_batch)

    # Loop through the data and update the parameters with the data of each batch
    for i, (text, cls) in enumerate(data):
        # Set the optimizer initial gradient to 0
        optimizer.zero_grad()
        # The model inputs a batch data and obtains the output
        output = model(text)
        # Calculate the loss based on the real label and model output
        loss = criterion(output, cls)
        # Add the loss of this batch to the total loss
        train_loss += loss.item()
        # Error back propagation
        loss.backward()
        # Parameters are updated
        optimizer.step()
        # Add the accuracy of the batch to the total accuracy
        train_acc += (output.argmax(1) == cls).sum().item()

    # Adjust optimizer learning rate  
    scheduler.step()

    # Return to the average loss and average accuracy of this round of training
    return train_loss / len(train_data), train_acc / len(train_data)

def valid(valid_data):
    """Model validation function"""
    # Initialization verification loss and accuracy is 0
    loss = 0
    acc = 0

    # As with training, use DataLoader to obtain training data generator
    data = DataLoader(valid_data, batch_size=BATCH_SIZE, collate_fn=generate_batch)
    # Take out data validation by batch
    for text, cls in data:
        # In the verification phase, the gradient is no longer solved
        with torch.no_grad():
            # Get output using model
            output = model(text)
            # Calculate loss
            loss = criterion(output, cls)
            # Add loss and accuracy to total loss and accuracy
            loss += loss.item()
            acc += (output.argmax(1) == cls).sum().item()

    # Return the average loss and average accuracy of this round of verification
    return loss / len(valid_data), acc / len(valid_data)

4. Conduct model training and verification

# Import time Kit
import time

# Import data random division method tool
from torch.utils.data.dataset import random_split

# Specify the number of training rounds
N_EPOCHS = 10

# Define initial verification loss
min_valid_loss = float('inf')

# Select the loss function, where the predefined cross entropy loss function is selected
criterion = torch.nn.CrossEntropyLoss().to(device)
# Select random gradient descent optimizer
optimizer = torch.optim.SGD(model.parameters(), lr=4.0)
# Select the optimizer step adjustment method StepLR to attenuate the learning rate
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1, gamma=0.9)

# From train_ The dataset takes 0.95 as the training set and takes its length first
train_len = int(len(train_dataset) * 0.95)

# Then use random_split is divided out of order to obtain the corresponding training set and verification set
sub_train_, sub_valid_ = \
    random_split(train_dataset, [train_len, len(train_dataset) - train_len])

# Start each round
for epoch in range(N_EPOCHS):
    # Record the start time of introduction training
    start_time = time.time()
    # Call train and valid functions to get the average loss and average accuracy of training and verification
    train_loss, train_acc = train(sub_train_)
    valid_loss, valid_acc = valid(sub_valid_)

    # Calculate the total time taken for training and verification (seconds)
    secs = int(time.time() - start_time)
    # Expressed in minutes and seconds
    mins = secs / 60
    secs = secs % 60

    # Print training and verification time-consuming, average loss, average accuracy
    print('Epoch: %d' %(epoch + 1), " | time in %d minutes, %d seconds" %(mins, secs))
    print(f'\tLoss: {train_loss:.4f}(train)\t|\tAcc: {train_acc * 100:.1f}%(train)')
    print(f'\tLoss: {valid_loss:.4f}(valid)\t|\tAcc: {valid_acc * 100:.1f}%(valid)')

Output effect:

120000lines [00:06, 17834.17lines/s]
120000lines [00:11, 10071.77lines/s]
7600lines [00:00, 10432.95lines/s]

Epoch: 1  | time in 0 minutes, 36 seconds
    Loss: 0.0592(train) |   Acc: 63.9%(train)
    Loss: 0.0005(valid) |   Acc: 69.2%(valid)
Epoch: 2  | time in 0 minutes, 37 seconds
    Loss: 0.0507(train) |   Acc: 71.3%(train)
    Loss: 0.0005(valid) |   Acc: 70.7%(valid)
Epoch: 3  | time in 0 minutes, 36 seconds
    Loss: 0.0484(train) |   Acc: 72.8%(train)
    Loss: 0.0005(valid) |   Acc: 71.4%(valid)
Epoch: 4  | time in 0 minutes, 36 seconds
    Loss: 0.0474(train) |   Acc: 73.4%(train)
    Loss: 0.0004(valid) |   Acc: 72.0%(valid)
Epoch: 5  | time in 0 minutes, 36 seconds
    Loss: 0.0455(train) |   Acc: 74.8%(train)
    Loss: 0.0004(valid) |   Acc: 72.5%(valid)
Epoch: 6  | time in 0 minutes, 36 seconds
    Loss: 0.0451(train) |   Acc: 74.9%(train)
    Loss: 0.0004(valid) |   Acc: 72.3%(valid)
Epoch: 7  | time in 0 minutes, 36 seconds
    Loss: 0.0446(train) |   Acc: 75.3%(train)
    Loss: 0.0004(valid) |   Acc: 72.0%(valid)
Epoch: 8  | time in 0 minutes, 36 seconds
    Loss: 0.0437(train) |   Acc: 75.9%(train)
    Loss: 0.0004(valid) |   Acc: 71.4%(valid)
Epoch: 9  | time in 0 minutes, 36 seconds
    Loss: 0.0431(train) |   Acc: 76.2%(train)
    Loss: 0.0004(valid) |   Acc: 72.7%(valid)
Epoch: 10  | time in 0 minutes, 36 seconds
    Loss: 0.0426(train) |   Acc: 76.6%(train)
    Loss: 0.0004(valid) |   Acc: 72.6%(valid)

5. View the word vector embedded in the embedding layer

# Print the Embedding matrix obtained from the state Dictionary of the model
print(model.state_dict()['embedding.weight'])

Output effect:

tensor([[ 0.4401, -0.4177, -0.4161,  ...,  0.2497, -0.4657, -0.1861],
        [-0.2574, -0.1952,  0.1443,  ..., -0.4687, -0.0742,  0.2606],
        [-0.1926, -0.1153, -0.0167,  ..., -0.0954,  0.0134, -0.0632],
        ...,
        [-0.0780, -0.2331, -0.3656,  ..., -0.1899,  0.4083,  0.3002],
        [-0.0696,  0.4396, -0.1350,  ...,  0.1019,  0.2792, -0.4749],
        [-0.2978,  0.1872, -0.1994,  ...,  0.3435,  0.4729, -0.2608]])

summary

Code some methods have been deprecated, please refer to the latest training code: [NLP] text classification TorchText actual combat - AG_NEWS news topic classification task (PyTorch version)

come on.

thank!

strive!

Keywords: Python Deep Learning NLP

Added by ragear on Tue, 05 Oct 2021 22:26:59 +0300