Oar regular season: Chinese News Text Title Classification Baseline(PaddleNLP)

Regular season: Chinese News Text Title Classification Baseline(PaddleNLP)

1, Scheme introduction

1.1 introduction to the competition:

Text classification is to automatically classify and mark the text set (or other entities or objects) according to a certain classification system or standard with the help of computer. This competition is for news title text classification. Players need to train a news classification model according to the news title text and category label provided, and then classify the news title text of the test set. Accuracy = correct number of classification / total number of required classification is used in the evaluation index. At the same time, the contestants need to use the propeller frame and the propeller text field core development library PaddleNLP. PaddleNLP has a concise and easy-to-use full process API in the text field, multi scene application examples, very rich pre training models, and is deeply suitable for the version 2.x of the propeller frame.

Game portal: Regular season: Chinese News Text Title Classification

1.2 data introduction:

THUCNews is generated based on the historical data of Sina News RSS subscription channel from 2005 to 2011. It contains 740000 news documents (2.19 GB), all in UTF-8 plain text format. Based on the original Sina News classification system, the competition data set is re integrated and divided into 14 candidate classification categories: finance, lottery, real estate, stock, home, education, science and technology, society, fashion, current politics, sports, constellation, games and entertainment. A total of 832471 training data were provided.

Format of data set provided by the competition: training set and verification set format: original title + \ t + label, test set format: original title.

1.3 Baseline idea:

The title of the competition is a more conventional short text multi classification task. This project is mainly based on paddelnlp. Through the pre training model Robert, fine tune the training data provided to complete the training and optimization of the news 14 classification model. Finally, the trained model is used to predict the test data and generate the Submission result file.

Note that the operation of this project needs to select the premium version of GPU environment! If the video memory is insufficient, please reduce the batchsize appropriately!

BERT pre knowledge supplement: [principle] classic pre training model - BERT

2, Data reading and analysis

# Enter the game data set storage directory
%cd /home/aistudio/data/data103654/
/home/aistudio/data/data103654
# Reading datasets using pandas
import pandas as pd
train = pd.read_table('train.txt', sep='\t',header=None)  # Training set
dev = pd.read_table('dev.txt', sep='\t',header=None)      # Validation set
test = pd.read_table('test.txt', sep='\t',header=None)    # Test set
# Adding column names facilitates better processing of data
train.columns = ["text_a",'label']
dev.columns = ["text_a",'label']
test.columns = ["text_a"]
# View 752471 training sets in total
train
text_alabel
0Netease's third quarter results were lower than analysts' expectationsscience and technology
1Barcelona's hell reappeared a year ago, but this time it is heaven. It will turn over when they go to the devil's away game againSports
2The United States says it supports emergency humanitarian assistance to North KoreaCurrent politics
3Capital increase bank of communications Kanglian Bank of communications won the first order of participating insurersshares
4Midday: the raw materials sector led the marketshares
.........
752466The source of the miracle of Tianjin Women's volleyball team is on the sidelines. He is the real core of the five championsSports
752467Nortel network patent auction postponed: 6 parts may be split for auctionscience and technology
752468Determination of issuance price of Spirit AeroSystems bondsshares
752469Lu Huiming must start: Frankfurt has no victory over Manchester United and Inter have passed smoothlylottery
752470Sony 46 inch new LED LCD special price promotionscience and technology

752471 rows × 2 columns

# View 80000 verification sets in total
dev
text_alabel
0What if you win 90 million after the netizens and citizens collectively fantasize about winning the prizelottery
1PVC futures are expected to be listed in MayFinance and Economics
2A new work at the third quarter of the afternoon: the record of magic God - fatalistic lovegame
3OSRAM LLFY network provides one-stop lighting solutionsHome Furnishing
4On where the Beijing property market is going: the endless queue is not enough to raise the pricehouse property
.........
79995Wang Dalei looked at the predicted score of the national football match. I think it's 2-0 or 3-1Sports
79996Crazza's return to the Raptors was overwhelming. Hill was expelled and the Suns were defeated by 51 pointsSports
79997Wang Jianzhou will create 4G network business opportunities with Taiwan businessmenscience and technology
79998Putin made a surprise visit to the food supermarket to investigate his dissatisfaction with the high price of pork (picture)Current politics
79999High altitude overlooking female star sexy cleavage (Group pictures) (7)fashion

80000 rows × 2 columns

# View the test set, 83599 in total
test
text_a
0Beijing Juntai department store is full of bright autumn, saving 353020 yuan
1Ministry of Education: learning sexual knowledge will begin in the upper grade of primary school
2Professional SLR Camera Canon 7D unit price 9280 yuan
3DBS Bank sued mainland customers, but the bank's tough customers were helpless
4Divorced from China's actual pressure, a sharp appreciation of the RMB can only be a dream
......
83594Razer cup DotA elite challenge came out in August
83595The improvement of economic data dispelled the expectation of RMB devaluation
83596Mortgage rate and collateral dual control policy Liu Mingkang supports real estate loans
8359780 megapixel Rito releases Aptus-II 12 digital back
83598The Ministry of Education announced the list of more than 10000 regular schools in 33 countries

83599 rows × 1 columns

# Splice training and verification sets for statistical analysis
total = pd.concat([train,dev],axis=0)
# Total category label distribution statistics
total['label'].value_counts()
science and technology    one hundred and sixty-two thousand two hundred and forty-five
 shares    one hundred and fifty-three thousand nine hundred and forty-nine
 Sports    one hundred and thirty thousand nine hundred and eighty-two
 entertainment     ninety-two thousand two hundred and twenty-eight
 Current politics     sixty-two thousand eight hundred and sixty-seven
 Sociology     fifty thousand five hundred and forty-one
 education     forty-one thousand six hundred and eighty
 Finance and Economics     thirty-six thousand nine hundred and sixty-three
 Home Furnishing     thirty-two thousand three hundred and sixty-three
 game     twenty-four thousand two hundred and eighty-three
 house property     nineteen thousand nine hundred and twenty-two
 fashion     thirteen thousand three hundred and thirty-five
 lottery      seven thousand five hundred and ninety-eight
 constellation      three thousand five hundred and fifteen
Name: label, dtype: int64
# Statistical analysis of text length shows that the text is short, with a maximum length of 48
total['text_a'].map(len).describe()
count    832471.000000
mean         19.388112
std           4.097139
min           2.000000
25%          17.000000
50%          20.000000
75%          23.000000
max          48.000000
Name: text_a, dtype: float64
# Through the statistical analysis of the length of the test set, it can be seen that the length distribution is similar to the training data
test['text_a'].map(len).describe()
count    83599.000000
mean        19.815022
std          3.883845
min          3.000000
25%         17.000000
50%         20.000000
75%         23.000000
max         84.000000
Name: text_a, dtype: float64
# Save the processed dataset file
train.to_csv('train.csv', sep='\t', index=False)  # Save the training set in the format of text_a,label
dev.to_csv('dev.csv', sep='\t', index=False)      # Save the validation set in the format text_a,label
test.to_csv('test.csv', sep='\t', index=False)    # Save the test set in the format text_a

3, Constructing baseline model based on PaddleNLP

3.1 pre environment preparation

# Import the required third-party libraries
import math
import numpy as np
import os
import collections
from functools import partial
import random
import time
import inspect
import importlib
from tqdm import tqdm
import paddle
import paddle.nn as nn
import paddle.nn.functional as F
from paddle.io import IterableDataset
from paddle.utils.download import get_path_from_url
# Download the latest version of paddlenlp
!pip install --upgrade paddlenlp
# Import the related packages required for paddlenlp
import paddlenlp as ppnlp
from paddlenlp.data import JiebaTokenizer, Pad, Stack, Tuple, Vocab
from paddlenlp.datasets import MapDataset
from paddle.dataset.common import md5file
from paddlenlp.datasets import DatasetBuilder

3.2 define the pre training model to be fine tuned

# This time, the Roberta WwM ext large model with better effect in the Chinese field is used. The pre training model is generally "miraculous". Selecting a large pre training model can achieve better effect than the base model
MODEL_NAME = "roberta-wwm-ext-large"
# Just specify the name of the model you want to use and the number of categories of text classification to complete the fine tune network definition, which is classified by splicing a Full Connected network after pre training the model
model = ppnlp.transformers.RobertaForSequenceClassification.from_pretrained(MODEL_NAME, num_classes=14) # The classification task is 14, so num_classes is set to 14
# Define the tokenizer corresponding to the model. The tokenizer can convert the original input text into the input data format acceptable to the model. Note that the tokenizer class should correspond to the selected model. For details, see the PaddleNLP related documents
tokenizer = ppnlp.transformers.RobertaTokenizer.from_pretrained(MODEL_NAME)
[2021-09-06 23:36:10,711] [    INFO] - Downloading https://paddlenlp.bj.bcebos.com/models/transformers/roberta_large/roberta_chn_large.pdparams and saved to /home/aistudio/.paddlenlp/models/roberta-wwm-ext-large
[2021-09-06 23:36:10,767] [    INFO] - Downloading roberta_chn_large.pdparams from https://paddlenlp.bj.bcebos.com/models/transformers/roberta_large/roberta_chn_large.pdparams
100%|██████████| 1271615/1271615 [00:18<00:00, 69830.07it/s]
[2021-09-06 23:36:41,190] [    INFO] - Downloading https://paddlenlp.bj.bcebos.com/models/transformers/roberta_large/vocab.txt and saved to /home/aistudio/.paddlenlp/models/roberta-wwm-ext-large
[2021-09-06 23:36:41,193] [    INFO] - Downloading vocab.txt from https://paddlenlp.bj.bcebos.com/models/transformers/roberta_large/vocab.txt
100%|██████████| 107/107 [00:00<00:00, 3538.63it/s]

PaddleNLP supports not only RoBERTa pre training model, but also ERNIE, BERT, Electra and other pre training models. For details: PaddleNLP model

The following table summarizes the various pre training models currently supported by PaddleNLP. Users can use the model provided by PaddleNLP to complete Q & A, sequence classification, token classification and other tasks. At the same time, 22 kinds of pre training parameter weights are provided for users, including the pre training weights of 11 Chinese language models.

ModelTokenizerSupported TaskModel Name
BERTBertTokenizerBertModel
BertForQuestionAnswering
BertForSequenceClassification
BertForTokenClassification
bert-base-uncased
bert-large-uncased
bert-base-multilingual-uncased
bert-base-cased
bert-base-chinese
bert-base-multilingual-cased
bert-large-cased
bert-wwm-chinese
bert-wwm-ext-chinese
ERNIEErnieTokenizer
ErnieTinyTokenizer
ErnieModel
ErnieForQuestionAnswering
ErnieForSequenceClassification
ErnieForTokenClassification
ernie-1.0
ernie-tiny
ernie-2.0-en
ernie-2.0-large-en
RoBERTaRobertaTokenizerRobertaModel
RobertaForQuestionAnswering
RobertaForSequenceClassification
RobertaForTokenClassification
roberta-wwm-ext
roberta-wwm-ext-large
rbt3
rbtl3
ELECTRAElectraTokenizerElectraModel
ElectraForSequenceClassification
ElectraForTokenClassification
electra-small
electra-base
electra-large
chinese-electra-small
chinese-electra-base

Note: the Chinese pre training models include Bert base Chinese, Bert WwM Chinese, Bert WwM ext Chinese, ernie-1.0, Ernie tiny, Roberta WwM ext, Roberta WwM ext large, rbt3, rbtl3, China electric base, China Electric small, etc.

3.3 data reading and processing

# Define 14 categories to classify
label_list=list(train.label.unique())
print(label_list)
['science and technology', 'Sports', 'Current politics', 'shares', 'entertainment', 'education', 'Home Furnishing', 'Finance and Economics', 'house property', 'Sociology', 'game', 'lottery', 'constellation', 'fashion']
# Define the file corresponding to the dataset and its file storage format
class NewsData(DatasetBuilder):
    SPLITS = {
        'train': 'train.csv',  # Training set
        'dev': 'dev.csv',      # Validation set
    }

    def _get_data(self, mode, **kwargs):
        filename = self.SPLITS[mode]
        return filename

    def _read(self, filename):
        """Read data"""
        with open(filename, 'r', encoding='utf-8') as f:
            head = None
            for line in f:
                data = line.strip().split("\t")    # Separate columns with '\ t'
                if not head:
                    head = data
                else:
                    text_a, label = data
                    yield {"text_a": text_a, "label": label}  # The format of the data set this time is: text_ a. Label, which can be modified according to the specific situation

    def get_labels(self):
        return label_list   # Category label
# Define dataset loading functions
def load_dataset(name=None,
                 data_files=None,
                 splits=None,
                 lazy=None,
                 **kwargs):
   
    reader_cls = NewsData  # Load defined dataset format
    print(reader_cls)
    if not name:
        reader_instance = reader_cls(lazy=lazy, **kwargs)
    else:
        reader_instance = reader_cls(lazy=lazy, name=name, **kwargs)

    datasets = reader_instance.read_datasets(data_files=data_files, splits=splits)
    return datasets
# Load training and validation sets
train_ds, dev_ds = load_dataset(splits=["train", "dev"])
<class '__main__.NewsData'>
# Define data loading and processing functions
def convert_example(example, tokenizer, max_seq_length=128, is_test=False):
    qtconcat = example["text_a"]
    encoded_inputs = tokenizer(text=qtconcat, max_seq_len=max_seq_length)  # tokenizer is processed in a format acceptable to the model 
    input_ids = encoded_inputs["input_ids"]
    token_type_ids = encoded_inputs["token_type_ids"]

    if not is_test:
        label = np.array([example["label"]], dtype="int64")
        return input_ids, token_type_ids, label
    else:
        return input_ids, token_type_ids

# Define the data loading function dataloader
def create_dataloader(dataset,
                      mode='train',
                      batch_size=1,
                      batchify_fn=None,
                      trans_fn=None):
    if trans_fn:
        dataset = dataset.map(trans_fn)

    shuffle = True if mode == 'train' else False
    # The training data set is randomly disrupted, and the test data set is not disrupted
    if mode == 'train':
        batch_sampler = paddle.io.DistributedBatchSampler(
            dataset, batch_size=batch_size, shuffle=shuffle)
    else:
        batch_sampler = paddle.io.BatchSampler(
            dataset, batch_size=batch_size, shuffle=shuffle)

    return paddle.io.DataLoader(
        dataset=dataset,
        batch_sampler=batch_sampler,
        collate_fn=batchify_fn,
        return_list=True)
# Parameter setting:
# Batch processing size. If the video memory is insufficient, this value can be appropriately reduced  
batch_size = 300
# The maximum truncation length of the text sequence shall be determined according to the specific length of the text, and the maximum length shall not exceed 512. It can be seen from the text length analysis that the maximum text length is 48, so it is set to 48 here
max_seq_length = 48
# Process the data into a data format that the model can read in
trans_func = partial(
    convert_example,
    tokenizer=tokenizer,
    max_seq_length=max_seq_length)

batchify_fn = lambda samples, fn=Tuple(
    Pad(axis=0, pad_val=tokenizer.pad_token_id),  # input_ids
    Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # token_type_ids
    Stack()  # labels
): [data for data in fn(samples)]

# Training set iterator
train_data_loader = create_dataloader(
    train_ds,
    mode='train',
    batch_size=batch_size,
    batchify_fn=batchify_fn,
    trans_fn=trans_func)

# Validation set iterator
dev_data_loader = create_dataloader(
    dev_ds,
    mode='dev',
    batch_size=batch_size,
    batchify_fn=batchify_fn,
    trans_fn=trans_func)

3.4 set fine tune optimization strategy and access evaluation index

The learning rate applicable to Transformer models such as BERT is the dynamic learning rate of warmup.

# Define hyperparameters, loss, optimizers, etc
from paddlenlp.transformers import LinearDecayWithWarmup

# Define training configuration parameters:
# Define the maximum learning rate during training
learning_rate = 4e-5
# Training rounds
epochs = 4
# Learning rate preheating ratio
warmup_proportion = 0.1
# The weight attenuation coefficient is similar to the regular term strategy of the model to avoid over fitting of the model
weight_decay = 0.01

num_training_steps = len(train_data_loader) * epochs
lr_scheduler = LinearDecayWithWarmup(learning_rate, num_training_steps, warmup_proportion)

# AdamW optimizer
optimizer = paddle.optimizer.AdamW(
    learning_rate=lr_scheduler,
    parameters=model.parameters(),
    weight_decay=weight_decay,
    apply_decay_param_fun=lambda x: x in [
        p.name for n, p in model.named_parameters()
        if not any(nd in n for nd in ["bias", "norm"])
    ])

criterion = paddle.nn.loss.CrossEntropyLoss()  # Cross entropy loss function
metric = paddle.metric.Accuracy()              # accuracy evaluation index

3.5 model training and evaluation

ps: during model training, you can enter NVIDIA SMI command at the terminal or click the "performance monitoring" option at the bottom to view the occupation of video memory, and properly adjust the batch size to prevent accidental suspension due to insufficient video memory.

# Define model training validation evaluation function
@paddle.no_grad()
def evaluate(model, criterion, metric, data_loader):
    model.eval()
    metric.reset()
    losses = []
    for batch in data_loader:
        input_ids, token_type_ids, labels = batch
        logits = model(input_ids, token_type_ids)
        loss = criterion(logits, labels)
        losses.append(loss.numpy())
        correct = metric.compute(logits, labels)
        metric.update(correct)
        accu = metric.accumulate()
    print("eval loss: %.5f, accu: %.5f" % (np.mean(losses), accu))  # Evaluate effect on output validation set
    model.train()
    metric.reset()
    return accu  # Return accuracy
# Fixed random seeds facilitate the reproduction of results
seed = 1024
random.seed(seed)
np.random.seed(seed)
paddle.seed(seed)
<paddle.fluid.core_avx.Generator at 0x7f7c85b26a30>

ps: during model training, you can check the occupation of video memory by entering NVIDIA SMI command at the terminal or through the performance monitoring option at the bottom right. If the video memory is insufficient, you should properly adjust the value of batchsize.

# Model training:
import paddle.nn.functional as F

save_dir = "checkpoint"
if not  os.path.exists(save_dir):
    os.makedirs(save_dir)

pre_accu=0
accu=0
global_step = 0
for epoch in range(1, epochs + 1):
    for step, batch in enumerate(train_data_loader, start=1):
        input_ids, segment_ids, labels = batch
        logits = model(input_ids, segment_ids)
        loss = criterion(logits, labels)
        probs = F.softmax(logits, axis=1)
        correct = metric.compute(probs, labels)
        metric.update(correct)
        acc = metric.accumulate()

        global_step += 1
        if global_step % 10 == 0 :
            print("global step %d, epoch: %d, batch: %d, loss: %.5f, acc: %.5f" % (global_step, epoch, step, loss, acc))
        loss.backward()
        optimizer.step()
        lr_scheduler.step()
        optimizer.clear_grad()
    # The validation set is evaluated at the end of each round
    accu = evaluate(model, criterion, metric, dev_data_loader)
    print(accu)
    if accu > pre_accu:
        # Save better model parameters than the previous round
        save_param_path = os.path.join(save_dir, 'model_state.pdparams')  # Save model parameters
        paddle.save(model.state_dict(), save_param_path)
        pre_accu=accu
tokenizer.save_pretrained(save_dir)
# Load the model parameters of the round with the best effect on the verification set
import os
import paddle

params_path = 'checkpoint/model_state.pdparams'
if params_path and os.path.isfile(params_path):
    # Load model parameters
    state_dict = paddle.load(params_path)
    model.set_dict(state_dict)
    print("Loaded parameters from %s" % params_path)
Loaded parameters from checkpoint/model_state.pdparams
# Test the score of the optimal model parameters on the verification set
evaluate(model, criterion, metric, dev_data_loader)
eval loss: 0.01434, accu: 0.99598





0.995975

3.6 model prediction

# Define model prediction function
def predict(model, data, tokenizer, label_map, batch_size=1):
    examples = []
    # Process the input data (list format) into a format acceptable to the model
    for text in data:
        input_ids, segment_ids = convert_example(
            text,
            tokenizer,
            max_seq_length=128,
            is_test=True)
        examples.append((input_ids, segment_ids))

    batchify_fn = lambda samples, fn=Tuple(
        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # input id
        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # segment id
    ): fn(samples)

    # Seperates data into some batches.
    batches = []
    one_batch = []
    for example in examples:
        one_batch.append(example)
        if len(one_batch) == batch_size:
            batches.append(one_batch)
            one_batch = []
    if one_batch:
        # The last batch whose size is less than the config batch_size setting.
        batches.append(one_batch)

    results = []
    model.eval()
    for batch in batches:
        input_ids, segment_ids = batchify_fn(batch)
        input_ids = paddle.to_tensor(input_ids)
        segment_ids = paddle.to_tensor(segment_ids)
        logits = model(input_ids, segment_ids)
        probs = F.softmax(logits, axis=1)
        idx = paddle.argmax(probs, axis=1).numpy()
        idx = idx.tolist()
        labels = [label_map[i] for i in idx]
        results.extend(labels)
    return results  # Return forecast results
# Define categories to classify
label_list=list(train.label.unique())
label_map = { 
    idx: label_text for idx, label_text in enumerate(label_list)
}
print(label_map)
{0: 'science and technology', 1: 'Sports', 2: 'Current politics', 3: 'shares', 4: 'entertainment', 5: 'education', 6: 'Home Furnishing', 7: 'Finance and Economics', 8: 'house property', 9: 'Sociology', 10: 'game', 11: 'lottery', 12: 'constellation', 13: 'fashion'}
# Read the test set file to predict
test = pd.read_csv('./test.csv',sep='\t')  
# Define the data preprocessing function to specify the list format for model input
def preprocess_prediction_data(data):
    examples = []
    for text_a in data:
        examples.append({"text_a": text_a})
    return examples

# Format test set data
data1 = list(test.text_a)
examples = preprocess_prediction_data(data1)
# Predict the test set
results = predict(model, examples, tokenizer, label_map, batch_size=16)   
# Store the prediction results in list format as txt file, and submit the format requirements: one category per line
def write_results(labels, file_path):
    with open(file_path, "w", encoding="utf8") as f:
        f.writelines("\n".join(labels))

write_results(results, "./result.txt")
# Since the format is required to be zip, the result file needs to be compressed into submission.zip submission file
!zip 'submission.zip' 'result.txt'
  adding: result.txt (deflated 89%)
# Move the data directory to submit the result file to the main directory for saving the result file
(file_path, "w", encoding="utf8") as f:
        f.writelines("\n".join(labels))

write_results(results, "./result.txt")
# Since the format is required to be zip, the result file needs to be compressed into submission.zip submission file
!zip 'submission.zip' 'result.txt'
  adding: result.txt (deflated 89%)
# Move the data directory to submit the result file to the main directory for saving the result file
!cp -r /home/aistudio/data/data103654/submission.zip /home/aistudio/

It should be noted that the submission format is zip. Find the generated submission.zip file in the main directory, download it locally and submit it on the competition page!

4, Lifting direction:

1. The training data can be enhanced to increase the amount of training data to improve the generalization ability of the model. NLP Chinese Data Augmentation one click Chinese data enhancement tool

2. Based on the baseline model, the effect can be further improved by adjusting the participation and model optimization. Practice of fine tuning skills in text classification

3. Try to use different pre training models, such as ERNIE and NEZHA, and vote and fuse the results of multiple models. Competition score Trick result fusion

4. After the training and verification sets are spliced, the division of the training and verification sets can be customized, and the difference results can be used for fusion or 5folds cross verification.

5. Take the same part of the prediction results of multiple models as a pseudo tag for model training. Pseudo label technique is generally used when the model accuracy is high, and beginners should use it with caution.

6. Those who are capable can try to re pre train under the training corpus and modify the model network structure to further improve the effect.

7. For more skills, you can try more by learning the Top sharing of other short text classification competitions. Zero basic introduction NLP - News Text Classification

About the use of PaddleNLP: it is recommended to read the latest official documents PaddleNLP document

github address of PaddleNLP: https://github.com/PaddlePaddle/PaddleNLP If you have questions, you can raise the issue on github and someone will answer them.

5, Personal introduction

Nickname? Alchemist 233

Current main direction: development, mainly focusing on NLP and data mining related competitions or projects

https://aistudio.baidu.com/aistudio/personalcenter/thirdview/330406 Pay attention to me and bring more wonderful projects to share next time!

Keywords: NLP paddlepaddle

Added by wattee on Wed, 10 Nov 2021 05:41:55 +0200