Propeller regular season: Chinese News Text Title Classification - No. 1 scheme in December

Regular season: Chinese News Text Title Classification

I Scheme introduction

1.1 introduction to the competition:

Text classification is to automatically classify and mark the text set (or other entities or objects) according to a certain classification system or standard with the help of computer. This competition is for news title text classification. Players need to train a news classification model according to the news title text and category label provided, and then classify the news title text of the test set. Accuracy = the correct number of classification / the total number of classification required is used in the evaluation index. At the same time, the contestants need to use the propeller frame and paddelnlp, the core development library in the field of propeller text. Paddelnlp has a concise and easy-to-use whole process API in the field of text, multi scene application examples, very rich pre training models, and is deeply suitable for the propeller frame 2 X version.

Game portal: Regular season: Chinese News Text Title Classification

1.2 data introduction:

THUCNews is generated based on the historical data of Sina News RSS subscription channel from 2005 to 2011. It contains 740000 news documents (2.19 GB), all in UTF-8 plain text format. Based on the original Sina News classification system, the competition data set is re integrated and divided into 14 candidate classification categories: finance, lottery, real estate, stock, home, education, science and technology, society, fashion, current politics, sports, constellation, game and entertainment. A total of 832471 training data were provided.

Format of data set provided by the competition: training set and verification set format: original title + \ t + label, test set format: original title.

1.3 model idea:

The competition question is a relatively conventional short text classification task. Under the paddelnlp framework, firstly, nine pre training models are fine tuned to train and optimize the 14 classification model of news headlines, and then the processing results of different models are integrated through integrated learning (relative voting method, maximum score method, etc.), and finally the result file is analyzed to generate pseudo tags and iteratively expand the scale of the training set.

Since the pre training model consumes a lot of memory resources, you need to choose the supreme version of GPU environment when running! If the video memory is insufficient, please reduce the batchsize appropriately!

BERT pre knowledge supplement: [principle] classic pre training model - BERT

II Data reading and analysis

# Enter the game data set storage directory
%cd /home/aistudio/dataset/

/home/aistudio/dataset

# Reading datasets using pandas
import pandas as pd
train = pd.read_table('train.txt', sep='\t',header=None)  # Training set
dev = pd.read_table('dev.txt', sep='\t',header=None)      # Verification set (officially divided)
test = pd.read_table('test.txt', sep='\t',header=None)    # Test set

# Manually add column names to the dataset
train.columns = ["text_a",'label']
dev.columns = ["text_a",'label']
test.columns = ["text_a"]

# Splice training and validation sets for statistical analysis
total = pd.concat([train,dev],axis=0)
print(total.label.unique())
print(total.label.unique().tolist())

['science and technology' 'Sports' 'Current politics' 'shares' 'entertainment' 'education' 'Home Furnishing' 'Finance and Economics' 'house property' 'Sociology' 'game' 'lottery' 'constellation' 'fashion']
['science and technology', 'Sports', 'Current politics', 'shares', 'entertainment', 'education', 'Home Furnishing', 'Finance and Economics', 'house property', 'Sociology', 'game', 'lottery', 'constellation', 'fashion']

# Splice training and validation sets for statistical analysis
total = pd.concat([train,dev],axis=0)
print("The number and proportion of labels in different categories are as follows:")
# Get the number and proportion of labels of different categories, which are arranged in descending order. The data type is "pandas.core.series.Series"
series_Count_desc = total['label'].value_counts()
series_Ratio_desc = (train['label'].value_counts())/sum(train['label'].value_counts())
# The number and proportion of labels in different categories are merged, and they are still arranged in descending order
series_Info_desc = pd.concat([series_Count_desc,series_Ratio_desc],axis=1)
# print(series_Info_desc)
# print(series_Info_desc.keys())
# Define categories to classify
label_list = total.label.unique().tolist()
# Establish a mapping table from classification label to ID
Label_ID_map = {Val:ID for ID,Val in enumerate(label_list)}
series_Info_data = pd.Series(Label_ID_map)
# print(series_Info_data.keys())
# Rearrange the number and proportion of labels in different categories according to the order of ID
df_Info_data = pd.concat([series_Info_data,series_Info_desc],axis=1)
# Set column name
df_Info_data.columns=['ID','Count','Ratio']
print(df_Info_data)
# print(type(df_Info_data))

The number and proportion of labels in different categories are as follows:
    ID   Count     Ratio
 Technology 0 162245 0.194874
 Physical education 1 130982 0.157401
 Current politics 2 62867 0.075455
 Stock 3 153949 0.184670
 Entertainment 4 92228 0.110792
 Education 5 41680 0.050159
 Home 6 32363 0.038976
 Caijing 7 36963 0.044372
 Property 8 19922 0.023981
 Social 9 50541 0.060820
 Game 10 24283 0.029152
 Lottery 11    7598  0.009077
 Constellation 12    3515  0.004281
 Fashion 13 13335 0.015990

#Calculate the weights of different types of loss function: the reciprocal of the proportion, which needs to be normalized
series_Recip_desc = 1/series_Ratio_desc#reciprocal
series_Weight_desc = series_Recip_desc
series_WeightNorm_desc = series_Recip_desc/sum(series_Recip_desc)#normalization
# print(series_Recip_desc)
# print(series_Weight_desc)
#Number and proportion of labels merged into different categories
df_Info_data = pd.concat([df_Info_data,series_Weight_desc],axis=1)
df_Info_data = pd.concat([df_Info_data,series_WeightNorm_desc],axis=1)
# Set column name
df_Info_data.columns=['ID','Count','Ratio','Weight','Weight Norm']
print(df_Info_data)

    ID   Count     Ratio      Weight  Weight Norm
 Technology 0 162245 0.194874    5.131522     0.008467
 Physical education 1 130982 0.157401    6.353183     0.010482
 Current politics 2 62867 0.075455   13.252862     0.021867
 Stock 3 153949 0.184670    5.415058     0.008935
 Entertainment 4 92228 0.110792    9.025897     0.014892
 Education 5 41680 0.050159   19.936703     0.032895
 Home 6 32363 0.038976   25.657085     0.042333
 Caijing 7 36963 0.044372   22.536494     0.037184
 Property 8 19922 0.023981   41.699695     0.068802
 Social 9 50541 0.060820   16.442063     0.027129
 Game 10 24283 0.029152   34.303018     0.056598
 Lottery 11    7598  0.009077  110.171449     0.181778
 Constellation 12    3515  0.004281  233.614095     0.385452
 Fashion 13 13335 0.015990   62.539146     0.103187

# Global settings to solve the problem of Chinese display errors in matplotlib. Refer to: https://aistudio.baidu.com/aistudio/projectdetail/1658980
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.font_manager as font_manager

# Set display Chinese
matplotlib.rcParams['font.sans-serif'] = ['FZSongYi-Z13S'] # Specifies the default font
matplotlib.rcParams['axes.unicode_minus'] = False # Solve the problem that the negative sign '-' is displayed as a square in the saved image
# Set font size
matplotlib.rcParams['font.size'] = 16

# Draw the distribution of different types of labels (in ascending order of ID)
plt.subplot2grid((2,2),(0,0),colspan=2)
df_Info_data['Count'].plot(kind='bar');
# Draw the distribution of different types of labels (in descending order according to the number of samples)
plt.subplot2grid((2,2),(1,0),colspan=2)
series_Count_desc.plot(kind='bar');

# Statistics the length information of "training set + verification set"
print("The length information of Train&&Dev is as follows:/n")
print(total['text_a'].map(len).describe())
# Statistics the length information of "test set"
print("The length information of Test is as follows:/n")
print(test['text_a'].map(len).describe())

The length information of Train&&Dev is as follows:/n
count    832471.000000
mean         19.388112
std           4.097139
min           2.000000
25%          17.000000
50%          20.000000
75%          23.000000
max          48.000000
Name: text_a, dtype: float64
The length information of Test is as follows:/n
count    83599.000000
mean        19.815022
std          3.883845
min          3.000000
25%         17.000000
50%         20.000000
75%         23.000000
max         84.000000
Name: text_a, dtype: float64

# Whether to use all data sets (training set + verification set) for training
useTotalData = False
# Whether to use pseudo tags for training
useFakeData = False
# Whether to use only pseudo tags for training (incremental learning)
useFakeOnly = False
if useTotalData == True:
    #All data sets (training set + verification set) are used for training
    train = pd.concat([train,dev],axis=0)
if useFakeData == True:
    #Training with pseudo Tags
    train = pd.concat([train,fakeData1],axis=0)
if useFakeOnly == True:
    #Training with pseudo tags only (incremental learning)
    train = fakeData1
# Save the processed dataset file
train.to_csv('train.csv', sep='\t', index=False)  # Save the training set in the format of text_ a. Label, separated by \ t
dev.to_csv('dev.csv', sep='\t', index=False)      # Save the validation set in the format text_ a. Label, separated by \ t
test.to_csv('test.csv', sep='\t', index=False)    # Save the test set in the format of text_a. Separated by \ t

III Build a baseline model based on PaddleNLP

3.1 setting the operating environment

# Import the required third-party libraries
import math
import numpy as np
import os
import collections
from functools import partial
import random
import time
import inspect
import importlib
from tqdm import tqdm
import paddle
import paddle.nn as nn
import paddle.nn.functional as F
from paddle.io import IterableDataset
from paddle.utils.download import get_path_from_url

# Download paddlenlp2 1. The latest version is 2.2. There will be problems in the following steps
!pip install --upgrade paddlenlp==2.1

# Import related packages required by paddlenlp
import paddlenlp as ppnlp
from paddlenlp.data import JiebaTokenizer, Pad, Stack, Tuple, Vocab
from paddlenlp.datasets import MapDataset
from paddle.dataset.common import md5file
from paddlenlp.datasets import DatasetBuilder

Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Requirement already up-to-date: paddlenlp==2.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (2.1.0)
Requirement already satisfied, skipping upgrade: multiprocess in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp==2.1) (0.70.11.1)
Requirement already satisfied, skipping upgrade: h5py in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp==2.1) (2.9.0)
Requirement already satisfied, skipping upgrade: paddlefsl==1.0.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp==2.1) (1.0.0)
Requirement already satisfied, skipping upgrade: colorlog in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp==2.1) (4.1.0)
Requirement already satisfied, skipping upgrade: seqeval in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp==2.1) (1.2.2)
Requirement already satisfied, skipping upgrade: jieba in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp==2.1) (0.42.1)
Requirement already satisfied, skipping upgrade: colorama in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp==2.1) (0.4.4)
Requirement already satisfied, skipping upgrade: dill>=0.3.3 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from multiprocess->paddlenlp==2.1) (0.3.3)
Requirement already satisfied, skipping upgrade: numpy>=1.7 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from h5py->paddlenlp==2.1) (1.20.3)
Requirement already satisfied, skipping upgrade: six in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from h5py->paddlenlp==2.1) (1.16.0)
Requirement already satisfied, skipping upgrade: tqdm~=4.27.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlefsl==1.0.0->paddlenlp==2.1) (4.27.0)
Requirement already satisfied, skipping upgrade: pillow==8.2.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlefsl==1.0.0->paddlenlp==2.1) (8.2.0)
Requirement already satisfied, skipping upgrade: requests~=2.24.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlefsl==1.0.0->paddlenlp==2.1) (2.24.0)
Requirement already satisfied, skipping upgrade: scikit-learn>=0.21.3 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from seqeval->paddlenlp==2.1) (0.24.2)
Requirement already satisfied, skipping upgrade: certifi>=2017.4.17 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from requests~=2.24.0->paddlefsl==1.0.0->paddlenlp==2.1) (2019.9.11)
Requirement already satisfied, skipping upgrade: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from requests~=2.24.0->paddlefsl==1.0.0->paddlenlp==2.1) (1.25.6)
Requirement already satisfied, skipping upgrade: chardet<4,>=3.0.2 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from requests~=2.24.0->paddlefsl==1.0.0->paddlenlp==2.1) (3.0.4)
Requirement already satisfied, skipping upgrade: idna<3,>=2.5 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from requests~=2.24.0->paddlefsl==1.0.0->paddlenlp==2.1) (2.8)
Requirement already satisfied, skipping upgrade: scipy>=0.19.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from scikit-learn>=0.21.3->seqeval->paddlenlp==2.1) (1.6.3)
Requirement already satisfied, skipping upgrade: threadpoolctl>=2.0.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from scikit-learn>=0.21.3->seqeval->paddlenlp==2.1) (2.1.0)
Requirement already satisfied, skipping upgrade: joblib>=0.11 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from scikit-learn>=0.21.3->seqeval->paddlenlp==2.1) (0.14.1)

3.2 define the pre training model to be fine tuned

In the selection of pre training models, 7 pre training models with better effects in different Chinese fields are selected. After training with the training set, the scores submitted are as follows:

roberta-wwm-ext-large: 89.08

nezha-large-wwm-chinese: 88.96

skep_ernie_1.0_large_ch: 88.82

bert-wwm-ext-chinese: 88.62

macbert-large-chinese: 88.75

huhuiwen/mengzi-bert-base: 88.64

junnyu/hfl-chinese-electra-180g-base-discriminator: 88.28

# Using the Roberta WwM ext large model
MODEL_NAME = "roberta-wwm-ext-large"
# Just specify the name of the model you want to use and the number of categories of text classification to complete the fine tune network definition, which is classified by splicing a Full Connected network after pre training the model
model = ppnlp.transformers.RobertaForSequenceClassification.from_pretrained(MODEL_NAME, num_classes=14) # The classification task is 14, so num_classes is set to 14
# Define the tokenizer corresponding to the model. The tokenizer can convert the original input text into the input data format acceptable to the model. It should be noted that the tokenizer class should correspond to the selected model. For details, see the relevant documents of PaddleNLP
tokenizer = ppnlp.transformers.RobertaTokenizer.from_pretrained(MODEL_NAME)

[2022-01-20 11:12:00,457] [    INFO] - Already cached /home/aistudio/.paddlenlp/models/roberta-wwm-ext-large/roberta_chn_large.pdparams
[2022-01-20 11:12:05,655] [    INFO] - Already cached /home/aistudio/.paddlenlp/models/roberta-wwm-ext-large/vocab.txt

# Using the Nezha large WwM Chinese model
# Specify the model name and load the model with one click
#model = ppnlp.transformers.NeZhaForSequenceClassification.from_pretrained('nezha-large-wwm-chinese', num_classes=14)
# Similarly, the corresponding Tokenizer is loaded with one click by specifying the model name, which is used to process text data, such as segmentation token and conversion token_id, etc
#tokenizer = ppnlp.transformers.NeZhaTokenizer.from_pretrained('nezha-large-wwm-chinese')

# skep_ernie_1.0_large_ch model
# Model name, one click loading
#model = ppnlp.transformers.SkepForSequenceClassification.from_pretrained(pretrained_model_name_or_path="skep_ernie_1.0_large_ch", num_classes=14)
# Similarly, the corresponding Tokenizer is loaded with one click by specifying the model name, which is used to process text data, such as segmentation token and conversion token_id, etc
#tokenizer = ppnlp.transformers.SkepTokenizer.from_pretrained(pretrained_model_name_or_path="skep_ernie_1.0_large_ch")

# Using Bert WwM ext Chinese model
MODEL_NAME = "bert-wwm-ext-chinese"
# Just specify the name of the model you want to use and the number of categories of text classification to complete the fine tune network definition, which is classified by splicing a Full Connected network after pre training the model
model = ppnlp.transformers.BertForSequenceClassification.from_pretrained(MODEL_NAME, num_classes=14) # The classification task is 14, so num_classes is set to 14
# Define the tokenizer corresponding to the model. The tokenizer can convert the original input text into the input data format acceptable to the model. It should be noted that the tokenizer class should correspond to the selected model. For details, see the relevant documents of PaddleNLP
tokenizer = ppnlp.transformers.BertTokenizer.from_pretrained(MODEL_NAME)

# Using macbert large Chinese model
#MODEL_NAME = "macbert-large-chinese"
# Just specify the name of the model you want to use and the number of categories of text classification to complete the fine tune network definition, which is classified by splicing a Full Connected network after pre training the model
#model = ppnlp.transformers.BertForSequenceClassification.from_pretrained(MODEL_NAME, num_classes=14) # The classification task is 14, so num_classes is set to 14
# Define the tokenizer corresponding to the model. The tokenizer can convert the original input text into the input data format acceptable to the model. It should be noted that the tokenizer class should correspond to the selected model. For details, see the relevant documents of PaddleNLP
#tokenizer = ppnlp.transformers.BertTokenizer.from_pretrained(MODEL_NAME)

# Using the huhuiwen / Mengzi Bert base model
#MODEL_NAME = "huhuiwen/mengzi-bert-base"
# Just specify the name of the model you want to use and the number of categories of text classification to complete the fine tune network definition, which is classified by splicing a Full Connected network after pre training the model
#model = ppnlp.transformers.BertForSequenceClassification.from_pretrained(MODEL_NAME, num_classes=14) # The classification task is 14, so num_classes is set to 14
# Define the tokenizer corresponding to the model. The tokenizer can convert the original input text into the input data format acceptable to the model. It should be noted that the tokenizer class should correspond to the selected model. For details, see the relevant documents of PaddleNLP
#tokenizer = ppnlp.transformers.BertTokenizer.from_pretrained(MODEL_NAME)

# Junnyu / HFL Chinese Electra 180g base discriminator model was used
# Specify the model name and load the model with one click
#model = ppnlp.transformers.ElectraForSequenceClassification.from_pretrained('junnyu/hfl-chinese-electra-180g-base-discriminator', num_classes=14)
# Similarly, the corresponding Tokenizer is loaded with one click by specifying the model name, which is used to process text data, such as segmentation token and conversion token_id, etc
#tokenizer = ppnlp.transformers.ElectraTokenizer.from_pretrained('junnyu/hfl-chinese-electra-180g-base-discriminator')

PaddleNLP supports not only RoBERTa pre training model, but also ERNIE, BERT, Electra and other pre training models. For details: PaddleNLP model

The following table summarizes the various pre training models currently supported by PaddleNLP. Users can use the model provided by PaddleNLP to complete questions and answers, sequence classification, token classification and other tasks. At the same time, 22 kinds of pre training parameter weights are provided for users, including the pre training weights of 11 Chinese language models.

Model	Tokenizer	Supported Task	Model Name
BERT	BertTokenizer	BertModel BertForQuestionAnswering BertForSequenceClassification BertForTokenClassification	bert-base-uncased bert-large-uncased bert-base-multilingual-uncased bert-base-cased bert-base-chinese bert-base-multilingual-cased bert-large-cased bert-wwm-chinese bert-wwm-ext-chinese
ERNIE	ErnieTokenizer ErnieTinyTokenizer	ErnieModel ErnieForQuestionAnswering ErnieForSequenceClassification ErnieForTokenClassification	ernie-1.0 ernie-tiny ernie-2.0-en ernie-2.0-large-en
RoBERTa	RobertaTokenizer	RobertaModel RobertaForQuestionAnswering RobertaForSequenceClassification RobertaForTokenClassification	roberta-wwm-ext roberta-wwm-ext-large rbt3 rbtl3
ELECTRA	ElectraTokenizer	ElectraModel ElectraForSequenceClassification ElectraForTokenClassification	electra-small electra-base electra-large chinese-electra-small chinese-electra-base

Note: the Chinese pre training models include Bert base Chinese, Bert WwM Chinese, Bert WwM ext Chinese, ernie-1.0, Ernie tiny, Roberta WwM ext, Roberta WwM ext large, rbt3, rbtl3, China electric base, China Electric small, etc.

3.3 data reading and processing

# Define the file corresponding to the dataset and its file storage format
class NewsData(DatasetBuilder):
    #File names corresponding to training set and verification set
    SPLITS = {
        'train': 'train.csv',  # Training set
        'dev': 'dev.csv',      # Validation set
    }

    #Get the file names of training set and verification set
    def _get_data(self, mode, **kwargs):
        filename = self.SPLITS[mode]
        return filename

    #Read data from file
    def _read(self, filename):
        with open(filename, 'r', encoding='utf-8') as f:
            head = None
            for line in f:
                data = line.strip().split("\t")    # Separate columns with '\ t'
                if not head:
                    head = data
                else:
                    text_a, label = data
                    yield {"text_a": text_a, "label": label}  # The format of this setting data is: text_ a. Label, which can be modified according to the specific situation

    #Get category label
    def get_labels(self):
        return label_list   # Category label

# Define dataset loading function
def load_dataset(name=None,
                 data_files=None,
                 splits=None,
                 lazy=None,
                 **kwargs):
   
    reader_cls = NewsData  # Load defined dataset format
    print(reader_cls)
    #Dataset load instance
    if not name:
        reader_instance = reader_cls(lazy=lazy, **kwargs)
    else:
        reader_instance = reader_cls(lazy=lazy, name=name, **kwargs)
    
    #Load dataset by instance
    datasets = reader_instance.read_datasets(data_files=data_files, splits=splits)
    return datasets

# Load training and validation sets
train_ds, dev_ds = load_dataset(splits=["train", "dev"])

<class '__main__.NewsData'>

# Define data loading and processing functions
def convert_example(example, tokenizer, max_seq_length=128, is_test=False):
    qtconcat = example["text_a"]
    encoded_inputs = tokenizer(text=qtconcat, max_seq_len=max_seq_length)  # tokenizer is processed into a format acceptable to the model 
    input_ids = encoded_inputs["input_ids"]
    token_type_ids = encoded_inputs["token_type_ids"]

    #If it is not a test set, you need to return the label
    if not is_test:
        label = np.array([example["label"]], dtype="int64")
        return input_ids, token_type_ids, label
    else:#The test set does not need to return a label
        return input_ids, token_type_ids

# Define the data loading function dataloader
def create_dataloader(dataset,
                      mode='train',
                      batch_size=1,
                      batchify_fn=None,
                      trans_fn=None):
    if trans_fn:
        dataset = dataset.map(trans_fn)

    shuffle = True if mode == 'train' else False
    # The training data set is randomly disrupted, and the test data set is not disrupted
    if mode == 'train':
        batch_sampler = paddle.io.DistributedBatchSampler(
            dataset, batch_size=batch_size, shuffle=shuffle)
    else:
        batch_sampler = paddle.io.BatchSampler(
            dataset, batch_size=batch_size, shuffle=shuffle)

    return paddle.io.DataLoader(
        dataset=dataset,
        batch_sampler=batch_sampler,
        collate_fn=batchify_fn,
        return_list=True)

# Parameter setting:
# Batch processing size. If the video memory is insufficient, this value can be appropriately reduced
# Note that in this scenario, if you use Nezha large WwM Chinese, you need to set batch_size is modified to 256, Chinese xlnet large to 128, and other models to 300. Otherwise, it is easy to cause the problem of explosive display memory
batch_size = 256
# The maximum truncation length of the text sequence shall be determined according to the specific length of the text, and the maximum length shall not exceed 512. Through the text length analysis, it can be seen that the maximum text length is 48, so it is set to 48 here
max_seq_length = 48

# Process the data into a data format that the model can read in
trans_func = partial(
    convert_example,
    tokenizer=tokenizer,
    max_seq_length=max_seq_length)

batchify_fn = lambda samples, fn=Tuple(
    Pad(axis=0, pad_val=tokenizer.pad_token_id),  # input_ids
    Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # token_type_ids
    Stack()  # labels
): [data for data in fn(samples)]

# Training set iterator
train_data_loader = create_dataloader(
    train_ds,
    mode='train',
    batch_size=batch_size,
    batchify_fn=batchify_fn,
    trans_fn=trans_func)

# Validation set iterator
dev_data_loader = create_dataloader(
    dev_ds,
    mode='dev',
    batch_size=batch_size,
    batchify_fn=batchify_fn,
    trans_fn=trans_func)

3.4 set fine tune optimization strategy and access evaluation index

The data distribution is uneven, and Focal Loss is used to calculate the loss function.
In order to avoid model instability (oscillation), the "warm-up" strategy (WarmUp) of network training is realized by gradually increasing the learning rate from low to high, as follows:

#According to the definition of Focal Loss loss function, it is found that the effect is not as good as Paddle's cross entropy
import paddle.nn.functional as F
class MultiCEFocalLoss(nn.Layer):
    #class_num: type quantity, reduction: how to deal with different types of loss function contributions. You can choose: mean, sum, and mean by default.
    #use_softmax: whether to softmax the input data
    def __init__(self, class_num, gamma=2, alpha=None, reduction='mean',use_softmax=True):
        super(MultiCEFocalLoss, self).__init__()
        if alpha is None:
            self.alpha = paddle.ones((class_num, 1))
        elif isinstance(alpha,list):
            self.alpha = paddle.to_tensor(alpha,dtype=paddle.float32)
        elif isinstance(alpha,[float,int]):
            self.alpha = paddle.to_tensor([alpha])
  
        self.gamma = gamma
        self.reduction = reduction
        self.class_num =  class_num
        self.use_softmax = use_softmax

    def forward(self, preds, labels):
        #print("predict",preds)
        #print("label",labels)
        labels = labels.cast(paddle.int64)
        # print("The dimension of preds is:",preds.ndim)
        # print("The dimension of labels is:",labels.ndim)
        if labels.ndim > 1:#Remove the dimension of single value array in labels
            labels = labels.squeeze(1)
        #List of predicted values of each type in the sample
        if self.use_softmax:#Prediction probability based on softmax
            pt = F.softmax(preds,axis=-1)
        else:
            pt = preds
        # print("Processed predict",pt)
        #One Hot code of sample label
        class_mask = F.one_hot(labels, self.class_num)
        # print("One hot of each Label ID",class_mask)
        #Convert to 1 column, and the element is the ID of the sample label
        ids = labels.reshape((-1, 1))
        # print("Label's ID",ids)
        # print("Label's alpha",self.alpha)
        #The sample label type corresponds to the List composed of weights, and the length is equal to the number of samples
        alpha = self.alpha[ids.reshape((-1,1))].reshape((-1,1))
        # print("Label's alpha of each sample",alpha)
        #The sample label type corresponds to the List composed of predicted values (from pt), and the length is equal to the number of samples
        probs = (pt * class_mask).sum(1).reshape((-1, 1)) # Using onehot as mask, the corresponding pt is extracted
        # print("Label's predict of each sample",probs)
        log_p = probs.log()
        # print("Label's log predict of each sample",log_p)
        #Weight alpha and log of predicted value of fusion sample label type_ p. Dynamic attenuation factor pow[(1-predicted value), gama], calculate loss function
        loss = -alpha * (paddle.pow((1 - probs), self.gamma)) * log_p
        # print(loss)
        
        #Fusion of different types of loss function contributions
        if self.reduction == 'mean':#average value
            loss = loss.mean()
        elif self.reduction == 'sum':#the sum
            loss = loss.sum()
        return loss

# Define hyperparameters, loss, optimizers, etc
from paddlenlp.transformers import LinearDecayWithWarmup, CosineDecayWithWarmup

# Define training configuration parameters:
# Define the maximum learning rate during training
learning_rate = 4e-5
# Training rounds
epochs = 4
# The preheating ratio of learning rate is used to control the Step position of the peak point in the "learning rate Step times" curve. Here, the peak value is reached at 10% of the whole Step and then attenuated.
warmup_proportion = 0.1
# The layer by layer weight attenuation coefficient is similar to the regular term strategy of the model to avoid over fitting of the model
weight_decay = 0.0

#Total number of steps experienced during training
num_training_steps = len(train_data_loader) * epochs
print(num_training_steps)
#Linear preheating learning rate. In the first "warm up_contribution * num_training_steps", the learning rate increases linearly from 0 to learning_rate, and then the cosine decays to 0.
lr_scheduler = CosineDecayWithWarmup(learning_rate, num_training_steps, warmup_proportion)

# AdamW optimizer
optimizer = paddle.optimizer.AdamW(
    learning_rate=lr_scheduler,
    parameters=model.parameters(),
    weight_decay=weight_decay,
    apply_decay_param_fun=lambda x: x in [
        p.name for n, p in model.named_parameters()
        if not any(nd in n for nd in ["bias", "norm"])
    ])

#Custom Focal Loss function
#criterion = MultiCEFocalLoss(class_num=14, alpha=(df_Info_data['Weight'].values.tolist()))
#Cross entropy loss function
criterion = paddle.nn.CrossEntropyLoss(weight=paddle.to_tensor(df_Info_data['Weight'].values.astype("float32")))
#criterion = paddle.nn.CrossEntropyLoss()
metric = paddle.metric.Accuracy()              # accuracy evaluation index

3.5 model training and evaluation

ps: during model training, judge whether the training parameters are appropriate and whether there is over fitting by observing the changes of parameters such as loss function and accuracy.
Input NVIDIA SMI command in the terminal or click the "performance monitoring" option at the bottom to check the occupation of video memory, and properly adjust the batch size to prevent suspension caused by insufficient video memory.

# Define model training verification evaluation function
@paddle.no_grad()
def evaluate(model, criterion, metric, data_loader):
    model.eval()
    metric.reset()
    losses = []
    for batch in data_loader:
        input_ids, token_type_ids, labels = batch
        logits = model(input_ids, token_type_ids)
        loss = criterion(logits, labels)
        losses.append(loss.numpy())
        correct = metric.compute(logits, labels)
        metric.update(correct)
        accu = metric.accumulate()
    print("eval loss: %.5f, accu: %.5f" % (np.mean(losses), accu))  # Evaluation effect on output validation set
    model.train()
    metric.reset()
    return accu  # Return accuracy

# Fixed random seeds facilitate the reproduction of results
seed = 1024
random.seed(seed)
np.random.seed(seed)
paddle.seed(seed)

<paddle.fluid.core_avx.Generator at 0x7f02da321fb0>

# Model training:
import paddle.nn.functional as F
import matplotlib.pyplot as plt

#Save path of the model with the highest accuracy on the validation set
save_dir = "checkpoint"
if not  os.path.exists(save_dir):
    os.makedirs(save_dir)

#Save path of the model after the end of the last Epoch
save_final_dir = "checkpoint_final"
if not  os.path.exists(save_final_dir):
    os.makedirs(save_final_dir)

pre_accu=0
accu=0
global_step = 0
#Record the Step and Epoch values when the effect is better
best_global_step = 0
best_epoch = 0
#Handle Epoch one by one
for epoch in range(1, epochs + 1):
    losses = []
    #Process Step by Step
    for step, batch in enumerate(train_data_loader, start=1):
        input_ids, segment_ids, labels = batch
        logits = model(input_ids, segment_ids)
        # print(len(logits[0]))
        # print(len(labels[0]))
        # print(logits)
        # print(labels)
        loss = criterion(logits, labels)
        losses.append(loss.numpy())
        probs = F.softmax(logits, axis=1)
        correct = metric.compute(probs, labels)
        metric.update(correct)
        acc = metric.accumulate()

        global_step += 1
        #Count and output the results every 10 steps
        if global_step % 10 == 0:
            print("global step %d, epoch: %d, batch: %d, loss: %.5f, avgLoss: %.5f, acc: %.5f" % (global_step, epoch, step, loss, np.mean(losses), acc))
        loss.backward()
        optimizer.step()
        lr_scheduler.step()
        optimizer.clear_grad()

        if useTotalData == False:
            #If all data sets (training set + verification set) are not used for training, the evaluation shall be carried out after a certain number of batches to avoid missing the best model in the middle
            if global_step % 300 == 0:
                accu = evaluate(model, criterion, metric, dev_data_loader)
                if accu > pre_accu:
                    # Save better model parameters
                    save_param_path = os.path.join(save_dir, 'model_state.pdparams')  # Save model parameters
                    paddle.save(model.state_dict(), save_param_path)
                    pre_accu=accu
                    #Record the Step and Epoch values when the effect is better
                    best_global_step = global_step
                    best_epoch = epoch
                    print("The best model is found in epoch: %d, batch: %d" % (best_epoch, best_global_step))
    
    if useTotalData == False:
        #If all data sets (training set + verification set) are not used for training, the verification set is evaluated at the end of each round
        accu = evaluate(model, criterion, metric, dev_data_loader)
        print(accu)
        if accu > pre_accu:
            #Save the model parameters with better effect than the previous round
            save_param_path = os.path.join(save_dir, 'model_state.pdparams')
            paddle.save(model.state_dict(), save_param_path)
            pre_accu=accu
            #Record the Step and Epoch values when the effect is better
            best_global_step = global_step
            best_epoch = epoch
        print("The best model is found in epoch: %d, batch: %d" % (best_epoch, best_global_step))
    else:#Save the model parameters of this round to avoid intermediate downtime
        save_param_path = os.path.join(save_dir, 'model_state.pdparams')
        paddle.save(model.state_dict(), save_param_path)
    tokenizer.save_pretrained(save_dir)

    #Draw loss function curve
    plt.figure()
    #Curve title
    plt.title("The loss at the different step")
    #Curve X-axis and Y-axis names
    plt.xlabel("Steps")
    plt.ylabel("Loss")
    plt.plot(losses)
    plt.figure()

#Save the model after the last Epoch
save_final_param_path = os.path.join(save_final_dir, 'model_state.pdparams')
paddle.save(model.state_dict(), save_final_param_path)
tokenizer.save_pretrained(save_final_dir)

The best model is found in epoch: 3, batch: 8400

<Figure size 432x288 with 0 Axes>

<Figure size 432x288 with 0 Axes>

<Figure size 432x288 with 0 Axes>

<Figure size 432x288 with 0 Axes>

# Load the model parameters of the round with the best effect on the verification set
import os
import paddle

params_path = 'checkpoint_final/model_state.pdparams'
if params_path and os.path.isfile(params_path):
    # Load model parameters
    state_dict = paddle.load(params_path)
    model.set_dict(state_dict)
    print("Loaded parameters from %s" % params_path)

Loaded parameters from checkpoint_final/model_state.pdparams

# Test the score of the optimal model parameters on the verification set
evaluate(model, criterion, metric, dev_data_loader)

eval loss: 0.06091, accu: 0.96970





0.9697

3.6 model prediction

# Define categories to classify
label_list = train.label.unique().tolist()
label_map = { 
    idx: label_text for idx, label_text in enumerate(label_list)
}
print(label_map)

{0: 'science and technology', 1: 'Sports', 2: 'Current politics', 3: 'shares', 4: 'entertainment', 5: 'education', 6: 'Home Furnishing', 7: 'Finance and Economics', 8: 'house property', 9: 'Sociology', 10: 'game', 11: 'lottery', 12: 'constellation', 13: 'fashion'}

# Prediction function definition
def predict(model, data, tokenizer, label_map, batch_size=1):
    examples = []
    # Process the input data (list format) into a format acceptable to the model
    for text in data:
        input_ids, segment_ids = convert_example(
            text,
            tokenizer,
            max_seq_length=128,
            is_test=True)
        examples.append((input_ids, segment_ids))

    batchify_fn = lambda samples, fn=Tuple(
        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # input id
        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # segment id
    ): fn(samples)

    # Divide the dataset into multiple batches and save them in batches
    batches = []
    one_batch = []
    #Process samples one by one
    for example in examples:
        #Add each sample to one_ In batch, if the number increases to batch size, it will be added to batches as one batch
        one_batch.append(example)
        if len(one_batch) == batch_size:
            batches.append(one_batch)
            one_batch = []
    if one_batch:
        #Process last one_ Special case where the number of elements in batch is less than batch size
        batches.append(one_batch)

    results = []
    model.eval()
    #Process data batch by batch
    for batch in batches:
        input_ids, segment_ids = batchify_fn(batch)
        input_ids = paddle.to_tensor(input_ids)
        segment_ids = paddle.to_tensor(segment_ids)
        logits = model(input_ids, segment_ids)
        probs = F.softmax(logits, axis=1)
        idx = paddle.argmax(probs, axis=1).numpy()
        idx = idx.tolist()
        labels = [label_map[i] for i in idx]
        results.extend(labels)
    return results  # Return forecast results

# Define the preprocessing function for data, and specify the list format for model input
def preprocess_prediction_data(data):
    examples = []
    for text_a in data:
        examples.append({"text_a": text_a})
    return examples

# Store the prediction results in list format as txt file, and submit the format requirements: one category per line
def write_results(labels, file_path):
    with open(file_path, "w", encoding="utf8") as f:
        f.writelines("\n".join(labels))

# Read the test set file to predict
test = pd.read_csv('./test.csv',sep='\t')

# Format the test set data
test_data = list(test.text_a)
test_example = preprocess_prediction_data(test_data)

# Predict the test set
results = predict(model, test_example, tokenizer, label_map, batch_size=16)

#Save the prediction results as txt files that meet the format requirements
write_results(results, "./result.txt")
#zip file
!zip 'submission.zip' 'result.txt'
!cp -r submission.zip /home/aistudio/

IV Integrated learning

The voting method is used to fuse the results of multiple models into a single result to eliminate inconsistency

According to the results of the seven models fused with equal weight, the score of the submitted results is 0.89494

The results of 7 models are fused with weights 158, 146, 132, 112, 125, 114 and 78, and the score of the submitted results is 0.89527

#When calculating the result correlation, the label column can only be ID, so you need to convert the original name to ID
#Establish a mapping table from classification label to ID
label_ID_map = {Val:ID for ID,Val in enumerate(label_list)}
print(label_ID_map)

{'science and technology': 0, 'Sports': 1, 'Current politics': 2, 'shares': 3, 'entertainment': 4, 'education': 5, 'Home Furnishing': 6, 'Finance and Economics': 7, 'house property': 8, 'Sociology': 9, 'game': 10, 'lottery': 11, 'constellation': 12, 'fashion': 13}

#Since each line of the prediction result file has only label, it needs to be preprocessed to generate several files
#predfile: save the file generated by the model prediction results, with only label column
#label_ID_map: the mapping table from classification label to ID, weight: the weight of the result file
#resultfile: the result file after adding the predfile to the column ('id','label ')
#resultIDfile: the result file after converting the classification label of resultfile into ID
def preprocess_prediction_file(predfile, label_ID_map, weight, resultfile, resultIDfile):
    dftab = pd.read_table(predfile,header=None)
    dftab.columns = ['label']
    #Add id index to data
    lst_ID = []
    for i in range(1,len(dftab) + 1):
        lst_ID.append(i)
    id = pd.DataFrame(lst_ID)
    dftab['id'] = id
    
    #DataFrame consisting of ID and Label
    dftab = dftab[['id','label']]
    #Save results
    dftab.to_csv(resultfile, index=False)
    #Result file with weighted name
    resultWeightfile = "_w" + str(int(weight)) + "_" + resultfile
    dftab.to_csv(resultWeightfile, index=False)

    #Add tag ID to data
    lst_LabelID = []
    for i in range(1,len(dftab) + 1):
        lst_LabelID.append(label_ID_map[dftab['label'][i-1]])
    labelID = pd.DataFrame(lst_LabelID)
    dftab['labelID'] = labelID

    #DataFrame consisting of ID and LabelID
    dfIDtab = dftab[['id','labelID']]
    #Save results
    dfIDtab.to_csv(resultIDfile, index=False)

%cd /home/aistudio/merge_result/
#Clear existing csv files
!rm *.csv

/home/aistudio/merge_result
rm: cannot remove '*.csv': No such file or directory

#Preprocess the model prediction result file
preprocess_prediction_file("./result89.08.txt", label_ID_map, 158, "method1.csv", "method1_ID.csv")
preprocess_prediction_file("./result88.96.txt", label_ID_map, 146, "method2.csv", "method2_ID.csv")
preprocess_prediction_file("./result88.82.txt", label_ID_map, 132, "method3.csv", "method3_ID.csv")
preprocess_prediction_file("./result88.62.txt", label_ID_map, 112, "method4.csv", "method4_ID.csv")
preprocess_prediction_file("./result88.75.txt", label_ID_map, 125, "method5.csv", "method5_ID.csv")
preprocess_prediction_file("./result88.64.txt", label_ID_map, 114, "method6.csv", "method6_ID.csv")
preprocess_prediction_file("./result88.28.txt", label_ID_map, 78, "method7.csv", "method7_ID.csv")

# Correlation of calculation results
!python correlations.py "./method1_ID.csv" "./method2_ID.csv"

Finding correlation between: ./method1_ID.csv and ./method2_ID.csv
Column to be measured: labelID
Pearson's correlation score: 0.9269307860693886
Kendall's correlation score: 0.9219420409220379
Spearman's correlation score: 0.9367103330656568

# Equal weight Voting Fusion
!python kaggle_vote.py "./method?.csv" "./merge.csv"
# Weighted voting fusion
!python kaggle_vote.py "./_w*.csv" "./merge_weight.csv" "weighted"

parsing: ./method3.csv
parsing: ./method2.csv
parsing: ./method1.csv
parsing: ./method5.csv
parsing: ./method7.csv
parsing: ./method4.csv
parsing: ./method6.csv
wrote to ./merge.csv
parsing: ./_w78_method7.csv
Using weight: 78
parsing: ./_w146_method2.csv
Using weight: 146
parsing: ./_w112_method4.csv
Using weight: 112
parsing: ./_w158_method1.csv
Using weight: 158
parsing: ./_w125_method5.csv
Using weight: 125
parsing: ./_w132_method3.csv
Using weight: 132
parsing: ./_w114_method6.csv
Using weight: 114
wrote to ./merge_weight.csv

# Because the result of fusion does not meet the format requirements, the following simple format processing is also required
# df1 = pd.read_csv('./_w4_method1.csv')
# df2 = pd.read_csv('./merge_weight.csv')
df1 = pd.read_csv('./method1.csv')
df2 = pd.read_csv('./merge.csv')
# Filling the filtered data into the new table mainly solves the problem of id disorder
df1['label'] = df1['id'].map(df2.set_index('id')['label'])
df1 = df1['label']
# Save forecast result file
df1.to_csv('result.txt',index=False,header=None)
# Compress the result file into a zip file
!zip 'result.zip' 'result.txt'

  adding: result.txt (deflated 89%)

V Direct push learning

Using direct push learning, the high-quality test set is transformed into training set to improve the number of effective samples

According to the pseudo tags generated by direct push learning, after two training, the scores of the submitted results are 90.01 and 90.06447.

#Since each line of the prediction result file has only label, you need to add ID index
#predfile: save the file generated by the model prediction results, with only label column
#resultfile: the result file after adding the predfile to the column ('id','label ')
def addIDIndex_prediction_file(predfile, resultfile):
    dftab = pd.read_table(predfile,header=None)
    dftab.columns = ['label']
    #Add id index to data
    lst_ID = []
    for i in range(1,len(dftab) + 1):
        lst_ID.append(i)
    id = pd.DataFrame(lst_ID)
    dftab['id'] = id
    
    #DataFrame consisting of ID and Label
    dftab = dftab[['id','label']]
    #Save results
    dftab.to_csv(resultfile, index=False)

#Convert the file composed of columns ('id','label ') into KV table ('id':'label ')
def iDLabelFile2KVTable(iDLabelFile):
    result_file = open(iDLabelFile,'r',encoding='utf-8')
    result_reader = csv.reader(result_file)
    KVTab = {}
    #Convert line by line
    for rows in result_reader:
        KVTab[rows[0]] = rows[1].encode('utf-8').decode('utf-8')
    return KVTab

%cd /home/aistudio/fake_data/
#Clear existing csv files
!rm *.csv

/home/aistudio/fake_data
rm: cannot remove '*.csv': No such file or directory

#Preprocess the model prediction result file
addIDIndex_prediction_file("./result89.08.txt", "method1.csv")
addIDIndex_prediction_file("./result88.96.txt", "method2.csv")
addIDIndex_prediction_file("./result88.82.txt", "method3.csv")
addIDIndex_prediction_file("./result88.62.txt", "method4.csv")
addIDIndex_prediction_file("./result88.75.txt", "method5.csv")
addIDIndex_prediction_file("./result88.64.txt", "method6.csv")
addIDIndex_prediction_file("./result88.28.txt", "method7.csv")

import csv
import numpy as np
import pandas as pd
#Convert the file composed of columns ('id','label ') into KV table ('id':'label ')
f1 = iDLabelFile2KVTable('method1.csv')
f2 = iDLabelFile2KVTable('method2.csv')
f3 = iDLabelFile2KVTable('method3.csv')
f4 = iDLabelFile2KVTable('method1.csv')
f5 = iDLabelFile2KVTable('method2.csv')
f6 = iDLabelFile2KVTable('method3.csv')
f7 = iDLabelFile2KVTable('method3.csv')

# Select the parts with the same prediction results of multiple models
x1 = set(f1.items()).intersection(set(f2.items())).intersection(set(f3.items()))
x2 = set(f4.items()).intersection(set(f5.items())).intersection(set(f6.items()))
x3 = set(f7.items())
x = x1.intersection(x2.intersection(x3))
x = pd.DataFrame(x)
x.columns = ["id", "label"]
# print(x)
df1 = pd.read_csv('method1.csv')
# print(df1['label'])
x[~df1['label'].isin(['label'])]

# Splice text_ a. Label is label data
t1 = x
t2 = pd.read_table('./test.txt',header=None)
t2.columns = ["text_a"]
# Add id index
list = []
for i in range(1,83600):
    list.append(i)
id = pd.DataFrame(list)
t2['id'] = id
t2 = t2[['id','text_a']]

t1['id'] = t1['id'].astype(str)
t1['label'] = t1['label'].astype(str)
t2['id'] = t2['id'].astype(str)
t2['text_a'] = t2['text_a'].astype(str)

t3 = pd.merge(t1, t2[['id', 'text_a']], on='id', how='left')
t3 = t3[['text_a','label']]
)
x = x1.intersection(x2.intersection(x3))
x = pd.DataFrame(x)
x.columns = ["id", "label"]
# print(x)
df1 = pd.read_csv('method1.csv')
# print(df1['label'])
x[~df1['label'].isin(['label'])]

# Splice text_ a. Label is label data
t1 = x
t2 = pd.read_table('./test.txt',header=None)
t2.columns = ["text_a"]
# Add id index
list = []
for i in range(1,83600):
    list.append(i)
id = pd.DataFrame(list)
t2['id'] = id
t2 = t2[['id','text_a']]

t1['id'] = t1['id'].astype(str)
t1['label'] = t1['label'].astype(str)
t2['id'] = t2['id'].astype(str)
t2['text_a'] = t2['text_a'].astype(str)

t3 = pd.merge(t1, t2[['id', 'text_a']], on='id', how='left')
t3 = t3[['text_a','label']]
t3.to_csv('fakeData1.csv',index=False,sep=',')

/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/ipykernel_launcher.py:23: UserWarning: Boolean Series key will be reindexed to match DataFrame index.

Please click here View the basic usage of this environment

Please click here for more detailed instructions.

Keywords: NLP paddlepaddle

Added by scofansnags on Thu, 03 Feb 2022 08:07:42 +0200

Programming VIP