Implementation of thousands of words data set emotion analysis competition based on paddelnlp Skep pre training model

Emotion analysis competition: score 0.7943

Here, the high-level API of the propeller is used to quickly build a model to submit the results of the emotion analysis competition. Please refer to the specific principle and analysis NLP punch in camp practice lesson 5: text emotion analysis . The following will be divided into three parts: sentence level affective analysis (nlpcc14-sc, chnenticorp); Objective emotion analysis (se-absa16_phns, se-absa16_case); And viewpoint extraction (COTE-BD, COTE-DP, COTE-MFW). For specific data set introduction, see Game link

The use of the project is very simple. Change the data of the corresponding chapter_ Name, and adjust batch by yourself_ Size and epochs to achieve the best training effect, and run all the codes in the corresponding chapters to get the prediction results of the corresponding data set. After all data predictions are completed, download the submission folder and submit it.

!pip install --upgrade paddlenlp -i https://pypi.org/simple 

1. Sentence level emotion analysis

Sentence level emotion analysis is to judge the emotional tendency of an input paragraph, which is generally positive (1) or negative (0).

As we all know, human natural language contains rich emotional colors: expressing people's emotions (such as sadness and happiness), expressing people's feelings (such as burnout and depression), expressing people's preferences (such as like and hate), expressing people's personality characteristics and expressing people's position, etc. Affective analysis is applied in commodity preference, consumption decision-making, public opinion analysis and other scenarios. Using machines to automatically analyze these emotional tendencies will not only help enterprises understand consumers' feelings about their products, but also provide a basis for product improvement; At the same time, it also helps enterprises analyze the attitudes of business partners in order to make better business decisions.

The well-known emotion analysis task is to classify a paragraph of text, such as three classification problems with positive emotion polarity, negative emotion polarity and others:


Emotion analysis task
  • Positive: it means positive emotions, such as happiness, surprise, expectation, etc.
  • Negative: indicates negative emotions, such as sadness, sadness, anger, panic, etc.
  • Other: other types of emotions.

In fact, the above familiar emotion analysis tasks are sentence level emotion analysis tasks.

Affective analysis tasks can be further divided into sentence level affective analysis, target level affective analysis and so on.

1.0 loading model and Tokenizer

Call paddlenlp transformers. SkepForTokenClassification. from_ The pre trained ('skep_ernie_1.0_large_ch ') method only needs to specify the model name and the number of categories of text classification to be used to define the model network.

PaddleNLP supports not only Skep pre training model, but also BERT, RoBERTa, Electra and other pre training models.
The following table summarizes the various pre training models currently supported by PaddleNLP. Users can use the model provided by PaddleNLP to complete tasks such as text classification, sequence annotation, question and answer, etc. At the same time, we provide 22 kinds of pre training parameter weights for users, including the pre training weights of 11 Chinese language models.

ModelTokenizerSupported TaskModel Name
BERTBertTokenizerBertModel
BertForQuestionAnswering
BertForSequenceClassification
BertForTokenClassification
bert-base-uncased
bert-large-uncased
bert-base-multilingual-uncased
bert-base-cased
bert-base-chinese
bert-base-multilingual-cased
bert-large-cased
bert-wwm-chinese
bert-wwm-ext-chinese
ERNIEErnieTokenizer
ErnieTinyTokenizer
ErnieModel
ErnieForQuestionAnswering
ErnieForSequenceClassification
ErnieForTokenClassification
ernie-1.0
ernie-tiny
ernie-2.0-en
ernie-2.0-large-en
RoBERTaRobertaTokenizerRobertaModel
RobertaForQuestionAnswering
RobertaForSequenceClassification
RobertaForTokenClassification
roberta-wwm-ext
roberta-wwm-ext-large
rbt3
rbtl3
ELECTRAElectraTokenizerElectraModel
ElectraForSequenceClassification
ElectraForTokenClassification
electra-small
electra-base
electra-large
chinese-electra-small
chinese-electra-base

Note: the Chinese pre training models include Bert base Chinese, Bert WwM Chinese, Bert WwM ext Chinese, ernie-1.0, Ernie tiny, Roberta WwM ext, Roberta WwM ext large, rbt3, rbtl3, China electric base, China Electric small, etc.

More pre training model references: https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/model_zoo/transformers.rst
For more usage methods of pre training model fine tune downstream tasks, please refer to https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples

import paddlenlp
from paddlenlp.transformers import SkepForSequenceClassification, SkepTokenizer
# from paddlenlp.transformers import ErnieForSequenceClassification, ErnieTokenizer
# from paddlenlp.transformers import BertForSequenceClassification, BertTokenizer
print(paddlenlp.__version__)

1.1 data processing

Although some data sets already exist in paddelnlp, for consistency in data processing, they are uniformly processed from uploaded datasets. For the existing data sets of PaddleNLP, it is strongly recommended to call them directly with API, which is very convenient.

# Decompress data
!unzip -o datasets/ChnSentiCorp
!unzip -o datasets/NLPCC14-SC

Data internal structure analysis:

ChnSentiCorp:

train: 
label		text_a
0		The room is too small. Everything else is average.........
1		Light, easy to carry, good performance, can meet the usual work needs, very good for business travelers

dev:
qid		label		text_a
0		1		The environment and service attitude of this hotel are also quite good,But the room space is too small~...

test:
qid		text_a
0		This hotel is rather old, and the special rooms are also very ordinary. Generally speaking, it is average
...		...

NLPCC14-SC:

train:
label		text_a
1		Excuse me, isn't this machine equipped with a remote control?
0		It's all truth

test:
qid		text_a
0		I finally found someone in the same line~~~~Since junior high school, I have been...
...		...

It can be seen from the above that the two data sets can define the same reading method, but nlpcc14 SC has no dev data set, so dev data is no longer defined

# Get dataset dictionary
def open_func(file_path):
    return [line.strip() for line in open(file_path, 'r', encoding='utf8').readlines()[1:] if len(line.strip().split('\t')) >= 2]

data_dict = {'chnsenticorp': {'test': open_func('ChnSentiCorp/test.tsv'),
                              'dev': open_func('ChnSentiCorp/dev.tsv'),
                              'train': open_func('ChnSentiCorp/train.tsv')},
             'nlpcc14sc': {'test': open_func('NLPCC14-SC/test.tsv'),
                           'train': open_func('NLPCC14-SC/train.tsv')}}

1.2 defining data readers

# Define dataset
from paddle.io import Dataset, DataLoader
from paddlenlp.data import Pad, Stack, Tuple
import numpy as np
label_list = [0, 1]

# Note that since token type does not work in this task, it will not be considered here. Let the model fill in by itself.
class MyDataset(Dataset):
    def __init__(self, data, tokenizer, max_len=512, for_test=False):
        super().__init__()
        self._data = data
        self._tokenizer = tokenizer
        self._max_len = max_len
        self._for_test = for_test
    
    def __len__(self):
        return len(self._data)
    
    def __getitem__(self, idx):
        samples = self._data[idx].split('\t')
        label = samples[-2]
        text = samples[-1]
        label = int(label)
        text = self._tokenizer.encode(text, max_seq_len=self._max_len)['input_ids']
        if self._for_test:
            return np.array(text, dtype='int64')
        else:
            return np.array(text, dtype='int64'), np.array(label, dtype='int64')

def batchify_fn(for_test=False):
    if for_test:
        return lambda samples, fn=Pad(axis=0, pad_val=tokenizer.pad_token_id): np.row_stack([data for data in fn(samples)])
    else:
        return lambda samples, fn=Tuple(Pad(axis=0, pad_val=tokenizer.pad_token_id),
                                        Stack()): [data for data in fn(samples)]

def get_data_loader(data, tokenizer, batch_size=32, max_len=512, for_test=False):
    dataset = MyDataset(data, tokenizer, max_len, for_test)
    shuffle = True if not for_test else False
    data_loader = DataLoader(dataset=dataset, batch_size=batch_size, collate_fn=batchify_fn(for_test), shuffle=shuffle)
    return data_loader

1.3 model building and training

The model is very simple. We only need to call the corresponding sequence classification tool. In order to facilitate the training, the high-level API Model is directly used to complete the training.

import paddle
from paddle.static import InputSpec

# Model and word segmentation
model = SkepForSequenceClassification.from_pretrained('skep_ernie_1.0_large_ch', num_classes=2)
tokenizer = SkepTokenizer.from_pretrained('skep_ernie_1.0_large_ch')

# model = ErnieForTokenClassification.from_pretrained('ernie-1.0', num_classes=2)
# tokenizer = ErnieTokenizer.from_pretrained('ernie-1.0')   

# model = BertForSequenceClassification.from_pretrained('bert-wwm-ext-chinese', num_classes=2)
# tokenizer = BertTokenizer.from_pretrained('bert-wwm-ext-chinese')
# Parameter setting  # Change this option to change the dataset chnsenticorp,nlpcc14sc
# data_name = 'chnsenticorp' 
data_name = 'nlpcc14sc'

# Training related
epochs = 5
learning_rate = 2e-5        # chnsenticorp 2e-5  /nlpcc14sc 2e-5
batch_size = 48    #chnsenticorp 64 / nlpcc14sc 128
max_len = 224     # 92 / 96

## Data correlation
train_dataloader = get_data_loader(data_dict[data_name]['train'], tokenizer, batch_size, max_len, for_test=False)
if data_name == 'chnsenticorp':
    dev_dataloader = get_data_loader(data_dict[data_name]['dev'], tokenizer, batch_size, max_len, for_test=False)
else:
    dev_dataloader = None

input = InputSpec((-1, -1), dtype='int64', name='input')
label = InputSpec((-1, 2), dtype='int64', name='label')
model = paddle.Model(model, [input], [label])

# Model preparation
# Data set chnsenticorp, add regularization 5e-4
# Dataset nlpcc14sc, add regularization 6e-4
optimizer = paddle.optimizer.Adam(learning_rate=learning_rate, parameters=model.parameters(),
            weight_decay=paddle.regularizer.L2Decay(5e-4))
model.prepare(optimizer, loss=paddle.nn.CrossEntropyLoss(), metrics=[paddle.metric.Accuracy()])
print(len(train_dataloader))

Train chnsenticorp

# Start training chnsenticorp
model.fit(train_dataloader, dev_dataloader, batch_size, epochs=12, save_freq=20,verbose=2, save_dir='./ckpt/chnsenticorp')

Training nlpcc14sc

# Start training nlpcc14sc
model.fit(train_dataloader, dev_dataloader, batch_size, epochs=8, save_freq=20,verbose=2, save_dir='./ckpt/nlpcc14sc')

1.4 forecast and save

import os
# Change the special data set and predict the corresponding results chnsenticorp,nlpcc14sc
# data_name = 'chnsenticorp'
data_name = 'nlpcc14sc'

# Import pre training model
checkpoint_path = "./ckpt/" + data_name +  "/final"
model = SkepForSequenceClassification.from_pretrained('skep_ernie_1.0_large_ch', num_classes=2)
# model = ErnieForSequenceClassification.from_pretrained('ernie-1.0', num_classes=2)
# model = BertForSequenceClassification.from_pretrained('bert-wwm-ext-chinese', num_classes=2)
input = InputSpec((-1, -1), dtype='int64', name='input')
model = paddle.Model(model, input)
model.load(checkpoint_path)

# Import test set
test_dataloader = get_data_loader(data_dict[data_name]['test'], tokenizer, batch_size, max_len, for_test=True)
# Save forecast results
save_dir = './submission'
save_file = {'chnsenticorp': 'ChnSentiCorp.tsv', 'nlpcc14sc': 'NLPCC14-SC.tsv'}
if not os.path.exists(save_dir):
    os.makedirs(save_dir)
predicts = []
for batch in test_dataloader:
    predict = model.predict_batch(batch)
    predicts += predict[0].argmax(axis=-1).tolist()

with open(os.path.join(save_dir,save_file[data_name]), 'w', encoding='utf8') as f:
    f.write("index\tprediction\n")
    for idx, sample in enumerate(data_dict[data_name]['test']):
        qid = sample.split('\t')[0]
        f.write(qid + '\t' + str(predicts[idx]) + '\n')
    f.close()

2. Goal level emotion analysis

Target level affective analysis expands the affective tendency of the whole sentence to the affective tendency of multiple specific attributes. In essence, it is still sequence classification, but it needs to be classified multiple times for the same sequence and different attributes. The idea here is to introduce the targeted attributes into the model as part of the input, and predict the emotional tendency.

In recent years, a large number of studies have shown that pre trained models (PTM) based on large corpus can learn general language representation, which is conducive to downstream NLP tasks, and can avoid training models from scratch. With the development of computing power, the emergence of depth model (i.e. Transformer) and the enhancement of training skills, PTM continues to develop from shallow to deep.

Emotional pre training model SKEP (sentimental knowledge enhanced pre training for sentimental analysis). SKEP uses emotional knowledge to enhance the pre training model and comprehensively surpasses SOTA in 14 typical tasks of Chinese and British emotional analysis. This work has been employed by ACL 2020. SKEP is an emotional pre training algorithm based on emotional knowledge enhancement proposed by Baidu research team. This algorithm uses unsupervised method to automatically mine emotional knowledge, and then uses emotional knowledge to construct pre training objectives, so that machines can learn to understand emotional semantics. SKEP provides a unified and powerful emotional semantic representation for all kinds of emotional analysis tasks.

Thesis address: https://arxiv.org/abs/2005.05635


Baidu research team further verified the effect of emotion pre training model SKEP on 14 Chinese and English data from three typical emotion analysis tasks, sentence level emotion classification, aspect level emotion classification and Opinion Role Labeling.

Specific experimental effect reference: https://github.com/baidu/Senta#skep

2.0 loading model and Tokenizer

import paddlenlp
from paddlenlp.transformers import SkepForSequenceClassification, SkepTokenizer
# from paddlenlp.transformers import ErnieForSequenceClassification, ErnieTokenizer
# from paddlenlp.transformers import BertForSequenceClassification,BertTokenizer

2.1 data processing

# Decompress data
!unzip -o datasets/SE-ABSA16_CAME
!unzip -o datasets/SE-ABSA16_PHNS
with open("SE-ABSA16_CAME/train.tsv", 'r',encoding="UTF-8") as f:
    lines = f.readlines()
    for line in lines[:5]:
        print(line)

Analysis of internal data structure (the structures of the two data sets are the same):

train:
label		text_a		text_b
1		phone#design_features 		 Today, I was lucky to get the real Hong Kong version of white iPhone 5. I tried it. Let's talk about my feelings: 1 The size and width of the real machine are consistent with 4/4s, and there is no change
0		software#operation_performance 		 Apple's new iPhone 5 comes to hand. Compared with the 4S, it feels 1 and looks good. At first, looking at the press conference and online photos, I was the same as most people: little change, a little disappointed

test:
qid		text_a		text_b
0		software#usability 		 Just started 8600, experience. Just bought from Taobao, 1635 yuan (including mail). 1. Brand new
...		...		...


```python
# Get dataset dictionary
def open_func(file_path):
    return [line.strip() for line in open(file_path, 'r', encoding='utf8').readlines()[1:] if len(line.strip().split('\t')) >= 2]

data_dict = {'seabsa16phns': {'test': open_func('SE-ABSA16_PHNS/test.tsv'),
                              'train': open_func('SE-ABSA16_PHNS/train.tsv')},
             'seabsa16came': {'test': open_func('SE-ABSA16_CAME/test.tsv'),
                              'train': open_func('SE-ABSA16_CAME/train.tsv')}}

2.2 defining data readers

The method is similar to that in 1.2, which is basically completely pasted and copied. Note that two text s are required here, and the token should be considered_ type_ I'm sorry.

# Define dataset
from paddle.io import Dataset, DataLoader
from paddlenlp.data import Pad, Stack, Tuple
import numpy as np
label_list = [0, 1]

# Consider token_type_id
class MyDataset(Dataset):
    def __init__(self, data, tokenizer, max_len=512, for_test=False):
        super().__init__()
        self._data = data
        self._tokenizer = tokenizer
        self._max_len = max_len
        self._for_test = for_test
    
    def __len__(self):
        return len(self._data)
    
    def __getitem__(self, idx):
        samples = self._data[idx].split('\t')
        label = samples[-3]
        text_b = samples[-1]
        text_a = samples[-2]
        label = int(label)
        encoder_out = self._tokenizer.encode(text_a, text_b, max_seq_len=self._max_len)
        text = encoder_out['input_ids']
        token_type = encoder_out['token_type_ids']
        if self._for_test:
            return np.array(text, dtype='int64'), np.array(token_type, dtype='int64')
        else:
            return np.array(text, dtype='int64'), np.array(token_type, dtype='int64'), np.array(label, dtype='int64')

def batchify_fn(for_test=False):
    if for_test:
        return lambda samples, fn=Tuple(Pad(axis=0, pad_val=tokenizer.pad_token_id),
                                        Pad(axis=0, pad_val=tokenizer.pad_token_type_id)): [data for data in fn(samples)]
    else:
        return lambda samples, fn=Tuple(Pad(axis=0, pad_val=tokenizer.pad_token_id),
                                        Pad(axis=0, pad_val=tokenizer.pad_token_type_id),
                                        Stack()): [data for data in fn(samples)]


def get_data_loader(data, tokenizer, batch_size=32, max_len=512, for_test=False):
    dataset = MyDataset(data, tokenizer, max_len, for_test)
    shuffle = True if not for_test else False
    data_loader = DataLoader(dataset=dataset, batch_size=batch_size, collate_fn=batchify_fn(for_test), shuffle=shuffle)
    return data_loader

2.3 model building and training

Paste the copy of 1.3, pay attention to the data set name and add a token_ type_ Input of ID.

import paddle
from paddle.static import InputSpec

# Model and word segmentation
# model = ErnieForSequenceClassification.from_pretrained('ernie-1.0', num_classes=2)
# tokenizer = ErnieTokenizer.from_pretrained('ernie-1.0')    # 0.5652

model = SkepForSequenceClassification.from_pretrained('skep_ernie_1.0_large_ch', num_classes=2)
tokenizer = SkepTokenizer.from_pretrained('skep_ernie_1.0_large_ch')

# model = BertForTokenClassification.from_pretrained('bert-wwm-ext-chinese', num_classes=2)
# tokenizer = BertTokenizer.from_pretrained('bert-wwm-ext-chinese')

# Parameter setting, change this option to change the dataset
# data_name = 'seabsa16phns'
data_name = 'seabsa16came'
## Training related
epochs = 20
learning_rate = 1e-5   # seabsa16phns 2e-5 /seabsa16came  2e-5
batch_size = 48  #  seabsa16phns 36 36/seabsa16came 32 36
max_len = 204   # seabsa16phns 114  124/  seabsa16came 128 96

# Data correlation
train_dataloader = get_data_loader(data_dict[data_name]['train'], tokenizer, batch_size, max_len, for_test=False)

input = InputSpec((-1, -1), dtype='int64', name='input')
token_type = InputSpec((-1, -1), dtype='int64', name='token_type')
label = InputSpec((-1, 2), dtype='int64', name='label')
model = paddle.Model(model, [input, token_type], [label])

# Model preparation
#Dataset seabsa16phns seabsa16cam
# step_each_epoch = len(train_dataloader)
# lr = paddle.optimizer.lr.CosineAnnealingDecay(learning_rate=learning_rate,
#                                                   T_max=step_each_epoch * epochs)
# optimizer = paddle.optimizer.Adam(learning_rate=lr, parameters=model.parameters())

optimizer = paddle.optimizer.AdamW(weight_decay=0.01, learning_rate=learning_rate,parameters=model.parameters())
model.prepare(optimizer, loss=paddle.nn.CrossEntropyLoss(), metrics=[paddle.metric.Accuracy()])

Training seabsa16phns

# Start training
model.fit(train_dataloader, batch_size=batch_size, epochs=epochs, save_freq=epochs, verbose=2,save_dir='./ckpt/seabsa16phns')

Training seabsa16cam

# Start training
model.fit(train_dataloader, batch_size=batch_size, epochs=20, save_freq=20, verbose=2,save_dir='./ckpt/seabsa16came')

2.4 forecast and save

# Import pre training model
checkpoint_path = "./ckpt/" + data_name +  "/final"
model = SkepForSequenceClassification.from_pretrained('skep_ernie_1.0_large_ch', num_classes=2)
# model = ErnieForSequenceClassification.from_pretrained('ernie-1.0', num_classes=2)

input = InputSpec((-1, -1), dtype='int64', name='input')
token_type = InputSpec((-1, -1), dtype='int64', name='token_type')
model = paddle.Model(model, [input, token_type])
model.load(checkpoint_path)

# Import test set
test_dataloader = get_data_loader(data_dict[data_name]['test'], tokenizer, batch_size, max_len, for_test=True)
# Forecast saving
save_file = {'seabsa16phns': './submission/SE-ABSA16_PHNS.tsv', 'seabsa16came': './submission/SE-ABSA16_CAME.tsv'}
predicts = []
for batch in test_dataloader:
    predict = model.predict_batch(batch)
    predicts += predict[0].argmax(axis=-1).tolist()

with open(save_file[data_name], 'w', encoding='utf8') as f:
    f.write("index\tprediction\n")
    for idx, sample in enumerate(data_dict[data_name]['test']):
        qid = sample.split('\t')[0]
        f.write(qid + '\t' + str(predicts[idx]) + '\n')
    f.close()

3. Opinion extraction

Information extraction aims to extract structured knowledge from unstructured natural language texts, such as entities, relationships, events and so on.

3.0 loading model and Tokenizer

import paddlenlp
from paddlenlp.transformers import SkepForTokenClassification, SkepTokenizer
# from paddlenlp.transformers import ErnieForTokenClassification, ErnieTokenizer
# from paddlenlp.transformers import BertForTokenClassification, BertTokenizer

3.1 data processing

# Decompress data
!unzip -o datasets/COTE-BD
!unzip -o datasets/COTE-DP
!unzip -o datasets/COTE-MFW
with open("COTE-DP/train.tsv", 'r',encoding="UTF-8") as f:
    lines = f.readlines()
    print('Data size:',len(lines))
    for line in lines[:3]:
        print(line)
with open("COTE-BD/train.tsv", 'r',encoding="UTF-8") as f:
    lines = f.readlines()
    print('Data size:',len(lines))
    for line in lines[:3]:
        print(line)
with open("COTE-MFW/train.tsv", 'r',encoding="UTF-8") as f:
    lines = f.readlines()
    print('Data size:',len(lines))
    for line in lines[:3]:
        print(line)

Analysis of internal data structure (the structures of the three data sets are the same):

train:
label		text_a
 not a good guy		<The book "bird man" takes the experience of Dr. bird as the main line, and mainly describes various absurd experiences of Dr. bird after he came out of campus.
...		...
test:
qid		text_a
0		The scenery of bipenggou has been heard for a long time, especially in autumn. However, it was late this time. All the red leaves fell off, and the yellow leaves could not be seen. It only snowed...
...		...
```python
# Get dataset dictionary
def open_func(file_path):
    return [line.strip() for line in open(file_path, 'r', encoding='utf8').readlines()[1:] if len(line.strip().split('\t')) >= 2]

data_dict = {'cotebd': {'test': open_func('COTE-BD/test.tsv'),
                        'train': open_func('COTE-BD/train.tsv')},
             'cotedp': {'test': open_func('COTE-DP/test.tsv'),
                        'train': open_func('COTE-DP/train.tsv')},
             'cotemfw': {'test': open_func('COTE-MFW/test.tsv'),
                        'train': open_func('COTE-MFW/train.tsv')}}

3.2 defining data readers

The idea is similar. It should be noted that this time it is a token level classification. In the data reader, the label is written in the form of BIO, and each token corresponds to a label.

# Define dataset
from paddle.io import Dataset, DataLoader
from paddlenlp.data import Pad, Stack, Tuple
import numpy as np
label_list = {'B': 0, 'I': 1, 'O': 2}
index2label = {0: 'B', 1: 'I', 2: 'O'}

# Consider token_type_id
class MyDataset(Dataset):
    def __init__(self, data, tokenizer, max_len=512, for_test=False):
        super().__init__()
        self._data = data
        self._tokenizer = tokenizer
        self._max_len = max_len
        self._for_test = for_test
    
    def __len__(self):
        return len(self._data)
    
    def __getitem__(self, idx):
        samples = self._data[idx].split('\t')
        label = samples[-2]
        text = samples[-1]
        if self._for_test:
            origin_enc = self._tokenizer.encode(text, max_seq_len=self._max_len)['input_ids']
            return np.array(origin_enc, dtype='int64')
        else:
            
            # Since not every word is a token, a simple processing method is adopted here. First encode label, then encode words other than label in text, and finally combine them together
            texts = text.split(label)
            label_enc = self._tokenizer.encode(label)['input_ids']
            cls_enc = label_enc[0]
            sep_enc = label_enc[-1]
            label_enc = label_enc[1:-1]
            
            # merge
            origin_enc = []
            label_ids = []
            for index, text in enumerate(texts):
                text_enc = self._tokenizer.encode(text)['input_ids']
                text_enc = text_enc[1:-1]
                origin_enc += text_enc
                label_ids += [label_list['O']] * len(text_enc)
                if index != len(texts) - 1:
                    origin_enc += label_enc
                    label_ids += [label_list['B']] + [label_list['I']] * (len(label_enc) - 1)

            origin_enc = [cls_enc] + origin_enc + [sep_enc]
            label_ids = [label_list['O']] + label_ids + [label_list['O']]
            
            # truncation
            if len(origin_enc) > self._max_len:
                origin_enc = origin_enc[:self._max_len-1] + origin_enc[-1:]
                label_ids = label_ids[:self._max_len-1] + label_ids[-1:]
            return np.array(origin_enc, dtype='int64'), np.array(label_ids, dtype='int64')


def batchify_fn(for_test=False):
    if for_test:
        return lambda samples, fn=Pad(axis=0, pad_val=tokenizer.pad_token_id): np.row_stack([data for data in fn(samples)])
    else:
        return lambda samples, fn=Tuple(Pad(axis=0, pad_val=tokenizer.pad_token_id),
                                        Pad(axis=0, pad_val=label_list['O'])): [data for data in fn(samples)]


def get_data_loader(data, tokenizer, batch_size=32, max_len=512, for_test=False):
    dataset = MyDataset(data, tokenizer, max_len, for_test)
    shuffle = True if not for_test else False
    data_loader = DataLoader(dataset=dataset, batch_size=batch_size, collate_fn=batchify_fn(for_test), shuffle=shuffle)
    return data_loader

3.3 model building and training

The difference is that the model is replaced by Token classification. Since Accuracy is no longer applicable to Token classification, we use Accuracy to roughly measure the Accuracy of prediction (close to 1 is the best).

import paddle
from paddle.static import InputSpec
from paddlenlp.metrics import Perplexity

# Model and word segmentation 
model = SkepForTokenClassification.from_pretrained('skep_ernie_1.0_large_ch', num_classes=3)
tokenizer = SkepTokenizer.from_pretrained('skep_ernie_1.0_large_ch')

# model = ErnieForTokenClassification.from_pretrained('ernie-1.0', num_classes=3)
# tokenizer = ErnieTokenizer.from_pretrained('ernie-1.0')   

# model = BertForTokenClassification.from_pretrained('bert-wwm-ext-chinese', num_classes=3)
# tokenizer = BertTokenizer.from_pretrained('bert-wwm-ext-chinese')

# Parameter setting, change this option to change the dataset
# data_name = 'cotedp'
# data_name = 'cotebd'
data_name = 'cotemfw'
# Training related
epochs = 10        # cotedp    /  cotebd10 16/
learning_rate = 2e-5     #2e-5 / 2e-5 /4e-5
batch_size = 56  # codedp 128 156 /codebd 128  156   /codemfw 156  196
max_len = 196      # 96 144 /  96  /  96

## Data correlation
train_dataloader = get_data_loader(data_dict[data_name]['train'], tokenizer, batch_size, max_len, for_test=False)

input = InputSpec((-1, -1), dtype='int64', name='input')
label = InputSpec((-1, -1, 3), dtype='int64', name='label')
model = paddle.Model(model, [input], [label])

# Model preparation    # 1 2e-5
# step_each_epoch = len(train_dataloader)
# lr = paddle.optimizer.lr.CosineAnnealingDecay(learning_rate=learning_rate,
#                                                   T_max=step_each_epoch * epochs)
# optimizer = paddle.optimizer.Adam(learning_rate=lr, parameters=model.parameters()
# ,weight_decay=paddle.regularizer.L2Decay(3e-5))    # weight_decay=paddle.regularizer.L2Decay(3e-4)

# optimizer = paddle.optimizer.Adam(learning_rate=learning_rate, parameters=model.parameters(),weight_decay=paddle.regularizer.L2Decay(3e-4))

optimizer = paddle.optimizer.AdamW(learning_rate=learning_rate,parameters=model.parameters(),weight_decay=0.01)
model.prepare(optimizer, loss=paddle.nn.CrossEntropyLoss(), metrics=[Perplexity()]) 

Training codedp

# Start training cotedp
model.fit(train_dataloader, batch_size=batch_size, epochs=epochs, save_freq=epochs, save_dir='./ckpt/cotedp')

Training cotebd

# Start training cotebd
model.fit(train_dataloader, batch_size=batch_size, epochs=epochs, save_freq=epochs, save_dir='./ckpt/cotebd')

Training codefw

# Start training cotemfw
model.fit(train_dataloader, batch_size=batch_size, epochs=epochs, save_freq=epochs, save_dir='./ckpt/cotemfw')

3.4 forecast and save

# Parameter setting, change this option to change the dataset cotedp cotebd cotemfw
# data_name = 'cotedp'
# data_name = 'cotebd'
data_name = 'cotemfw'
# Import pre training model
checkpoint_path = "./ckpt/" + data_name +  "/final"  # Fill in the saving path of the pre training model

model = SkepForTokenClassification.from_pretrained('skep_ernie_1.0_large_ch', num_classes=3)
# model = ErnieForTokenClassification.from_pretrained('ernie-1.0', num_classes=3)

input = InputSpec((-1, -1), dtype='int64', name='input')
model = paddle.Model(model, [input])
model.load(checkpoint_path)

# Import test set
test_dataloader = get_data_loader(data_dict[data_name]['test'], tokenizer, batch_size, max_len, for_test=True)
# Save results
save_file = {'cotebd': './submission/COTE_BD.tsv', 'cotedp': './submission/COTE_DP.tsv', 'cotemfw': './submission/COTE_MFW.tsv'}
predicts = []
input_ids = []
for batch in test_dataloader:
    predict = model.predict_batch(batch)
    predicts += predict[0].argmax(axis=-1).tolist()
    input_ids += batch.numpy().tolist()

# First find the position of B, that is, the position with label 0, and then find all I along the position, that is, the position with label 1.
def find_entity(prediction, input_ids):
    entity = []
    entity_ids = []
    for index, idx in enumerate(prediction):
        if idx == label_list['B']:
            entity_ids = [input_ids[index]]
        elif idx == label_list['I']:
            if entity_ids:
                entity_ids.append(input_ids[index])
        elif idx == label_list['O']:
            if entity_ids:
                entity.append(''.join(tokenizer.convert_ids_to_tokens(entity_ids)))
                entity_ids = []
    return entity

with open(save_file[data_name], 'w', encoding='utf8') as f:
    f.write("index\tprediction\n")
    for idx, sample in enumerate(data_dict[data_name]['test']):
        qid = sample.split('\t')[0]
        entity = find_entity(predicts[idx], input_ids[idx])
        f.write(qid + '\t' + '\x01'.join(entity) + '\n')
    f.close()

Compress the forecast results into a zip file and submit Thousand words competition website

Note: NLPCC14-SC.tsv, se-absa16 in the results folder_ CAME. tsv,COTE_BD.tsv,COTE_MFW.tsv,COTE_DP.tsv and other documents are for smooth submission and supplement.
The results need to be improved.

#Compress the forecast results into a zip file and submit
t']):
        qid = sample.split('\t')[0]
        entity = find_entity(predicts[idx], input_ids[idx])
        f.write(qid + '\t' + '\x01'.join(entity) + '\n')
    f.close()

Compress the forecast results into a zip file and submit Thousand words competition website

Note: NLPCC14-SC.tsv, se-absa16 in the results folder_ CAME. tsv,COTE_BD.tsv,COTE_MFW.tsv,COTE_DP.tsv and other documents are for smooth submission and supplement.
The results need to be improved.

#Compress the forecast results into a zip file and submit
!zip -r submission.zip submission

References & Courses

1)Baidu PaddlePaddle - Natural Language Processing based on deep learning
2) NLP live lesson Day 5: emotion analysis pre training model SKEP

Keywords: NLP

Added by rocket on Wed, 26 Jan 2022 02:25:15 +0200