Skep model of true and false news classification based on PaddleNLP

1, Classification of true and false news of American general election based on PaddleNLP (2) classification task based on SKEP model

0. Interpretation

Originally, this is all unfinished. I saw someone ask where Er is? I have to say that it has not been made public. I have to fill in the hole I dug with tears. Next time, the title will never disdain one or two. It's really easy to end......

1. Introduction

The news media has become a channel to convey information about what is happening in the world to the people of the world. People usually think that everything conveyed in the news is true. In some cases, even news channels admit that their news is not as true as they write. However, some news has a significant impact not only on the people or the government, but also on the economy. A news can move up and down the curve according to people's emotions and political situation.

It is very important to identify false news from real news. This problem has been solved by natural language processing tools. This article can help us identify false news or real news according to historical data.

2. Problem description

For print media and digital media, the authenticity of information has become a long-term problem affecting enterprises and society. On social networks, the scope and impact of information dissemination occur at such a fast speed and are amplified so rapidly that distorted, inaccurate or false information has great potential to affect millions of users in the real world in a few minutes. Recently, people have expressed some concerns about this problem and put forward some methods to alleviate this problem.

Throughout the history of various information broadcasting, there have been imprecise, eye-catching and fascinating news headlines designed to attract the attention of the audience to sell information. However, on social networking sites, the scope and impact of information dissemination have been significantly amplified, and the development speed is so fast that distorted, inaccurate or false information has great potential to bring real impact to millions of users in a few minutes.

3. Objectives

  • Our only goal is to classify the news in the data set as false news or real news.
  • Detailed EDA of news
  • Select and build a powerful classification model

data

Data address: https://www.kaggle.com/clmentbisaillon/fake-and-real-news-dataset

2, Data processing

1.PaddleNLP environment

!pip install -U paddlenlp

2. Decompress data

# Run the decompression once and comment it out later
# ! unzip data/data27271 / true and false news dataset.zip

3. Export basic library

# Basic data package: pandas and numpy
import pandas as pd 
import numpy as np 
import os
import paddle
import paddle.nn.functional as F

4. Load data

import pandas as pd
# Read dataset
fake_news = pd.read_csv('Fake.csv')
true_news = pd.read_csv('True.csv')
# Size and field of false news dataset
print ("Size and field of false news dataset (row, column):"+ str(fake_news.shape))
print (fake_news.info())
print("\n --------------------------------------- \n")
# Size and field of real news dataset
print ("Size and field of real news dataset (row, column):"+ str(true_news.shape))
print (true_news.info())
Size and field of false news dataset (row, column):(23481, 4)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23481 entries, 0 to 23480
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   title    23481 non-null  object
 1   text     23481 non-null  object
 2   subject  23481 non-null  object
 3   date     23481 non-null  object
dtypes: object(4)
memory usage: 733.9+ KB
None

 --------------------------------------- 

Size and field of real news dataset (row, column):(21417, 4)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21417 entries, 0 to 21416
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   title    21417 non-null  object
 1   text     21417 non-null  object
 2   subject  21417 non-null  object
 3   date     21417 non-null  object
dtypes: object(4)
memory usage: 669.4+ KB
None

5. Data consolidation

# Label conversion
true_news['label'] = 0
fake_news['label'] = 1

# Data merging
news_all = pd.concat([true_news, fake_news], ignore_index=True)
news_all.info
<bound method DataFrame.info of                                                    title  \
0      As U.S. budget fight looms, Republicans flip t...   
1      U.S. military to accept transgender recruits o...   
2      Senior U.S. Republican senator: 'Let Mr. Muell...   
3      FBI Russia probe helped by Australian diplomat...   
4      Trump wants Postal Service to charge 'much mor...   
...                                                  ...   
44893  McPain: John McCain Furious That Iran Treated ...   
44894  JUSTICE? Yahoo Settles E-mail Privacy Class-ac...   
44895  Sunnistan: US and Allied 'Safe Zone' Plan to T...   
44896  How to Blow $700 Million: Al Jazeera America F...   
44897  10 U.S. Navy Sailors Held by Iranian Military ...   

                                                    text       subject  \
0      WASHINGTON (Reuters) - The head of a conservat...  politicsNews   
1      WASHINGTON (Reuters) - Transgender people will...  politicsNews   
2      WASHINGTON (Reuters) - The special counsel inv...  politicsNews   
3      WASHINGTON (Reuters) - Trump campaign adviser ...  politicsNews   
4      SEATTLE/WASHINGTON (Reuters) - President Donal...  politicsNews   
...                                                  ...           ...   
44893  21st Century Wire says As 21WIRE reported earl...   Middle-east   
44894  21st Century Wire says It s a familiar theme. ...   Middle-east   
44895  Patrick Henningsen  21st Century WireRemember ...   Middle-east   
44896  21st Century Wire says Al Jazeera America will...   Middle-east   
44897  21st Century Wire says As 21WIRE predicted in ...   Middle-east   

                     date  label  
0      December 31, 2017       0  
1      December 29, 2017       0  
2      December 31, 2017       0  
3      December 30, 2017       0  
4      December 29, 2017       0  
...                   ...    ...  
44893    January 16, 2016      1  
44894    January 16, 2016      1  
44895    January 15, 2016      1  
44896    January 14, 2016      1  
44897    January 12, 2016      1  

[44898 rows x 5 columns]>

6. Data set division

# Custom reader method
from paddlenlp.datasets import load_dataset
from paddle.io import Dataset, Subset
from paddlenlp.datasets import MapDataset

def read(pd_data):
    for index, item in pd_data.iterrows():       
        yield {'text': item['title']+'. '+item['text'], 'label': item['label'], 'qid': index}
# Partition dataset
all_ds = load_dataset(read, pd_data=news_all,lazy=False)
train_ds = Subset(dataset=all_ds, indices=[i for i in range(len(all_ds)) if i % 10 != 1])
dev_ds = Subset(dataset=all_ds, indices=[i for i in range(len(all_ds)) if i % 10 == 1])

# Converting to MapDataset type
train_ds = MapDataset(train_ds)
dev_ds = MapDataset(dev_ds)
print(len(train_ds))
print(len(dev_ds))
40408
4490

3, SKEP model loading

from paddlenlp.transformers import SkepForSequenceClassification, SkepTokenizer

# Specify the model name and load the model with one click
model = SkepForSequenceClassification.from_pretrained(pretrained_model_name_or_path="skep_ernie_2.0_large_en", num_classes=2)
# Similarly, the corresponding Tokenizer is loaded with one click by specifying the model name, which is used to process text data, such as segmentation token and conversion token_id, etc.
tokenizer = SkepTokenizer.from_pretrained(pretrained_model_name_or_path="skep_ernie_2.0_large_en")
[2021-07-26 00:43:18,015] [    INFO] - Already cached /home/aistudio/.paddlenlp/models/skep_ernie_2.0_large_en/skep_ernie_2.0_large_en.pdparams
[2021-07-26 00:43:28,446] [    INFO] - Found /home/aistudio/.paddlenlp/models/skep_ernie_2.0_large_en/skep_ernie_2.0_large_en.vocab.txt

4, NLP data processing

1. Add log

# Visual DL introduction
from visualdl import LogWriter

writer = LogWriter("./log")

2.SkepTokenizer data processing

The SKEP model processes text according to word granularity. We can use the built-in SkepTokenizer in PaddleNLP to complete one click processing.

import os
from functools import partial


import numpy as np
import paddle
import paddle.nn.functional as F
from paddlenlp.data import Stack, Tuple, Pad

from utils import create_dataloader

def convert_example(example,
                    tokenizer,
                    max_seq_length=512,
                    is_test=False):
   
    # The original data is processed into a format that can be read in by the model, enocded_inputs is a dict that contains inputs_ ids,token_type_ids and other fields
    encoded_inputs = tokenizer(
        text=example["text"], max_seq_len=max_seq_length)

    # input_ids: the corresponding token id in the vocabulary after the text is segmented into tokens
    input_ids = encoded_inputs["input_ids"]
    # token_type_ids: whether the current token belongs to sentence 1 or sentence 2, that is, the segment ids expressed in the above figure
    token_type_ids = encoded_inputs["token_type_ids"]

    if not is_test:
        # label: emotional polarity category
        label = np.array([example["label"]], dtype="int64")
        return input_ids, token_type_ids, label
    else:
        # qid: number of each data
        qid = np.array([example["qid"]], dtype="int64")
        return input_ids, token_type_ids, qid
# Batch data size
batch_size = 10
# Maximum length of text sequence
max_seq_length = 512

# Process the data into a data format that the model can read in
trans_func = partial(
    convert_example,
    tokenizer=tokenizer,
    max_seq_length=max_seq_length)

# Form data into batch data, such as
# padding text sequences of different lengths to the maximum length of batch data
# Stack each data label together
batchify_fn = lambda samples, fn=Tuple(
    Pad(axis=0, pad_val=tokenizer.pad_token_id),  # input_ids
    Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # token_type_ids
    Stack()  # labels
): [data for data in fn(samples)]
train_data_loader = create_dataloader(
    train_ds,
    mode='train',
    batch_size=batch_size,
    batchify_fn=batchify_fn,
    trans_fn=trans_func)
dev_data_loader = create_dataloader(
    dev_ds,
    mode='dev',
    batch_size=batch_size,
    batchify_fn=batchify_fn,
    trans_fn=trans_func)

5, Model training and evaluation

1. Start training

import time

from utils import evaluate

# Training rounds
epochs = 3
# Folder to save model parameters during training
ckpt_dir = "skep_ckpt"
# len(train_data_loader) the number of step s required for a round of training
num_training_steps = len(train_data_loader) * epochs

# Adam optimizer
optimizer = paddle.optimizer.AdamW(
    learning_rate=2e-5,
    parameters=model.parameters())
# Cross entropy loss function
criterion = paddle.nn.loss.CrossEntropyLoss()
# accuracy evaluation index
metric = paddle.metric.Accuracy()
# Start training
global_step = 0
tic_train = time.time()
for epoch in range(1, epochs + 1):
    for step, batch in enumerate(train_data_loader, start=1):
        input_ids, token_type_ids, labels = batch
        # Feed data to model
        logits = model(input_ids, token_type_ids)
        # Calculate loss function value
        loss = criterion(logits, labels)
        # Predicted classification probability value
        probs = F.softmax(logits, axis=1)
        # Calculate acc
        correct = metric.compute(probs, labels)
        metric.update(correct)
        acc = metric.accumulate()

        global_step += 1
        if global_step % 10 == 0:
            print(
                "global step %d, epoch: %d, batch: %d, loss: %.5f, accu: %.5f, speed: %.2f step/s"
                % (global_step, epoch, step, loss, acc,
                    10 / (time.time() - tic_train)))
            tic_train = time.time()
        
        # Reverse gradient return and update parameters
        loss.backward()
        optimizer.step()
        optimizer.clear_grad()

        if global_step % 100 == 0:
            save_dir = os.path.join(ckpt_dir, "model_%d" % global_step)
            if not os.path.exists(save_dir):
                os.makedirs(save_dir)
            # Model for evaluating current training
            evaluate(model, criterion, metric, dev_data_loader)
            # Save current model parameters, etc
            model.save_pretrained(save_dir)
            # Save the vocabulary of tokenizer, etc
            tokenizer.save_pretrained(save_dir)

2. Training log

global step 110, epoch: 1, batch: 110, loss: 0.00653, accu: 0.98000, speed: 0.05 step/s
global step 120, epoch: 1, batch: 120, loss: 0.00180, accu: 0.99000, speed: 0.95 step/s
global step 130, epoch: 1, batch: 130, loss: 0.00236, accu: 0.99000, speed: 0.94 step/s
global step 140, epoch: 1, batch: 140, loss: 0.00210, accu: 0.99250, speed: 0.94 step/s
global step 150, epoch: 1, batch: 150, loss: 0.00216, accu: 0.99400, speed: 0.95 step/s
global step 160, epoch: 1, batch: 160, loss: 0.00651, accu: 0.99500, speed: 0.95 step/s
global step 170, epoch: 1, batch: 170, loss: 0.00105, accu: 0.99571, speed: 0.95 step/s
global step 180, epoch: 1, batch: 180, loss: 0.00092, accu: 0.99625, speed: 0.94 step/s
global step 190, epoch: 1, batch: 190, loss: 0.00065, accu: 0.99667, speed: 0.94 step/s
global step 200, epoch: 1, batch: 200, loss: 0.00058, accu: 0.99700, speed: 0.95 step/s
eval loss: 0.00571, accu: 0.99866

6, Summary

1. Data processing

  • Merge datasets
  • Generate new label
  • Data set partition

2.PaddleNLP custom reader method

In the past, most of them read files directly and then return. Recently, they have been reading and returning with pandas, which is more convenient and fast

3.SKEP model application

Classification is like this model, max_seq_length refers to the maximum number of words, no more than 512. If it exceeds various trick s, such as front interception, back interception, and so on, some will be lost.

Keywords: NLP paddlepaddle

Added by finalxerror on Mon, 08 Nov 2021 09:32:57 +0200