1, Classification of true and false news of American general election based on PaddleNLP (2) classification task based on SKEP model
0. Interpretation
Originally, this is all unfinished. I saw someone ask where Er is? I have to say that it has not been made public. I have to fill in the hole I dug with tears. Next time, the title will never disdain one or two. It's really easy to end......
1. Introduction
The news media has become a channel to convey information about what is happening in the world to the people of the world. People usually think that everything conveyed in the news is true. In some cases, even news channels admit that their news is not as true as they write. However, some news has a significant impact not only on the people or the government, but also on the economy. A news can move up and down the curve according to people's emotions and political situation.
It is very important to identify false news from real news. This problem has been solved by natural language processing tools. This article can help us identify false news or real news according to historical data.
2. Problem description
For print media and digital media, the authenticity of information has become a long-term problem affecting enterprises and society. On social networks, the scope and impact of information dissemination occur at such a fast speed and are amplified so rapidly that distorted, inaccurate or false information has great potential to affect millions of users in the real world in a few minutes. Recently, people have expressed some concerns about this problem and put forward some methods to alleviate this problem.
Throughout the history of various information broadcasting, there have been imprecise, eye-catching and fascinating news headlines designed to attract the attention of the audience to sell information. However, on social networking sites, the scope and impact of information dissemination have been significantly amplified, and the development speed is so fast that distorted, inaccurate or false information has great potential to bring real impact to millions of users in a few minutes.
3. Objectives
- Our only goal is to classify the news in the data set as false news or real news.
- Detailed EDA of news
- Select and build a powerful classification model
data
Data address: https://www.kaggle.com/clmentbisaillon/fake-and-real-news-dataset
2, Data processing
1.PaddleNLP environment
!pip install -U paddlenlp
2. Decompress data
# Run the decompression once and comment it out later # ! unzip data/data27271 / true and false news dataset.zip
3. Export basic library
# Basic data package: pandas and numpy import pandas as pd import numpy as np import os import paddle import paddle.nn.functional as F
4. Load data
import pandas as pd # Read dataset fake_news = pd.read_csv('Fake.csv') true_news = pd.read_csv('True.csv') # Size and field of false news dataset print ("Size and field of false news dataset (row, column):"+ str(fake_news.shape)) print (fake_news.info()) print("\n --------------------------------------- \n") # Size and field of real news dataset print ("Size and field of real news dataset (row, column):"+ str(true_news.shape)) print (true_news.info())
Size and field of false news dataset (row, column):(23481, 4) <class 'pandas.core.frame.DataFrame'> RangeIndex: 23481 entries, 0 to 23480 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 title 23481 non-null object 1 text 23481 non-null object 2 subject 23481 non-null object 3 date 23481 non-null object dtypes: object(4) memory usage: 733.9+ KB None --------------------------------------- Size and field of real news dataset (row, column):(21417, 4) <class 'pandas.core.frame.DataFrame'> RangeIndex: 21417 entries, 0 to 21416 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 title 21417 non-null object 1 text 21417 non-null object 2 subject 21417 non-null object 3 date 21417 non-null object dtypes: object(4) memory usage: 669.4+ KB None
5. Data consolidation
# Label conversion true_news['label'] = 0 fake_news['label'] = 1 # Data merging news_all = pd.concat([true_news, fake_news], ignore_index=True)
news_all.info
<bound method DataFrame.info of title \ 0 As U.S. budget fight looms, Republicans flip t... 1 U.S. military to accept transgender recruits o... 2 Senior U.S. Republican senator: 'Let Mr. Muell... 3 FBI Russia probe helped by Australian diplomat... 4 Trump wants Postal Service to charge 'much mor... ... ... 44893 McPain: John McCain Furious That Iran Treated ... 44894 JUSTICE? Yahoo Settles E-mail Privacy Class-ac... 44895 Sunnistan: US and Allied 'Safe Zone' Plan to T... 44896 How to Blow $700 Million: Al Jazeera America F... 44897 10 U.S. Navy Sailors Held by Iranian Military ... text subject \ 0 WASHINGTON (Reuters) - The head of a conservat... politicsNews 1 WASHINGTON (Reuters) - Transgender people will... politicsNews 2 WASHINGTON (Reuters) - The special counsel inv... politicsNews 3 WASHINGTON (Reuters) - Trump campaign adviser ... politicsNews 4 SEATTLE/WASHINGTON (Reuters) - President Donal... politicsNews ... ... ... 44893 21st Century Wire says As 21WIRE reported earl... Middle-east 44894 21st Century Wire says It s a familiar theme. ... Middle-east 44895 Patrick Henningsen 21st Century WireRemember ... Middle-east 44896 21st Century Wire says Al Jazeera America will... Middle-east 44897 21st Century Wire says As 21WIRE predicted in ... Middle-east date label 0 December 31, 2017 0 1 December 29, 2017 0 2 December 31, 2017 0 3 December 30, 2017 0 4 December 29, 2017 0 ... ... ... 44893 January 16, 2016 1 44894 January 16, 2016 1 44895 January 15, 2016 1 44896 January 14, 2016 1 44897 January 12, 2016 1 [44898 rows x 5 columns]>
6. Data set division
# Custom reader method from paddlenlp.datasets import load_dataset from paddle.io import Dataset, Subset from paddlenlp.datasets import MapDataset def read(pd_data): for index, item in pd_data.iterrows(): yield {'text': item['title']+'. '+item['text'], 'label': item['label'], 'qid': index}
# Partition dataset all_ds = load_dataset(read, pd_data=news_all,lazy=False) train_ds = Subset(dataset=all_ds, indices=[i for i in range(len(all_ds)) if i % 10 != 1]) dev_ds = Subset(dataset=all_ds, indices=[i for i in range(len(all_ds)) if i % 10 == 1]) # Converting to MapDataset type train_ds = MapDataset(train_ds) dev_ds = MapDataset(dev_ds) print(len(train_ds)) print(len(dev_ds))
40408 4490
3, SKEP model loading
from paddlenlp.transformers import SkepForSequenceClassification, SkepTokenizer # Specify the model name and load the model with one click model = SkepForSequenceClassification.from_pretrained(pretrained_model_name_or_path="skep_ernie_2.0_large_en", num_classes=2) # Similarly, the corresponding Tokenizer is loaded with one click by specifying the model name, which is used to process text data, such as segmentation token and conversion token_id, etc. tokenizer = SkepTokenizer.from_pretrained(pretrained_model_name_or_path="skep_ernie_2.0_large_en")
[2021-07-26 00:43:18,015] [ INFO] - Already cached /home/aistudio/.paddlenlp/models/skep_ernie_2.0_large_en/skep_ernie_2.0_large_en.pdparams [2021-07-26 00:43:28,446] [ INFO] - Found /home/aistudio/.paddlenlp/models/skep_ernie_2.0_large_en/skep_ernie_2.0_large_en.vocab.txt
4, NLP data processing
1. Add log
# Visual DL introduction from visualdl import LogWriter writer = LogWriter("./log")
2.SkepTokenizer data processing
The SKEP model processes text according to word granularity. We can use the built-in SkepTokenizer in PaddleNLP to complete one click processing.
import os from functools import partial import numpy as np import paddle import paddle.nn.functional as F from paddlenlp.data import Stack, Tuple, Pad from utils import create_dataloader def convert_example(example, tokenizer, max_seq_length=512, is_test=False): # The original data is processed into a format that can be read in by the model, enocded_inputs is a dict that contains inputs_ ids,token_type_ids and other fields encoded_inputs = tokenizer( text=example["text"], max_seq_len=max_seq_length) # input_ids: the corresponding token id in the vocabulary after the text is segmented into tokens input_ids = encoded_inputs["input_ids"] # token_type_ids: whether the current token belongs to sentence 1 or sentence 2, that is, the segment ids expressed in the above figure token_type_ids = encoded_inputs["token_type_ids"] if not is_test: # label: emotional polarity category label = np.array([example["label"]], dtype="int64") return input_ids, token_type_ids, label else: # qid: number of each data qid = np.array([example["qid"]], dtype="int64") return input_ids, token_type_ids, qid
# Batch data size batch_size = 10 # Maximum length of text sequence max_seq_length = 512 # Process the data into a data format that the model can read in trans_func = partial( convert_example, tokenizer=tokenizer, max_seq_length=max_seq_length) # Form data into batch data, such as # padding text sequences of different lengths to the maximum length of batch data # Stack each data label together batchify_fn = lambda samples, fn=Tuple( Pad(axis=0, pad_val=tokenizer.pad_token_id), # input_ids Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # token_type_ids Stack() # labels ): [data for data in fn(samples)] train_data_loader = create_dataloader( train_ds, mode='train', batch_size=batch_size, batchify_fn=batchify_fn, trans_fn=trans_func) dev_data_loader = create_dataloader( dev_ds, mode='dev', batch_size=batch_size, batchify_fn=batchify_fn, trans_fn=trans_func)
5, Model training and evaluation
1. Start training
import time from utils import evaluate # Training rounds epochs = 3 # Folder to save model parameters during training ckpt_dir = "skep_ckpt" # len(train_data_loader) the number of step s required for a round of training num_training_steps = len(train_data_loader) * epochs # Adam optimizer optimizer = paddle.optimizer.AdamW( learning_rate=2e-5, parameters=model.parameters()) # Cross entropy loss function criterion = paddle.nn.loss.CrossEntropyLoss() # accuracy evaluation index metric = paddle.metric.Accuracy()
# Start training global_step = 0 tic_train = time.time() for epoch in range(1, epochs + 1): for step, batch in enumerate(train_data_loader, start=1): input_ids, token_type_ids, labels = batch # Feed data to model logits = model(input_ids, token_type_ids) # Calculate loss function value loss = criterion(logits, labels) # Predicted classification probability value probs = F.softmax(logits, axis=1) # Calculate acc correct = metric.compute(probs, labels) metric.update(correct) acc = metric.accumulate() global_step += 1 if global_step % 10 == 0: print( "global step %d, epoch: %d, batch: %d, loss: %.5f, accu: %.5f, speed: %.2f step/s" % (global_step, epoch, step, loss, acc, 10 / (time.time() - tic_train))) tic_train = time.time() # Reverse gradient return and update parameters loss.backward() optimizer.step() optimizer.clear_grad() if global_step % 100 == 0: save_dir = os.path.join(ckpt_dir, "model_%d" % global_step) if not os.path.exists(save_dir): os.makedirs(save_dir) # Model for evaluating current training evaluate(model, criterion, metric, dev_data_loader) # Save current model parameters, etc model.save_pretrained(save_dir) # Save the vocabulary of tokenizer, etc tokenizer.save_pretrained(save_dir)
2. Training log
global step 110, epoch: 1, batch: 110, loss: 0.00653, accu: 0.98000, speed: 0.05 step/s global step 120, epoch: 1, batch: 120, loss: 0.00180, accu: 0.99000, speed: 0.95 step/s global step 130, epoch: 1, batch: 130, loss: 0.00236, accu: 0.99000, speed: 0.94 step/s global step 140, epoch: 1, batch: 140, loss: 0.00210, accu: 0.99250, speed: 0.94 step/s global step 150, epoch: 1, batch: 150, loss: 0.00216, accu: 0.99400, speed: 0.95 step/s global step 160, epoch: 1, batch: 160, loss: 0.00651, accu: 0.99500, speed: 0.95 step/s global step 170, epoch: 1, batch: 170, loss: 0.00105, accu: 0.99571, speed: 0.95 step/s global step 180, epoch: 1, batch: 180, loss: 0.00092, accu: 0.99625, speed: 0.94 step/s global step 190, epoch: 1, batch: 190, loss: 0.00065, accu: 0.99667, speed: 0.94 step/s global step 200, epoch: 1, batch: 200, loss: 0.00058, accu: 0.99700, speed: 0.95 step/s eval loss: 0.00571, accu: 0.99866
6, Summary
1. Data processing
- Merge datasets
- Generate new label
- Data set partition
2.PaddleNLP custom reader method
In the past, most of them read files directly and then return. Recently, they have been reading and returning with pandas, which is more convenient and fast
3.SKEP model application
Classification is like this model, max_seq_length refers to the maximum number of words, no more than 512. If it exceeds various trick s, such as front interception, back interception, and so on, some will be lost.