Text classification based on MLP

Recently, I learned the MLP, CNN and RNN network models based on pytoch framework, and conducted text classification experiments using the commodity comment data obtained on GitHub. This paper introduces how to establish MLP under the pytoch framework to classify the data. The data sets are roughly as follows:

1. Import module

import pandas as pd
import numpy as np
import jieba
import keras
import re
import spacy
from keras.preprocessing.text import Tokenizer
import gensim
from gensim import models
from torch.nn.utils.rnn import pad_sequence
import torch

2. Import data

data = pd.read_csv('./data.tsv',sep='\t',index_col=0).astype(str)
data.head()

3. Text data preprocessing

3.1 data cleaning

First of all, the data set needs to be cleaned routinely. In the actual e-commerce platform, sometimes users do not comment after purchase, and businesses automatically release high praise for the product in order to "brush the bill", so it is necessary to check the missing value, double value and delete the text data respectively, so as to reduce the unnecessary impact on the research purpose. At the same time, because too short comments are of little research value, the data with comment text less than or equal to 5 is filtered out. The above cleaning process is uniformly customized and packaged into the clean function as follows:

def clean(data):
    #Remove missing values
    Nan = data.dropna()
    print("After removing missing values:",Nan.shape)
    #Remove duplicate values
    dup = Nan.drop_duplicates()
    print("After removing duplicate values:",dup.shape)
    #Filter comments with more than 5 comment text
    clean_data = dup[dup['text'].str.len() > 5]
    print("After filtering out comments with more than 5 comment texts:",clean_data.shape)
    clean_data.reset_index(inplace=True)
    
    return clean_data


After running, it is found that there are no missing values in the data, but there are 10 duplicate values. After deleting the text data less than or equal to 5, 9702 comments remain.

3.2 word segmentation and removal of stop words

First, we need to exclude the text that is not Chinese characters, such as numbers and letters, and use the re module to define find_ The Chinese function is as follows:

#Exclude numbers, letters and other data that are not Chinese characters
def find_chinese(file):
    pattern = re.compile(r'[^\u4e00-\u9fa5]')
    chinese = re.sub(pattern, '', file)
    return chinese

for number in range(len(newdata)):
    newdata[number]=find_chinese(newdata[number])

Then use the jieba module to segment the text and store the results in the list, transfer in the xlsx file containing Chinese stop words data, and use the if judgment statement to remove the stop words (much faster than spacy...). These two steps can be packaged into the following functions:

#Remove stop words
def del_stop(texts):
    stop_words=pd.read_excel("stop_words.xlsx",header=None)[0].tolist()
    texts = list(map(find_chinese,texts))

    texts_temp = []
    n = 0

    for sentence in texts:
        n += 1
        doc = ''
        sentence_list = jieba.lcut(sentence)
        for word in sentence_list:
            if word not in stop_words:
                doc += word
        texts_temp.append(doc[1:])

    return texts_temp

#participle
def split_words(texts):
    texts_temp=[]

    for sentence in texts:

        sentence_list=jieba.lcut(sentence)
        texts_temp.append(" ".join(sentence_list))

    return texts_temp

Print out the top ten items of the result list, as shown in the following figure:

3.3 text Vectorization

Because the ultimate purpose of the experiment is to classify the text, it is necessary to convert the text into a mathematical representation - vector, that is, word embedding technology. In this paper, Doc2Vec method is used to encapsulate each word in the text into vector expression, and Doc2Vec function in gensim module is used for training. According to the requirements of input parameters of this function, the data processed above needs to be converted again:

words = []

for i in result:
    a = i.split(' ')
    words.append(a)
    
print(words[:10])

It looks like this:

Then you need to tag the data with the TaggedDocument function:

from gensim.models.doc2vec import Doc2Vec, TaggedDocument
train_docs = []
for i, text in enumerate(words):
    document = TaggedDocument(text, tags=[i])
    train_docs.append(document)

After typing, see the following figure:

After the tag, you can throw it into the function for training ~ this paper sets the vector dimension vector_ The size is 50, the number of threads in the training model is 4, and the number of training iterations epochs is 5:

model_dm = Doc2Vec(train_docs, dm=1, min_count=1, window=3, vector_size=50, sample=1e-3, negative=5, workers=4)
model_dm.train(train_docs, total_examples=model_dm.corpus_count, epochs=5)

Because the training output result only vectorizes each word in the text data, it is necessary to combine the original label value and pass it into the MLP model for training:

doc_vec = []

for i in range(len(train_docs)):
    doc_vec.append(model_dm.dv[i].tolist())

data = pd.DataFrame(doc_vec)
newdata = pd.concat([data,label],axis=1)
newdata.head(3)

newdata.to_csv('doc2vec.txt',sep = '\t',header = False)

After vectorization, the length is as follows:

The merged data shape is 9702 rows and 51 columns (the last column is label), and the results are saved to the local txt file.

4. Build and train models

4.1 building MLP network

This part refers to the codes of books and teachers. Specifically, the MLP has two hidden layers, one classification layer, and defines the forward propagation path (the specific details will not be explained):

import torch.nn as nn

import torch
import matplotlib.pyplot as plt
from torchviz import make_dot
class MLPclassifica(nn.Module):
    def __init__(self):
        super(MLPclassifica,self).__init__()   ###Methods that inherit the parent class https://blog.csdn.net/a__int__/article/details/104600972 understanding of parent class inheritance
        #####Define the first hidden layer
        self.hidden1 = nn.Sequential(
            nn.Linear(
              in_features = 50,   ####Enter the number of features
              out_features = 30,  ####Number of output features
              bias = True,  ###bias
          ),
            nn.ReLU()
        )
        #####Define the second hidden as
        self.hidden2 = nn.Sequential(
            nn.Linear(30,10),
            nn.ReLU()
        )
        ######Classification layer
        self.classifica = nn.Sequential(
            nn.Linear(10,2),
            nn.Sigmoid()
        )
    #####Define forward propagation path
    def forward(self, x):
        fc1 = self.hidden1(x)
        fc2 = self.hidden2(fc1)
        output = self.classifica(fc2)
        ###The output is two hidden layers and an output layer
        return fc1,fc2,output
mlpc = MLPclassifica()
##visualization
x = torch.randn(1,50,requires_grad = True)
y = mlpc(x)
Mymlpcvis = make_dot(y, params=dict(list(mlpc.named_parameters())+ [('x',x)]))
Mymlpcvis

The network can be visualized as follows (seemingly incomplete):

4.2 training model

First, import relevant modules and cut the data into training set and test set:

import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler,MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score,confusion_matrix,classification_report
from sklearn.manifold import TSNE


X_train, X_test, y_train, y_test = train_test_split(np.array(newdata.iloc[:,:50]),
                                                    np.array(newdata.iloc[:,50]),test_size=0.25, random_state=123)

Then, the training data and test data shall be converted into tensors respectively. After conversion, X and y shall be combined, and a data loader shall be defined to process the data in batch:

import torch.utils.data as Data

X_train_t = torch.from_numpy(X_train.astype(np.float32))
y_train_t = torch.from_numpy(y_train.astype(np.int64))
X_test_t = torch.from_numpy(X_test.astype(np.float32))
y_test_t = torch.from_numpy(y_test.astype(np.int64))

train_data = Data.TensorDataset(X_train_t,y_train_t)

train_loader = Data.DataLoader(
    dataset = train_data ,
    batch_size = 64 ,
    shuffle = True, ###Scramble data before each iteration
    num_workers = 3,   ##The number of processes loaded using multiple processes. 0 means that multiple processes are not used
)

Then you need to define an optimizer and visualize the results using Canvas. Finally, start training the model and set the number of training iterations epoch to 15:

###Load data to improve the whole network
from torch.optim import SGD,Adam
import seaborn as sns
import hiddenlayer as hl

###Define optimizer
optimizer = torch.optim.Adam(mlpc.parameters(),lr=0.01)
loss_func = nn.CrossEntropyLoss()    ###Binary loss function

####Record the indicators of the training process
history1 = hl.History()

####Visualization using Canvas
canvas1 = hl.Canvas()
print_step = 25

###Iterative training is carried out for the model, and epoch round is trained for all data
for epoch in range(15):
    ##Iterative calculation of the loader of training data
    for step, (b_x , b_y) in enumerate(train_loader):
        ##Calculate the loss per batch
        _, _, output = mlpc(b_x)   ###MLP output on training batch
        train_loss = loss_func(output , b_y)   ####Binary cross entropy loss function
        optimizer.zero_grad()     #####The gradient of each iteration is initialized to 0
        train_loss.backward()     #####The loss is propagated backward and the gradient is calculated
        optimizer.step()         #####Optimization using gradients
        niter = epoch*len(train_loader) + step +1
        if niter % print_step == 0:
            _, _, output = mlpc(X_test_t)
            _, pre_lab = torch.max(output, 1)
            test_accuracy = accuracy_score(y_test_t, pre_lab)
            ## Add epoch, loss, and precision to history
            history1.log(niter, train_loss=train_loss, test_accuracy=test_accuracy)
            ###Two images are used to visualize the loss function and accuracy
            with canvas1:
                canvas1.draw_plot(history1['train_loss'])
                canvas1.draw_plot(history1['test_accuracy'])

The results are as follows:

It can be seen that the accuracy of the test set is about 0.78.

5. Improvement direction

1. In this paper, the fixed stopwords data file is used to remove the stop words, and the influence of subject words on the results is not considered for the text data of different topics
2. To improve the accuracy of the model, we can refer to improving the vector dimension, training epoch times and so on

Keywords: Python Pytorch Deep Learning NLP

Added by pagod on Thu, 14 Oct 2021 22:33:17 +0300