[news text classification] (task3) text representation (fastText)

Learning summary

(1) Learn the principle and use of FastText, and divide the data set through 10 fold cross validation.
(2) Note fasttext.train_ The returned value result after predicted here is supervised. Because we want the label with the largest probability value, we will find a lump of model. predict (x) [0] [0]. Split ('') [- 1]. Don't panic, just go to the first label, and remove several pieces of things separated by the underline because the underline is added. We take the last lump is the desired label, ex: above__ label__baking will handle it.

1, Defects represented by existing text

There are some problems in one hot, Bag of Words, N-gram, TF-IDF and other methods: the converted vector dimension is very high and needs long training practice; The relationship between words is not considered, but statistics are carried out.

Deep learning is used for text representation, which can be mapped to a low dimensional space, such as FastText, Word2Vec and Bert.

2, FastText algorithm

2.1 algorithm Introduction

FastText is a three-layer neural network: input layer, hidden layer and output layer. The words are mapped to the dense space through the embedding layer, and then all the words in the sentence are averaged in the embedding space to complete the classification.

Specific papers: Bag of Tricks for Efficient Text Classification, https://arxiv.org/abs/1607.01759

FastText is superior to TF-IDF in text classification tasks:

FastText uses the document vector obtained by the Embedding superposition of words to classify similar sentences into one category
The embedded space dimension learned by FastText is relatively low and can be trained quickly

The first step is to download the fasttext package. If the command pip install fasttext cannot be downloaded in anaconda's prompt, you can directly download it in The website Find the whl file corresponding to your python interpreter version and download it. Then many blogs say that you can download it with the cmd command pip install, but I will report an error if I try, so you can go back to prompt to download and find it (the successful installation is shown below).

There is another detail to note, because the code will import fasttext at the beginning, that is, the file name cannot be named fasttext, otherwise it will conflict, that is, the error is as follows:

AttributeError: partially initialized module 'fasttext' has no attribute 'train_supervised' (most likely due to a circular import)

2.2 why fasttext is fast

1) Multithreading training: fastText uses multithreading for training. Each training thread does not lock when updating parameters, which will bring some noise to the parameter update, but will not affect the final result. Neither google's word2vec implementation nor the fastText library is locked. The default number of threads is 12, which can be set manually.

2) Layered softmax: fastText uses layered softmax when calculating softmax, which can greatly improve the operation efficiency.

(3) The use of Hierarchical softmax is actually the so-called Hoffman tree structure: each leaf node of the tree is a word, and the result of softmax is nothing more than a probability. When we want to find a word, we calculate the probability product in the path of the word.

2.3 interface parameters

First, look at fasttext.train, which will be used later_ Parameters of supervised:

input_file                 Training file path (required)
output                     Output file path (required)
label_prefix               Tag Prefix  default __label__
lr                         Learning rate default 0.1
lr_update_rate             Learning rate update rate default 100
dim                        Word vector dimension default 100
ws                         Context window size default 5
epoch                      epochs quantity default 5
min_count                  Minimum word frequency default 5
word_ngrams                n-gram set up default 1
loss                       loss function  {ns,hs,softmax} default softmax
minn                       Minimum character length default 0
maxn                       Maximum character length default 0
thread                     Number of threads default 12
t                          Sampling threshold default 0.0001
silent                     Disable c++ Extended log output default 1
encoding                   appoint input_file code default utf-8
pretrained_vectors         Specifies to use an existing word vector .vec file default None

Before adjusting parameters, there are such provisions for the samples of training data.
Each data + "\ t" + "label_prefix" + label
That is, the label is after each data, and the label is used_ Prefix (label_ft is used below).

3, fastText quick start

Learn here Facebook's fastText official document , the Chinese documents are a little less than the English documents. It is recommended to see the English documents here.
Chinese documents: http://fasttext.apachecn.org/#/doc/zh/supervised-tutorial

3.1 n-gram and n-char

(1)n-gram
Example: who am I? n-gram is set to 2
The n-gram features are who, who, am, am, I, i
(2)n-char

Example: where, n=3, Set start stop character<, >
n-char The characteristics are,<wh, whe, her, ere, er>

Therefore, for Chinese, the output words may not need to be subdivided, so the n-char can be 0, but there will be a certain connection between the word and the word meaning, and the n-gram can be set according to the situation.

3.2 classification of multiple label s

There is an official chestnut to learn:
First, import the package and data for model training:
Here, the idea of dealing with multi classification is to set a two classification classifier for each label through fasttext.train_ Set the - loss one vs all or - loss ova parameter of supervised.

>>> import fasttext
>>> model = fasttext.train_supervised(input="cooking.train", lr=0.5, epoch=25, wordNgrams=2, bucket=200000, dim=50, loss='ova')
Read 0M words
Number of words:  14543
Number of labels: 735
Progress: 100.0% words/sec/thread:   72104 lr:  0.000000 loss:  4.340807 ETA:   0h 0m

If you want to make as many predictions as possible, set the k parameter to - 1, and then set the threshold to 0.5 as long as the probability is greater than 0.5:

>>> model.predict("Which baking dish is best to bake a banana bread ?", k=-1, threshold=0.5)
((u''__label__baking, u'__label__bananas', u'__label__bread'), array([1.00000, 0.939923, 0.592677]))

>>> model.test("cooking.valid", k=-1)
(3000L, 0.702, 0.2)

Note the return value result after predict here. Because we want the label with the largest probability value, we will find a lump of model. Predict (x) [0] [0]. Split ('_') [- 1]. Don't panic. Just go to the first label, and then remove several pieces of things separated by underline because the underline is added. We take the last lump is the desired label, ex: above__ label__baking will handle it.

3, Text classification chestnut

See the notes for details.

# -*- coding: utf-8 -*-
"""
Created on Fri Nov  5 09:04:43 2021

@author: 86493
"""
import pandas as pd
from sklearn.metrics import f1_score 
import fasttext

# Convert to the format required by FastText
train_df = pd.read_csv('train_set.csv',
                       sep = '\t',
                       nrows =15000)
# astype is a cast type
train_df['label_ft'] = '__label__' + train_df['label'].astype(str)
train_df[['text', 'label_ft']].iloc[: -5000].to_csv('train.csv',
                                                  index = None,
                                                  header = None,
                                                  sep = '\t')
# word_ngrams=2 and NGram of the previous task_ Range is not the same meaning
model = fasttext.train_supervised('train.csv',
                                  lr = 1.0,       # Learning rate
                                  wordNgrams = 2, # Letter combination length
                                  verbose = 2,    #
                                  minCount = 1,   # Filter words less than minCount
                                  epoch = 25,     # Number of iterations
                                  loss = 'hs')    # The default loss is negative sampling
# The last 5k test samples are used as the validation set to analyze the prediction results and generate a list
val_pred = [model.predict(x)[0][0].split('__')[-1] for x in train_df.iloc[-5000:]['text']]
print(f1_score(train_df['label'].values[-5000:].astype(str), # [10000:] OK
               val_pred,
               average = 'macro'))

During the running process, the final F1 value is 0.8229238895393863:

Read 9M words
Number of words:  5341
Number of labels: 14

Read 9M words
Number of words:  5341
Number of labels: 14
Progress:   0.1% words/sec/thread:   29827 lr:  0.999370 avg.loss:  2.461391 ETA:   0h18m 9s
Read 9M words
Number of words:  5341
Number of labels: 14
Progress:   0.2% words/sec/thread:   34835 lr:  0.998136 avg.loss:  2.420802 ETA:   0h15m34s
Read 9M words
Number of words:  5341
Number of labels: 14
Progress:   0.3% words/sec/thread:   32120 lr:  0.997353 avg.loss:  2.450788 ETA:   0h16m49s
Read 9M words
Number of words:  5341
Number of labels: 14
Progress:   0.6% words/sec/thread:   56698 lr:  0.993602 avg.loss:  2.382696 ETA:   0h 9m30s
Read 9M words
Number of words:  5341
Number of labels: 14
Progress:   1.1% words/sec/thread:   78496 lr:  0.988754 avg.loss:  2.180967 ETA:   0h 6m49s
Read 9M words
Number of words:  5341
Number of labels: 14
Progress:   1.4% words/sec/thread:   79590 lr:  0.986145 avg.loss:  2.087085 ETA:   0h 6m43s
Read 9M words
Number of words:  5341
Number of labels: 14
Progress:   2.7% words/sec/thread:  130232 lr:  0.973349 avg.loss:  1.691365 ETA:   0h 4m 3s
Read 9M words
Number of words:  5341
Number of labels: 14
. . . . . . . 
Progress:  97.4% words/sec/thread:  601597 lr:  0.025779 avg.loss:  0.150194 ETA:   0h 0m 1s
Read 9M words
Number of words:  5341
Number of labels: 14
Progress:  99.8% words/sec/thread:  603704 lr:  0.004354 avg.loss:  0.147263 ETA:   0h 0m 0s
Read 9M words
Number of words:  5341
Number of labels: 14
Progress: 100.0% words/sec/thread:  603532 lr:  0.000000 avg.loss:  0.146719 ETA:   0h 0m 0s
0.8229238895393863

4, Using validation sets to tune parameters

Adjustment parameters:
(1) By reading the document, you should find out the general meaning of these parameters, which will increase the complexity of the model.
(2) By verifying the accuracy of the model on the verification set, we can find out whether the model is over fitting or under fitting.

10 fold cross validation, each fold uses 9 / 10 of the data for training, and the remaining 1 / 10 is used as the validation set to test the effect of the model. It should be noted that the division of each fold must ensure that the label distribution is consistent with the distribution of the whole data set.

label2id = {}
for i in range(total):
    label = str(all_labels[i])
    if label not in label2id:
        label2id[label] = [i]
    else:
        label2id[label].append(i)

Through 10 fold division, we select the last one to complete the remaining experiments, that is, the one with index 9 as the verification set and the one with index 1-8 as the training set, and then adjust the super parameters based on the results of the verification set to make the model performance better.

Reference

(1) Ali Tianchi platform
(2)fastText training and use
(3) Fasttext Chinese document: http://fasttext.apachecn.org/#/
(4)https://github.com/apachecn/fasttext-doc-zh/
(5) fastText official website: https://fasttext.cc/

Keywords: NLP

Added by dsandif on Fri, 05 Nov 2021 20:01:38 +0200

Programming VIP