[natural language processing] text information extractor CNN

Main content of this paper

  • This paper briefly introduces the process of processing text information by convolutional neural network (CNN)
  • Use CNN for text classification and comment the code
  • Article code[ https://github.com/540117253/Chinese-Text-Classification ]

1, CNN overview

Figure 1 CNN text encoder

The structure of CNN text encoder is shown in Figure 1. In the first layer, the word mapping function f:M → Rdf:M \rightarrow R^{d}f:M → Rd maps each word of the comment to a ddd dimension vector, and then converts the given comment text to a word embedding matrix with fixed length as TTT (only the first TTT words in the comment text are intercepted, if the text is not long enough, it will be filled).

After the word mapping layer is the convolution layer, which contains mmm neurons. The convolution kernel K ∈ Rt × dK \in R^{t \times d}K ∈ Rt × d corresponding to each neuron is used for convolution operation of word vectors to extract features. Assume V1:TV_{1:T}V1:T is a word embedding matrix whose text length is TTT. The characteristics of the jjj neurons are as follows:
zj=ReLU(V1:T∗Kj+bj) z_j=ReLU(V_{1:T}*K_j + b_j) zj​=ReLU(V1:T​∗Kj​+bj​)
Where bjb_jbj is the bias term, * * * is the convolution operation, and relurelulu is the nonlinear activation function.

Finally, under the action of sliding window t t t, the characteristics of the jjj neurons are z1,z2,...,zjT − t+1z_1,z_2,...,z_j^{T-t+1}z1​,z2​,...,zjT−t+1​. It is mainly used to capture the most important feature with the maximum value. It is defined as:
oj=max(z1,z2,...,zjT−t+1) o_j = max(z_1,z_2,...,z_j^{T-t+1}) oj​=max(z1​,z2​,...,zjT−t+1​)
   the final output of the final convolution layer is the splicing result of mmm neuron output, which is defined as:
O=[o1,o2,....om] O=[o_1,o_2,....o_m] O=[o1​,o2​,....om​]
Generally, OOO will be sent to the full connection layer, which contains the weight matrix W ∈ Rm × nW \in R ^{m \times n}W ∈ Rm × N and the offset term g ∈ Rn \ in R ^ ng ∈ Rn. The specific formula is as follows:
X=ReLU(WO+g) X=ReLU(WO+g) X=ReLU(WO+g)
On the whole, the convolution kernel size of CNN is generally 3 or 5 (that is, only 3 or 5 words of information are calculated in a convolution operation), which can scan the whole text through a sliding window by using only one convolution kernel, so the whole text can be regarded as a group of parameters sharing the same convolution kernel, which can save memory space very well. However, a convolution operation can only contain the words in the convolution window. The longer the input text is, the more previous information will be lost when the convolution window slides to the end of the text. Therefore, for text data, RNN is generally used, which is better than CNN in text information extraction.

2, CNN text classification example

2.1 data set introduction

1. Download address:

  [https://github.com/skdjfla/toutiao-text-classfication-dataset ]

2. Format:

6552431613437805063_ ! 102_ ! news_ entertainment_ ! Xie Na clarifies the Internet rumors for Li haofei, and then adds points for her two actions_ ! Tong Liya, Internet rumors, happy base camp, Li haofei, Xie Na, audience

Each line of data to_ ! The divided fields are: News ID, category code (see below), category name (see below), news string (only including title), news keyword

Classification code and name:

100 people's livelihood stories news_story
 101 cultural news_culture
 102 entertainment news_entertainment
 103 sports news_sports
 104 financial news_finance
 106 real estate news_house
 107 Auto news_car
 108 Education news_edu 
109 Technology news_tech
 110 military news_military
 112 Tourism news_travel
 113 international news_world
 114 stock
 115 agricultural news_agriculture
 116 e-games news_game

2.2 pre training word vector

The pre training word vector uses the word vector trained in Baidu Encyclopedia based on ACL-2018 model.

Download address:[ https://github.com/Embedding/Chinese-Word-Vectors ]

2.3 data preprocessing

  1. Remove useless characters and do word segmentation
  2. Build the dictionary of the whole dataset, key=word, value = the number of words
  3. Truncate or supplement 0 to ensure that the length of each sample is maxlen
  4. Label of the serialized sample, for example, the category number of "sports news" is 1, and the category number of "entertainment news" is 2
  5. Convert the processed data to DataFrame format and save it to hard disk

2.4 definition of CNN model

'''
    Text => CNN => Fully_Connected => Softmax
    
    //Parameters:
    filter_sizes: The size of convolution kernel
    num_filters: The number of convolution kernels
    embedded_size: The dimension of word vector
    dict_size: Number of words in the dataset
    maxlen: Maximum number of words per sample
    label_num: Number of sample categories
    learning_rate: Initial learning rate of gradient optimizer
'''
class CNN:
    def __init__(self, filter_sizes, num_filters, embedded_size,
                 dict_size, maxlen, label_num, learning_rate): 

        # print('model_Name:', 'CNN')

        self.droput_rate = 0.5
        
        '''
            Convulutional Neural Network
        '''
        def cnn (input_emb, filter_sizes, num_filters):
            pooled_outputs = []
            for i, filter_size in enumerate(filter_sizes):
                filter_shape = [filter_size, embedded_size, 1, num_filters]
                W = tf.Variable(tf.truncated_normal(filter_shape, stddev=0.1), name="W")
                b = tf.Variable(tf.constant(0.1, shape=[num_filters]), name="b")         
                conv = tf.nn.conv2d(
                    input_emb,
                    W,
                    strides=[1, 1, 1, 1],
                    padding="VALID",
                    name="conv") # shape(conv) = [None, sequence_length - filter_size + 1, 1, num_filters]
                h = tf.nn.relu(tf.nn.bias_add(conv, b), name="relu")
                word_num = input_emb.shape.as_list()[1]
                pooled = tf.nn.max_pool(
                    h,
                    ksize=[1, word_num - filter_size + 1, 1, 1],
                    strides=[1, 1, 1, 1],
                    padding='VALID',
                    name="pool") # shape(pooled) = [None, 1, 1, num_filters]
                pooled_outputs.append(pooled)
            num_filters_total = num_filters * len(filter_sizes)

            h_pool = tf.concat(pooled_outputs,3) 

            h_pool_flat = tf.reshape(h_pool, [-1, num_filters_total])  # shape = [None,num_filters_total] 

            cnn_fea = tf.nn.dropout(h_pool_flat, keep_prob =  self.droput_rate)

            return cnn_fea
        
        self.X = tf.placeholder(tf.int32, [None, maxlen], name='input_x')
        self.Y = tf.placeholder(tf.int64, [None])
        
        self.encoder_embeddings = tf.Variable(tf.random_uniform([dict_size, embedded_size], -1, 1), trainable=False)
        encoder_embedded = tf.nn.embedding_lookup(self.encoder_embeddings, self.X)
        
        # Since conv2d requires a four-dimensional input data, a dimension needs to be added manually.
        encoder_embedded = tf.expand_dims(encoder_embedded, -1) # shape(encoder_embedded) = [None, user_review_num*u_n_words, embedding_size, 1]

        outputs = cnn(input_emb = encoder_embedded, filter_sizes = filter_sizes, num_filters = num_filters)
 
        self.logits = keras.layers.Dense(label_num, use_bias=True)(outputs)
        self.probability = tf.nn.softmax(self.logits, name='probability')

        self.cost = tf.nn.sparse_softmax_cross_entropy_with_logits(
                                                                    labels = self.Y, 
                                                                    logits = self.logits)
        self.cost = tf.reduce_mean(self.cost)
        self.optimizer = tf.train.AdamOptimizer(learning_rate = learning_rate).minimize(self.cost)
        self.pre_y = tf.argmax(self.logits, 1, name='pre_y')
        correct_pred = tf.equal(self.pre_y, self.Y)
        self.accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))

2.5 training model

  1. The preprocessed data set is divided into 80% training set, 10% verification set and 10% test set
  2. After each training with a training set, the verification set is used for testing.
  3. When the accuracy of the verification set drops 5 times in a row, step 2 is stopped, and then the result of the test set is used as the final performance of the model.

Keywords: github network

Added by nitm on Tue, 09 Jun 2020 06:21:37 +0300