Advanced deep learning: introduction to natural language processing: Chapter 3 word2vec

Chapter 3 word2vec

The topic of this chapter is still the distributed representation of words. In the previous chapter, we obtained the distributed representation of words by using the counting based method. In this chapter, we will discuss the alternative method of this method, that is, the reasoning based method.

As the name suggests, the reasoning based method uses the reasoning mechanism. Of course, the reasoning mechanism here uses neural network. In this chapter, the famous word2vec will appear.

The goal of this chapter is to implement a simple word2vec. This simple word2vec prioritizes comprehensibility at the expense of processing efficiency. Therefore, we will not use it to deal with large data sets, but there is no problem with small data sets.

3.1 reasoning based methods and neural networks

Recently, the research of using vector to represent words is in full swing. The more successful methods can be divided into two kinds: one is based on counting; The other is based on reasoning.

3.1. 1 problem of counting based method

The counting based method uses the statistical data of the whole corpus (co-occurrence matrix and PPMI, etc.) to obtain the distributed representation of words through one-time processing (SVD, etc.). The reasoning based method uses neural network, which is usually studied on mini batch data. This means that the neural network only needs to look at a part of the learning data (Mini batch) at a time and update the weight repeatedly.

The counting based method processes all learning data at one time; On the contrary, the reasoning based method uses part of the learning data to learn step by step.

3.1. 2 summary of reasoning based methods

The main operation of reasoning based method is "reasoning"**

3.1. 3 word processing method in neural network

As shown in Figure 3-5, the input layer is represented by 7 neurons, corresponding to 7 words respectively (the first neuron corresponds to you and the second neuron corresponds to say)

Now things are very simple. Because as long as words are represented as vectors, these vectors can be processed by various "layers" constituting neural networks. For example, for a word represented by one hot, use the full connection layer to transform it, as shown in Figure 3-6. (offset omitted)

Let's look at the code. The full connection layer transformation here can be written as follows

import numpy as np
c = np.array([[1, 0, 0, 0, 0, 0, 0]]) # input
W = np.random.randn(7, 3) # weight
h = np.dot(c, W) # Intermediate node
print(W)
print(h)

#output
[[-1.30982294  0.19485001 -0.49979452]
 [-1.77466539 -2.67810488  2.7992046 ]
 [-0.20747764 -0.68246166  0.7149981 ]
 [ 0.18558413 -0.61176428  2.25844791]
 [-0.70263837  0.63946127 -0.33276184]
 [-0.31945603  0.07161013  1.18615179]
 [ 2.01949978 -0.5961003  -1.01233551]]

[[-1.30982294  0.19485001 -0.49979452]]

class MatMul:
    def __init__(self, W):
        self.params = [W]
        self.grads = [np.zeros_like(W)]
        self.x = None

    def forward(self, x):
        W, = self.params
        out = np.dot(x, W)
        self.x = x
        return out

    def backward(self, dout):
        W, = self.params
        dx = np.dot(dout, W.T)
        dW = np.dot(self.x.T, dout)
        self.grads[0][...] = dW
        return dx

Here, it does not seem very efficient to calculate the matrix product only to extract the row vector of the weight. We will improve this in Section 4.1. In addition, the functions of the above code can also be completed by using the MatMul layer implemented in Chapter 1,

c = np.array([[1, 0, 0, 0, 0, 0, 0]])
W = np.random.randn(7, 3)
layer = MatMul(W)
h = layer.forward(c)
print(h)
# [[-0.70012195 0.25204755 -0.79774592]]

3.2 simple word2vec

Word2vec was originally used to refer to programs or tools, but with the popularity of the word, it also refers to the model of neural network in some contexts. Correctly speaking, CBOW model and skip gram model are two neural networks used in word2vec.

3.2. 1 reasoning of CBOW model

CBOW model is a neural network that predicts the target word according to the context ("target word" refers to the middle word, and the words around it are "context"). By training the CBOW model so that it can predict correctly as much as possible, we can obtain the distributed representation of words.

The input to the CBOW model is context. This context is represented by a list of words such as ['you', 'goodbye]. We convert it to a one hot representation so that the CBOW model can handle it. On this basis, the network of CBOW model can be drawn as shown in Figure 3-9.

Figure 3-9 shows the network of CBOW model. It has two input layers, passing through the middle layer to the output layer. Here, the transformation from the input layer to the middle layer is completed by the same full connection layer (weight Win), and the transformation from the middle layer to the output layer is completed by another full connection layer (weight Wout).

Sometimes, the neuron whose score passes through the Softmax layer is called the output layer. Here, we call the node that outputs the score as the output layer.

The processing of the full connection layer without bias is represented by the forward propagation of the MatMul layer. This layer computes the matrix product internally.

The reasoning of CBOW model is realized. The specific implementation is as follows

# coding: utf-8
import sys
sys.path.append('..')
import numpy as np
from common.layers import MatMul


# Sample context data
c0 = np.array([[1, 0, 0, 0, 0, 0, 0]])
c1 = np.array([[0, 0, 1, 0, 0, 0, 0]])

# Initialize weight
W_in = np.random.randn(7, 3)
W_out = np.random.randn(3, 7)

# Generation layer
in_layer0 = MatMul(W_in)
in_layer1 = MatMul(W_in)
out_layer = MatMul(W_out)

# Forward propagation
h0 = in_layer0.forward(c0)
h1 = in_layer1.forward(c1)
h = 0.5 * (h0 + h1)
s = out_layer.forward(h)
print(s)

# [[-0.33304998  3.19700011  1.75226542  1.36880744  1.68725368  2.38521564 0.81187955]]

The above is the reasoning process of CBOW model. The CBOW model we see here is a simple network structure without activation function. There is no difficulty except that multiple input layers share weights.

3.2. 2 CBOW model learning

The learning of CBOW model is to adjust the weight to make the prediction accurate. As a result, the weight Win (to be exact, both Win and Wout) learns the vector containing the occurrence pattern of the word. According to past experiments, the distributed representation of words obtained by CBOW model (and skip gram model), especially the distributed representation of words learned from large-scale corpora such as Wikipedia, has many cases that accord with our intuition in word meaning and grammar

CBOW model only learns the occurrence patterns of words in the corpus. If the corpus is different, the distributed representation of the learned words is also different. For example, the distributed representation of words obtained by using only "Sports" related articles will be very different from that obtained by using only "music" related articles.

3.2. 3 weight and distributed representation of word2vec

In terms of word2vec (especially skip gram model), the most popular is scheme A. Following this idea, we also use Win as the distributed representation of words

3.3 preparation of learning data

Before starting the learning of word2vec, let's prepare the data for learning. Here we still say "You say goodbye and I say hello." This one sentence corpus is illustrated as an example.

3.3. 1 context and target words

The input of the neural network used in word2vec is context, and its correct unlabeling is the word surrounded by these contexts, that is, the target word.

def preprocess(text):
    text = text.lower()
    text = text.replace('.', ' .')
    words = text.split(' ')

    word_to_id = {}
    id_to_word = {}
    for word in words:
        if word not in word_to_id:
            new_id = len(word_to_id)
            word_to_id[word] = new_id
            id_to_word[new_id] = word

    corpus = np.array([word_to_id[w] for w in words])

    return corpus, word_to_id, id_to_word

We implement the function of generating context and target words from corpus. Before that, let's review the content of the previous chapter. First, the text of the corpus is transformed into word ID.

import sys
sys.path.append('..')
from common.util import preprocess
text = 'You say goodbye and I say hello.'
corpus, word_to_id, id_to_word = preprocess(text)
print(corpus)
# [0 1 2 3 4 1 5 6]
print(id_to_word)
# {0: 'you', 1: 'say', 2: 'goodbye', 3: 'and', 4: 'i', 5: 'hello', 6: '.'}

Create function to generate context and target words_ contexts_ target(corpus, window_size)

def create_contexts_target(corpus, window_size=1):
    '''Generate context and target words

    :param corpus: Corpus (words) ID (list)
    :param window_size: Window size (when the window size is 1, 1 word on the left and right is the context)
    :return:
    '''
    target = corpus[window_size:-window_size]
    contexts = []

    for idx in range(window_size, len(corpus)-window_size):
        cs = []
        for t in range(-window_size, window_size + 1):
            if t == 0:
                continue
            cs.append(corpus[idx + t])
        contexts.append(cs)

    return np.array(contexts), np.array(target)

Following the implementation just now, the code is as follows.

contexts, target = create_contexts_target(corpus, window_size=1)
print(contexts)
# [[0 2]
# [1 3]
# [2 4]
# [3 1]
# [4 5]
# [1 6]]
print(target)
# [1 2 3 4 1 5]

3.3. 2 convert to one hot representation

def convert_one_hot(corpus, vocab_size):
    '''Convert to one-hot express

    :param corpus: word ID List (one or two dimensional) NumPy Array)
    :param vocab_size: Number of words
    :return: one-hot Representation (2D or 3D) NumPy Array)
    '''
    N = corpus.shape[0]

    if corpus.ndim == 1:
        one_hot = np.zeros((N, vocab_size), dtype=np.int32)
        for idx, word_id in enumerate(corpus):
            one_hot[idx, word_id] = 1

    elif corpus.ndim == 2:
        C = corpus.shape[1]
        one_hot = np.zeros((N, C, vocab_size), dtype=np.int32)
        for idx_0, word_ids in enumerate(corpus):
            for idx_1, word_id in enumerate(word_ids):
                one_hot[idx_0, idx_1, word_id] = 1

    return one_hot

Implementation code: the preparation of learning data is completed

text = 'You say goodbye and I say hello.'

corpus, word_to_id, id_to_word = preprocess(text)

contexts, target = create_contexts_target(corpus, window_size=1)

vocab_size = len(word_to_id)

target = convert_one_hot(target, vocab_size)

contexts = convert_one_hot(contexts, vocab_size)
contexts

#Output contexts
array([[[1, 0, 0, 0, 0, 0, 0],
        [0, 0, 1, 0, 0, 0, 0]],

       [[0, 1, 0, 0, 0, 0, 0],
        [0, 0, 0, 1, 0, 0, 0]],

       [[0, 0, 1, 0, 0, 0, 0],
        [0, 0, 0, 0, 1, 0, 0]],

       [[0, 0, 0, 1, 0, 0, 0],
        [0, 1, 0, 0, 0, 0, 0]],

       [[0, 0, 0, 0, 1, 0, 0],
        [0, 0, 0, 0, 0, 1, 0]],

       [[0, 1, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 1]]])

3.4 implementation of cbow model

class SimpleCBOW:
    def __init__(self, vocab_size, hidden_size):
        V, H = vocab_size, hidden_size		#Vocabulary number vocab_size, number of neurons in the middle layer hidden_size

        # Initialize the weights and initialize the two weights with some small random values. astype('f ') initialization will use 32-bit floating-point numbers.
        W_in = 0.01 * np.random.randn(V, H).astype('f')
        W_out = 0.01 * np.random.randn(H, V).astype('f')

        # Generation layer
        #Generate two input side MatMul layers, one output side MatMul layer, and one Softmax with Loss layer.
        self.in_layer0 = MatMul(W_in)
        self.in_layer1 = MatMul(W_in)
        self.out_layer = MatMul(W_out)
        self.loss_layer = SoftmaxWithLoss()

        # Organize all weights and gradients into a list
        layers = [self.in_layer0, self.in_layer1, self.out_layer]
        self.params, self.grads = [], []
        for layer in layers:
            self.params += layer.params
            self.grads += layer.grads

        # Set the distributed representation of the word as a member variable
        self.word_vecs = W_in

    #Forward propagation function of neural networks
    def forward(self, contexts, target):
        h0 = self.in_layer0.forward(contexts[:, 0])
        h1 = self.in_layer1.forward(contexts[:, 1])
        h = (h0 + h1) * 0.5
        score = self.out_layer.forward(h)
        loss = self.loss_layer.forward(score, target)
        return loss

    def backward(self, dout=1):
        ds = self.loss_layer.backward(dout)
        da = self.out_layer.backward(ds)
        da *= 0.5
        self.in_layer1.backward(da)
        self.in_layer0.backward(da)
        return None

Here, the number of MatMul layers used to process the input side context is the same as the number of words in the context (two in this example). In addition, we use the same weight to initialize the MatMul layer.

Finally, the weight parameters and gradients used in the neural network are saved in the list type member variables params and grads respectively.

Here, multiple layers share the same weight. Therefore, there are multiple identical weights in the params list. However, when there are multiple identical weights in the params list, the operation of optimizers such as Adam and Momentum will become unexpected (at least in terms of our code). For this reason, inside the Trainer class, a simple de duplication operation will be performed when updating parameters. For this point, the description is omitted here. Interested readers can refer to common / Trainer Remove of PY_ duplicate(params, grads)

Here, we assume that the parameter contexts is a three-dimensional NumPy array, that is, the shape in the example (6,2,7) in Figure 3-18 in the previous section, where the number of elements in dimension 0 is the number of mini batch, the number of elements in dimension 1 is the window size of the context, and the second dimension represents the one hot vector. In addition, the target is a two-dimensional shape such as (6,7).

So far, the implementation of back propagation is over. We have saved the gradient of each weight parameter in the member variable grads. Therefore, by calling the forward() function first and then the backward() function, the grads in the grads list is updated. Next, let's continue to look at the learning of the simplecrow class

Realization of learning

The learning of CBOW model is exactly the same as that of general neural network. First, prepare the learning data for the neural network. Then, calculate the gradient and gradually update the weight parameters. Here, we use the Trainer class introduced in Chapter 1 to perform the learning process,

import sys
sys.path.append('..')  # Settings for importing files from the parent directory
from common.trainer import Trainer
from common.optimizer import Adam
from simple_cbow import SimpleCBOW
from common.util import preprocess, create_contexts_target, convert_one_hot


window_size = 1
hidden_size = 5
batch_size = 3
max_epoch = 1000

text = 'You say goodbye and I say hello.'
corpus, word_to_id, id_to_word = preprocess(text)

vocab_size = len(word_to_id)
contexts, target = create_contexts_target(corpus, window_size)
target = convert_one_hot(target, vocab_size)
contexts = convert_one_hot(contexts, vocab_size)

model = SimpleCBOW(vocab_size, hidden_size)
optimizer = Adam()
trainer = Trainer(model, optimizer)

trainer.fit(contexts, target, max_epoch, batch_size)
trainer.plot()

word_vecs = model.word_vecs  #The weight of the MatMul layer on the input side has been assigned to the member variable word_vecs
for word_id, word in id_to_word.items():
    print(word, word_vecs[word_id])

you [ 1.1550112  1.2552509 -1.1116056  1.1482503 -1.2046812]
say [-1.2141827  -1.2367412   1.2163384  -1.2366292   0.67279905]
goodbye [ 0.7116186   0.55987084 -0.8319744   0.749239   -0.8436555 ]
and [-0.94652086 -0.8535852   0.55927175 -0.6934891   2.0411916 ]
i [ 0.7177702  0.5699475 -0.8368816  0.7513028 -0.8419596]
hello [ 1.1411363  1.2600429 -1.105042   1.1401148 -1.2044929]
. [-1.1948084 -1.2921802  1.4119368 -1.345656  -1.5299033]

Unfortunately, the small corpus used here does not give good results. Of course, the main reason is that the corpus is too small. If we change to a larger and more practical corpus, we believe we will get better results. However, there will be new problems in processing speed, because there are several problems in processing efficiency in the implementation of the current CBOW model. In the next chapter, we will improve this simple CBOW model and implement a "real" CBOW model.

3.5 additional notes to word2vec

3.5. 1 CBOW model and probability

In this book, the probability is recorded as P(·), for example, the probability of event a is recorded as P(A). The joint probability is marked as * * P(A, B), indicating the probability that event a and event B occur at the same time. A posteriori probability is recorded as P(A|B) * *, which literally means "probability after the event". From another point of view, it can also be interpreted as "the probability that event a occurs when event B (information) is given".

3.5. 2 skip gram model

import sys
sys.path.append('..')
import numpy as np
from common.layers import MatMul, SoftmaxWithLoss


class SimpleSkipGram:
    def __init__(self, vocab_size, hidden_size):
        V, H = vocab_size, hidden_size

        # Initialize weight
        W_in = 0.01 * np.random.randn(V, H).astype('f')
        W_out = 0.01 * np.random.randn(H, V).astype('f')

        # Generation layer
        self.in_layer = MatMul(W_in)
        self.out_layer = MatMul(W_out)
        self.loss_layer1 = SoftmaxWithLoss()
        self.loss_layer2 = SoftmaxWithLoss()

        # Organize all weights and gradients into a list
        layers = [self.in_layer, self.out_layer]
        self.params, self.grads = [], []
        for layer in layers:
            self.params += layer.params
            self.grads += layer.grads

        # Set the distributed representation of the word as a member variable
        self.word_vecs = W_in

    def forward(self, contexts, target):
        h = self.in_layer.forward(target)
        s = self.out_layer.forward(h)
        l1 = self.loss_layer1.forward(s, contexts[:, 0])
        l2 = self.loss_layer2.forward(s, contexts[:, 1])
        loss = l1 + l2
        return loss

    def backward(self, dout=1):
        dl1 = self.loss_layer1.backward(dout)
        dl2 = self.loss_layer2.backward(dout)
        ds = dl1 + dl2
        dh = self.out_layer.backward(ds)
        self.in_layer.backward(dh)
        return None

class MatMul:
    def __init__(self, W):
        self.params = [W]
        self.grads = [np.zeros_like(W)]
        self.x = None

    def forward(self, x):
        W, = self.params
        out = np.dot(x, W)
        self.x = x
        return out

    def backward(self, dout):
        W, = self.params
        dx = np.dot(dout, W.T)
        dW = np.dot(self.x.T, dout)
        self.grads[0][...] = dW
        return dx
    
class SoftmaxWithLoss:
    def __init__(self):
        self.params, self.grads = [], []
        self.y = None  # Output of softmax
        self.t = None  # Supervision label

    def forward(self, x, t):
        self.t = t
        self.y = softmax(x)

        # When the supervised tag is a one hot vector, it is converted to the index of the correct solution tag
        if self.t.size == self.y.size:
            self.t = self.t.argmax(axis=1)

        loss = cross_entropy_error(self.y, self.t)
        return loss

    def backward(self, dout=1):
        batch_size = self.t.shape[0]

        dx = self.y.copy()
        dx[np.arange(batch_size), self.t] -= 1
        dx *= dout
        dx = dx / batch_size

        return dx

3.5. 3 counting based and reasoning based

In addition, after word2vec, researchers proposed GloVe method [27]. GloVe method combines reasoning based method and counting based method. The idea of this method is to incorporate the statistical data of the whole corpus into the loss function for mini batch learning (please refer to paper [27]). Accordingly, the two methodologies were successfully integrated.

3.6 summary

In this chapter, we explain the CBOW model of word2vec in detail and implement it. CBOW model is basically a two-layer neural network with very simple structure. We constructed CBOW model using MatMul layer and Softmax with Loss layer, and confirmed its learning process with a small-scale corpus. Unfortunately, there are still some problems in the processing efficiency of CBOW model at this stage. However, after understanding the CBOW model in this chapter, it is one step away from the real word2vec. In the next chapter, we will improve the CBOW model.

What is learned in this chapter

The reasoning based method aims at prediction and obtains the distributed representation of words as by-products
word2vec is a reasoning based method, which is composed of a simple two-layer neural network
word2vec has skip gram model and CBOW model
CBOW model predicts one word (target word) from multiple words (context)
The skip gram model in turn predicts multiple words (context) from one word (target word)
Because word2vec can carry out incremental learning of weights, it can efficiently update or add the distributed representation of words

Keywords: Deep Learning NLP word2vec

Added by lansing on Thu, 09 Dec 2021 15:31:47 +0200

Programming VIP