Pre trained word embedding is used in Keras model

The original address is here.

What is word embedding?

Word embedding is a series of natural language processing technologies aimed at mapping semantics to geometric space. This is done by associating a number vector with each word in the dictionary, so that the distance between any two vectors (such as L2 distance or more commonly cosine distance) will capture part of the semantic relationship between the two related words. The geometric space formed by these vectors is called embedding space.

For example, "coconut" and "polar bear" are semantically different words, so a reasonable embedding space will represent them as vectors far away. But "kitchen" and "dinner" are related words, so they should be embedded close to each other.

Ideally, in a good embedded space, the "path" (a vector) starting from "kitchen" and "dinner" will accurately capture the semantic relationship between the two concepts. In this case, the relationship is "where x appears", so you would expect the vector kitchen dinner (the difference between the two embedded vectors, that is, the path from dinner to kitchen) to capture this "where x appears" relationship. Basically, we should have vector identities: Dinner + (where x appears) = kitchen (at least approximate). If so, we can use such a relationship vector to answer the question. For example, starting with a new vector, such as "work", and applying this relationship vector, we should become meaningful at some time, such as work + (where x appears) = Office, and answer "where does work appear?".

Word embedding is calculated by applying dimension reduction technology to the co-occurrence statistical data set between words in text corpus. This can be done by neural network ("word2vec" technology) or matrix decomposition.

GloVe word embedding

We will use GloVe embedding, which you can read here. GloVe stands for "global vector represented by word". This is a popular embedding technique based on decomposing word co-occurrence statistical matrix.

Specifically, we will use the 100 dimensional GloVe embedding of 400k words calculated in the 2014 English Wikipedia dump. You can here Download them (warning: clicking this link will start downloading 822MB).

20 newsgroup dataset

The task we will try to solve is to classify posts from 20 different newsgroups into their first 20 categories - the notorious "20 Newsgroups" data set ”. Here you can read about the dataset and download the original text data.

Categories are quite different semantically, so there will be completely different words associated with them. Here are some sample categories:

comp.sys.ibm.pc.hardware
comp.graphics
comp.os.ms-windows.misc
comp.sys.mac.hardware
comp.windows.x

rec.autos
rec.motorcycles
rec.sport.baseball
rec.sport.hockey

method

Here is how we will solve the classification problem:

Convert all text samples in the dataset into word index sequences. The word index is just the integer ID of the word. We will only consider the first 20000 words that appear most frequently in the dataset and truncate the sequence to a maximum of 1000 words.
Prepare an "embedding matrix", which will contain the embedding vector of the words indexed i in our word index at index i.
Load this embedding matrix into the Keras embedding layer and set it to freeze (its weight and embedding vector will not be updated during training).
A 1D convolutional neural network is constructed on it, which ends with our 20 categories of softmax outputs.

Prepare text data

First, we will simply traverse the folders where the text samples are stored and format them into a sample list. We will also prepare a list of category indexes matching the sample:

texts = []  # list of text samples
labels_index = {}  # dictionary mapping label name to numeric id
labels = []  # list of label ids
for name in sorted(os.listdir(TEXT_DATA_DIR)):
    path = os.path.join(TEXT_DATA_DIR, name)
    if os.path.isdir(path):
        label_id = len(labels_index)
        labels_index[name] = label_id
        for fname in sorted(os.listdir(path)):
            if fname.isdigit():
                fpath = os.path.join(path, fname)
                if sys.version_info < (3,):
                    f = open(fpath)
                else:
                    f = open(fpath, encoding='latin-1')
                t = f.read()
                i = t.find('\n\n')  # skip header
                if 0 < i:
                    t = t[i:]
                texts.append(t)
                f.close()
                labels.append(label_id)

print('Found %s texts.' % len(texts))

Then we can format our text samples and labels into tensors that can be input into the neural network. To do this, we will rely on the Keras utility preprocessing. text. Tokenizer and Keras preprocessing. sequence. pad_ sequences.

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

tokenizer = Tokenizer(nb_words=MAX_NB_WORDS)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)

word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)

labels = to_categorical(np.asarray(labels))
print('Shape of data tensor:', data.shape)
print('Shape of label tensor:', labels.shape)

# split the data into a training set and a validation set
indices = np.arange(data.shape[0])
np.random.shuffle(indices)
data = data[indices]
labels = labels[indices]
nb_validation_samples = int(VALIDATION_SPLIT * data.shape[0])

x_train = data[:-nb_validation_samples]
y_train = labels[:-nb_validation_samples]
x_val = data[-nb_validation_samples:]
y_val = labels[-nb_validation_samples:]

Prepare embedded layer

Next, we calculate the mapping of words to known embedded indexes by parsing the pre training embedded data dump:

embeddings_index = {}
f = open(os.path.join(GLOVE_DIR, 'glove.6B.100d.txt'))
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Found %s word vectors.' % len(embeddings_index))

At this point, we can use embedding_index dictionary and word_index to calculate the embedding matrix:

embedding_matrix = np.zeros((len(word_index) + 1, EMBEDDING_DIM))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector

We load this embedding matrix into the embedding layer. Note that we set trainable=False to prevent weights from being updated during training.

from keras.layers import Embedding

embedding_layer = Embedding(len(word_index) + 1,
                            EMBEDDING_DIM,
                            weights=[embedding_matrix],
                            input_length=MAX_SEQUENCE_LENGTH,
                            trainable=False)

The embedding layer should input an integer sequence, that is, the 2D input of the shape (sample, index). These input sequences should be populated so that they all have the same length in a batch of input data (although the embedded layer can handle sequences of heterogeneous length if you do not pass the explicit input_length parameter to the layer).

All the Embedding layer does is map the integer input to the vector at the corresponding index in the Embedding matrix, that is, the sequence [1,2] will be converted to [embeddings[1], embeddings[2]]. This means that the output of the Embedding layer will be a 3D tensor shape (samples, sequence_length, embedding_dim).

Training one-dimensional convolutional network

Finally, we can build a small one-dimensional convolution network to solve our classification problem:

sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedded_sequences = embedding_layer(sequence_input)
x = Conv1D(128, 5, activation='relu')(embedded_sequences)
x = MaxPooling1D(5)(x)
x = Conv1D(128, 5, activation='relu')(x)
x = MaxPooling1D(5)(x)
x = Conv1D(128, 5, activation='relu')(x)
x = MaxPooling1D(35)(x)  # global max pooling
x = Flatten()(x)
x = Dense(128, activation='relu')(x)
preds = Dense(len(labels_index), activation='softmax')(x)

model = Model(sequence_input, preds)
model.compile(loss='categorical_crossentropy',
              optimizer='rmsprop',
              metrics=['acc'])

# happy learning!
model.fit(x_train, y_train, validation_data=(x_val, y_val),
          epochs=2, batch_size=128)

The model achieved 95% classification accuracy in the validation set only after two epoch s. You may get higher accuracy by using some regularization mechanisms (such as dropout) or by fine tuning the embedded layer for longer training.

We can also test our performance by not using pre trained word embedding, but initializing our embedding layer from scratch and learning its weight during training. We just need to replace our embedded layer with the following:

embedding_layer = Embedding(len(word_index) + 1,
                            EMBEDDING_DIM,
                            input_length=MAX_SEQUENCE_LENGTH)

After two epochs, this method can only make us achieve 90% verification accuracy, which is lower than that of the previous model in one epoch. The embedding of our pre training must have brought us something. In general, using pre training embedding is related to natural processing tasks because there is little training data available (functionally, embedding acts as an injection of external information, which may be useful for your model).

Keywords: Machine Learning AI Deep Learning keras

Added by FatStratCat on Fri, 28 Jan 2022 13:03:54 +0200