Using Seq2Seq for Chinese-English translation

1. Introduction

1.1 Deep NLP

Natural Language Processing (NLP) is an interdisciplinary branch of computer science, artificial intelligence and linguistics. It mainly enables computers to process or understand natural languages, such as machine translation, question and answer systems.But NLP is often considered difficult because of its complexity in expressing, learning, and using languages.In recent years, with the rise of Deep Learning (DL), people have continuously tried to apply DL to NLP, called Deep NLP, and made many breakthroughs.There is the Seq2Seq model.

1.2 Origin

Seq2Seq Model, short for Sequence to Sequence model, is also known as an Encoder-Decoder model based on two papers published in 2014:

  • Sequence to Sequence Learning with Neural Networks by Sutskever et al.,
  • Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation by Cho et al.,

Author Sutskever analyzed that Deep Neural Networks (DNNs) cannot handle unknown and variable-length sequences because they limit the length of input and output sequences; and many important problems are represented by sequences of unknown length.This demonstrates that it is necessary to propose new solutions to the sequence problem of unknown length.Thus, the Seq2Seq model is innovatively proposed.Let's see what this model is.

2. Continuous Exploration of Seq2Seq Model

Why is it innovative? The Seq2Seq model was finally determined by the author Sutskever after three modeling demonstrations.And the model is very cleverly designed.Let's first review the author's exploration.Language Model (LM) calculates the next word from a given word using conditional probability.This is the prediction basis of the Seq2Seq model.Because the sequences are contextually related, sentence-like, and conditional probability, the author first chooses RNN-LM (Recurrent Neural Network Language Model).

Above is a simple RNN cell.RNN loops back and forth to condition the results of the previous step into the current input.
Suitable for modeling context dependencies in sequences of any length.The problem is that we need to align input and output sequences in advance, and it is not clear how to apply RNNs to sequences with complex, non-singular relationships of different lengths.To solve the alignment problem, the authors propose a theoretically feasible approach: using two RNs.One RNN maps the input to a fixed-length vector from which the other predicts the output sequence.

Why is it theoretical?The author Sutskever's Ph.D. paper TRAINING RECURRENT NEURAL NETWORKS (Training Circular Neural Network) suggests that training RNs is difficult.Because of the network structure of RNN itself, the output of its current moment needs to take into account the input of all previous moments, so once the input sequence is very long, the Gradients Vanish problem will easily occur.To solve the problem of RNN training difficulty, the authors use LSTM (Long Short-Term Memory) network.

Above is an LSTM cell internal structure.LSTM is proposed to solve the problem of RNN gradient disappearance. It innovatively adds a forgetting gate, which allows LSTM to choose to forget the previous input irrelevant sequences without considering all input sequences.After three attempts and finally LSTM is added, a simple Seq2Seq model is established.

Above, a simple Seq2Seq model consists of three parts, Encoder-LSTM, Decoder-LSTM, and Context.The input sequence is ABC, and Encoder-LSTM processes the input sequence and returns the hidden state of the entire input sequence in the last neuron, also known as the context (C).Decoder-LSTM then predicts the next character of the target sequence step by step based on the hidden state.The final output sequence wxyz.It is worth mentioning that the author Sutskever designed a specific Seq2Seq model based on his specific tasks.The input sequence is processed in reverse order, which enables the model to process long sentences and improves the accuracy.

The above image is a real model designed by the author Sutskever and is proud of three points.The first uses two LSTMs, one for encoding and one for decoding.This is also the result of the author's exploration and demonstration.Second, a deep LSTM (layer 4) is used, which reduces the difficulty of each additional layer by 10% compared to the shallow network.The third pair of input sequences uses an inverse sequence operation, which improves LSTM's ability to process long sequences.

3. Chinese-English translation

It's time for us to get started and understand the Seq2Seq model above, so let's build a simple Chinese-English translation model.

3.1 Dataset

Using a Chinese-English dataset from the manythings website, we have now uploaded it to the Mo platform. Click to view .The dataset format is English+tab+Chinese.

3.2 Processing data

from keras.models import Model
from keras.layers import Input, LSTM, Dense
import numpy as np

batch_size = 64  # Batch size for training.
epochs = 100  # Number of epochs to train for.
latent_dim = 256  # Latent dimensionality of the encoding space.
num_samples = 10000  # Number of samples to train on.
# Path to the data txt file on disk.
data_path = 'cmn.txt'

# Vectorize the data.
input_texts = []
target_texts = []
input_characters = set()
target_characters = set()
with open(data_path, 'r', encoding='utf-8') as f:
    lines = f.read().split('\n')
for line in lines[: min(num_samples, len(lines) - 1)]:
    input_text, target_text = line.split('\t')
    # We use "tab" as the "start sequence" character
    # for the targets, and "\n" as "end sequence" character.
    target_text = '\t' + target_text + '\n'
    input_texts.append(input_text)
    target_texts.append(target_text)
    for char in input_text:
        if char not in input_characters:
            input_characters.add(char)
    for char in target_text:
        if char not in target_characters:
            target_characters.add(char)

input_characters = sorted(list(input_characters))
target_characters = sorted(list(target_characters))
num_encoder_tokens = len(input_characters)
num_decoder_tokens = len(target_characters)
max_encoder_seq_length = max([len(txt) for txt in input_texts])
max_decoder_seq_length = max([len(txt) for txt in target_texts])

print('Number of samples:', len(input_texts))
print('Number of unique input tokens:', num_encoder_tokens)
print('Number of unique output tokens:', num_decoder_tokens)
print('Max sequence length for inputs:', max_encoder_seq_length)
print('Max sequence length for outputs:', max_decoder_seq_length)

3.3 Encoder-LSTM

# mapping token to index, easily to vectors
input_token_index = dict([(char, i) for i, char in enumerate(input_characters)])
target_token_index = dict([(char, i) for i, char in enumerate(target_characters)])

# np.zeros(shape, dtype, order)
# shape is an tuple, in here 3D
encoder_input_data = np.zeros(
    (len(input_texts), max_encoder_seq_length, num_encoder_tokens),
    dtype='float32')
decoder_input_data = np.zeros(
    (len(input_texts), max_decoder_seq_length, num_decoder_tokens),
    dtype='float32')
decoder_target_data = np.zeros(
    (len(input_texts), max_decoder_seq_length, num_decoder_tokens),
    dtype='float32')

# input_texts contain all english sentences
# output_texts contain all chinese sentences
# zip('ABC','xyz') ==> Ax By Cz, looks like that
# the aim is: vectorilize text, 3D
for i, (input_text, target_text) in enumerate(zip(input_texts, target_texts)):
    for t, char in enumerate(input_text):
        # 3D vector only z-index has char its value equals 1.0
        encoder_input_data[i, t, input_token_index[char]] = 1.
    for t, char in enumerate(target_text):
        # decoder_target_data is ahead of decoder_input_data by one timestep
        decoder_input_data[i, t, target_token_index[char]] = 1.
        if t > 0:
            # decoder_target_data will be ahead by one timestep
            # and will not include the start character.
            # igone t=0 and start t=1, means 
            decoder_target_data[i, t - 1, target_token_index[char]] = 1.

3.4 Context(hidden state)

# Define an input sequence and process it.
# input prodocts keras tensor, to fit keras model!
# 1x73 vector 
# encoder_inputs is a 1x73 tensor!
encoder_inputs = Input(shape=(None, num_encoder_tokens))

# units=256, return the last state in addition to the output
encoder_lstm = LSTM((latent_dim), return_state=True)

# LSTM(tensor) return output, state-history, state-current
encoder_outputs, state_h, state_c = encoder_lstm(encoder_inputs)

# We discard `encoder_outputs` and only keep the states.
encoder_states = [state_h, state_c]

3.5 Decoder-LSTM

# Set up the decoder, using `encoder_states` as initial state.
decoder_inputs = Input(shape=(None, num_decoder_tokens))

# We set up our decoder to return full output sequences,
# and to return internal states as well. We don't use the
# return states in the training model, but we will use them in inference.
decoder_lstm = LSTM((latent_dim), return_sequences=True, return_state=True)

# obtain output
decoder_outputs, _, _ = decoder_lstm(decoder_inputs,initial_state=encoder_states)

# dense 2580x1 units full connented layer
decoder_dense = Dense(num_decoder_tokens, activation='softmax')

# why let decoder_outputs go through dense ?
decoder_outputs = decoder_dense(decoder_outputs)

# Define the model that will turn, groups layers into an object 
# with training and inference features
# `encoder_input_data` & `decoder_input_data` into `decoder_target_data`
# model(input, output)
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

# Run training
# compile -> configure model for training
model.compile(optimizer='rmsprop', loss='categorical_crossentropy')
# model optimizsm
model.fit([encoder_input_data, decoder_input_data], 
          decoder_target_data,
          batch_size=batch_size,
          epochs=epochs,
          validation_split=0.2)
# Save model
model.save('seq2seq.h5')

3.6 decoding sequence

# Define sampling models
encoder_model = Model(encoder_inputs, encoder_states)
decoder_state_input_h = Input(shape=(latent_dim,))
decoder_state_input_c = Input(shape=(latent_dim,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
decoder_outputs, state_h, state_c = decoder_lstm(decoder_inputs, initial_state=decoder_states_inputs)
decoder_states = [state_h, state_c]
decoder_outputs = decoder_dense(decoder_outputs)
decoder_model = Model(
    [decoder_inputs] + decoder_states_inputs,
    [decoder_outputs] + decoder_states)

# Reverse-lookup token index to decode sequences back to
# something readable.
reverse_input_char_index = dict(
    (i, char) for char, i in input_token_index.items())
reverse_target_char_index = dict(
    (i, char) for char, i in target_token_index.items())

def decode_sequence(input_seq):
    # Encode the input as state vectors.
    states_value = encoder_model.predict(input_seq)

    # Generate empty target sequence of length 1.
    target_seq = np.zeros((1, 1, num_decoder_tokens))
    # Populate the first character of target sequence with the start character.
    target_seq[0, 0, target_token_index['\t']] = 1.
    # this target_seq you can treat as initial state

    # Sampling loop for a batch of sequences
    # (to simplify, here we assume a batch of size 1).
    stop_condition = False
    decoded_sentence = ''
    while not stop_condition:
        output_tokens, h, c = decoder_model.predict([target_seq] + states_value)

        # Sample a token
        # argmax: Returns the indices of the maximum values along an axis
        # just like find the most possible char
        sampled_token_index = np.argmax(output_tokens[0, -1, :])
        # find char using index
        sampled_char = reverse_target_char_index[sampled_token_index]
        # and append sentence
        decoded_sentence += sampled_char

        # Exit condition: either hit max length
        # or find stop character.
        if (sampled_char == '\n' or len(decoded_sentence) > max_decoder_seq_length):
            stop_condition = True

        # Update the target sequence (of length 1).
        # append then ?
        # creating another new target_seq
        # and this time assume sampled_token_index to 1.0
        target_seq = np.zeros((1, 1, num_decoder_tokens))
        target_seq[0, 0, sampled_token_index] = 1.

        # Update states
        # update states, frome the front parts
        states_value = [h, c]

    return decoded_sentence

3.7 Forecast

for seq_index in range(100,200):
    # Take one sequence (part of the training set)
    # for trying out decoding.
    input_seq = encoder_input_data[seq_index: seq_index + 1]
    decoded_sentence = decode_sequence(input_seq)
    print('Input sentence:', input_texts[seq_index])
    print('Decoded sentence:', decoded_sentence)

The project has been made public on the Mo o platform. Chinese-English translation of Seq2Seq GPU training is recommended.
Introduces a very close and practical feature of the Mo platform: API Doc (in the right bar of the development interface, second).



Writing code on the Mo o platform makes it easy to display multiple windows by dragging the title bar of the window.

4. Summary and Prospect

It is an amazing thing to propose the classical Seq2Seq model, which solves many important problems and difficult problems that NLP cannot solve in the fields of machine translation and speech recognition.It is also a milestone in the application of in-depth learning to NLP.Subsequently, based on this model, many improvements and optimizations have been proposed, such as Attention mechanism.We are confident that in the near future, there will be new and significant discoveries, which we will look forward to.
Project Source Address (Welcome to the computer for fork): https://momodel.cn/explore/5d38500a1afd94479891643a?type=app

5. Reference

Paper: Sequence to Sequence Learning with Neural Networks
Blog: Understanding LSTM Networks
Code: A ten-minute introduction to sequence-to-sequence learning in Keras

Keywords: Python network encoding Windows

Added by sheffrem on Sat, 03 Aug 2019 19:10:44 +0300