[NLP] [Xi] text generation based on RNN and tf.keras


This article comes from TensorFlow's official guidance (https://TensorFlow.google.cn/tutorials/sequences/text'generation), with some details added.

[2] overview

1. tf.keras and keras have the following three major differences

1): opt must be opt under tf.train module, not under keras

2): the default saving format of tf.keras model is check-point, not h5

3) when tf.keras trains and infers models, input_data can directly transfer tf.data.Dataset

2. The official case of tensorflow is text generation based on characters. The basic process is to give a sentence and predict its next character. So the model doesn't know how to spell a word and how to make a word. Because it is character level, it only knows to predict the next character. As a result, non-existent words or words may have been generated.

3. The model has only three layers (char embedding, GRU, FC), but the parameters are huge and the training is very slow (i7 CPU trains an epoch for about half an hour). And here, char embedding is trained directly, not through fasttext or gensim, and then it's doing fine tuning.

[3] the code is as follows:

# -*- coding:utf-8 -*-
import tensorflow as tf
import numpy as np
import os
import time

# 1. Data download
path = tf.keras.utils.get_file('shakespeare.txt',

#2. Data preprocessing
with open(path) as f:
    # text is a string
    text = f.read()
# 3. Extract all the characters that make up the text. Note that vocab is a list
vocab = sorted(set(text))

# 4. Create a mapping relationship for text -- > int
char2idx = {u:i for i, u in enumerate(vocab)}
idx2char = np.array(vocab)
text_as_int = np.array([char2idx[c] for c in text])

# 5. Using the batch method of dataset, text is divided into fixed length sentences
seq_length = 100
examples_per_epoch = len(text)//seq_length
char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)
# The reason for adding 1 to batch Ou size is the generation of inputs and labels. Labels have one more character than inputs
sequences = char_dataset.batch(seq_length+1, drop_remainder=True)
# 6. Divide each sentence into inputs and labels. For example: Hello, inputs = shell, label = ello
def split_input_target(chunk):
    input_text = chunk[:-1]
    target_text = chunk[1:]
    return input_text, target_text
dataset = sequences.map(split_input_target)

# 7. Divide sentences into batch
steps_per_epoch = examples_per_epoch//BATCH_SIZE
# Generally, dropīš remainder needs to be set to true, which means that when the last set of data is not enough to be divided into a batch, this set of data will be discarded
dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)

# 8. Model building
# Length of the vocabulary in chars
vocab_size = len(vocab)
# The embedding dimension
embedding_dim = 256
# Number of RNN units
rnn_units = 1024
model = tf.keras.Sequential()
# This is the character embedding, so it's the character set size * embedding_


# 9. Model configuration
# optimizer must be opt under tf.train, not under keras

# 10. Set callback function
checkpoint_dir = './training_checkpoints'
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")

tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir='./logs/')

# 11. For the training model, repeat() indicates infinite loop of dataset, otherwise the data may not be enough 30 epichs

# 12. Model preservation
# Save as keras model format
# Save as TensorFlow format

# 13. Model generation text
def generate_text(model, start_string):
    # Evaluation step (generating text using the learned model)

    # Number of characters to generate
    num_generate = 1000

    # You can change the start string to experiment
    start_string = 'ROMEO'

    # Converting our start string to numbers (vectorizing)
    input_eval = [char2idx[s] for s in start_string]
    input_eval = tf.expand_dims(input_eval, 0)

    # Empty string to store our results
    text_generated = []

    # Low temperatures results in more predictable text.
    # Higher temperatures results in more surprising text.
    # Experiment to find the best setting.
    temperature = 1.0

    # Here batch size == 1
    for i in range(num_generate):
        predictions = model(input_eval)
        # remove the batch dimension
        predictions = tf.squeeze(predictions, 0)

        # using a multinomial distribution to predict the word returned by the model
        predictions = predictions / temperature
        predicted_id = tf.multinomial(predictions, num_samples=1)[-1, 0].numpy()

        # We pass the predicted word as the next input to the model
        # along with the previous hidden state
        input_eval = tf.expand_dims([predicted_id], 0)


    return (start_string + ''.join(text_generated))

print(generate_text(model, start_string="ROMEO: "))

[iv] summary

1. For more information about tf.keras, please refer to the official website (https://tensorflow.google.cn/guide/keras)

2. For more information about tf.dataset, please refer to the official website (https://tensorflow.google.cn/guide/datasets) and another blog (https://my.oschina.net/u/3800567/blog/1637798)

3. Tf.keras can be fully used instead of keras. The two functions are consistent with the interface, and tf.keras provides more support for TensorFlow

