Statement
This article comes from TensorFlow's official guidance (https://TensorFlow.google.cn/tutorials/sequences/text'generation), with some details added.
[2] overview
1. tf.keras and keras have the following three major differences
1): opt must be opt under tf.train module, not under keras
2): the default saving format of tf.keras model is check-point, not h5
3) when tf.keras trains and infers models, input_data can directly transfer tf.data.Dataset
2. The official case of tensorflow is text generation based on characters. The basic process is to give a sentence and predict its next character. So the model doesn't know how to spell a word and how to make a word. Because it is character level, it only knows to predict the next character. As a result, non-existent words or words may have been generated.
3. The model has only three layers (char embedding, GRU, FC), but the parameters are huge and the training is very slow (i7 CPU trains an epoch for about half an hour). And here, char embedding is trained directly, not through fasttext or gensim, and then it's doing fine tuning.
[3] the code is as follows:
# -*- coding:utf-8 -*- import tensorflow as tf import numpy as np import os import time tf.enable_eager_execution() # 1. Data download path = tf.keras.utils.get_file('shakespeare.txt', 'https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt') #2. Data preprocessing with open(path) as f: # text is a string text = f.read() # 3. Extract all the characters that make up the text. Note that vocab is a list vocab = sorted(set(text)) # 4. Create a mapping relationship for text -- > int char2idx = {u:i for i, u in enumerate(vocab)} idx2char = np.array(vocab) text_as_int = np.array([char2idx[c] for c in text]) # 5. Using the batch method of dataset, text is divided into fixed length sentences seq_length = 100 examples_per_epoch = len(text)//seq_length char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int) # The reason for adding 1 to batch Ou size is the generation of inputs and labels. Labels have one more character than inputs sequences = char_dataset.batch(seq_length+1, drop_remainder=True) # 6. Divide each sentence into inputs and labels. For example: Hello, inputs = shell, label = ello def split_input_target(chunk): input_text = chunk[:-1] target_text = chunk[1:] return input_text, target_text dataset = sequences.map(split_input_target) # 7. Divide sentences into batch BATCH_SIZE = 64 steps_per_epoch = examples_per_epoch//BATCH_SIZE BUFFER_SIZE = 10000 # Generally, dropīš remainder needs to be set to true, which means that when the last set of data is not enough to be divided into a batch, this set of data will be discarded dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True) # 8. Model building # Length of the vocabulary in chars vocab_size = len(vocab) # The embedding dimension embedding_dim = 256 # Number of RNN units rnn_units = 1024 model = tf.keras.Sequential() # This is the character embedding, so it's the character set size * embedding_ model.add(tf.keras.layers.Embedding(input_dim=vocab_size,output_dim=embedding_dim, batch_input_shape=[BATCH_SIZE,None])) model.add(tf.keras.layers.GRU(units=rnn_units, return_sequences=True, recurrent_initializer='glorot_uniform', stateful=True)) model.add(tf.keras.layers.Dense(units=vocab_size)) model.summary() # 9. Model configuration # optimizer must be opt under tf.train, not under keras model.compile(optimizer=tf.train.AdamOptimizer(),loss=tf.losses.sparse_softmax_cross_entropy) # 10. Set callback function checkpoint_dir = './training_checkpoints' checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}") checkpoint_callback=tf.keras.callbacks.ModelCheckpoint( filepath=checkpoint_prefix, save_weights_only=True) tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir='./logs/') # 11. For the training model, repeat() indicates infinite loop of dataset, otherwise the data may not be enough 30 epichs model.fit(dataset.repeat(),epochs=30, steps_per_epoch=steps_per_epoch, callbacks=[checkpoint_callback,tensorboard_callback]) # 12. Model preservation # Save as keras model format model.save_weights(filepath='./models/gen_text_with_char_on_rnn.h5',save_format='h5') # Save as TensorFlow format model.save_weights(filepath='./models/gen_text_with_char_on_rnn_check_point') # 13. Model generation text def generate_text(model, start_string): # Evaluation step (generating text using the learned model) # Number of characters to generate num_generate = 1000 # You can change the start string to experiment start_string = 'ROMEO' # Converting our start string to numbers (vectorizing) input_eval = [char2idx[s] for s in start_string] input_eval = tf.expand_dims(input_eval, 0) # Empty string to store our results text_generated = [] # Low temperatures results in more predictable text. # Higher temperatures results in more surprising text. # Experiment to find the best setting. temperature = 1.0 # Here batch size == 1 model.reset_states() for i in range(num_generate): predictions = model(input_eval) # remove the batch dimension predictions = tf.squeeze(predictions, 0) # using a multinomial distribution to predict the word returned by the model predictions = predictions / temperature predicted_id = tf.multinomial(predictions, num_samples=1)[-1, 0].numpy() # We pass the predicted word as the next input to the model # along with the previous hidden state input_eval = tf.expand_dims([predicted_id], 0) text_generated.append(idx2char[predicted_id]) return (start_string + ''.join(text_generated)) print(generate_text(model, start_string="ROMEO: "))
[iv] summary
1. For more information about tf.keras, please refer to the official website (https://tensorflow.google.cn/guide/keras)
2. For more information about tf.dataset, please refer to the official website (https://tensorflow.google.cn/guide/datasets) and another blog (https://my.oschina.net/u/3800567/blog/1637798)
3. Tf.keras can be fully used instead of keras. The two functions are consistent with the interface, and tf.keras provides more support for TensorFlow