Original Link: https://blog.csdn.net/songbinxu/article/details/80136836
Tensorflow data read-write: Numpy stored as TFRecord file and read
With the Tensorflow training model, there are three ways to read data:
- Each epoch/batch feeds in-memory Numpy data into a placeholder, only for small datasets, which can be quite memory intensive.
- IO operations are time consuming when read from a txt or csv file on your hard disk.
- Read the recommended TFRecord file from tensorflow
_TFRecord file is a binary file that stores data and label together, makes better use of memory, and copies, moves, and reads faster in tensorflow graph s.The TFRecord file contains the tf.train.Example protocol buffer, which converts the data to string serialization, fills in the protocol buffer, and is written to the TFRecord file by TFRecordWriter.
Numpy Store TFRecord
def save_tfrecords(data, label, desfile): with tf.python_io.TFRecordWriter(desfile) as writer: for i in range(len(data)): features = tf.train.Features( feature = { "data":tf.train.Feature(bytes_list = tf.train.BytesList(value = [data[i].astype(np.float64).tostring()])), "label":tf.train.Feature(int64_list = tf.train.Int64List(value = [label[i]])) } ) example = tf.train.Example(features = features) serialized = example.SerializeToString() writer.write(serialized)
Use examples
For example, we store a dataset of 10 samples with varying dimensions in a tfrecord file along with their label s.
# Fixed length with padding complement 0 def padding(data, maxlen=10): for i in range(len(data)): data[i] = np.hstack([data[i], np.zeros((maxlen-len(data[i])))]) lens = np.random.randint(low=3,high=10,size=(10,)) data = [np.arange(l) for l in lens] padding(data) label = [0,0,0,0,0,1,1,1,1,1] save_tfrecords(data, label, "./data.tfrecords")
Read TFRecord to Numpy
def _parse_function(example_proto): features = {"data": tf.FixedLenFeature((), tf.string), "label": tf.FixedLenFeature((), tf.int64)} parsed_features = tf.parse_single_example(example_proto, features) data = tf.decode_raw(parsed_features['data'], tf.float32) return data, parsed_features["label"] def load_tfrecords(srcfile): sess = tf.Session() dataset = tf.data.TFRecordDataset(srcfile) # load tfrecord file dataset = dataset.map(_parse_function) # parse data into tensor dataset = dataset.repeat(2) # repeat for 2 epoches dataset = dataset.batch(5) # set batch_size = 5 iterator = dataset.make_one_shot_iterator() next_data = iterator.get_next() while True: try: data, label = sess.run(next_data) print data print label except tf.errors.OutOfRangeError: break
Use examples
load_tfrecords(srcfile="./data.tfrecords")
Output of Results
# 10 samples, 2 epoch s, equivalent to 20 samples, 5 samples per batch [[0. 1. 2. 3. 4. 0. 0. 0. 0. 0.] [0. 1. 2. 3. 4. 5. 6. 7. 8. 0.] [0. 1. 2. 3. 4. 5. 0. 0. 0. 0.] [0. 1. 2. 3. 4. 5. 0. 0. 0. 0.] [0. 1. 2. 3. 4. 0. 0. 0. 0. 0.]] [0 0 0 0 0] [[0. 1. 2. 3. 4. 0. 0. 0. 0. 0.] [0. 1. 2. 3. 0. 0. 0. 0. 0. 0.] [0. 1. 2. 3. 4. 5. 0. 0. 0. 0.] [0. 1. 2. 3. 4. 0. 0. 0. 0. 0.] [0. 1. 2. 3. 4. 5. 6. 0. 0. 0.]] [1 1 1 1 1] [[0. 1. 2. 3. 4. 0. 0. 0. 0. 0.] [0. 1. 2. 3. 4. 5. 6. 7. 8. 0.] [0. 1. 2. 3. 4. 5. 0. 0. 0. 0.] [0. 1. 2. 3. 4. 5. 0. 0. 0. 0.] [0. 1. 2. 3. 4. 0. 0. 0. 0. 0.]] [0 0 0 0 0] [[0. 1. 2. 3. 4. 0. 0. 0. 0. 0.] [0. 1. 2. 3. 0. 0. 0. 0. 0. 0.] [0. 1. 2. 3. 4. 5. 0. 0. 0. 0.] [0. 1. 2. 3. 4. 0. 0. 0. 0. 0.] [0. 1. 2. 3. 4. 5. 6. 0. 0. 0.]] [1 1 1 1 1]
Reading TFRecord Training with Dataset
Preparing datasets
For a training, the iris dataset is selected and stored as a TFRecord file.
from sklearn.datasets import load_iris iris = load_iris() data = iris.data label = iris.label save_tfrecords(data, label, "./iris.tfrecord")
design a model
Here we simply use a two-layer neural network with relu as the activation function
def model_function(X=None, Y=None): # data & label if X == None or Y == None: X = tf.placeholder(tf.float32, [None, 4]) Y = tf.placeholder(tf.int64, [None,]) # params W1 = tf.Variable(tf.random_normal([4,32], 0.0, 0.01)) b1 = tf.Variable(tf.zeros([32,])) W2 = tf.Variable(tf.random_normal([32,3], 0.0, 0.01)) b2 = tf.Variable(tf.zeros([3,])) # transform H1 = tf.nn.relu(tf.matmul(X, W1) + b1) H2 = tf.nn.relu(tf.matmul(H1, W2) + b2) cross_entropy = tf.losses.sparse_softmax_cross_entropy(Y, H2) return X, Y, cross_entropy
Routine Training
_First, provide a way to train a placeholder that reads each batch's data input network from memory. The problem with this method is that it consumes a lot of memory, but it should be faster in theory because no additional IO operations are required.
def common_training(): iris = load_iris() data = iris.data label = iris.target with tf.Session() as sess: X,Y,loss = model_function() training_op = tf.train.AdamOptimizer().minimize(loss) tf.global_variables_initializer().run() start = time.time() for epoch in range(1000): S = 0 for batch in range(3): index = range(batch*50, (batch+1)*50) batch_x, batch_y = data[index], label[index] L, _ = sess.run([loss, training_op], feed_dict={X:batch_x, Y:batch_y}) S += L if epoch % 100 == 0: print S / 3.0, len(index), len(batch_x) print time.time() - start, 's'
Reading data with tfrecord and feeding it into model training
_Initialize tf.data.TFRecordDataset object with TFRecord file, set batch size and number of epoch s, run(loss) directly during training, data will jump batch automatically.Note that when the file queue reaches the end an error is thrown, excpet tf.errors.OutOfRangeError is required to prevent errors.
def tfrecord_training(): sess = tf.Session() iris = tf.data.TFRecordDataset("./iris.tfrecord") iris = iris.map(_parse_function) iris = iris.batch(50) iris = iris.repeat(1000) iterator = iris.make_one_shot_iterator() next_example, next_label = iterator.get_next() _, _, loss = model_function(next_example, next_label) training_op = tf.train.AdamOptimizer().minimize(loss) sess.run(tf.global_variables_initializer()) # must initialize start = time.time() for epoch in range(1000): S = 0 for batch in range(3): try: L, _ = sess.run([loss, training_op]) except tf.errors.OutOfRangeError: break S += L if epoch % 100 == 0: print S, S/3.0 print time.time()-start, 's'
Fast versus slow contrast
_common_training takes 4 seconds and tfrecord_training takes 8 seconds.(
_I thought it would be quicker to use Dataset, but it was actually slower, which is reflected in every batch of training.(
_I guess that Dataset.map() and Dataset.batch() just have a function interface set up to do a map operation on each batch, which slows down the speed and even doesn't actually read the file, so IO time is added here, and common_training just takes data from memory.(
_Thus, the advantage of tfrecord_training should be that it does not need to read the data into memory beforehand, which is better for training with large data at the expense of processing time.