tfrecord processing for numpy

Original Link: https://blog.csdn.net/songbinxu/article/details/80136836

Tensorflow data read-write: Numpy stored as TFRecord file and read

With the Tensorflow training model, there are three ways to read data:

  • Each epoch/batch feeds in-memory Numpy data into a placeholder, only for small datasets, which can be quite memory intensive.
  • IO operations are time consuming when read from a txt or csv file on your hard disk.
  • Read the recommended TFRecord file from tensorflow

_TFRecord file is a binary file that stores data and label together, makes better use of memory, and copies, moves, and reads faster in tensorflow graph s.The TFRecord file contains the tf.train.Example protocol buffer, which converts the data to string serialization, fills in the protocol buffer, and is written to the TFRecord file by TFRecordWriter.

Numpy Store TFRecord

def save_tfrecords(data, label, desfile):
    with tf.python_io.TFRecordWriter(desfile) as writer:
        for i in range(len(data)):
            features = tf.train.Features(
                feature = {
                    "data":tf.train.Feature(bytes_list = tf.train.BytesList(value = [data[i].astype(np.float64).tostring()])),
                    "label":tf.train.Feature(int64_list = tf.train.Int64List(value = [label[i]]))
                }
            )
            example = tf.train.Example(features = features)
            serialized = example.SerializeToString()
            writer.write(serialized)

Use examples

For example, we store a dataset of 10 samples with varying dimensions in a tfrecord file along with their label s.

# Fixed length with padding complement 0
def padding(data, maxlen=10):
    for i in range(len(data)):
        data[i] = np.hstack([data[i], np.zeros((maxlen-len(data[i])))])

lens = np.random.randint(low=3,high=10,size=(10,))
data = [np.arange(l) for l in lens]
padding(data)
label = [0,0,0,0,0,1,1,1,1,1]

save_tfrecords(data, label, "./data.tfrecords")
  •  

Read TFRecord to Numpy

def _parse_function(example_proto):
  features = {"data": tf.FixedLenFeature((), tf.string),
              "label": tf.FixedLenFeature((), tf.int64)}
  parsed_features = tf.parse_single_example(example_proto, features)
  data = tf.decode_raw(parsed_features['data'], tf.float32)
  return data, parsed_features["label"]

def load_tfrecords(srcfile):
    sess = tf.Session()

    dataset = tf.data.TFRecordDataset(srcfile) # load tfrecord file
    dataset = dataset.map(_parse_function) # parse data into tensor
    dataset = dataset.repeat(2) # repeat for 2 epoches
    dataset = dataset.batch(5) # set batch_size = 5

    iterator = dataset.make_one_shot_iterator()
    next_data = iterator.get_next()

    while True:
        try:
            data, label = sess.run(next_data)
            print data
            print label
        except tf.errors.OutOfRangeError:
            break

Use examples

load_tfrecords(srcfile="./data.tfrecords")
  •  

Output of Results

# 10 samples, 2 epoch s, equivalent to 20 samples, 5 samples per batch
[[0. 1. 2. 3. 4. 0. 0. 0. 0. 0.]
 [0. 1. 2. 3. 4. 5. 6. 7. 8. 0.]
 [0. 1. 2. 3. 4. 5. 0. 0. 0. 0.]
 [0. 1. 2. 3. 4. 5. 0. 0. 0. 0.]
 [0. 1. 2. 3. 4. 0. 0. 0. 0. 0.]]
[0 0 0 0 0]
[[0. 1. 2. 3. 4. 0. 0. 0. 0. 0.]
 [0. 1. 2. 3. 0. 0. 0. 0. 0. 0.]
 [0. 1. 2. 3. 4. 5. 0. 0. 0. 0.]
 [0. 1. 2. 3. 4. 0. 0. 0. 0. 0.]
 [0. 1. 2. 3. 4. 5. 6. 0. 0. 0.]]
[1 1 1 1 1]
[[0. 1. 2. 3. 4. 0. 0. 0. 0. 0.]
 [0. 1. 2. 3. 4. 5. 6. 7. 8. 0.]
 [0. 1. 2. 3. 4. 5. 0. 0. 0. 0.]
 [0. 1. 2. 3. 4. 5. 0. 0. 0. 0.]
 [0. 1. 2. 3. 4. 0. 0. 0. 0. 0.]]
[0 0 0 0 0]
[[0. 1. 2. 3. 4. 0. 0. 0. 0. 0.]
 [0. 1. 2. 3. 0. 0. 0. 0. 0. 0.]
 [0. 1. 2. 3. 4. 5. 0. 0. 0. 0.]
 [0. 1. 2. 3. 4. 0. 0. 0. 0. 0.]
 [0. 1. 2. 3. 4. 5. 6. 0. 0. 0.]]
[1 1 1 1 1]

Reading TFRecord Training with Dataset

Preparing datasets

For a training, the iris dataset is selected and stored as a TFRecord file.

from sklearn.datasets import load_iris
iris = load_iris()
data = iris.data
label = iris.label
save_tfrecords(data, label, "./iris.tfrecord")

design a model

Here we simply use a two-layer neural network with relu as the activation function

def model_function(X=None, Y=None):
    # data & label
    if X == None or Y == None:
        X = tf.placeholder(tf.float32, [None, 4])
        Y = tf.placeholder(tf.int64, [None,])

    # params
    W1 = tf.Variable(tf.random_normal([4,32], 0.0, 0.01))
    b1 = tf.Variable(tf.zeros([32,]))
    W2 = tf.Variable(tf.random_normal([32,3], 0.0, 0.01))
    b2 = tf.Variable(tf.zeros([3,]))

    # transform
    H1 = tf.nn.relu(tf.matmul(X, W1) + b1)
    H2 = tf.nn.relu(tf.matmul(H1, W2) + b2)

    cross_entropy = tf.losses.sparse_softmax_cross_entropy(Y, H2)

    return X, Y, cross_entropy

Routine Training

_First, provide a way to train a placeholder that reads each batch's data input network from memory. The problem with this method is that it consumes a lot of memory, but it should be faster in theory because no additional IO operations are required.

def common_training():

    iris = load_iris()
    data = iris.data
    label = iris.target

    with tf.Session() as sess:
        X,Y,loss = model_function()
        training_op = tf.train.AdamOptimizer().minimize(loss)
        tf.global_variables_initializer().run()

        start = time.time()
        for epoch in range(1000):
            S = 0
            for batch in range(3):
                index = range(batch*50, (batch+1)*50)
                batch_x, batch_y = data[index], label[index]
                L, _ = sess.run([loss, training_op], feed_dict={X:batch_x, Y:batch_y})
                S += L
            if epoch % 100 == 0:
                print S / 3.0, len(index), len(batch_x)
        print time.time() - start, 's'

Reading data with tfrecord and feeding it into model training

_Initialize tf.data.TFRecordDataset object with TFRecord file, set batch size and number of epoch s, run(loss) directly during training, data will jump batch automatically.Note that when the file queue reaches the end an error is thrown, excpet tf.errors.OutOfRangeError is required to prevent errors.

def tfrecord_training():
    sess = tf.Session()
    iris = tf.data.TFRecordDataset("./iris.tfrecord")
    iris = iris.map(_parse_function)
    iris = iris.batch(50)
    iris = iris.repeat(1000)

    iterator = iris.make_one_shot_iterator()
    next_example, next_label = iterator.get_next()

    _, _, loss = model_function(next_example, next_label)
    training_op = tf.train.AdamOptimizer().minimize(loss)

    sess.run(tf.global_variables_initializer()) # must initialize

    start = time.time()
    for epoch in range(1000):
        S = 0
        for batch in range(3):
            try:
                L, _ = sess.run([loss, training_op])
            except tf.errors.OutOfRangeError:
                break
            S += L
        if epoch % 100 == 0:
            print S, S/3.0
    print time.time()-start, 's'

Fast versus slow contrast

_common_training takes 4 seconds and tfrecord_training takes 8 seconds.(
_I thought it would be quicker to use Dataset, but it was actually slower, which is reflected in every batch of training.(
_I guess that Dataset.map() and Dataset.batch() just have a function interface set up to do a map operation on each batch, which slows down the speed and even doesn't actually read the file, so IO time is added here, and common_training just takes data from memory.(
_Thus, the advantage of tfrecord_training should be that it does not need to read the data into memory beforehand, which is better for training with large data at the expense of processing time.

Keywords: Session network

Added by unlishema.wolf on Sun, 12 May 2019 09:01:39 +0300