Tensorflow2.0 technology analysis and actual combat -- detailed explanation of Dataset class

Data pipeline Dataset


Knowledge tree

1.Dataset class related operations

1.1 Dataset class create dataset

tf.data.Dataset Class to create a dataset and instantiate the dataset.
The most common ones are:

  • tf.data.Dataset.from_tensors(): creates a dataset object, merges input and returns a dataset with a single element.
  • tf.data.Dataset.from_tensor_slices(): create a dataset object. The input can be one or more tensors. If there are multiple tensors, they need to be assembled in the form of tuples or dictionaries
  • tf.data.Dataset. from_generator(): it is used to generate the required data set iteratively. Generally, it is used when there is a large amount of data.

Note: a Dataset can be seen as an ordered list of "elements" of the same type. In practice, a single "element" can be a vector, a string, a picture, or even a tuple or dict.

1.2 Dataset class data conversion

Dataset contains rich data conversion functions:

  • map(f): apply the function f to each element in the dataset to get a new dataset (this part often combines tf.io Read, write and decode files, tf.image Image processing)
  • shuffle(buffer_size: scramble the data set (set a fixed size Buffer) and take out the previous buffer_size elements are put in and randomly sampled from the Buffer, and the sampled data is replaced with subsequent data)
  • repeat(count): number of data set repetitions
  • batch(batch_size): divide the dataset into batches, that is, for each batch_size elements, using tf.stack() merge in dimension 0 as an element
  • flat_map(): map the map function to each element of the Dataset and flatten the nested Dataset.

flat_map() program case 1:

import tensorflow as tf
a = tf.data.Dataset.range(1, 6)  # ==> [ 1, 2, 3, 4, 5 ]
# NOTE: New lines indicate "block" boundaries.
b=a.flat_map(lambda x: tf.data.Dataset.from_tensors(x).repeat(6)) 
for item in b:
    print(item.numpy(),end=', ')

Next, run the program result.

1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 

flat_map() program case 2:

import tensorflow as tf
dataset = tf.data.Dataset.from_tensor_slices([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
dataset_flat = dataset.flat_map(lambda x: tf.data.Dataset.from_tensor_slices(x))
for line in dataset:
    print(line)

Next, run the program result.

tf.Tensor([1 2 3], shape=(3,), dtype=int32)
tf.Tensor([4 5 6], shape=(3,), dtype=int32)
tf.Tensor([7 8 9], shape=(3,), dtype=int32)

flat_map() program case 3:

import tensorflow as tf
dataset = tf.data.Dataset.from_tensor_slices([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
dataset_flat = dataset.flat_map(lambda x: tf.data.Dataset.from_tensor_slices(x))
for line in dataset_flat:
    print(line)

Next, run the program result.

tf.Tensor(1, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)
tf.Tensor(3, shape=(), dtype=int32)
tf.Tensor(4, shape=(), dtype=int32)
tf.Tensor(5, shape=(), dtype=int32)
tf.Tensor(6, shape=(), dtype=int32)
tf.Tensor(7, shape=(), dtype=int32)
tf.Tensor(8, shape=(), dtype=int32)
tf.Tensor(9, shape=(), dtype=int32)
  • interleave(): similar to flat_map, but you can clamp data from different sources together
  • take(): intercept the first elements in the dataset
  • Filter: filter out some elements
  • zip: horizontally hinge two datasets of the same length
    zip program case 1:
a = tf.data.Dataset.range(1, 4)  # ==> [ 1, 2, 3 ]
b = tf.data.Dataset.range(4, 7) # ==> [ 4, 5, 6 ]
ds = tf.data.Dataset.zip((a, b))
for line in ds:
    print(line)

Next, run the program result.

(<tf.Tensor: id=182, shape=(), dtype=int64, numpy=1>, <tf.Tensor: id=183, shape=(), dtype=int64, numpy=4>)
(<tf.Tensor: id=184, shape=(), dtype=int64, numpy=2>, <tf.Tensor: id=185, shape=(), dtype=int64, numpy=5>)
(<tf.Tensor: id=186, shape=(), dtype=int64, numpy=3>, <tf.Tensor: id=187, shape=(), dtype=int64, numpy=6>)

zip program case 2:

a = tf.data.Dataset.range(1, 4)  # ==> [ 1, 2, 3 ]
b = tf.data.Dataset.range(4, 7) # ==> [ 4, 5, 6 ]
ds = tf.data.Dataset.zip((b, a))
for line in ds:
    print(line)

Next, run the program result.

(<tf.Tensor: id=194, shape=(), dtype=int64, numpy=4>, <tf.Tensor: id=195, shape=(), dtype=int64, numpy=1>)
(<tf.Tensor: id=196, shape=(), dtype=int64, numpy=5>, <tf.Tensor: id=197, shape=(), dtype=int64, numpy=2>)
(<tf.Tensor: id=198, shape=(), dtype=int64, numpy=6>, <tf.Tensor: id=199, shape=(), dtype=int64, numpy=3>)
  • concatenate: connect two datasets vertically

concatenate program case:

a = tf.data.Dataset.range(1, 4)  # ==> [ 1, 2, 3 ]
b = tf.data.Dataset.range(4, 7) # ==> [ 4, 5, 6 ]
ds = a.concatenate(b)
for line in ds:
    print(line)

Next, run the program result.

tf.Tensor(1, shape=(), dtype=int64)
tf.Tensor(2, shape=(), dtype=int64)
tf.Tensor(3, shape=(), dtype=int64)
tf.Tensor(4, shape=(), dtype=int64)
tf.Tensor(5, shape=(), dtype=int64)
tf.Tensor(6, shape=(), dtype=int64)
  • reduce: perform merge operation

interleave() is a class method of Dataset, so interleave works on a Dataset.
First, the method takes the cycle from the Dataset_ Length elements, and then apply map to these elements_ Func, get cycle_length new Dataset objects. Then take the data from these newly generated Dataset objects, and each Dataset object takes the block once_ Length data. When the objects of a newly generated Dataset are exhausted, take another element from the original Dataset, and then apply the map_ Func, and so on.

Keywords: Lambda

Added by Steven_belfast on Mon, 22 Jun 2020 10:09:24 +0300