TensorFlow2 function -- TF data. Dataset. padded_ batch

Category: General contents of TensorFlow2 function in simple terms

Function:

padded_batch(batch_size, padded_shapes=None, padding_values=None, drop_remainder=False,name=None)

This function can merge the continuous elements of the dataset into the padded batch. Merge multiple consecutive elements of the input dataset into a single element. And tf.data.Dataset.batch Similarly, the returned result will have an additional dimension, batch_size. If batch_size does not enter the number of elements N N N average segmentation, and drop_ If remainder is False, the batch of the last element_ Size is n% batch_size. If the program relies on a batch with the same external size_ Size, drop_ Set the remarks parameter to True to prevent smaller batches from being generated.

And tf.data.Dataset.batch The difference is that the input elements processed may have different shapes, and the function fills each vector into padded_ Corresponding shapes in shapes. padded_ The shapes parameter determines the resulting shape for each dimension of each vector in the output element:

  • If the dimension of the vector is set to a constant, each vector will be filled to that length.
  • If the dimension of the vector is not set, each vector will be filled to the maximum length of all elements.
A = (tf.data.Dataset
     .range(1, 5, output_type=tf.int32)
     .map(lambda x: tf.fill([x], x)))
for element in A.as_numpy_iterator():
  print(element)

Output:

[1]
[2 2]
[3 3 3]
[4 4 4 4]

Use the smallest batch per batch_size to fill in:

B = A.padded_batch(2)
for element in B.as_numpy_iterator():
  print(element)

Output:

[[1 0]
[2 2]]
[[3 3 3 0]
[4 4 4 4]]

Use a fixed batch_size to fill in:

C = A.padded_batch(2, padded_shapes=5)
for element in C.as_numpy_iterator():
  print(element)

Output:

[[1 0 0 0 0]
[2 2 0 0 0]]
[[3 3 3 0 0]
[4 4 4 4 0]]

Using a specific value padding_values to fill in:

D = A.padded_batch(2, padded_shapes=5, padding_values=-1)
for element in D.as_numpy_iterator():
  print(element)

Output:

[[ 1 -1 -1 -1 -1]
[ 2 2 -1 -1 -1]]
[[ 3 3 3 -1 -1]
[ 4 4 4 4-1]]

Multidimensional arrays can be filled independently:

elements = [([1, 2, 3], [10]),
            ([4, 5], [11, 12])]
dataset = tf.data.Dataset.from_generator(
    lambda: iter(elements), (tf.int32, tf.int32))

for element in dataset.as_numpy_iterator():
  print(element)

Output:

(array([1, 2, 3]), array([10]))
(array([4, 5]), array([11, 12]))

The first batch uses padded_shapes=4 and padding_values=-1, and the second batch uses padding_values=100 and minimum length to fill:

dataset = dataset.padded_batch(2,
    padded_shapes=([4], [None]),
    padding_values=(-1, 100))
list(dataset.as_numpy_iterator())

Populate a multidimensional array with a single value:

E = tf.data.Dataset.zip((A, A)).padded_batch(2, padding_values=-1)
for element in E.as_numpy_iterator():
  print(element)

Output:

(array([[ 1, -1],
[ 2, 2]]), array([[ 1, -1],
[ 2, 2]]))
(array([[ 3, 3, 3, -1],
[ 4, 4, 4, 4]]), array([[ 3, 3, 3, -1],
[ 4, 4, 4, 4]]))

Parameters:

parametersignificance
batch_size[tf.int64 /tf.Tensor] indicates the number of consecutive elements of this dataset to be combined in a single batch.
padded_shapes[optional, tf.int64 /tf.TensorShape] represents the shape to which each vector of each input element should be filled before processing. If not set, all vectors are filled to the maximum size in the batch. If any vector has an unknown rank, the fill shape must be set.
padding_values[optional, tf.TensorShape] represents the fill value for each vector. None indicates that default values should be populated. The default value of numeric type is 0, and the default value of string type is empty string. padding_values should have the same structure as the input data set. If padding_values is a single element, and the input dataset has multiple components, the same padding_values will be used to populate each component of the dataset. If padding_values is a scalar, its value is broadcast to match the shape of each vector.
drop_remainder[optional, tf.bool /tf.Tensor] indicates whether the last batch of elements should be deleted if the last batch of elements is less than the batch size. The default is False.
name[optional] TF Name of the data operation

Return value:

Return valuesignificance
DatasetA TF data. The dataset of the dataset.

Exception:

Exception typesignificance
ValueErrorIf the component has an unknown rank and padded is not set_ Shapes parameter.
TypeErrorPadding filled_ Values does not match the type of the original vector.

Function implementation:

  def padded_batch(self,
                   batch_size,
                   padded_shapes=None,
                   padding_values=None,
                   drop_remainder=False,
                   name=None):
    """Combines consecutive elements of this dataset into padded batches.
    This transformation combines multiple consecutive elements of the input
    dataset into a single element.
    Like `tf.data.Dataset.batch`, the components of the resulting element will
    have an additional outer dimension, which will be `batch_size` (or
    `N % batch_size` for the last element if `batch_size` does not divide the
    number of input elements `N` evenly and `drop_remainder` is `False`). If
    your program depends on the batches having the same outer dimension, you
    should set the `drop_remainder` argument to `True` to prevent the smaller
    batch from being produced.
    Unlike `tf.data.Dataset.batch`, the input elements to be batched may have
    different shapes, and this transformation will pad each component to the
    respective shape in `padded_shapes`. The `padded_shapes` argument
    determines the resulting shape for each dimension of each component in an
    output element:
    * If the dimension is a constant, the component will be padded out to that
      length in that dimension.
    * If the dimension is unknown, the component will be padded out to the
      maximum length of all elements in that dimension.
    >>> A = (tf.data.Dataset
    ...      .range(1, 5, output_type=tf.int32)
    ...      .map(lambda x: tf.fill([x], x)))
    >>> # Pad to the smallest per-batch size that fits all elements.
    >>> B = A.padded_batch(2)
    >>> for element in B.as_numpy_iterator():
    ...   print(element)
    [[1 0]
     [2 2]]
    [[3 3 3 0]
     [4 4 4 4]]
    >>> # Pad to a fixed size.
    >>> C = A.padded_batch(2, padded_shapes=5)
    >>> for element in C.as_numpy_iterator():
    ...   print(element)
    [[1 0 0 0 0]
     [2 2 0 0 0]]
    [[3 3 3 0 0]
     [4 4 4 4 0]]
    >>> # Pad with a custom value.
    >>> D = A.padded_batch(2, padded_shapes=5, padding_values=-1)
    >>> for element in D.as_numpy_iterator():
    ...   print(element)
    [[ 1 -1 -1 -1 -1]
     [ 2  2 -1 -1 -1]]
    [[ 3  3  3 -1 -1]
     [ 4  4  4  4 -1]]
    >>> # Components of nested elements can be padded independently.
    >>> elements = [([1, 2, 3], [10]),
    ...             ([4, 5], [11, 12])]
    >>> dataset = tf.data.Dataset.from_generator(
    ...     lambda: iter(elements), (tf.int32, tf.int32))
    >>> # Pad the first component of the tuple to length 4, and the second
    >>> # component to the smallest size that fits.
    >>> dataset = dataset.padded_batch(2,
    ...     padded_shapes=([4], [None]),
    ...     padding_values=(-1, 100))
    >>> list(dataset.as_numpy_iterator())
    [(array([[ 1,  2,  3, -1], [ 4,  5, -1, -1]], dtype=int32),
      array([[ 10, 100], [ 11,  12]], dtype=int32))]
    >>> # Pad with a single value and multiple components.
    >>> E = tf.data.Dataset.zip((A, A)).padded_batch(2, padding_values=-1)
    >>> for element in E.as_numpy_iterator():
    ...   print(element)
    (array([[ 1, -1],
           [ 2,  2]], dtype=int32), array([[ 1, -1],
           [ 2,  2]], dtype=int32))
    (array([[ 3,  3,  3, -1],
           [ 4,  4,  4,  4]], dtype=int32), array([[ 3,  3,  3, -1],
           [ 4,  4,  4,  4]], dtype=int32))
    See also `tf.data.experimental.dense_to_sparse_batch`, which combines
    elements that may have different shapes into a `tf.sparse.SparseTensor`.
    Args:
      batch_size: A `tf.int64` scalar `tf.Tensor`, representing the number of
        consecutive elements of this dataset to combine in a single batch.
      padded_shapes: (Optional.) A (nested) structure of `tf.TensorShape` or
        `tf.int64` vector tensor-like objects representing the shape to which
        the respective component of each input element should be padded prior
        to batching. Any unknown dimensions will be padded to the maximum size
        of that dimension in each batch. If unset, all dimensions of all
        components are padded to the maximum size in the batch. `padded_shapes`
        must be set if any component has an unknown rank.
      padding_values: (Optional.) A (nested) structure of scalar-shaped
        `tf.Tensor`, representing the padding values to use for the respective
        components. None represents that the (nested) structure should be padded
        with default values.  Defaults are `0` for numeric types and the empty
        string for string types. The `padding_values` should have the same
        (nested) structure as the input dataset. If `padding_values` is a single
        element and the input dataset has multiple components, then the same
        `padding_values` will be used to pad every component of the dataset.
        If `padding_values` is a scalar, then its value will be broadcasted
        to match the shape of each component.
      drop_remainder: (Optional.) A `tf.bool` scalar `tf.Tensor`, representing
        whether the last batch should be dropped in the case it has fewer than
        `batch_size` elements; the default behavior is not to drop the smaller
        batch.
      name: (Optional.) A name for the tf.data operation.
    Returns:
      Dataset: A `Dataset`.
    Raises:
      ValueError: If a component has an unknown rank, and the `padded_shapes`
        argument is not set.
      TypeError: If a component is of an unsupported type. The list of supported
        types is documented in
        https://www.tensorflow.org/guide/data#dataset_structure.
    """
    if padded_shapes is None:
      padded_shapes = get_legacy_output_shapes(self)
      for i, shape in enumerate(nest.flatten(padded_shapes)):
        # A `tf.TensorShape` is only false if its *rank* is unknown.
        if not shape:
          raise ValueError(f"You must provide `padded_shapes` argument because "
                           f"component {i} has unknown rank.")
    return PaddedBatchDataset(
        self,
        batch_size,
        padded_shapes,
        padding_values,
        drop_remainder,
        name=name)

Keywords: Machine Learning AI TensorFlow Deep Learning

Added by MBK on Wed, 15 Dec 2021 18:38:45 +0200