Data Orchestration Summit
www.alluxio.io/data-orchestration-summit-2019
November 7, 2019
tf.data: TensorFlow Input Pipeline
Speakers:
Jiri Simsa, Google
3. 3
Why input pipeline API?
- data might not fit into memory
- data might require (randomized) pre-processing
- efficiently utilize hardware
- decouple loading + pre-processing from distribution
4. tf.data: TensorFlow Input Pipeline
4
Extract:
- read data from memory / storage
- parse file format
Transform:
- text vectorization
- image transformations
- video temporal sampling
- shuffling, batching, …
Load:
- transfer data to the accelerator
time
flops
CPU
accelerators
24. TFDS: TensorFlow Datasets
24
- https://www.tensorflow.org/datasets/datasets
- canned datasets ready to be used
import tensorflow as tf
import tensorflow_datasets as tfds
# See available datasets
print(tfds.list_builders())
# Construct a tf.data.Dataset
dataset = tfds.load(name="mnist", split=tfds.Split.TRAIN)
# Customize your input pipeline
dataset = dataset.shuffle(1024).batch(32)
for features in dataset.take(1):
image, label = features["image"], features["label"]