tf.data: TensorFlow Input Pipeline

•

1 like•3,339 views

Data Orchestration Summit www.alluxio.io/data-orchestration-summit-2019 November 7, 2019 tf.data: TensorFlow Input Pipeline Speakers: Jiri Simsa, Google

Software

tf.data: TensorFlow Input Pipeline
https://www.tensorflow.org/guide/data
https://www.tensorflow.org/guide/data_performance
1
Jiri Simsa

3
Why input pipeline API?
- data might not fit into memory
- data might require (randomized) pre-processing
- efficiently utilize hardware
- decouple loading + pre-processing from distribution

tf.data: TensorFlow Input Pipeline
4
Extract:
- read data from memory / storage
- parse file format
Transform:
- text vectorization
- image transformations
- video temporal sampling
- shuffling, batching, …
Load:
- transfer data to the accelerator
time
flops
CPU
accelerators

import tensorflow as tf
def preprocess(record):
...
dataset = tf.data.TFRecordDataset(".../*.tfrecord")
dataset = dataset.map(preprocess)
dataset = dataset.batch(batch_size=32)
model = ...
model.fit(dataset, epochs=10)
55

import tensorflow as tf
def preprocess(record):
...
dataset = tf.data.TFRecordDataset(".../*.tfrecord")
dataset = dataset.map(preprocess, num_parallel_calls=Y)
dataset = dataset.batch(batch_size=32)
dataset = dataset.prefetch(buffer_size=X)
model = ...
model.fit(dataset, epochs=10)
161616

import tensorflow as tf
def preprocess(record):
...
dataset = tf.data.TFRecordDataset(".../*.tfrecord", num_parallel_readers=Z)
dataset = dataset.map(preprocess, num_parallel_calls=Y)
dataset = dataset.batch(batch_size=32)
dataset = dataset.prefetch(buffer_size=X)
model = ...
model.fit(dataset, epochs=10)
191919

tf.data Options
21
- determinism
- statistics
- optimizations (autotuning, fusion, vectorization, parallelization, ...)
- threading (private thread pool, intra op parallelism)

tf.data Options
22
- determinism
- statistics
- optimizations (autotuning, fusion, vectorization, parallelization, ...)
- threading (private thread pool, intra op parallelism)
dataset = ...
options = tf.data.Options()
options.experimental_optimization.map_parallelization = True
dataset = dataset.with_options(options)

TFDS: TensorFlow Datasets
23
- https://www.tensorflow.org/datasets/datasets
- canned datasets ready to be used

TFDS: TensorFlow Datasets
24
- https://www.tensorflow.org/datasets/datasets
- canned datasets ready to be used
import tensorflow as tf
import tensorflow_datasets as tfds
# See available datasets
print(tfds.list_builders())
# Construct a tf.data.Dataset
dataset = tfds.load(name="mnist", split=tfds.Split.TRAIN)
# Customize your input pipeline
dataset = dataset.shuffle(1024).batch(32)
for features in dataset.take(1):
image, label = features["image"], features["label"]

What's hot

Prometeusについてはじめてみよう / Let's start PrometeusTakeo Noda

密かに話題のBufferbloatKazuhito Ohkawa

20221111_JPUG_CustomScan_APIKohei KaiGai

10分でわかる Cilium と XDP / BPFShuji Yamada

Apache flinkpranay kumar

Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey GraingerOpenSource Connections

シスコ装置を使い倒す！組込み機能による可視化からセキュリティ強化シスコシステムズ合同会社

Generalized Pipeline Parallelism for DNN TrainingDatabricks

Openflow実験Yahoo!デベロッパーネットワーク

Programming the Network Data PlaneC4Media

Apache Spark ComponentsGirish Khanzode

OpenStackアップストリーム活動実践　中級Takashi Natsume

PFN のオンプレML基盤の取り組み / オンプレML基盤 on Kubernetes 〜PFN、ヤフー〜Preferred Networks

PostgreSQLのパラレル化に向けた取り組み@第30回(仮名)PostgreSQL勉強会Shigeru Hanada

Topology Managerについて / Kubernetes Meetup Tokyo 50Preferred Networks

Get Started with the Most Advanced Edition Yet of Neo4j Graph Data ScienceNeo4j

HBase Consistency and Performance ImprovementsDataWorks Summit

Machine Intelligence at Google Scale: TensorFlowDataWorks Summit/Hadoop Summit

通信と放送の融合を考えるBoF 5Masaaki Nabeshima

続・PFN のオンプレML基盤の取り組み / オンプレML基盤 on Kubernetes 〜PFN、ヤフー〜 #2Preferred Networks

What's hot (20)

Prometeusについてはじめてみよう / Let's start Prometeus

密かに話題のBufferbloat

20221111_JPUG_CustomScan_API

10分でわかる Cilium と XDP / BPF

Apache flink

Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger

シスコ装置を使い倒す！組込み機能による可視化からセキュリティ強化

Generalized Pipeline Parallelism for DNN Training

Openflow実験

Programming the Network Data Plane

Apache Spark Components

OpenStackアップストリーム活動実践　中級

PFN のオンプレML基盤の取り組み / オンプレML基盤 on Kubernetes 〜PFN、ヤフー〜

PostgreSQLのパラレル化に向けた取り組み@第30回(仮名)PostgreSQL勉強会

Topology Managerについて / Kubernetes Meetup Tokyo 50

Get Started with the Most Advanced Edition Yet of Neo4j Graph Data Science

HBase Consistency and Performance Improvements

Machine Intelligence at Google Scale: TensorFlow

通信と放送の融合を考えるBoF 5

続・PFN のオンプレML基盤の取り組み / オンプレML基盤 on Kubernetes 〜PFN、ヤフー〜 #2

Similar to tf.data: TensorFlow Input Pipeline

TensorFlow.Data 및 TensorFlow HubJeongkyu Shin

Certification Study Group -Professional ML Engineer Session 2 (GCP-TensorFlow...gdgsurrey

Real Time Streaming Data with Kafka and TensorFlow (Yong Tang, MobileIron) Ka...confluent

Introduction to TensorFlow 2Oswald Campesato

Introduction to TensorFlow 2 and KerasOswald Campesato

Meetup tensorframesPaolo Platter

TensorFlow example for AI Ukraine2016Andrii Babii

Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...Chris Fregly

What's new with Apache Spark's Structured Streaming?Miklos Christine

Theano vs TensorFlow | EdurekaEdureka!

Easy, scalable, fault tolerant stream processing with structured streaming - ...Anyscale

How Many Ohs? (An Integration Guide to Apex & Triple-o)OPNFV

Tensorflow in practice by Engineer - donghwi chaDonghwi Cha

Distributed Deep Learning with Apache Spark and TensorFlow with Jim DowlingDatabricks

TensorFlow Dev Summit 2017 요약Jin Joong Kim

Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...confluent

Introduction to TensorFlow 2Oswald Campesato

TensorFrames: Google Tensorflow on Apache SparkDatabricks

Time Series Analysis for Network Secruitymrphilroth

Spark Summit EU talk by Tim HunterSpark Summit

Similar to tf.data: TensorFlow Input Pipeline (20)

TensorFlow.Data 및 TensorFlow Hub

Certification Study Group -Professional ML Engineer Session 2 (GCP-TensorFlow...

Real Time Streaming Data with Kafka and TensorFlow (Yong Tang, MobileIron) Ka...

Introduction to TensorFlow 2

Introduction to TensorFlow 2 and Keras

Meetup tensorframes

TensorFlow example for AI Ukraine2016

Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...

What's new with Apache Spark's Structured Streaming?

Theano vs TensorFlow | Edureka

Easy, scalable, fault tolerant stream processing with structured streaming - ...

How Many Ohs? (An Integration Guide to Apex & Triple-o)

Tensorflow in practice by Engineer - donghwi cha

Distributed Deep Learning with Apache Spark and TensorFlow with Jim Dowling

TensorFlow Dev Summit 2017 요약

Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...

Introduction to TensorFlow 2

TensorFrames: Google Tensorflow on Apache Spark

Time Series Analysis for Network Secruity

Spark Summit EU talk by Tim Hunter

Recently uploaded

Test Automation Strategy for Frontend and BackendArshad QA

Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3

Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812

Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions

5 Signs You Need a Fashion PLM Software.pdfWave PLM

CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823

why an Opensea Clone Script might be your perfect match.pdfjoe51371421

DNT_Corporate presentation know about usDynamic Netsoft

A Secure and Reliable Document Management System is Essential.docxComplianceQuest1

Active Directory Penetration Testing, cionsystems.com.pdfCionsystems

Exploring iOS App Development: Simplifying the ProcessEvangelist Apps https://twitter.com/EvangelistSW/

Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531

Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH

Professional Resume Template for Software DevelopersVinodh Ram

Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110

Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01

Optimizing AI for immediate response in Smart CCTVshikhaohhpro

HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai

Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveCall Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure

Diamond Application Development Crafting Solutions with PrecisionSolGuruz

Recently uploaded (20)

Test Automation Strategy for Frontend and Backend

Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data

Unlocking the Future of AI Agents with Large Language Models

Advancing Engineering with AI through the Next Generation of Strategic Projec...

5 Signs You Need a Fashion PLM Software.pdf

CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️

why an Opensea Clone Script might be your perfect match.pdf

DNT_Corporate presentation know about us

A Secure and Reliable Document Management System is Essential.docx

Active Directory Penetration Testing, cionsystems.com.pdf

Exploring iOS App Development: Simplifying the Process

Hand gesture recognition PROJECT PPT.pptx

Der Spagat zwischen BIAS und FAIRNESS (2024)

Professional Resume Template for Software Developers

Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...

Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...

Optimizing AI for immediate response in Smart CCTV

HR Software Buyers Guide in 2024 - HRSoftware.com

Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live

Diamond Application Development Crafting Solutions with Precision

tf.data: TensorFlow Input Pipeline

1. tf.data: TensorFlow Input Pipeline https://www.tensorflow.org/guide/data https://www.tensorflow.org/guide/data_performance 1 Jiri Simsa

2. 2 Why input pipeline API?

3. 3 Why input pipeline API? - data might not fit into memory - data might require (randomized) pre-processing - efficiently utilize hardware - decouple loading + pre-processing from distribution

4. tf.data: TensorFlow Input Pipeline 4 Extract: - read data from memory / storage - parse file format Transform: - text vectorization - image transformations - video temporal sampling - shuffling, batching, … Load: - transfer data to the accelerator time flops CPU accelerators

5. import tensorflow as tf def preprocess(record): ... dataset = tf.data.TFRecordDataset(".../*.tfrecord") dataset = dataset.map(preprocess) dataset = dataset.batch(batch_size=32) model = ... model.fit(dataset, epochs=10) 55

6. import tensorflow as tf def preprocess(record): ... dataset = tf.data.TFRecordDataset(".../*.tfrecord") dataset = dataset.map(preprocess) dataset = dataset.batch(batch_size=32) model = ... model.fit(dataset, epochs=10) 66 reads data from storage

7. import tensorflow as tf def preprocess(record): ... dataset = tf.data.TFRecordDataset(".../*.tfrecord") dataset = dataset.map(preprocess) dataset = dataset.batch(batch_size=32) model = ... model.fit(dataset, epochs=10) 77 applies user-defined preprocessing

8. import tensorflow as tf def preprocess(record): ... dataset = tf.data.TFRecordDataset(".../*.tfrecord") dataset = dataset.map(preprocess) dataset = dataset.batch(batch_size=32) model = ... model.fit(dataset, epochs=10) 88 batches data for training efficiency

9. import tensorflow as tf def preprocess(record): ... dataset = tf.data.TFRecordDataset(".../*.tfrecord") dataset = dataset.map(preprocess) dataset = dataset.batch(batch_size=32) model = ... model.fit(dataset, epochs=10) 99 training APIs natively support tf.data

10. 10 Input Pipeline Performance

11. Software Pipelining 11

12. import tensorflow as tf def preprocess(record): ... dataset = tf.data.TFRecordDataset(".../*.tfrecord") dataset = dataset.map(preprocess) dataset = dataset.batch(batch_size=32) model = ... model.fit(dataset, epochs=10) 121212

13. import tensorflow as tf def preprocess(record): ... dataset = tf.data.TFRecordDataset(".../*.tfrecord") dataset = dataset.map(preprocess) dataset = dataset.batch(batch_size=32) dataset = dataset.prefetch(buffer_size=X) model = ... model.fit(dataset, epochs=10) 131313

14. Parallel Transformation 14

15. import tensorflow as tf def preprocess(record): ... dataset = tf.data.TFRecordDataset(".../*.tfrecord") dataset = dataset.map(preprocess) dataset = dataset.batch(batch_size=32) dataset = dataset.prefetch(buffer_size=X) model = ... model.fit(dataset, epochs=10) 151515

16. import tensorflow as tf def preprocess(record): ... dataset = tf.data.TFRecordDataset(".../*.tfrecord") dataset = dataset.map(preprocess, num_parallel_calls=Y) dataset = dataset.batch(batch_size=32) dataset = dataset.prefetch(buffer_size=X) model = ... model.fit(dataset, epochs=10) 161616

17. Parallel Extraction 17

18. import tensorflow as tf def preprocess(record): ... dataset = tf.data.TFRecordDataset(".../*.tfrecord") dataset = dataset.map(preprocess, num_parallel_calls=Y) dataset = dataset.batch(batch_size=32) dataset = dataset.prefetch(buffer_size=X) model = ... model.fit(dataset, epochs=10) 181818

19. import tensorflow as tf def preprocess(record): ... dataset = tf.data.TFRecordDataset(".../*.tfrecord", num_parallel_readers=Z) dataset = dataset.map(preprocess, num_parallel_calls=Y) dataset = dataset.batch(batch_size=32) dataset = dataset.prefetch(buffer_size=X) model = ... model.fit(dataset, epochs=10) 191919

20. import tensorflow as tf def preprocess(record): ... dataset = tf.data.TFRecordDataset(".../*.tfrecord", num_parallel_readers=Z) dataset = dataset.map(preprocess, num_parallel_calls=Y) dataset = dataset.batch(batch_size=32) dataset = dataset.prefetch(buffer_size=X) model = ... model.fit(dataset, epochs=10) 20 tf.data.experimental.AUTOTUNE 2020

21. tf.data Options 21 - determinism - statistics - optimizations (autotuning, fusion, vectorization, parallelization, ...) - threading (private thread pool, intra op parallelism)

22. tf.data Options 22 - determinism - statistics - optimizations (autotuning, fusion, vectorization, parallelization, ...) - threading (private thread pool, intra op parallelism) dataset = ... options = tf.data.Options() options.experimental_optimization.map_parallelization = True dataset = dataset.with_options(options)

23. TFDS: TensorFlow Datasets 23 - https://www.tensorflow.org/datasets/datasets - canned datasets ready to be used

24. TFDS: TensorFlow Datasets 24 - https://www.tensorflow.org/datasets/datasets - canned datasets ready to be used import tensorflow as tf import tensorflow_datasets as tfds # See available datasets print(tfds.list_builders()) # Construct a tf.data.Dataset dataset = tfds.load(name="mnist", split=tfds.Split.TRAIN) # Customize your input pipeline dataset = dataset.shuffle(1024).batch(32) for features in dataset.take(1): image, label = features["image"], features["label"]

25. Thank You!

tf.data: TensorFlow Input Pipeline

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to tf.data: TensorFlow Input Pipeline

Similar to tf.data: TensorFlow Input Pipeline (20)

More from Alluxio, Inc.

More from Alluxio, Inc. (20)

Recently uploaded

Recently uploaded (20)

tf.data: TensorFlow Input Pipeline