Working with tf.data (TF 2)

Introduction to tf.data (TF2)
H2O Meetup
Galvanize San Francisco
02/19/2020
Oswald Campesato
ocampesato@yahoo.com

Highlights/Overview
What is tf.data?
Working with TF 2 tf.data.Dataset
Intermediate operators
Terminal operators
filter() and map()
zip() and batch()
Working with TF 2 generators

tf.data: TF Input Pipeline
 An input pipeline is useful:
 for streaming data
 when data is too big to fit in memory
 When data requires preprocessing
 When you need to shuffle large data
 Can be scaled to multiple hosts
 => ETL functionality

What are tf.data.Datasets
 Simple example:
 1) define a Numpy array of numbers
 2) create a TF Dataset ds
 3) iterate through the dataset ds

What are TF Datasets
 import tensorflow as tf # tf-dataset1.py
 import numpy as np
 x = np.array([1,2,3,4,5])
 ds = tf.data.Dataset.from_tensor_slices(x)
 # iterate through the elements:
for value in ds.take(len(x)):
print(value)

What are Lambda Expressions
 a lambda expression is an anonymous function
 use lambda expressions to define local functions
 pass lambda expressions as arguments
 return them as the value of function calls

Some tf.data “lazy operators”
map()
filter()
flatmap()
batch()
take()
zip()
flatten()
=> Combined via “method chaining”

tf.data “lazy operators”
 filter():
 uses Boolean logic to "filter" the elements in an array to
determine which elements satisfy the Boolean condition
 map(): a projection
 this operator "applies" a lambda expression to each input
element
 flat_map():
 maps a single element of the input dataset to a Dataset of
elements

tf.data “lazy operators”
 batch(n):
 processes a "batch" of n elements during each
iteration
 repeat(n):
 repeats its input values n times
 take(n):
 operator "takes" n input values

tf.data.Dataset.from_tensors()
 Import tensorflow as tf
 #combine the input into one element
 t1 = tf.constant([[1, 2], [3, 4]])
 ds1 = tf.data.Dataset.from_tensors(t1)
 # output: [[1, 2], [3, 4]]

tf.data.Dataset.from_tensor_slices()
 Import tensorflow as tf
 #separate element for each item
 t2 = tf.constant([[1, 2], [3, 4]])
 ds1 =
tf.data.Dataset.from_tensor_slices(t2)
 # output: [1, 2], [3, 4]

TF2 Datasets: code sample
 import tensorflow as tf
x = np.arange(0, 10)
 # create a dataset from a Numpy array
ds = tf.data.Dataset.from_tensor_slices(x)

TF filter() operator: ex #1
import tensorflow as tf # tf2_filter1.py
import numpy as np
x = np.array([1,2,3,4,5])
 print("First iteration:")
 for value in ds:
 print("value:",value)

 First iteration:
 value: tf.Tensor(1, shape=(), dtype=int64)

import tensorflow as tf # tf2_filter2.py
import numpy as np
x = np.array([1,2,3,4,5])

 # "tf.math.equal(x, y)" is required
 # for equality comparison
 def filter_fn(x):
 return tf.math.equal(x, 1)
 ds = ds.filter(filter_fn)
 print("Second iteration:")

 Second iteration:

What are Lambda Expressions
 a lambda expression takes an input variable
 performs an operation on that variable
 A "bare bones" lambda expression:
lambda x: x + 1
 => this adds 1 to an input variable x

 import tensorflow as tf # tf2_filter3.py
 ds = tf.data.Dataset.from_tensor_slices([1,2,3,4,5])
 ds = ds.filter(lambda x: x < 4) # [1,2,3]

 # "tf.math.equal(x, y)" is required
 # for equality comparison
 return tf.math.equal(x, 1)
 print("Second iteration:")

 Second iteration:

 return tf.equal(x % 2, 0)
 x = np.array([1,2,3,4,5,6,7,8,9,10])

 return tf.reshape(tf.not_equal(x % 2, 1), [])
 x = np.array([1,2,3,4,5,6,7,8,9,10])

TF map() operator: ex #1
import tensorflow as tf # tf2-map.py
import numpy as np
 x = np.array([[1],[2],[3],[4]])
 ds = ds.map(lambda x: x*2)

 value: tf.Tensor([2], shape=(1,), dtype=int64)

TF map() and filter() operators
import tensorflow as tf # tf2_map_filter.py
import numpy as np
 x = np.array([1,2,3,4,5,6,7,8,9,10])
 ds1 = ds.filter(lambda x: tf.equal(x % 4, 0))
 ds1 = ds1.map(lambda x: x*x)
 ds2 = ds.map(lambda x: x*x)
 ds2 = ds2.filter(lambda x: tf.equal(x % 4, 0))

 for value1 in ds1:
 print("value1:",value1)
 for value2 in ds2:
 print("value2:",value2)

 value1: tf.Tensor(16, shape=(), dtype=int64)

import tensorflow as tf # tf2-map2.py
import numpy as np
 x = np.array([[1],[2],[3],[4]])
 # METHOD #1: THE LONG WAY
 # a lambda expression to double each value
 #ds = ds.map(lambda x: x*2)
 # a lambda expression to add one to each value
 #ds = ds.map(lambda x: x+1)
 # a lambda expression to cube each value
 #ds = ds.map(lambda x: x**3)

 # METHOD #2: A SHORTER WAY
ds = ds.map(lambda x: x*2).map(lambda x: x+1).map(lambda x: x**3)
 print("value:",value
 # an example of “Method Chaining”

TF take() operator: ex #1
import tensorflow as tf # tf2-take.py
import numpy as np
ds = tf.data.Dataset.from_tensor_slices(tf.range(8))
ds = ds.take(5)
 for value in ds.take(20):

import tensorflow as tf # tf2_take.py
import numpy as np
 x = np.array([[1],[2],[3],[4]])
 # make a ds from a numpy array
 ds = ds.map(lambda x: x*2)
.map(lambda x: x+1).map(lambda x: x**3)

TF 2 map() and take(): output

TF zip() operator: ex #1
 import tensorflow as tf # tf2_zip1.py
 dx = tf.data.Dataset.from_tensor_slices([0,1,2,3,4])
 dy = tf.data.Dataset.from_tensor_slices([1,1,2,3,5])
 # zip the two datasets together
 d2 = tf.data.Dataset.zip((dx, dy))
 for value in d2:

 value:
(<tf.Tensor: id=11, shape=(), dtype=int32, numpy=0>,
<tf.Tensor: id=12, shape=(), dtype=int32, numpy=1>)
 value:
 value:
=> Plus two more rows of output

 import tensorflow as tf # tf2_zip_take.py
 x = np.arange(0, 10)
 y = np.arange(1, 11)
 dx = tf.data.Dataset.from_tensor_slices(x)
 dy = tf.data.Dataset.from_tensor_slices(y)
 # zip the two datasets together
 d2 = tf.data.Dataset.zip((dx, dy)).batch(3)
 for value in d2.take(8):

 value: (<tf.Tensor: id=11, shape=(), dtype=int32,
numpy=0>, <tf.Tensor: id=12, shape=(), dtype=int32,
numpy=1>)
numpy=1>)
numpy=2>)
numpy=3>)
numpy=5>)

 import tensorflow as tf # tf2_zip_batch.py
 ds1 = tf.data.Dataset.range(100)
 ds2 = tf.data.Dataset.range(0, -100, -1)
 ds3 = tf.data.Dataset.zip((ds1, ds2))
 ds4 = ds3.batch(4)
 for value in ds4.take(4):
 for value in d2.take(8):

 value: (<tf.Tensor: id=21, shape=(4,), dtype=int64,
numpy=array([0, 1, 2, 3])>, <tf.Tensor: id=22,
shape=(4,), dtype=int64, numpy=array([ 0, -1, -2, -
3])>)
shape=(4,), dtype=int64, numpy=array([-4, -5, -6, -
7])>)
numpy=array([ 8, 9, 10, 11])>, <tf.Tensor: id=30,
shape=(4,), dtype=int64, numpy=array([ -8, -9, -10,
-11])>)
shape=(4,), dtype=int64, numpy=array([-12, -13, -14,
-15])>)

TF 2 generators
Python functions
Containing your custom code
Specified in the dataset definition
Invoked when data is requested

Generator Functions (1)
 import tensorflow as tf # tf2_generator1.py
 def gener():
 i = 0
 while(i < len(x)):
 yield i
 i += 1

ds=tf.data.Dataset.from_generator(gener,(tf.int64))
 size = 2*len(x)
 for value in ds.take(size):
 # value: tf.Tensor(0, shape=(), dtype=int64)

import tensorflow as tf # tf2-timesthree.py
import numpy as np
x = np.arange(0, 5) # 0, 1, 2, 3, 4
def gener():
for i in x:
yield (3*i)
ds = tf.data.Dataset.from_generator(gener, (tf.int64))
for value in ds.take(len(x)):
print("1value:",value)
for value in ds.take(2*len(x)):
print("2value:",value)

 1value: tf.Tensor(0, shape=(), dtype=int64)

 import tensorflow as tf # tf2_generator3.py
 while(i < len(x/3)):
 yield (i, i+1, i+2)
 i += 3

 ds = tf.data.Dataset.from_generator(
gener,
(tf.int64,tf.int64,tf.int64))
 third = int(len(x)/3)
 for value in ds.take(third):

#value:
#(<tf.Tensor: id=35, shape=(),dtype=int64,numpy=0>,
# <tf.Tensor: id=36, shape=(),dtype=int64,numpy=1>,
# <tf.Tensor: id=37, shape=(),dtype=int64,numpy=2>)
#value:
#(<tf.Tensor: id=41, shape=(),dtype=int64,numpy=3>,
# <tf.Tensor: id=42, shape=(),dtype=int64,numpy=4>,

Processing Text Files (1)
 define a TF Dataset with lines in file.txt
 skip lines that start with a “#” character
 then display only the first two lines

Contents of file.txt
 #this is file line #1
 this is file line #3
 this is file line #5

 import tensorflow as tf # tf2_flatmap_filter.py
 filenames = ["file.txt”]
 ds = tf.data.Dataset.from_tensor_slices(filenames)

 ds = ds.flat_map(
 lambda filename: (
 tf.data.TextLineDataset(filename)
 .skip(1)
 .filter(lambda line:
tf.not_equal(tf.strings.substr(line,0,1),"#"))))

Text Output (first two lines)
('value:', <tf.Tensor: id=16, shape=(5,),
dtype=string,
numpy=array(['this', 'is', 'file', 'line', ’#3'],
dtype=object)>)
('value:', <tf.Tensor: id=18, shape=(5,),
dtype=string,
numpy=array(['this', 'is', 'file', 'line', ’#5'],
dtype=object)>)

Tokenizers and tf.text
 import tensorflow as tf # NB: requires TF 2
 import tensorflow_text as text
 # pip3 install -q tensorflow-text
 docs = tf.data.Dataset.from_tensor_slices(
[['Chicago Pizza'],
["how are you"]])
 tokenizer = text.WhitespaceTokenizer()
 token_docs = docs.map(
lambda x: tokenizer.tokenize(x))

Tokenizers and tf.text
 iterator = iter(tokenized_docs)
 print(next(iterator).to_list())
 print(next(iterator).to_list())
 # [[b'a', b'b', b'c']]
 # [[b'd', b'e', b'f']]
 [[b'Chicago', b'Pizza']]
 [[b'how', b'are', b'you']]

Tf.data and MNIST
 import tensorflow as tf # tf2_mnist.py
train, test = tf.keras.datasets.mnist.load_data()
mnist_x, mnist_y = train
mnist_ds=tf.data.Dataset.from_tensor_slices(mnist_x)
print(mnist_ds)
 for value in mnist_ds.take(2):

Tf.data and MNIST
 value: tf.Tensor(
 [[ 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0
 0 0 0 0 0 0 0 0 0 0]
 [ 0 0 0 0 0 0 0 0 0 0 0 0 3 18
18 18 126 136
 175 26 166 255 247 127 0 0 0 0]
 [ 0 0 0 0 0 0 0 0 30 36 94 154 170 253
253 253 253 253
 225 172 253 242 195 64 0 0 0 0]
 [ 0 0 0 0 0 0 0 49 238 253 253 253 253 253
253 253 253 251
 93 82 82 56 39 0 0 0 0 0]

TF2 generator Example
import tensorflow as tf # tf2_generator2.py
import numpy as np
x = np.arange(0, 12)
def gener():
i = 0
while(i < len(x/3)):
yield (i, i+1, i+2) # three integers at a time
i += 3
ds = tf.data.Dataset.from_generator(gener, (tf.int64,tf.int64,tf.int64))
third = int(len(x)/3)
for value in ds.take(third):
print("value:",value)

TF2 generator Example
 value:
(<tf.Tensor: id=35, shape=(), dtype=int64,
numpy=0>,
<tf.Tensor: id=36, shape=(), dtype=int64,
numpy=1>,
numpy=2>)
 value:
(<tf.Tensor: id=38, shape=(), dtype=int64,
numpy=3>,
numpy=4>,

TF 2 tf.data.TFRecordDataset
 ds = tf.data.TFRecordDataset(tf-records)
 ds = ds.map(your-pre-processing)
 ds = ds.batch(batch_size=32)
 OR:
 ds = tf.data.TFRecordDataset(a-tf-record)
 .map(your-pre-processing)
 .batch(batch_size=32)
 model = . . . [Keras]
 model.fit(ds, epochs=20)

Use prefetch() for Performance
 .prefetch(buffer_size=X)

Parallelize Data Transformations
 .map(preprocess,num_parallel_calls=Y)
 => uses background threads & internal buffer

Parallelize Data “Readers”
 ds = tf.data.TFRecordDataset(tf-records,
 num_parallel_readers=Z)
 .map(preprocess,num_parallel_calls=Y)
 => for sharded data

Parallelize Data “Readers”
 Selecting optimal values:
 tf.data.experimental.AUTO_TUNE
 Uses Reinforcement Learning
 To tune values during data input

tf.data.Options
 Setting global options:
 Deterministic/non-deterministic
 Statistics
 Optimizations (ex: autotuning)
 Threading (ex: private thread pool)

tf.data.Options
op = tf.data.Options()
op.experimental_optimization.map_optimization=True
ds = ds.with_optimization(op)

TF 2: Built-in Datasets
tf.keras.datasets.boston_housing
tf.keras.datasets.cifar10
tf.keras.datasets.cifar100
tf.keras.datasets.fashion_mnist
tf.keras.datasets.imdb
tf.keras.datasets.mnist
tf.keras.datasets.reuters

About Me: Recent Books
 1) Python3 and Machine Learning (2020)
 2) Angular 9 and Deep Learning (2020)
 3) Angular 8 & Machine Learning (2020)
 4) AI/ML/DL: Concepts and Code (2020)
 5) Bash Programming on Mac (2020)
 6) TensorFlow 2 Pocket Primer (2019)
 7) TensorFlow 1.x Pocket Primer (2019)
 8) Python for TensorFlow (2019)
 9) C Programming Pocket Primer (2019)

About Me: Less Recent Books
 10) RegEx Pocket Primer (2018)
 11) Data Cleaning Pocket Primer (2018)
 12) Angular Pocket Primer (2017)
 13) Android Pocket Primer (2017)
 14) CSS3 Pocket Primer (2016)
 15) SVG Pocket Primer (2016)
 16) Python Pocket Primer (2015)
 17) D3 Pocket Primer (2015)
 18) HTML5 Mobile Pocket Primer (2014)

About Me: Older Books
 19) jQuery, CSS3, and HTML5 (2013)
 20) HTML5 Pocket Primer (2013)
 21) jQuery Pocket Primer (2013)
 22) HTML5 Canvas (2012)
 23) Flash on Android (2011)
 24) Web 2.0 Fundamentals (2010)
 25) MS Silverlight Graphics (2008)
 26) Fundamentals of SVG (2003)
 27) Java Graphics Library (2002)

ML/DL/NLP/DRL Classes
ML/DL/NLP/RL Instructor at UCSC (Santa Clara):
Deep Learning (TF2/Keras) (01/30/2020) 10 weeks
NLP (Transformer/BERT/etc) (02/21/2020) 10 weeks
ML (ML, NLP, RL) (05/01/2020) 10 weeks
DRL (PPO/A3C/SAC/etc) (05/07/2020) 10 weeks
ML (ML, NLP, RL) (06/16/2020) 10 weeks
NLP (Transformer/BERT/etc) (06/29/2020) 10 weeks
UCSC link:
https://www.ucsc-extension.edu/certificate-program/offering/deep-learning-
and-artificial-intelligence-tensorflow

Working with tf.data (TF 2)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Working with tf.data (TF 2)

Similar to Working with tf.data (TF 2) (20)

More from Oswald Campesato

More from Oswald Campesato (17)

Recently uploaded

Recently uploaded (20)

Working with tf.data (TF 2)