Session 2
Professional Machine Learning Engineer
Vasudev
@vasudevmaduri
Where are we on our journey
1
Session 2 Content Review
2
Q&A
4
Preview actions for next week
5
Sample Question Review
3
Where are we on our
journey
Professional Machine Learning Certification
Learning Journey Organized by Google Developer Groups Surrey co hosting with GDG Seattle
Session 1
Feb 24, 2024
Virtual
Session 2
Mar 2, 2024
Virtual
Session 3
Mar 9, 2024
Virtual
Session 4
Mar 16, 2024
Virtual
Session 5
Mar 23, 2024
Virtual
Session 6
Apr 6, 2024
Virtual Review the
Professional ML
Engineer Exam
Guide
Review the
Professional ML
Engineer Sample
Questions
Go through:
Google Cloud
Platform Big Data
and Machine
Learning
Fundamentals
Hands On Lab
Practice:
Perform
Foundational Data,
ML, and AI Tasks in
Google Cloud
(Skill Badge) - 7hrs
Build and Deploy ML
Solutions on Vertex
AI
(Skill Badge) - 8hrs
Self
study
(and
potential
exam)
Lightning talk +
Kick-off & Machine
Learning Basics +
Q&A
Lightning talk +
GCP- Tensorflow &
Feature Engineering
+ Q&A
Lightning talk +
Enterprise Machine
Learning + Q&A
Lightning talk +
Production Machine
Learning with
Google Cloud + Q&A
Lightning talk + NLP
& Recommendation
Systems on GCP +
Q&A
Lightning talk + MOPs
& ML Pipelines on GCP
+ Q&A
Complete course:
Introduction to AI and
Machine Learning on
Google Cloud
Launching into
Machine Learning
Complete course:
TensorFlow on
Google Cloud
Feature
Engineering
Complete course:
Machine Learning in
the Enterprise
Hands On Lab
Practice:
Production Machine
Learning Systems
Computer Vision
Fundamentals with
Google Cloud
Complete course:
Natural Language
Processing on Google
Cloud
Recommendation
Systems on GCP
Complete course:
ML Ops - Getting
Started
ML Pipelines on Google
Cloud
Check Readiness:
Professional ML
Engineer Sample
Questions
Session 2 Content
Review
Session 2
Study Group
Preparation and Processing
- Data ingestion.
- Data exploration (EDA).
- Design data pipelines.
- Build data pipelines.
- Feature engineering.
A
B
C
D
Solution: TensorFlow Extended (TFX)
Data
Ingestion
TensorFlow
Data Validation
TensorFlow
Transform
Estimator or
Keras Model
TensorFlow
Model Analysis
TensorFlow
Serving
Logging
Shared Utilities for Garbage Collection, Data Access Controls
Pipeline Storage
Shared Configuration Framework and Job Orchestration
Integrated Frontend for Job Management, Monitoring, Debugging, Data/Model/Evaluation Visualization
An end-to-end tool for deploying production ML system
tensorflow.org/tfx
Solution: TensorFlow Extended (TFX)
An end-to-end tool for deploying production ML system
tensorflow.org/tfx
Data Ingestion Challenges
Data might not
fit into memory
Data might
require
(randomized)
pre-processing
Efficiently utilize
hardware
Decouple
loading + pre-
processing from
deployment
tf.data: TensorFlow Input Pipeline
13
Extract:
- read data from memory /
storage
- parse file format
Transform:
- text vectorization
- image transformations
- video temporal sampling
- shuffling, batching, …
Load:
- transfer data to the accelerator
time
flops
CPU
accelerators
Data Ingestion Example
import tensorflow as tf
def preprocess(record):
...
dataset = tf.data.TFRecordDataset(".../*.tfrecord")
dataset = dataset.map(preprocess)
dataset = dataset.batch(batch_size=32)
model = ...
model.fit(dataset, epochs=10)
15
15
import tensorflow as tf
def preprocess(record):
...
dataset = tf.data.TFRecordDataset(".../*.tfrecord")
dataset = dataset.map(preprocess)
dataset = dataset.batch(batch_size=32)
model = ...
model.fit(dataset, epochs=10)
16
16
reads data from storage
import tensorflow as tf
def preprocess(record):
...
dataset = tf.data.TFRecordDataset(".../*.tfrecord")
dataset = dataset.map(preprocess)
dataset = dataset.batch(batch_size=32)
model = ...
model.fit(dataset, epochs=10)
17
17
applies user-defined preprocessing
import tensorflow as tf
def preprocess(record):
...
dataset = tf.data.TFRecordDataset(".../*.tfrecord")
dataset = dataset.map(preprocess)
dataset = dataset.batch(batch_size=32)
model = ...
model.fit(dataset, epochs=10)
18
18
batches data for training efficiency
import tensorflow as tf
def preprocess(record):
...
dataset = tf.data.TFRecordDataset(".../*.tfrecord")
dataset = dataset.map(preprocess)
dataset = dataset.batch(batch_size=32)
model = ...
model.fit(dataset, epochs=10)
19
19
training APIs natively support tf.data
Efficient Resource Utilization
21
Input Pipeline Performance
Input Pipeline Performance
22
import tensorflow as tf
def preprocess(record):
...
dataset = tf.data.TFRecordDataset(".../*.tfrecord")
dataset = dataset.map(preprocess)
dataset = dataset.batch(batch_size=32)
dataset = dataset.prefetch(buffer_size=X)
model = ...
model.fit(dataset, epochs=10)
23
23
23
Parallel Transformation
24
import tensorflow as tf
def preprocess(record):
...
dataset = tf.data.TFRecordDataset(".../*.tfrecord")
dataset = dataset.map(preprocess)
dataset = dataset.batch(batch_size=32)
dataset = dataset.prefetch(buffer_size=X)
model = ...
model.fit(dataset, epochs=10)
25
25
25
import tensorflow as tf
def preprocess(record):
...
dataset = tf.data.TFRecordDataset(".../*.tfrecord")
dataset = dataset.map(preprocess, num_parallel_calls=tf.data.AUTOTUNE)
dataset = dataset.batch(batch_size=32)
dataset = dataset.prefetch(buffer_size=X)
model = ...
model.fit(dataset, epochs=10)
26
26
26
Parallel Extraction
27
import tensorflow as tf
def preprocess(record):
...
dataset = tf.data.TFRecordDataset(".../*.tfrecord")
dataset = dataset.map(preprocess, num_parallel_calls=tf.data.AUTOTUNE)
dataset = dataset.batch(batch_size=32)
dataset = dataset.prefetch(buffer_size=X)
model = ...
model.fit(dataset, epochs=10)
28
28
28
import tensorflow as tf
def preprocess(record):
...
dataset = tf.data.TFRecordDataset(".../*.tfrecord", num_parallel_reads=N)
dataset = dataset.map(preprocess, num_parallel_calls=tf.data.AUTOTUNE)
dataset = dataset.batch(batch_size=32)
dataset = dataset.prefetch(buffer_size=X)
model = ...
model.fit(dataset, epochs=10)
29
29
29
Solution: TensorFlow Extended (TFX)
Data
Ingestion
TensorFlow
Data Validation
TensorFlow
Transform
Estimator or
Keras Model
TensorFlow
Model Analysis
TensorFlow
Serving
Logging
Shared Utilities for Garbage Collection, Data Access Controls
Pipeline Storage
Shared Configuration Framework and Job Orchestration
Integrated Frontend for Job Management, Monitoring, Debugging, Data/Model/Evaluation Visualization
An end-to-end tool for deploying production ML system
tensorflow.org/tfx
TensorFlow Data Validation (TFDV)
Helps developers understand, validate, and
monitor their ML data at scale
Used analyze and validate petabytes of
data at Google every day
Has a proven track record in maintaining
the health of production ML pipelines
Solution: TensorFlow Extended (TFX)
Data
Ingestion
TensorFlow
Data Validation
TensorFlow
Transform
Estimator or
Keras Model
TensorFlow
Model Analysis
TensorFlow
Serving
Logging
Shared Utilities for Garbage Collection, Data Access Controls
Pipeline Storage
Shared Configuration Framework and Job Orchestration
Integrated Frontend for Job Management, Monitoring, Debugging, Data/Model/Evaluation Visualization
An end-to-end tool for deploying production ML system
tensorflow.org/tfx
NUMERIC
number
1-element
vector
BUCKETIZED
number
1-hot encoding
0 1 0
M L XL
buckets
CROSS
Red, Green, Blue
M , L, XL
Red
M
Red
L
Red
XL
Green
M
Green
L
Green
XL
Blue
M
Blue
L
Blue
XL
NxM categories
CATEGORICAL
Red, Green,
Blue
1-hot encoding
0 1 0 1-hot encoding
0 0 0
EMBEDDING
EMBEDDING
many categories
α0 α1 α2
α3 α4 α5
α6 α7 α8
α9 αA αB
Cat.1
Cat.2
Cat.3
Cat.4
trainable
params
“embedding” for Cat.
3
α6 α7 α8
0 1 0 0 0 0
Feature columns
Partition by range
Capture feature
interactions
Learn new
representations
Limit vocab. Size
HASHING
Infinite categories
Cat.1
Cat.2
Cat.3
Cat.1K
Cat.1M
.
.
.
.
.
With vocab list, vocab
files, or identity
0
There are three possible places to do feature engineering,
each of which has its pros and cons
Input
s
Pre
processin
g
Feature
creation
Train
model
Hyper-parameter tuning
Model
Preprocessed
features
There are three possible places to do feature engineering,
each of which has its pros and cons
1
Input
s
Pre
processin
g
Feature
creation
Train
model
Hyper-parameter tuning
Model
Preprocessed
features
TensorFlow*
*efficient
*tf methods only
*this input only
There are three possible places to do feature engineering,
each of which has its pros and cons
2 1
Input
s
Pre
processin
g
Feature
creation
Train
model
Hyper-parameter tuning
Model
Preprocessed
features
tf.transform*
*efficient
*tf methods only
*aggregates
TensorFlow*
*efficient
*tf methods only
*this input only
There are three possible places to do feature engineering,
each of which has its pros and cons
tf.transform*
*efficient
*tf methods only
*aggregates
Beam/Dataflow*
*in a pipeline
*Python/Java code
*time-windows
TensorFlow*
*efficient
*tf methods only
*this input only
3 2 1
Input
s
Pre
processin
g
Feature
creation
Train
model
Hyper-parameter tuning
Model
Preprocessed
features
Sample Questions
Review
Different cities in California have markedly different housing prices. Suppose you
must create a model to predict the housing prices. Which of the following sets of
features, or features crosses could learn city-specific relationships between
roomsPerPerson and housing price?
A. Three separated binned features: [binned latitude], [binned longitude],
[roomsPerPerson]
B. Two feature crosses: [binned latitude x roomsPerPerson] and [binned longitude
x roomsPerPerson]
C. One feature cross [latitude x longitude x roomsPerPerson]
D. One feature cross [binned latitude x binned longitude x binned roomsPerPerson]
Different cities in California have markedly different housing prices. Suppose you
must create a model to predict the housing prices. Which of the following sets of
features, or features crosses could learn city-specific relationships between
roomsPerPerson and housing price?
A. Three separated binned features: [binned latitude], [binned longitude],
[roomsPerPerson]
B. Two feature crosses: [binned latitude x roomsPerPerson] and [binned longitude
x roomsPerPerson]
C. One feature cross [latitude x longitude x roomsPerPerson]
D. One feature cross [binned latitude x binned longitude x binned roomsPerPerson]
Q&A
Preview actions for next
week
By our next meeting
1. Complete
a. Machine Learning in the Enterprise
Link to badge
Redeem your participation badge
Thank you for joining the event
Thank you for
tuning in!
For any operational questions about access to
Cloud Skills Boost or the Road to Google
Developers Certification program contact: gdg-
support@google.com

Certification Study Group -Professional ML Engineer Session 2 (GCP-TensorFlow & Feature Engineering)

  • 1.
    Session 2 Professional MachineLearning Engineer Vasudev @vasudevmaduri
  • 2.
    Where are weon our journey 1 Session 2 Content Review 2 Q&A 4 Preview actions for next week 5 Sample Question Review 3
  • 3.
    Where are weon our journey
  • 4.
    Professional Machine LearningCertification Learning Journey Organized by Google Developer Groups Surrey co hosting with GDG Seattle Session 1 Feb 24, 2024 Virtual Session 2 Mar 2, 2024 Virtual Session 3 Mar 9, 2024 Virtual Session 4 Mar 16, 2024 Virtual Session 5 Mar 23, 2024 Virtual Session 6 Apr 6, 2024 Virtual Review the Professional ML Engineer Exam Guide Review the Professional ML Engineer Sample Questions Go through: Google Cloud Platform Big Data and Machine Learning Fundamentals Hands On Lab Practice: Perform Foundational Data, ML, and AI Tasks in Google Cloud (Skill Badge) - 7hrs Build and Deploy ML Solutions on Vertex AI (Skill Badge) - 8hrs Self study (and potential exam) Lightning talk + Kick-off & Machine Learning Basics + Q&A Lightning talk + GCP- Tensorflow & Feature Engineering + Q&A Lightning talk + Enterprise Machine Learning + Q&A Lightning talk + Production Machine Learning with Google Cloud + Q&A Lightning talk + NLP & Recommendation Systems on GCP + Q&A Lightning talk + MOPs & ML Pipelines on GCP + Q&A Complete course: Introduction to AI and Machine Learning on Google Cloud Launching into Machine Learning Complete course: TensorFlow on Google Cloud Feature Engineering Complete course: Machine Learning in the Enterprise Hands On Lab Practice: Production Machine Learning Systems Computer Vision Fundamentals with Google Cloud Complete course: Natural Language Processing on Google Cloud Recommendation Systems on GCP Complete course: ML Ops - Getting Started ML Pipelines on Google Cloud Check Readiness: Professional ML Engineer Sample Questions
  • 5.
  • 6.
    Session 2 Study Group Preparationand Processing - Data ingestion. - Data exploration (EDA). - Design data pipelines. - Build data pipelines. - Feature engineering.
  • 8.
  • 10.
    Solution: TensorFlow Extended(TFX) Data Ingestion TensorFlow Data Validation TensorFlow Transform Estimator or Keras Model TensorFlow Model Analysis TensorFlow Serving Logging Shared Utilities for Garbage Collection, Data Access Controls Pipeline Storage Shared Configuration Framework and Job Orchestration Integrated Frontend for Job Management, Monitoring, Debugging, Data/Model/Evaluation Visualization An end-to-end tool for deploying production ML system tensorflow.org/tfx
  • 11.
    Solution: TensorFlow Extended(TFX) An end-to-end tool for deploying production ML system tensorflow.org/tfx
  • 12.
    Data Ingestion Challenges Datamight not fit into memory Data might require (randomized) pre-processing Efficiently utilize hardware Decouple loading + pre- processing from deployment
  • 13.
    tf.data: TensorFlow InputPipeline 13 Extract: - read data from memory / storage - parse file format Transform: - text vectorization - image transformations - video temporal sampling - shuffling, batching, … Load: - transfer data to the accelerator time flops CPU accelerators
  • 14.
  • 15.
    import tensorflow astf def preprocess(record): ... dataset = tf.data.TFRecordDataset(".../*.tfrecord") dataset = dataset.map(preprocess) dataset = dataset.batch(batch_size=32) model = ... model.fit(dataset, epochs=10) 15 15
  • 16.
    import tensorflow astf def preprocess(record): ... dataset = tf.data.TFRecordDataset(".../*.tfrecord") dataset = dataset.map(preprocess) dataset = dataset.batch(batch_size=32) model = ... model.fit(dataset, epochs=10) 16 16 reads data from storage
  • 17.
    import tensorflow astf def preprocess(record): ... dataset = tf.data.TFRecordDataset(".../*.tfrecord") dataset = dataset.map(preprocess) dataset = dataset.batch(batch_size=32) model = ... model.fit(dataset, epochs=10) 17 17 applies user-defined preprocessing
  • 18.
    import tensorflow astf def preprocess(record): ... dataset = tf.data.TFRecordDataset(".../*.tfrecord") dataset = dataset.map(preprocess) dataset = dataset.batch(batch_size=32) model = ... model.fit(dataset, epochs=10) 18 18 batches data for training efficiency
  • 19.
    import tensorflow astf def preprocess(record): ... dataset = tf.data.TFRecordDataset(".../*.tfrecord") dataset = dataset.map(preprocess) dataset = dataset.batch(batch_size=32) model = ... model.fit(dataset, epochs=10) 19 19 training APIs natively support tf.data
  • 20.
  • 21.
  • 22.
  • 23.
    import tensorflow astf def preprocess(record): ... dataset = tf.data.TFRecordDataset(".../*.tfrecord") dataset = dataset.map(preprocess) dataset = dataset.batch(batch_size=32) dataset = dataset.prefetch(buffer_size=X) model = ... model.fit(dataset, epochs=10) 23 23 23
  • 24.
  • 25.
    import tensorflow astf def preprocess(record): ... dataset = tf.data.TFRecordDataset(".../*.tfrecord") dataset = dataset.map(preprocess) dataset = dataset.batch(batch_size=32) dataset = dataset.prefetch(buffer_size=X) model = ... model.fit(dataset, epochs=10) 25 25 25
  • 26.
    import tensorflow astf def preprocess(record): ... dataset = tf.data.TFRecordDataset(".../*.tfrecord") dataset = dataset.map(preprocess, num_parallel_calls=tf.data.AUTOTUNE) dataset = dataset.batch(batch_size=32) dataset = dataset.prefetch(buffer_size=X) model = ... model.fit(dataset, epochs=10) 26 26 26
  • 27.
  • 28.
    import tensorflow astf def preprocess(record): ... dataset = tf.data.TFRecordDataset(".../*.tfrecord") dataset = dataset.map(preprocess, num_parallel_calls=tf.data.AUTOTUNE) dataset = dataset.batch(batch_size=32) dataset = dataset.prefetch(buffer_size=X) model = ... model.fit(dataset, epochs=10) 28 28 28
  • 29.
    import tensorflow astf def preprocess(record): ... dataset = tf.data.TFRecordDataset(".../*.tfrecord", num_parallel_reads=N) dataset = dataset.map(preprocess, num_parallel_calls=tf.data.AUTOTUNE) dataset = dataset.batch(batch_size=32) dataset = dataset.prefetch(buffer_size=X) model = ... model.fit(dataset, epochs=10) 29 29 29
  • 30.
    Solution: TensorFlow Extended(TFX) Data Ingestion TensorFlow Data Validation TensorFlow Transform Estimator or Keras Model TensorFlow Model Analysis TensorFlow Serving Logging Shared Utilities for Garbage Collection, Data Access Controls Pipeline Storage Shared Configuration Framework and Job Orchestration Integrated Frontend for Job Management, Monitoring, Debugging, Data/Model/Evaluation Visualization An end-to-end tool for deploying production ML system tensorflow.org/tfx
  • 31.
    TensorFlow Data Validation(TFDV) Helps developers understand, validate, and monitor their ML data at scale Used analyze and validate petabytes of data at Google every day Has a proven track record in maintaining the health of production ML pipelines
  • 32.
    Solution: TensorFlow Extended(TFX) Data Ingestion TensorFlow Data Validation TensorFlow Transform Estimator or Keras Model TensorFlow Model Analysis TensorFlow Serving Logging Shared Utilities for Garbage Collection, Data Access Controls Pipeline Storage Shared Configuration Framework and Job Orchestration Integrated Frontend for Job Management, Monitoring, Debugging, Data/Model/Evaluation Visualization An end-to-end tool for deploying production ML system tensorflow.org/tfx
  • 33.
    NUMERIC number 1-element vector BUCKETIZED number 1-hot encoding 0 10 M L XL buckets CROSS Red, Green, Blue M , L, XL Red M Red L Red XL Green M Green L Green XL Blue M Blue L Blue XL NxM categories CATEGORICAL Red, Green, Blue 1-hot encoding 0 1 0 1-hot encoding 0 0 0 EMBEDDING EMBEDDING many categories α0 α1 α2 α3 α4 α5 α6 α7 α8 α9 αA αB Cat.1 Cat.2 Cat.3 Cat.4 trainable params “embedding” for Cat. 3 α6 α7 α8 0 1 0 0 0 0 Feature columns Partition by range Capture feature interactions Learn new representations Limit vocab. Size HASHING Infinite categories Cat.1 Cat.2 Cat.3 Cat.1K Cat.1M . . . . . With vocab list, vocab files, or identity 0
  • 36.
    There are threepossible places to do feature engineering, each of which has its pros and cons Input s Pre processin g Feature creation Train model Hyper-parameter tuning Model Preprocessed features
  • 37.
    There are threepossible places to do feature engineering, each of which has its pros and cons 1 Input s Pre processin g Feature creation Train model Hyper-parameter tuning Model Preprocessed features TensorFlow* *efficient *tf methods only *this input only
  • 38.
    There are threepossible places to do feature engineering, each of which has its pros and cons 2 1 Input s Pre processin g Feature creation Train model Hyper-parameter tuning Model Preprocessed features tf.transform* *efficient *tf methods only *aggregates TensorFlow* *efficient *tf methods only *this input only
  • 39.
    There are threepossible places to do feature engineering, each of which has its pros and cons tf.transform* *efficient *tf methods only *aggregates Beam/Dataflow* *in a pipeline *Python/Java code *time-windows TensorFlow* *efficient *tf methods only *this input only 3 2 1 Input s Pre processin g Feature creation Train model Hyper-parameter tuning Model Preprocessed features
  • 46.
  • 47.
    Different cities inCalifornia have markedly different housing prices. Suppose you must create a model to predict the housing prices. Which of the following sets of features, or features crosses could learn city-specific relationships between roomsPerPerson and housing price? A. Three separated binned features: [binned latitude], [binned longitude], [roomsPerPerson] B. Two feature crosses: [binned latitude x roomsPerPerson] and [binned longitude x roomsPerPerson] C. One feature cross [latitude x longitude x roomsPerPerson] D. One feature cross [binned latitude x binned longitude x binned roomsPerPerson]
  • 48.
    Different cities inCalifornia have markedly different housing prices. Suppose you must create a model to predict the housing prices. Which of the following sets of features, or features crosses could learn city-specific relationships between roomsPerPerson and housing price? A. Three separated binned features: [binned latitude], [binned longitude], [roomsPerPerson] B. Two feature crosses: [binned latitude x roomsPerPerson] and [binned longitude x roomsPerPerson] C. One feature cross [latitude x longitude x roomsPerPerson] D. One feature cross [binned latitude x binned longitude x binned roomsPerPerson]
  • 49.
  • 50.
  • 51.
    By our nextmeeting 1. Complete a. Machine Learning in the Enterprise
  • 52.
    Link to badge Redeemyour participation badge Thank you for joining the event
  • 53.
    Thank you for tuningin! For any operational questions about access to Cloud Skills Boost or the Road to Google Developers Certification program contact: gdg- support@google.com