MOPs & ML Pipelines on GCP - Session 6, RGDC

Session 6
Professional Machine Learning Engineer
Vasudev
@vasudevmaduri

Where are we on our journey
1
Session 6 Content Review
2
Sample Questions Review
4
Q&A
5
Exam Information
3
6 Next Steps

Professional Machine Learning Certification
Learning Journey Organized by Google Developer Groups Surrey co hosting with GDG Seattle
Session 1
Feb 24, 2024
Virtual
Session 2
Mar 2, 2024
Virtual
Session 3
Mar 9, 2024
Virtual
Session 4
Mar 16, 2024
Virtual
Session 5
Mar 23, 2024
Virtual
Session 6
Apr 6, 2024
Virtual
Review the
Professional ML
Engineer Exam
Guide
Review the
Professional ML
Engineer Sample
Questions
Go through:
Google Cloud
Platform Big Data
and Machine
Learning
Fundamentals
Hands On Lab
Practice:
Perform
Foundational Data,
ML, and AI Tasks in
Google Cloud
(Skill Badge) - 7hrs
Build and Deploy ML
Solutions on Vertex
AI
Self
study
(and
potential
exam)
Lightning talk +
Kick-off & Machine
Learning Basics +
Q&A
Lightning talk +
GCP- Tensorflow &
Feature Engineering
+ Q&A
Lightning talk +
Enterprise Machine
Learning + Q&A
Production ML
Systems and
Computer Vision
with Google Cloud +
Q&A
Lightning talk + NLP
& Recommendation
Systems on GCP +
Q&A
MOPs & ML Pipelines
on GCP + Q&A
Complete course:
Introduction to AI and
Machine Learning on
Google Cloud
Launching into
Machine Learning
Complete course:
TensorFlow on Google
Cloud
Feature
Engineering
Complete course:
Machine Learning in
the Enterprise
Hands On Lab
Practice:
Production Machine
Learning Systems
Computer Vision
Fundamentals with
Google Cloud
Complete course:
Natural Language
Processing on Google
Cloud
Recommendation
Systems on GCP
Complete course:
ML Ops - Getting
Started
ML Pipelines on Google
Cloud
Check Readiness:
Professional ML
Engineer Sample
Questions

Session 6
Study Group
Exam Tips
- Review
- Registering for the Exam
- Tips for managing your time / test taking strategy

Summarizing the four ML options
Pre-Built APIs BigQuery ML AutoML Vertex AI
Data type
Tabular, image,
text, and video
Tabular
Tabular, image,
text, and video
No limits
Training data size No data required Medium to large Medium Medium to large
ML and coding
expertise
Low Medium Low High
Flexibility to tune
hyperparameters
None Medium None High
Time to train a model None Medium Medium Long
01 02 03

Section 1: ML Problem Framing
Translate business challenge into ML use
case. Considerations include:
● Defining business problems
● Identifying non-ML solutions
● Defining output use
● Managing incorrect results
● Identifying data sources
● Mapping business problem to ML
problem. What should our label be?
What should our features be?
● Does this problem actually require
ML? (Ex: Find the avg. number of units
manufactured by month for the last
year)
● Making sure label maps to business
decision
● Knowing that we need labelled data
for supervised ML

Define ML problem. Considerations include:
● Defining problem type (classification,
regression, clustering, etc.)
● Defining outcome of model
predictions
● Defining the input (features) and
predicted output format
● General ML terminology
● Regression = numeric / continuous
label
● Classification = discrete class label
● Output of classification models are
probabilities of each class (sigmoid for
binary classification, softmax for N-
class)
● Features must be numeric so one-hot
encode categorical features

Define business success criteria.
Considerations include:
● Success metrics
● Key results
● Determination of when a model is
deemed unsuccessful
● Precision = TP / (TP + FP)
○ Positive predictive value: Of all
examples the model predicted
positive, what percentage were
actually positive?
● Recall = TP / (TP + FN)
○ True positive rate: Of all positive
examples, what percentage did
my model correctly predict as
positive?
● AUC ROC = Area under the curve by
plotting TPR against FPR
○ Threshold independent

Identify risks to feasibility and
implementation of ML solution.
● Assessing and communicating
business impact
● Assessing ML solution readiness
● Assessing data readiness
● Aligning with Google AI principles and
practices (e.g. different biases)
● ML Readiness = data + infra
● Important section of AI Principles: “AI
algorithms and datasets can reflect,
reinforce, or reduce unfair biases.”

Section 2: ML Solution Architecture
Design reliable, scalable, highly available ML
solutions. Considerations include:
● Optimizing data use and storage
● Data connections
● Automation of data preparation and
model training/deployment
● SDLC best practices
● GCS = Unstructured Data (or
structured data in Parquet, Avro, etc)
● BigQuery = Structured Data
● BigTable = Structured Data
○ Low latency & High throughput
● ML is an iterative process that
follows concrete steps:
○ Data ingest/analysis/exploration
○ Data validation
○ Feature engineering
○ Model training
○ Model evaluation/validation
○ Model deployment

Choose appropriate Google Cloud software
components. Considerations include:
● A variety of component types - data
collection; data management
● Exploration/analysis
● Feature engineering
● Logging/management
● Automation
● Monitoring
● Serving
● Data Collection/Management:
○ BigQuery = Batch or Stream
■ 100,000 rows/second with
insert ID (1M without)
■ Latency ~1-2s
○ GCS = Unstructured (usually)
○ PubSub = Stream Ingest
○ DataFlow = Batch or Stream
processing
● Exploration/analysis:
○ BigQuery (SQL)
○ Vertex Workbench Notebooks
(Python)
○ Dataprep (Visual ETL)

● Automation
● Monitoring
● Serving
● Feature Engineering:
○ TF Transform (DataFlow)
■ Most scalable
○ BigQuery Transform
■ BQML models
○ Keras Lambda Layers
■ Baked into TF graph
(somewhat easier to
implement than tft)
● Logging/management:
○ Understand what gets logged
when using different products.
Ex: BigQuery slot usage, AI
Platform job status, etc

● Automation
● Monitoring
● Serving
● Automation:
○ Scheduled BigQuery
ML.PREDICT, ML.FORECAST
queries
○ Kubeflow scheduled runs
○ Vertex Pipelines
○ Composer scheduled runs
○ Cloud Functions for event
triggered runs
● Serverless Serving Infrastructures:
○ Vertex AI (good default choice
for batch/online)
○ Cloud Run (leverage containers,
deploy model as part of app)
○ BigQuery (batch preds on BQML
model)

Design architecture that complies with
regulatory and security concerns.
● Building secure ML systems
● Privacy implications of data usage
● Identifying potential regulatory issues
● Data Loss Prevention (DLP) API for
identifying sensitive data
● Ways to handle sensitive data:
○ Throw it away (not great)
○ Masking / hashing to anonymize
○ Coarsen
■ Ex: Use first 3 digits of ZIP
code instead of full zip
code
■ Conceptually this is
bucketizing to make a
feature non-identifiable at
an individual level

Section 3: Data Preparation and Processing
Data ingestion. Considerations include:
● Ingestion of various file types (e.g.
Csv, json, img, parquet or databases,
Hadoop/Spark)
● Database migration
● Streaming data (e.g. from IoT devices)
● Serializing to TFRecords
● Analytics workloads -> BigQuery
● Hadoop/Spark migration -> DataProc
● Streaming Data
○ Ingest with PubSub
○ Process with DataFlow

Data exploration (EDA). Considerations
include:
● Visualization
● Statistical fundamentals at scale
● Evaluation of data quality and
feasibility
● Vertex Workbench Notebooks w/
BigQuery magic to sample and explore
your data in Python (Pandas, Matplotlib,
Seaborn)
● Numeric input + numeric output =
Pearson Correlation
● Numeric input + categorical output =
ANOVA
● Categorical input + categorical output =
chi-squared

Design/build data pipelines. Considerations
include:
● Batching and streaming data pipelines
at scale
● Data privacy and compliance
● Monitoring/changing deployed
pipelines
● Handling missing data
● Handling outliers
● Managing large samples (TFRecords)
● Transformations (TensorFlow
Transform)
● For scalable, production systems
leverage Dataflow (TF Transform) for
transformation
● Vertex Pipelines with Kubeflow
Pipelines or TFX (Tensorflow extended)
● Options with missing data
○ Throw it away (not great)
○ Impute missing numeric features
○ Create a separate bucket for
missing categorical features
● Outliers
○ Clipping at a max/min value
○ Bucketize and have a “catch-all”
bucket. Try to ensure equal
number of samples in each
bucket

Feature engineering. Considerations
include:
● Data leakage and augmentation
● Encoding structured data types
● Feature selection
● Class imbalance
● Feature crosses
● Dataset Augmentation (common with
images): tweak existing data in some
small way to create a larger training set
● Features crosses
○ Captures feature interactions
○ Lead to sparsity
○ Frequently combined with
embedding layer
● Feature selection
○ See Correlation/ANOVA
○ L1 Regularization

Section 4: ML Model Development
Build a model. Considerations include:
● Choice of framework and model
● Modeling techniques given
interpretability requirements
● Transfer learning
● Model generalization
● Overfitting
● Model Architectures
○ Boosted Trees: Good for
structured data (frequently as
good as DNNs)
○ LSTMs: Time series data
○ Transformers: Popular with NLP
○ CNNs: Images
● Transfer Learning: Repurposing a model
trained on one task to do another task.
Take existing model and train it a bit
more with your data.
● Regularization = techniques to help
model generalize (L1/L2 Reg, Dropout)
● Overfitting = memorizing training data.
Low train loss, high test loss

Train/test a model. Considerations include:
● Productionizing
● Training a model as a job in different
environments
● Tracking metrics during training
● Retraining/redeployment evaluation
● Model performance against baselines,
simpler models, and across the time
dimension
● Model explainability on Vertex AI
● Best practice is to use serverless,
distributing training products (like
Vertex AI)
● Model checkpointing & early stopping
are important
● Always have a common sense baseline
to compare your models with
● Model explainability
○ SHAP (good for Boosted Trees bc
they are not differentiable)
○ Integrated Gradients (model must
be piecewise differentiable)
○ Both techniques are supported by
Vertex AI

Scale model training and serving.
● Distributed training
● Scaling prediction service (e.g. Vertex
AI Prediction, containerized serving)
● Know when to use GPUs / TPUs and
leverage distributed training (Ex:
Training an image classifier with 5M
images on a single machine would take
forever)
● Distribution strategies with Tensorflow
in Vertex AI
● Serverless Serving Infrastructures:
○ Vertex AI
○ Cloud Run (leverage containers,
deploy model as part of app)
○ BigQuery (batch preds on BQML
model)

Section 5: Automating and orchestrating ML pipelines
Designing and implementing training
pipeline. Considerations include:
● Identification of components,
parameters, triggers, and compute
needs (Cloud Build, Cloud Run)
● Orchestration framework
● Hybrid or multi-cloud strategies
● System design with TFX
components/Kubeflow DSL
● Vertex Pipelines
○ Kubeflow
■ Lightweight Python Components
■ Custom Components
○ TFX
■ Focused on e2e ML Ops, works great
with TF, but also with others
● Cloud Composer/Airflow (generic
orchestrator)
○ Cloud functions to trigger DAG runs
● Cloud Build
○ Github triggers to run Vertex
pipelines
● Constructing a pipeline with KubeFlow
SDK or with TFX
● Vertex Metadata (TFX)

Section 5: ML Pipeline Automation and Orchestration
Implement serving pipeline. Considerations
include:
● Serving (online, batch, caching)
● Google Cloud serving options
● Testing for target performance
● Configuring trigger and pipeline
schedules
● Different types of serializing models. TF
SavedModel is default for TF models.
Scikit learn supports .pkl and joblib
● Serving options (again):
○ Vertex AI predictions
○ App Engine (some customers do it)
○ Cloud Run
○ BigQuery (batch preds)

Section 5: ML Pipeline Automation and Orchestration
Track and audit metadata. Considerations
include:
● Organization and tracking
experiments and pipeline runs
● Hooking into model and dataset
versioning
● Model/dataset lineage
● ML Metadata (TFX)
● Model versioning with Vertex AI
● Vertex Pipelines for model/data lineage

Section 6: ML Solution Monitoring, Optimization, and Maintenance
Monitor ML solutions. Considerations
include:
● Performance and business quality of
ML model predictions
● Logging strategies
● Establishing continuous evaluation
metrics
● Understand GCP permissions model
(IAM)
● Common training and serving errors
(TensorFlow)
● ML model failure and biases
● See what is captured in Vertex AI logs
(job failures, performance metrics, etc.)
● Continuous evaluation metrics - back in
evaluation to orchestration itself (inside
Vertex Pipelines or Composer DAG)
● Common training/serving errors
○ Train-serve skew
○ Almost always an issue in the data
● Biases
○ Look at subgroups
○ What-if tool

Section 6: ML Solution Monitoring, Optimization, and Maintenance
Tune performance of ML solutions for
training & serving in production.
● Optimization and simplification of
input pipeline for training
● Simplification technique
● Avoid useless intermediary steps
● Start small: Sample your data,
experiment, get a baseline before using
your whole dataset
● Simulate how the models performance
would degrade over time to influence
retraining policy
○ Retraining policy is a balance of
cost-to-retrain and business
value gained from retraining

Key facts
● Taken online or in person
● Exam length: 2 hours
● 50 multiple-choice or multiple-
select questions.
● Register at
https://webassessor.com/wa.do?p
age=publicHome&branding=GOO
GLECLOUD

Tips and tricks
● Apply your experience.
● Read the questions
carefully.
● Mark questions and
review them later.

You need to build an object detection model for a small startup company to identify if and where
the company’s logo appears in an image. You were given a large repository of images, some with
logos and some without. These images are not yet labelled. You need to label these pictures, and
then train and deploy the model. What should you do?
A. Use Google Cloud Data Labelling Service to label your data. Use AutoML Object Detection
to train and deploy the model.
B. Use Vision API to detect and identify logos in pictures and use it as a label. Use AI Platform
to build and train a convolutional neural network.
C. Create two folders: one where the logo appears and one where it doesn’t. Manually place
images in each folder. Use AI Platform to build and train a convolutional neural network.
D. Create two folders: one where the logo appears and one where it doesn’t. Manually place
images in each folder. Use AI Platform to build and train a real time object detection model.

You work for a textile manufacturer and have been asked to build a model to detect and classify
fabric defects. You trained a machine learning model with high recall based on high resolution
images taken at the end of the production line. You want quality control inspectors to gain trust
in your model. Which technique should you use to understand the rationale of your classifier?
A. Use K-fold cross validation to understand how the model performs on different test
datasets.
B. Use the Integrated Gradients method to efficiently compute feature attributions for each
predicted image.
C. Use PCA (Principal Component Analysis) to reduce the original feature set to a smaller set
of easily understood features.
D. Use k-means clustering to group similar images together, and calculate the Davies-Bouldin
index to evaluate the separation between clusters

You need to write a generic test to verify whether Dense Neural Network (DNN)
models automatically released by your team have a sufficient number of
parameters to learn the task for which they were built. What should you do?
A. Train the model for a few iterations, and check for NaN values.
B. Train the model for a few iterations, and verify that the loss is constant.
C. Train a simple linear model, and determine if the DNN model outperforms
it.
D. Train the model with no regularization, and verify that the loss function is
close to zero.

Your team is using a TensorFlow Inception-v3 CNN model pretrained on ImageNet for
an image classification prediction challenge on 10,000 images. You will use AI Platform
to perform the model training. What TensorFlow distribution strategy and AI Platform
training job configuration should you use to train the model and optimize for wall-clock
time?
A. Default Strategy; Custom tier with a single master node and four v100 GPUs.
B. One Device Strategy; Custom tier with a single master node and four v100 GPUs.
C. One Device Strategy; Custom tier with a single master node and eight v100 GPUs.
D. MirroredStrategy; Custom tier with a single master node and four v100 GPUs.

You work for a maintenance company and have built and trained a deep learning model
that identifies defects based on thermal images of underground electric cables. Your
dataset contains 10,000 images, 100 of which contain visible defects. How should you
evaluate the performance of the model on a test dataset?
A. Calculate the Area Under the Curve (AUC) value.
B. Calculate the number of true positive results predicted by the model.
C. Calculate the fraction of images predicted by the model to have a visible defect.
D. Calculate the Cosine Similarity to compare the model’s performance on the test
dataset to the model’s performance on the training dataset.

You work for a large financial institution that is planning to use Dialogflow to create a chatbot for the
company’s mobile app. You have reviewed old chat logs and tagged each conversation for intent
based on each customer’s stated intention for contacting customer service. About 70% of customer
inquiries are simple requests that are solved within 10 intents. The remaining 30% of inquiries require
much longer and more complicated requests. Which intents should you automate first?
A. Automate a blend of the shortest and longest intents to be representative of all intents.
B. Automate the more complicated requests first because those require more of the agents’ time.
C. Automate the 10 intents that cover 70% of the requests so that live agents can handle the more
complicated requests.
D. Automate intents in places where common words such as “payment” only appear once to avoid
confusing the software.

You work for a gaming company that develops and manages a popular massively multiplayer online
(MMO) game. The game’s environment is open-ended, and a large number of positions and moves
can be taken by a player. Your team has developed an ML model with TensorFlow that predicts the
next move of each player. Edge deployment is not possible, but low-latency serving is required. How
should you configure the deployment?
A. Use a Cloud TPU to optimize model training speed.
B. Use AI Platform Prediction with a NVIDIA GPU to make real-time predictions.
C. Use AI Platform Prediction with a high-CPU machine type to get a batch prediction for the
players.
D. Use AI Platform Prediction with a high-memory machine type to get a batch prediction for the
players.

You work for a large retailer. You want to use ML to forecast future sales leveraging 10 years of
historical sales data. The historical data is stored in Cloud Storage in Avro format. You want to rapidly
experiment with all the available data. How should you build and train your model for the sales
forecast?
A. Load data into BigQuery and use the ARIMA model type on BigQuery ML.
B. Convert the data into CSV format and create a regression model on AutoML Tables.
C. Convert the data into TFRecords and create an RNN model on TensorFlow on AI Platform
Notebooks.
D. Convert and refactor the data into CSV format and use the built-in XGBoost algorithm on AI
Platform Training.

You are an ML engineer at a media company. You want to use machine learning to
analyze video content, identify objects, and alert users if there is inappropriate
content. Which Google Cloud products should you use to build this project?
A. Pub/Sub, Cloud Function, Cloud Vision API
B. Pub/Sub, Cloud IoT, Dataflow, Cloud Vision API, Cloud Logging
C. Pub/Sub, Cloud Function, Video Intelligence API, Cloud Logging
D. Pub/Sub, Cloud Function, AutoML Video Intelligence, Cloud Logging

You work on a team where the process for deploying a model into production starts with data
scientists training different versions of models in a Kubeflow pipeline. The workflow then stores the
new model artifact into the corresponding Cloud Storage bucket. You need to build the next steps of
the pipeline after the submitted model is ready to be tested and deployed in production on AI
Platform. How should you configure the architecture before deploying the model to production?
A. A. Deploy model in test environment -> Evaluate and test model -> Create a new AI Platform
model version
B. Validate model -> Deploy model in test environment -> Create a new AI Platform model version
C. Create a new AI Platform model version -> Evaluate and test model -> Deploy model in test
environment
D. Create a new AI Platform model version - > Deploy model in test environment -> Validate model

This week (Session 6):
● Take the PMLE Sample Questions
○ ML Ops - Getting Started
○ ML Pipelines on Google Cloud
○ Build and Deploy ML Solutions on Vertex AI
○ Perform Foundational Data, ML, and AI Tasks in Google Cloud
● Share your result on Slack group if you want.
Next week:
● You’re ready to take the exam!
○ Register at: webassessor.com/googlecloud
Stay on track

Professional Machine Learning Certification
Learning Journey Organized by Google Developer Groups Surrey co hosting with GDG Seattle
Session 1
Feb 24, 2024
Virtual
Session 2
Mar 2, 2024
Virtual
Session 3
Mar 9, 2024
Virtual
Session 4
Mar 16, 2024
Virtual
Session 5
Mar 23, 2024
Virtual
Session 6
Apr 6, 2024
Virtual
Review the
Professional ML Engineer Exam Guide
Review the
Professional ML Engineer Sample Questions
Go through:
Google Cloud Platform Big Data and Machine
Learning Fundamentals
Hands On Lab Practice:
Perform Foundational Data, ML, and AI Tasks
in Google Cloud
Build and Deploy ML Solutions on Vertex AI
Self study
(and potential
exam)
Lightning talk +
Kick-off & Machine
Learning Basics +
Q&A
Lightning talk +
GCP- Tensorflow &
Feature Engineering
+ Q&A
Lightning talk +
Enterprise Machine
Learning + Q&A
Production ML
Systems and
Computer Vision
with Google Cloud +
Q&A
Lightning talk + NLP
& Recommendation
Systems on GCP +
Q&A
MOPs & ML Pipelines
on GCP + Q&A
Complete course:
Introduction to AI and
Machine Learning on
Google Cloud
Launching into
Machine Learning
Complete course:
TensorFlow on Google
Cloud
Feature
Engineering
Complete course:
Machine Learning in
the Enterprise
Hands On Lab
Practice:
Production Machine
Learning Systems
Computer Vision
Fundamentals with
Google Cloud
Complete course:
Natural Language
Processing on Google
Cloud
Recommendation
Systems on GCP
Complete course:
ML Ops - Getting
Started
ML Pipelines on Google
Cloud
Check Readiness:
Professional ML
Engineer Sample
Questions

Link to badge
Redeem your participation badge
Thank you for joining the event

Thank you for
tuning in!
For any operational questions about access to
Cloud Skills Boost or the Road to Google
Developers Certification program contact: gdg-
support@google.com

MOPs & ML Pipelines on GCP - Session 6, RGDC

Recommended

Recommended

More Related Content

Similar to MOPs & ML Pipelines on GCP - Session 6, RGDC

Similar to MOPs & ML Pipelines on GCP - Session 6, RGDC (20)

More from gdgsurrey

More from gdgsurrey (6)

Recently uploaded

Recently uploaded (20)

MOPs & ML Pipelines on GCP - Session 6, RGDC