Production ML Systems and Computer Vision with Google Cloud

Session 4
Vikram Tiwari
Professional Machine Learning Engineer
Margaret Maynard-Reid

Professional Machine Learning Certification
Learning Journey Organized by Google Developer Groups Surrey co hosting with GDG Seattle
Session 1
Feb 24, 2024
Virtual
Session 2
Mar 2, 2024
Virtual
Session 3
Mar 9, 2024
Virtual
Session 4
Mar 16, 2024
Virtual
Session 5
Mar 23, 2024
Virtual
Session 6
Apr 6, 2024
Virtual Review the
Professional ML
Engineer Exam
Guide
Review the
Professional ML
Engineer Sample
Questions
Go through:
Google Cloud
Platform Big Data
and Machine
Learning
Fundamentals
Hands On Lab
Practice:
Perform
Foundational Data,
ML, and AI Tasks in
Google Cloud
(Skill Badge) - 7hrs
Build and Deploy ML
Solutions on Vertex
AI
(Skill Badge) - 8hrs
Self
study
(and
potential
exam)
Lightning talk +
Kick-off & Machine
Learning Basics +
Q&A
Lightning talk +
GCP- Tensorflow &
Feature Engineering
+ Q&A
Lightning talk +
Enterprise Machine
Learning + Q&A
Production ML
Systems and
Computer Vision
with Google Cloud +
Q&A
Lightning talk + NLP
& Recommendation
Systems on GCP +
Q&A
Lightning talk + MOPs
& ML Pipelines on GCP
+ Q&A
Complete course:
Introduction to AI and
Machine Learning on
Google Cloud
Launching into
Machine Learning
Complete course:
TensorFlow on Google
Cloud
Feature
Engineering
Complete course:
Machine Learning in
the Enterprise
Hands On Lab
Practice:
Production Machine
Learning Systems
Computer Vision
Fundamentals with
Google Cloud
Complete course:
Natural Language
Processing on Google
Cloud
Recommendation
Systems on GCP
Complete course:
ML Ops - Getting
Started
ML Pipelines on Google
Cloud
Check Readiness:
Professional ML
Engineer Sample
Questions

Session 4
Study Group
Computer Vision
● Vision API & AutoML Vision
● Beyond the course
Model Development
● Build a model.
● Train a model.
● Test a model.
● Scale model training and serving.

Computer Vision
on Google Cloud

ML GDE (Google Developer Expert)
GDG Seattle organizer
3D artist
Fashion Designer
Instructor at UW
About me
margaretmz.art
5

Computer Vision on Google Cloud (Cloud Skills Boost)
Course Overview
Module 1
Introduction
to Computer
Vision
Module 2
Vertex AI
AutoML Vision
Module 3
Custom
Training
Moule 4
Convolutional
Neural
Network
Module 5
Working with
Image Data

● What is computer vision
● Different types of computer vision use cases
● Various ML tools on Google Cloud
● Experiment pre-built APIs
Intro to Computer Vision
Module 1

Computer Vision Use Cases
Complexity
Image classification
(single-label)
Classify an image to
a class
Examples
Painting style or
artist
Van Gogh
Image classification
(multi-label)
Classify an image to
multiple classes
Examples
Movie poster genre
action, sci-fi
Feature extraction
Extracting latent
features of an
image with CNN
models
Examples
Visual search
find similar fashion
Object detection
Identify one or
multiple objects
within an image and
their locations with
bounding boxes.
detect UI elements
Segmentation
Classify whether
each pixel of the
image belongs to a
certain class
segment UI elements
Generative models
Computer Vision (+
NLP)
Examples
- Generate new images
- Super resolution
- Image-to-image
- Text-to-image etc.
an generated image
8

Google Cloud Vision API
Module 1 labs
Lab 1
Detecting Labels, Faces and
Landmarks in Images with the Cloud
Vision API
● Create a bucket
● Upload an image (public access)
● Send json request
● Receive json response
Lab 2
Extracting Text from the Images
using the Google Cloud Vision API
● Cloud Functions
● Upload images to Cloud
Storage
● Extract, translate and save
text

● Intro to Vertex AI - Google’s unified AI platform
● Automated ML pipeline with AutoML
● AutoML Vision
● AutoML example
● Options: Vision API vs AutoML Vision vs custom training
Module 2
Vertex AI & Auto ML Vision

● Image classification
● Custom Image classifier with 5-flowers dataset
● TensorFlow:
○ Linear network
○ Neural network
○ Deep Neural Networks (DNN)
● Dropout and Batch Normalization
Module 3
Custom Training

● How to use CNN
● What makes CNN different?
● Key CNN model parameters: filters, # of channels, kernel size etc
● Working with Polling Layers
● Implement CNNs on Vertex with pre-built TensorFlow container using
Vertex workbench
Module 4
Convolutional Neural Networks (CNN)

● Preprocessing (with Keras and TensorFlow dataset)
● Data scarcity problem:
○ Image Augmentation
○ Transfer learning
Module 5 - Image Data

● Why transfer learning?
○ Less data and faster training
● What is transfer learning?
● How to use transfer learning
Transfer learning
Why, What & How
“I know Kungfu…”

Core Services
Cloud Vision API
-Image labeling
-Face detection
-Landmark detection
-Text extraction
(OCR)
AutoML Vision
-Image classification
-Object Detection
Specialized Services
Video Intelligence
API
-Shot change
detection
-Object tracking
-Text detection
Document AI
-Form parsing
-Invoice/receipt
processing
Vertex AI
Gemini Pro Vision
-Visual analysis
-Multimodal Q&A
Computer Vision on Google Cloud
Imagen 2
-Image generation
-Image editing
-Visual captioning
-Visual Q&A

Production ML
@Vikram_Tiwari
is hard

ML lifecycle - Infrastructure
In a notebook Local machine/VM On the cloud
AI Platform
Exploration Phase
(Component Wise Test)
Development Phase
(ML Pipeline Test as a Whole)
Production Phase
(Integrate with Other Products)

23
ML lifecycle - Data
Skew and Drift are the silent killers of
your ML models
Training-Serving Skew
Feature at training: green banana
Feature at serving: yellow banana
Prediction Drift
Feature at serving: changing from
green to yellow

24
Research Production
Objectives Model performance Different stakeholders have different
objectives
Computational priority Fast training, high throughput Fast inference, low latency
Data Static Constantly shifting
Fairness Good to have (sadly) Important
Interpretability Good to have Important
Research vs Production

Session 4
@Vikram_Tiwari
@chipro
@hanneshapke
Production ML
- Architecting production ML systems
- Designing adaptable ML systems
- Designing High-performance ML systems
- Building Hybrid ML systems

Three ways Google Cloud can help you benefit from ML
Retrained models:
your data + our models
Pre-Trained models:
our data + our models
Custom models:
your data + your model
Vision
Translation Natural
Language
Speech-
to-Text
Job Discovery
Video
Intelligence
Text-to
Speech
AutoML
Easy-to-Use, for non-ML engineers Customizable, for Data Scientists
3
1 2
Compute
Engine
GPU Cloud TPU
Cloud
Dataproc
Kubernetes
Engine
BigQuery
AI Platform
Training & Prediction
Translate
NLP
Speech
Vision
Tables
Recomm
dation
Dialogflow
Enterprise

End-to-end environment for AI inside GCP
console
Offers an integrated tool chain from data
engineering to model deployment with “no lock-
in”
Allows you to run on-premises or on Google Cloud
without significant code changes.
Access to cutting-edge Google AI technology
like TensorFlow, TPUs, and TFX tools as you deploy
your AI applications to production.
What is AI Platform?
AI
Platfor
m

What is included?
AI Platform
Integrated with
Deep Learning
VM Images
Cloud
Dataflow
Cloud
Dataproc
Google
BigQuery
Cloud
Dataprep
Google Data
Studio
Notebooks
Data Labeling
Training Predictions
Pre-built
Algorithms
For data
warehousing
For data
transformation
For data
cleansing
For Hadoop and
Spark clusters
For BI
dashboards

What is included?
Kubeflow
(On premises)
AI Platform
Integrated with
Pipelines
Cloud
Dataflow
Cloud
Dataproc
Google
BigQuery
Cloud
Dataprep
Google Data
Studio
Notebooks
Data Labeling
Pre-built
Algorithms
For data
warehousing
For data
transformation
For data
cleansing
For Hadoop and
Spark clusters
For BI
dashboards
AI Hub

AI Platform Notebooks
A hosted Jupyter notebook solution that makes
it easy for Data Scientists to spin up
JupyterLab; and gives DevOps teams the
controls they need.
Centrally managed: DevOps teams can easily
manage and secure these environments.
Get started quickly: Latest data science and
machine learning frameworks are pre-configured.
No learning curve: Uses the industry standard
JupyterLab interface.
Scalable & cost-effective: Pick the hardware you
need; and scale up and down easily.
GCP integration: It’s easy to access and use GCP
services from within your notebooks.
Easily build, train, and deploy models: Supports
the full ML lifecycle through integration with the
most popular ML frameworks and tools.

One-Click
Deployment
Spin up a JuypterLab instance, pre-
configured with the latest machine
learning and data science
frameworks in one click.
Get started quickly

JupyterLab
AI Platform Notebooks uses the latest
open-source version of the industry-
standard JupyterLab.
No learning curve

Scale
On Demand
You can easily change hardware
including adding and removing GPUs.
Scalable & cost-effective

● Serverless and no-ops ML training
● Distributed training infrastructure that
supports CPUs, GPUs and TPUs
● Hyperparameter tuning
● Train and tune TensorFlow models,
Scikit-learn models, XGBoost models
and custom containers
● Multiple runtime versions for different
frameworks
● Prebuilt algorithms (TensorFlow linear
learner and wide&deep algorithm,
XGBoost algorithm)
AI Platform Training

Package TensorFlow trainer
https://github.com/GoogleCloudPlatform/cloudml-samples/tree/master/cloudml-template
Project root
directory

Package Model trainer
https://github.com/GoogleCloudPlatform/cloudml-samples/tree/master/cloudml-template
Project root directory
python task.py
--train-files $TRAIN_DATA
--eval-files $EVAL_DATA
task.py
➔ Accept command line arguments
➔ Upload the model to GCS

Training locally
gcloud ai-platform local train
--module-name trainer.task --package-path trainer/
--
--train-files $TRAIN_DATA --eval-files $EVAL_DATA --job-dir $MODEL_DIR
training
data evaluation
data
output
directory
train locally
Local path

gcloud ai-platform jobs submit training $JOB_NAME --job-dir $OUTPUT_PATH
--runtime-version 1.13 --module-name trainer.task --package-path trainer --region $REGION
--scale-tier BASIC
--
--train-files $TRAIN_DATA --eval-files $EVAL_DATA
single worker
https://cloud.google.com/ai-platform/training/docs/machine-types
Training in the cloud
with single node

Training in the cloud at scale
--scale-tier BASIC_GPU
--
single GPU
with GPUs (K80/P100/V100 - availability by region)

--scale-tier BASIC_TPU
--
TPU Device
https://cloud.google.com/ai-platform/training/docs/using-tpus
with TPUs

--scale-tier CUSTOM --config config.yaml
--
custom cluster
trainingInput:
scaleTier: CUSTOM
masterType: complex_model_l
workerType: complex_model_l_gpu
workerCount: 10
parameterServerType: large_model
https://cloud.google.com/ai-platform/training/pricing
with custom cluster specs

--runtime-version 1.13 --module-name trainer.task --package-path trainer/ --region $REGION

--scale-tier PREMIUM_1 --config config.yaml
--
hypertuning
Hyperparameter tuning

● Automatic hyperparameter tuning
service
● Google-developed “black-box” search
(Bayesian Optimisation) algorithm
● In addition to Random Search and Grid
Search
● Supports numeric, discrete, and
categorical params
● Early stopping & resumability
Objective
We want to find this
Not these
https://cloud.google.com/blog/big-data/2017/08/hyperparameter-tuning-in-cloud-machine-learning-engine-using-bayesian-optimization
https://cloud.google.com/blog/big-data/2018/03/hyperparameter-tuning-on-google-cloud-platform-is-now-faster-and-smarter

trainingInput:
hyperparameters:
goal: MAXIMIZE
hyperparameterMetricTag: accuracy
maxTrials: 40
enableTrialEarlyStopping: True
maxParallelTrials: 2
algorithm: UNSPECIFIED
params:
- parameterName: learning-rate
type: FLOAT
minValue: 0.001
maxValue: 0.1
scaleType: UNIT_LOG_SCALE
...
...
# Initialise the optimizer for the DNN
optimizer = tf.train.AdagradOptimizer(
learning_rate=hparams.learning_rate)
...
parser.add_argument(
'--learning-rate',
help='Learning rate used by the DNN
optimizer',
default=0.01,
type=float
)
...
config.yaml
task.py

What is included?
AI Platform
Integrated with
Pipelines
Cloud
Dataflow
Google
BigQuery
Cloud
Dataprep
Data Labeling
Training
Pre-built
Algorithms
For data
warehousing
For data
transformation
For data
cleansing

Built-in Algorithms
Start an ML Engine training job using the built-
in algorithms.
No coding required! Just use the provided UI.

Training in 4 easy steps
Training algorithm Training data Algorithm arguments Job settings
1 2 3 4

What is included?
AI Platform
Integrated with
Pipelines
Cloud
Dataflow
Cloud
Dataproc
Google
BigQuery
Cloud
Dataprep
Google Data
Studio
Notebooks
Data Labeling
For data
warehousing
For data
transformation
For data
cleansing
For Hadoop and
Spark clusters
For BI
dashboards
Pre-built
Algorithms

● Serverless and no-ops ML serving
● Batch prediction for TensorFlow
models on CPUs and GPUs
● Online prediction for Scikit-learn
models, XGBoost models and Custom
prediction routines
● Explainability using different methods
(Integrated Gradients (TF),
TreeSHAP(XGB), Sampled Shapley,
Exact Shapley)
● Data Services for Data Labeling and
Continues Evaluation
AI Platform Prediction

Deploy the trained TF model
# Creating model
gcloud ai-platform models create $NAME --regions $REGION
# Creating versions
gcloud ai-platform versions create $VERSION --model $NAME --origin $MODEL_DIR
--runtime-version 1.7
gcloud command line tool:

Predicting
POST https://ml.googleapis.com/v1/projects/your-project-id/
models/${model-name}/
versions/${version}:predict
Request Body:
{
"instances": [
[0.0, 1.1, 2.2],
[3.3, 4.4, 5.5],
...
]
}
batch prediction*: online prediction*:
cloud ai-platform jobs submit prediction
$JOB_NAME
--model $NAME
--version $VERSION
--data-format TEXT
--input-paths $GCS_DATA_DIR
--output-path $GCS_OUT_DIR
*gcloud commands and APIs exists for both methods

Simply choose an explanation method
when you set up a model, and Cloud AI
Platform will tell on every prediction how
much each feature affected the final result
Explainable AI (XAI):
Cloud AI Platform
provides analysis with
every prediction
Cloud AI Platform
Prediction Service
Data
Model

Supported AI Platform explanation methods
Support
Method Frameworks Data types Paper link
Integrated
Gradients
TensorFlow Tabular, image, text
(differentiable
modes)
arxiv.org/abs/1703.01365
Sampled Shapley TensorFlow Tabular arxiv.org/pdf/1306.4265
XRAI TensorFlow Image arxiv.org/abs/1906.02825

Deploying explainable models on AI Platform
!gcloud beta ai-platform versions create $VERSION
--model $MODEL
--origin $export_path
--framework 'TENSORFLOW'
--python-version 3.7
--machine-type n1-standard-4
--explanation-method 'integrated-gradients'
--num-integral-steps 25

Deploy model
Request predictions ...with explanations
...with explanations
gcloud beta ai-platform versions create $VERSION
--model $MODEL
--framework TENSORFLOW
gcloud beta ai-platform versions create $VERSION
--model $MODEL
--framework TENSORFLOW
--explanation-method 'integrated-gradients'
--num-integral-steps 25
gcloud beta ai-platform predict
--model $MODEL
--version $VERSION
--json-instances='data.txt'
gcloud beta ai-platform explain
--model $MODEL
--version $VERSION
--json-instances='data.txt'

Steps to train a TensorFlow model - Docker user journey
1. Develop a TensorFlow model and training code
2. Create a Dockerfile with your model code
3. Build the image
4. Push it to a container registry (e.g. Google Container Registry)
5. Kick off your AI Platform training job

First: Create your model and training code
model = tf.keras.Sequential(
[
Dense(100, activation=relu,
input_shape=(input_dim,)),
Dense(75, activation=relu),
Dense(1, activation=sigmoid)
])
Sample code showing structure in cloudml-samples repo
# Train model
keras_model.fit(
training_dataset,
steps_per_epoch=int(num_train_examples /
args.batch_size),
epochs=args.num_epochs,
validation_data=validation_dataset,
validation_steps=1,
verbose=1,
callbacks=[lr_decay_cb, tensorboard_cb])
model.py task.py

Second: Create your Dockerfile
FROM gcr.io/deeplearning-platform-release/tf2-ent-latest-gpu
WORKDIR /root
COPY model.py /root/model.py
COPY task.py /root/task.py
ENTRYPOINT ["python", "task.py"]
Extend DLVM Image

Third: Build, test, and push image
IMAGE = "gcr.io/MY-PROJECT/MY-REPO:MY_IMAGE"
# Build image
docker build -f Dockerfile -t $IMAGE
# Test locally
docker run $IMAGE --lr 0.1
# Push to container registry
docker push $IMAGE
Run locally to test. You can pass custom model
parameters (e.g. learning rate) into the image.

Fourth: Submit Training Job
gcloud ai-platform jobs submit training my-job
--region us-west1
--master-image-uri gcr.io/my-project/my-repo:my-image
--
--lr=0.1
Standard parameters
for the AI Platform
command
Everything after the are custom parameters that
your training code is designed to accept

Optional: Add hyper-parameter tuning
trainingInput:
hyperparameters:
goal: MINIMIZE
hyperparameterMetricTag: "my_loss"
maxTrials: 20
maxParallelTrials: 5
enableTrialEarlyStopping: True
params:
- parameterName: lr
type: DOUBLE
minValue: 0.0001
maxValue: 0.1
Add parameter --config config.yaml to training job
...
parser.add_argument(
'--lr',
type=float,
default=0.01,
metavar='LR',
help='learning rate (default: 0.01)')
...
config.yaml task.py

What is included?
AI Platform
Integrated with
Pipelines
Cloud
Dataflow
Cloud
Dataproc
Google
BigQuery
Cloud
Dataprep
Notebooks
Data Labeling
Pre-built
Algorithms
For data
warehousing
For data
transformation
For data
cleansing
For Hadoop and
Spark clusters

Data Labeling
Service
Custom Instructions: Provide your own custom
instructions to labelers
Human Labeled Data: Get high quality human
labeled data to train and evaluate for your ML
models
Labeling tasks for unstructured data: Task focusing
on images, videos and text
Continues Evaluation: Record sample predictions
on BQ and sent for evaluation

Production ML Systems and Computer Vision with Google Cloud

Recommended

Recommended

More Related Content

Similar to Production ML Systems and Computer Vision with Google Cloud

Similar to Production ML Systems and Computer Vision with Google Cloud (20)

More from gdgsurrey

More from gdgsurrey (6)

Recently uploaded

Recently uploaded (20)

Production ML Systems and Computer Vision with Google Cloud