Deep Learning - Continuous Operations

FullStack Developers Israel
CONTINUOS OPERATIONS
DEEP LEARNING | HAGGAI PHILIP ZAGURY

Tikal Knowledge
TIKAL INTRO
WHO WE ARE ?
▸ Tikal helps ISV’s in Israel & abroad in their technological
challenges.
▸ Our Engineers are Fullstack Developers with expertise in
Android, DevOps, Java, JS, Python, ML
▸ We are passionate about technology and specialise in
OpenSource technologies.
▸ Our Tech and Group leaders help establish & enhance
existing software teams with innovative & creative
thinking.
https://www.meetup.com/full-stack-developer-il/

SELF INTRODUCTION
▸ My open thinking and open techniques
ideology is driven by Open Source
technologies and the collaborative manner
deﬁning my M.O.
▸ My solution driven approach is strongly
based on hands-on and deep understanding
of Operating Systems, Applications stacks
and Software languages, Networking, Cloud
in general and today more an more Cloud
Native solutions.
▸ Technologies:
▸ Linux { just pick a ﬂavour …}
▸ *Scripting
▸ Git
▸ Python/Go
▸ Cloud { public/private/hybrid }
▸ Docker
▸ Kubernetes 
HAGGAI PHILIP ZAGURY - DEVOPS ARCHITECT AND GROUP TECH LEAD

THE STORY …
MACHINE LEARNING | CONTINUOUS OPERATIONS

WE NEED “CI/CD” FOR OUR MODEL TRAINING …
▸ What he didn’t say is …
▸ In-browser training
▸ Backed training
▸ Tensorﬂow training
▸ Tensorﬂow serving
▸ Storage [ for raw data & model ] …

THE LEARNING CURVE

A RELATIVELY SIMPLE USE CASE …
TENSOR-FLOW
TRAINING
Server
SERVER
CLIENT
- SERVE FRONTEND APP
- COLLECT IMAGES
- TRAIN
-INFER
Upload Images
Serve
Model
Get trained
Model
Enrich
Model
with new data
Upload
Images
Serve
Protobuf
Object store
1
2
3
4
5
6

A CLASSIC APP
SERVER
CLIENT
- SERVE FRONTEND APP
- COLLECT IMAGES
- TRAIN
-INFER
Upload Images
Serve
Model
Get trained
Model
Upload
Images
Object store
1
2 5
6

MODEL TRAINING …
‣ If your using a pre-trained model - it’s no different
than using a backend / an api endpoint !
‣ Training processes are complex and require
Infrastructure As A Service & On demand
‣ Scalability
‣ faster Time to Market vs. faster results
‣ Scaling costs …

STAGE #1
‣ python train_model.py 
3 Total data size: 332
4 Train X: (298, 7, 7, 256)
5 Train Y: (298, 2)
6 Test X: (34, 7, 7, 256)
7 Test Y: (34, 2)
8 Train on 298 samples, validate on 34 samples
9 Epoch 1/10
10 298/298 [==============================] - 1s 3ms/step - loss: 0.5061 - acc: 0.7651 - val_loss: 0.2331 - val_acc:
0.9118
11 Epoch 2/10
1.0000
13 Epoch 3/10
1.0000
15 Epoch 4/10
1.0000
17 Epoch 5/10
1.0000
19 Epoch 6/10
TENSOR-FLOW
TRAINING

STAGE #2 - DOCKERIZE & PARAMETARIZE …
‣ docker run -v “${PWD}/data:/opt/data” tikal/webcam-controller-
model:latest
TENSOR-FLOW
TRAINING
4 Train X: (298, 7, 7, 256)
5 Train Y: (298, 2)
6 Test X: (34, 7, 7, 256)
7 Test Y: (34, 2)
9 Epoch 1/10
0.9118
11 Epoch 2/10
1.0000
13 Epoch 3/10
1.0000
15 Epoch 4/10
1.0000
17 Epoch 5/10
1.0000
19 Epoch 6/10

CONTINUOS INTEGRATION
‣ A Jenkins pipeline
‣ Build - get sample data /
updated data
‣ Deploy model to cpu/gpu
‣ Train and record results
‣ Promote upload new
model for “space invaders”
micro service backend

THE GAME IS JUST A MEANS TO AN END …
TENSOR-FLOW TRAINING
TENSOR-FLOW
TRAINING
# epochs lr more ﬂags
1 flags = tf.app.flags
2 flags.DEFINE_float("lr", 0.0001, "Learning Rate")
3 flags.DEFINE_string("units", "((50, 0.2), (40, 0.1))", "Configuration of hidden un
4 "Expected: tuple of tuple pairs. Each pair represent one hidde
5 "For instance: "((100, 0.2), (50, 0.3))" will create dense h
6 "dropout layer with rate of 0.2. Afterwards, it will create de
7 "dropout layer with rate of 0.3. If you wish to have hidden la
8 "second value. Example: "((100,), (50, 0.3))"")
9 flags.DEFINE_integer("epochs", 10, "Number of epochs")
10 flags.DEFINE_float("batch_frac", 0.3, "The fraction of training examples to consid
11 "For instance, 0.1 will divide the training to 10 batches")
12 flags.DEFINE_boolean("draw_plot", False, "Whether to draw a plot at the end")
13 flags.DEFINE_boolean("export_js", False, "Whether to export to a tenorflow.js mode
14 FLAGS = flags.FLAGS
TENSOR-FLOW TRAINING
# epochs lr more ﬂags
‣ We need to train our
model 
With different parameters
to 
Reach the Optimal model
parameters …

SACALING / MULTIPLEXING … TENSORFLOW SUPPORTS MULTI-PART / DISTRIBUTED FLOWS
‣ Running the same model with
different parameters in order to
choose the most efficient vs most
accurate vs cost affective pipeline !
‣ most efficient #of epochs /
params
https://www.tensorflow.org/performance/datasets_performance

A/B TESTING / CANARY RELEASES ?!
MODEL VER 1.0
MODEL VER 1.7
MODEL VER 2.0
Storage Provider
60%
30%
10%
Collect In-Browser  
training

TRANSLATION …
▸ A ﬂexible training model
▸ Parametarized ﬂow
▸ Model Testing
▸ Promotion mechanism
▸ Data Import and preprocessing
▸ Post Processing

FullStack Developers IL
REQUIREMENTS DRIVEN
SOLUTION(S)

OPTIONS - AWS ML
▸ Use custom DL AMI’s [ we used
them to get started … ]

OPTIONS - AWS ML

OPTIONS - GCP ML/DL
▸ Assume you develop in the
cloud / on the cloud
▸ Consume C/G/Tpu’s
constantly
▸ Adjust your workﬂow to
Google Patterns (which isn’t
a bad thing …)

OPTIONS - GCP ML/DL
▸ TPC lock-in ?
▸ Wouldn’t it be nice to
benchmark TPU & GPU on
another provider ?!

OPTIONS - AZURE ML/DL

IT’S ALL ABOUT THE PIPELINE / WORKFLOW

TEXT
IT’S ALL ABOUT THE PIPELINE / WORKFLOW
‣ You might be able to make this work …
‣ But !

THERES A PATTERN HERE …
IDE
Model Serving
Model Storage
Parameter injectionParameterized training
Training Orchestrator
1
2
3
4
5
6

STAGE #3 - ADJUST OUR DOCKERIZED APP TO MY VENDOR …
model:latest
TENSOR-FLOW
TRAINING
4 Train X: (298, 7, 7, 256)
5 Train Y: (298, 2)
6 Test X: (34, 7, 7, 256)
7 Test Y: (34, 2)
9 Epoch 1/10
0.9118
11 Epoch 2/10
1.0000
13 Epoch 3/10
1.0000
15 Epoch 4/10
1.0000
17 Epoch 5/10
1.0000
19 Epoch 6/10

DO I CARE ABOUT VENDOR LOCK-IN ?! - LET’S TALK MULTI-CLOUD
my laptop  
cloud
I need CPU / GPU / TPU
Adjust / Wrap our code to
suit the Vendor
TENSOR-FLOW
TRAINING
TENSOR-FLOW
TRAINING
TENSOR-FLOW
TRAINING

IT’S NOT ONLY A MATTER OF VENDOR LOCK-IN! - IT’S MULTI-CLOUD
Only in Google ATM
CPU GPU TPU
my laptop  
cloud
I need CPU / GPU / TPU

OPERATORS

TF [TENSORFLOW] OPERATOR

STAGE #4 - WRAP CODE TO SUPPORT WORKER | ADMIN | PS OPERATOR PATTERN
model:latest
TENSOR-FLOW
TRAINING
4 Train X: (298, 7, 7, 256)
5 Train Y: (298, 2)
6 Test X: (34, 7, 7, 256)
7 Test Y: (34, 2)
9 Epoch 1/10
0.9118
11 Epoch 2/10
1.0000
13 Epoch 3/10
1.0000
15 Epoch 4/10
1.0000
17 Epoch 5/10
1.0000
19 Epoch 6/10

ML/DL AS A SERVICE - ON YOUR INFRASTRUCTURE
‣ Package model
‣ Package conﬁguration

PRE PACKAGE MODELS FOR TRAINING / SERVING
‣ Apply to Kubernetes via
ksonnet

MODEL TRAINING
DevEnv
Push Tensorflow
container to registry
Create
tfjob
https://www.slideshare.net/barbarafusinska/hassle-free-scalable-machine-learning-learning-with-kubeflow
https://codelabs.developers.google.com/codelabs/kubeflow-introduction/index.html?index=..%2F..%2Fio2018#2
Store
Results

MODEL SERVING
DevEnv
Consume / Use model In local development Or in the Cloud
Deploy app to K8s
Use
Results
Push Application
Use & Improve model

MODEL TRAINING & SERVING
DevEnv
Deploy app to K8s
Use
Results
Push Application
Use & Improve modelPush Tensorﬂow
1
2 3
4
Train model in Kubeﬂow
Store
Results
5
6
5

A/B TESTING
DevEnv
Deploy app to K8s
Use
Results
Push Application
Use & Improve model
Push Tensorﬂow
1
2 3
4
Train model in Kubeﬂow
Store
Results
5
6
5
Use Ambassador for
A/B testing 7

A ONE STOP SHOP FOR EVERYTHING …
On Prem /  
Cloud
“PaaS" on K8s
▸ Job
▸ Cron Job
▸ POD
▸ Replica sets (multi-step /
distributed)

TFJOB CRD - CUSTOM RESOURCE DEFINITION
hagzag@model-tarining 👉 kubectl get tfjob
NAME AGE
wcm 1d

OUR IMAGE IN KUBEFLOW …
…
11 clusterName: “minikube"
12 creationTimestamp: 2018-06-23T07:31:54Z
13 generation: 1
14 labels:
15 app.kubernetes.io/deploy-manager: ksonnet
16 name: wcm
17 namespace: wcm
18 resourceVersion: "94971"
19 selfLink: /apis/kubeflow.org/v1alpha1/namespaces/wcm/tfjobs/wcm
20 uid: 80ab9472-76b7-11e8-be6d-0800279cc216
21 spec:
22 RuntimeId: werb
23 replicaSpecs:
24 - replicas: 3
25 template:
26 metadata:
27 creationTimestamp: null
28 spec:
29 containers:
30 - image: tikal/webcam-controller-model:latest
31 name: tensorflow
32 resources: {}
33 restartPolicy: OnFailure
34 tfPort: 2222
35 tfReplicaType: WORKER
36 - replicas: 2
37 template:
‣ Next step is to wrap our model
with some Operator / TF data
so kubeﬂow can display it …

USE S3 AND TERNSORBAORD …
‣ Reuse training results
and display in your
common tensor-ﬂow
tooling.

WANT MORE
‣ Demo model -> https://github.com/tikalk/
webcam-controller-model
‣ Kubeflow - the main “engine” kubeflow.io
‣ It also supports other tools …  
https://github.com/dwhitena/
kubeflow_pachyderm
‣ https://github.com/SeldonIO/seldon-core

EVEN MORE
Preprocess | ingest data
Serve
Train
Store

Deep Learning - Continuous Operations

More Related Content

What's hot

Similar to Deep Learning - Continuous Operations

More from Haggai Philip Zagury

Recently uploaded

Deep Learning - Continuous Operations