FullStack Developers Israel
CONTINUOS OPERATIONS
DEEP LEARNING | HAGGAI PHILIP ZAGURY
Tikal Knowledge
TIKAL INTRO
WHO WE ARE ?
▸ Tikal helps ISV’s in Israel & abroad in their technological
challenges.
▸ Our Engineers are Fullstack Developers with expertise in
Android, DevOps, Java, JS, Python, ML
▸ We are passionate about technology and specialise in
OpenSource technologies.
▸ Our Tech and Group leaders help establish & enhance
existing software teams with innovative & creative
thinking.
https://www.meetup.com/full-stack-developer-il/
FullStack Developers Israel
SELF INTRODUCTION
▸ My open thinking and open techniques
ideology is driven by Open Source
technologies and the collaborative manner
defining my M.O.
▸ My solution driven approach is strongly
based on hands-on and deep understanding
of Operating Systems, Applications stacks
and Software languages, Networking, Cloud
in general and today more an more Cloud
Native solutions.
▸ Technologies:
▸ Linux { just pick a flavour …}
▸ *Scripting
▸ Git
▸ Python/Go
▸ Cloud { public/private/hybrid }
▸ Docker
▸ Kubernetes

HAGGAI PHILIP ZAGURY - DEVOPS ARCHITECT AND GROUP TECH LEAD
FullStack Developers Israel
THE STORY …
MACHINE LEARNING | CONTINUOUS OPERATIONS
FullStack Developers Israel
MACHINE LEARNING | CONTINUOUS OPERATIONS
WE NEED “CI/CD” FOR OUR MODEL TRAINING …
▸ What he didn’t say is …
▸ In-browser training
▸ Backed training
▸ Tensorflow training
▸ Tensorflow serving
▸ Storage [ for raw data & model ] …
FullStack Developers Israel
MACHINE LEARNING | CONTINUOUS OPERATIONS
THE LEARNING CURVE
FullStack Developers Israel
MACHINE LEARNING | CONTINUOUS OPERATIONS
A RELATIVELY SIMPLE USE CASE …
TENSOR-FLOW
TRAINING
Server
SERVER
CLIENT
- SERVE FRONTEND APP
- COLLECT IMAGES
- TRAIN
-INFER
Upload Images
Serve
Model
Get trained
Model
Enrich
Model
with new data
Upload
Images
Serve
Protobuf
Object store
1
2
3
4
5
6
FullStack Developers Israel
MACHINE LEARNING | CONTINUOUS OPERATIONS
A CLASSIC APP
SERVER
CLIENT
- SERVE FRONTEND APP
- COLLECT IMAGES
- TRAIN
-INFER
Upload Images
Serve
Model
Get trained
Model
Upload
Images
Object store
1
2 5
6
FullStack Developers Israel
MACHINE LEARNING | CONTINUOUS OPERATIONS
MODEL TRAINING …
‣ If your using a pre-trained model - it’s no different
than using a backend / an api endpoint !
‣ Training processes are complex and require
Infrastructure As A Service & On demand
‣ Scalability
‣ faster Time to Market vs. faster results
‣ Scaling costs …
FullStack Developers Israel
MACHINE LEARNING | CONTINUOUS OPERATIONS
STAGE #1
‣ python train_model.py

3 Total data size: 332
4 Train X: (298, 7, 7, 256)
5 Train Y: (298, 2)
6 Test X: (34, 7, 7, 256)
7 Test Y: (34, 2)
8 Train on 298 samples, validate on 34 samples
9 Epoch 1/10
10 298/298 [==============================] - 1s 3ms/step - loss: 0.5061 - acc: 0.7651 - val_loss: 0.2331 - val_acc:
0.9118
11 Epoch 2/10
12 298/298 [==============================] - 0s 1ms/step - loss: 0.1361 - acc: 0.9765 - val_loss: 0.0763 - val_acc:
1.0000
13 Epoch 3/10
14 298/298 [==============================] - 0s 1ms/step - loss: 0.0471 - acc: 0.9966 - val_loss: 0.0365 - val_acc:
1.0000
15 Epoch 4/10
16 298/298 [==============================] - 0s 1ms/step - loss: 0.0172 - acc: 1.0000 - val_loss: 0.0196 - val_acc:
1.0000
17 Epoch 5/10
18 298/298 [==============================] - 0s 1ms/step - loss: 0.0123 - acc: 1.0000 - val_loss: 0.0113 - val_acc:
1.0000
19 Epoch 6/10
20 298/298 [==============================] - 0s 1ms/step - loss: 0.0065 - acc: 1.0000 - val_loss: 0.0080 - val_acc:
TENSOR-FLOW
TRAINING
FullStack Developers Israel
MACHINE LEARNING | CONTINUOUS OPERATIONS
STAGE #2 - DOCKERIZE & PARAMETARIZE …
‣ docker run -v “${PWD}/data:/opt/data” tikal/webcam-controller-
model:latest
TENSOR-FLOW
TRAINING
3 Total data size: 332
4 Train X: (298, 7, 7, 256)
5 Train Y: (298, 2)
6 Test X: (34, 7, 7, 256)
7 Test Y: (34, 2)
8 Train on 298 samples, validate on 34 samples
9 Epoch 1/10
10 298/298 [==============================] - 1s 3ms/step - loss: 0.5061 - acc: 0.7651 - val_loss: 0.2331 - val_acc:
0.9118
11 Epoch 2/10
12 298/298 [==============================] - 0s 1ms/step - loss: 0.1361 - acc: 0.9765 - val_loss: 0.0763 - val_acc:
1.0000
13 Epoch 3/10
14 298/298 [==============================] - 0s 1ms/step - loss: 0.0471 - acc: 0.9966 - val_loss: 0.0365 - val_acc:
1.0000
15 Epoch 4/10
16 298/298 [==============================] - 0s 1ms/step - loss: 0.0172 - acc: 1.0000 - val_loss: 0.0196 - val_acc:
1.0000
17 Epoch 5/10
18 298/298 [==============================] - 0s 1ms/step - loss: 0.0123 - acc: 1.0000 - val_loss: 0.0113 - val_acc:
1.0000
19 Epoch 6/10
FullStack Developers Israel
MACHINE LEARNING | CONTINUOUS OPERATIONS
CONTINUOS INTEGRATION
‣ A Jenkins pipeline
‣ Build - get sample data /
updated data
‣ Deploy model to cpu/gpu
‣ Train and record results
‣ Promote upload new
model for “space invaders”
micro service backend
FullStack Developers Israel
MACHINE LEARNING | CONTINUOUS OPERATIONS
THE GAME IS JUST A MEANS TO AN END …
TENSOR-FLOW TRAINING
TENSOR-FLOW
TRAINING
# epochs lr more flags
1 flags = tf.app.flags
2 flags.DEFINE_float("lr", 0.0001, "Learning Rate")
3 flags.DEFINE_string("units", "((50, 0.2), (40, 0.1))", "Configuration of hidden un
4 "Expected: tuple of tuple pairs. Each pair represent one hidde
5 "For instance: "((100, 0.2), (50, 0.3))" will create dense h
6 "dropout layer with rate of 0.2. Afterwards, it will create de
7 "dropout layer with rate of 0.3. If you wish to have hidden la
8 "second value. Example: "((100,), (50, 0.3))"")
9 flags.DEFINE_integer("epochs", 10, "Number of epochs")
10 flags.DEFINE_float("batch_frac", 0.3, "The fraction of training examples to consid
11 "For instance, 0.1 will divide the training to 10 batches")
12 flags.DEFINE_boolean("draw_plot", False, "Whether to draw a plot at the end")
13 flags.DEFINE_boolean("export_js", False, "Whether to export to a tenorflow.js mode
14 FLAGS = flags.FLAGS
TENSOR-FLOW TRAINING
# epochs lr more flags
‣ We need to train our
model

With different parameters
to

Reach the Optimal model
parameters …
FullStack Developers Israel
MACHINE LEARNING | CONTINUOUS OPERATIONS
SACALING / MULTIPLEXING … TENSORFLOW SUPPORTS MULTI-PART / DISTRIBUTED FLOWS
‣ Running the same model with
different parameters in order to
choose the most efficient vs most
accurate vs cost affective pipeline !
‣ most efficient #of epochs /
params
https://www.tensorflow.org/performance/datasets_performance
FullStack Developers Israel
MACHINE LEARNING | CONTINUOUS OPERATIONS
A/B TESTING / CANARY RELEASES ?!
MODEL VER 1.0
MODEL VER 1.7
MODEL VER 2.0
Storage Provider
60%
30%
10%
Collect In-Browser 

training
FullStack Developers Israel
MACHINE LEARNING | CONTINUOUS OPERATIONS
TRANSLATION …
▸ A flexible training model
▸ Parametarized flow
▸ Model Testing
▸ Promotion mechanism
▸ Data Import and preprocessing
▸ Post Processing
FullStack Developers IL
REQUIREMENTS DRIVEN
SOLUTION(S)
FullStack Developers Israel
MACHINE LEARNING | CONTINUOUS OPERATIONS
OPTIONS - AWS ML
▸ Use custom DL AMI’s [ we used
them to get started … ]
FullStack Developers Israel
MACHINE LEARNING | CONTINUOUS OPERATIONS
OPTIONS - AWS ML
▸ Use custom DL AMI’s [ we used
them to get started … ]
FullStack Developers Israel
MACHINE LEARNING | CONTINUOUS OPERATIONS
OPTIONS - AWS ML
FullStack Developers Israel
MACHINE LEARNING | CONTINUOUS OPERATIONS
OPTIONS - GCP ML/DL
▸ Assume you develop in the
cloud / on the cloud
▸ Consume C/G/Tpu’s
constantly
▸ Adjust your workflow to
Google Patterns (which isn’t
a bad thing …)
FullStack Developers Israel
MACHINE LEARNING | CONTINUOUS OPERATIONS
OPTIONS - GCP ML/DL
▸ TPC lock-in ?
▸ Wouldn’t it be nice to
benchmark TPU & GPU on
another provider ?!
FullStack Developers Israel
MACHINE LEARNING | CONTINUOUS OPERATIONS
OPTIONS - AZURE ML/DL
FullStack Developers Israel
IT’S ALL ABOUT THE PIPELINE / WORKFLOW
FullStack Developers Israel
TEXT
IT’S ALL ABOUT THE PIPELINE / WORKFLOW
‣ You might be able to make this work …
‣ But !
FullStack Developers Israel
MACHINE LEARNING | CONTINUOUS OPERATIONS
THERES A PATTERN HERE …
IDE
Model Serving
Model Storage
Parameter injectionParameterized training
Training Orchestrator
1
2
3
4
5
6
FullStack Developers Israel
MACHINE LEARNING | CONTINUOUS OPERATIONS
STAGE #3 - ADJUST OUR DOCKERIZED APP TO MY VENDOR …
‣ docker run -v “${PWD}/data:/opt/data” tikal/webcam-controller-
model:latest
TENSOR-FLOW
TRAINING
3 Total data size: 332
4 Train X: (298, 7, 7, 256)
5 Train Y: (298, 2)
6 Test X: (34, 7, 7, 256)
7 Test Y: (34, 2)
8 Train on 298 samples, validate on 34 samples
9 Epoch 1/10
10 298/298 [==============================] - 1s 3ms/step - loss: 0.5061 - acc: 0.7651 - val_loss: 0.2331 - val_acc:
0.9118
11 Epoch 2/10
12 298/298 [==============================] - 0s 1ms/step - loss: 0.1361 - acc: 0.9765 - val_loss: 0.0763 - val_acc:
1.0000
13 Epoch 3/10
14 298/298 [==============================] - 0s 1ms/step - loss: 0.0471 - acc: 0.9966 - val_loss: 0.0365 - val_acc:
1.0000
15 Epoch 4/10
16 298/298 [==============================] - 0s 1ms/step - loss: 0.0172 - acc: 1.0000 - val_loss: 0.0196 - val_acc:
1.0000
17 Epoch 5/10
18 298/298 [==============================] - 0s 1ms/step - loss: 0.0123 - acc: 1.0000 - val_loss: 0.0113 - val_acc:
1.0000
19 Epoch 6/10
FullStack Developers Israel
MACHINE LEARNING | CONTINUOUS OPERATIONS
DO I CARE ABOUT VENDOR LOCK-IN ?! - LET’S TALK MULTI-CLOUD
my laptop 

cloud
I need CPU / GPU / TPU
Adjust / Wrap our code to
suit the Vendor
TENSOR-FLOW
TRAINING
TENSOR-FLOW
TRAINING
TENSOR-FLOW
TRAINING
FullStack Developers Israel
MACHINE LEARNING | CONTINUOUS OPERATIONS
IT’S NOT ONLY A MATTER OF VENDOR LOCK-IN! - IT’S MULTI-CLOUD
Only in Google ATM
CPU GPU TPU
my laptop 

cloud
I need CPU / GPU / TPU
FullStack Developers Israel
OPERATORS
FullStack Developers Israel
MACHINE LEARNING | CONTINUOUS OPERATIONS
TF [TENSORFLOW] OPERATOR
FullStack Developers Israel
MACHINE LEARNING | CONTINUOUS OPERATIONS
STAGE #4 - WRAP CODE TO SUPPORT WORKER | ADMIN | PS OPERATOR PATTERN
‣ docker run -v “${PWD}/data:/opt/data” tikal/webcam-controller-
model:latest
TENSOR-FLOW
TRAINING
3 Total data size: 332
4 Train X: (298, 7, 7, 256)
5 Train Y: (298, 2)
6 Test X: (34, 7, 7, 256)
7 Test Y: (34, 2)
8 Train on 298 samples, validate on 34 samples
9 Epoch 1/10
10 298/298 [==============================] - 1s 3ms/step - loss: 0.5061 - acc: 0.7651 - val_loss: 0.2331 - val_acc:
0.9118
11 Epoch 2/10
12 298/298 [==============================] - 0s 1ms/step - loss: 0.1361 - acc: 0.9765 - val_loss: 0.0763 - val_acc:
1.0000
13 Epoch 3/10
14 298/298 [==============================] - 0s 1ms/step - loss: 0.0471 - acc: 0.9966 - val_loss: 0.0365 - val_acc:
1.0000
15 Epoch 4/10
16 298/298 [==============================] - 0s 1ms/step - loss: 0.0172 - acc: 1.0000 - val_loss: 0.0196 - val_acc:
1.0000
17 Epoch 5/10
18 298/298 [==============================] - 0s 1ms/step - loss: 0.0123 - acc: 1.0000 - val_loss: 0.0113 - val_acc:
1.0000
19 Epoch 6/10
FullStack Developers Israel
MACHINE LEARNING | CONTINUOUS OPERATIONS
ML/DL AS A SERVICE - ON YOUR INFRASTRUCTURE
‣ Package model
‣ Package configuration
FullStack Developers Israel
MACHINE LEARNING | CONTINUOUS OPERATIONS
PRE PACKAGE MODELS FOR TRAINING / SERVING
‣ Apply to Kubernetes via
ksonnet
FullStack Developers Israel
MACHINE LEARNING | CONTINUOUS OPERATIONS
MODEL TRAINING
DevEnv
Push Tensorflow
container to registry
Create
tfjob
https://www.slideshare.net/barbarafusinska/hassle-free-scalable-machine-learning-learning-with-kubeflow
https://codelabs.developers.google.com/codelabs/kubeflow-introduction/index.html?index=..%2F..%2Fio2018#2
Store
Results
FullStack Developers Israel
MACHINE LEARNING | CONTINUOUS OPERATIONS
MODEL SERVING
DevEnv
Consume / Use model In local development Or in the Cloud
Deploy app to K8s
Use
Results
Push Application
container to registry
Use & Improve model
FullStack Developers Israel
MACHINE LEARNING | CONTINUOUS OPERATIONS
MODEL TRAINING & SERVING
DevEnv
Consume / Use model In local development Or in the Cloud
Deploy app to K8s
Use
Results
Push Application
container to registry
Use & Improve modelPush Tensorflow
container to registry
1
2 3
4
Train model in Kubeflow
Store
Results
5
6
5
FullStack Developers Israel
MACHINE LEARNING | CONTINUOUS OPERATIONS
A/B TESTING
DevEnv
Consume / Use model In local development Or in the Cloud
Deploy app to K8s
Use
Results
Push Application
container to registry
Use & Improve model
Push Tensorflow
container to registry
1
2 3
4
Train model in Kubeflow
Store
Results
5
6
5
Use Ambassador for
A/B testing 7
FullStack Developers Israel
MACHINE LEARNING | CONTINUOUS OPERATIONS
A ONE STOP SHOP FOR EVERYTHING …
On Prem / 

Cloud
“PaaS" on K8s
▸ Job
▸ Cron Job
▸ POD
▸ Replica sets (multi-step /
distributed)
FullStack Developers Israel
MACHINE LEARNING | CONTINUOUS OPERATIONS
TFJOB CRD - CUSTOM RESOURCE DEFINITION
hagzag@model-tarining 👉 kubectl get tfjob
NAME AGE
wcm 1d
FullStack Developers Israel
MACHINE LEARNING | CONTINUOUS OPERATIONS
OUR IMAGE IN KUBEFLOW …
…
11 clusterName: “minikube"
12 creationTimestamp: 2018-06-23T07:31:54Z
13 generation: 1
14 labels:
15 app.kubernetes.io/deploy-manager: ksonnet
16 name: wcm
17 namespace: wcm
18 resourceVersion: "94971"
19 selfLink: /apis/kubeflow.org/v1alpha1/namespaces/wcm/tfjobs/wcm
20 uid: 80ab9472-76b7-11e8-be6d-0800279cc216
21 spec:
22 RuntimeId: werb
23 replicaSpecs:
24 - replicas: 3
25 template:
26 metadata:
27 creationTimestamp: null
28 spec:
29 containers:
30 - image: tikal/webcam-controller-model:latest
31 name: tensorflow
32 resources: {}
33 restartPolicy: OnFailure
34 tfPort: 2222
35 tfReplicaType: WORKER
36 - replicas: 2
37 template:
‣ Next step is to wrap our model
with some Operator / TF data
so kubeflow can display it …
FullStack Developers Israel
MACHINE LEARNING | CONTINUOUS OPERATIONS
USE S3 AND TERNSORBAORD …
‣ Reuse training results
and display in your
common tensor-flow
tooling.
FullStack Developers Israel
MACHINE LEARNING | CONTINUOUS OPERATIONS
WANT MORE
‣ Demo model -> https://github.com/tikalk/
webcam-controller-model
‣ Kubeflow - the main “engine” kubeflow.io
‣ It also supports other tools … 

https://github.com/dwhitena/
kubeflow_pachyderm
‣ https://github.com/SeldonIO/seldon-core
FullStack Developers Israel
MACHINE LEARNING | CONTINUOUS OPERATIONS
EVEN MORE
Preprocess | ingest data
Serve
Train
Store
FullStack Developers Israel

Deep Learning - Continuous Operations

  • 1.
    FullStack Developers Israel CONTINUOSOPERATIONS DEEP LEARNING | HAGGAI PHILIP ZAGURY
  • 2.
    Tikal Knowledge TIKAL INTRO WHOWE ARE ? ▸ Tikal helps ISV’s in Israel & abroad in their technological challenges. ▸ Our Engineers are Fullstack Developers with expertise in Android, DevOps, Java, JS, Python, ML ▸ We are passionate about technology and specialise in OpenSource technologies. ▸ Our Tech and Group leaders help establish & enhance existing software teams with innovative & creative thinking. https://www.meetup.com/full-stack-developer-il/
  • 3.
    FullStack Developers Israel SELFINTRODUCTION ▸ My open thinking and open techniques ideology is driven by Open Source technologies and the collaborative manner defining my M.O. ▸ My solution driven approach is strongly based on hands-on and deep understanding of Operating Systems, Applications stacks and Software languages, Networking, Cloud in general and today more an more Cloud Native solutions. ▸ Technologies: ▸ Linux { just pick a flavour …} ▸ *Scripting ▸ Git ▸ Python/Go ▸ Cloud { public/private/hybrid } ▸ Docker ▸ Kubernetes
 HAGGAI PHILIP ZAGURY - DEVOPS ARCHITECT AND GROUP TECH LEAD
  • 4.
    FullStack Developers Israel THESTORY … MACHINE LEARNING | CONTINUOUS OPERATIONS
  • 5.
    FullStack Developers Israel MACHINELEARNING | CONTINUOUS OPERATIONS WE NEED “CI/CD” FOR OUR MODEL TRAINING … ▸ What he didn’t say is … ▸ In-browser training ▸ Backed training ▸ Tensorflow training ▸ Tensorflow serving ▸ Storage [ for raw data & model ] …
  • 6.
    FullStack Developers Israel MACHINELEARNING | CONTINUOUS OPERATIONS THE LEARNING CURVE
  • 7.
    FullStack Developers Israel MACHINELEARNING | CONTINUOUS OPERATIONS A RELATIVELY SIMPLE USE CASE … TENSOR-FLOW TRAINING Server SERVER CLIENT - SERVE FRONTEND APP - COLLECT IMAGES - TRAIN -INFER Upload Images Serve Model Get trained Model Enrich Model with new data Upload Images Serve Protobuf Object store 1 2 3 4 5 6
  • 8.
    FullStack Developers Israel MACHINELEARNING | CONTINUOUS OPERATIONS A CLASSIC APP SERVER CLIENT - SERVE FRONTEND APP - COLLECT IMAGES - TRAIN -INFER Upload Images Serve Model Get trained Model Upload Images Object store 1 2 5 6
  • 9.
    FullStack Developers Israel MACHINELEARNING | CONTINUOUS OPERATIONS MODEL TRAINING … ‣ If your using a pre-trained model - it’s no different than using a backend / an api endpoint ! ‣ Training processes are complex and require Infrastructure As A Service & On demand ‣ Scalability ‣ faster Time to Market vs. faster results ‣ Scaling costs …
  • 10.
    FullStack Developers Israel MACHINELEARNING | CONTINUOUS OPERATIONS STAGE #1 ‣ python train_model.py
 3 Total data size: 332 4 Train X: (298, 7, 7, 256) 5 Train Y: (298, 2) 6 Test X: (34, 7, 7, 256) 7 Test Y: (34, 2) 8 Train on 298 samples, validate on 34 samples 9 Epoch 1/10 10 298/298 [==============================] - 1s 3ms/step - loss: 0.5061 - acc: 0.7651 - val_loss: 0.2331 - val_acc: 0.9118 11 Epoch 2/10 12 298/298 [==============================] - 0s 1ms/step - loss: 0.1361 - acc: 0.9765 - val_loss: 0.0763 - val_acc: 1.0000 13 Epoch 3/10 14 298/298 [==============================] - 0s 1ms/step - loss: 0.0471 - acc: 0.9966 - val_loss: 0.0365 - val_acc: 1.0000 15 Epoch 4/10 16 298/298 [==============================] - 0s 1ms/step - loss: 0.0172 - acc: 1.0000 - val_loss: 0.0196 - val_acc: 1.0000 17 Epoch 5/10 18 298/298 [==============================] - 0s 1ms/step - loss: 0.0123 - acc: 1.0000 - val_loss: 0.0113 - val_acc: 1.0000 19 Epoch 6/10 20 298/298 [==============================] - 0s 1ms/step - loss: 0.0065 - acc: 1.0000 - val_loss: 0.0080 - val_acc: TENSOR-FLOW TRAINING
  • 11.
    FullStack Developers Israel MACHINELEARNING | CONTINUOUS OPERATIONS STAGE #2 - DOCKERIZE & PARAMETARIZE … ‣ docker run -v “${PWD}/data:/opt/data” tikal/webcam-controller- model:latest TENSOR-FLOW TRAINING 3 Total data size: 332 4 Train X: (298, 7, 7, 256) 5 Train Y: (298, 2) 6 Test X: (34, 7, 7, 256) 7 Test Y: (34, 2) 8 Train on 298 samples, validate on 34 samples 9 Epoch 1/10 10 298/298 [==============================] - 1s 3ms/step - loss: 0.5061 - acc: 0.7651 - val_loss: 0.2331 - val_acc: 0.9118 11 Epoch 2/10 12 298/298 [==============================] - 0s 1ms/step - loss: 0.1361 - acc: 0.9765 - val_loss: 0.0763 - val_acc: 1.0000 13 Epoch 3/10 14 298/298 [==============================] - 0s 1ms/step - loss: 0.0471 - acc: 0.9966 - val_loss: 0.0365 - val_acc: 1.0000 15 Epoch 4/10 16 298/298 [==============================] - 0s 1ms/step - loss: 0.0172 - acc: 1.0000 - val_loss: 0.0196 - val_acc: 1.0000 17 Epoch 5/10 18 298/298 [==============================] - 0s 1ms/step - loss: 0.0123 - acc: 1.0000 - val_loss: 0.0113 - val_acc: 1.0000 19 Epoch 6/10
  • 12.
    FullStack Developers Israel MACHINELEARNING | CONTINUOUS OPERATIONS CONTINUOS INTEGRATION ‣ A Jenkins pipeline ‣ Build - get sample data / updated data ‣ Deploy model to cpu/gpu ‣ Train and record results ‣ Promote upload new model for “space invaders” micro service backend
  • 13.
    FullStack Developers Israel MACHINELEARNING | CONTINUOUS OPERATIONS THE GAME IS JUST A MEANS TO AN END … TENSOR-FLOW TRAINING TENSOR-FLOW TRAINING # epochs lr more flags 1 flags = tf.app.flags 2 flags.DEFINE_float("lr", 0.0001, "Learning Rate") 3 flags.DEFINE_string("units", "((50, 0.2), (40, 0.1))", "Configuration of hidden un 4 "Expected: tuple of tuple pairs. Each pair represent one hidde 5 "For instance: "((100, 0.2), (50, 0.3))" will create dense h 6 "dropout layer with rate of 0.2. Afterwards, it will create de 7 "dropout layer with rate of 0.3. If you wish to have hidden la 8 "second value. Example: "((100,), (50, 0.3))"") 9 flags.DEFINE_integer("epochs", 10, "Number of epochs") 10 flags.DEFINE_float("batch_frac", 0.3, "The fraction of training examples to consid 11 "For instance, 0.1 will divide the training to 10 batches") 12 flags.DEFINE_boolean("draw_plot", False, "Whether to draw a plot at the end") 13 flags.DEFINE_boolean("export_js", False, "Whether to export to a tenorflow.js mode 14 FLAGS = flags.FLAGS TENSOR-FLOW TRAINING # epochs lr more flags ‣ We need to train our model
 With different parameters to
 Reach the Optimal model parameters …
  • 14.
    FullStack Developers Israel MACHINELEARNING | CONTINUOUS OPERATIONS SACALING / MULTIPLEXING … TENSORFLOW SUPPORTS MULTI-PART / DISTRIBUTED FLOWS ‣ Running the same model with different parameters in order to choose the most efficient vs most accurate vs cost affective pipeline ! ‣ most efficient #of epochs / params https://www.tensorflow.org/performance/datasets_performance
  • 15.
    FullStack Developers Israel MACHINELEARNING | CONTINUOUS OPERATIONS A/B TESTING / CANARY RELEASES ?! MODEL VER 1.0 MODEL VER 1.7 MODEL VER 2.0 Storage Provider 60% 30% 10% Collect In-Browser 
 training
  • 16.
    FullStack Developers Israel MACHINELEARNING | CONTINUOUS OPERATIONS TRANSLATION … ▸ A flexible training model ▸ Parametarized flow ▸ Model Testing ▸ Promotion mechanism ▸ Data Import and preprocessing ▸ Post Processing
  • 17.
  • 18.
    FullStack Developers Israel MACHINELEARNING | CONTINUOUS OPERATIONS OPTIONS - AWS ML ▸ Use custom DL AMI’s [ we used them to get started … ]
  • 19.
    FullStack Developers Israel MACHINELEARNING | CONTINUOUS OPERATIONS OPTIONS - AWS ML ▸ Use custom DL AMI’s [ we used them to get started … ]
  • 20.
    FullStack Developers Israel MACHINELEARNING | CONTINUOUS OPERATIONS OPTIONS - AWS ML
  • 21.
    FullStack Developers Israel MACHINELEARNING | CONTINUOUS OPERATIONS OPTIONS - GCP ML/DL ▸ Assume you develop in the cloud / on the cloud ▸ Consume C/G/Tpu’s constantly ▸ Adjust your workflow to Google Patterns (which isn’t a bad thing …)
  • 22.
    FullStack Developers Israel MACHINELEARNING | CONTINUOUS OPERATIONS OPTIONS - GCP ML/DL ▸ TPC lock-in ? ▸ Wouldn’t it be nice to benchmark TPU & GPU on another provider ?!
  • 23.
    FullStack Developers Israel MACHINELEARNING | CONTINUOUS OPERATIONS OPTIONS - AZURE ML/DL
  • 24.
    FullStack Developers Israel IT’SALL ABOUT THE PIPELINE / WORKFLOW
  • 25.
    FullStack Developers Israel TEXT IT’SALL ABOUT THE PIPELINE / WORKFLOW ‣ You might be able to make this work … ‣ But !
  • 26.
    FullStack Developers Israel MACHINELEARNING | CONTINUOUS OPERATIONS THERES A PATTERN HERE … IDE Model Serving Model Storage Parameter injectionParameterized training Training Orchestrator 1 2 3 4 5 6
  • 27.
    FullStack Developers Israel MACHINELEARNING | CONTINUOUS OPERATIONS STAGE #3 - ADJUST OUR DOCKERIZED APP TO MY VENDOR … ‣ docker run -v “${PWD}/data:/opt/data” tikal/webcam-controller- model:latest TENSOR-FLOW TRAINING 3 Total data size: 332 4 Train X: (298, 7, 7, 256) 5 Train Y: (298, 2) 6 Test X: (34, 7, 7, 256) 7 Test Y: (34, 2) 8 Train on 298 samples, validate on 34 samples 9 Epoch 1/10 10 298/298 [==============================] - 1s 3ms/step - loss: 0.5061 - acc: 0.7651 - val_loss: 0.2331 - val_acc: 0.9118 11 Epoch 2/10 12 298/298 [==============================] - 0s 1ms/step - loss: 0.1361 - acc: 0.9765 - val_loss: 0.0763 - val_acc: 1.0000 13 Epoch 3/10 14 298/298 [==============================] - 0s 1ms/step - loss: 0.0471 - acc: 0.9966 - val_loss: 0.0365 - val_acc: 1.0000 15 Epoch 4/10 16 298/298 [==============================] - 0s 1ms/step - loss: 0.0172 - acc: 1.0000 - val_loss: 0.0196 - val_acc: 1.0000 17 Epoch 5/10 18 298/298 [==============================] - 0s 1ms/step - loss: 0.0123 - acc: 1.0000 - val_loss: 0.0113 - val_acc: 1.0000 19 Epoch 6/10
  • 28.
    FullStack Developers Israel MACHINELEARNING | CONTINUOUS OPERATIONS DO I CARE ABOUT VENDOR LOCK-IN ?! - LET’S TALK MULTI-CLOUD my laptop 
 cloud I need CPU / GPU / TPU Adjust / Wrap our code to suit the Vendor TENSOR-FLOW TRAINING TENSOR-FLOW TRAINING TENSOR-FLOW TRAINING
  • 29.
    FullStack Developers Israel MACHINELEARNING | CONTINUOUS OPERATIONS IT’S NOT ONLY A MATTER OF VENDOR LOCK-IN! - IT’S MULTI-CLOUD Only in Google ATM CPU GPU TPU my laptop 
 cloud I need CPU / GPU / TPU
  • 30.
  • 31.
    FullStack Developers Israel MACHINELEARNING | CONTINUOUS OPERATIONS TF [TENSORFLOW] OPERATOR
  • 32.
    FullStack Developers Israel MACHINELEARNING | CONTINUOUS OPERATIONS STAGE #4 - WRAP CODE TO SUPPORT WORKER | ADMIN | PS OPERATOR PATTERN ‣ docker run -v “${PWD}/data:/opt/data” tikal/webcam-controller- model:latest TENSOR-FLOW TRAINING 3 Total data size: 332 4 Train X: (298, 7, 7, 256) 5 Train Y: (298, 2) 6 Test X: (34, 7, 7, 256) 7 Test Y: (34, 2) 8 Train on 298 samples, validate on 34 samples 9 Epoch 1/10 10 298/298 [==============================] - 1s 3ms/step - loss: 0.5061 - acc: 0.7651 - val_loss: 0.2331 - val_acc: 0.9118 11 Epoch 2/10 12 298/298 [==============================] - 0s 1ms/step - loss: 0.1361 - acc: 0.9765 - val_loss: 0.0763 - val_acc: 1.0000 13 Epoch 3/10 14 298/298 [==============================] - 0s 1ms/step - loss: 0.0471 - acc: 0.9966 - val_loss: 0.0365 - val_acc: 1.0000 15 Epoch 4/10 16 298/298 [==============================] - 0s 1ms/step - loss: 0.0172 - acc: 1.0000 - val_loss: 0.0196 - val_acc: 1.0000 17 Epoch 5/10 18 298/298 [==============================] - 0s 1ms/step - loss: 0.0123 - acc: 1.0000 - val_loss: 0.0113 - val_acc: 1.0000 19 Epoch 6/10
  • 33.
    FullStack Developers Israel MACHINELEARNING | CONTINUOUS OPERATIONS ML/DL AS A SERVICE - ON YOUR INFRASTRUCTURE ‣ Package model ‣ Package configuration
  • 34.
    FullStack Developers Israel MACHINELEARNING | CONTINUOUS OPERATIONS PRE PACKAGE MODELS FOR TRAINING / SERVING ‣ Apply to Kubernetes via ksonnet
  • 35.
    FullStack Developers Israel MACHINELEARNING | CONTINUOUS OPERATIONS MODEL TRAINING DevEnv Push Tensorflow container to registry Create tfjob https://www.slideshare.net/barbarafusinska/hassle-free-scalable-machine-learning-learning-with-kubeflow https://codelabs.developers.google.com/codelabs/kubeflow-introduction/index.html?index=..%2F..%2Fio2018#2 Store Results
  • 36.
    FullStack Developers Israel MACHINELEARNING | CONTINUOUS OPERATIONS MODEL SERVING DevEnv Consume / Use model In local development Or in the Cloud Deploy app to K8s Use Results Push Application container to registry Use & Improve model
  • 37.
    FullStack Developers Israel MACHINELEARNING | CONTINUOUS OPERATIONS MODEL TRAINING & SERVING DevEnv Consume / Use model In local development Or in the Cloud Deploy app to K8s Use Results Push Application container to registry Use & Improve modelPush Tensorflow container to registry 1 2 3 4 Train model in Kubeflow Store Results 5 6 5
  • 38.
    FullStack Developers Israel MACHINELEARNING | CONTINUOUS OPERATIONS A/B TESTING DevEnv Consume / Use model In local development Or in the Cloud Deploy app to K8s Use Results Push Application container to registry Use & Improve model Push Tensorflow container to registry 1 2 3 4 Train model in Kubeflow Store Results 5 6 5 Use Ambassador for A/B testing 7
  • 39.
    FullStack Developers Israel MACHINELEARNING | CONTINUOUS OPERATIONS A ONE STOP SHOP FOR EVERYTHING … On Prem / 
 Cloud “PaaS" on K8s ▸ Job ▸ Cron Job ▸ POD ▸ Replica sets (multi-step / distributed)
  • 40.
    FullStack Developers Israel MACHINELEARNING | CONTINUOUS OPERATIONS TFJOB CRD - CUSTOM RESOURCE DEFINITION hagzag@model-tarining 👉 kubectl get tfjob NAME AGE wcm 1d
  • 41.
    FullStack Developers Israel MACHINELEARNING | CONTINUOUS OPERATIONS OUR IMAGE IN KUBEFLOW … … 11 clusterName: “minikube" 12 creationTimestamp: 2018-06-23T07:31:54Z 13 generation: 1 14 labels: 15 app.kubernetes.io/deploy-manager: ksonnet 16 name: wcm 17 namespace: wcm 18 resourceVersion: "94971" 19 selfLink: /apis/kubeflow.org/v1alpha1/namespaces/wcm/tfjobs/wcm 20 uid: 80ab9472-76b7-11e8-be6d-0800279cc216 21 spec: 22 RuntimeId: werb 23 replicaSpecs: 24 - replicas: 3 25 template: 26 metadata: 27 creationTimestamp: null 28 spec: 29 containers: 30 - image: tikal/webcam-controller-model:latest 31 name: tensorflow 32 resources: {} 33 restartPolicy: OnFailure 34 tfPort: 2222 35 tfReplicaType: WORKER 36 - replicas: 2 37 template: ‣ Next step is to wrap our model with some Operator / TF data so kubeflow can display it …
  • 42.
    FullStack Developers Israel MACHINELEARNING | CONTINUOUS OPERATIONS USE S3 AND TERNSORBAORD … ‣ Reuse training results and display in your common tensor-flow tooling.
  • 43.
    FullStack Developers Israel MACHINELEARNING | CONTINUOUS OPERATIONS WANT MORE ‣ Demo model -> https://github.com/tikalk/ webcam-controller-model ‣ Kubeflow - the main “engine” kubeflow.io ‣ It also supports other tools … 
 https://github.com/dwhitena/ kubeflow_pachyderm ‣ https://github.com/SeldonIO/seldon-core
  • 44.
    FullStack Developers Israel MACHINELEARNING | CONTINUOUS OPERATIONS EVEN MORE Preprocess | ingest data Serve Train Store
  • 45.