Datascience and Azure(v1.0)

AGENDA - DATA SCIENCE IN POC
• it’s not about you, it’s about us !!!
• What’s being covered and WHY ?
• Data science vs Azure
• Demo per case based scenario
Targeted Audience :
(1) Data Scientist who are interested using the Azure Cloud for their daily
tasks
(2) Inner circle team members who interact with data scientist on a frequent
basis
(3) Stalkholders and managers who wants to understand/gain control of

IT’S NOT ABOUT YOU ,
IT’S ABOUT US !!!

Data Engineer
Data Analyst
/Statistician
Database
professionals
Data Scientist
Business
Intelligence
Manager/Advocat
e
Stalkholder
(decision maker)
Developers
(App & Infra)
Innovation Officer
Data Protection
Officer (GDPR)

Data Scientist
Data Engineer
Data Analyst
/Statistician
Database
professionals
Business
Intelligence
Manager/Advocat
e
Stalkholder
(decision maker)
Developers
(App & Infra)
Innovation Officer
Data Protection
Officer (GDPR)
I want to
experiment on
new models
from that paper I
read yesterday
Let’s automize
the data-pipeline
with Hadoop eco
system !Give me a
model so I
can
embedded
into an App
!
let’s explore the
data with
histograms,
boxplots and density
plots
Let’s make some
pretty
dashboards to
show others...
Build me something
cool , I will go and
show/tell others
Does any of your
model make $ for
the company ?
I want to try out
machine learning
models in a SQL
friendly way, is it
possible?
Let’s run a
POC to
understand
how data
science
work
Pls make sure
all data
processes are
compliant to
GDPR

Gather inputs
 Prioritize
 map
deliverables
Does your model
make $ for the
company ?
Build me something
cool , I will go and
show/tell others
Ensure model developing process
resonate with deliverables
Map outputs to deliverables
(1) ensure inter-mediate data
process output is archive for
dashboarding and
reproducibility
(2) Make sure the selected model is
–
a) Interpretable to a certain
degree
b) Possibility to be embedded
to an API call or can do
batch-prediction
c) Possibility to scalable and
the life-cycle can be
managed
(3) Make sure the model
performance can somehow
translate to $
Innovation Officer
Let’s run a
POC to
understand
how data
science
work
Make sure the
model performance
is somehow
translated to $
Document all
activites , make
sure to incl
1. why the model
works
2. what does it
look like in
production
3. how to scale &
integrate with IT
Make sure the
deliverables incl. a
killer-looking
dashboard/app so i
can easily show/tell
others

Common &
sharable workspace
All desired toolsets
pre-installed or
minimal install
required
Connect different
data sources in/out
workspace
Scale
down/up as
i see fit
train the model fast
Possibility to
automize
the workflow
Demo my model
Prep & deploy my
model
A centralized
work space where
i can ...

WHAT’S BEING COVERED
• Framework – Python focused (WHY ?)
• Data : healthcare data (public datasets or artificial data)
• Azure Machine Learning services vs Data Science activities
• Constrains –
1. no live streaming prediction , only batch-sized prediction ( excl
backend infra, only API call)
2. No Model production pipeline management ( only focus on POCs)
• Demo constrain : if it is gonna take > 20 mins to run , i will just
walk through the jupyter notebook as demo instead !

FORMAT OF THE DEMO
THE FOLLOWING SLIDE SHOWS IN WHICH FORMAT WE WILL WALK
THROUGH THE DEMO PER CASE BASED SCENARIO

DATA SCIENCE ACTIVITIES VS AZURE SERVICES
(1)Get data in/out
of Azure
(2) Model
building
(3) Demo your
model
(4)
Productionalize
the model
AZURE
Who : who is this service intended
WHY : justify why /when will we use it
Data type : structure /unstructure or both
Scenario : this architecture is used in a database-centric scenario ... Blah
blah blah
Compacted Data Science activities vs what Azure Services provide
Case
specific
properties
Pros
Cons

Data: what data used
Model : type of machien learning model
Operationalize: batch predict or API call
Azure service model output screenshot

CASE 1 – CITIZEN DATA
SCIENTIST’S PLAYGROUND
(1)Get data in/out
of Azure
(2) Model
building
(3) Demo your
model
(4)
Productionalize
the model
Who : whoever want to understand what data scientist is doing and the data
science process
Why : explain an end-to-end data science toy journey on a single UI
Data type : mainly structured , limited case when unstructured data can be
used
Scenario : to gain awareness internally, one needs a tool to easilyStudio
Pros
Cons
Upload data
to AML
Small data
Drag & drop
(no coding)
Limited
model
selection
Visualible
directly via
UI
-
Can directly be
deployed as an
API call
Prototype
only , not
built for
production

Data: breast cancer csv
Model : SVM
Operationalize: Yes, available
as a web service (=API call)

CASE 2 – SQL DATA BASE
ONLY
(1)Get data in/out
of Azure
(2) Model
building
(3) Demo your
model
(4)
Productionalize
the model
Who : SQL server admin /SQL professionals
Why : SQL admin/engineer who want to utilize Machine Learning models in their ETL
tasks
Data type : only structured
Scenario : when all data sources reside in SQL environment , want to embed the ML
model into ETL process
Pros
Cons
Optimized
for data in
SQL DB
Limited
support for
data not in
SQL DB
Directly
embedded
R/Python code
in SQL
Not all
models are
supported
Via SQL
management
studio
Can’t do
demo else
where
Can be
embedded in
ETL process
Can’t do
stream
prediction

Azure Machien Learning
Studio
Data: breast cancer csv
Model : Naive Bayes
Operationalize: Yes, available as an in-database batch prediction (= built
into ETL process)

CASE 3 – PARALLEL
TRAINING ON MULTI-
DATASETS
(1)Get data in/out
of Azure
(2) Model
building
(3) Demo your
model
(4)
Productionalize
the model
Batch
Who : data engineer / data scientists / data analyst professionals / IT devs
Why : automize the process of training across multiple identical features’
datasets
Scenario : Data Scientist had developed a model and he/she wants to train
across multiple datasets to scale( say you have many hospitals’ data with the
exact same features=columns to train & make prediction on ) , a candy treat
Pros
Cons
Deal with any
kind of data
uploaded to
azure blob
Data I/O
could be slow
Build model
locally on your
preferred env
Models require
GPU is not
supported
Via command
line/jupyter
notebook
Not a
visually-
appealing
demo
Train&predict
multiple datasets
and can be
automized for
batch prediction
Can’t do
stream
prediction

Output trained model.pkl files as well as the prediction csv files
Job training in command line, launch and go have a coffee !
Data: hospital 1/2/3 breast cancer data csv(s)
Model : RandomForest Classifier
Operationalize: Yes, available as batch-size prediction ( can be scheduled to automized the

CASE 4 – TRAIN DEEP
LEARNIG MODELS + GPU
ACCELERATION
(1)Get data in/out
of Azure
(2) Model
building
(3) Demo your
model
(4)
Productionalize
the model
BatchAI
Who : data engineer / data scientists / data analyst professionals /IT devs
Why : specific GPU-enabled model training environment is required
Scenario : a model that required training on GPU (otherwise it will take ages
to train) , an example would be training an image classification model , the
kind of cool stuff Advocate can use to show case team data science’s
Pros
Cons
Deal with any
kind of data
uploaded to
azure fileshare
Data I/O
could be slow
Build model
(optimized for
GPU) locally on
your preferred env
Session time-
out could be
problematic
Via command
line /jupyter
notebook
Not a
visually-
appealing
demo
Can be
automized
(schedule)
batch-sized
prediction
Can’t do
stream
prediction

Data: dicom CT scan image data
Model : CNN, Convolutional Neural Networks
classifier
Operationalize: Yes, available as batch-size
prediction ( can be scheduled to automized the
processs)

CASE 5 – SPEED MATTERS
(1)Get data in/out
of Azure
(2) Model
building
(3) Demo your
model
(4)
Productionalize
the model
DataBric
k
Who : data engineer / data scientists / data analyst professionals
Why : where the speed of data processing matters and parallization can be
applied , it can easily turn into a model pipeline production for an end-to-
end data science workflow
Scenario : when processing >GBs of data, local PC can no longer suffice, one
needs a lightling speed environment to do data processing, music to the ear
Pros
Cons
Any type of data
,fine-tune for
large datasets
For small
dataset is no
need
A user-
friendly
databrick
notebook
Limited support
for models
outside spark
eco system
Via databrick
notebook
Not a
visually-
appealing
demo
Can be
developed to do
stream
prediction
Model
deployment
pipeline is
rather complex

Data: membrane dicom image
Model : Unet (deep learning , image segmentation)
Operationalize: Yes, ACI, AKS service for deep leaning models are in preview, check back

Any model that can be trained&saved via azure.ml can be deployed to a docker image 
web service
Feed test data to the
webservice
 Output prediction from the webservice, note it predicts correctly the correct
Class 0,0,1,1
 You will get a proper url to call fromData: breast cancer csv data
Model : pyspark GBT classifier
Operationalize: Yes, API calls availables

CASE 6 – SANDBOX ENV.
(BASIC)
(1)Get data in/out
of Azure
(2) Model
building
(3) Demo your
model
(4)
Productionalize
the model
DSVM
Who : data engineer / data scientists / data analyst professionals
WHY : one-stop-shop for all data science toolsets needs + a sharable
enviornment
Data type : structure /unstructure or both ( MBs ~ GB data)
Scenario : a hassel free environment where you can just get to work
immediately  sandbox for data scientist + data engineer + data analyst +
Pros
Cons
Any type of data
that can be
uploaded to
fileshare
Upload/downlo
ad data can be
slow
A pre-
built,all-in-
one toolsets
in one
environment
Could be costly
if GPU env
needed
Any kind of
demo you like
need internet
connection all
the time
can plug in
other Azure
service to
deploy the
model
Need to plug in
other services
for model
deployment

FOR APP DEV (DEMO):
FOR ONE-OFF DEMO ONLY !

Datascience and Azure(v1.0)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Datascience and Azure(v1.0)

Similar to Datascience and Azure(v1.0) (20)

More from Zenodia Charpy

More from Zenodia Charpy (6)

Recently uploaded

Recently uploaded (20)

Datascience and Azure(v1.0)

Editor's Notes