AGENDA - DATA SCIENCE IN POC
• it’s not about you, it’s about us !!!
• What’s being covered and WHY ?
• Data science vs Azure
• Demo per case based scenario
Targeted Audience :
(1) Data Scientist who are interested using the Azure Cloud for their daily
tasks
(2) Inner circle team members who interact with data scientist on a frequent
basis
(3) Stalkholders and managers who wants to understand/gain control of
IT’S NOT ABOUT YOU ,
IT’S ABOUT US !!!
Data Engineer
Data Analyst
/Statistician
Database
professionals
Data Scientist
Business
Intelligence
Manager/Advocat
e
Stalkholder
(decision maker)
Developers
(App & Infra)
Innovation Officer
Data Protection
Officer (GDPR)
Data Scientist
Data Engineer
Data Analyst
/Statistician
Database
professionals
Business
Intelligence
Manager/Advocat
e
Stalkholder
(decision maker)
Developers
(App & Infra)
Innovation Officer
Data Protection
Officer (GDPR)
I want to
experiment on
new models
from that paper I
read yesterday
Let’s automize
the data-pipeline
with Hadoop eco
system !Give me a
model so I
can
embedded
into an App
!
let’s explore the
data with
histograms,
boxplots and density
plots
Let’s make some
pretty
dashboards to
show others...
Build me something
cool , I will go and
show/tell others
Does any of your
model make $ for
the company ?
I want to try out
machine learning
models in a SQL
friendly way, is it
possible?
Let’s run a
POC to
understand
how data
science
work
Pls make sure
all data
processes are
compliant to
GDPR
WHAT THE OTHERS WANT
Gather inputs
 Prioritize
 map
deliverables
Does your model
make $ for the
company ?
Build me something
cool , I will go and
show/tell others
Ensure model developing process
resonate with deliverables
Map outputs to deliverables
(1) ensure inter-mediate data
process output is archive for
dashboarding and
reproducibility
(2) Make sure the selected model is
–
a) Interpretable to a certain
degree
b) Possibility to be embedded
to an API call or can do
batch-prediction
c) Possibility to scalable and
the life-cycle can be
managed
(3) Make sure the model
performance can somehow
translate to $
Innovation Officer
Let’s run a
POC to
understand
how data
science
work
Make sure the
model performance
is somehow
translated to $
Document all
activites , make
sure to incl
1. why the model
works
2. what does it
look like in
production
3. how to scale &
integrate with IT
Make sure the
deliverables incl. a
killer-looking
dashboard/app so i
can easily show/tell
others
WHAT I (=DATA SCIENTIST)
WANT
Common &
sharable workspace
All desired toolsets
pre-installed or
minimal install
required
Connect different
data sources in/out
workspace
Scale
down/up as
i see fit
train the model fast
Possibility to
automize
the workflow
Demo my model
Prep & deploy my
model
A centralized
work space where
i can ...
WHAT’S BEING COVERED
WHAT’S BEING COVERED
• Framework – Python focused (WHY ?)
• Data : healthcare data (public datasets or artificial data)
• Azure Machine Learning services vs Data Science activities
• Constrains –
1. no live streaming prediction , only batch-sized prediction ( excl
backend infra, only API call)
2. No Model production pipeline management ( only focus on POCs)
• Demo constrain : if it is gonna take > 20 mins to run , i will just
walk through the jupyter notebook as demo instead !
FORMAT OF THE DEMO
THE FOLLOWING SLIDE SHOWS IN WHICH FORMAT WE WILL WALK
THROUGH THE DEMO PER CASE BASED SCENARIO
DATA SCIENCE ACTIVITIES VS AZURE SERVICES
(1)Get data in/out
of Azure
(2) Model
building
(3) Demo your
model
(4)
Productionalize
the model
AZURE
Who : who is this service intended
WHY : justify why /when will we use it
Data type : structure /unstructure or both
Scenario : this architecture is used in a database-centric scenario ... Blah
blah blah
Compacted Data Science activities vs what Azure Services provide
Case
specific
properties
Pros
Cons
Data: what data used
Model : type of machien learning model
Operationalize: batch predict or API call
Azure service model output screenshot
CASES AND DEMOS
CASE 1 – CITIZEN DATA
SCIENTIST’S PLAYGROUND
(1)Get data in/out
of Azure
(2) Model
building
(3) Demo your
model
(4)
Productionalize
the model
Who : whoever want to understand what data scientist is doing and the data
science process
Why : explain an end-to-end data science toy journey on a single UI
Data type : mainly structured , limited case when unstructured data can be
used
Scenario : to gain awareness internally, one needs a tool to easilyStudio
Pros
Cons
Upload data
to AML
Small data
Drag & drop
(no coding)
Limited
model
selection
Visualible
directly via
UI
-
Can directly be
deployed as an
API call
Prototype
only , not
built for
production
Data: breast cancer csv
Model : SVM
Operationalize: Yes, available
as a web service (=API call)
CASE 2 – SQL DATA BASE
ONLY
(1)Get data in/out
of Azure
(2) Model
building
(3) Demo your
model
(4)
Productionalize
the model
Who : SQL server admin /SQL professionals
Why : SQL admin/engineer who want to utilize Machine Learning models in their ETL
tasks
Data type : only structured
Scenario : when all data sources reside in SQL environment , want to embed the ML
model into ETL process
Pros
Cons
Optimized
for data in
SQL DB
Limited
support for
data not in
SQL DB
Directly
embedded
R/Python code
in SQL
Not all
models are
supported
Via SQL
management
studio
Can’t do
demo else
where
Can be
embedded in
ETL process
Can’t do
stream
prediction
Azure Machien Learning
Studio
Data: breast cancer csv
Model : Naive Bayes
Operationalize: Yes, available as an in-database batch prediction (= built
into ETL process)
CASE 3 – PARALLEL
TRAINING ON MULTI-
DATASETS
(1)Get data in/out
of Azure
(2) Model
building
(3) Demo your
model
(4)
Productionalize
the model
Batch
Who : data engineer / data scientists / data analyst professionals / IT devs
Why : automize the process of training across multiple identical features’
datasets
Data type : structure /unstructure or both
Scenario : Data Scientist had developed a model and he/she wants to train
across multiple datasets to scale( say you have many hospitals’ data with the
exact same features=columns to train & make prediction on ) , a candy treat
Pros
Cons
Deal with any
kind of data
uploaded to
azure blob
Data I/O
could be slow
Build model
locally on your
preferred env
Models require
GPU is not
supported
Via command
line/jupyter
notebook
Not a
visually-
appealing
demo
Train&predict
multiple datasets
and can be
automized for
batch prediction
Can’t do
stream
prediction
Output trained model.pkl files as well as the prediction csv files
Job training in command line, launch and go have a coffee !
Data: hospital 1/2/3 breast cancer data csv(s)
Model : RandomForest Classifier
Operationalize: Yes, available as batch-size prediction ( can be scheduled to automized the
CASE 4 – TRAIN DEEP
LEARNIG MODELS + GPU
ACCELERATION
(1)Get data in/out
of Azure
(2) Model
building
(3) Demo your
model
(4)
Productionalize
the model
BatchAI
Who : data engineer / data scientists / data analyst professionals /IT devs
Why : specific GPU-enabled model training environment is required
Data type : structure /unstructure or both
Scenario : a model that required training on GPU (otherwise it will take ages
to train) , an example would be training an image classification model , the
kind of cool stuff Advocate can use to show case team data science’s
Pros
Cons
Deal with any
kind of data
uploaded to
azure fileshare
Data I/O
could be slow
Build model
(optimized for
GPU) locally on
your preferred env
Session time-
out could be
problematic
Via command
line /jupyter
notebook
Not a
visually-
appealing
demo
Can be
automized
(schedule)
batch-sized
prediction
Can’t do
stream
prediction
Data: dicom CT scan image data
Model : CNN, Convolutional Neural Networks
classifier
Operationalize: Yes, available as batch-size
prediction ( can be scheduled to automized the
processs)
CASE 5 – SPEED MATTERS
(1)Get data in/out
of Azure
(2) Model
building
(3) Demo your
model
(4)
Productionalize
the model
DataBric
k
Who : data engineer / data scientists / data analyst professionals
Why : where the speed of data processing matters and parallization can be
applied , it can easily turn into a model pipeline production for an end-to-
end data science workflow
Data type : structure /unstructure or both
Scenario : when processing >GBs of data, local PC can no longer suffice, one
needs a lightling speed environment to do data processing, music to the ear
Pros
Cons
Any type of data
,fine-tune for
large datasets
For small
dataset is no
need
A user-
friendly
databrick
notebook
Limited support
for models
outside spark
eco system
Via databrick
notebook
Not a
visually-
appealing
demo
Can be
developed to do
stream
prediction
Model
deployment
pipeline is
rather complex
Data: membrane dicom image
Model : Unet (deep learning , image segmentation)
Operationalize: Yes, ACI, AKS service for deep leaning models are in preview, check back
Any model that can be trained&saved via azure.ml can be deployed to a docker image 
web service
Feed test data to the
webservice
 Output prediction from the webservice, note it predicts correctly the correct
Class 0,0,1,1
 You will get a proper url to call fromData: breast cancer csv data
Model : pyspark GBT classifier
Operationalize: Yes, API calls availables
CASE 6 – SANDBOX ENV.
(BASIC)
(1)Get data in/out
of Azure
(2) Model
building
(3) Demo your
model
(4)
Productionalize
the model
DSVM
Who : data engineer / data scientists / data analyst professionals
WHY : one-stop-shop for all data science toolsets needs + a sharable
enviornment
Data type : structure /unstructure or both ( MBs ~ GB data)
Scenario : a hassel free environment where you can just get to work
immediately  sandbox for data scientist + data engineer + data analyst +
Pros
Cons
Any type of data
that can be
uploaded to
fileshare
Upload/downlo
ad data can be
slow
A pre-
built,all-in-
one toolsets
in one
environment
Could be costly
if GPU env
needed
Any kind of
demo you like
need internet
connection all
the time
can plug in
other Azure
service to
deploy the
model
Need to plug in
other services
for model
deployment
OTHER STUFF
FOR APP DEV (DEMO):
FOR ONE-OFF DEMO ONLY !
FOR BI PEOPLE (DEMO):

Datascience and Azure(v1.0)

  • 1.
    AGENDA - DATASCIENCE IN POC • it’s not about you, it’s about us !!! • What’s being covered and WHY ? • Data science vs Azure • Demo per case based scenario Targeted Audience : (1) Data Scientist who are interested using the Azure Cloud for their daily tasks (2) Inner circle team members who interact with data scientist on a frequent basis (3) Stalkholders and managers who wants to understand/gain control of
  • 2.
    IT’S NOT ABOUTYOU , IT’S ABOUT US !!!
  • 3.
    Data Engineer Data Analyst /Statistician Database professionals DataScientist Business Intelligence Manager/Advocat e Stalkholder (decision maker) Developers (App & Infra) Innovation Officer Data Protection Officer (GDPR)
  • 4.
    Data Scientist Data Engineer DataAnalyst /Statistician Database professionals Business Intelligence Manager/Advocat e Stalkholder (decision maker) Developers (App & Infra) Innovation Officer Data Protection Officer (GDPR) I want to experiment on new models from that paper I read yesterday Let’s automize the data-pipeline with Hadoop eco system !Give me a model so I can embedded into an App ! let’s explore the data with histograms, boxplots and density plots Let’s make some pretty dashboards to show others... Build me something cool , I will go and show/tell others Does any of your model make $ for the company ? I want to try out machine learning models in a SQL friendly way, is it possible? Let’s run a POC to understand how data science work Pls make sure all data processes are compliant to GDPR
  • 5.
  • 6.
    Gather inputs  Prioritize map deliverables Does your model make $ for the company ? Build me something cool , I will go and show/tell others Ensure model developing process resonate with deliverables Map outputs to deliverables (1) ensure inter-mediate data process output is archive for dashboarding and reproducibility (2) Make sure the selected model is – a) Interpretable to a certain degree b) Possibility to be embedded to an API call or can do batch-prediction c) Possibility to scalable and the life-cycle can be managed (3) Make sure the model performance can somehow translate to $ Innovation Officer Let’s run a POC to understand how data science work Make sure the model performance is somehow translated to $ Document all activites , make sure to incl 1. why the model works 2. what does it look like in production 3. how to scale & integrate with IT Make sure the deliverables incl. a killer-looking dashboard/app so i can easily show/tell others
  • 7.
    WHAT I (=DATASCIENTIST) WANT
  • 8.
    Common & sharable workspace Alldesired toolsets pre-installed or minimal install required Connect different data sources in/out workspace Scale down/up as i see fit train the model fast Possibility to automize the workflow Demo my model Prep & deploy my model A centralized work space where i can ...
  • 9.
  • 10.
    WHAT’S BEING COVERED •Framework – Python focused (WHY ?) • Data : healthcare data (public datasets or artificial data) • Azure Machine Learning services vs Data Science activities • Constrains – 1. no live streaming prediction , only batch-sized prediction ( excl backend infra, only API call) 2. No Model production pipeline management ( only focus on POCs) • Demo constrain : if it is gonna take > 20 mins to run , i will just walk through the jupyter notebook as demo instead !
  • 11.
    FORMAT OF THEDEMO THE FOLLOWING SLIDE SHOWS IN WHICH FORMAT WE WILL WALK THROUGH THE DEMO PER CASE BASED SCENARIO
  • 12.
    DATA SCIENCE ACTIVITIESVS AZURE SERVICES (1)Get data in/out of Azure (2) Model building (3) Demo your model (4) Productionalize the model AZURE Who : who is this service intended WHY : justify why /when will we use it Data type : structure /unstructure or both Scenario : this architecture is used in a database-centric scenario ... Blah blah blah Compacted Data Science activities vs what Azure Services provide Case specific properties Pros Cons
  • 13.
    Data: what dataused Model : type of machien learning model Operationalize: batch predict or API call Azure service model output screenshot
  • 14.
  • 15.
    CASE 1 –CITIZEN DATA SCIENTIST’S PLAYGROUND (1)Get data in/out of Azure (2) Model building (3) Demo your model (4) Productionalize the model Who : whoever want to understand what data scientist is doing and the data science process Why : explain an end-to-end data science toy journey on a single UI Data type : mainly structured , limited case when unstructured data can be used Scenario : to gain awareness internally, one needs a tool to easilyStudio Pros Cons Upload data to AML Small data Drag & drop (no coding) Limited model selection Visualible directly via UI - Can directly be deployed as an API call Prototype only , not built for production
  • 16.
    Data: breast cancercsv Model : SVM Operationalize: Yes, available as a web service (=API call)
  • 17.
    CASE 2 –SQL DATA BASE ONLY (1)Get data in/out of Azure (2) Model building (3) Demo your model (4) Productionalize the model Who : SQL server admin /SQL professionals Why : SQL admin/engineer who want to utilize Machine Learning models in their ETL tasks Data type : only structured Scenario : when all data sources reside in SQL environment , want to embed the ML model into ETL process Pros Cons Optimized for data in SQL DB Limited support for data not in SQL DB Directly embedded R/Python code in SQL Not all models are supported Via SQL management studio Can’t do demo else where Can be embedded in ETL process Can’t do stream prediction
  • 18.
    Azure Machien Learning Studio Data:breast cancer csv Model : Naive Bayes Operationalize: Yes, available as an in-database batch prediction (= built into ETL process)
  • 19.
    CASE 3 –PARALLEL TRAINING ON MULTI- DATASETS (1)Get data in/out of Azure (2) Model building (3) Demo your model (4) Productionalize the model Batch Who : data engineer / data scientists / data analyst professionals / IT devs Why : automize the process of training across multiple identical features’ datasets Data type : structure /unstructure or both Scenario : Data Scientist had developed a model and he/she wants to train across multiple datasets to scale( say you have many hospitals’ data with the exact same features=columns to train & make prediction on ) , a candy treat Pros Cons Deal with any kind of data uploaded to azure blob Data I/O could be slow Build model locally on your preferred env Models require GPU is not supported Via command line/jupyter notebook Not a visually- appealing demo Train&predict multiple datasets and can be automized for batch prediction Can’t do stream prediction
  • 20.
    Output trained model.pklfiles as well as the prediction csv files Job training in command line, launch and go have a coffee ! Data: hospital 1/2/3 breast cancer data csv(s) Model : RandomForest Classifier Operationalize: Yes, available as batch-size prediction ( can be scheduled to automized the
  • 21.
    CASE 4 –TRAIN DEEP LEARNIG MODELS + GPU ACCELERATION (1)Get data in/out of Azure (2) Model building (3) Demo your model (4) Productionalize the model BatchAI Who : data engineer / data scientists / data analyst professionals /IT devs Why : specific GPU-enabled model training environment is required Data type : structure /unstructure or both Scenario : a model that required training on GPU (otherwise it will take ages to train) , an example would be training an image classification model , the kind of cool stuff Advocate can use to show case team data science’s Pros Cons Deal with any kind of data uploaded to azure fileshare Data I/O could be slow Build model (optimized for GPU) locally on your preferred env Session time- out could be problematic Via command line /jupyter notebook Not a visually- appealing demo Can be automized (schedule) batch-sized prediction Can’t do stream prediction
  • 22.
    Data: dicom CTscan image data Model : CNN, Convolutional Neural Networks classifier Operationalize: Yes, available as batch-size prediction ( can be scheduled to automized the processs)
  • 23.
    CASE 5 –SPEED MATTERS (1)Get data in/out of Azure (2) Model building (3) Demo your model (4) Productionalize the model DataBric k Who : data engineer / data scientists / data analyst professionals Why : where the speed of data processing matters and parallization can be applied , it can easily turn into a model pipeline production for an end-to- end data science workflow Data type : structure /unstructure or both Scenario : when processing >GBs of data, local PC can no longer suffice, one needs a lightling speed environment to do data processing, music to the ear Pros Cons Any type of data ,fine-tune for large datasets For small dataset is no need A user- friendly databrick notebook Limited support for models outside spark eco system Via databrick notebook Not a visually- appealing demo Can be developed to do stream prediction Model deployment pipeline is rather complex
  • 24.
    Data: membrane dicomimage Model : Unet (deep learning , image segmentation) Operationalize: Yes, ACI, AKS service for deep leaning models are in preview, check back
  • 25.
    Any model thatcan be trained&saved via azure.ml can be deployed to a docker image  web service Feed test data to the webservice  Output prediction from the webservice, note it predicts correctly the correct Class 0,0,1,1  You will get a proper url to call fromData: breast cancer csv data Model : pyspark GBT classifier Operationalize: Yes, API calls availables
  • 26.
    CASE 6 –SANDBOX ENV. (BASIC) (1)Get data in/out of Azure (2) Model building (3) Demo your model (4) Productionalize the model DSVM Who : data engineer / data scientists / data analyst professionals WHY : one-stop-shop for all data science toolsets needs + a sharable enviornment Data type : structure /unstructure or both ( MBs ~ GB data) Scenario : a hassel free environment where you can just get to work immediately  sandbox for data scientist + data engineer + data analyst + Pros Cons Any type of data that can be uploaded to fileshare Upload/downlo ad data can be slow A pre- built,all-in- one toolsets in one environment Could be costly if GPU env needed Any kind of demo you like need internet connection all the time can plug in other Azure service to deploy the model Need to plug in other services for model deployment
  • 28.
  • 29.
    FOR APP DEV(DEMO): FOR ONE-OFF DEMO ONLY !
  • 30.

Editor's Notes

  • #13 Conda env name : py36_zBatchAI
  • #16 https://cloudblogs.microsoft.com/sqlserver/2017/09/26/in-database-machine-learning-in-sql-server-2017/
  • #18 https://cloudblogs.microsoft.com/sqlserver/2017/09/26/in-database-machine-learning-in-sql-server-2017/
  • #19 Step by step guide : https://www.linkedin.com/pulse/part-iva-operationalize-python-ml-model-via-batch-predict-charpy/
  • #20 https://cloudblogs.microsoft.com/sqlserver/2017/09/26/in-database-machine-learning-in-sql-server-2017/
  • #21 Step by step guide : https://www.linkedin.com/pulse/part-ii-azure-batch-service-dsvm-pytorch-time-series-zenodia-charpy/
  • #22 https://cloudblogs.microsoft.com/sqlserver/2017/09/26/in-database-machine-learning-in-sql-server-2017/
  • #23 Step by step guide : https://www.linkedin.com/pulse/part-iib-azure-batchai-docker-deep-learning-model-trainning-charpy/
  • #24 https://cloudblogs.microsoft.com/sqlserver/2017/09/26/in-database-machine-learning-in-sql-server-2017/
  • #25 Databrick – Unet_membrane
  • #26 Step by step guide : https://www.linkedin.com/pulse/part-ivb-operationalize-your-model-databrick-via-mmlspark-charpy/
  • #27 https://cloudblogs.microsoft.com/sqlserver/2017/09/26/in-database-machine-learning-in-sql-server-2017/
  • #28 https://azure.microsoft.com/en-us/services/virtual-machines/data-science-virtual-machines/
  • #30 Step by step guide : https://www.linkedin.com/pulse/azure-data-science-virtual-machine-lets-you-get-work-directly-charpy/
  • #31 Step by step guide to build this : https://www.linkedin.com/pulse/easy-5-steps-build-your-shiny-app-interact-ml-models-decision-charpy/