Data Science in the Cloud
November 2nd, 2017
Data Science Transition to
the Cloud
Various Cloud Offerings
Team Data Science
Data Science Transition to the Cloud
Data Scientists must have many tools at their
disposal
– R
– Python
– SQL
– Data Analysis
– Big Data
Now they must begin to consider another tool
to help them build and test out their models
– Cloud
Data Scientists and their Toolsets
Modeling and Validation take time and resources
Back in the day, data scientists would wait hours to run models and validate ideal performance
All this was done on a personal local machine with limited memory
Data Science Models are resource intensive
Something had to give
Minimize the volume of data in the model
– Limits model performance
Minimize the complexity of the model
– Limits variety and innovation
Have personal machines increased performance over the last couple of years to handle both
volume and complexity?
Data Science Models are resource intensive
Laptops can now be purchased with 2 TB of Hard Drive using SSD and 32 GB of Memory
You’re looking at least $4K
What if you could run the same model on your top of the line laptop on the cloud for a few $$
each month?
Data Science Models are resource intensive
Many models benefit from larger datasets, especially deep learning models
How many GPU’s does your local machine have?
CPU has a few cores, GPU has hundreds of cores
GPU’s can huge batches of data and perform the same operation over and over again
CPU’s can handle a few threads at a time
GPU vs CPU
Free users from hardware and local constraints
Collaborative
– Models can be accessed by anyone with an account
– Models can be shared across developers
Advantages to Machine Learning in the Cloud
Compliance and Security
Offline access
Disadvantages to Machine Learning in the Cloud
Various Cloud Offerings
Microsoft Azure
Amazon Web Services
Google Cloud Platform
Azure vs AWS vs Google Cloud
Magic Quadrant for Cloud According to
Garner in 2014
AWS was on an island by itself
Cloud according to Gartner in 2014
For our purposes we want to do a comparison based on Machine Learning offerings for Azure,
Amazon Web Services, and Google Cloud?
But what about Machine Learning?
One of the most automated solutions on the market
Can load data from Amazon RDS, Redshift, and CSV files
Data can automatically be identified as categorical or numeric during preparation
Modeling does not support Unsupervised Learning, only Supervised out of the box
Predictions fall under only 2 main areas:
1. Binary and Multiclass classification
2. Regression
AWS Machine Learning
Similar to the Amazon Machine Learning offering
Predictions fall under similar categories out of the box:
1. Binary and Multiclass Classification
2. Regression
Google offers pre-trained models as templates that can be used as starting points for model
development
Google Prediction API
Almost all operations are manual as opposed to Google and Amazon
Supports a user-friendly graphical interface to visualize each step within a workflow for building
a model
– Looks very much like Visio or PowerPoint
Supports both Supervised and Unsupervised models
Azure ML
Variety of Algorithms available, not just limited to Classification and Regression:
– Classification: Binary and Multiclass
– Regression
– Anomaly Detection
– Recommendation
– Text Analysis
Azure also has a set of templates available in the Cortana Intelligence Gallery that are prebuilt
as templates for reuse
Azure ML
Team Data Science
From a 2014 Capgemini Report
Only 27% of the big data projects are regarded as successful
Only 13% of organizations have achieved full-scale production for their big data
implementations
Only 8% of the big data projects are regarded as VERY successful
Only 17% of survey respondents said they had a well-developed Predictive/Prescriptive
Analytics Program in while
Some Sobering Statistics
Why is this happening?
Most data scientists unfortunately are working in silos
Silos within an organization
Silos within their tools
Some Sobering Statistics
Team Data Science Process in AML is an agile and iterative process for delivering Machine
Learning solutions effectively across enterprise Data Science Teams
Released in April 2017
One developer picks up where the other one dropped off
Team Data Science Process in Azure ML
Team Data Science Process is comprised of the following four parts:
1. A standard data science life cycle definition
2. A standard template for documentation, structure, and reporting
3. Infrastructure for project execution and code repositories
4. Tools for implementing task lists, version control, code review, data exploration and
modeling
Team Data Science Process in Azure ML
Team Data Science Process is comprised of the following four parts:
1. A standard data science life cycle definition
2. A standard template for documentation, structure, and reporting
3. Infrastructure for project execution and code repositories
4. Tools for implementing task lists, version control, code review, data exploration and
modeling
Team Data Science Process in Azure ML
The next time your boss asks you if you want to upgrade your laptop
Instead you may want to ask them to get you an iPad Pro or a Surface Notebook with a cloud
subscription
Key Takeaway
Questions?
THANK YOU!
What questions do you have?

How Cloud is Affecting Data Scientists

  • 1.
    Data Science inthe Cloud November 2nd, 2017
  • 2.
    Data Science Transitionto the Cloud Various Cloud Offerings Team Data Science
  • 3.
  • 4.
    Data Scientists musthave many tools at their disposal – R – Python – SQL – Data Analysis – Big Data Now they must begin to consider another tool to help them build and test out their models – Cloud Data Scientists and their Toolsets
  • 5.
    Modeling and Validationtake time and resources Back in the day, data scientists would wait hours to run models and validate ideal performance All this was done on a personal local machine with limited memory Data Science Models are resource intensive
  • 6.
    Something had togive Minimize the volume of data in the model – Limits model performance Minimize the complexity of the model – Limits variety and innovation Have personal machines increased performance over the last couple of years to handle both volume and complexity? Data Science Models are resource intensive
  • 7.
    Laptops can nowbe purchased with 2 TB of Hard Drive using SSD and 32 GB of Memory You’re looking at least $4K What if you could run the same model on your top of the line laptop on the cloud for a few $$ each month? Data Science Models are resource intensive
  • 8.
    Many models benefitfrom larger datasets, especially deep learning models How many GPU’s does your local machine have? CPU has a few cores, GPU has hundreds of cores GPU’s can huge batches of data and perform the same operation over and over again CPU’s can handle a few threads at a time GPU vs CPU
  • 9.
    Free users fromhardware and local constraints Collaborative – Models can be accessed by anyone with an account – Models can be shared across developers Advantages to Machine Learning in the Cloud
  • 10.
    Compliance and Security Offlineaccess Disadvantages to Machine Learning in the Cloud
  • 11.
  • 12.
    Microsoft Azure Amazon WebServices Google Cloud Platform Azure vs AWS vs Google Cloud
  • 13.
    Magic Quadrant forCloud According to Garner in 2014 AWS was on an island by itself Cloud according to Gartner in 2014
  • 14.
    For our purposeswe want to do a comparison based on Machine Learning offerings for Azure, Amazon Web Services, and Google Cloud? But what about Machine Learning?
  • 15.
    One of themost automated solutions on the market Can load data from Amazon RDS, Redshift, and CSV files Data can automatically be identified as categorical or numeric during preparation Modeling does not support Unsupervised Learning, only Supervised out of the box Predictions fall under only 2 main areas: 1. Binary and Multiclass classification 2. Regression AWS Machine Learning
  • 16.
    Similar to theAmazon Machine Learning offering Predictions fall under similar categories out of the box: 1. Binary and Multiclass Classification 2. Regression Google offers pre-trained models as templates that can be used as starting points for model development Google Prediction API
  • 17.
    Almost all operationsare manual as opposed to Google and Amazon Supports a user-friendly graphical interface to visualize each step within a workflow for building a model – Looks very much like Visio or PowerPoint Supports both Supervised and Unsupervised models Azure ML
  • 18.
    Variety of Algorithmsavailable, not just limited to Classification and Regression: – Classification: Binary and Multiclass – Regression – Anomaly Detection – Recommendation – Text Analysis Azure also has a set of templates available in the Cortana Intelligence Gallery that are prebuilt as templates for reuse Azure ML
  • 19.
  • 20.
    From a 2014Capgemini Report Only 27% of the big data projects are regarded as successful Only 13% of organizations have achieved full-scale production for their big data implementations Only 8% of the big data projects are regarded as VERY successful Only 17% of survey respondents said they had a well-developed Predictive/Prescriptive Analytics Program in while Some Sobering Statistics
  • 21.
    Why is thishappening? Most data scientists unfortunately are working in silos Silos within an organization Silos within their tools Some Sobering Statistics
  • 22.
    Team Data ScienceProcess in AML is an agile and iterative process for delivering Machine Learning solutions effectively across enterprise Data Science Teams Released in April 2017 One developer picks up where the other one dropped off Team Data Science Process in Azure ML
  • 23.
    Team Data ScienceProcess is comprised of the following four parts: 1. A standard data science life cycle definition 2. A standard template for documentation, structure, and reporting 3. Infrastructure for project execution and code repositories 4. Tools for implementing task lists, version control, code review, data exploration and modeling Team Data Science Process in Azure ML
  • 24.
    Team Data ScienceProcess is comprised of the following four parts: 1. A standard data science life cycle definition 2. A standard template for documentation, structure, and reporting 3. Infrastructure for project execution and code repositories 4. Tools for implementing task lists, version control, code review, data exploration and modeling Team Data Science Process in Azure ML
  • 25.
    The next timeyour boss asks you if you want to upgrade your laptop Instead you may want to ask them to get you an iPad Pro or a Surface Notebook with a cloud subscription Key Takeaway
  • 26.
  • 27.