Data Science as a Service: Intersection of Cloud Computing and Data Science

Data Science as a Service
Dr. Pouria Amirian (Pouria.Amirian@ndm.ox.ac.uk)
Big Data Project Coordinator, The Global Health Network, University of Oxford
Intersection Of Cloud Computing And Data Science

outline
 Data Science
 What data science is
 Steps in a Data Science project
 Experiments
 Using AzureML
 Big Data issues
 In a data science project
 Methods in analysis
2

What is Data Science?
 Practice of obtaining useful insights from data
 3 Vs of Big Data:
 Volume
 Variety
 Velocity
 + other Vs
 It applies to large volume data (volume)
 It applies to semi-structured and unstructured data (variety)
 It sometimes applies to real-time or fast changing data
(velocity)
 It applies to small and traditional static data
3

Data Science as a team sport
4
Math
Statistical Learning
Linguistics
Machine Learning
Signal Processing
Programming
Storage/Data StructureOperations Research
Distributed and High
Performance Computing

Data Science from analytics point of view
 Analytics Spectrum:
5
Descriptive
Diagnostic
Predictive
Prescriptive
What happened?
Why did it happen?
What will happen?
What should I do?

Data Science Vs. Business Intelligence
 Analytics Spectrum:
6
What happened?
Why did it happen?
What will happen?
What should I do?
Traditional BI
Descriptive
Diagnostic
Predictive
Prescriptive

Why is it so popular? why it matters?
 A) More Available and Usable Data
 McKensey: Organizations that use data science to make
decisions are more productive and deliver higher ROI
 Gartner: Organizations that invest in modern data infrastructure
will outperform their peers by up to 20%
7

 B) Increased Awareness of Machine Learning Techniques
 A subset of machine learning algorithms are now more widely
understood since they have been tried and tested by early
adopters such as Netflix and Amazon (Recommendation engines).
 while many people may not know details of the algorithms used,
they now increasingly understand their research/business value.
8

 C) More Accuracte Analysis
 The large volumes of data being collected also enables you to
build more accurate predictive models.
 The larger sample size, the smaller the margin of error. This in turn
increases the accuracy of predictions from your model.
9

 D) Faster and Cheaper Computation
 Today, a smartphone’s processor is up to five times more
powerful than that of a desktop computer 20 years ago.
 Price of computation is decreased
 Capacity of computation is increased
 dramatic gains in technology, productivity, innovations etc.
10

The Data Science Workflow
Problem Definition
Data Collection and
Preparation
Model
Development
Model
Deployment
Performance
Improvement
11
Critical
Very Important
Time Consuming
Fun :D
Iterative
Cumbersome :(
Critical

Problem Definition
Data Collection and
Preparation
Model
Development
Model
Deployment
Performance
Improvement
12
• Domain Knowledge
• Separation of Concerns
• Prioritize each problem
• Selection or right data
• Data Transformation
• Missing Values
• Exploratory analysis
• Right algorithm
• Test accuracy
• Test other algorithms
• Validate
• Turning data scientist model
to developer code
(R to C#)
• Monitor the performance of
deployed model
• Re-Training model
• Re-Deploying model
• Re-monitoring

Big Data
Issues (I)
Problem Definition
Data Collection and
Preparation
Model
Development
Model
Deployment
Performance
Improvement
13

Solutions to overcome the big data issues
14
 1- Use advanced research computing
(http://www.arc.ox.ac.uk/)

Solutions to overcome the big data issues
15
 2-Create and use a Hadoop Cluster
 Open source (Apache)
 It is based on two components
HDFS
MapReduce

open source won't prevent vendor
lock-in!!!
20

Third Solution
 Microsoft’s
Cloud Computing
21

AzureML (Azure Machine Learning)
 Azure ML provides an easy-to-use and powerful set of cloud-
based data transformation and machine learning
tools.
 AzureML Studio (or Studio for short)
 It has many modules for data transformation, analysis,
visualization,…
 It supports R and Python
 It is under heavy development
 www.studio.azureml.net
22

AzureML Workflow
23
Data Input
Data Transformation (Project)
Split Data(training and test)Learning Algorithm
Train the Learning Algorithm
Validate the Algorithm(Score)
Evaluate Model Performance

First Experiment:
Predicting Price of Car
AutomobileFullModuleModel02-03-2015
24

Second Experiment: Using R in ML Studio
AutomobileRTransformation02-03-2015
25

Third experiment: comparing two models
AutomobileFullModuleTwoModel02-03-2015
26

Fourth experiment: Creating Web service
 Very easy just some clicks!!!!
 Make: bmw
 Engine-size: 164
 Horse-power: 121
 highway—mpg: 25
 Its actual price is 24,565
27

Tips
 Data input can come from a variety of data interfaces,
including HTTP connections (any filesharing service like
dropbox, googleDrive, oneDrive), SQLAzure, and Hive Query.
 You can use functionality in all supported R modules (410)
 You can write your utility functions and upload it as another
module
 It is under heavy development
 Two weeks ago the process for web service publication changed
 Two months ago there was no support for Python
 Two months ago around 400 R packages were supported
 …
28

Big Data Issues (II)
 High dimensional data or wide data
 Using various methods needs knowledge of those methods
 Traditional methods are not efficient enough (unstable)
 Least Squares for example
29

Advantages of AzureML
 Solutions can be quickly deployed as web services.
 Models run in a highly scalable cloud environment.
 using the R and Python language for solution-specific
functionality.
 It creates minimum code for consuming the web service
in R and Python (and C#)
 It can be run from anywhere
30

“
”
Big Data is not about Data.
The value in big data is in
Analytics.
GARY KING
Thanks for your attention
Time for Q/A

Data Science as a Service: Intersection of Cloud Computing and Data Science

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (10)

Similar to Data Science as a Service: Intersection of Cloud Computing and Data Science

Similar to Data Science as a Service: Intersection of Cloud Computing and Data Science (20)

Recently uploaded

Recently uploaded (20)

Data Science as a Service: Intersection of Cloud Computing and Data Science