The Business Economics and Opportunity of Open Source Data Science

• Open Source
• Big Data
• Advanced Analytics & Data Science
• Cloud

• Freedom to use, for any purpose
• Freedom to tinker
• Freedom to redistribute copies
• Freedom to share modifications

• Cost Reduction (freedom to use / redistribute)
• Time-to-market (freedom to share)
• Innovation (freedom to tinker)

http://commons.wikimedia.org/wiki/File:Google%E2%80%
99s_First_Production_Server.jpg
CC-BY-2.0
1996:
10x 4Gb Hard
Drives
2000:
5000 Linux PCs
Today:
> 2 billion servers
(estimated)
“I don't think the web
would exist without
open source and
Linux. So there would
have been no
Google.”
— Chris DiBona, Google

"Unlike prior eras in which industry players lacking
technical competencies effectively outsourced the job
of software creation to third party commercial software
organizations, companies like Amazon, Facebook and
Google looked around and quickly determined that
help was not coming from that direction – and even if it
did, the economics of traditional software licensing
would be a non-starter in scale-out environments.“
— Stephen O’Grady, Redmonk
http://redmonk.com/sogrady/2015/03/17/open-source-and-aas/

BIG DATA:
THE ELEPHANT
IN THE ROOM

• Born at Yahoo! in mid-2000’s to enable web-scale
search
• First successful massively-distributed, failure-resistant
data store
• Open source, running on commodity hardware
• Invention of Map-Reduce ushers in the age of Big
Data Analytics

• ETL
• Marketing channel data
• Behavioral variables
• Promotional data
• Overlay data
• Exploratory data analysis
• Time-to-event models
• GAM survival models
• Scoring for inference
• Scoring for prediction
• 5 billion scores per day
per retailer
CUSTOM DATA
FORMAT
CUSTOM VARIABLES
(PMML)

Drew Conway
http://www.dataists.com/2010/09/the-data-
science-venn-diagram/
Data Integration
Mashups
Applications
Models
Visualization
Predictions
Uncertainty
Problems
Data Sources
Credibility
Effective
Data
Applications

What
happened?
Why did
it happen?
What will
happen?
How can we
make it happen?
Traditional BI Advanced Analytics

Facebook
• Exploratory Data Analysis
• Experimental Analysis
“Generally, we use R to move
fast when we get a new data
set. With R, we don’t need to
develop custom tools or write
a bunch of code. Instead, we
can just go about cleaning and
exploring the data.” —
Solomon Messing, data
scientist at Facebook

The New York Times
Interactive Features
• Election Forecast
• Dialect Quiz
Data Journalism
• NFL Draft Picks
• Wealth distribution in USA

• Credit Risk Analysis • Financial Networks

Data Scientist
Interact directly with data
Built-in to SQL Server
Data Developer/DBA
Manage data and
analytics together
SQL Server 2016
Built-in in-database analytics
Example Solutions
• Fraud detection
• Salesforecasting
• Warehouse efficiency
• Predictive maintenance
Relational Data
Analytic Library
T-SQL Interface
Extensibility
R
RIntegration
010010
100100
010101
Microsoft Azure
Machine Learning Marketplace
New R scripts
010010
100100
010101
010010
100100
010101
010010
100100
010101
010010
100100
010101
010010
100100
010101

• Easy to use: add 2 lines to the top of each script
• For the package author:
• For a script collaborator:
21

Software Revenues New License Revenues
http://redmonk.com/sogrady/2013/11/21/selling-software/ 24

The Azure Cloud
Operational Announced
Central US
Iowa
West US
California
North Europe
Ireland
East US
Virginia
East US 2
Virginia
US Gov
Virginia
North Central US
Illinois
US Gov
Iowa
South Central US
Texas
Brazil South
Sao Paulo
West Europe
Netherlands
China North *
Beijing
China South *
Shanghai
Japan East
Saitama
Japan West
OsakaIndia West
TBD
India East
TBD
East Asia
Hong Kong
SE Asia
Singapore
Australia West
Melbourne
Australia East
Sydney
* Operated by 21Vianet

• Exposing the expertise of data scientists as APIs
• Bringing the utility of data science to applications
• Addressing the Data Science talent gap

http://blog.revolutionanalytics.com/2015/06/r-build-keynote.html/

Building a genetic disease risk application with R
Data
• Public genome data from 1000 Genomes
• About 2TB of raw data
Processing
• VariantTools variant caller in R
• Match against NHGRI GWAS catalog
Analytics
• Risk association
• Ancestry prediction
Presentation
• Expose as API
• Web page, phone app, etc
BAM BAM BAM BAM
VariantTools
GWAS
BAM
Platform
• HDInsights Hadoop 1800 Nodes
• Raw genome sequence data in HDFS
• Revolution R Enterprise

HDInsights Cluster
rxEXEC
Task
Task
Task
Finalizer
Initiator
1,800
Nodes
Revolution R Enterprise
 … load a large dataset into Hadoop
rxSetComputeContext(RxHadoopMR(…)
…. rxEXEC (R script,…..)
1000
Genomes
rxEXEC
Distribute the Script
Across 1,800 Nodes

https://github.com/rstudio/d3heatmap

• Calculate population-level risks on
2TB of data
• Create R function to calculate
individual risk
• Build Windows Phone application
• Supply DNA sequence to app

cloud
computing
2011  2016 5x increase
data
science
Universities filling
300,000 US talent gap
90% of the data in the world
today has been created in
the last two years alone
big
data
open
source
including R, Linux, Hadoop

David Smith
R Community Lead
@revodavid
davidsmi@microsoft.com
blog.revolutionanalytics.com

36
mran.revolutionanalytics.com

The Business Economics and Opportunity of Open Source Data Science

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to The Business Economics and Opportunity of Open Source Data Science

Similar to The Business Economics and Opportunity of Open Source Data Science (20)

More from Revolution Analytics

More from Revolution Analytics (20)

Recently uploaded

Recently uploaded (20)

The Business Economics and Opportunity of Open Source Data Science

Editor's Notes