Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
• Open Source
• Big Data
• Advanced Analytics & Data Science
• Cloud
OPENING WITH
OPEN SOURCE
• Freedom to use, for any purpose
• Freedom to tinker
• Freedom to redistribute copies
• Freedom to share modifications
• Cost Reduction (freedom to use / redistribute)
• Time-to-market (freedom to share)
• Innovation (freedom to tinker)
http://commons.wikimedia.org/wiki/File:Google%E2%80%
99s_First_Production_Server.jpg
CC-BY-2.0
1996:
10x 4Gb Hard
Drives
2...
"Unlike prior eras in which industry players lacking
technical competencies effectively outsourced the job
of software cre...
BIG DATA:
THE ELEPHANT
IN THE ROOM
• Born at Yahoo! in mid-2000’s to enable web-scale
search
• First successful massively-distributed, failure-resistant
data...
• ETL
• Marketing channel data
• Behavioral variables
• Promotional data
• Overlay data
• Exploratory data analysis
• Time...
THE
DATA SCIENCE
REVOLUTION
Drew Conway
http://www.dataists.com/2010/09/the-data-
science-venn-diagram/
Data Integration
Mashups
Applications
Models
V...
What
happened?
Why did
it happen?
What will
happen?
How can we
make it happen?
Traditional BI Advanced Analytics
Facebook
• Exploratory Data Analysis
• Experimental Analysis
“Generally, we use R to move
fast when we get a new data
set....
The New York Times
Interactive Features
• Election Forecast
• Dialect Quiz
Data Journalism
• NFL Draft Picks
• Wealth dist...
• Credit Risk Analysis • Financial Networks
Data Scientist
Interact directly with data
Built-in to SQL Server
Data Developer/DBA
Manage data and
analytics together
SQ...
• Easy to use: add 2 lines to the top of each script
• For the package author:
• For a script collaborator:
21
CLOUDY,
WITH A CHANGE
On Premises Cloud
23
Software Revenues New License Revenues
http://redmonk.com/sogrady/2013/11/21/selling-software/ 24
The Azure Cloud
Operational Announced
Central US
Iowa
West US
California
North Europe
Ireland
East US
Virginia
East US 2
V...
• Exposing the expertise of data scientists as APIs
• Bringing the utility of data science to applications
• Addressing th...
http://blog.revolutionanalytics.com/2015/06/r-build-keynote.html/
Building a genetic disease risk application with R
Data
• Public genome data from 1000 Genomes
• About 2TB of raw data
Pro...
HDInsights Cluster
rxEXEC
Task
Task
Task
Finalizer
Initiator
1,800
Nodes
Revolution R Enterprise
 … load a large dataset ...
https://github.com/rstudio/d3heatmap
• Calculate population-level risks on
2TB of data
• Create R function to calculate
individual risk
• Build Windows Phone a...
cloud
computing
2011  2016 5x increase
data
science
Universities filling
300,000 US talent gap
90% of the data in the wor...
David Smith
R Community Lead
@revodavid
davidsmi@microsoft.com
blog.revolutionanalytics.com
36
mran.revolutionanalytics.com
The Business Economics and Opportunity of Open Source Data Science
The Business Economics and Opportunity of Open Source Data Science
The Business Economics and Opportunity of Open Source Data Science
The Business Economics and Opportunity of Open Source Data Science
The Business Economics and Opportunity of Open Source Data Science
The Business Economics and Opportunity of Open Source Data Science
Upcoming SlideShare
Loading in …5
×

The Business Economics and Opportunity of Open Source Data Science

17,296 views

Published on

Keynote presentation by David Smith (R Community Lead, Microsoft) at EARL Boston Conference, November 3 2015

Published in: Technology
  • Be the first to comment

The Business Economics and Opportunity of Open Source Data Science

  1. 1. • Open Source • Big Data • Advanced Analytics & Data Science • Cloud
  2. 2. OPENING WITH OPEN SOURCE
  3. 3. • Freedom to use, for any purpose • Freedom to tinker • Freedom to redistribute copies • Freedom to share modifications
  4. 4. • Cost Reduction (freedom to use / redistribute) • Time-to-market (freedom to share) • Innovation (freedom to tinker)
  5. 5. http://commons.wikimedia.org/wiki/File:Google%E2%80% 99s_First_Production_Server.jpg CC-BY-2.0 1996: 10x 4Gb Hard Drives 2000: 5000 Linux PCs Today: > 2 billion servers (estimated) “I don't think the web would exist without open source and Linux. So there would have been no Google.” — Chris DiBona, Google
  6. 6. "Unlike prior eras in which industry players lacking technical competencies effectively outsourced the job of software creation to third party commercial software organizations, companies like Amazon, Facebook and Google looked around and quickly determined that help was not coming from that direction – and even if it did, the economics of traditional software licensing would be a non-starter in scale-out environments.“ — Stephen O’Grady, Redmonk http://redmonk.com/sogrady/2015/03/17/open-source-and-aas/
  7. 7. BIG DATA: THE ELEPHANT IN THE ROOM
  8. 8. • Born at Yahoo! in mid-2000’s to enable web-scale search • First successful massively-distributed, failure-resistant data store • Open source, running on commodity hardware • Invention of Map-Reduce ushers in the age of Big Data Analytics
  9. 9. • ETL • Marketing channel data • Behavioral variables • Promotional data • Overlay data • Exploratory data analysis • Time-to-event models • GAM survival models • Scoring for inference • Scoring for prediction • 5 billion scores per day per retailer CUSTOM DATA FORMAT CUSTOM VARIABLES (PMML)
  10. 10. THE DATA SCIENCE REVOLUTION
  11. 11. Drew Conway http://www.dataists.com/2010/09/the-data- science-venn-diagram/ Data Integration Mashups Applications Models Visualization Predictions Uncertainty Problems Data Sources Credibility Effective Data Applications
  12. 12. What happened? Why did it happen? What will happen? How can we make it happen? Traditional BI Advanced Analytics
  13. 13. Facebook • Exploratory Data Analysis • Experimental Analysis “Generally, we use R to move fast when we get a new data set. With R, we don’t need to develop custom tools or write a bunch of code. Instead, we can just go about cleaning and exploring the data.” — Solomon Messing, data scientist at Facebook
  14. 14. The New York Times Interactive Features • Election Forecast • Dialect Quiz Data Journalism • NFL Draft Picks • Wealth distribution in USA
  15. 15. • Credit Risk Analysis • Financial Networks
  16. 16. Data Scientist Interact directly with data Built-in to SQL Server Data Developer/DBA Manage data and analytics together SQL Server 2016 Built-in in-database analytics Example Solutions • Fraud detection • Salesforecasting • Warehouse efficiency • Predictive maintenance Relational Data Analytic Library T-SQL Interface Extensibility R RIntegration 010010 100100 010101 Microsoft Azure Machine Learning Marketplace New R scripts 010010 100100 010101 010010 100100 010101 010010 100100 010101 010010 100100 010101 010010 100100 010101
  17. 17. • Easy to use: add 2 lines to the top of each script • For the package author: • For a script collaborator: 21
  18. 18. CLOUDY, WITH A CHANGE
  19. 19. On Premises Cloud 23
  20. 20. Software Revenues New License Revenues http://redmonk.com/sogrady/2013/11/21/selling-software/ 24
  21. 21. The Azure Cloud Operational Announced Central US Iowa West US California North Europe Ireland East US Virginia East US 2 Virginia US Gov Virginia North Central US Illinois US Gov Iowa South Central US Texas Brazil South Sao Paulo West Europe Netherlands China North * Beijing China South * Shanghai Japan East Saitama Japan West OsakaIndia West TBD India East TBD East Asia Hong Kong SE Asia Singapore Australia West Melbourne Australia East Sydney * Operated by 21Vianet
  22. 22. • Exposing the expertise of data scientists as APIs • Bringing the utility of data science to applications • Addressing the Data Science talent gap
  23. 23. http://blog.revolutionanalytics.com/2015/06/r-build-keynote.html/
  24. 24. Building a genetic disease risk application with R Data • Public genome data from 1000 Genomes • About 2TB of raw data Processing • VariantTools variant caller in R • Match against NHGRI GWAS catalog Analytics • Risk association • Ancestry prediction Presentation • Expose as API • Web page, phone app, etc BAM BAM BAM BAM VariantTools GWAS BAM Platform • HDInsights Hadoop 1800 Nodes • Raw genome sequence data in HDFS • Revolution R Enterprise
  25. 25. HDInsights Cluster rxEXEC Task Task Task Finalizer Initiator 1,800 Nodes Revolution R Enterprise  … load a large dataset into Hadoop rxSetComputeContext(RxHadoopMR(…) …. rxEXEC (R script,…..) 1000 Genomes rxEXEC Distribute the Script Across 1,800 Nodes
  26. 26. https://github.com/rstudio/d3heatmap
  27. 27. • Calculate population-level risks on 2TB of data • Create R function to calculate individual risk • Build Windows Phone application • Supply DNA sequence to app
  28. 28. cloud computing 2011  2016 5x increase data science Universities filling 300,000 US talent gap 90% of the data in the world today has been created in the last two years alone big data open source including R, Linux, Hadoop
  29. 29. David Smith R Community Lead @revodavid davidsmi@microsoft.com blog.revolutionanalytics.com
  30. 30. 36 mran.revolutionanalytics.com

×