The Rise of Data Science in the Age of Big Data Analytics: Why data distillation and machine learning aren't enough

Revolution Confidential

T he R is e of Data
S c ienc e in the age of
B ig Data A nalytic s
Why Data Dis tillation and Mac hine
L earning A ren’t E nough

David M S mith
V P Marketing and C ommunity
R evolution Analytic s

Today, we’ll dis c us s : Revolution Confidential

 What is Data Science?
 Why machine learning isn’t enough
 Why Data Science works
 The Data Scientists Toolkit
 The Future of Big Data Analytics
 Closing thoughts and resources

2


© Dov Harrington, CC By-2.0
http://www.flickr.com/photos/idovermani/4110546683/ 3

Where is it s afe to fis h near S an F ranc is c o? Revolution Confidential

San Francisco Estuary Institute
http://www.sfei.org/tools/wqt 4

Hurric ane S andy Revolution Confidential

Bob Rudis
http://rud.is/b/2012/10/28/watch-sandy-in-r-including-forecast-cone/

5

Hurric ane S andy Revolution Confidential

Ed Chen
http://blog.echen.me/hurricane-sandy-outages/

6

When did Mic hael J ac ks on have his
bigges t hits ? Revolution Confidential

New York Times, June 25 2009 (3 hours after Michael Jackson’s death)
http://www.nytimes.com/interactive/2009/06/25/arts/0625-jackson-graphic.html 7

T hree E s s ential S kills of Data S c ientis ts Revolution Confidential

Models
Data Integration
Visualization
Mashups
Predictions
Applications
Uncertainty

Problems Effective
Data Sources Data
Credibility Applications

Drew Conway
http://www.dataists.com/2010/09/the-data-science-venn-diagram/ 8


Image © Abode of Chaos, CC BY 2.0
http://www.flickr.com/photos/home_of_chaos/6418989233/ 9

Mac hine learning (ML ) for predic tions Revolution Confidential

Building the Model
Responses
Features

scoring Scoring new data
ML rules

Predictions (scores)
New Data
scoring
Validating the Model
Predictions rules
Response
Validation

scoring
set

rules

“Accuracy”

10

P roblem: A lac k of pers pec tive Revolution Confidential

Image © 2010 David M Smith. Some rights reserved CC BY-2.0 11

P roblem: L ac k of c redibility Revolution Confidential

12

P roblem: C omplexity Revolution Confidential

13

Data Science to the

Rescue!

14

A ns wer Unas ked Ques tions Revolution Confidential

Revolutions blog: “The Uncanny Valley of Big Data”
http://blog.revolutionanalytics.com/2012/02/the-uncanny-valley-of-big-data.html 15

F ill in knowledge gaps Revolution Confidential

“Companies that have
massive amounts of data
without massive amounts
of clue are going to be
displaced by startups that
have less data but more
clue.” -- Tim O’Reilly
“More data beats
better algorithms,
every time” – Google

Google Research, “The Unreasonable Effectiveness of Data”:
http://googleresearch.blogspot.com/2009/03/unreasonable-effectiveness-of-data.html
Tim O’Reilly on Google+: https://plus.google.com/107033731246200681024/posts/4Xa76AtxYwd
TechnoCalifornia: http://technocalifornia.blogspot.com/2012/07/more-data-or-better-models.html 16

Avoid ineffec tive reac tions Revolution Confidential

S&P 500

Stupid Data Miner Tricks
http://nerdsonwallstreet.typepad.com/my_weblog/files/dataminejune_2000.pdf 17


© Henricks Photos CC-BY-ND 2.0
http://www.flickr.com/photos/hendricksphotos/3240667626/ 18

0. Data (B ig & Mes s y) Revolution Confidential

19

1. A language for programming with data Revolution Confidential

Download the White Paper
R is Hot
bit.ly/r-is-hot

20

Data import and pre-
processing

User-defined functions

Internet API interface
XML parsing

Grant awards to homeless veterans FY09
Iterative data processing Data: Data.gov
Analysis: Drew Conway

Custom graphics

21

2. S peed. L ots and lots of s peed. Revolution Confidential

Variable
Transformation

Feature
Selection Model
Data Sampling Estimation Predictions
Aggregation

Model
Model
Comparison /
Refinement
Benkmarking

22

Us e all available c omputing c yc les Revolution Confidential

Shared Memory

Data Data Data

Core 0 Core 1 Core 2 Core n
Disk (Thread 0) (Thread 1) (Thread 2) (Thread n)

Multicore Processor (4, 8, 16+ cores)

23

3. A lgorithms that don’t c hoke on B ig Data

Compute
Node

Data
Partition
Compute
Data Node
Partition
BIG
Data
Master
Node
Partition Compute
DATA Node

Data
Partition
Compute
Node

PEMAs: Parallel External-Memory Algorithms
24

Drink les s c offee! Revolution Confidential

Single Threaded
Non-optimized
algorithms

Optimized
Parallelized
Algorithms

25

4. Move c ode to data (not vic e vers a) Revolution Confidential

Map-Reduce

RHadoop: http://bit.ly/RHadoop 26

B ig Data A pplianc es Revolution Confidential

More info: http://bit.ly/R-Netezza

27

P lay Nic e with Others Revolution Confidential

Presentation Layer
• Business Intelligence Tools
• Web-based data apps
• Reporting / Spreadsheets

Analytics Layer
•R

Data Layer
• Relational datastores
• Unstructured datastores

28

What every data s c ientis t needs Revolution Confidential

Revolution R
Open-Source R Enterprise
Interface with multiple data sources ✓ ✓✓
Exploratory data analysis ✓✓ ✓✓
Wide range of statistical methods ✓✓ ✓✓
High-speed computation ✘ ✓✓
Big Data support ✘ ✓✓
Data/code locality (Hadoop, etc.) ✘ ✓✓
Print-quality data visualization ✓ ✓
Scheduled batch production ✓ ✓✓
Works in a multi-tool ecosystem ✓✓ ✓✓
Integration into Data Apps ✘ ✓✓

29

R evolution R E nterpris e: B ig-Data R Revolution Confidential

Revolution R
Open-Source R Enterprise
Interface with multiple data sources ✓ ✓✓
Exploratory data analysis ✓✓ ✓✓
Wide range of statistical methods ✓✓ ✓✓
High-speed computation ✘ ✓✓
Big Data support ✘ ✓✓
Data/code locality (Hadoop, etc.) ✘ ✓✓
Print-quality data visualization ✓✓ ✓✓
Scheduled batch production ✓ ✓✓
Works in a multi-tool ecosystem ✓✓ ✓✓
Integration into Data Apps ✘ ✓✓

www.revolutionanalytics.com/products 30

A nd … the future? Revolution Confidential

 Even more data

 Cloud computing

 Demand for
Data Scientists

 Diverging paradigms for data analytics

http://www.indeed.com/jobtrends 32

Diverging data paradigms Revolution Confidential

More data, better fault tolerance

Files Data Hadoop
Clusters Appliances NoSQL

Exploration Storage
Modeling Preprocessing
Easier programming, better performance

Production

33

Data S c ienc e in P roduc tion Revolution Confidential

Real-time Big Data Analytics: From
Deployment to Production

Thursday, November 29, 2012
10:00AM - 11:00AM Pacific Time

www.revolutionanalytics.com/news-events/free-webinars/

34

B uilding Data S c ienc e Teams Revolution Confidential

 DJ Patil in O’Reilly Radar: http://oreil.ly/I3H5fI

 Statistics and Data Science graduates

 Kaggle and Chorus

 Revolution Analytics R Training:
 http://www.revolutionanalytics.com/services/training/

35

C los ing T houghts Revolution Confidential

 Data Science process leads to more
powerful, and more useful models

 Data Scientists need a technology platform
to think about, explore, and model data

 Revolution R Enterprise is R for Big Data

36

R es ourc es Revolution Confidential

 Revolution R Enterprise : R for Big Data
 www.revolutionanalytics.com/products
 Rhadoop : Connecting R and Hadoop
 bit.ly/r-hadoop

 Contact David Smith
 david@revolutionanalytics.com
 @revodavid
 blog.revolutionanalytics.com
37

T hank you. Revolution Confidential

The leading commercial provider of software and support for the popular
open source R statistics language.

www.revolutionanalytics.com 650.646.9545 Twitter: @RevolutionR

38

The Rise of Data Science in the Age of Big Data Analytics: Why data distillation and machine learning aren't enough

Recommended

Recommended

More Related Content

Similar to The Rise of Data Science in the Age of Big Data Analytics: Why data distillation and machine learning aren't enough

Similar to The Rise of Data Science in the Age of Big Data Analytics: Why data distillation and machine learning aren't enough (20)

More from Revolution Analytics

More from Revolution Analytics (20)

The Rise of Data Science in the Age of Big Data Analytics: Why data distillation and machine learning aren't enough