Revolution Confidential

Data Science
Not just for big data!
David Smith
Revolution Analytics
@revodavid
October 16, 2013
Big Data: the new oil?

Photo: Sarah&Boston (flickr: pocheco) Creative Commons BY-SA 2.0

Revolution Confidential

2
Big Data is just raw material

Revolution Confidential

 Data Distillation
 Extract quantities of interest
 Find complete cases
 Derive missing information

 Big Data Pitfalls:
 Data cleanliness & accuracy
 Observational bias
 Do the data I have represent the population I’m
interested in?
3
Surveys & Experiments

Revolution Confidential

 Even with Big Data, the data you need isn’t
always in the building!
 … so ask (survey)!
 Survey design
 Stratified sampling

 … or experiment!
 A/B Testing
 Experimental Design
4
Data Exploration & Visualization

Revolution Confidential

 Limited by pixels
 Big data = a big black
blob

 Extract signal from
noise





Aggregations
Heat maps
Smoothing
Small multiples
5
Statistical Modeling & Forecasting

Revolution Confidential

 You don’t always need big data
 Sampling can help with observational bias

 Model selection
 Feature extraction
 Confounding?
 Interactions?

 Model validation
 Overfitting

 Prediction
 Extrapolation
 Confidence
http://xkcd.com/605/

6
Summary

Revolution Confidential

 Big Data is great, but think of it as the “raw
materials” for data science
 After refining, “big” isn’t always so “Big”

 Use statistical insight to avoid pitfalls:
 Inferences: Observational bias / Sampling bias
 Predictions: Confounding / Overfitting
 Think about variances and means (risk!)

 Some data scientists may miss these issues
 Look for statistical expertise

 Further reading:
 ComputerWorld: 12 predictive analytics screw-ups
7

Data Science: Not Just For Big Data

  • 1.
    Revolution Confidential Data Science Notjust for big data! David Smith Revolution Analytics @revodavid October 16, 2013
  • 2.
    Big Data: thenew oil? Photo: Sarah&Boston (flickr: pocheco) Creative Commons BY-SA 2.0 Revolution Confidential 2
  • 3.
    Big Data isjust raw material Revolution Confidential  Data Distillation  Extract quantities of interest  Find complete cases  Derive missing information  Big Data Pitfalls:  Data cleanliness & accuracy  Observational bias  Do the data I have represent the population I’m interested in? 3
  • 4.
    Surveys & Experiments RevolutionConfidential  Even with Big Data, the data you need isn’t always in the building!  … so ask (survey)!  Survey design  Stratified sampling  … or experiment!  A/B Testing  Experimental Design 4
  • 5.
    Data Exploration &Visualization Revolution Confidential  Limited by pixels  Big data = a big black blob  Extract signal from noise     Aggregations Heat maps Smoothing Small multiples 5
  • 6.
    Statistical Modeling &Forecasting Revolution Confidential  You don’t always need big data  Sampling can help with observational bias  Model selection  Feature extraction  Confounding?  Interactions?  Model validation  Overfitting  Prediction  Extrapolation  Confidence http://xkcd.com/605/ 6
  • 7.
    Summary Revolution Confidential  BigData is great, but think of it as the “raw materials” for data science  After refining, “big” isn’t always so “Big”  Use statistical insight to avoid pitfalls:  Inferences: Observational bias / Sampling bias  Predictions: Confounding / Overfitting  Think about variances and means (risk!)  Some data scientists may miss these issues  Look for statistical expertise  Further reading:  ComputerWorld: 12 predictive analytics screw-ups 7