The Rise of Data Science in the Age of Big Data Analytics: Why data distillation and machine learning aren't enough
Upcoming SlideShare
Loading in...5
×
 

The Rise of Data Science in the Age of Big Data Analytics: Why data distillation and machine learning aren't enough

on

  • 2,487 views

The reason why Big Data is important is because we want to use it to make sense of our world. It’s tempting to think there’s some “magic bullet” for analyzing big data, but simple “data ...

The reason why Big Data is important is because we want to use it to make sense of our world. It’s tempting to think there’s some “magic bullet” for analyzing big data, but simple “data distillation” often isn’t enough, and unsupervised machine-learning systems can be dangerous. (Like, bringing-down-the-entire-financial-system dangerous.) Data Science is the key to unlocking insight from Big Data: by combining computer science skills with statistical analysis and a deep understanding of the data and problem we can not only make better predictions, but also fill in gaps in our knowledge, and even find answers to questions we hadn’t even thought of yet.

Statistics

Views

Total Views
2,487
Views on SlideShare
1,447
Embed Views
1,040

Actions

Likes
3
Downloads
66
Comments
0

1 Embed 1,040

http://www.revolutionanalytics.com 1040

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

The Rise of Data Science in the Age of Big Data Analytics: Why data distillation and machine learning aren't enough The Rise of Data Science in the Age of Big Data Analytics: Why data distillation and machine learning aren't enough Presentation Transcript

  • Revolution ConfidentialT he R is e of DataS c ienc e in the age ofB ig Data A nalytic sWhy Data Dis tillation and Mac hineL earning A ren’t E noughDavid M S mithV P Marketing and C ommunityR evolution Analytic s
  • Today, we’ll dis c us s : Revolution Confidential What is Data Science? Why machine learning isn’t enough Why Data Science works The Data Scientists Toolkit The Future of Big Data Analytics Closing thoughts and resources 2
  • Revolution Confidential© Dov Harrington, CC By-2.0http://www.flickr.com/photos/idovermani/4110546683/ 3 View slide
  • Where is it s afe to fis h near S an F ranc is c o? Revolution Confidential San Francisco Estuary Institute http://www.sfei.org/tools/wqt 4 View slide
  • Hurric ane S andy Revolution Confidential Bob Rudis http://rud.is/b/2012/10/28/watch-sandy-in-r-including-forecast-cone/ 5
  • Hurric ane S andy Revolution Confidential Ed Chen http://blog.echen.me/hurricane-sandy-outages/ 6
  • When did Mic hael J ac ks on have hisbigges t hits ? Revolution Confidential New York Times, June 25 2009 (3 hours after Michael Jackson’s death) http://www.nytimes.com/interactive/2009/06/25/arts/0625-jackson-graphic.html 7
  • T hree E s s ential S kills of Data S c ientis ts Revolution Confidential ModelsData Integration Visualization Mashups Predictions Applications Uncertainty Problems Effective Data Sources Data Credibility Applications Drew Conway http://www.dataists.com/2010/09/the-data-science-venn-diagram/ 8
  • Revolution ConfidentialImage © Abode of Chaos, CC BY 2.0http://www.flickr.com/photos/home_of_chaos/6418989233/ 9
  • Mac hine learning (ML ) for predic tions Revolution Confidential Building the Model Responses Features scoring Scoring new data ML rules Predictions (scores) New Data scoring Validating the Model Predictions rules Response Validation scoring set rules “Accuracy” 10
  • P roblem: A lac k of pers pec tive Revolution Confidential Image © 2010 David M Smith. Some rights reserved CC BY-2.0 11
  • P roblem: L ac k of c redibility Revolution Confidential 12
  • P roblem: C omplexity Revolution Confidential 13
  • Data Science to the Revolution Confidential Rescue! 14
  • A ns wer Unas ked Ques tions Revolution Confidential Revolutions blog: “The Uncanny Valley of Big Data” http://blog.revolutionanalytics.com/2012/02/the-uncanny-valley-of-big-data.html 15
  • F ill in knowledge gaps Revolution Confidential “Companies that have massive amounts of data without massive amounts of clue are going to be displaced by startups that have less data but more clue.” -- Tim O’Reilly “More data beats better algorithms, every time” – Google Google Research, “The Unreasonable Effectiveness of Data”: http://googleresearch.blogspot.com/2009/03/unreasonable-effectiveness-of-data.html Tim O’Reilly on Google+: https://plus.google.com/107033731246200681024/posts/4Xa76AtxYwd TechnoCalifornia: http://technocalifornia.blogspot.com/2012/07/more-data-or-better-models.html 16
  • Avoid ineffec tive reac tions Revolution Confidential S&P 500 Stupid Data Miner Tricks http://nerdsonwallstreet.typepad.com/my_weblog/files/dataminejune_2000.pdf 17
  • Revolution Confidential© Henricks Photos CC-BY-ND 2.0http://www.flickr.com/photos/hendricksphotos/3240667626/ 18
  • 0. Data (B ig & Mes s y) Revolution Confidential 19
  • 1. A language for programming with data Revolution Confidential Download the White Paper R is Hot bit.ly/r-is-hot 20
  • Data import and pre- processing Revolution Confidential User-defined functions Internet API interface XML parsing Grant awards to homeless veterans FY09Iterative data processing Data: Data.gov Analysis: Drew Conway Custom graphics 21
  • 2. S peed. L ots and lots of s peed. Revolution Confidential Variable Transformation Feature Selection Model Data Sampling Estimation Predictions Aggregation Model Model Comparison / Refinement Benkmarking 22
  • Us e all available c omputing c yc les Revolution Confidential Shared Memory Data Data Data Core 0 Core 1 Core 2 Core n Disk (Thread 0) (Thread 1) (Thread 2) (Thread n) Multicore Processor (4, 8, 16+ cores) 23
  • 3. A lgorithms that don’t c hoke on B ig Data Revolution Confidential Compute Node Data Partition Compute Data Node Partition BIG Data Master Node Partition Compute DATA Node Data Partition Compute NodePEMAs: Parallel External-Memory Algorithms 24
  • Drink les s c offee! Revolution Confidential Single Threaded Non-optimized algorithms Optimized Parallelized Algorithms 25
  • 4. Move c ode to data (not vic e vers a) Revolution Confidential Map-Reduce RHadoop: http://bit.ly/RHadoop 26
  • B ig Data A pplianc es Revolution Confidential More info: http://bit.ly/R-Netezza 27
  • P lay Nic e with Others Revolution Confidential Presentation Layer • Business Intelligence Tools • Web-based data apps • Reporting / Spreadsheets Analytics Layer •R Data Layer • Relational datastores • Unstructured datastores 28
  • What every data s c ientis t needs Revolution Confidential Revolution R Open-Source R Enterprise Interface with multiple data sources ✓ ✓✓ Exploratory data analysis ✓✓ ✓✓ Wide range of statistical methods ✓✓ ✓✓ High-speed computation ✘ ✓✓ Big Data support ✘ ✓✓ Data/code locality (Hadoop, etc.) ✘ ✓✓ Print-quality data visualization ✓ ✓ Scheduled batch production ✓ ✓✓ Works in a multi-tool ecosystem ✓✓ ✓✓ Integration into Data Apps ✘ ✓✓ 29
  • R evolution R E nterpris e: B ig-Data R Revolution Confidential Revolution R Open-Source R Enterprise Interface with multiple data sources ✓ ✓✓ Exploratory data analysis ✓✓ ✓✓ Wide range of statistical methods ✓✓ ✓✓ High-speed computation ✘ ✓✓ Big Data support ✘ ✓✓ Data/code locality (Hadoop, etc.) ✘ ✓✓ Print-quality data visualization ✓✓ ✓✓ Scheduled batch production ✓ ✓✓ Works in a multi-tool ecosystem ✓✓ ✓✓ Integration into Data Apps ✘ ✓✓ www.revolutionanalytics.com/products 30
  • Revolution ConfidentialImage © www.tinyplanetphotography.com 31
  • A nd … the future? Revolution Confidential Even more data Cloud computing Demand for Data Scientists Diverging paradigms for data analytics http://www.indeed.com/jobtrends 32
  • Diverging data paradigms Revolution Confidential More data, better fault tolerance Files Data Hadoop Clusters Appliances NoSQLExploration Storage Modeling Preprocessing Easier programming, better performance Production 33
  • Data S c ienc e in P roduc tion Revolution Confidential Real-time Big Data Analytics: From Deployment to Production Thursday, November 29, 2012 10:00AM - 11:00AM Pacific Timewww.revolutionanalytics.com/news-events/free-webinars/ 34
  • B uilding Data S c ienc e Teams Revolution Confidential DJ Patil in O’Reilly Radar: http://oreil.ly/I3H5fI Statistics and Data Science graduates Kaggle and Chorus Revolution Analytics R Training:  http://www.revolutionanalytics.com/services/training/ 35
  • C los ing T houghts Revolution Confidential Data Science process leads to more powerful, and more useful models Data Scientists need a technology platform to think about, explore, and model data Revolution R Enterprise is R for Big Data 36
  • R es ourc es Revolution Confidential Revolution R Enterprise : R for Big Data  www.revolutionanalytics.com/products Rhadoop : Connecting R and Hadoop  bit.ly/r-hadoop Contact David Smith  david@revolutionanalytics.com  @revodavid  blog.revolutionanalytics.com 37
  • T hank you. Revolution Confidential The leading commercial provider of software and support for the popular open source R statistics language. www.revolutionanalytics.com 650.646.9545 Twitter: @RevolutionR 38