Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Building Analytical Applications on PUBLICLY                                 DO NOT USE    Hadoop                        P...
About Me2
What are ‘Analytical Applications?’3
The Humble Dashboard4
Crossfilter with Flight Information5
New York Times Electoral Vote Map6
New York Times Electoral Vote Map (Detail)7
Analytical Applications vs. Frameworks8
Developing Analytical Applications    A Case Study9
2012: The Predicting of the President10
RealClearPolitics     • Simple Average of Polls     • Transparent     • Simple Interactions11
FiveThirtyEight                       • “Foxy” Model                       • Opaque                       • Simple Interac...
Princeton Election Consortium     • Medians and      Polynomials     • Transparent     • Rich Interactions13
How Did They Do?14
A Few of These, Because They’re Fun15
A Few of These, Because They’re Fun16
A Few of These, Because They’re Fun17
Here’s the Rub: One Expert Beat Nate18
Index Funds, Hedge Funds, and Warren Buffett19
A Brief Introduction to Hadoop20
Data Storage in 2001: Databases     • Structured schemas     • Intensive processing       done where data is       stored ...
Data Storage in 2001: Filers                               • No schemas, stores any                                 kind o...
And Then, This Happened23
Data Economics: Return on Byte24
Big Data Economics     • No individual record is       particularly valuable     • Having every record is       incredibly...
Introduction to Hadoop26
The Hadoop Distributed File System     • Based on the Google File       System     • Data stored in large files        • L...
Simple, Reliable Processing: MapReduce     •   Map Stage          •   Embarrassingly parallel     • Shuffle Stage: Large-s...
Developing Analytical Applications     with Hadoop29
Novelty is the Enemy of Adoption30
The Best Way to Get Started: Apache Hive     •   Apache Hive          •   Data Warehouse System on              top of Had...
Borrowing Abstractions32
Improving the UX (http://github.com/cloudera/impala)33
Moving Beyond the Abstractions34
Making the Abstract Concrete35
Cloudera’s Data Science Course36
Analytical Applications I Love37
The Experiments Dashboard38
Adverse Drug Events39
Gene Sequencing and Analytics40
The Doctor’s Perspective41
A Couple of Themes     1.   Structure data the data in the way that makes sense for the          problem.     2.   Interac...
Working Towards The Dream43
Developing Analytical Applications     Moving Beyond MapReduce44
The Cambrian Explosion…of Frameworks 45
It’s Frameworks All The Way Down: Spark     • Developed at Berkeley’s       AMP Lab     • Defines operations on       dist...
IFATWD: Graphlab     • Developed at CMU     • Lower-level primitives         •   (but higher than MPI)     • Map/Reduce =>...
Playing with YARN48
BranchReduce (http://github.com/cloudera/branchreduce)49
50
Upcoming SlideShare
Loading in …5
×

Builiding analytical apps on Hadoop

1,909 views

Published on

  • Be the first to comment

Builiding analytical apps on Hadoop

  1. 1. Building Analytical Applications on PUBLICLY DO NOT USE Hadoop PRIOR TO 10/23/12 Headline Goes Here Josh Wills | Director of Data Science Speaker Name or Subhead Goes Here November 20121
  2. 2. About Me2
  3. 3. What are ‘Analytical Applications?’3
  4. 4. The Humble Dashboard4
  5. 5. Crossfilter with Flight Information5
  6. 6. New York Times Electoral Vote Map6
  7. 7. New York Times Electoral Vote Map (Detail)7
  8. 8. Analytical Applications vs. Frameworks8
  9. 9. Developing Analytical Applications A Case Study9
  10. 10. 2012: The Predicting of the President10
  11. 11. RealClearPolitics • Simple Average of Polls • Transparent • Simple Interactions11
  12. 12. FiveThirtyEight • “Foxy” Model • Opaque • Simple Interactions with a richer UI12
  13. 13. Princeton Election Consortium • Medians and Polynomials • Transparent • Rich Interactions13
  14. 14. How Did They Do?14
  15. 15. A Few of These, Because They’re Fun15
  16. 16. A Few of These, Because They’re Fun16
  17. 17. A Few of These, Because They’re Fun17
  18. 18. Here’s the Rub: One Expert Beat Nate18
  19. 19. Index Funds, Hedge Funds, and Warren Buffett19
  20. 20. A Brief Introduction to Hadoop20
  21. 21. Data Storage in 2001: Databases • Structured schemas • Intensive processing done where data is stored • Somewhat reliable • Expensive at scale21
  22. 22. Data Storage in 2001: Filers • No schemas, stores any kind of file • No data processing capability • Reliable • Expensive at scale22
  23. 23. And Then, This Happened23
  24. 24. Data Economics: Return on Byte24
  25. 25. Big Data Economics • No individual record is particularly valuable • Having every record is incredibly valuable • Web index • Recommendation systems • Sensor data • Market basket analysis • Online advertising25
  26. 26. Introduction to Hadoop26
  27. 27. The Hadoop Distributed File System • Based on the Google File System • Data stored in large files • Large block size: 64MB to 256MB per block • Blocks are replicated to multiple nodes in the cluster27
  28. 28. Simple, Reliable Processing: MapReduce • Map Stage • Embarrassingly parallel • Shuffle Stage: Large-scale distributed sort • Reduce Stage • Process all of the values that have the same key in a single step • Process the data where it is stored • Write once and you’re done.28
  29. 29. Developing Analytical Applications with Hadoop29
  30. 30. Novelty is the Enemy of Adoption30
  31. 31. The Best Way to Get Started: Apache Hive • Apache Hive • Data Warehouse System on top of Hadoop • SQL-based query language • SELECT, INSERT, CREATE TABLE • Includes some MapReduce- specific extensions31
  32. 32. Borrowing Abstractions32
  33. 33. Improving the UX (http://github.com/cloudera/impala)33
  34. 34. Moving Beyond the Abstractions34
  35. 35. Making the Abstract Concrete35
  36. 36. Cloudera’s Data Science Course36
  37. 37. Analytical Applications I Love37
  38. 38. The Experiments Dashboard38
  39. 39. Adverse Drug Events39
  40. 40. Gene Sequencing and Analytics40
  41. 41. The Doctor’s Perspective41
  42. 42. A Couple of Themes 1. Structure data the data in the way that makes sense for the problem. 2. Interactive inputs, not just interactive outputs. 3. Simpler interfaces that yield more sophisticated answers.42
  43. 43. Working Towards The Dream43
  44. 44. Developing Analytical Applications Moving Beyond MapReduce44
  45. 45. The Cambrian Explosion…of Frameworks 45
  46. 46. It’s Frameworks All The Way Down: Spark • Developed at Berkeley’s AMP Lab • Defines operations on distributed in-memory collections • Written in Scala • Supports reading to and writing from HDFS46
  47. 47. IFATWD: Graphlab • Developed at CMU • Lower-level primitives • (but higher than MPI) • Map/Reduce => Update/Sort • Flexible, allows for asynchronous computations • Reads from HDFS47
  48. 48. Playing with YARN48
  49. 49. BranchReduce (http://github.com/cloudera/branchreduce)49
  50. 50. 50

×