Builiding analytical apps on Hadoop

1,865 views

Published on

0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,865
On SlideShare
0
From Embeds
0
Number of Embeds
958
Actions
Shares
0
Downloads
38
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide
  • They are applications that allow users to work with and make decisions from data.
  • It seems like there should be a UX equivalent of Clippy– maybe like a tiny picture of Edward Tufte– that pops up whenever someone decides to use a 3D pie chart.
  • http://square.github.com/crossfilter/
  • http://elections.nytimes.com/2012/results/president (Click on “Shift from 2008”)
  • Click on a state to zoom in
  • Frameworks != Analytical applicatons, for our purposes today. It’s not an analytical application until you put some data in it.
  • A few different models were developed for predicting the presidency in 2012– let’s consider a few of them.
  • http://www.realclearpolitics.com/epolls/2012/president/2012_elections_electoral_college_map.html
  • http://fivethirtyeight.blogs.nytimes.com/
  • http://election.princeton.edu/
  • http://isnatesilverawitch.com/Everyone predicted the election correctly. The RCP model got every state but Florida, PEC said it was a tossup, and 538 got every single state right.
  • MarkosMoulitsas over at theDailyKos did even better than Nate at predicting the share of the vote within the swing states. Don’t think that math can always out-perform an expert armed with good data.http://news.cnet.com/8301-13578_3-57546778-38/among-the-top-election-quants-nate-silver-reigns-supreme/
  • Index fund == simple average.Hedge fund == 538Warren Buffett == Expert with good data
  • Classical data economics: If the value I can extract from a byte is greater than the cost to store it, then I throw it away or store it on tape.
  • We use metaphors that help us understand new technology in terms of the old. Translatedesktop tools and metaphors on to Hadoop, even when we’re working with specialized data types: http://blog.cloudera.com/blog/2012/01/seismic-data-science-hadoop-use-case/
  • It’s a data warehousing metaphor– not an actual data warehouse. Schema on read vs. schema on write, for example. Non-interactive for the most part. Think of ELT, not interactive queries.
  • We borrow these abstractions because they make it easy to get started, but they don’t necessarily conform to the user’s expectations of how Hadoop will work.If you think of Hadoop as a really big database, or as a spreadsheet that goes on forever and ever, then you have failed to understand Hadoop.
  • Impala is about fulfilling those abstractions, esp. for interactive queries of relational-style data on Hadoop.
  • But we can also go beyond the abstractions and study how Hadoop can be effective for new kinds of analytic applications.
  • Step 1: Study real problems. Especially real problems where non-sophisticated users (e.g., people who don’t even know SQL) need to do sophisticated analysis on large quantities of information.
  • I realized earlier this year that other people do not use Hive the way that I use Hive, and so we created the data science course to take people through the problem of building an analytical application from start to finish on Hadoop.http://blog.cloudera.com/blog/2012/10/data-science-training/
  • They are applications that allow users to work with and make decisions from data.
  • http://blog.cloudera.com/blog/2011/11/using-hadoop-to-analyze-adverse-drug-events/
  • http://www.slideshare.net/cloudera/7-leveraging-h-base-for-the-worlds-largest-curated-genomic-data-collection-satnam-alag-nextbio-finalupdatedlastminute
  • The truth is that building tools for unsophisticated users typically requires incredibly sophisticated development.
  • An open-source version of Wolfram Alpha for useful data.
  • https://github.com/cloudera/kitten
  • http://github.com/cloudera/branchreduce
  • Builiding analytical apps on Hadoop

    1. 1. Building Analytical Applications on PUBLICLY DO NOT USE Hadoop PRIOR TO 10/23/12 Headline Goes Here Josh Wills | Director of Data Science Speaker Name or Subhead Goes Here November 20121
    2. 2. About Me2
    3. 3. What are ‘Analytical Applications?’3
    4. 4. The Humble Dashboard4
    5. 5. Crossfilter with Flight Information5
    6. 6. New York Times Electoral Vote Map6
    7. 7. New York Times Electoral Vote Map (Detail)7
    8. 8. Analytical Applications vs. Frameworks8
    9. 9. Developing Analytical Applications A Case Study9
    10. 10. 2012: The Predicting of the President10
    11. 11. RealClearPolitics • Simple Average of Polls • Transparent • Simple Interactions11
    12. 12. FiveThirtyEight • “Foxy” Model • Opaque • Simple Interactions with a richer UI12
    13. 13. Princeton Election Consortium • Medians and Polynomials • Transparent • Rich Interactions13
    14. 14. How Did They Do?14
    15. 15. A Few of These, Because They’re Fun15
    16. 16. A Few of These, Because They’re Fun16
    17. 17. A Few of These, Because They’re Fun17
    18. 18. Here’s the Rub: One Expert Beat Nate18
    19. 19. Index Funds, Hedge Funds, and Warren Buffett19
    20. 20. A Brief Introduction to Hadoop20
    21. 21. Data Storage in 2001: Databases • Structured schemas • Intensive processing done where data is stored • Somewhat reliable • Expensive at scale21
    22. 22. Data Storage in 2001: Filers • No schemas, stores any kind of file • No data processing capability • Reliable • Expensive at scale22
    23. 23. And Then, This Happened23
    24. 24. Data Economics: Return on Byte24
    25. 25. Big Data Economics • No individual record is particularly valuable • Having every record is incredibly valuable • Web index • Recommendation systems • Sensor data • Market basket analysis • Online advertising25
    26. 26. Introduction to Hadoop26
    27. 27. The Hadoop Distributed File System • Based on the Google File System • Data stored in large files • Large block size: 64MB to 256MB per block • Blocks are replicated to multiple nodes in the cluster27
    28. 28. Simple, Reliable Processing: MapReduce • Map Stage • Embarrassingly parallel • Shuffle Stage: Large-scale distributed sort • Reduce Stage • Process all of the values that have the same key in a single step • Process the data where it is stored • Write once and you’re done.28
    29. 29. Developing Analytical Applications with Hadoop29
    30. 30. Novelty is the Enemy of Adoption30
    31. 31. The Best Way to Get Started: Apache Hive • Apache Hive • Data Warehouse System on top of Hadoop • SQL-based query language • SELECT, INSERT, CREATE TABLE • Includes some MapReduce- specific extensions31
    32. 32. Borrowing Abstractions32
    33. 33. Improving the UX (http://github.com/cloudera/impala)33
    34. 34. Moving Beyond the Abstractions34
    35. 35. Making the Abstract Concrete35
    36. 36. Cloudera’s Data Science Course36
    37. 37. Analytical Applications I Love37
    38. 38. The Experiments Dashboard38
    39. 39. Adverse Drug Events39
    40. 40. Gene Sequencing and Analytics40
    41. 41. The Doctor’s Perspective41
    42. 42. A Couple of Themes 1. Structure data the data in the way that makes sense for the problem. 2. Interactive inputs, not just interactive outputs. 3. Simpler interfaces that yield more sophisticated answers.42
    43. 43. Working Towards The Dream43
    44. 44. Developing Analytical Applications Moving Beyond MapReduce44
    45. 45. The Cambrian Explosion…of Frameworks 45
    46. 46. It’s Frameworks All The Way Down: Spark • Developed at Berkeley’s AMP Lab • Defines operations on distributed in-memory collections • Written in Scala • Supports reading to and writing from HDFS46
    47. 47. IFATWD: Graphlab • Developed at CMU • Lower-level primitives • (but higher than MPI) • Map/Reduce => Update/Sort • Flexible, allows for asynchronous computations • Reads from HDFS47
    48. 48. Playing with YARN48
    49. 49. BranchReduce (http://github.com/cloudera/branchreduce)49
    50. 50. 50

    ×