Finding the needles in the haystack. An Overview of Analyzing Big Data with Hadoop


Published on

A presentation I gave to R&D Informatics broadly introducing large scale data processing with Hadoop focusing on HDFS, MapReduce, Pig, and Hive.

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Finding the needles in the haystack. An Overview of Analyzing Big Data with Hadoop

  1. 1. <ul><li>Finding the needles in the haystack </li></ul>An overview of analyzing big data with Hadoop Presented by Chris Baglieri Architect, Research Architecture & Development Technology Forum, August 30 th , 2010
  2. 2. <ul><li>Hadoop-alooza (ahem Agenda) </li></ul><ul><li>Opening Act (Big Data): </li></ul><ul><li>Overview </li></ul><ul><li>Analytical Challenges </li></ul><ul><li>It’s Not Just <insert big web company name here>’s Problem </li></ul><ul><li>Headliner (Hadoop Core) </li></ul><ul><li>Overview </li></ul><ul><li>Hadoop Distributed File System </li></ul><ul><li>Hadoop MapReduce </li></ul><ul><li>Secondary Stage (Tooling Atop MapReduce) </li></ul><ul><li>Overview </li></ul><ul><li>Pig </li></ul><ul><li>Hive </li></ul>
  3. 3. <ul><li>Setting The Stage; What If We Could… </li></ul><ul><li>… centrally log instrument activity across R&D? </li></ul><ul><li>Are there times when a lab is exceptionally proficient or inefficient? </li></ul><ul><li>Do certain instruments which look like sound investments cost us more as a result of down time or rarely produce actionable data? </li></ul><ul><li>Do certain instruments fail following a predictable chain of events? </li></ul><ul><li>… centrally log service or application activity across informatics? </li></ul><ul><li>Do certain applications fail following a predictable chain of events? </li></ul><ul><li>Are there two seemingly disparate applications performing similar tasks; can we extract those tasks into a common service? </li></ul><ul><li>… collect exhaust data on samples? </li></ul><ul><li>What’s the logical starting point to conduct analysis on this sample? </li></ul><ul><li>Could we construct a sample’s analytical resume on the spot? </li></ul><ul><li>Could we identify friction in our discovery process? </li></ul>
  4. 4. <ul><li>Oh there’s the big data! </li></ul><ul><li>Facebook </li></ul><ul><li>36 PB of uncompressed data </li></ul><ul><li>2250 machines </li></ul><ul><li>23,000 cores </li></ul><ul><li>32 GB of RAM per machine </li></ul><ul><li>Processing 80-90 TB/day </li></ul><ul><li>Yahoo </li></ul><ul><li>70 PB of data in distributed storage </li></ul><ul><li>170 PB spread across the globe </li></ul><ul><li>34000 servers </li></ul><ul><li>Processing 3 PB/day </li></ul><ul><li>Twitter </li></ul><ul><li>7 TB/day into distributed storage </li></ul><ul><li>LinkedIn </li></ul><ul><li>120 billion relationships </li></ul><ul><li>16 TB of intermediate data </li></ul><ul><li>* All references pulled from Hadoop Summit 2010 </li></ul>
  5. 5. <ul><li>We’re not them, right? </li></ul><ul><li>Research Pushes Boundaries </li></ul><ul><li>Sequencing (25GB/day) </li></ul><ul><li>Gene Expression Omnibus (>2TB) </li></ul><ul><li>Studies consisting of 5 billion rows </li></ul><ul><li>Aggregating log data across R&D </li></ul><ul><li>More collaborations </li></ul><ul><li>More partners </li></ul><ul><li>AWS Public Data Sets </li></ul><ul><li>GenBank (~200GB) </li></ul><ul><li>Ensembl Genomic Data (~200GB) </li></ul><ul><li>PubChem Library (~230GB) </li></ul>Alright, so we’re not exactly close to them , but we’ll still run into similar problems sooner rather than later.
  6. 6. <ul><li>So what if we store everything? </li></ul>A typical drive from 1990 could store 1370 MB and had a transfer speed of 4.4 MB/s , hence reading all the data in ~5 minutes . Twenty years later, storage has 1 TB if not more is the norm while the rate of access has only climbed to ~100 MB/s , yielding a full read time of >2.5 hours . * Graph extrapolated from data generated on Tom’s Hardware
  7. 7. <ul><li>Our business is data… </li></ul><ul><li>… and it’s already </li></ul><ul><li>Big (just not as big). </li></ul><ul><li>Very Complex. </li></ul><ul><li>Highly relational. </li></ul><ul><li>Very diverse. </li></ul><ul><li>… and it may not </li></ul><ul><li>Fit nicely a database. </li></ul><ul><li>Scale nicely in a database. </li></ul><ul><li>… and we need tooling to </li></ul><ul><li>Aggregate it. </li></ul><ul><li>Organize it. </li></ul><ul><li>Analyze it. </li></ul><ul><li>Share it. </li></ul><ul><li>Do more in the area of metrics. </li></ul>
  8. 8. <ul><li>Enter the elephant! </li></ul><ul><li>At the highest level, Hadoop is a… </li></ul><ul><li>… reliable shared storage system. </li></ul><ul><li>… batch analytical processing engine. </li></ul><ul><li>… a toy elephant. </li></ul><ul><li>Technically speaking, Hadoop is… </li></ul><ul><li>… an open source Apache licensed project with numerous sub-projects. </li></ul><ul><li>… based on Google infrastructure (GFS, MapReduce, Sawzall, BigTable,…) </li></ul><ul><li>… written in Java. </li></ul><ul><li>… under active development. </li></ul><ul><li>… distributed. </li></ul><ul><li>… fault tolerant. </li></ul><ul><li>… incredibly scalable. </li></ul><ul><li>...built with commodity hardware in mind. </li></ul><ul><li>… efficient with big data. </li></ul><ul><li>… an ecosystem*. </li></ul><ul><li>… a market*. </li></ul>
  9. 9. <ul><li>Hadoop (and a small fraction of) The Ecosystem </li></ul>
  10. 10. <ul><li>Hadoop (and a small fraction of) The Market </li></ul>
  11. 11. <ul><li>Hadoop The Apache Project </li></ul>Raw Processing Zookeeper Hadoop Commons Compiled Processing Structured Storage Unstructured Storage Pig Hive Chukwa HDFS MapReduce HBase
  12. 12. <ul><li>Hadoop Distributed File System (HDFS) </li></ul>HDFS is a file system designed for storing very large files (with hooks for smaller files) with streaming data access patterns, running on clusters on commodity hardware .
  13. 13. <ul><li>Hadoop Distributed File System (HDFS) </li></ul>
  14. 14. <ul><li>Hadoop Distributed File System (HDFS) </li></ul>
  15. 15. <ul><li>HDFS Anti-Patterns </li></ul><ul><li>Low-latency data access </li></ul><ul><li>Applications requiring access speeds of 10’s of milliseconds look elsewhere. </li></ul><ul><li>Optimized for delivering a high throughput of data at the expense of latency. </li></ul><ul><li>Hbase is better suited for these access patterns. </li></ul><ul><li>Lots of small files </li></ul><ul><li>The capacity of the namenode is a limiting factor. </li></ul><ul><li>Millions is feasible, billions is beyond the capability of today’s hardware. </li></ul><ul><li>Multiple writers, arbitrary file modifications </li></ul><ul><li>Files may be written to by a single writer at a given point in time. </li></ul><ul><li>Writes are always made at the end of the file. </li></ul>
  16. 16. <ul><li>Map Reduce </li></ul><ul><li>Generalities </li></ul><ul><li>A simple programming model. </li></ul><ul><li>A framework for processing large data sets. </li></ul><ul><li>Abstracts away the complexities of parallel programming. </li></ul><ul><li>Computation routed to data. </li></ul><ul><li>A Language Unto Itself </li></ul><ul><li>A map reduce job is a unit of work. </li></ul><ul><li>It consists of input data , a map reduce program , and configurations . </li></ul><ul><li>Hadoop runs the job dividing it into tasks, namely map and reduce tasks . </li></ul><ul><li>Two node types control the execution: job and task trackers . </li></ul><ul><li>Hadoop divides input to a job into fixed-sized pieces called splits . </li></ul><ul><li>Hadoop creates one map task per split (often the size of a block). </li></ul><ul><li>Hadoop employs data locality optimization . </li></ul><ul><li>Map tasks results are intermediate ; stored locally and not on HDFS. </li></ul>
  17. 17. <ul><li>Map Reduce </li></ul><ul><li>Basic steps (mapper and reducer change to fit the problem) </li></ul><ul><li>Read data. </li></ul><ul><li>Extract something you care about from each record (mapping) </li></ul><ul><li>Shuffle and sort the extracted data. </li></ul><ul><li>Aggregate, summarize, filter, or transform (reducing) </li></ul><ul><li>Write the results. </li></ul>
  18. 18. <ul><li>Map Reduce (Job Configuration) </li></ul>
  19. 19. <ul><li>Map Reduce (Map) </li></ul>general form => (k1, v1)  list(k2, v2) key => offset from the beginning of a file value => line of text output => { word == Text, 1 == IntWritable } reporter => container that holds metadata about the job
  20. 20. <ul><li>Map Reduce (Reduce) </li></ul>general form => (k2, list(v2))  list(k3, v3) key => offset from the start of a file value => collection of word counts where each count is 1 output => { word == Text, number of times it appeared in full text == IntWritable } reporter => container that holds metadata about the job
  21. 21. <ul><li>Hadoop-able Problems Courtesy of Cloudera </li></ul><ul><li>Properties of Data </li></ul><ul><li>Complex data. </li></ul><ul><li>Multiple data sources. </li></ul><ul><li>Lots of data. </li></ul><ul><li>Types of Analysis </li></ul><ul><li>Text mining. </li></ul><ul><li>Index building. </li></ul><ul><li>Graph creation and analysis. </li></ul><ul><li>Pattern recognition. </li></ul><ul><li>Collaborative filtering. </li></ul><ul><li>Prediction models. </li></ul><ul><li>Sentiment analysis. </li></ul><ul><li>Risk assessment. </li></ul><ul><li>Examples </li></ul><ul><li>Modeling true risk </li></ul><ul><li>Customer churn analysis </li></ul><ul><li>Recommendation engine </li></ul><ul><li>Ad targeting </li></ul><ul><li>Point of sale transaction analysis </li></ul><ul><li>Failure prediction </li></ul><ul><li>Threat analysis </li></ul><ul><li>Trade surveillance </li></ul><ul><li>Search quality </li></ul><ul><li>Data “sandbox” </li></ul>
  22. 22. <ul><li>Tooling Atop MapReduce </li></ul><ul><li>Challenges </li></ul><ul><li>MapReduce is incredibly powerful but quite verbose. </li></ul><ul><li>It’s a developer tool through and through. </li></ul><ul><li>Common tasks are implemented over and over and over again. </li></ul><ul><li>Better tooling is needed for data preparation and presentation. </li></ul><ul><li>Higher Order Abstractions Are Emerging </li></ul><ul><li>Writing a job in a higher order MR language requires orders of magnitude less code, thus requires orders of magnitude less time, at acceptable hits in execution time. </li></ul>
  23. 23. <ul><li>Hadoop Pig Overview </li></ul><ul><li>Born at Y!; roughly 40% of Y!’s Hadoop jobs are scripted in Pig. </li></ul><ul><li>Operates directly on data in HDFS via… </li></ul><ul><ul><li>… a command line. </li></ul></ul><ul><ul><li>… scripts. </li></ul></ul><ul><ul><li>… plug-ins (Eclipse has one called Pig Pen). </li></ul></ul><ul><li>Well suited for “data factory” (e.g. ETL) type of work. </li></ul><ul><ul><li>Raw data in, data ready for consumers out. </li></ul></ul><ul><li>Users are often engineers, data specialists, and researchers. </li></ul><ul><li>Consists of a… </li></ul><ul><ul><li>… data flow language (Pig Latin); 10 lines of Pig Latin ~= 200 lines of Java. </li></ul></ul><ul><ul><li>… shell (Grunt). </li></ul></ul><ul><ul><li>… server with a JDBC like interface (Pig Server). </li></ul></ul><ul><li>Has lots of constructs suitable for data manipulation </li></ul><ul><ul><li>Join </li></ul></ul><ul><ul><li>Merge </li></ul></ul><ul><ul><li>Distinct </li></ul></ul><ul><ul><li>Union </li></ul></ul><ul><ul><li>Limit </li></ul></ul>
  24. 24. <ul><li>Pig Makes Something Like This… </li></ul>
  25. 25. <ul><li>…Look Like This </li></ul>
  26. 26. <ul><li>A More Interesting Pig Example </li></ul>A more interesting example, the above Ruby script (taken from a presentation given by Twitter’s Analytics Lead) breaks down where users are Tweeting from (e.g. the API, the front page, their profile page, etc.). Note the use of scripts in Twitter’s “PiggyBank”.
  27. 27. <ul><li>Hadoop Hive Overview </li></ul><ul><li>Born at Facebook. </li></ul><ul><li>Does not operate directly on HDFS data but rather on a Hive metastore </li></ul><ul><ul><li>HDFS data is pre-configured and loaded into the metastore. </li></ul></ul><ul><ul><li>You may have something like /home/hive/warehouse on HDFS. </li></ul></ul><ul><ul><li>Tables are created and stored as sub-directories in the warehouse. </li></ul></ul><ul><ul><li>Contents are stored as files within table sub-directories. </li></ul></ul><ul><li>Well suited for “data presentation” (e.g. warehouse) type of work. </li></ul><ul><li>Users are often engineers using data for their systems, analysts, or decision-makers. </li></ul><ul><li>Consists of a… </li></ul><ul><ul><li>… data SQL-like language. </li></ul></ul><ul><ul><li>… shell. </li></ul></ul><ul><ul><li>… server. </li></ul></ul><ul><li>Mirrors SQL semantics </li></ul><ul><ul><li>Table creation. </li></ul></ul><ul><ul><li>Select clauses. </li></ul></ul><ul><ul><li>Includes clauses. </li></ul></ul><ul><ul><li>Joins clauses. </li></ul></ul><ul><ul><li>Etc. </li></ul></ul>
  28. 28. <ul><li>Working With Hive </li></ul>
  29. 29. <ul><li>Working With Hive </li></ul>
  30. 30. <ul><li>Working With Hive </li></ul>
  31. 31. <ul><li>Working With Hive </li></ul>
  32. 32. <ul><li>Working With Hive </li></ul>
  33. 33. <ul><li>Closing Thoughts </li></ul><ul><li>re: Data </li></ul><ul><li>We already have big (or at the very least medium) data. </li></ul><ul><li>Our data is inherently diverse, complex, and highly related. </li></ul><ul><li>Our data footprint is growing fast; can informatics keep up? </li></ul><ul><li>re: Technology </li></ul><ul><li>The Hadoop ecosystem and market are maturing and expanding rapidly. </li></ul><ul><li>The Hadoop community is strong (oh yeah, and helpful too). </li></ul><ul><li>Cost to get in the game is not outrageous in terms of soft/hardware. </li></ul><ul><li>Provisioning “hardware” it’s drastically easier, especially given our AWS efforts. </li></ul><ul><li>re: Opportunity </li></ul><ul><li>Capturing data doesn’t always need to be structured or regimented. </li></ul><ul><li>R&D could benefit from additional “free form” data exploration. </li></ul><ul><li>While relatively inexpensive to get in the game, it does require an investment. </li></ul>
  34. 34. <ul><li>Questions </li></ul>In pioneer days they used oxen for heavy pulling, and when one ox couldn’t budge a log, they didn’t try to grow a larger ox. We shouldn’t be trying for bigger computers, but for more systems of computers. Grace Hopper Fun fact, she also is accredited with “It’s easier to ask for forgiveness than it is to get permission.