Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Hadoop and Pig at Twitter__HadoopSummit2010

6,937 views

Published on

Hadoop Summit 2010 - Developers Track
Hadoop and Pig at Twitter
Kevin Weil, Twitter

Published in: Technology
  • Be the first to comment

Hadoop and Pig at Twitter__HadoopSummit2010

  1. 1. Hadoop at Twitter <ul><li>Kevin Weil -- @kevinweil </li></ul><ul><li>Analytics Lead, Twitter </li></ul>
  2. 2. The Twitter Data Lifecycle <ul><li>Data Input </li></ul><ul><li>Data Storage </li></ul><ul><li>Data Analysis </li></ul><ul><li>Data Products </li></ul>
  3. 3. The Twitter Data Lifecycle <ul><li>Data Input: Scribe , Crane </li></ul><ul><li>Data Storage: Elephant Bird , HBase </li></ul><ul><li>Data Analysis: Pig , Oink </li></ul><ul><li>Data Products: Birdbrain </li></ul>1 Community Open Source 2 Twitter Open Source (or soon)
  4. 4. My Background <ul><li>Studied Mathematics and Physics at Harvard, Physics at Stanford </li></ul><ul><li>Tropos Networks (city-wide wireless): mesh routing algorithms, GBs of data </li></ul><ul><li>Cooliris (web media): Hadoop and Pig for analytics, TBs of data </li></ul><ul><li>Twitter : Hadoop, Pig, machine learning, visualization, social graph analysis, (soon) PBs of data </li></ul>
  5. 5. The Twitter Data Lifecycle <ul><li>Data Input: Scribe , Crane </li></ul><ul><li>Data Storage </li></ul><ul><li>Data Analysis </li></ul><ul><li>Data Products </li></ul>1 Community Open Source 2 Twitter Open Source
  6. 6. What Data? <ul><li>Two main kinds of raw data </li></ul><ul><li>Logs </li></ul><ul><li>Tabular data </li></ul>
  7. 7. Logs <ul><li>Started with syslog-ng </li></ul><ul><li>As our volume grew, it didn’t scale </li></ul>
  8. 8. Logs <ul><li>Started with syslog-ng </li></ul><ul><li>As our volume grew, it didn’t scale </li></ul><ul><li>Resources overwhelmed </li></ul><ul><li>Lost data </li></ul>
  9. 9. Scribe <ul><li>Scribe daemon runs locally; reliable in network outage </li></ul><ul><li>Nodes only know downstream </li></ul><ul><li>writer; hierarchical, scalable </li></ul><ul><li>Pluggable outputs, per category </li></ul>FE FE FE Agg Agg HDFS File
  10. 10. Scribe at Twitter <ul><li>Solved our problem, opened new vistas </li></ul><ul><li>Currently 57 different categories logged from multiple sources </li></ul><ul><li>FE: Javascript, Ruby on Rails </li></ul><ul><li>Middle tier: Ruby on Rails, Scala </li></ul><ul><li>Backend: Scala, Java, C++ </li></ul><ul><li>7 TB/day into HDFS </li></ul><ul><li>Log first, ask questions later. </li></ul>
  11. 11. Scribe at Twitter <ul><li>We’ve contributed to it as we’ve used it 1 </li></ul><ul><li>Improved logging, monitoring, writing to HDFS, compression </li></ul><ul><li>Added ZooKeeper-based config </li></ul><ul><li>Continuing to work with FB on patches </li></ul><ul><li>Also: working with Cloudera to evaluate Flume </li></ul>1 http://github.com/traviscrawford/scribe
  12. 12. Tabular Data <ul><li>Most site data is in MySQL </li></ul><ul><li>Tweets, users, devices, client applications, etc </li></ul><ul><li>Need to move it between MySQL and HDFS </li></ul><ul><ul><ul><ul><ul><li>Also between MySQL and HBase, or MySQL and MySQL </li></ul></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>Crane: configuration driven ETL tool </li></ul></ul></ul></ul></ul>
  13. 13. Crane Driver Configuration/Batch Management Extract Load Transform Protobuf P1 Protobuf P2 Source Sink ZooKeeper Registration
  14. 14. Crane <ul><li>Extract </li></ul><ul><li>MySQL, HDFS, HBase, Flock, GA, Facebook Insights </li></ul>
  15. 15. Crane <ul><li>Extract </li></ul><ul><li>MySQL, HDFS, HBase, Flock, GA, Facebook Insights </li></ul><ul><li>Transform </li></ul><ul><li>IP/Phone -> Geo, canonicalize dates, cleaning, arbitrary logic </li></ul>
  16. 16. Crane <ul><li>Extract </li></ul><ul><li>MySQL, HDFS, HBase, Flock, GA, Facebook Insights </li></ul><ul><li>Transform </li></ul><ul><li>IP/Phone -> Geo, canonicalize dates, cleaning, arbitrary logic </li></ul><ul><li>Load </li></ul><ul><li>MySQL, Local file, Stdout, HDFS, HBase </li></ul>
  17. 17. Crane <ul><li>Extract </li></ul><ul><li>MySQL, HDFS, HBase, Flock, GA, Facebook Insights </li></ul><ul><li>Transform </li></ul><ul><li>IP/Phone -> Geo, canonicalize dates, cleaning, arbitrary logic </li></ul><ul><li>Load </li></ul><ul><li>MySQL, Local file, Stdout, HDFS, HBase </li></ul><ul><li>ZooKeeper coordination, intelligent date management </li></ul><ul><li>Run all the time from multiple servers, self healing </li></ul>
  18. 18. The Twitter Data Lifecycle <ul><li>Data Input </li></ul><ul><li>Data Storage: Elephant Bird , HBase </li></ul><ul><li>Data Analysis </li></ul><ul><li>Data Products </li></ul>1 Community Open Source 2 Twitter Open Source
  19. 19. Storage Basics <ul><li>Incoming data: 7 TB/day </li></ul><ul><li>LZO encode everything </li></ul><ul><li>Save 3-4x on storage, pay little CPU </li></ul><ul><li>Splittable! 1 </li></ul><ul><li>IO-bound jobs ==> 3-4x perf increase </li></ul>1 http://www.github.com/kevinweil/hadoop-lzo
  20. 20. Elephant Bird http://www.flickr.com/photos/jagadish/3072134867/ 1 http://github.com/kevinweil/elephant-bird
  21. 21. Elephant Bird <ul><li>We have data coming in as protocol buffers via Crane... </li></ul>
  22. 22. Elephant Bird <ul><li>We have data coming in as protocol buffers via Crane... </li></ul><ul><li>Protobufs: codegen for efficient ser-de of data structures </li></ul>
  23. 23. Elephant Bird <ul><li>We have data coming in as protocol buffers via Crane... </li></ul><ul><li>Protobufs: codegen for efficient ser-de of data structures </li></ul><ul><li>Why shouldn’t we just continue, and codegen more glue? </li></ul>
  24. 24. Elephant Bird <ul><li>We have data coming in as protocol buffers via Crane... </li></ul><ul><li>Protobufs: codegen for efficient ser-de of data structures </li></ul><ul><li>Why shouldn’t we just continue, and codegen more glue? </li></ul><ul><li>InputFormats, OutputFormats, Pig LoadFuncs, Pig StoreFuncs, Pig HBaseLoaders </li></ul>
  25. 25. Elephant Bird <ul><li>We have data coming in as protocol buffers via Crane... </li></ul><ul><li>Protobufs: codegen for efficient ser-de of data structures </li></ul><ul><li>Why shouldn’t we just continue, and codegen more glue? </li></ul><ul><li>InputFormats, OutputFormats, Pig LoadFuncs, Pig StoreFuncs, Pig HBaseLoaders </li></ul><ul><li>Also now does part of this with Thrift, soon Avro </li></ul><ul><li>And JSON, W3C Logs </li></ul>
  26. 26. Challenge: Mutable Data <ul><li>HDFS is write-once: no seek on write, no append (yet) </li></ul><ul><li>Logs are easy. </li></ul><ul><li>But our tables change. </li></ul>
  27. 27. Challenge: Mutable Data <ul><li>HDFS is write-once: no seek on write, no append (yet) </li></ul><ul><li>Logs are easy. </li></ul><ul><li>But our tables change. </li></ul><ul><li>Handling rapidly changing data in HDFS: not trivial. </li></ul><ul><li>Don’t worry about updated data </li></ul><ul><li>Refresh entire dataset </li></ul><ul><li>Download updates, tombstone old versions of data, ensure jobs only run over current versions of data, occasionally rewrite full dataset </li></ul>
  28. 28. Challenge: Mutable Data <ul><li>HDFS is write-once: no seek on write, no append (yet) </li></ul><ul><li>Logs are easy. </li></ul><ul><li>But our tables change. </li></ul><ul><li>Handling changing data in HDFS: not trivial. </li></ul>
  29. 29. HBase <ul><li>Has already solved the update problem </li></ul><ul><li>Bonus: low-latency query API </li></ul><ul><li>Bonus: rich, BigTable-style data model based on column families </li></ul>
  30. 30. HBase at Twitter <ul><li>Crane loads data directly into HBase </li></ul><ul><li>One CF for protobuf bytes, one CF to denormalize columns for indexing or quicker batch access </li></ul><ul><li>Processing updates transparent, so we always have accurate data in HBase </li></ul><ul><li>Pig Loader for HBase in Elephant Bird makes integration with existing analyses easy </li></ul>
  31. 31. HBase at Twitter <ul><li>Crane loads data directly into HBase </li></ul><ul><li>One CF for protobuf bytes, one CF to denormalize columns for indexing or quicker batch access </li></ul><ul><li>Processing updates transparent, so we always have accurate data in HBase </li></ul><ul><li>Pig Loader for HBase in Elephant Bird </li></ul>
  32. 32. The Twitter Data Lifecycle <ul><li>Data Input </li></ul><ul><li>Data Storage </li></ul><ul><li>Data Analysis: Pig , Oink </li></ul><ul><li>Data Products </li></ul>1 Community Open Source 2 Twitter Open Source
  33. 33. Enter Pig <ul><li>High level language </li></ul><ul><li>Transformations on sets of records </li></ul><ul><li>Process data one step at a time </li></ul><ul><li>UDFs are first-class citizens </li></ul><ul><li>Easier than SQL? </li></ul>
  34. 34. Why Pig? <ul><li>Because I bet you can read the following script. </li></ul>
  35. 35. A Real Pig Script <ul><li>Now, just for fun... the same calculation in vanilla Hadoop MapReduce. </li></ul>
  36. 36. No, seriously.
  37. 37. Pig Democratizes Large-scale Data Analysis <ul><li>The Pig version is: </li></ul><ul><ul><li>5% of the code </li></ul></ul><ul><ul><li>5% of the time </li></ul></ul><ul><ul><li>Within 30% of the execution time. </li></ul></ul><ul><ul><li>Innovation increasingly driven from large-scale data analysis </li></ul></ul><ul><ul><li>Need fast iteration to understand the right questions </li></ul></ul><ul><ul><li>More minds contributing = more value from your data </li></ul></ul>
  38. 38. Pig Examples <ul><li>Using the HBase Loader </li></ul><ul><li>Using the protobuf loaders </li></ul>
  39. 39. Pig Workflow <ul><li>Oink: framework around Pig for loading, combining, running, post-processing </li></ul><ul><li>Everyone I know has one of these </li></ul><ul><li>Points to an opening for innovation; discussion beginning </li></ul><ul><li>Something we’re looking at: Ruby DSL for Pig, Piglet 1 </li></ul>1 http://github.com/ningliang/piglet
  40. 40. Counting Big Data <ul><li>standard counts, min, max, std dev </li></ul><ul><li>How many requests do we serve in a day? </li></ul><ul><li>What is the average latency? 95% latency? </li></ul><ul><li>Group by response code. What is the hourly distribution? </li></ul><ul><li>How many searches happen each day on Twitter? </li></ul><ul><li>How many unique queries, how many unique users? </li></ul><ul><li>What is their geographic distribution? </li></ul>
  41. 41. Correlating Big Data <ul><li>How does usage differ for mobile users? </li></ul><ul><li>How about for users with 3rd party desktop clients? </li></ul><ul><li>Cohort analyses </li></ul><ul><li>Site problems: what goes wrong at the same time? </li></ul><ul><li>Which features get users hooked? </li></ul><ul><li>Which features do successful users use often? </li></ul><ul><li>Search corrections, search suggestions </li></ul><ul><li>A/B testing </li></ul><ul><li>probabilities, covariance, influence </li></ul>
  42. 42. Research on Big Data <ul><li>What can we tell about a user from their tweets? </li></ul><ul><li>From the tweets of those they follow? </li></ul><ul><li>From the tweets of their followers? </li></ul><ul><li>From the ratio of followers/following? </li></ul><ul><li>What graph structures lead to successful networks? </li></ul><ul><li>User reputation </li></ul><ul><li>prediction, graph analysis, natural language </li></ul>
  43. 43. Research on Big Data <ul><li>Sentiment analysis </li></ul><ul><li>What features get a tweet retweeted? </li></ul><ul><li>How deep is the corresponding retweet tree? </li></ul><ul><li>Long-term duplicate detection </li></ul><ul><li>Machine learning </li></ul><ul><li>Language detection </li></ul><ul><li>... the list goes on. </li></ul><ul><li>prediction, graph analysis, natural language </li></ul>
  44. 44. The Twitter Data Lifecycle <ul><li>Data Input </li></ul><ul><li>Data Storage </li></ul><ul><li>Data Analysis </li></ul><ul><li>Data Products: Birdbrain </li></ul>1 Community Open Source 2 Twitter Open Source
  45. 45. Data Products <ul><li>Ad Hoc Analyses </li></ul><ul><li>Answer questions to keep the business agile, do research </li></ul><ul><li>Online Products </li></ul><ul><li>Name search, other upcoming products </li></ul><ul><li>Company Dashboard </li></ul><ul><li>Birdbrain </li></ul>
  46. 46. Questions ? Follow me at twitter.com/kevinweil TM <ul><li>P.S. We’re hiring. Help us build the next step: realtime big data analytics. </li></ul>

×