Hadoop at Twitter <ul><li>Kevin Weil -- @kevinweil </li></ul><ul><li>Analytics Lead, Twitter </li></ul>
The Twitter Data Lifecycle <ul><li>Data Input </li></ul><ul><li>Data Storage </li></ul><ul><li>Data Analysis </li></ul><ul...
The Twitter Data Lifecycle <ul><li>Data Input:  Scribe ,  Crane </li></ul><ul><li>Data Storage:  Elephant Bird ,  HBase </...
My Background <ul><li>Studied Mathematics and Physics at Harvard, Physics at Stanford </li></ul><ul><li>Tropos Networks  (...
The Twitter Data Lifecycle <ul><li>Data Input:  Scribe ,  Crane </li></ul><ul><li>Data Storage </li></ul><ul><li>Data Anal...
What Data? <ul><li>Two main kinds of raw data </li></ul><ul><li>Logs </li></ul><ul><li>Tabular data </li></ul>
Logs <ul><li>Started with syslog-ng </li></ul><ul><li>As our volume grew, it didn’t scale </li></ul>
Logs <ul><li>Started with syslog-ng </li></ul><ul><li>As our volume grew, it didn’t scale </li></ul><ul><li>Resources over...
Scribe <ul><li>Scribe daemon runs locally; reliable in network outage </li></ul><ul><li>Nodes only know downstream </li></...
Scribe at Twitter <ul><li>Solved our problem, opened new vistas </li></ul><ul><li>Currently 57 different categories logged...
Scribe at Twitter <ul><li>We’ve contributed to it as we’ve used it 1 </li></ul><ul><li>Improved logging, monitoring, writi...
Tabular Data <ul><li>Most site data is in MySQL </li></ul><ul><li>Tweets, users, devices, client applications, etc </li></...
Crane Driver Configuration/Batch Management Extract Load Transform Protobuf P1 Protobuf P2 Source Sink ZooKeeper Registrat...
Crane <ul><li>Extract </li></ul><ul><li>MySQL, HDFS, HBase, Flock, GA, Facebook Insights </li></ul>
Crane <ul><li>Extract </li></ul><ul><li>MySQL, HDFS, HBase, Flock, GA, Facebook Insights </li></ul><ul><li>Transform </li>...
Crane <ul><li>Extract </li></ul><ul><li>MySQL, HDFS, HBase, Flock, GA, Facebook Insights </li></ul><ul><li>Transform </li>...
Crane <ul><li>Extract </li></ul><ul><li>MySQL, HDFS, HBase, Flock, GA, Facebook Insights </li></ul><ul><li>Transform </li>...
The Twitter Data Lifecycle <ul><li>Data Input </li></ul><ul><li>Data Storage:  Elephant Bird ,  HBase </li></ul><ul><li>Da...
Storage Basics <ul><li>Incoming data: 7 TB/day </li></ul><ul><li>LZO encode everything </li></ul><ul><li>Save 3-4x on stor...
Elephant Bird http://www.flickr.com/photos/jagadish/3072134867/ 1   http://github.com/kevinweil/elephant-bird
Elephant Bird <ul><li>We have data coming in as protocol buffers via Crane... </li></ul>
Elephant Bird <ul><li>We have data coming in as protocol buffers via Crane... </li></ul><ul><li>Protobufs: codegen for eff...
Elephant Bird <ul><li>We have data coming in as protocol buffers via Crane... </li></ul><ul><li>Protobufs: codegen for eff...
Elephant Bird <ul><li>We have data coming in as protocol buffers via Crane... </li></ul><ul><li>Protobufs: codegen for eff...
Elephant Bird <ul><li>We have data coming in as protocol buffers via Crane... </li></ul><ul><li>Protobufs: codegen for eff...
Challenge: Mutable Data <ul><li>HDFS is write-once: no seek on write, no append (yet) </li></ul><ul><li>Logs are easy. </l...
Challenge: Mutable Data <ul><li>HDFS is write-once: no seek on write, no append (yet) </li></ul><ul><li>Logs are easy. </l...
Challenge: Mutable Data <ul><li>HDFS is write-once: no seek on write, no append (yet) </li></ul><ul><li>Logs are easy. </l...
HBase <ul><li>Has already solved the update problem </li></ul><ul><li>Bonus: low-latency query API </li></ul><ul><li>Bonus...
HBase at Twitter <ul><li>Crane loads data directly into HBase </li></ul><ul><li>One CF for protobuf bytes, one CF to denor...
HBase at Twitter <ul><li>Crane loads data directly into HBase </li></ul><ul><li>One CF for protobuf bytes, one CF to denor...
The Twitter Data Lifecycle <ul><li>Data Input </li></ul><ul><li>Data Storage </li></ul><ul><li>Data Analysis:  Pig , Oink ...
Enter Pig <ul><li>High level language </li></ul><ul><li>Transformations on sets of records </li></ul><ul><li>Process data ...
Why Pig? <ul><li>Because I bet you can read the following script. </li></ul>
A Real Pig Script <ul><li>Now, just for fun... the same calculation in vanilla Hadoop MapReduce. </li></ul>
No, seriously.
Pig Democratizes Large-scale Data Analysis <ul><li>The Pig version is: </li></ul><ul><ul><li>5% of the code </li></ul></ul...
Pig Examples <ul><li>Using the HBase Loader </li></ul><ul><li>Using the protobuf loaders </li></ul>
Pig Workflow <ul><li>Oink: framework around Pig for loading, combining, running, post-processing </li></ul><ul><li>Everyon...
Counting Big Data <ul><li>standard counts, min, max, std dev </li></ul><ul><li>How many requests do we serve in a day? </l...
Correlating Big Data <ul><li>How does usage differ for mobile users? </li></ul><ul><li>How about for users with 3rd party ...
Research on Big Data <ul><li>What can we tell about a user from their tweets? </li></ul><ul><li>From the tweets of those t...
Research on Big Data <ul><li>Sentiment analysis </li></ul><ul><li>What features get a tweet retweeted? </li></ul><ul><li>H...
The Twitter Data Lifecycle <ul><li>Data Input </li></ul><ul><li>Data Storage </li></ul><ul><li>Data Analysis </li></ul><ul...
Data Products <ul><li>Ad Hoc Analyses </li></ul><ul><li>Answer questions to keep the business agile, do research </li></ul...
Questions ? Follow me at twitter.com/kevinweil TM <ul><li>P.S. We’re hiring.  Help us build the next step: realtime big da...
Upcoming SlideShare
Loading in...5
×

Hadoop and Pig at Twitter__HadoopSummit2010

6,206

Published on

Hadoop Summit 2010 - Developers Track
Hadoop and Pig at Twitter
Kevin Weil, Twitter

Published in: Technology
0 Comments
25 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
6,206
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
25
Embeds 0
No embeds

No notes for slide

Transcript of "Hadoop and Pig at Twitter__HadoopSummit2010"

  1. 1. Hadoop at Twitter <ul><li>Kevin Weil -- @kevinweil </li></ul><ul><li>Analytics Lead, Twitter </li></ul>
  2. 2. The Twitter Data Lifecycle <ul><li>Data Input </li></ul><ul><li>Data Storage </li></ul><ul><li>Data Analysis </li></ul><ul><li>Data Products </li></ul>
  3. 3. The Twitter Data Lifecycle <ul><li>Data Input: Scribe , Crane </li></ul><ul><li>Data Storage: Elephant Bird , HBase </li></ul><ul><li>Data Analysis: Pig , Oink </li></ul><ul><li>Data Products: Birdbrain </li></ul>1 Community Open Source 2 Twitter Open Source (or soon)
  4. 4. My Background <ul><li>Studied Mathematics and Physics at Harvard, Physics at Stanford </li></ul><ul><li>Tropos Networks (city-wide wireless): mesh routing algorithms, GBs of data </li></ul><ul><li>Cooliris (web media): Hadoop and Pig for analytics, TBs of data </li></ul><ul><li>Twitter : Hadoop, Pig, machine learning, visualization, social graph analysis, (soon) PBs of data </li></ul>
  5. 5. The Twitter Data Lifecycle <ul><li>Data Input: Scribe , Crane </li></ul><ul><li>Data Storage </li></ul><ul><li>Data Analysis </li></ul><ul><li>Data Products </li></ul>1 Community Open Source 2 Twitter Open Source
  6. 6. What Data? <ul><li>Two main kinds of raw data </li></ul><ul><li>Logs </li></ul><ul><li>Tabular data </li></ul>
  7. 7. Logs <ul><li>Started with syslog-ng </li></ul><ul><li>As our volume grew, it didn’t scale </li></ul>
  8. 8. Logs <ul><li>Started with syslog-ng </li></ul><ul><li>As our volume grew, it didn’t scale </li></ul><ul><li>Resources overwhelmed </li></ul><ul><li>Lost data </li></ul>
  9. 9. Scribe <ul><li>Scribe daemon runs locally; reliable in network outage </li></ul><ul><li>Nodes only know downstream </li></ul><ul><li>writer; hierarchical, scalable </li></ul><ul><li>Pluggable outputs, per category </li></ul>FE FE FE Agg Agg HDFS File
  10. 10. Scribe at Twitter <ul><li>Solved our problem, opened new vistas </li></ul><ul><li>Currently 57 different categories logged from multiple sources </li></ul><ul><li>FE: Javascript, Ruby on Rails </li></ul><ul><li>Middle tier: Ruby on Rails, Scala </li></ul><ul><li>Backend: Scala, Java, C++ </li></ul><ul><li>7 TB/day into HDFS </li></ul><ul><li>Log first, ask questions later. </li></ul>
  11. 11. Scribe at Twitter <ul><li>We’ve contributed to it as we’ve used it 1 </li></ul><ul><li>Improved logging, monitoring, writing to HDFS, compression </li></ul><ul><li>Added ZooKeeper-based config </li></ul><ul><li>Continuing to work with FB on patches </li></ul><ul><li>Also: working with Cloudera to evaluate Flume </li></ul>1 http://github.com/traviscrawford/scribe
  12. 12. Tabular Data <ul><li>Most site data is in MySQL </li></ul><ul><li>Tweets, users, devices, client applications, etc </li></ul><ul><li>Need to move it between MySQL and HDFS </li></ul><ul><ul><ul><ul><ul><li>Also between MySQL and HBase, or MySQL and MySQL </li></ul></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>Crane: configuration driven ETL tool </li></ul></ul></ul></ul></ul>
  13. 13. Crane Driver Configuration/Batch Management Extract Load Transform Protobuf P1 Protobuf P2 Source Sink ZooKeeper Registration
  14. 14. Crane <ul><li>Extract </li></ul><ul><li>MySQL, HDFS, HBase, Flock, GA, Facebook Insights </li></ul>
  15. 15. Crane <ul><li>Extract </li></ul><ul><li>MySQL, HDFS, HBase, Flock, GA, Facebook Insights </li></ul><ul><li>Transform </li></ul><ul><li>IP/Phone -> Geo, canonicalize dates, cleaning, arbitrary logic </li></ul>
  16. 16. Crane <ul><li>Extract </li></ul><ul><li>MySQL, HDFS, HBase, Flock, GA, Facebook Insights </li></ul><ul><li>Transform </li></ul><ul><li>IP/Phone -> Geo, canonicalize dates, cleaning, arbitrary logic </li></ul><ul><li>Load </li></ul><ul><li>MySQL, Local file, Stdout, HDFS, HBase </li></ul>
  17. 17. Crane <ul><li>Extract </li></ul><ul><li>MySQL, HDFS, HBase, Flock, GA, Facebook Insights </li></ul><ul><li>Transform </li></ul><ul><li>IP/Phone -> Geo, canonicalize dates, cleaning, arbitrary logic </li></ul><ul><li>Load </li></ul><ul><li>MySQL, Local file, Stdout, HDFS, HBase </li></ul><ul><li>ZooKeeper coordination, intelligent date management </li></ul><ul><li>Run all the time from multiple servers, self healing </li></ul>
  18. 18. The Twitter Data Lifecycle <ul><li>Data Input </li></ul><ul><li>Data Storage: Elephant Bird , HBase </li></ul><ul><li>Data Analysis </li></ul><ul><li>Data Products </li></ul>1 Community Open Source 2 Twitter Open Source
  19. 19. Storage Basics <ul><li>Incoming data: 7 TB/day </li></ul><ul><li>LZO encode everything </li></ul><ul><li>Save 3-4x on storage, pay little CPU </li></ul><ul><li>Splittable! 1 </li></ul><ul><li>IO-bound jobs ==> 3-4x perf increase </li></ul>1 http://www.github.com/kevinweil/hadoop-lzo
  20. 20. Elephant Bird http://www.flickr.com/photos/jagadish/3072134867/ 1 http://github.com/kevinweil/elephant-bird
  21. 21. Elephant Bird <ul><li>We have data coming in as protocol buffers via Crane... </li></ul>
  22. 22. Elephant Bird <ul><li>We have data coming in as protocol buffers via Crane... </li></ul><ul><li>Protobufs: codegen for efficient ser-de of data structures </li></ul>
  23. 23. Elephant Bird <ul><li>We have data coming in as protocol buffers via Crane... </li></ul><ul><li>Protobufs: codegen for efficient ser-de of data structures </li></ul><ul><li>Why shouldn’t we just continue, and codegen more glue? </li></ul>
  24. 24. Elephant Bird <ul><li>We have data coming in as protocol buffers via Crane... </li></ul><ul><li>Protobufs: codegen for efficient ser-de of data structures </li></ul><ul><li>Why shouldn’t we just continue, and codegen more glue? </li></ul><ul><li>InputFormats, OutputFormats, Pig LoadFuncs, Pig StoreFuncs, Pig HBaseLoaders </li></ul>
  25. 25. Elephant Bird <ul><li>We have data coming in as protocol buffers via Crane... </li></ul><ul><li>Protobufs: codegen for efficient ser-de of data structures </li></ul><ul><li>Why shouldn’t we just continue, and codegen more glue? </li></ul><ul><li>InputFormats, OutputFormats, Pig LoadFuncs, Pig StoreFuncs, Pig HBaseLoaders </li></ul><ul><li>Also now does part of this with Thrift, soon Avro </li></ul><ul><li>And JSON, W3C Logs </li></ul>
  26. 26. Challenge: Mutable Data <ul><li>HDFS is write-once: no seek on write, no append (yet) </li></ul><ul><li>Logs are easy. </li></ul><ul><li>But our tables change. </li></ul>
  27. 27. Challenge: Mutable Data <ul><li>HDFS is write-once: no seek on write, no append (yet) </li></ul><ul><li>Logs are easy. </li></ul><ul><li>But our tables change. </li></ul><ul><li>Handling rapidly changing data in HDFS: not trivial. </li></ul><ul><li>Don’t worry about updated data </li></ul><ul><li>Refresh entire dataset </li></ul><ul><li>Download updates, tombstone old versions of data, ensure jobs only run over current versions of data, occasionally rewrite full dataset </li></ul>
  28. 28. Challenge: Mutable Data <ul><li>HDFS is write-once: no seek on write, no append (yet) </li></ul><ul><li>Logs are easy. </li></ul><ul><li>But our tables change. </li></ul><ul><li>Handling changing data in HDFS: not trivial. </li></ul>
  29. 29. HBase <ul><li>Has already solved the update problem </li></ul><ul><li>Bonus: low-latency query API </li></ul><ul><li>Bonus: rich, BigTable-style data model based on column families </li></ul>
  30. 30. HBase at Twitter <ul><li>Crane loads data directly into HBase </li></ul><ul><li>One CF for protobuf bytes, one CF to denormalize columns for indexing or quicker batch access </li></ul><ul><li>Processing updates transparent, so we always have accurate data in HBase </li></ul><ul><li>Pig Loader for HBase in Elephant Bird makes integration with existing analyses easy </li></ul>
  31. 31. HBase at Twitter <ul><li>Crane loads data directly into HBase </li></ul><ul><li>One CF for protobuf bytes, one CF to denormalize columns for indexing or quicker batch access </li></ul><ul><li>Processing updates transparent, so we always have accurate data in HBase </li></ul><ul><li>Pig Loader for HBase in Elephant Bird </li></ul>
  32. 32. The Twitter Data Lifecycle <ul><li>Data Input </li></ul><ul><li>Data Storage </li></ul><ul><li>Data Analysis: Pig , Oink </li></ul><ul><li>Data Products </li></ul>1 Community Open Source 2 Twitter Open Source
  33. 33. Enter Pig <ul><li>High level language </li></ul><ul><li>Transformations on sets of records </li></ul><ul><li>Process data one step at a time </li></ul><ul><li>UDFs are first-class citizens </li></ul><ul><li>Easier than SQL? </li></ul>
  34. 34. Why Pig? <ul><li>Because I bet you can read the following script. </li></ul>
  35. 35. A Real Pig Script <ul><li>Now, just for fun... the same calculation in vanilla Hadoop MapReduce. </li></ul>
  36. 36. No, seriously.
  37. 37. Pig Democratizes Large-scale Data Analysis <ul><li>The Pig version is: </li></ul><ul><ul><li>5% of the code </li></ul></ul><ul><ul><li>5% of the time </li></ul></ul><ul><ul><li>Within 30% of the execution time. </li></ul></ul><ul><ul><li>Innovation increasingly driven from large-scale data analysis </li></ul></ul><ul><ul><li>Need fast iteration to understand the right questions </li></ul></ul><ul><ul><li>More minds contributing = more value from your data </li></ul></ul>
  38. 38. Pig Examples <ul><li>Using the HBase Loader </li></ul><ul><li>Using the protobuf loaders </li></ul>
  39. 39. Pig Workflow <ul><li>Oink: framework around Pig for loading, combining, running, post-processing </li></ul><ul><li>Everyone I know has one of these </li></ul><ul><li>Points to an opening for innovation; discussion beginning </li></ul><ul><li>Something we’re looking at: Ruby DSL for Pig, Piglet 1 </li></ul>1 http://github.com/ningliang/piglet
  40. 40. Counting Big Data <ul><li>standard counts, min, max, std dev </li></ul><ul><li>How many requests do we serve in a day? </li></ul><ul><li>What is the average latency? 95% latency? </li></ul><ul><li>Group by response code. What is the hourly distribution? </li></ul><ul><li>How many searches happen each day on Twitter? </li></ul><ul><li>How many unique queries, how many unique users? </li></ul><ul><li>What is their geographic distribution? </li></ul>
  41. 41. Correlating Big Data <ul><li>How does usage differ for mobile users? </li></ul><ul><li>How about for users with 3rd party desktop clients? </li></ul><ul><li>Cohort analyses </li></ul><ul><li>Site problems: what goes wrong at the same time? </li></ul><ul><li>Which features get users hooked? </li></ul><ul><li>Which features do successful users use often? </li></ul><ul><li>Search corrections, search suggestions </li></ul><ul><li>A/B testing </li></ul><ul><li>probabilities, covariance, influence </li></ul>
  42. 42. Research on Big Data <ul><li>What can we tell about a user from their tweets? </li></ul><ul><li>From the tweets of those they follow? </li></ul><ul><li>From the tweets of their followers? </li></ul><ul><li>From the ratio of followers/following? </li></ul><ul><li>What graph structures lead to successful networks? </li></ul><ul><li>User reputation </li></ul><ul><li>prediction, graph analysis, natural language </li></ul>
  43. 43. Research on Big Data <ul><li>Sentiment analysis </li></ul><ul><li>What features get a tweet retweeted? </li></ul><ul><li>How deep is the corresponding retweet tree? </li></ul><ul><li>Long-term duplicate detection </li></ul><ul><li>Machine learning </li></ul><ul><li>Language detection </li></ul><ul><li>... the list goes on. </li></ul><ul><li>prediction, graph analysis, natural language </li></ul>
  44. 44. The Twitter Data Lifecycle <ul><li>Data Input </li></ul><ul><li>Data Storage </li></ul><ul><li>Data Analysis </li></ul><ul><li>Data Products: Birdbrain </li></ul>1 Community Open Source 2 Twitter Open Source
  45. 45. Data Products <ul><li>Ad Hoc Analyses </li></ul><ul><li>Answer questions to keep the business agile, do research </li></ul><ul><li>Online Products </li></ul><ul><li>Name search, other upcoming products </li></ul><ul><li>Company Dashboard </li></ul><ul><li>Birdbrain </li></ul>
  46. 46. Questions ? Follow me at twitter.com/kevinweil TM <ul><li>P.S. We’re hiring. Help us build the next step: realtime big data analytics. </li></ul>

×