Hadoop and Protocol Buffers at Twitter
                        Kevin Weil -- @kevinweil
                        Analytics ...
Outline
                   ‣     Problem Statement
                   ‣     CSV? XML? JSON? Regex?
                   ‣   ...
My Background
                   ‣     Studied Mathematics and Physics at Harvard, Physics at
                         Sta...
Outline
                   ‣     Problem Statement
                   ‣     CSV? XML? JSON? Regex?
                   ‣   ...
The Challenge
                    ‣     Store some tweets




Wednesday, February 17, 2010
The Challenge
                    ‣     Store some tweets Store 100 billion tweets




Wednesday, February 17, 2010
The Challenge
                    ‣     Store 100 billion tweets in a way that is
                    ‣     	    Robust to...
The Challenge
                    ‣     Store 100 billion tweets in a way that is
                    ‣     	    Robust
  ...
The Challenge
                    ‣     Store 100 billion tweets in a way that is
                    ‣     	    Robust
  ...
The Challenge
                    ‣     Store 100 billion tweets in a way that is
                    ‣     	        Robus...
The System
                   ‣     Your (friend’s) hadoop
                         cluster




Wednesday, February 17, 20...
The Data                                                                                     ‣     kevin@tw-mbp-kweil ~ $ ...
The Requirements
                                      ‣   Splittability
                                      ‣   Parsing...
The Requirements
                                      ‣   Splittability
                                      ‣   Parsing...
The Requirements
                                      ‣   Splittability
                                      ‣   Parsing...
The Requirements
                                      ‣   Splittability
                                      ‣   Parsing...
The Requirements
                                      ‣   Splittability
                                      ‣   Parsing...
The Requirements
                                      ‣   Splittability
                                      ‣   Parsing...
The Requirements
                                      ‣   Splittability
                                      ‣   Parsing...
Outline
                   ‣     Problem Statement
                   ‣     CSV? XML? JSON? Regex?
                   ‣   ...
Common Formats
                                             Parsing                                Ignore unused
         ...
Common Formats
                                             Parsing                                Ignore unused
         ...
Common Formats
                                             Parsing                                Ignore unused
         ...
Common Formats
                                             Parsing                                Ignore unused
         ...
Outline
                   ‣     Problem Statement
                   ‣     CSV? XML? JSON? Regex?
                   ‣   ...
Enter Protocol Buffers
                    ‣     “Protocol Buffers are a way of encoding structured data in an
           ...
Protobuf IDL Example
                    ‣     message Status {
                    ‣       optional string created_at    ...
Protobuf Generated Code
                    ‣     The generated code is:
                    ‣     
    Efficient (Google q...
Common Formats
                                             Parsing                                Ignore unused
         ...
Outline
                   ‣     Problem Statement
                   ‣     CSV? XML? JSON? Regex?
                   ‣   ...
But Wait, There’s More
                    ‣     Codegen for data structures is nice...
                    ‣     Next ste...
But Wait, There’s More
                    ‣     Codegen for data structures is nice...
                    ‣     Next ste...
But Wait, There’s More
                    ‣     Codegen for data structures is nice...
                    ‣     Next ste...
But Wait, There’s More
                    ‣     Codegen for data structures is nice...
                    ‣     Next ste...
But Wait, There’s More
                    ‣     Codegen for data structures is nice...
                    ‣     Next ste...
But Wait, There’s More
                    ‣     Codegen for data structures is nice...
                    ‣     Next ste...
But Wait, There’s More
                    ‣     Codegen for data structures is nice...
                    ‣     Next ste...
Protocol Buffer InputFormats
                                                  ‣   All objects
                           ...
Pig LoadFuncs
                                   ‣   All objects
                                       (hierarchical
    ...
Where do these work?
                    ‣     Java MapReduce APIs (InputFormats, OutputFormats, Writables)
              ...
Outline
                   ‣     Problem Statement
                   ‣     CSV? XML? JSON? Regex?
                   ‣   ...
Counting Big Data
                   ‣                  standard counts, min, max, std dev
                   ‣     How ma...
Correlating Big Data
                  ‣                   probabilities, covariance, influence
                  ‣     Ho...
Research on Big Data
                  ‣                 prediction, graph analysis, natural language
                  ‣ ...
Research on Big Data
                  ‣                 prediction, graph analysis, natural language
                  ‣ ...
Outline
                   ‣     Problem Statement
                   ‣     CSV? XML? JSON? Regex?
                   ‣   ...
Resolution
                    ‣     All we do now is write IDL for the data schema
                    ‣     Get efficient...
Twitter                 Open Source
                    ‣     Coming soon! (1-2 weeks) http://github.com/kevinweil
       ...
Questions?                               Follow me at
                                                                    ...
Upcoming SlideShare
Loading in...5
×

Twitter Protobufs And Hadoop Hug 021709

4,561

Published on

0 Comments
13 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
4,561
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
0
Comments
0
Likes
13
Embeds 0
No embeds

No notes for slide

Twitter Protobufs And Hadoop Hug 021709

  1. 1. Hadoop and Protocol Buffers at Twitter Kevin Weil -- @kevinweil Analytics Lead, Twitter TM Wednesday, February 17, 2010
  2. 2. Outline ‣ Problem Statement ‣ CSV? XML? JSON? Regex? ‣ Protocol Buffers ‣ Codegen, Hadoop and You ‣ Applications ‣ Conclusions and Next Steps Wednesday, February 17, 2010
  3. 3. My Background ‣ Studied Mathematics and Physics at Harvard, Physics at Stanford ‣ Tropos Networks (city-wide wireless): mesh routing algorithms, GBs of data ‣ Cooliris (web media): Hadoop and Pig for analytics, TBs of data ‣ Twitter: Hadoop, Pig, HBase, large-scale data analysis and visualization, social graph analysis, machine learning, lots more data Wednesday, February 17, 2010
  4. 4. Outline ‣ Problem Statement ‣ CSV? XML? JSON? Regex? ‣ Protocol Buffers ‣ Codegen, Hadoop and You ‣ Applications ‣ Conclusions and Next Steps Wednesday, February 17, 2010
  5. 5. The Challenge ‣ Store some tweets Wednesday, February 17, 2010
  6. 6. The Challenge ‣ Store some tweets Store 100 billion tweets Wednesday, February 17, 2010
  7. 7. The Challenge ‣ Store 100 billion tweets in a way that is ‣ Robust to changes Wednesday, February 17, 2010
  8. 8. The Challenge ‣ Store 100 billion tweets in a way that is ‣ Robust ‣ Efficient in size and speed Wednesday, February 17, 2010
  9. 9. The Challenge ‣ Store 100 billion tweets in a way that is ‣ Robust ‣ Efficient ‣ Amenable to large-scale analysis Wednesday, February 17, 2010
  10. 10. The Challenge ‣ Store 100 billion tweets in a way that is ‣ Robust ‣ Efficient ‣ Amenable to large-scale analysis ‣ Reusable (especially for other classes of data, like logs, where the size gets really large) Wednesday, February 17, 2010
  11. 11. The System ‣ Your (friend’s) hadoop cluster Wednesday, February 17, 2010
  12. 12. The Data ‣ kevin@tw-mbp-kweil ~ $ curl http:// ‣ ‣ <?xml version="1.0" encoding="UTF-8"?> <status> api.twitter.com/1/statuses/show/9225259353.xml ‣ <created_at>Wed Feb 17 08:01:13 +0000 2010</created_at> ‣ <id>9225259353</id> ‣ <text>Preparing slides for tomorrow's talk at Y! at the Hadoop User Group: Protobufs and Hadoop at Twitter. See you there? http://bit.ly/9DJcd9</text> ‣ <source>&lt;a href=&quot;http://www.tweetdeck.com/&quot; rel=&quot;nofollow&quot;&gt;TweetDeck&lt;/a&gt;</source> ‣ <truncated>false</truncated> ‣ <in_reply_to_status_id></in_reply_to_status_id> <in_reply_to_user_id></in_reply_to_user_id> Each tweet has 12 fields, 3 of which (user, geo, ‣ ‣ ‣ <favorited>false</favorited> ‣ <in_reply_to_screen_name></in_reply_to_screen_name> ‣ <user> contributors) have subfields ‣ <id>3452911</id> ‣ <name>Kevin Weil</name> ‣ <screen_name>kevinweil</screen_name> ‣ <location>Portola Valley, CA</location> ‣ <description>Analytics Lead at Twitter. Ultra-marathons, cycling, hadoop, lolcats.</description> ‣ <profile_image_url>http://a3.twimg.com/profile_images/220257539/n206489_34325699_8572_normal.jpg</profile_image_url> ‣ <url></url> ‣ <protected>false</protected> ‣ <followers_count>3122</followers_count> ‣ <profile_background_color>B2DFDA</profile_background_color> ‣ <profile_text_color>333333</profile_text_color> ‣ It can change as we add new features ‣ <profile_link_color>93A644</profile_link_color> ‣ <profile_sidebar_fill_color>ffffff</profile_sidebar_fill_color> ‣ <profile_sidebar_border_color>eeeeee</profile_sidebar_border_color> ‣ <friends_count>436</friends_count> ‣ <created_at>Wed Apr 04 19:29:46 +0000 2007</created_at> ‣ <favourites_count>721</favourites_count> ‣ <utc_offset>-28800</utc_offset> ‣ <time_zone>Pacific Time (US &amp; Canada)</time_zone> ‣ <profile_background_image_url>http://s.twimg.com/a/1266345225/images/themes/theme13/bg.gif</profile_background_image_url> ‣ <profile_background_tile>false</profile_background_tile> ‣ <notifications>false</notifications> ‣ <geo_enabled>true</geo_enabled> ‣ <verified>false</verified> ‣ <following>false</following> ‣ <statuses_count>2556</statuses_count> ‣ <lang>en</lang> ‣ <contributors_enabled>false</contributors_enabled> ‣ </user> ‣ <geo/> ‣ <contributors/> ‣ </status> ‣ Wednesday, February 17, 2010
  13. 13. The Requirements ‣ Splittability ‣ Parsing efficiency ‣ Reusability ‣ Ability to add new fields ‣ Ability to ignore unused fields ‣ Small data size ‣ Hierarchical Wednesday, February 17, 2010
  14. 14. The Requirements ‣ Splittability ‣ Parsing efficiency ‣ Reusability ‣ Ability to add new fields ‣ Ability to ignore unused fields ‣ Small data size ‣ Hierarchical Wednesday, February 17, 2010
  15. 15. The Requirements ‣ Splittability ‣ Parsing efficiency ‣ Reusability ‣ Ability to add new fields ‣ Ability to ignore unused fields ‣ Small data size ‣ Hierarchical Wednesday, February 17, 2010
  16. 16. The Requirements ‣ Splittability ‣ Parsing efficiency ‣ Reusability ‣ Ability to add new fields ‣ Ability to ignore unused fields ‣ Small data size ‣ Hierarchical Wednesday, February 17, 2010
  17. 17. The Requirements ‣ Splittability ‣ Parsing efficiency ‣ Reusability ‣ Ability to add new fields ‣ Ability to ignore unused fields ‣ Small data size ‣ Hierarchical Wednesday, February 17, 2010
  18. 18. The Requirements ‣ Splittability ‣ Parsing efficiency ‣ Reusability ‣ Ability to add new fields ‣ Ability to ignore unused fields ‣ Small data size ‣ Hierarchical Wednesday, February 17, 2010
  19. 19. The Requirements ‣ Splittability ‣ Parsing efficiency ‣ Reusability ‣ Ability to add new fields ‣ Ability to ignore unused fields ‣ Small data size ‣ Hierarchical Wednesday, February 17, 2010
  20. 20. Outline ‣ Problem Statement ‣ CSV? XML? JSON? Regex? ‣ Protocol Buffers ‣ Codegen, Hadoop and You ‣ Applications ‣ Conclusions and Next Steps Wednesday, February 17, 2010
  21. 21. Common Formats Parsing Ignore unused Splittable Reusability Add new fields Small data size Hierarchical efficiency fields XML JSON CSV Custom regex (Apache) Wednesday, February 17, 2010
  22. 22. Common Formats Parsing Ignore unused Splittable Reusability Add new fields Small data size Hierarchical efficiency fields XML JSON CSV Custom regex (Apache) Wednesday, February 17, 2010
  23. 23. Common Formats Parsing Ignore unused Splittable Reusability Add new fields Small data size Hierarchical efficiency fields XML JSON CSV Custom regex (Apache) Wednesday, February 17, 2010
  24. 24. Common Formats Parsing Ignore unused Splittable Reusability Add new fields Small data size Hierarchical efficiency fields XML JSON CSV Custom regex (Apache) Wednesday, February 17, 2010
  25. 25. Outline ‣ Problem Statement ‣ CSV? XML? JSON? Regex? ‣ Protocol Buffers ‣ Codegen, Hadoop and You ‣ Applications ‣ Conclusions and Next Steps Wednesday, February 17, 2010
  26. 26. Enter Protocol Buffers ‣ “Protocol Buffers are a way of encoding structured data in an efficient yet extensible format. Google uses Protocol Buffers for almost all of its internal RPC protocols and file formats.” ‣ http://code.google.com/p/protobuf ‣ You write IDL describing your data structure ‣ It generates code in your languages of choice to construct, serialize, deserialize, reflect across, etc, your data structure ‣ Like Thrift, but richer and more efficient (except no RPC) ‣ Avro is an exciting up-and-coming alternative Wednesday, February 17, 2010
  27. 27. Protobuf IDL Example ‣ message Status { ‣ optional string created_at = 1; ‣ optional int64 id = 2; ‣ optional string text = 3; ‣ optional string source = 4; ‣ optional bool truncated = 5; ‣ optional int64 in_reply_to_status_id = 6; ‣ optional int64 in_reply_to_user_id = 7; ‣ optional bool favorited = 8; ‣ optional string in_reply_to_screen_name = 9; ‣ optional message User = 10; ‣ optional message Geo = 11; ‣ optional message Contributors = 12; ‣ message User { ‣ optional int64 id = 1; ‣ optional string name = 2; ‣ ... ‣ } ‣ message Geo { ... } ‣ message Contributors { ... } ‣ } Wednesday, February 17, 2010
  28. 28. Protobuf Generated Code ‣ The generated code is: ‣ Efficient (Google quotes 80x vs. |-delimited format)1,2 ‣ Extensible ‣ Backwards compatible ‣ Polymorphic (in Java, C++, Python) ‣ Metadata-rich 1. http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext 2. http://code.google.com/p/thrift-protobuf-compare/wiki/Benchmarking Wednesday, February 17, 2010
  29. 29. Common Formats Parsing Ignore unused Splittable Reusability Add new fields Small data size Hierarchical efficiency fields XML JSON CSV Custom regex (Apache) Protocol Buffers Wednesday, February 17, 2010
  30. 30. Outline ‣ Problem Statement ‣ CSV? XML? JSON? Regex? ‣ Protocol Buffers ‣ Codegen, Hadoop and You ‣ Applications ‣ Conclusions and Next Steps Wednesday, February 17, 2010
  31. 31. But Wait, There’s More ‣ Codegen for data structures is nice... ‣ Next step: codegen for all Hadoop-related code Wednesday, February 17, 2010
  32. 32. But Wait, There’s More ‣ Codegen for data structures is nice... ‣ Next step: codegen for all Hadoop-related code ‣ Protocol Buffer InputFormats Wednesday, February 17, 2010
  33. 33. But Wait, There’s More ‣ Codegen for data structures is nice... ‣ Next step: codegen for all Hadoop-related code ‣ Protocol Buffer InputFormats ‣ OutputFormats Wednesday, February 17, 2010
  34. 34. But Wait, There’s More ‣ Codegen for data structures is nice... ‣ Next step: codegen for all Hadoop-related code ‣ Protocol Buffer InputFormats ‣ OutputFormats ‣ Writables Wednesday, February 17, 2010
  35. 35. But Wait, There’s More ‣ Codegen for data structures is nice... ‣ Next step: codegen for all Hadoop-related code ‣ Protocol Buffer InputFormats ‣ OutputFormats ‣ Writables ‣ Pig LoadFuncs and StoreFuncs Wednesday, February 17, 2010
  36. 36. But Wait, There’s More ‣ Codegen for data structures is nice... ‣ Next step: codegen for all Hadoop-related code ‣ Protocol Buffer InputFormats ‣ OutputFormats ‣ Writables ‣ Pig LoadFuncs and StoreFuncs ‣ Cascading, Streaming, Dumbo, etc Wednesday, February 17, 2010
  37. 37. But Wait, There’s More ‣ Codegen for data structures is nice... ‣ Next step: codegen for all Hadoop-related code ‣ Protocol Buffer InputFormats ‣ OutputFormats ‣ Writables ‣ Pig LoadFuncs and StoreFuncs ‣ Cascading, Streaming, Dumbo, etc ‣ Per Protocol Buffer Wednesday, February 17, 2010
  38. 38. Protocol Buffer InputFormats ‣ All objects (hierarchical data, inheritance, etc) ‣ All automatically generated ‣ Efficient, extensible storage and serialization Wednesday, February 17, 2010
  39. 39. Pig LoadFuncs ‣ All objects (hierarchical data, inheritance, etc) ‣ All automatically generated ‣ Even the load statement itself is codegen Wednesday, February 17, 2010
  40. 40. Where do these work? ‣ Java MapReduce APIs (InputFormats, OutputFormats, Writables) ‣ Deprecated Java MapReduce APIs (same) ‣ Enables Streaming, Dumbo, Cascading ‣ Pig ‣ HBase Wednesday, February 17, 2010
  41. 41. Outline ‣ Problem Statement ‣ CSV? XML? JSON? Regex? ‣ Protocol Buffers ‣ Codegen, Hadoop and You ‣ Applications ‣ Conclusions and Next Steps Wednesday, February 17, 2010
  42. 42. Counting Big Data ‣ standard counts, min, max, std dev ‣ How many requests do we serve in a day? ‣ What is the average latency? 95% latency? ‣ Group by response code. What is the hourly distribution? ‣ How many searches happen each day on Twitter? ‣ How many unique queries, how many unique users? ‣ What is their geographic distribution? Wednesday, February 17, 2010
  43. 43. Correlating Big Data ‣ probabilities, covariance, influence ‣ How does usage differ for mobile users? ‣ How about for users with 3rd party desktop clients? ‣ Cohort analyses ‣ Site problems: what goes wrong at the same time? ‣ Which features get users hooked? ‣ Which features do successful users use often? ‣ Search corrections, search suggestions ‣ A/B testing Wednesday, February 17, 2010
  44. 44. Research on Big Data ‣ prediction, graph analysis, natural language ‣ What can we tell about a user from their tweets? ‣ From the tweets of those they follow? ‣ From the tweets of their followers? ‣ From the ratio of followers/following? ‣ What graph structures lead to successful networks? ‣ User reputation Wednesday, February 17, 2010
  45. 45. Research on Big Data ‣ prediction, graph analysis, natural language ‣ Sentiment analysis ‣ What features get a tweet retweeted? ‣ How deep is the corresponding retweet tree? ‣ Long-term duplicate detection ‣ Machine learning ‣ Language detection ‣ ... the list goes on. Wednesday, February 17, 2010
  46. 46. Outline ‣ Problem Statement ‣ CSV? XML? JSON? Regex? ‣ Protocol Buffers ‣ Codegen, Hadoop and You ‣ Applications ‣ Conclusions and Next Steps Wednesday, February 17, 2010
  47. 47. Resolution ‣ All we do now is write IDL for the data schema ‣ Get efficient, forward/backwards compatible, splittable data structures automatically generated for us ‣ Get loaders, input formats, output formats, writables, and schemas automatically generated for us ‣ Helps the Twitter analytics team stay agile ‣ Can handle new, complex data without the need for new code, new tests, new bugs ‣ Focus on the analysis, not data formats Wednesday, February 17, 2010
  48. 48. Twitter Open Source ‣ Coming soon! (1-2 weeks) http://github.com/kevinweil ‣ All base classes for InputFormats, OutputFormats, Writables, Pig Loaders, etc ‣ For new and deprecated MapReduce API ‣ With and without LZO compression (see http://github.com/ kevinweil/hadoop-lzo) ‣ Protobuf reflection helpers ‣ Serialized block storage format for HDFS Wednesday, February 17, 2010
  49. 49. Questions? Follow me at twitter.com/kevinweil ‣ If this sounded interesting to you -- that’s because it is. And we’re hiring. TM Wednesday, February 17, 2010

×