Twitter Protobufs And Hadoop Hug 021709

  • 4,303 views
Uploaded on

 

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
4,303
On Slideshare
0
From Embeds
0
Number of Embeds
3

Actions

Shares
Downloads
0
Comments
0
Likes
13

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Hadoop and Protocol Buffers at Twitter Kevin Weil -- @kevinweil Analytics Lead, Twitter TM Wednesday, February 17, 2010
  • 2. Outline ‣ Problem Statement ‣ CSV? XML? JSON? Regex? ‣ Protocol Buffers ‣ Codegen, Hadoop and You ‣ Applications ‣ Conclusions and Next Steps Wednesday, February 17, 2010
  • 3. My Background ‣ Studied Mathematics and Physics at Harvard, Physics at Stanford ‣ Tropos Networks (city-wide wireless): mesh routing algorithms, GBs of data ‣ Cooliris (web media): Hadoop and Pig for analytics, TBs of data ‣ Twitter: Hadoop, Pig, HBase, large-scale data analysis and visualization, social graph analysis, machine learning, lots more data Wednesday, February 17, 2010
  • 4. Outline ‣ Problem Statement ‣ CSV? XML? JSON? Regex? ‣ Protocol Buffers ‣ Codegen, Hadoop and You ‣ Applications ‣ Conclusions and Next Steps Wednesday, February 17, 2010
  • 5. The Challenge ‣ Store some tweets Wednesday, February 17, 2010
  • 6. The Challenge ‣ Store some tweets Store 100 billion tweets Wednesday, February 17, 2010
  • 7. The Challenge ‣ Store 100 billion tweets in a way that is ‣ Robust to changes Wednesday, February 17, 2010
  • 8. The Challenge ‣ Store 100 billion tweets in a way that is ‣ Robust ‣ Efficient in size and speed Wednesday, February 17, 2010
  • 9. The Challenge ‣ Store 100 billion tweets in a way that is ‣ Robust ‣ Efficient ‣ Amenable to large-scale analysis Wednesday, February 17, 2010
  • 10. The Challenge ‣ Store 100 billion tweets in a way that is ‣ Robust ‣ Efficient ‣ Amenable to large-scale analysis ‣ Reusable (especially for other classes of data, like logs, where the size gets really large) Wednesday, February 17, 2010
  • 11. The System ‣ Your (friend’s) hadoop cluster Wednesday, February 17, 2010
  • 12. The Data ‣ kevin@tw-mbp-kweil ~ $ curl http:// ‣ ‣ <?xml version="1.0" encoding="UTF-8"?> <status> api.twitter.com/1/statuses/show/9225259353.xml ‣ <created_at>Wed Feb 17 08:01:13 +0000 2010</created_at> ‣ <id>9225259353</id> ‣ <text>Preparing slides for tomorrow's talk at Y! at the Hadoop User Group: Protobufs and Hadoop at Twitter. See you there? http://bit.ly/9DJcd9</text> ‣ <source>&lt;a href=&quot;http://www.tweetdeck.com/&quot; rel=&quot;nofollow&quot;&gt;TweetDeck&lt;/a&gt;</source> ‣ <truncated>false</truncated> ‣ <in_reply_to_status_id></in_reply_to_status_id> <in_reply_to_user_id></in_reply_to_user_id> Each tweet has 12 fields, 3 of which (user, geo, ‣ ‣ ‣ <favorited>false</favorited> ‣ <in_reply_to_screen_name></in_reply_to_screen_name> ‣ <user> contributors) have subfields ‣ <id>3452911</id> ‣ <name>Kevin Weil</name> ‣ <screen_name>kevinweil</screen_name> ‣ <location>Portola Valley, CA</location> ‣ <description>Analytics Lead at Twitter. Ultra-marathons, cycling, hadoop, lolcats.</description> ‣ <profile_image_url>http://a3.twimg.com/profile_images/220257539/n206489_34325699_8572_normal.jpg</profile_image_url> ‣ <url></url> ‣ <protected>false</protected> ‣ <followers_count>3122</followers_count> ‣ <profile_background_color>B2DFDA</profile_background_color> ‣ <profile_text_color>333333</profile_text_color> ‣ It can change as we add new features ‣ <profile_link_color>93A644</profile_link_color> ‣ <profile_sidebar_fill_color>ffffff</profile_sidebar_fill_color> ‣ <profile_sidebar_border_color>eeeeee</profile_sidebar_border_color> ‣ <friends_count>436</friends_count> ‣ <created_at>Wed Apr 04 19:29:46 +0000 2007</created_at> ‣ <favourites_count>721</favourites_count> ‣ <utc_offset>-28800</utc_offset> ‣ <time_zone>Pacific Time (US &amp; Canada)</time_zone> ‣ <profile_background_image_url>http://s.twimg.com/a/1266345225/images/themes/theme13/bg.gif</profile_background_image_url> ‣ <profile_background_tile>false</profile_background_tile> ‣ <notifications>false</notifications> ‣ <geo_enabled>true</geo_enabled> ‣ <verified>false</verified> ‣ <following>false</following> ‣ <statuses_count>2556</statuses_count> ‣ <lang>en</lang> ‣ <contributors_enabled>false</contributors_enabled> ‣ </user> ‣ <geo/> ‣ <contributors/> ‣ </status> ‣ Wednesday, February 17, 2010
  • 13. The Requirements ‣ Splittability ‣ Parsing efficiency ‣ Reusability ‣ Ability to add new fields ‣ Ability to ignore unused fields ‣ Small data size ‣ Hierarchical Wednesday, February 17, 2010
  • 14. The Requirements ‣ Splittability ‣ Parsing efficiency ‣ Reusability ‣ Ability to add new fields ‣ Ability to ignore unused fields ‣ Small data size ‣ Hierarchical Wednesday, February 17, 2010
  • 15. The Requirements ‣ Splittability ‣ Parsing efficiency ‣ Reusability ‣ Ability to add new fields ‣ Ability to ignore unused fields ‣ Small data size ‣ Hierarchical Wednesday, February 17, 2010
  • 16. The Requirements ‣ Splittability ‣ Parsing efficiency ‣ Reusability ‣ Ability to add new fields ‣ Ability to ignore unused fields ‣ Small data size ‣ Hierarchical Wednesday, February 17, 2010
  • 17. The Requirements ‣ Splittability ‣ Parsing efficiency ‣ Reusability ‣ Ability to add new fields ‣ Ability to ignore unused fields ‣ Small data size ‣ Hierarchical Wednesday, February 17, 2010
  • 18. The Requirements ‣ Splittability ‣ Parsing efficiency ‣ Reusability ‣ Ability to add new fields ‣ Ability to ignore unused fields ‣ Small data size ‣ Hierarchical Wednesday, February 17, 2010
  • 19. The Requirements ‣ Splittability ‣ Parsing efficiency ‣ Reusability ‣ Ability to add new fields ‣ Ability to ignore unused fields ‣ Small data size ‣ Hierarchical Wednesday, February 17, 2010
  • 20. Outline ‣ Problem Statement ‣ CSV? XML? JSON? Regex? ‣ Protocol Buffers ‣ Codegen, Hadoop and You ‣ Applications ‣ Conclusions and Next Steps Wednesday, February 17, 2010
  • 21. Common Formats Parsing Ignore unused Splittable Reusability Add new fields Small data size Hierarchical efficiency fields XML JSON CSV Custom regex (Apache) Wednesday, February 17, 2010
  • 22. Common Formats Parsing Ignore unused Splittable Reusability Add new fields Small data size Hierarchical efficiency fields XML JSON CSV Custom regex (Apache) Wednesday, February 17, 2010
  • 23. Common Formats Parsing Ignore unused Splittable Reusability Add new fields Small data size Hierarchical efficiency fields XML JSON CSV Custom regex (Apache) Wednesday, February 17, 2010
  • 24. Common Formats Parsing Ignore unused Splittable Reusability Add new fields Small data size Hierarchical efficiency fields XML JSON CSV Custom regex (Apache) Wednesday, February 17, 2010
  • 25. Outline ‣ Problem Statement ‣ CSV? XML? JSON? Regex? ‣ Protocol Buffers ‣ Codegen, Hadoop and You ‣ Applications ‣ Conclusions and Next Steps Wednesday, February 17, 2010
  • 26. Enter Protocol Buffers ‣ “Protocol Buffers are a way of encoding structured data in an efficient yet extensible format. Google uses Protocol Buffers for almost all of its internal RPC protocols and file formats.” ‣ http://code.google.com/p/protobuf ‣ You write IDL describing your data structure ‣ It generates code in your languages of choice to construct, serialize, deserialize, reflect across, etc, your data structure ‣ Like Thrift, but richer and more efficient (except no RPC) ‣ Avro is an exciting up-and-coming alternative Wednesday, February 17, 2010
  • 27. Protobuf IDL Example ‣ message Status { ‣ optional string created_at = 1; ‣ optional int64 id = 2; ‣ optional string text = 3; ‣ optional string source = 4; ‣ optional bool truncated = 5; ‣ optional int64 in_reply_to_status_id = 6; ‣ optional int64 in_reply_to_user_id = 7; ‣ optional bool favorited = 8; ‣ optional string in_reply_to_screen_name = 9; ‣ optional message User = 10; ‣ optional message Geo = 11; ‣ optional message Contributors = 12; ‣ message User { ‣ optional int64 id = 1; ‣ optional string name = 2; ‣ ... ‣ } ‣ message Geo { ... } ‣ message Contributors { ... } ‣ } Wednesday, February 17, 2010
  • 28. Protobuf Generated Code ‣ The generated code is: ‣ Efficient (Google quotes 80x vs. |-delimited format)1,2 ‣ Extensible ‣ Backwards compatible ‣ Polymorphic (in Java, C++, Python) ‣ Metadata-rich 1. http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext 2. http://code.google.com/p/thrift-protobuf-compare/wiki/Benchmarking Wednesday, February 17, 2010
  • 29. Common Formats Parsing Ignore unused Splittable Reusability Add new fields Small data size Hierarchical efficiency fields XML JSON CSV Custom regex (Apache) Protocol Buffers Wednesday, February 17, 2010
  • 30. Outline ‣ Problem Statement ‣ CSV? XML? JSON? Regex? ‣ Protocol Buffers ‣ Codegen, Hadoop and You ‣ Applications ‣ Conclusions and Next Steps Wednesday, February 17, 2010
  • 31. But Wait, There’s More ‣ Codegen for data structures is nice... ‣ Next step: codegen for all Hadoop-related code Wednesday, February 17, 2010
  • 32. But Wait, There’s More ‣ Codegen for data structures is nice... ‣ Next step: codegen for all Hadoop-related code ‣ Protocol Buffer InputFormats Wednesday, February 17, 2010
  • 33. But Wait, There’s More ‣ Codegen for data structures is nice... ‣ Next step: codegen for all Hadoop-related code ‣ Protocol Buffer InputFormats ‣ OutputFormats Wednesday, February 17, 2010
  • 34. But Wait, There’s More ‣ Codegen for data structures is nice... ‣ Next step: codegen for all Hadoop-related code ‣ Protocol Buffer InputFormats ‣ OutputFormats ‣ Writables Wednesday, February 17, 2010
  • 35. But Wait, There’s More ‣ Codegen for data structures is nice... ‣ Next step: codegen for all Hadoop-related code ‣ Protocol Buffer InputFormats ‣ OutputFormats ‣ Writables ‣ Pig LoadFuncs and StoreFuncs Wednesday, February 17, 2010
  • 36. But Wait, There’s More ‣ Codegen for data structures is nice... ‣ Next step: codegen for all Hadoop-related code ‣ Protocol Buffer InputFormats ‣ OutputFormats ‣ Writables ‣ Pig LoadFuncs and StoreFuncs ‣ Cascading, Streaming, Dumbo, etc Wednesday, February 17, 2010
  • 37. But Wait, There’s More ‣ Codegen for data structures is nice... ‣ Next step: codegen for all Hadoop-related code ‣ Protocol Buffer InputFormats ‣ OutputFormats ‣ Writables ‣ Pig LoadFuncs and StoreFuncs ‣ Cascading, Streaming, Dumbo, etc ‣ Per Protocol Buffer Wednesday, February 17, 2010
  • 38. Protocol Buffer InputFormats ‣ All objects (hierarchical data, inheritance, etc) ‣ All automatically generated ‣ Efficient, extensible storage and serialization Wednesday, February 17, 2010
  • 39. Pig LoadFuncs ‣ All objects (hierarchical data, inheritance, etc) ‣ All automatically generated ‣ Even the load statement itself is codegen Wednesday, February 17, 2010
  • 40. Where do these work? ‣ Java MapReduce APIs (InputFormats, OutputFormats, Writables) ‣ Deprecated Java MapReduce APIs (same) ‣ Enables Streaming, Dumbo, Cascading ‣ Pig ‣ HBase Wednesday, February 17, 2010
  • 41. Outline ‣ Problem Statement ‣ CSV? XML? JSON? Regex? ‣ Protocol Buffers ‣ Codegen, Hadoop and You ‣ Applications ‣ Conclusions and Next Steps Wednesday, February 17, 2010
  • 42. Counting Big Data ‣ standard counts, min, max, std dev ‣ How many requests do we serve in a day? ‣ What is the average latency? 95% latency? ‣ Group by response code. What is the hourly distribution? ‣ How many searches happen each day on Twitter? ‣ How many unique queries, how many unique users? ‣ What is their geographic distribution? Wednesday, February 17, 2010
  • 43. Correlating Big Data ‣ probabilities, covariance, influence ‣ How does usage differ for mobile users? ‣ How about for users with 3rd party desktop clients? ‣ Cohort analyses ‣ Site problems: what goes wrong at the same time? ‣ Which features get users hooked? ‣ Which features do successful users use often? ‣ Search corrections, search suggestions ‣ A/B testing Wednesday, February 17, 2010
  • 44. Research on Big Data ‣ prediction, graph analysis, natural language ‣ What can we tell about a user from their tweets? ‣ From the tweets of those they follow? ‣ From the tweets of their followers? ‣ From the ratio of followers/following? ‣ What graph structures lead to successful networks? ‣ User reputation Wednesday, February 17, 2010
  • 45. Research on Big Data ‣ prediction, graph analysis, natural language ‣ Sentiment analysis ‣ What features get a tweet retweeted? ‣ How deep is the corresponding retweet tree? ‣ Long-term duplicate detection ‣ Machine learning ‣ Language detection ‣ ... the list goes on. Wednesday, February 17, 2010
  • 46. Outline ‣ Problem Statement ‣ CSV? XML? JSON? Regex? ‣ Protocol Buffers ‣ Codegen, Hadoop and You ‣ Applications ‣ Conclusions and Next Steps Wednesday, February 17, 2010
  • 47. Resolution ‣ All we do now is write IDL for the data schema ‣ Get efficient, forward/backwards compatible, splittable data structures automatically generated for us ‣ Get loaders, input formats, output formats, writables, and schemas automatically generated for us ‣ Helps the Twitter analytics team stay agile ‣ Can handle new, complex data without the need for new code, new tests, new bugs ‣ Focus on the analysis, not data formats Wednesday, February 17, 2010
  • 48. Twitter Open Source ‣ Coming soon! (1-2 weeks) http://github.com/kevinweil ‣ All base classes for InputFormats, OutputFormats, Writables, Pig Loaders, etc ‣ For new and deprecated MapReduce API ‣ With and without LZO compression (see http://github.com/ kevinweil/hadoop-lzo) ‣ Protobuf reflection helpers ‣ Serialized block storage format for HDFS Wednesday, February 17, 2010
  • 49. Questions? Follow me at twitter.com/kevinweil ‣ If this sounded interesting to you -- that’s because it is. And we’re hiring. TM Wednesday, February 17, 2010