0
Hadoop and Protocol Buffers at Twitter
Kevin Weil -- @kevinweil
Analytics Lead, Twitter




                              ...
Outline
‣   Problem Statement
‣   CSV? XML? JSON? Regex?
‣   Protocol Buffers
‣   Codegen, Hadoop and You
‣   Applications...
My Background
‣   Studied Mathematics and Physics at Harvard, Physics at
    Stanford
‣   Tropos Networks (city-wide wirel...
Outline
‣   Problem Statement
‣   CSV? XML? JSON? Regex?
‣   Protocol Buffers
‣   Codegen, Hadoop and You
‣   Applications...
The Challenge
‣   Store some tweets
The Challenge
‣   Store some tweets Store 100 billion tweets
The Challenge
‣   Store 100 billion tweets in a way that is
‣   	   Robust to changes
The Challenge
‣   Store 100 billion tweets in a way that is
‣   	   Robust
‣   	   Efficient in size and speed
The Challenge
‣   Store 100 billion tweets in a way that is
‣   	   Robust
‣   	   Efficient
‣   	   Amenable to large-sca...
The Challenge
‣   Store 100 billion tweets in a way that is
‣   	     Robust
‣   	     Efficient
‣    	    Amenable to lar...
The System
‣   Your (friend’s) hadoop
    cluster
The Data                                                                                ‣     kevin@tw-mbp-kweil ~ $ curl ...
The Requirements
                   ‣   Splittability
                   ‣   Parsing efficiency
                   ‣   Reu...
The Requirements
                   ‣   Splittability
                   ‣   Parsing efficiency
                   ‣   Reu...
The Requirements
                   ‣   Splittability
                   ‣   Parsing efficiency
                   ‣   Reu...
The Requirements
                   ‣   Splittability
                   ‣   Parsing efficiency
                   ‣   Reu...
The Requirements
                   ‣   Splittability
                   ‣   Parsing efficiency
                   ‣   Reu...
The Requirements
                   ‣   Splittability
                   ‣   Parsing efficiency
                   ‣   Reu...
The Requirements
                   ‣   Splittability
                   ‣   Parsing efficiency
                   ‣   Reu...
Outline
‣   Problem Statement
‣   CSV? XML? JSON? Regex?
‣   Protocol Buffers
‣   Codegen, Hadoop and You
‣   Applications...
Common Formats
                         Parsing                                Ignore unused
           Splittable        ...
Common Formats
                         Parsing                                Ignore unused
           Splittable        ...
Common Formats
                         Parsing                                Ignore unused
           Splittable        ...
Common Formats
                         Parsing                                Ignore unused
           Splittable        ...
Outline
‣   Problem Statement
‣   CSV? XML? JSON? Regex?
‣   Protocol Buffers
‣   Codegen, Hadoop and You
‣   Applications...
Enter Protocol Buffers
‣   “Protocol Buffers are a way of encoding structured data in an
    efficient yet extensible forma...
Protobuf IDL Example
‣   message Status {
‣     optional string created_at                =   1;
‣     optional int64 id  ...
Protobuf Generated Code
‣   The generated code is:
‣   
   Efficient (Google quotes 80x vs. |-delimited                    ...
Common Formats
                         Parsing                                Ignore unused
           Splittable        ...
Outline
‣   Problem Statement
‣   CSV? XML? JSON? Regex?
‣   Protocol Buffers
‣   Codegen, Hadoop and You
‣   Applications...
But Wait, There’s More
‣   Codegen for data structures is nice...
‣   Next step: codegen for all Hadoop-related code
But Wait, There’s More
‣   Codegen for data structures is nice...
‣   Next step: codegen for all Hadoop-related code
‣   
...
But Wait, There’s More
‣   Codegen for data structures is nice...
‣   Next step: codegen for all Hadoop-related code
‣   
...
But Wait, There’s More
‣   Codegen for data structures is nice...
‣   Next step: codegen for all Hadoop-related code
‣   
...
But Wait, There’s More
‣   Codegen for data structures is nice...
‣   Next step: codegen for all Hadoop-related code
‣   
...
But Wait, There’s More
‣   Codegen for data structures is nice...
‣   Next step: codegen for all Hadoop-related code
‣   
...
But Wait, There’s More
‣   Codegen for data structures is nice...
‣   Next step: codegen for all Hadoop-related code
‣   
...
Protocol Buffer InputFormats
                               ‣   All objects
                                   (hierarchic...
Pig LoadFuncs
                ‣   All objects
                    (hierarchical
                    data,
                ...
Where do these work?
‣   Java MapReduce APIs (InputFormats, OutputFormats, Writables)
‣   Deprecated Java MapReduce APIs (...
Outline
‣   Problem Statement
‣   CSV? XML? JSON? Regex?
‣   Protocol Buffers
‣   Codegen, Hadoop and You
‣   Applications...
Counting Big Data
‣                standard counts, min, max, std dev
‣   How many requests do we serve in a day?
‣   What...
Correlating Big Data
‣                 probabilities, covariance, influence
‣   How does usage differ for mobile users?
‣ ...
Research on Big Data
‣           prediction, graph analysis, natural language
‣   What can we tell about a user from their...
Research on Big Data
‣            prediction, graph analysis, natural language
‣   Sentiment analysis
‣   What features ge...
Outline
‣   Problem Statement
‣   CSV? XML? JSON? Regex?
‣   Protocol Buffers
‣   Codegen, Hadoop and You
‣   Applications...
Resolution
‣   All we do now is write IDL for the data schema
‣   Get efficient, forward/backwards compatible, splittable d...
Twitter              Open Source
‣   Coming soon! (1-2 weeks) http://github.com/kevinweil
‣   All base classes for InputFo...
Questions?                                           Follow me at
                                                        ...
Upcoming SlideShare
Loading in...5
×

Protocol Buffers and Hadoop at Twitter

44,406

Published on

How Twitter uses Hadoop and Protocol Buffers for efficient, flexible data storage and fast MapReduce/Pig jobs.

Published in: Technology
8 Comments
157 Likes
Statistics
Notes
No Downloads
Views
Total Views
44,406
On Slideshare
0
From Embeds
0
Number of Embeds
14
Actions
Shares
0
Downloads
1,471
Comments
8
Likes
157
Embeds 0
No embeds

No notes for slide
  • Transcript of "Protocol Buffers and Hadoop at Twitter"

    1. 1. Hadoop and Protocol Buffers at Twitter Kevin Weil -- @kevinweil Analytics Lead, Twitter TM
    2. 2. Outline ‣ Problem Statement ‣ CSV? XML? JSON? Regex? ‣ Protocol Buffers ‣ Codegen, Hadoop and You ‣ Applications ‣ Conclusions and Next Steps
    3. 3. My Background ‣ Studied Mathematics and Physics at Harvard, Physics at Stanford ‣ Tropos Networks (city-wide wireless): mesh routing algorithms, GBs of data ‣ Cooliris (web media): Hadoop and Pig for analytics, TBs of data ‣ Twitter: Hadoop, Pig, HBase, large-scale data analysis and visualization, social graph analysis, machine learning, lots more data
    4. 4. Outline ‣ Problem Statement ‣ CSV? XML? JSON? Regex? ‣ Protocol Buffers ‣ Codegen, Hadoop and You ‣ Applications ‣ Conclusions and Next Steps
    5. 5. The Challenge ‣ Store some tweets
    6. 6. The Challenge ‣ Store some tweets Store 100 billion tweets
    7. 7. The Challenge ‣ Store 100 billion tweets in a way that is ‣ Robust to changes
    8. 8. The Challenge ‣ Store 100 billion tweets in a way that is ‣ Robust ‣ Efficient in size and speed
    9. 9. The Challenge ‣ Store 100 billion tweets in a way that is ‣ Robust ‣ Efficient ‣ Amenable to large-scale analysis
    10. 10. The Challenge ‣ Store 100 billion tweets in a way that is ‣ Robust ‣ Efficient ‣ Amenable to large-scale analysis ‣ Reusable (especially for other classes of data, like logs, where the size gets really large)
    11. 11. The System ‣ Your (friend’s) hadoop cluster
    12. 12. The Data ‣ kevin@tw-mbp-kweil ~ $ curl http:// ‣ ‣ <?xml version="1.0" encoding="UTF-8"?> <status> api.twitter.com/1/statuses/show/9225259353.xml ‣ <created_at>Wed Feb 17 08:01:13 +0000 2010</created_at> ‣ <id>9225259353</id> ‣ <text>Preparing slides for tomorrow's talk at Y! at the Hadoop User Group: Protobufs and Hadoop at Twitter. See you there? http://bit.ly/9DJcd9</text> ‣ <source>&lt;a href=&quot;http://www.tweetdeck.com/&quot; rel=&quot;nofollow&quot;&gt;TweetDeck&lt;/a&gt;</source> ‣ <truncated>false</truncated> ‣ <in_reply_to_status_id></in_reply_to_status_id> <in_reply_to_user_id></in_reply_to_user_id> Each tweet has 12 fields, 3 of which (user, geo, ‣ ‣ ‣ <favorited>false</favorited> ‣ <in_reply_to_screen_name></in_reply_to_screen_name> ‣ <user> contributors) have subfields ‣ <id>3452911</id> ‣ <name>Kevin Weil</name> ‣ <screen_name>kevinweil</screen_name> ‣ <location>Portola Valley, CA</location> ‣ <description>Analytics Lead at Twitter. Ultra-marathons, cycling, hadoop, lolcats.</description> ‣ <profile_image_url>http://a3.twimg.com/profile_images/220257539/n206489_34325699_8572_normal.jpg</profile_image_url> ‣ <url></url> ‣ <protected>false</protected> ‣ <followers_count>3122</followers_count> ‣ <profile_background_color>B2DFDA</profile_background_color> ‣ <profile_text_color>333333</profile_text_color> ‣ It can change as we add new features ‣ <profile_link_color>93A644</profile_link_color> ‣ <profile_sidebar_fill_color>ffffff</profile_sidebar_fill_color> ‣ <profile_sidebar_border_color>eeeeee</profile_sidebar_border_color> ‣ <friends_count>436</friends_count> ‣ <created_at>Wed Apr 04 19:29:46 +0000 2007</created_at> ‣ <favourites_count>721</favourites_count> ‣ <utc_offset>-28800</utc_offset> ‣ <time_zone>Pacific Time (US &amp; Canada)</time_zone> ‣ <profile_background_image_url>http://s.twimg.com/a/1266345225/images/themes/theme13/bg.gif</profile_background_image_url> ‣ <profile_background_tile>false</profile_background_tile> ‣ <notifications>false</notifications> ‣ <geo_enabled>true</geo_enabled> ‣ <verified>false</verified> ‣ <following>false</following> ‣ <statuses_count>2556</statuses_count> ‣ <lang>en</lang> ‣ <contributors_enabled>false</contributors_enabled> ‣ </user> ‣ <geo/> ‣ <contributors/> ‣ </status> ‣
    13. 13. The Requirements ‣ Splittability ‣ Parsing efficiency ‣ Reusability ‣ Ability to add new fields ‣ Ability to ignore unused fields ‣ Small data size ‣ Hierarchical
    14. 14. The Requirements ‣ Splittability ‣ Parsing efficiency ‣ Reusability ‣ Ability to add new fields ‣ Ability to ignore unused fields ‣ Small data size ‣ Hierarchical
    15. 15. The Requirements ‣ Splittability ‣ Parsing efficiency ‣ Reusability ‣ Ability to add new fields ‣ Ability to ignore unused fields ‣ Small data size ‣ Hierarchical
    16. 16. The Requirements ‣ Splittability ‣ Parsing efficiency ‣ Reusability ‣ Ability to add new fields ‣ Ability to ignore unused fields ‣ Small data size ‣ Hierarchical
    17. 17. The Requirements ‣ Splittability ‣ Parsing efficiency ‣ Reusability ‣ Ability to add new fields ‣ Ability to ignore unused fields ‣ Small data size ‣ Hierarchical
    18. 18. The Requirements ‣ Splittability ‣ Parsing efficiency ‣ Reusability ‣ Ability to add new fields ‣ Ability to ignore unused fields ‣ Small data size ‣ Hierarchical
    19. 19. The Requirements ‣ Splittability ‣ Parsing efficiency ‣ Reusability ‣ Ability to add new fields ‣ Ability to ignore unused fields ‣ Small data size ‣ Hierarchical
    20. 20. Outline ‣ Problem Statement ‣ CSV? XML? JSON? Regex? ‣ Protocol Buffers ‣ Codegen, Hadoop and You ‣ Applications ‣ Conclusions and Next Steps
    21. 21. Common Formats Parsing Ignore unused Splittable Reusability Add new fields Small data size Hierarchical efficiency fields XML JSON CSV Custom regex (Apache)
    22. 22. Common Formats Parsing Ignore unused Splittable Reusability Add new fields Small data size Hierarchical efficiency fields XML JSON CSV Custom regex (Apache)
    23. 23. Common Formats Parsing Ignore unused Splittable Reusability Add new fields Small data size Hierarchical efficiency fields XML JSON CSV Custom regex (Apache)
    24. 24. Common Formats Parsing Ignore unused Splittable Reusability Add new fields Small data size Hierarchical efficiency fields XML JSON CSV Custom regex (Apache)
    25. 25. Outline ‣ Problem Statement ‣ CSV? XML? JSON? Regex? ‣ Protocol Buffers ‣ Codegen, Hadoop and You ‣ Applications ‣ Conclusions and Next Steps
    26. 26. Enter Protocol Buffers ‣ “Protocol Buffers are a way of encoding structured data in an efficient yet extensible format. Google uses Protocol Buffers for almost all of its internal RPC protocols and file formats.” ‣ http://code.google.com/p/protobuf ‣ You write IDL describing your data structure ‣ It generates code in your languages of choice to construct, serialize, deserialize, reflect across, etc, your data structure ‣ Like Thrift, but richer and more efficient (except no RPC) ‣ Avro is an exciting up-and-coming alternative
    27. 27. Protobuf IDL Example ‣ message Status { ‣ optional string created_at = 1; ‣ optional int64 id = 2; ‣ optional string text = 3; ‣ optional string source = 4; ‣ optional bool truncated = 5; ‣ optional int64 in_reply_to_status_id = 6; ‣ optional int64 in_reply_to_user_id = 7; ‣ optional bool favorited = 8; ‣ optional string in_reply_to_screen_name = 9; ‣ optional message User = 10; ‣ optional message Geo = 11; ‣ optional message Contributors = 12; ‣ message User { ‣ optional int64 id = 1; ‣ optional string name = 2; ‣ ... ‣ } ‣ message Geo { ... } ‣ message Contributors { ... } ‣ }
    28. 28. Protobuf Generated Code ‣ The generated code is: ‣ Efficient (Google quotes 80x vs. |-delimited format)1,2 ‣ Extensible ‣ Backwards compatible ‣ Polymorphic (in Java, C++, Python) ‣ Metadata-rich 1. http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext 2. http://code.google.com/p/thrift-protobuf-compare/wiki/Benchmarking
    29. 29. Common Formats Parsing Ignore unused Splittable Reusability Add new fields Small data size Hierarchical efficiency fields XML JSON CSV Custom regex (Apache) Protocol Buffers
    30. 30. Outline ‣ Problem Statement ‣ CSV? XML? JSON? Regex? ‣ Protocol Buffers ‣ Codegen, Hadoop and You ‣ Applications ‣ Conclusions and Next Steps
    31. 31. But Wait, There’s More ‣ Codegen for data structures is nice... ‣ Next step: codegen for all Hadoop-related code
    32. 32. But Wait, There’s More ‣ Codegen for data structures is nice... ‣ Next step: codegen for all Hadoop-related code ‣ Protocol Buffer InputFormats
    33. 33. But Wait, There’s More ‣ Codegen for data structures is nice... ‣ Next step: codegen for all Hadoop-related code ‣ Protocol Buffer InputFormats ‣ OutputFormats
    34. 34. But Wait, There’s More ‣ Codegen for data structures is nice... ‣ Next step: codegen for all Hadoop-related code ‣ Protocol Buffer InputFormats ‣ OutputFormats ‣ Writables
    35. 35. But Wait, There’s More ‣ Codegen for data structures is nice... ‣ Next step: codegen for all Hadoop-related code ‣ Protocol Buffer InputFormats ‣ OutputFormats ‣ Writables ‣ Pig LoadFuncs and StoreFuncs
    36. 36. But Wait, There’s More ‣ Codegen for data structures is nice... ‣ Next step: codegen for all Hadoop-related code ‣ Protocol Buffer InputFormats ‣ OutputFormats ‣ Writables ‣ Pig LoadFuncs and StoreFuncs ‣ Cascading, Streaming, Dumbo, etc
    37. 37. But Wait, There’s More ‣ Codegen for data structures is nice... ‣ Next step: codegen for all Hadoop-related code ‣ Protocol Buffer InputFormats ‣ OutputFormats ‣ Writables ‣ Pig LoadFuncs and StoreFuncs ‣ Cascading, Streaming, Dumbo, etc ‣ Per Protocol Buffer
    38. 38. Protocol Buffer InputFormats ‣ All objects (hierarchical data, inheritance, etc) ‣ All automatically generated ‣ Efficient, extensible storage and serialization
    39. 39. Pig LoadFuncs ‣ All objects (hierarchical data, inheritance, etc) ‣ All automatically generated ‣ Even the load statement itself is codegen
    40. 40. Where do these work? ‣ Java MapReduce APIs (InputFormats, OutputFormats, Writables) ‣ Deprecated Java MapReduce APIs (same) ‣ Enables Streaming, Dumbo, Cascading ‣ Pig ‣ HBase
    41. 41. Outline ‣ Problem Statement ‣ CSV? XML? JSON? Regex? ‣ Protocol Buffers ‣ Codegen, Hadoop and You ‣ Applications ‣ Conclusions and Next Steps
    42. 42. Counting Big Data ‣ standard counts, min, max, std dev ‣ How many requests do we serve in a day? ‣ What is the average latency? 95% latency? ‣ Group by response code. What is the hourly distribution? ‣ How many searches happen each day on Twitter? ‣ How many unique queries, how many unique users? ‣ What is their geographic distribution?
    43. 43. Correlating Big Data ‣ probabilities, covariance, influence ‣ How does usage differ for mobile users? ‣ How about for users with 3rd party desktop clients? ‣ Cohort analyses ‣ Site problems: what goes wrong at the same time? ‣ Which features get users hooked? ‣ Which features do successful users use often? ‣ Search corrections, search suggestions ‣ A/B testing
    44. 44. Research on Big Data ‣ prediction, graph analysis, natural language ‣ What can we tell about a user from their tweets? ‣ From the tweets of those they follow? ‣ From the tweets of their followers? ‣ From the ratio of followers/following? ‣ What graph structures lead to successful networks? ‣ User reputation
    45. 45. Research on Big Data ‣ prediction, graph analysis, natural language ‣ Sentiment analysis ‣ What features get a tweet retweeted? ‣ How deep is the corresponding retweet tree? ‣ Long-term duplicate detection ‣ Machine learning ‣ Language detection ‣ ... the list goes on.
    46. 46. Outline ‣ Problem Statement ‣ CSV? XML? JSON? Regex? ‣ Protocol Buffers ‣ Codegen, Hadoop and You ‣ Applications ‣ Conclusions and Next Steps
    47. 47. Resolution ‣ All we do now is write IDL for the data schema ‣ Get efficient, forward/backwards compatible, splittable data structures automatically generated for us ‣ Get loaders, input formats, output formats, writables, and schemas automatically generated for us ‣ Helps the Twitter analytics team stay agile ‣ Can handle new, complex data without the need for new code, new tests, new bugs ‣ Focus on the analysis, not data formats
    48. 48. Twitter Open Source ‣ Coming soon! (1-2 weeks) http://github.com/kevinweil ‣ All base classes for InputFormats, OutputFormats, Writables, Pig Loaders, etc ‣ For new and deprecated MapReduce API ‣ With and without LZO compression (see http://github.com/ kevinweil/hadoop-lzo) ‣ Protobuf reflection helpers ‣ Serialized block storage format for HDFS
    49. 49. Questions? Follow me at twitter.com/kevinweil ‣ If this sounded interesting to you -- that’s because it is. And we’re hiring. TM
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×