karl long, social strategy, open innovation and design profesional at ConsultingFascinating presentation Kevin, I really was interested in some of the business requirements especially around user behavior. Personally I think the two most important questions you're asking are * Which features get people hooked? * Which features do successful users use often?
I did a quick post on this here: http://experiencecurve.com/archives/the-most-important-questions-twitter-is-asking-of-its-data
Protocol Buffers and Hadoop at TwitterPresentation Transcript
Hadoop and Protocol Buffers at Twitter
Kevin Weil -- @kevinweil
Analytics Lead, Twitter
TM
Outline
‣ Problem Statement
‣ CSV? XML? JSON? Regex?
‣ Protocol Buffers
‣ Codegen, Hadoop and You
‣ Applications
‣ Conclusions and Next Steps
My Background
‣ Studied Mathematics and Physics at Harvard, Physics at
Stanford
‣ Tropos Networks (city-wide wireless): mesh routing algorithms,
GBs of data
‣ Cooliris (web media): Hadoop and Pig for analytics, TBs of data
‣ Twitter: Hadoop, Pig, HBase, large-scale data analysis and
visualization, social graph analysis, machine learning, lots more
data
Outline
‣ Problem Statement
‣ CSV? XML? JSON? Regex?
‣ Protocol Buffers
‣ Codegen, Hadoop and You
‣ Applications
‣ Conclusions and Next Steps
The Challenge
‣ Store some tweets
The Challenge
‣ Store some tweets Store 100 billion tweets
The Challenge
‣ Store 100 billion tweets in a way that is
‣ Robust to changes
The Challenge
‣ Store 100 billion tweets in a way that is
‣ Robust
‣ Efficient in size and speed
The Challenge
‣ Store 100 billion tweets in a way that is
‣ Robust
‣ Efficient
‣ Amenable to large-scale analysis
The Challenge
‣ Store 100 billion tweets in a way that is
‣ Robust
‣ Efficient
‣ Amenable to large-scale analysis
‣ Reusable (especially for other classes of data, like logs, where the size gets
really large)
The System
‣ Your (friend’s) hadoop
cluster
The Data ‣ kevin@tw-mbp-kweil ~ $ curl http://
‣
‣
<?xml version="1.0" encoding="UTF-8"?>
<status>
api.twitter.com/1/statuses/show/9225259353.xml
‣ <created_at>Wed Feb 17 08:01:13 +0000 2010</created_at>
‣ <id>9225259353</id>
‣ <text>Preparing slides for tomorrow's talk at Y! at the Hadoop User Group: Protobufs and Hadoop at Twitter. See you there? http://bit.ly/9DJcd9</text>
‣ <source><a href="http://www.tweetdeck.com/" rel="nofollow">TweetDeck</a></source>
‣ <truncated>false</truncated>
‣ <in_reply_to_status_id></in_reply_to_status_id>
<in_reply_to_user_id></in_reply_to_user_id>
Each tweet has 12 fields, 3 of which (user, geo,
‣
‣
‣ <favorited>false</favorited>
‣ <in_reply_to_screen_name></in_reply_to_screen_name>
‣ <user>
contributors) have subfields
‣ <id>3452911</id>
‣ <name>Kevin Weil</name>
‣ <screen_name>kevinweil</screen_name>
‣ <location>Portola Valley, CA</location>
‣ <description>Analytics Lead at Twitter. Ultra-marathons, cycling, hadoop, lolcats.</description>
‣ <profile_image_url>http://a3.twimg.com/profile_images/220257539/n206489_34325699_8572_normal.jpg</profile_image_url>
‣ <url></url>
‣ <protected>false</protected>
‣ <followers_count>3122</followers_count>
‣ <profile_background_color>B2DFDA</profile_background_color>
‣ <profile_text_color>333333</profile_text_color>
‣ It can change as we add new features
‣ <profile_link_color>93A644</profile_link_color>
‣ <profile_sidebar_fill_color>ffffff</profile_sidebar_fill_color>
‣ <profile_sidebar_border_color>eeeeee</profile_sidebar_border_color>
‣ <friends_count>436</friends_count>
‣ <created_at>Wed Apr 04 19:29:46 +0000 2007</created_at>
‣ <favourites_count>721</favourites_count>
‣ <utc_offset>-28800</utc_offset>
‣ <time_zone>Pacific Time (US & Canada)</time_zone>
‣ <profile_background_image_url>http://s.twimg.com/a/1266345225/images/themes/theme13/bg.gif</profile_background_image_url>
‣ <profile_background_tile>false</profile_background_tile>
‣ <notifications>false</notifications>
‣ <geo_enabled>true</geo_enabled>
‣ <verified>false</verified>
‣ <following>false</following>
‣ <statuses_count>2556</statuses_count>
‣ <lang>en</lang>
‣ <contributors_enabled>false</contributors_enabled>
‣ </user>
‣ <geo/>
‣ <contributors/>
‣ </status>
‣
The Requirements
‣ Splittability
‣ Parsing efficiency
‣ Reusability
‣ Ability to add new fields
‣ Ability to ignore unused fields
‣ Small data size
‣ Hierarchical
The Requirements
‣ Splittability
‣ Parsing efficiency
‣ Reusability
‣ Ability to add new fields
‣ Ability to ignore unused fields
‣ Small data size
‣ Hierarchical
The Requirements
‣ Splittability
‣ Parsing efficiency
‣ Reusability
‣ Ability to add new fields
‣ Ability to ignore unused fields
‣ Small data size
‣ Hierarchical
The Requirements
‣ Splittability
‣ Parsing efficiency
‣ Reusability
‣ Ability to add new fields
‣ Ability to ignore unused fields
‣ Small data size
‣ Hierarchical
The Requirements
‣ Splittability
‣ Parsing efficiency
‣ Reusability
‣ Ability to add new fields
‣ Ability to ignore unused fields
‣ Small data size
‣ Hierarchical
The Requirements
‣ Splittability
‣ Parsing efficiency
‣ Reusability
‣ Ability to add new fields
‣ Ability to ignore unused fields
‣ Small data size
‣ Hierarchical
The Requirements
‣ Splittability
‣ Parsing efficiency
‣ Reusability
‣ Ability to add new fields
‣ Ability to ignore unused fields
‣ Small data size
‣ Hierarchical
Outline
‣ Problem Statement
‣ CSV? XML? JSON? Regex?
‣ Protocol Buffers
‣ Codegen, Hadoop and You
‣ Applications
‣ Conclusions and Next Steps
Common Formats
Parsing Ignore unused
Splittable Reusability Add new fields Small data size Hierarchical
efficiency fields
XML
JSON
CSV
Custom
regex
(Apache)
Common Formats
Parsing Ignore unused
Splittable Reusability Add new fields Small data size Hierarchical
efficiency fields
XML
JSON
CSV
Custom
regex
(Apache)
Common Formats
Parsing Ignore unused
Splittable Reusability Add new fields Small data size Hierarchical
efficiency fields
XML
JSON
CSV
Custom
regex
(Apache)
Common Formats
Parsing Ignore unused
Splittable Reusability Add new fields Small data size Hierarchical
efficiency fields
XML
JSON
CSV
Custom
regex
(Apache)
Outline
‣ Problem Statement
‣ CSV? XML? JSON? Regex?
‣ Protocol Buffers
‣ Codegen, Hadoop and You
‣ Applications
‣ Conclusions and Next Steps
Enter Protocol Buffers
‣ “Protocol Buffers are a way of encoding structured data in an
efficient yet extensible format. Google uses Protocol Buffers for
almost all of its internal RPC protocols and file formats.”
‣
http://code.google.com/p/protobuf
‣ You write IDL describing your data structure
‣ It generates code in your languages of choice to construct, serialize,
deserialize, reflect across, etc, your data structure
‣ Like Thrift, but richer and more efficient (except no RPC)
‣ Avro is an exciting up-and-coming alternative
Protobuf Generated Code
‣ The generated code is:
‣
Efficient (Google quotes 80x vs. |-delimited format)1,2
‣
Extensible
‣
Backwards compatible
‣
Polymorphic (in Java, C++, Python)
‣
Metadata-rich
1. http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext
2. http://code.google.com/p/thrift-protobuf-compare/wiki/Benchmarking
Common Formats
Parsing Ignore unused
Splittable Reusability Add new fields Small data size Hierarchical
efficiency fields
XML
JSON
CSV
Custom
regex
(Apache)
Protocol
Buffers
Outline
‣ Problem Statement
‣ CSV? XML? JSON? Regex?
‣ Protocol Buffers
‣ Codegen, Hadoop and You
‣ Applications
‣ Conclusions and Next Steps
But Wait, There’s More
‣ Codegen for data structures is nice...
‣ Next step: codegen for all Hadoop-related code
But Wait, There’s More
‣ Codegen for data structures is nice...
‣ Next step: codegen for all Hadoop-related code
‣
Protocol Buffer InputFormats
But Wait, There’s More
‣ Codegen for data structures is nice...
‣ Next step: codegen for all Hadoop-related code
‣
Protocol Buffer InputFormats
‣
OutputFormats
But Wait, There’s More
‣ Codegen for data structures is nice...
‣ Next step: codegen for all Hadoop-related code
‣
Protocol Buffer InputFormats
‣
OutputFormats
‣
Writables
But Wait, There’s More
‣ Codegen for data structures is nice...
‣ Next step: codegen for all Hadoop-related code
‣
Protocol Buffer InputFormats
‣
OutputFormats
‣
Writables
‣
Pig LoadFuncs and StoreFuncs
But Wait, There’s More
‣ Codegen for data structures is nice...
‣ Next step: codegen for all Hadoop-related code
‣
Protocol Buffer InputFormats
‣
OutputFormats
‣
Writables
‣
Pig LoadFuncs and StoreFuncs
‣
Cascading, Streaming, Dumbo, etc
But Wait, There’s More
‣ Codegen for data structures is nice...
‣ Next step: codegen for all Hadoop-related code
‣
Protocol Buffer InputFormats
‣
OutputFormats
‣
Writables
‣
Pig LoadFuncs and StoreFuncs
‣
Cascading, Streaming, Dumbo, etc
‣
Per Protocol Buffer
Protocol Buffer InputFormats
‣ All objects
(hierarchical
data,
inheritance, etc)
‣ All automatically
generated
‣ Efficient,
extensible
storage and
serialization
Pig LoadFuncs
‣ All objects
(hierarchical
data,
inheritance, etc)
‣ All automatically
generated
‣ Even the load
statement itself
is codegen
Where do these work?
‣ Java MapReduce APIs (InputFormats, OutputFormats, Writables)
‣ Deprecated Java MapReduce APIs (same)
‣
Enables Streaming, Dumbo, Cascading
‣ Pig
‣ HBase
Outline
‣ Problem Statement
‣ CSV? XML? JSON? Regex?
‣ Protocol Buffers
‣ Codegen, Hadoop and You
‣ Applications
‣ Conclusions and Next Steps
Counting Big Data
‣ standard counts, min, max, std dev
‣ How many requests do we serve in a day?
‣ What is the average latency? 95% latency?
‣ Group by response code. What is the hourly distribution?
‣ How many searches happen each day on Twitter?
‣ How many unique queries, how many unique users?
‣ What is their geographic distribution?
Correlating Big Data
‣ probabilities, covariance, influence
‣ How does usage differ for mobile users?
‣ How about for users with 3rd party desktop clients?
‣ Cohort analyses
‣ Site problems: what goes wrong at the same time?
‣ Which features get users hooked?
‣ Which features do successful users use often?
‣ Search corrections, search suggestions
‣ A/B testing
Research on Big Data
‣ prediction, graph analysis, natural language
‣ What can we tell about a user from their tweets?
‣ From the tweets of those they follow?
‣ From the tweets of their followers?
‣ From the ratio of followers/following?
‣ What graph structures lead to successful networks?
‣ User reputation
Research on Big Data
‣ prediction, graph analysis, natural language
‣ Sentiment analysis
‣ What features get a tweet retweeted?
‣ How deep is the corresponding retweet tree?
‣ Long-term duplicate detection
‣ Machine learning
‣ Language detection
‣ ... the list goes on.
Outline
‣ Problem Statement
‣ CSV? XML? JSON? Regex?
‣ Protocol Buffers
‣ Codegen, Hadoop and You
‣ Applications
‣ Conclusions and Next Steps
Resolution
‣ All we do now is write IDL for the data schema
‣ Get efficient, forward/backwards compatible, splittable data structures
automatically generated for us
‣ Get loaders, input formats, output formats, writables, and schemas
automatically generated for us
‣ Helps the Twitter analytics team stay agile
‣
Can handle new, complex data without the need for new code, new
tests, new bugs
‣
Focus on the analysis, not data formats
Twitter Open Source
‣ Coming soon! (1-2 weeks) http://github.com/kevinweil
‣ All base classes for InputFormats, OutputFormats, Writables, Pig
Loaders, etc
‣ For new and deprecated MapReduce API
‣ With and without LZO compression (see http://github.com/
kevinweil/hadoop-lzo)
‣ Protobuf reflection helpers
‣ Serialized block storage format for HDFS
Questions? Follow me at
twitter.com/kevinweil
‣ If this sounded interesting to you -- that’s because it is. And we’re hiring.
TM
Let LinkedIn power your SlideShare experience
+
Let LinkedIn power your SlideShare experience
Customize SlideShare content based on your interests
We will import your LinkedIn profile and you will be visible on SlideShare.
Keep up to date when your LinkedIn contacts post on SlideShare
* Which features get people hooked?
* Which features do successful users use often?
I did a quick post on this here:
http://experiencecurve.com/archives/the-most-important-questions-twitter-is-asking-of-its-data
Cheers. 3 years ago