SlideShare a Scribd company logo
1 of 49
Hadoop and Protocol Buffers at Twitter
Kevin Weil -- @kevinweil
Analytics Lead, Twitter




                                         TM
Outline
‣   Problem Statement
‣   CSV? XML? JSON? Regex?
‣   Protocol Buffers
‣   Codegen, Hadoop and You
‣   Applications
‣   Conclusions and Next Steps
My Background
‣   Studied Mathematics and Physics at Harvard, Physics at
    Stanford
‣   Tropos Networks (city-wide wireless): mesh routing algorithms,
    GBs of data
‣   Cooliris (web media): Hadoop and Pig for analytics, TBs of data
‣   Twitter: Hadoop, Pig, HBase, large-scale data analysis and
    visualization, social graph analysis, machine learning, lots more
    data
Outline
‣   Problem Statement
‣   CSV? XML? JSON? Regex?
‣   Protocol Buffers
‣   Codegen, Hadoop and You
‣   Applications
‣   Conclusions and Next Steps
The Challenge
‣   Store some tweets
The Challenge
‣   Store some tweets Store 100 billion tweets
The Challenge
‣   Store 100 billion tweets in a way that is
‣   	   Robust to changes
The Challenge
‣   Store 100 billion tweets in a way that is
‣   	   Robust
‣   	   Efficient in size and speed
The Challenge
‣   Store 100 billion tweets in a way that is
‣   	   Robust
‣   	   Efficient
‣   	   Amenable to large-scale analysis
The Challenge
‣   Store 100 billion tweets in a way that is
‣   	     Robust
‣   	     Efficient
‣    	    Amenable to large-scale analysis
‣   	     Reusable (especially for other classes of data, like logs, where the size gets
    really large)
The System
‣   Your (friend’s) hadoop
    cluster
The Data                                                                                ‣     kevin@tw-mbp-kweil ~ $ curl http://
‣

‣
    <?xml version="1.0" encoding="UTF-8"?>
    <status>
                                                                                              api.twitter.com/1/statuses/show/9225259353.xml
‣    <created_at>Wed Feb 17 08:01:13 +0000 2010</created_at>
‣    <id>9225259353</id>
‣    <text>Preparing slides for tomorrow's talk at Y! at the Hadoop User Group: Protobufs and Hadoop at Twitter.   See you there?   http://bit.ly/9DJcd9</text>
‣    <source>&lt;a href=&quot;http://www.tweetdeck.com/&quot; rel=&quot;nofollow&quot;&gt;TweetDeck&lt;/a&gt;</source>
‣    <truncated>false</truncated>
‣    <in_reply_to_status_id></in_reply_to_status_id>
     <in_reply_to_user_id></in_reply_to_user_id>



                                                                                              Each tweet has 12 fields, 3 of which (user, geo,
‣




                                                                                        ‣
‣    <favorited>false</favorited>
‣    <in_reply_to_screen_name></in_reply_to_screen_name>
‣    <user>




                                                                                              contributors) have subfields
‣      <id>3452911</id>
‣      <name>Kevin Weil</name>
‣      <screen_name>kevinweil</screen_name>
‣      <location>Portola Valley, CA</location>
‣      <description>Analytics Lead at Twitter. Ultra-marathons, cycling, hadoop, lolcats.</description>
‣      <profile_image_url>http://a3.twimg.com/profile_images/220257539/n206489_34325699_8572_normal.jpg</profile_image_url>
‣      <url></url>
‣      <protected>false</protected>
‣      <followers_count>3122</followers_count>
‣      <profile_background_color>B2DFDA</profile_background_color>
‣      <profile_text_color>333333</profile_text_color>



                                                                                        ‣     It can change as we add new features
‣      <profile_link_color>93A644</profile_link_color>
‣      <profile_sidebar_fill_color>ffffff</profile_sidebar_fill_color>
‣      <profile_sidebar_border_color>eeeeee</profile_sidebar_border_color>
‣      <friends_count>436</friends_count>
‣      <created_at>Wed Apr 04 19:29:46 +0000 2007</created_at>
‣      <favourites_count>721</favourites_count>
‣      <utc_offset>-28800</utc_offset>
‣      <time_zone>Pacific Time (US &amp; Canada)</time_zone>
‣      <profile_background_image_url>http://s.twimg.com/a/1266345225/images/themes/theme13/bg.gif</profile_background_image_url>
‣      <profile_background_tile>false</profile_background_tile>
‣      <notifications>false</notifications>
‣      <geo_enabled>true</geo_enabled>
‣      <verified>false</verified>
‣      <following>false</following>
‣      <statuses_count>2556</statuses_count>
‣      <lang>en</lang>
‣      <contributors_enabled>false</contributors_enabled>
‣    </user>
‣    <geo/>
‣    <contributors/>
‣   </status>
‣
The Requirements
                   ‣   Splittability
                   ‣   Parsing efficiency
                   ‣   Reusability
                   ‣   Ability to add new fields
                   ‣   Ability to ignore unused fields
                   ‣   Small data size
                   ‣   Hierarchical
The Requirements
                   ‣   Splittability
                   ‣   Parsing efficiency
                   ‣   Reusability
                   ‣   Ability to add new fields
                   ‣   Ability to ignore unused fields
                   ‣   Small data size
                   ‣   Hierarchical
The Requirements
                   ‣   Splittability
                   ‣   Parsing efficiency
                   ‣   Reusability
                   ‣   Ability to add new fields
                   ‣   Ability to ignore unused fields
                   ‣   Small data size
                   ‣   Hierarchical
The Requirements
                   ‣   Splittability
                   ‣   Parsing efficiency
                   ‣   Reusability
                   ‣   Ability to add new fields
                   ‣   Ability to ignore unused fields
                   ‣   Small data size
                   ‣   Hierarchical
The Requirements
                   ‣   Splittability
                   ‣   Parsing efficiency
                   ‣   Reusability
                   ‣   Ability to add new fields
                   ‣   Ability to ignore unused fields
                   ‣   Small data size
                   ‣   Hierarchical
The Requirements
                   ‣   Splittability
                   ‣   Parsing efficiency
                   ‣   Reusability
                   ‣   Ability to add new fields
                   ‣   Ability to ignore unused fields
                   ‣   Small data size
                   ‣   Hierarchical
The Requirements
                   ‣   Splittability
                   ‣   Parsing efficiency
                   ‣   Reusability
                   ‣   Ability to add new fields
                   ‣   Ability to ignore unused fields
                   ‣   Small data size
                   ‣   Hierarchical
Outline
‣   Problem Statement
‣   CSV? XML? JSON? Regex?
‣   Protocol Buffers
‣   Codegen, Hadoop and You
‣   Applications
‣   Conclusions and Next Steps
Common Formats
                         Parsing                                Ignore unused
           Splittable               Reusability   Add new fields               Small data size   Hierarchical
                        efficiency                                   fields


 XML


 JSON


  CSV

 Custom
  regex
(Apache)
Common Formats
                         Parsing                                Ignore unused
           Splittable               Reusability   Add new fields               Small data size   Hierarchical
                        efficiency                                   fields


 XML


 JSON


  CSV

 Custom
  regex
(Apache)
Common Formats
                         Parsing                                Ignore unused
           Splittable               Reusability   Add new fields               Small data size   Hierarchical
                        efficiency                                   fields


 XML


 JSON


  CSV

 Custom
  regex
(Apache)
Common Formats
                         Parsing                                Ignore unused
           Splittable               Reusability   Add new fields               Small data size   Hierarchical
                        efficiency                                   fields


 XML


 JSON


  CSV

 Custom
  regex
(Apache)
Outline
‣   Problem Statement
‣   CSV? XML? JSON? Regex?
‣   Protocol Buffers
‣   Codegen, Hadoop and You
‣   Applications
‣   Conclusions and Next Steps
Enter Protocol Buffers
‣   “Protocol Buffers are a way of encoding structured data in an
    efficient yet extensible format. Google uses Protocol Buffers for
    almost all of its internal RPC protocols and file formats.”
‣   
   http://code.google.com/p/protobuf
‣   You write IDL describing your data structure
‣   It generates code in your languages of choice to construct, serialize,
    deserialize, reflect across, etc, your data structure
‣   Like Thrift, but richer and more efficient (except no RPC)
‣   Avro is an exciting up-and-coming alternative
Protobuf IDL Example
‣   message Status {
‣     optional string created_at                =   1;
‣     optional int64 id                         =   2;
‣     optional string text                      =   3;
‣     optional string source                    =   4;
‣     optional bool truncated                   =   5;
‣     optional int64 in_reply_to_status_id      =   6;
‣     optional int64 in_reply_to_user_id        =   7;
‣     optional bool favorited                   =   8;
‣     optional string in_reply_to_screen_name   =   9;
‣     optional message User                     =   10;
‣     optional message Geo                      =   11;
‣     optional message Contributors             =   12;

‣       message User {
‣         optional int64 id                     = 1;
‣         optional string name                  = 2;
‣         ...
‣       }
‣       message Geo { ... }
‣       message Contributors { ... }
‣   }
Protobuf Generated Code
‣   The generated code is:
‣   
   Efficient (Google quotes 80x vs. |-delimited                     format)1,2

‣   
   Extensible
‣   
   Backwards compatible
‣   
   Polymorphic (in Java, C++, Python)
‣   
   Metadata-rich



1. http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext
2. http://code.google.com/p/thrift-protobuf-compare/wiki/Benchmarking
Common Formats
                         Parsing                                Ignore unused
           Splittable               Reusability   Add new fields               Small data size   Hierarchical
                        efficiency                                   fields


 XML


 JSON


  CSV

 Custom
  regex
(Apache)
Protocol
 Buffers
Outline
‣   Problem Statement
‣   CSV? XML? JSON? Regex?
‣   Protocol Buffers
‣   Codegen, Hadoop and You
‣   Applications
‣   Conclusions and Next Steps
But Wait, There’s More
‣   Codegen for data structures is nice...
‣   Next step: codegen for all Hadoop-related code
But Wait, There’s More
‣   Codegen for data structures is nice...
‣   Next step: codegen for all Hadoop-related code
‣   
   Protocol Buffer InputFormats
But Wait, There’s More
‣   Codegen for data structures is nice...
‣   Next step: codegen for all Hadoop-related code
‣   
   Protocol Buffer InputFormats
‣   
   OutputFormats
But Wait, There’s More
‣   Codegen for data structures is nice...
‣   Next step: codegen for all Hadoop-related code
‣   
   Protocol Buffer InputFormats
‣   
   OutputFormats
‣   
   Writables
But Wait, There’s More
‣   Codegen for data structures is nice...
‣   Next step: codegen for all Hadoop-related code
‣   
   Protocol Buffer InputFormats
‣   
   OutputFormats
‣   
   Writables
‣   
   Pig LoadFuncs and StoreFuncs
But Wait, There’s More
‣   Codegen for data structures is nice...
‣   Next step: codegen for all Hadoop-related code
‣   
   Protocol Buffer InputFormats
‣   
   OutputFormats
‣   
   Writables
‣   
   Pig LoadFuncs and StoreFuncs
‣   
   Cascading, Streaming, Dumbo, etc
But Wait, There’s More
‣   Codegen for data structures is nice...
‣   Next step: codegen for all Hadoop-related code
‣   
   Protocol Buffer InputFormats
‣   
   OutputFormats
‣   
   Writables
‣   
   Pig LoadFuncs and StoreFuncs
‣   
   Cascading, Streaming, Dumbo, etc
‣   
   Per Protocol Buffer
Protocol Buffer InputFormats
                               ‣   All objects
                                   (hierarchical
                                   data,
                                   inheritance, etc)
                               ‣   All automatically
                                   generated
                               ‣   Efficient,
                                   extensible
                                   storage and
                                   serialization
Pig LoadFuncs
                ‣   All objects
                    (hierarchical
                    data,
                    inheritance, etc)
                ‣   All automatically
                    generated
                ‣   Even the load
                    statement itself
                    is codegen
Where do these work?
‣   Java MapReduce APIs (InputFormats, OutputFormats, Writables)
‣   Deprecated Java MapReduce APIs (same)
‣   
     Enables Streaming, Dumbo, Cascading
‣   Pig
‣   HBase
Outline
‣   Problem Statement
‣   CSV? XML? JSON? Regex?
‣   Protocol Buffers
‣   Codegen, Hadoop and You
‣   Applications
‣   Conclusions and Next Steps
Counting Big Data
‣                standard counts, min, max, std dev
‣   How many requests do we serve in a day?
‣   What is the average latency? 95% latency?
‣   Group by response code. What is the hourly distribution?
‣   How many searches happen each day on Twitter?
‣   How many unique queries, how many unique users?
‣   What is their geographic distribution?
Correlating Big Data
‣                 probabilities, covariance, influence
‣   How does usage differ for mobile users?
‣   How about for users with 3rd party desktop clients?
‣   Cohort analyses
‣   Site problems: what goes wrong at the same time?
‣   Which features get users hooked?
‣   Which features do successful users use often?
‣   Search corrections, search suggestions
‣   A/B testing
Research on Big Data
‣           prediction, graph analysis, natural language
‣   What can we tell about a user from their tweets?
‣     From the tweets of those they follow?
‣     From the tweets of their followers?
‣     From the ratio of followers/following?
‣   What graph structures lead to successful networks?
‣   User reputation
Research on Big Data
‣            prediction, graph analysis, natural language
‣   Sentiment analysis
‣   What features get a tweet retweeted?
‣     How deep is the corresponding retweet tree?
‣   Long-term duplicate detection
‣   Machine learning
‣   Language detection
‣   ... the list goes on.
Outline
‣   Problem Statement
‣   CSV? XML? JSON? Regex?
‣   Protocol Buffers
‣   Codegen, Hadoop and You
‣   Applications
‣   Conclusions and Next Steps
Resolution
‣   All we do now is write IDL for the data schema
‣   Get efficient, forward/backwards compatible, splittable data structures
    automatically generated for us
‣   Get loaders, input formats, output formats, writables, and schemas
    automatically generated for us
‣   Helps the Twitter analytics team stay agile
‣   
   Can handle new, complex data without the need for new code, new
    
   tests, new bugs
‣   
   Focus on the analysis, not data formats
Twitter              Open Source
‣   Coming soon! (1-2 weeks) http://github.com/kevinweil
‣   All base classes for InputFormats, OutputFormats, Writables, Pig
    Loaders, etc
‣   For new and deprecated MapReduce API
‣   With and without LZO compression (see http://github.com/
    kevinweil/hadoop-lzo)
‣   Protobuf reflection helpers
‣   Serialized block storage format for HDFS
Questions?                                           Follow me at
                                                            twitter.com/kevinweil




‣   If this sounded interesting to you -- that’s because it is. And we’re hiring.

                                                                         TM

More Related Content

What's hot

Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotFlink Forward
 
Introduction to DataFusion An Embeddable Query Engine Written in Rust
Introduction to DataFusion  An Embeddable Query Engine Written in RustIntroduction to DataFusion  An Embeddable Query Engine Written in Rust
Introduction to DataFusion An Embeddable Query Engine Written in RustAndrew Lamb
 
Data Pipelines with Apache Kafka
Data Pipelines with Apache KafkaData Pipelines with Apache Kafka
Data Pipelines with Apache KafkaBen Stopford
 
Unique ID generation in distributed systems
Unique ID generation in distributed systemsUnique ID generation in distributed systems
Unique ID generation in distributed systemsDave Gardner
 
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry confluent
 
Data Streaming Ecosystem Management at Booking.com
Data Streaming Ecosystem Management at Booking.com Data Streaming Ecosystem Management at Booking.com
Data Streaming Ecosystem Management at Booking.com confluent
 
Best Practices for Enabling Speculative Execution on Large Scale Platforms
Best Practices for Enabling Speculative Execution on Large Scale PlatformsBest Practices for Enabling Speculative Execution on Large Scale Platforms
Best Practices for Enabling Speculative Execution on Large Scale PlatformsDatabricks
 
Microservices Manchester: Authentication in Microservice Systems by David Borsos
Microservices Manchester: Authentication in Microservice Systems by David BorsosMicroservices Manchester: Authentication in Microservice Systems by David Borsos
Microservices Manchester: Authentication in Microservice Systems by David BorsosOpenCredo
 
Introduction to Structured Streaming
Introduction to Structured StreamingIntroduction to Structured Streaming
Introduction to Structured StreamingKnoldus Inc.
 
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...Databricks
 
Scaling up uber's real time data analytics
Scaling up uber's real time data analyticsScaling up uber's real time data analytics
Scaling up uber's real time data analyticsXiang Fu
 
Devoxx : being productive with JHipster
Devoxx : being productive with JHipsterDevoxx : being productive with JHipster
Devoxx : being productive with JHipsterJulien Dubois
 
Maintaining Consistency Across Data Centers (Randy Fradin, BlackRock) | Cassa...
Maintaining Consistency Across Data Centers (Randy Fradin, BlackRock) | Cassa...Maintaining Consistency Across Data Centers (Randy Fradin, BlackRock) | Cassa...
Maintaining Consistency Across Data Centers (Randy Fradin, BlackRock) | Cassa...DataStax
 
Dapr - A 10x Developer Framework for Any Language
Dapr - A 10x Developer Framework for Any LanguageDapr - A 10x Developer Framework for Any Language
Dapr - A 10x Developer Framework for Any LanguageBilgin Ibryam
 
Introduction to Kafka Streams
Introduction to Kafka StreamsIntroduction to Kafka Streams
Introduction to Kafka StreamsGuozhang Wang
 
Producer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaProducer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaJiangjie Qin
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsSpark Summit
 
Kafka Tutorial - introduction to the Kafka streaming platform
Kafka Tutorial - introduction to the Kafka streaming platformKafka Tutorial - introduction to the Kafka streaming platform
Kafka Tutorial - introduction to the Kafka streaming platformJean-Paul Azar
 
Storing 16 Bytes at Scale
Storing 16 Bytes at ScaleStoring 16 Bytes at Scale
Storing 16 Bytes at ScaleFabian Reinartz
 

What's hot (20)

Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
 
Introduction to DataFusion An Embeddable Query Engine Written in Rust
Introduction to DataFusion  An Embeddable Query Engine Written in RustIntroduction to DataFusion  An Embeddable Query Engine Written in Rust
Introduction to DataFusion An Embeddable Query Engine Written in Rust
 
Data Pipelines with Apache Kafka
Data Pipelines with Apache KafkaData Pipelines with Apache Kafka
Data Pipelines with Apache Kafka
 
Unique ID generation in distributed systems
Unique ID generation in distributed systemsUnique ID generation in distributed systems
Unique ID generation in distributed systems
 
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
 
Data Streaming Ecosystem Management at Booking.com
Data Streaming Ecosystem Management at Booking.com Data Streaming Ecosystem Management at Booking.com
Data Streaming Ecosystem Management at Booking.com
 
Best Practices for Enabling Speculative Execution on Large Scale Platforms
Best Practices for Enabling Speculative Execution on Large Scale PlatformsBest Practices for Enabling Speculative Execution on Large Scale Platforms
Best Practices for Enabling Speculative Execution on Large Scale Platforms
 
Microservices Manchester: Authentication in Microservice Systems by David Borsos
Microservices Manchester: Authentication in Microservice Systems by David BorsosMicroservices Manchester: Authentication in Microservice Systems by David Borsos
Microservices Manchester: Authentication in Microservice Systems by David Borsos
 
Introduction to Structured Streaming
Introduction to Structured StreamingIntroduction to Structured Streaming
Introduction to Structured Streaming
 
Flink vs. Spark
Flink vs. SparkFlink vs. Spark
Flink vs. Spark
 
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
 
Scaling up uber's real time data analytics
Scaling up uber's real time data analyticsScaling up uber's real time data analytics
Scaling up uber's real time data analytics
 
Devoxx : being productive with JHipster
Devoxx : being productive with JHipsterDevoxx : being productive with JHipster
Devoxx : being productive with JHipster
 
Maintaining Consistency Across Data Centers (Randy Fradin, BlackRock) | Cassa...
Maintaining Consistency Across Data Centers (Randy Fradin, BlackRock) | Cassa...Maintaining Consistency Across Data Centers (Randy Fradin, BlackRock) | Cassa...
Maintaining Consistency Across Data Centers (Randy Fradin, BlackRock) | Cassa...
 
Dapr - A 10x Developer Framework for Any Language
Dapr - A 10x Developer Framework for Any LanguageDapr - A 10x Developer Framework for Any Language
Dapr - A 10x Developer Framework for Any Language
 
Introduction to Kafka Streams
Introduction to Kafka StreamsIntroduction to Kafka Streams
Introduction to Kafka Streams
 
Producer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaProducer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache Kafka
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
 
Kafka Tutorial - introduction to the Kafka streaming platform
Kafka Tutorial - introduction to the Kafka streaming platformKafka Tutorial - introduction to the Kafka streaming platform
Kafka Tutorial - introduction to the Kafka streaming platform
 
Storing 16 Bytes at Scale
Storing 16 Bytes at ScaleStoring 16 Bytes at Scale
Storing 16 Bytes at Scale
 

Viewers also liked

Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Kevin Weil
 
Rest style web services (google protocol buffers) prasad nirantar
Rest style web services (google protocol buffers)   prasad nirantarRest style web services (google protocol buffers)   prasad nirantar
Rest style web services (google protocol buffers) prasad nirantarIndicThreads
 
Data Serialization Using Google Protocol Buffers
Data Serialization Using Google Protocol BuffersData Serialization Using Google Protocol Buffers
Data Serialization Using Google Protocol BuffersWilliam Kibira
 
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)Adam Kawa
 
What Makes a Great Open API?
What Makes a Great Open API?What Makes a Great Open API?
What Makes a Great Open API?John Musser
 
Large Scale Hierarchical Text Classification
Large Scale Hierarchical Text ClassificationLarge Scale Hierarchical Text Classification
Large Scale Hierarchical Text ClassificationHammad Haleem
 
Measuring CDN performance and why you're doing it wrong
Measuring CDN performance and why you're doing it wrongMeasuring CDN performance and why you're doing it wrong
Measuring CDN performance and why you're doing it wrongFastly
 
Introduction to protocol buffer
Introduction to protocol bufferIntroduction to protocol buffer
Introduction to protocol bufferTim (文昌)
 
Startupinformatik
StartupinformatikStartupinformatik
StartupinformatikDirk Riehle
 
Experience protocol buffer on android
Experience protocol buffer on androidExperience protocol buffer on android
Experience protocol buffer on androidRichard Chang
 
Scalable Event Analytics with MongoDB & Ruby on Rails
Scalable Event Analytics with MongoDB & Ruby on RailsScalable Event Analytics with MongoDB & Ruby on Rails
Scalable Event Analytics with MongoDB & Ruby on RailsJared Rosoff
 
Hadoop summit 2010 frameworks panel elephant bird
Hadoop summit 2010 frameworks panel elephant birdHadoop summit 2010 frameworks panel elephant bird
Hadoop summit 2010 frameworks panel elephant birdKevin Weil
 
An Empirical Evaluation of VoIP Playout Buffer Dimensioning in Skype, Google ...
An Empirical Evaluation of VoIP Playout Buffer Dimensioning in Skype, Google ...An Empirical Evaluation of VoIP Playout Buffer Dimensioning in Skype, Google ...
An Empirical Evaluation of VoIP Playout Buffer Dimensioning in Skype, Google ...Academia Sinica
 
Illustration of TextSecure's Protocol Buffer usage
Illustration of TextSecure's Protocol Buffer usageIllustration of TextSecure's Protocol Buffer usage
Illustration of TextSecure's Protocol Buffer usageChristine Corbett Moran
 
Hadoop at Twitter (Hadoop Summit 2010)
Hadoop at Twitter (Hadoop Summit 2010)Hadoop at Twitter (Hadoop Summit 2010)
Hadoop at Twitter (Hadoop Summit 2010)Kevin Weil
 
Spatial Analytics, Where 2.0 2010
Spatial Analytics, Where 2.0 2010Spatial Analytics, Where 2.0 2010
Spatial Analytics, Where 2.0 2010Kevin Weil
 

Viewers also liked (20)

Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)
 
Rest style web services (google protocol buffers) prasad nirantar
Rest style web services (google protocol buffers)   prasad nirantarRest style web services (google protocol buffers)   prasad nirantar
Rest style web services (google protocol buffers) prasad nirantar
 
Data Serialization Using Google Protocol Buffers
Data Serialization Using Google Protocol BuffersData Serialization Using Google Protocol Buffers
Data Serialization Using Google Protocol Buffers
 
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
 
3 apache-avro
3 apache-avro3 apache-avro
3 apache-avro
 
What Makes a Great Open API?
What Makes a Great Open API?What Makes a Great Open API?
What Makes a Great Open API?
 
Large Scale Hierarchical Text Classification
Large Scale Hierarchical Text ClassificationLarge Scale Hierarchical Text Classification
Large Scale Hierarchical Text Classification
 
Protocol Buffers
Protocol BuffersProtocol Buffers
Protocol Buffers
 
Measuring CDN performance and why you're doing it wrong
Measuring CDN performance and why you're doing it wrongMeasuring CDN performance and why you're doing it wrong
Measuring CDN performance and why you're doing it wrong
 
Introduction to protocol buffer
Introduction to protocol bufferIntroduction to protocol buffer
Introduction to protocol buffer
 
Startupinformatik
StartupinformatikStartupinformatik
Startupinformatik
 
Experience protocol buffer on android
Experience protocol buffer on androidExperience protocol buffer on android
Experience protocol buffer on android
 
Scalable Event Analytics with MongoDB & Ruby on Rails
Scalable Event Analytics with MongoDB & Ruby on RailsScalable Event Analytics with MongoDB & Ruby on Rails
Scalable Event Analytics with MongoDB & Ruby on Rails
 
Hadoop summit 2010 frameworks panel elephant bird
Hadoop summit 2010 frameworks panel elephant birdHadoop summit 2010 frameworks panel elephant bird
Hadoop summit 2010 frameworks panel elephant bird
 
An Empirical Evaluation of VoIP Playout Buffer Dimensioning in Skype, Google ...
An Empirical Evaluation of VoIP Playout Buffer Dimensioning in Skype, Google ...An Empirical Evaluation of VoIP Playout Buffer Dimensioning in Skype, Google ...
An Empirical Evaluation of VoIP Playout Buffer Dimensioning in Skype, Google ...
 
Protocol buffers
Protocol buffersProtocol buffers
Protocol buffers
 
Illustration of TextSecure's Protocol Buffer usage
Illustration of TextSecure's Protocol Buffer usageIllustration of TextSecure's Protocol Buffer usage
Illustration of TextSecure's Protocol Buffer usage
 
Protocol Buffer.ppt
Protocol Buffer.pptProtocol Buffer.ppt
Protocol Buffer.ppt
 
Hadoop at Twitter (Hadoop Summit 2010)
Hadoop at Twitter (Hadoop Summit 2010)Hadoop at Twitter (Hadoop Summit 2010)
Hadoop at Twitter (Hadoop Summit 2010)
 
Spatial Analytics, Where 2.0 2010
Spatial Analytics, Where 2.0 2010Spatial Analytics, Where 2.0 2010
Spatial Analytics, Where 2.0 2010
 

Similar to Protocol Buffers and Hadoop at Twitter

Pure Sign Breakfast Presentations - Drupal FieldAPI
Pure Sign Breakfast Presentations - Drupal FieldAPIPure Sign Breakfast Presentations - Drupal FieldAPI
Pure Sign Breakfast Presentations - Drupal FieldAPIPure Sign
 
Big Data with SQL Server
Big Data with SQL ServerBig Data with SQL Server
Big Data with SQL ServerMark Kromer
 
ArangoDB – A different approach to NoSQL
ArangoDB – A different approach to NoSQLArangoDB – A different approach to NoSQL
ArangoDB – A different approach to NoSQLArangoDB Database
 
Philly Code Camp 2013 Mark Kromer Big Data with SQL Server
Philly Code Camp 2013 Mark Kromer Big Data with SQL ServerPhilly Code Camp 2013 Mark Kromer Big Data with SQL Server
Philly Code Camp 2013 Mark Kromer Big Data with SQL ServerMark Kromer
 
GraphQL vs. (the) REST
GraphQL vs. (the) RESTGraphQL vs. (the) REST
GraphQL vs. (the) RESTcoliquio GmbH
 
Webinar: How native multi model works in ArangoDB
Webinar: How native multi model works in ArangoDBWebinar: How native multi model works in ArangoDB
Webinar: How native multi model works in ArangoDBArangoDB Database
 
Big Data in the Real World
Big Data in the Real WorldBig Data in the Real World
Big Data in the Real WorldMark Kromer
 
PiterPy 2016: Parallelization, Aggregation and Validation of API in Python
PiterPy 2016: Parallelization, Aggregation and Validation of API in PythonPiterPy 2016: Parallelization, Aggregation and Validation of API in Python
PiterPy 2016: Parallelization, Aggregation and Validation of API in PythonMax Klymyshyn
 
Creating PostgreSQL-as-a-Service at Scale
Creating PostgreSQL-as-a-Service at ScaleCreating PostgreSQL-as-a-Service at Scale
Creating PostgreSQL-as-a-Service at ScaleSean Chittenden
 
MongoDB World 2019: Raiders of the Anti-patterns: A Journey Towards Fixing Sc...
MongoDB World 2019: Raiders of the Anti-patterns: A Journey Towards Fixing Sc...MongoDB World 2019: Raiders of the Anti-patterns: A Journey Towards Fixing Sc...
MongoDB World 2019: Raiders of the Anti-patterns: A Journey Towards Fixing Sc...MongoDB
 
There and back_again_oracle_and_big_data_16x9
There and back_again_oracle_and_big_data_16x9There and back_again_oracle_and_big_data_16x9
There and back_again_oracle_and_big_data_16x9Gleb Otochkin
 
[db tech showcase Tokyo 2017] C13:There and back again or how to connect Orac...
[db tech showcase Tokyo 2017] C13:There and back again or how to connect Orac...[db tech showcase Tokyo 2017] C13:There and back again or how to connect Orac...
[db tech showcase Tokyo 2017] C13:There and back again or how to connect Orac...Insight Technology, Inc.
 
Agility and Scalability with MongoDB
Agility and Scalability with MongoDBAgility and Scalability with MongoDB
Agility and Scalability with MongoDBMongoDB
 
Lambda Architectures in Practice
Lambda Architectures in PracticeLambda Architectures in Practice
Lambda Architectures in PracticeC4Media
 
Bug bites Elephant? Test-driven Quality Assurance in Big Data Application Dev...
Bug bites Elephant? Test-driven Quality Assurance in Big Data Application Dev...Bug bites Elephant? Test-driven Quality Assurance in Big Data Application Dev...
Bug bites Elephant? Test-driven Quality Assurance in Big Data Application Dev...inovex GmbH
 
Heterogenous Persistence
Heterogenous PersistenceHeterogenous Persistence
Heterogenous PersistenceJervin Real
 

Similar to Protocol Buffers and Hadoop at Twitter (20)

Pure Sign Breakfast Presentations - Drupal FieldAPI
Pure Sign Breakfast Presentations - Drupal FieldAPIPure Sign Breakfast Presentations - Drupal FieldAPI
Pure Sign Breakfast Presentations - Drupal FieldAPI
 
Big Data with SQL Server
Big Data with SQL ServerBig Data with SQL Server
Big Data with SQL Server
 
ArangoDB – A different approach to NoSQL
ArangoDB – A different approach to NoSQLArangoDB – A different approach to NoSQL
ArangoDB – A different approach to NoSQL
 
Philly Code Camp 2013 Mark Kromer Big Data with SQL Server
Philly Code Camp 2013 Mark Kromer Big Data with SQL ServerPhilly Code Camp 2013 Mark Kromer Big Data with SQL Server
Philly Code Camp 2013 Mark Kromer Big Data with SQL Server
 
Multi model-databases
Multi model-databasesMulti model-databases
Multi model-databases
 
Multi model-databases
Multi model-databasesMulti model-databases
Multi model-databases
 
Taming NoSQL with Spring Data
Taming NoSQL with Spring DataTaming NoSQL with Spring Data
Taming NoSQL with Spring Data
 
GraphQL vs. (the) REST
GraphQL vs. (the) RESTGraphQL vs. (the) REST
GraphQL vs. (the) REST
 
ArangoDB
ArangoDBArangoDB
ArangoDB
 
Webinar: How native multi model works in ArangoDB
Webinar: How native multi model works in ArangoDBWebinar: How native multi model works in ArangoDB
Webinar: How native multi model works in ArangoDB
 
Big Data in the Real World
Big Data in the Real WorldBig Data in the Real World
Big Data in the Real World
 
PiterPy 2016: Parallelization, Aggregation and Validation of API in Python
PiterPy 2016: Parallelization, Aggregation and Validation of API in PythonPiterPy 2016: Parallelization, Aggregation and Validation of API in Python
PiterPy 2016: Parallelization, Aggregation and Validation of API in Python
 
Creating PostgreSQL-as-a-Service at Scale
Creating PostgreSQL-as-a-Service at ScaleCreating PostgreSQL-as-a-Service at Scale
Creating PostgreSQL-as-a-Service at Scale
 
MongoDB World 2019: Raiders of the Anti-patterns: A Journey Towards Fixing Sc...
MongoDB World 2019: Raiders of the Anti-patterns: A Journey Towards Fixing Sc...MongoDB World 2019: Raiders of the Anti-patterns: A Journey Towards Fixing Sc...
MongoDB World 2019: Raiders of the Anti-patterns: A Journey Towards Fixing Sc...
 
There and back_again_oracle_and_big_data_16x9
There and back_again_oracle_and_big_data_16x9There and back_again_oracle_and_big_data_16x9
There and back_again_oracle_and_big_data_16x9
 
[db tech showcase Tokyo 2017] C13:There and back again or how to connect Orac...
[db tech showcase Tokyo 2017] C13:There and back again or how to connect Orac...[db tech showcase Tokyo 2017] C13:There and back again or how to connect Orac...
[db tech showcase Tokyo 2017] C13:There and back again or how to connect Orac...
 
Agility and Scalability with MongoDB
Agility and Scalability with MongoDBAgility and Scalability with MongoDB
Agility and Scalability with MongoDB
 
Lambda Architectures in Practice
Lambda Architectures in PracticeLambda Architectures in Practice
Lambda Architectures in Practice
 
Bug bites Elephant? Test-driven Quality Assurance in Big Data Application Dev...
Bug bites Elephant? Test-driven Quality Assurance in Big Data Application Dev...Bug bites Elephant? Test-driven Quality Assurance in Big Data Application Dev...
Bug bites Elephant? Test-driven Quality Assurance in Big Data Application Dev...
 
Heterogenous Persistence
Heterogenous PersistenceHeterogenous Persistence
Heterogenous Persistence
 

Recently uploaded

[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard37
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMKumar Satyam
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 

Recently uploaded (20)

[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptx
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDM
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 

Protocol Buffers and Hadoop at Twitter

  • 1. Hadoop and Protocol Buffers at Twitter Kevin Weil -- @kevinweil Analytics Lead, Twitter TM
  • 2. Outline ‣ Problem Statement ‣ CSV? XML? JSON? Regex? ‣ Protocol Buffers ‣ Codegen, Hadoop and You ‣ Applications ‣ Conclusions and Next Steps
  • 3. My Background ‣ Studied Mathematics and Physics at Harvard, Physics at Stanford ‣ Tropos Networks (city-wide wireless): mesh routing algorithms, GBs of data ‣ Cooliris (web media): Hadoop and Pig for analytics, TBs of data ‣ Twitter: Hadoop, Pig, HBase, large-scale data analysis and visualization, social graph analysis, machine learning, lots more data
  • 4. Outline ‣ Problem Statement ‣ CSV? XML? JSON? Regex? ‣ Protocol Buffers ‣ Codegen, Hadoop and You ‣ Applications ‣ Conclusions and Next Steps
  • 5. The Challenge ‣ Store some tweets
  • 6. The Challenge ‣ Store some tweets Store 100 billion tweets
  • 7. The Challenge ‣ Store 100 billion tweets in a way that is ‣ Robust to changes
  • 8. The Challenge ‣ Store 100 billion tweets in a way that is ‣ Robust ‣ Efficient in size and speed
  • 9. The Challenge ‣ Store 100 billion tweets in a way that is ‣ Robust ‣ Efficient ‣ Amenable to large-scale analysis
  • 10. The Challenge ‣ Store 100 billion tweets in a way that is ‣ Robust ‣ Efficient ‣ Amenable to large-scale analysis ‣ Reusable (especially for other classes of data, like logs, where the size gets really large)
  • 11. The System ‣ Your (friend’s) hadoop cluster
  • 12. The Data ‣ kevin@tw-mbp-kweil ~ $ curl http:// ‣ ‣ <?xml version="1.0" encoding="UTF-8"?> <status> api.twitter.com/1/statuses/show/9225259353.xml ‣ <created_at>Wed Feb 17 08:01:13 +0000 2010</created_at> ‣ <id>9225259353</id> ‣ <text>Preparing slides for tomorrow's talk at Y! at the Hadoop User Group: Protobufs and Hadoop at Twitter. See you there? http://bit.ly/9DJcd9</text> ‣ <source>&lt;a href=&quot;http://www.tweetdeck.com/&quot; rel=&quot;nofollow&quot;&gt;TweetDeck&lt;/a&gt;</source> ‣ <truncated>false</truncated> ‣ <in_reply_to_status_id></in_reply_to_status_id> <in_reply_to_user_id></in_reply_to_user_id> Each tweet has 12 fields, 3 of which (user, geo, ‣ ‣ ‣ <favorited>false</favorited> ‣ <in_reply_to_screen_name></in_reply_to_screen_name> ‣ <user> contributors) have subfields ‣ <id>3452911</id> ‣ <name>Kevin Weil</name> ‣ <screen_name>kevinweil</screen_name> ‣ <location>Portola Valley, CA</location> ‣ <description>Analytics Lead at Twitter. Ultra-marathons, cycling, hadoop, lolcats.</description> ‣ <profile_image_url>http://a3.twimg.com/profile_images/220257539/n206489_34325699_8572_normal.jpg</profile_image_url> ‣ <url></url> ‣ <protected>false</protected> ‣ <followers_count>3122</followers_count> ‣ <profile_background_color>B2DFDA</profile_background_color> ‣ <profile_text_color>333333</profile_text_color> ‣ It can change as we add new features ‣ <profile_link_color>93A644</profile_link_color> ‣ <profile_sidebar_fill_color>ffffff</profile_sidebar_fill_color> ‣ <profile_sidebar_border_color>eeeeee</profile_sidebar_border_color> ‣ <friends_count>436</friends_count> ‣ <created_at>Wed Apr 04 19:29:46 +0000 2007</created_at> ‣ <favourites_count>721</favourites_count> ‣ <utc_offset>-28800</utc_offset> ‣ <time_zone>Pacific Time (US &amp; Canada)</time_zone> ‣ <profile_background_image_url>http://s.twimg.com/a/1266345225/images/themes/theme13/bg.gif</profile_background_image_url> ‣ <profile_background_tile>false</profile_background_tile> ‣ <notifications>false</notifications> ‣ <geo_enabled>true</geo_enabled> ‣ <verified>false</verified> ‣ <following>false</following> ‣ <statuses_count>2556</statuses_count> ‣ <lang>en</lang> ‣ <contributors_enabled>false</contributors_enabled> ‣ </user> ‣ <geo/> ‣ <contributors/> ‣ </status> ‣
  • 13. The Requirements ‣ Splittability ‣ Parsing efficiency ‣ Reusability ‣ Ability to add new fields ‣ Ability to ignore unused fields ‣ Small data size ‣ Hierarchical
  • 14. The Requirements ‣ Splittability ‣ Parsing efficiency ‣ Reusability ‣ Ability to add new fields ‣ Ability to ignore unused fields ‣ Small data size ‣ Hierarchical
  • 15. The Requirements ‣ Splittability ‣ Parsing efficiency ‣ Reusability ‣ Ability to add new fields ‣ Ability to ignore unused fields ‣ Small data size ‣ Hierarchical
  • 16. The Requirements ‣ Splittability ‣ Parsing efficiency ‣ Reusability ‣ Ability to add new fields ‣ Ability to ignore unused fields ‣ Small data size ‣ Hierarchical
  • 17. The Requirements ‣ Splittability ‣ Parsing efficiency ‣ Reusability ‣ Ability to add new fields ‣ Ability to ignore unused fields ‣ Small data size ‣ Hierarchical
  • 18. The Requirements ‣ Splittability ‣ Parsing efficiency ‣ Reusability ‣ Ability to add new fields ‣ Ability to ignore unused fields ‣ Small data size ‣ Hierarchical
  • 19. The Requirements ‣ Splittability ‣ Parsing efficiency ‣ Reusability ‣ Ability to add new fields ‣ Ability to ignore unused fields ‣ Small data size ‣ Hierarchical
  • 20. Outline ‣ Problem Statement ‣ CSV? XML? JSON? Regex? ‣ Protocol Buffers ‣ Codegen, Hadoop and You ‣ Applications ‣ Conclusions and Next Steps
  • 21. Common Formats Parsing Ignore unused Splittable Reusability Add new fields Small data size Hierarchical efficiency fields XML JSON CSV Custom regex (Apache)
  • 22. Common Formats Parsing Ignore unused Splittable Reusability Add new fields Small data size Hierarchical efficiency fields XML JSON CSV Custom regex (Apache)
  • 23. Common Formats Parsing Ignore unused Splittable Reusability Add new fields Small data size Hierarchical efficiency fields XML JSON CSV Custom regex (Apache)
  • 24. Common Formats Parsing Ignore unused Splittable Reusability Add new fields Small data size Hierarchical efficiency fields XML JSON CSV Custom regex (Apache)
  • 25. Outline ‣ Problem Statement ‣ CSV? XML? JSON? Regex? ‣ Protocol Buffers ‣ Codegen, Hadoop and You ‣ Applications ‣ Conclusions and Next Steps
  • 26. Enter Protocol Buffers ‣ “Protocol Buffers are a way of encoding structured data in an efficient yet extensible format. Google uses Protocol Buffers for almost all of its internal RPC protocols and file formats.” ‣ http://code.google.com/p/protobuf ‣ You write IDL describing your data structure ‣ It generates code in your languages of choice to construct, serialize, deserialize, reflect across, etc, your data structure ‣ Like Thrift, but richer and more efficient (except no RPC) ‣ Avro is an exciting up-and-coming alternative
  • 27. Protobuf IDL Example ‣ message Status { ‣ optional string created_at = 1; ‣ optional int64 id = 2; ‣ optional string text = 3; ‣ optional string source = 4; ‣ optional bool truncated = 5; ‣ optional int64 in_reply_to_status_id = 6; ‣ optional int64 in_reply_to_user_id = 7; ‣ optional bool favorited = 8; ‣ optional string in_reply_to_screen_name = 9; ‣ optional message User = 10; ‣ optional message Geo = 11; ‣ optional message Contributors = 12; ‣ message User { ‣ optional int64 id = 1; ‣ optional string name = 2; ‣ ... ‣ } ‣ message Geo { ... } ‣ message Contributors { ... } ‣ }
  • 28. Protobuf Generated Code ‣ The generated code is: ‣ Efficient (Google quotes 80x vs. |-delimited format)1,2 ‣ Extensible ‣ Backwards compatible ‣ Polymorphic (in Java, C++, Python) ‣ Metadata-rich 1. http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext 2. http://code.google.com/p/thrift-protobuf-compare/wiki/Benchmarking
  • 29. Common Formats Parsing Ignore unused Splittable Reusability Add new fields Small data size Hierarchical efficiency fields XML JSON CSV Custom regex (Apache) Protocol Buffers
  • 30. Outline ‣ Problem Statement ‣ CSV? XML? JSON? Regex? ‣ Protocol Buffers ‣ Codegen, Hadoop and You ‣ Applications ‣ Conclusions and Next Steps
  • 31. But Wait, There’s More ‣ Codegen for data structures is nice... ‣ Next step: codegen for all Hadoop-related code
  • 32. But Wait, There’s More ‣ Codegen for data structures is nice... ‣ Next step: codegen for all Hadoop-related code ‣ Protocol Buffer InputFormats
  • 33. But Wait, There’s More ‣ Codegen for data structures is nice... ‣ Next step: codegen for all Hadoop-related code ‣ Protocol Buffer InputFormats ‣ OutputFormats
  • 34. But Wait, There’s More ‣ Codegen for data structures is nice... ‣ Next step: codegen for all Hadoop-related code ‣ Protocol Buffer InputFormats ‣ OutputFormats ‣ Writables
  • 35. But Wait, There’s More ‣ Codegen for data structures is nice... ‣ Next step: codegen for all Hadoop-related code ‣ Protocol Buffer InputFormats ‣ OutputFormats ‣ Writables ‣ Pig LoadFuncs and StoreFuncs
  • 36. But Wait, There’s More ‣ Codegen for data structures is nice... ‣ Next step: codegen for all Hadoop-related code ‣ Protocol Buffer InputFormats ‣ OutputFormats ‣ Writables ‣ Pig LoadFuncs and StoreFuncs ‣ Cascading, Streaming, Dumbo, etc
  • 37. But Wait, There’s More ‣ Codegen for data structures is nice... ‣ Next step: codegen for all Hadoop-related code ‣ Protocol Buffer InputFormats ‣ OutputFormats ‣ Writables ‣ Pig LoadFuncs and StoreFuncs ‣ Cascading, Streaming, Dumbo, etc ‣ Per Protocol Buffer
  • 38. Protocol Buffer InputFormats ‣ All objects (hierarchical data, inheritance, etc) ‣ All automatically generated ‣ Efficient, extensible storage and serialization
  • 39. Pig LoadFuncs ‣ All objects (hierarchical data, inheritance, etc) ‣ All automatically generated ‣ Even the load statement itself is codegen
  • 40. Where do these work? ‣ Java MapReduce APIs (InputFormats, OutputFormats, Writables) ‣ Deprecated Java MapReduce APIs (same) ‣ Enables Streaming, Dumbo, Cascading ‣ Pig ‣ HBase
  • 41. Outline ‣ Problem Statement ‣ CSV? XML? JSON? Regex? ‣ Protocol Buffers ‣ Codegen, Hadoop and You ‣ Applications ‣ Conclusions and Next Steps
  • 42. Counting Big Data ‣ standard counts, min, max, std dev ‣ How many requests do we serve in a day? ‣ What is the average latency? 95% latency? ‣ Group by response code. What is the hourly distribution? ‣ How many searches happen each day on Twitter? ‣ How many unique queries, how many unique users? ‣ What is their geographic distribution?
  • 43. Correlating Big Data ‣ probabilities, covariance, influence ‣ How does usage differ for mobile users? ‣ How about for users with 3rd party desktop clients? ‣ Cohort analyses ‣ Site problems: what goes wrong at the same time? ‣ Which features get users hooked? ‣ Which features do successful users use often? ‣ Search corrections, search suggestions ‣ A/B testing
  • 44. Research on Big Data ‣ prediction, graph analysis, natural language ‣ What can we tell about a user from their tweets? ‣ From the tweets of those they follow? ‣ From the tweets of their followers? ‣ From the ratio of followers/following? ‣ What graph structures lead to successful networks? ‣ User reputation
  • 45. Research on Big Data ‣ prediction, graph analysis, natural language ‣ Sentiment analysis ‣ What features get a tweet retweeted? ‣ How deep is the corresponding retweet tree? ‣ Long-term duplicate detection ‣ Machine learning ‣ Language detection ‣ ... the list goes on.
  • 46. Outline ‣ Problem Statement ‣ CSV? XML? JSON? Regex? ‣ Protocol Buffers ‣ Codegen, Hadoop and You ‣ Applications ‣ Conclusions and Next Steps
  • 47. Resolution ‣ All we do now is write IDL for the data schema ‣ Get efficient, forward/backwards compatible, splittable data structures automatically generated for us ‣ Get loaders, input formats, output formats, writables, and schemas automatically generated for us ‣ Helps the Twitter analytics team stay agile ‣ Can handle new, complex data without the need for new code, new tests, new bugs ‣ Focus on the analysis, not data formats
  • 48. Twitter Open Source ‣ Coming soon! (1-2 weeks) http://github.com/kevinweil ‣ All base classes for InputFormats, OutputFormats, Writables, Pig Loaders, etc ‣ For new and deprecated MapReduce API ‣ With and without LZO compression (see http://github.com/ kevinweil/hadoop-lzo) ‣ Protobuf reflection helpers ‣ Serialized block storage format for HDFS
  • 49. Questions? Follow me at twitter.com/kevinweil ‣ If this sounded interesting to you -- that’s because it is. And we’re hiring. TM