Hadoop at Twitter Kevin Weil -- @kevinweil Analytics Lead, Twitter
The Twitter Data Lifecycle Data Input Data Storage Data Analysis Data Products
The Twitter Data Lifecycle Data Input:  Scribe ,  Crane Data Storage:  Elephant Bird ,  HBase Data Analysis:  Pig , Oink Data Products: Birdbrain 1   Community Open Source 2   Twitter Open Source (or soon)
My Background Studied Mathematics and Physics at Harvard, Physics at Stanford Tropos Networks  (city-wide wireless): mesh routing algorithms, GBs of data Cooliris  (web media): Hadoop and Pig for analytics, TBs of data Twitter : Hadoop, Pig, machine learning, visualization, social graph analysis, (soon) PBs of data
The Twitter Data Lifecycle Data Input:  Scribe ,  Crane Data Storage Data Analysis Data Products 1   Community Open Source 2   Twitter Open Source
What Data? Two main kinds of raw data Logs Tabular data
Logs Started with syslog-ng As our volume grew, it didn’t scale
Logs Started with syslog-ng As our volume grew, it didn’t scale Resources overwhelmed Lost data
Scribe Scribe daemon runs locally; reliable in network outage Nodes only know downstream writer; hierarchical, scalable Pluggable outputs, per category FE FE FE Agg Agg HDFS File
Scribe at Twitter Solved our problem, opened new vistas Currently 57 different categories logged from multiple sources FE: Javascript, Ruby on Rails Middle tier: Ruby on Rails, Scala Backend: Scala, Java, C++ 7 TB/day into HDFS Log first, ask questions later.
Scribe at Twitter We’ve contributed to it as we’ve used it 1 Improved logging, monitoring, writing to HDFS, compression Added ZooKeeper-based config Continuing to work with FB on patches Also: working with Cloudera to evaluate Flume 1   http://github.com/traviscrawford/scribe
Tabular Data Most site data is in MySQL Tweets, users, devices, client applications, etc Need to move it between MySQL and HDFS Also between MySQL and HBase, or MySQL and MySQL Crane: configuration driven ETL tool
Crane Driver Configuration/Batch Management Extract Load Transform Protobuf P1 Protobuf P2 Source Sink ZooKeeper Registration
Crane Extract MySQL, HDFS, HBase, Flock, GA, Facebook Insights
Crane Extract MySQL, HDFS, HBase, Flock, GA, Facebook Insights Transform IP/Phone -> Geo, canonicalize dates, cleaning, arbitrary logic
Crane Extract MySQL, HDFS, HBase, Flock, GA, Facebook Insights Transform IP/Phone -> Geo, canonicalize dates, cleaning, arbitrary logic Load MySQL, Local file, Stdout, HDFS, HBase
Crane Extract MySQL, HDFS, HBase, Flock, GA, Facebook Insights Transform IP/Phone -> Geo, canonicalize dates, cleaning, arbitrary logic Load MySQL, Local file, Stdout, HDFS, HBase  ZooKeeper coordination, intelligent date management Run all the time from multiple servers, self healing
The Twitter Data Lifecycle Data Input Data Storage:  Elephant Bird ,  HBase Data Analysis Data Products 1   Community Open Source 2   Twitter Open Source
Storage Basics Incoming data: 7 TB/day LZO encode everything Save 3-4x on storage, pay little CPU Splittable! 1 IO-bound jobs ==> 3-4x perf increase 1   http://www.github.com/kevinweil/hadoop-lzo
Elephant Bird http://www.flickr.com/photos/jagadish/3072134867/ 1   http://github.com/kevinweil/elephant-bird
Elephant Bird We have data coming in as protocol buffers via Crane...
Elephant Bird We have data coming in as protocol buffers via Crane... Protobufs: codegen for efficient ser-de of data structures
Elephant Bird We have data coming in as protocol buffers via Crane... Protobufs: codegen for efficient ser-de of data structures Why shouldn’t we just continue, and codegen more glue?
Elephant Bird We have data coming in as protocol buffers via Crane... Protobufs: codegen for efficient ser-de of data structures Why shouldn’t we just continue, and codegen more glue? InputFormats, OutputFormats, Pig LoadFuncs, Pig StoreFuncs, Pig HBaseLoaders
Elephant Bird We have data coming in as protocol buffers via Crane... Protobufs: codegen for efficient ser-de of data structures Why shouldn’t we just continue, and codegen more glue? InputFormats, OutputFormats, Pig LoadFuncs, Pig StoreFuncs, Pig HBaseLoaders Also now does part of this with Thrift, soon Avro And JSON, W3C Logs
Challenge: Mutable Data HDFS is write-once: no seek on write, no append (yet) Logs are easy. But our tables change.
Challenge: Mutable Data HDFS is write-once: no seek on write, no append (yet) Logs are easy. But our tables change. Handling rapidly changing data in HDFS: not trivial. Don’t worry about updated data Refresh entire dataset Download updates, tombstone old versions of data, ensure jobs only run over current versions of data, occasionally rewrite full dataset
Challenge: Mutable Data HDFS is write-once: no seek on write, no append (yet) Logs are easy. But our tables change. Handling changing data in HDFS: not trivial.
HBase Has already solved the update problem Bonus: low-latency query API Bonus: rich, BigTable-style data model based on column families
HBase at Twitter Crane loads data directly into HBase One CF for protobuf bytes, one CF to denormalize columns for indexing or quicker batch access Processing updates transparent, so we always have accurate data in HBase Pig Loader for HBase in Elephant Bird makes integration with existing analyses easy
HBase at Twitter Crane loads data directly into HBase One CF for protobuf bytes, one CF to denormalize columns for indexing or quicker batch access Processing updates transparent, so we always have accurate data in HBase Pig Loader for HBase in Elephant Bird
The Twitter Data Lifecycle Data Input Data Storage Data Analysis:  Pig , Oink Data Products 1   Community Open Source 2   Twitter Open Source
Enter Pig High level language Transformations on sets of records Process data one step at a time UDFs are first-class citizens Easier than SQL?
Why Pig? Because I bet you can read the following script.
A Real Pig Script Now, just for fun... the same calculation in vanilla Hadoop MapReduce.
No, seriously.
Pig Democratizes Large-scale Data Analysis The Pig version is: 5% of the code 5% of the time Within 30% of the execution time. Innovation increasingly driven from large-scale data analysis Need fast iteration to understand the  right questions More minds contributing = more value from your data
Pig Examples Using the HBase Loader Using the protobuf loaders
Pig Workflow Oink: framework around Pig for loading, combining, running, post-processing Everyone I know has one of these Points to an opening for innovation; discussion beginning Something we’re looking at: Ruby DSL for Pig, Piglet 1 1   http://github.com/ningliang/piglet
Counting Big Data standard counts, min, max, std dev How many requests do we serve in a day? What is the average latency?  95% latency? Group by response code.  What is the hourly distribution? How many searches happen each day on Twitter? How many unique queries, how many unique users? What is their geographic distribution?
Correlating Big Data How does usage differ for mobile users? How about for users with 3rd party desktop clients? Cohort analyses Site problems: what goes wrong at the same time? Which features get users hooked? Which features do successful users use often? Search corrections, search suggestions A/B testing probabilities, covariance, influence
Research on Big Data What can we tell about a user from their tweets? From the tweets of those they follow? From the tweets of their followers? From the ratio of followers/following? What graph structures lead to successful networks? User reputation prediction, graph analysis, natural language
Research on Big Data Sentiment analysis What features get a tweet retweeted? How deep is the corresponding retweet tree? Long-term duplicate detection Machine learning Language detection ... the list goes on. prediction, graph analysis, natural language
The Twitter Data Lifecycle Data Input Data Storage Data Analysis Data Products: Birdbrain 1   Community Open Source 2   Twitter Open Source
Data Products Ad Hoc Analyses Answer questions to keep the business agile, do research Online Products Name search, other upcoming products Company Dashboard Birdbrain
Questions ? Follow me at twitter.com/kevinweil TM P.S. We’re hiring.  Help us build the next step: realtime big data analytics.

Hadoop and Pig at Twitter__HadoopSummit2010

  • 1.
    Hadoop at TwitterKevin Weil -- @kevinweil Analytics Lead, Twitter
  • 2.
    The Twitter DataLifecycle Data Input Data Storage Data Analysis Data Products
  • 3.
    The Twitter DataLifecycle Data Input: Scribe , Crane Data Storage: Elephant Bird , HBase Data Analysis: Pig , Oink Data Products: Birdbrain 1 Community Open Source 2 Twitter Open Source (or soon)
  • 4.
    My Background StudiedMathematics and Physics at Harvard, Physics at Stanford Tropos Networks (city-wide wireless): mesh routing algorithms, GBs of data Cooliris (web media): Hadoop and Pig for analytics, TBs of data Twitter : Hadoop, Pig, machine learning, visualization, social graph analysis, (soon) PBs of data
  • 5.
    The Twitter DataLifecycle Data Input: Scribe , Crane Data Storage Data Analysis Data Products 1 Community Open Source 2 Twitter Open Source
  • 6.
    What Data? Twomain kinds of raw data Logs Tabular data
  • 7.
    Logs Started withsyslog-ng As our volume grew, it didn’t scale
  • 8.
    Logs Started withsyslog-ng As our volume grew, it didn’t scale Resources overwhelmed Lost data
  • 9.
    Scribe Scribe daemonruns locally; reliable in network outage Nodes only know downstream writer; hierarchical, scalable Pluggable outputs, per category FE FE FE Agg Agg HDFS File
  • 10.
    Scribe at TwitterSolved our problem, opened new vistas Currently 57 different categories logged from multiple sources FE: Javascript, Ruby on Rails Middle tier: Ruby on Rails, Scala Backend: Scala, Java, C++ 7 TB/day into HDFS Log first, ask questions later.
  • 11.
    Scribe at TwitterWe’ve contributed to it as we’ve used it 1 Improved logging, monitoring, writing to HDFS, compression Added ZooKeeper-based config Continuing to work with FB on patches Also: working with Cloudera to evaluate Flume 1 http://github.com/traviscrawford/scribe
  • 12.
    Tabular Data Mostsite data is in MySQL Tweets, users, devices, client applications, etc Need to move it between MySQL and HDFS Also between MySQL and HBase, or MySQL and MySQL Crane: configuration driven ETL tool
  • 13.
    Crane Driver Configuration/BatchManagement Extract Load Transform Protobuf P1 Protobuf P2 Source Sink ZooKeeper Registration
  • 14.
    Crane Extract MySQL,HDFS, HBase, Flock, GA, Facebook Insights
  • 15.
    Crane Extract MySQL,HDFS, HBase, Flock, GA, Facebook Insights Transform IP/Phone -> Geo, canonicalize dates, cleaning, arbitrary logic
  • 16.
    Crane Extract MySQL,HDFS, HBase, Flock, GA, Facebook Insights Transform IP/Phone -> Geo, canonicalize dates, cleaning, arbitrary logic Load MySQL, Local file, Stdout, HDFS, HBase
  • 17.
    Crane Extract MySQL,HDFS, HBase, Flock, GA, Facebook Insights Transform IP/Phone -> Geo, canonicalize dates, cleaning, arbitrary logic Load MySQL, Local file, Stdout, HDFS, HBase ZooKeeper coordination, intelligent date management Run all the time from multiple servers, self healing
  • 18.
    The Twitter DataLifecycle Data Input Data Storage: Elephant Bird , HBase Data Analysis Data Products 1 Community Open Source 2 Twitter Open Source
  • 19.
    Storage Basics Incomingdata: 7 TB/day LZO encode everything Save 3-4x on storage, pay little CPU Splittable! 1 IO-bound jobs ==> 3-4x perf increase 1 http://www.github.com/kevinweil/hadoop-lzo
  • 20.
    Elephant Bird http://www.flickr.com/photos/jagadish/3072134867/1 http://github.com/kevinweil/elephant-bird
  • 21.
    Elephant Bird Wehave data coming in as protocol buffers via Crane...
  • 22.
    Elephant Bird Wehave data coming in as protocol buffers via Crane... Protobufs: codegen for efficient ser-de of data structures
  • 23.
    Elephant Bird Wehave data coming in as protocol buffers via Crane... Protobufs: codegen for efficient ser-de of data structures Why shouldn’t we just continue, and codegen more glue?
  • 24.
    Elephant Bird Wehave data coming in as protocol buffers via Crane... Protobufs: codegen for efficient ser-de of data structures Why shouldn’t we just continue, and codegen more glue? InputFormats, OutputFormats, Pig LoadFuncs, Pig StoreFuncs, Pig HBaseLoaders
  • 25.
    Elephant Bird Wehave data coming in as protocol buffers via Crane... Protobufs: codegen for efficient ser-de of data structures Why shouldn’t we just continue, and codegen more glue? InputFormats, OutputFormats, Pig LoadFuncs, Pig StoreFuncs, Pig HBaseLoaders Also now does part of this with Thrift, soon Avro And JSON, W3C Logs
  • 26.
    Challenge: Mutable DataHDFS is write-once: no seek on write, no append (yet) Logs are easy. But our tables change.
  • 27.
    Challenge: Mutable DataHDFS is write-once: no seek on write, no append (yet) Logs are easy. But our tables change. Handling rapidly changing data in HDFS: not trivial. Don’t worry about updated data Refresh entire dataset Download updates, tombstone old versions of data, ensure jobs only run over current versions of data, occasionally rewrite full dataset
  • 28.
    Challenge: Mutable DataHDFS is write-once: no seek on write, no append (yet) Logs are easy. But our tables change. Handling changing data in HDFS: not trivial.
  • 29.
    HBase Has alreadysolved the update problem Bonus: low-latency query API Bonus: rich, BigTable-style data model based on column families
  • 30.
    HBase at TwitterCrane loads data directly into HBase One CF for protobuf bytes, one CF to denormalize columns for indexing or quicker batch access Processing updates transparent, so we always have accurate data in HBase Pig Loader for HBase in Elephant Bird makes integration with existing analyses easy
  • 31.
    HBase at TwitterCrane loads data directly into HBase One CF for protobuf bytes, one CF to denormalize columns for indexing or quicker batch access Processing updates transparent, so we always have accurate data in HBase Pig Loader for HBase in Elephant Bird
  • 32.
    The Twitter DataLifecycle Data Input Data Storage Data Analysis: Pig , Oink Data Products 1 Community Open Source 2 Twitter Open Source
  • 33.
    Enter Pig Highlevel language Transformations on sets of records Process data one step at a time UDFs are first-class citizens Easier than SQL?
  • 34.
    Why Pig? BecauseI bet you can read the following script.
  • 35.
    A Real PigScript Now, just for fun... the same calculation in vanilla Hadoop MapReduce.
  • 36.
  • 37.
    Pig Democratizes Large-scaleData Analysis The Pig version is: 5% of the code 5% of the time Within 30% of the execution time. Innovation increasingly driven from large-scale data analysis Need fast iteration to understand the right questions More minds contributing = more value from your data
  • 38.
    Pig Examples Usingthe HBase Loader Using the protobuf loaders
  • 39.
    Pig Workflow Oink:framework around Pig for loading, combining, running, post-processing Everyone I know has one of these Points to an opening for innovation; discussion beginning Something we’re looking at: Ruby DSL for Pig, Piglet 1 1 http://github.com/ningliang/piglet
  • 40.
    Counting Big Datastandard counts, min, max, std dev How many requests do we serve in a day? What is the average latency? 95% latency? Group by response code. What is the hourly distribution? How many searches happen each day on Twitter? How many unique queries, how many unique users? What is their geographic distribution?
  • 41.
    Correlating Big DataHow does usage differ for mobile users? How about for users with 3rd party desktop clients? Cohort analyses Site problems: what goes wrong at the same time? Which features get users hooked? Which features do successful users use often? Search corrections, search suggestions A/B testing probabilities, covariance, influence
  • 42.
    Research on BigData What can we tell about a user from their tweets? From the tweets of those they follow? From the tweets of their followers? From the ratio of followers/following? What graph structures lead to successful networks? User reputation prediction, graph analysis, natural language
  • 43.
    Research on BigData Sentiment analysis What features get a tweet retweeted? How deep is the corresponding retweet tree? Long-term duplicate detection Machine learning Language detection ... the list goes on. prediction, graph analysis, natural language
  • 44.
    The Twitter DataLifecycle Data Input Data Storage Data Analysis Data Products: Birdbrain 1 Community Open Source 2 Twitter Open Source
  • 45.
    Data Products AdHoc Analyses Answer questions to keep the business agile, do research Online Products Name search, other upcoming products Company Dashboard Birdbrain
  • 46.
    Questions ? Followme at twitter.com/kevinweil TM P.S. We’re hiring. Help us build the next step: realtime big data analytics.