Hadoop and Pig at Twitter__HadoopSummit2010

Hadoop at Twitter Kevin Weil -- @kevinweil Analytics Lead, Twitter

The Twitter Data Lifecycle Data Input Data Storage Data Analysis Data Products

The Twitter Data Lifecycle Data Input: Scribe , Crane Data Storage: Elephant Bird , HBase Data Analysis: Pig , Oink Data Products: Birdbrain 1 Community Open Source 2 Twitter Open Source (or soon)

My Background Studied Mathematics and Physics at Harvard, Physics at Stanford Tropos Networks (city-wide wireless): mesh routing algorithms, GBs of data Cooliris (web media): Hadoop and Pig for analytics, TBs of data Twitter : Hadoop, Pig, machine learning, visualization, social graph analysis, (soon) PBs of data

The Twitter Data Lifecycle Data Input: Scribe , Crane Data Storage Data Analysis Data Products 1 Community Open Source 2 Twitter Open Source

What Data? Two main kinds of raw data Logs Tabular data

Logs Started with syslog-ng As our volume grew, it didn’t scale

Logs Started with syslog-ng As our volume grew, it didn’t scale Resources overwhelmed Lost data

Scribe Scribe daemon runs locally; reliable in network outage Nodes only know downstream writer; hierarchical, scalable Pluggable outputs, per category FE FE FE Agg Agg HDFS File

Scribe at Twitter Solved our problem, opened new vistas Currently 57 different categories logged from multiple sources FE: Javascript, Ruby on Rails Middle tier: Ruby on Rails, Scala Backend: Scala, Java, C++ 7 TB/day into HDFS Log first, ask questions later.

Scribe at Twitter We’ve contributed to it as we’ve used it 1 Improved logging, monitoring, writing to HDFS, compression Added ZooKeeper-based config Continuing to work with FB on patches Also: working with Cloudera to evaluate Flume 1 http://github.com/traviscrawford/scribe

Tabular Data Most site data is in MySQL Tweets, users, devices, client applications, etc Need to move it between MySQL and HDFS Also between MySQL and HBase, or MySQL and MySQL Crane: configuration driven ETL tool

Crane Driver Configuration/Batch Management Extract Load Transform Protobuf P1 Protobuf P2 Source Sink ZooKeeper Registration

Crane Extract MySQL, HDFS, HBase, Flock, GA, Facebook Insights

Crane Extract MySQL, HDFS, HBase, Flock, GA, Facebook Insights Transform IP/Phone -> Geo, canonicalize dates, cleaning, arbitrary logic

Crane Extract MySQL, HDFS, HBase, Flock, GA, Facebook Insights Transform IP/Phone -> Geo, canonicalize dates, cleaning, arbitrary logic Load MySQL, Local file, Stdout, HDFS, HBase

Crane Extract MySQL, HDFS, HBase, Flock, GA, Facebook Insights Transform IP/Phone -> Geo, canonicalize dates, cleaning, arbitrary logic Load MySQL, Local file, Stdout, HDFS, HBase ZooKeeper coordination, intelligent date management Run all the time from multiple servers, self healing

The Twitter Data Lifecycle Data Input Data Storage: Elephant Bird , HBase Data Analysis Data Products 1 Community Open Source 2 Twitter Open Source

Storage Basics Incoming data: 7 TB/day LZO encode everything Save 3-4x on storage, pay little CPU Splittable! 1 IO-bound jobs ==> 3-4x perf increase 1 http://www.github.com/kevinweil/hadoop-lzo

Elephant Bird http://www.flickr.com/photos/jagadish/3072134867/ 1 http://github.com/kevinweil/elephant-bird

Elephant Bird We have data coming in as protocol buffers via Crane...

Elephant Bird We have data coming in as protocol buffers via Crane... Protobufs: codegen for efficient ser-de of data structures

Elephant Bird We have data coming in as protocol buffers via Crane... Protobufs: codegen for efficient ser-de of data structures Why shouldn’t we just continue, and codegen more glue?

Elephant Bird We have data coming in as protocol buffers via Crane... Protobufs: codegen for efficient ser-de of data structures Why shouldn’t we just continue, and codegen more glue? InputFormats, OutputFormats, Pig LoadFuncs, Pig StoreFuncs, Pig HBaseLoaders

Elephant Bird We have data coming in as protocol buffers via Crane... Protobufs: codegen for efficient ser-de of data structures Why shouldn’t we just continue, and codegen more glue? InputFormats, OutputFormats, Pig LoadFuncs, Pig StoreFuncs, Pig HBaseLoaders Also now does part of this with Thrift, soon Avro And JSON, W3C Logs

Challenge: Mutable Data HDFS is write-once: no seek on write, no append (yet) Logs are easy. But our tables change.

Challenge: Mutable Data HDFS is write-once: no seek on write, no append (yet) Logs are easy. But our tables change. Handling rapidly changing data in HDFS: not trivial. Don’t worry about updated data Refresh entire dataset Download updates, tombstone old versions of data, ensure jobs only run over current versions of data, occasionally rewrite full dataset

Challenge: Mutable Data HDFS is write-once: no seek on write, no append (yet) Logs are easy. But our tables change. Handling changing data in HDFS: not trivial.

HBase Has already solved the update problem Bonus: low-latency query API Bonus: rich, BigTable-style data model based on column families

HBase at Twitter Crane loads data directly into HBase One CF for protobuf bytes, one CF to denormalize columns for indexing or quicker batch access Processing updates transparent, so we always have accurate data in HBase Pig Loader for HBase in Elephant Bird makes integration with existing analyses easy

HBase at Twitter Crane loads data directly into HBase One CF for protobuf bytes, one CF to denormalize columns for indexing or quicker batch access Processing updates transparent, so we always have accurate data in HBase Pig Loader for HBase in Elephant Bird

The Twitter Data Lifecycle Data Input Data Storage Data Analysis: Pig , Oink Data Products 1 Community Open Source 2 Twitter Open Source

Enter Pig High level language Transformations on sets of records Process data one step at a time UDFs are first-class citizens Easier than SQL?

Why Pig? Because I bet you can read the following script.

A Real Pig Script Now, just for fun... the same calculation in vanilla Hadoop MapReduce.

Pig Democratizes Large-scale Data Analysis The Pig version is: 5% of the code 5% of the time Within 30% of the execution time. Innovation increasingly driven from large-scale data analysis Need fast iteration to understand the right questions More minds contributing = more value from your data

Pig Examples Using the HBase Loader Using the protobuf loaders

Pig Workflow Oink: framework around Pig for loading, combining, running, post-processing Everyone I know has one of these Points to an opening for innovation; discussion beginning Something we’re looking at: Ruby DSL for Pig, Piglet 1 1 http://github.com/ningliang/piglet

Counting Big Data standard counts, min, max, std dev How many requests do we serve in a day? What is the average latency? 95% latency? Group by response code. What is the hourly distribution? How many searches happen each day on Twitter? How many unique queries, how many unique users? What is their geographic distribution?

Correlating Big Data How does usage differ for mobile users? How about for users with 3rd party desktop clients? Cohort analyses Site problems: what goes wrong at the same time? Which features get users hooked? Which features do successful users use often? Search corrections, search suggestions A/B testing probabilities, covariance, influence

Research on Big Data What can we tell about a user from their tweets? From the tweets of those they follow? From the tweets of their followers? From the ratio of followers/following? What graph structures lead to successful networks? User reputation prediction, graph analysis, natural language

Research on Big Data Sentiment analysis What features get a tweet retweeted? How deep is the corresponding retweet tree? Long-term duplicate detection Machine learning Language detection ... the list goes on. prediction, graph analysis, natural language

The Twitter Data Lifecycle Data Input Data Storage Data Analysis Data Products: Birdbrain 1 Community Open Source 2 Twitter Open Source

Data Products Ad Hoc Analyses Answer questions to keep the business agile, do research Online Products Name search, other upcoming products Company Dashboard Birdbrain

Questions ? Follow me at twitter.com/kevinweil TM P.S. We’re hiring. Help us build the next step: realtime big data analytics.

Hadoop and Pig at Twitter__HadoopSummit2010

More Related Content

What's hot

Viewers also liked

Similar to Hadoop and Pig at Twitter__HadoopSummit2010

More from Yahoo Developer Network

Recently uploaded

Hadoop and Pig at Twitter__HadoopSummit2010