• Save
Hadoop and Pig at Twitter__HadoopSummit2010
Upcoming SlideShare
Loading in...5
×
 

Hadoop and Pig at Twitter__HadoopSummit2010

on

  • 7,511 views

Hadoop Summit 2010 - Developers Track

Hadoop Summit 2010 - Developers Track
Hadoop and Pig at Twitter
Kevin Weil, Twitter

Statistics

Views

Total Views
7,511
Views on SlideShare
6,994
Embed Views
517

Actions

Likes
25
Downloads
0
Comments
0

5 Embeds 517

http://abrdev.com 506
url_unknown 5
http://bart7449.tistory.com 3
https://abs.twimg.com 2
http://slideclip.b-prep.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Hadoop and Pig at Twitter__HadoopSummit2010 Hadoop and Pig at Twitter__HadoopSummit2010 Presentation Transcript

  • Hadoop at Twitter
    • Kevin Weil -- @kevinweil
    • Analytics Lead, Twitter
  • The Twitter Data Lifecycle
    • Data Input
    • Data Storage
    • Data Analysis
    • Data Products
  • The Twitter Data Lifecycle
    • Data Input: Scribe , Crane
    • Data Storage: Elephant Bird , HBase
    • Data Analysis: Pig , Oink
    • Data Products: Birdbrain
    1 Community Open Source 2 Twitter Open Source (or soon)
  • My Background
    • Studied Mathematics and Physics at Harvard, Physics at Stanford
    • Tropos Networks (city-wide wireless): mesh routing algorithms, GBs of data
    • Cooliris (web media): Hadoop and Pig for analytics, TBs of data
    • Twitter : Hadoop, Pig, machine learning, visualization, social graph analysis, (soon) PBs of data
  • The Twitter Data Lifecycle
    • Data Input: Scribe , Crane
    • Data Storage
    • Data Analysis
    • Data Products
    1 Community Open Source 2 Twitter Open Source
  • What Data?
    • Two main kinds of raw data
    • Logs
    • Tabular data
  • Logs
    • Started with syslog-ng
    • As our volume grew, it didn’t scale
  • Logs
    • Started with syslog-ng
    • As our volume grew, it didn’t scale
    • Resources overwhelmed
    • Lost data
  • Scribe
    • Scribe daemon runs locally; reliable in network outage
    • Nodes only know downstream
    • writer; hierarchical, scalable
    • Pluggable outputs, per category
    FE FE FE Agg Agg HDFS File
  • Scribe at Twitter
    • Solved our problem, opened new vistas
    • Currently 57 different categories logged from multiple sources
    • FE: Javascript, Ruby on Rails
    • Middle tier: Ruby on Rails, Scala
    • Backend: Scala, Java, C++
    • 7 TB/day into HDFS
    • Log first, ask questions later.
  • Scribe at Twitter
    • We’ve contributed to it as we’ve used it 1
    • Improved logging, monitoring, writing to HDFS, compression
    • Added ZooKeeper-based config
    • Continuing to work with FB on patches
    • Also: working with Cloudera to evaluate Flume
    1 http://github.com/traviscrawford/scribe
  • Tabular Data
    • Most site data is in MySQL
    • Tweets, users, devices, client applications, etc
    • Need to move it between MySQL and HDFS
            • Also between MySQL and HBase, or MySQL and MySQL
            • Crane: configuration driven ETL tool
  • Crane Driver Configuration/Batch Management Extract Load Transform Protobuf P1 Protobuf P2 Source Sink ZooKeeper Registration
  • Crane
    • Extract
    • MySQL, HDFS, HBase, Flock, GA, Facebook Insights
  • Crane
    • Extract
    • MySQL, HDFS, HBase, Flock, GA, Facebook Insights
    • Transform
    • IP/Phone -> Geo, canonicalize dates, cleaning, arbitrary logic
  • Crane
    • Extract
    • MySQL, HDFS, HBase, Flock, GA, Facebook Insights
    • Transform
    • IP/Phone -> Geo, canonicalize dates, cleaning, arbitrary logic
    • Load
    • MySQL, Local file, Stdout, HDFS, HBase
  • Crane
    • Extract
    • MySQL, HDFS, HBase, Flock, GA, Facebook Insights
    • Transform
    • IP/Phone -> Geo, canonicalize dates, cleaning, arbitrary logic
    • Load
    • MySQL, Local file, Stdout, HDFS, HBase
    • ZooKeeper coordination, intelligent date management
    • Run all the time from multiple servers, self healing
  • The Twitter Data Lifecycle
    • Data Input
    • Data Storage: Elephant Bird , HBase
    • Data Analysis
    • Data Products
    1 Community Open Source 2 Twitter Open Source
  • Storage Basics
    • Incoming data: 7 TB/day
    • LZO encode everything
    • Save 3-4x on storage, pay little CPU
    • Splittable! 1
    • IO-bound jobs ==> 3-4x perf increase
    1 http://www.github.com/kevinweil/hadoop-lzo
  • Elephant Bird http://www.flickr.com/photos/jagadish/3072134867/ 1 http://github.com/kevinweil/elephant-bird
  • Elephant Bird
    • We have data coming in as protocol buffers via Crane...
  • Elephant Bird
    • We have data coming in as protocol buffers via Crane...
    • Protobufs: codegen for efficient ser-de of data structures
  • Elephant Bird
    • We have data coming in as protocol buffers via Crane...
    • Protobufs: codegen for efficient ser-de of data structures
    • Why shouldn’t we just continue, and codegen more glue?
  • Elephant Bird
    • We have data coming in as protocol buffers via Crane...
    • Protobufs: codegen for efficient ser-de of data structures
    • Why shouldn’t we just continue, and codegen more glue?
    • InputFormats, OutputFormats, Pig LoadFuncs, Pig StoreFuncs, Pig HBaseLoaders
  • Elephant Bird
    • We have data coming in as protocol buffers via Crane...
    • Protobufs: codegen for efficient ser-de of data structures
    • Why shouldn’t we just continue, and codegen more glue?
    • InputFormats, OutputFormats, Pig LoadFuncs, Pig StoreFuncs, Pig HBaseLoaders
    • Also now does part of this with Thrift, soon Avro
    • And JSON, W3C Logs
  • Challenge: Mutable Data
    • HDFS is write-once: no seek on write, no append (yet)
    • Logs are easy.
    • But our tables change.
  • Challenge: Mutable Data
    • HDFS is write-once: no seek on write, no append (yet)
    • Logs are easy.
    • But our tables change.
    • Handling rapidly changing data in HDFS: not trivial.
    • Don’t worry about updated data
    • Refresh entire dataset
    • Download updates, tombstone old versions of data, ensure jobs only run over current versions of data, occasionally rewrite full dataset
  • Challenge: Mutable Data
    • HDFS is write-once: no seek on write, no append (yet)
    • Logs are easy.
    • But our tables change.
    • Handling changing data in HDFS: not trivial.
  • HBase
    • Has already solved the update problem
    • Bonus: low-latency query API
    • Bonus: rich, BigTable-style data model based on column families
  • HBase at Twitter
    • Crane loads data directly into HBase
    • One CF for protobuf bytes, one CF to denormalize columns for indexing or quicker batch access
    • Processing updates transparent, so we always have accurate data in HBase
    • Pig Loader for HBase in Elephant Bird makes integration with existing analyses easy
  • HBase at Twitter
    • Crane loads data directly into HBase
    • One CF for protobuf bytes, one CF to denormalize columns for indexing or quicker batch access
    • Processing updates transparent, so we always have accurate data in HBase
    • Pig Loader for HBase in Elephant Bird
  • The Twitter Data Lifecycle
    • Data Input
    • Data Storage
    • Data Analysis: Pig , Oink
    • Data Products
    1 Community Open Source 2 Twitter Open Source
  • Enter Pig
    • High level language
    • Transformations on sets of records
    • Process data one step at a time
    • UDFs are first-class citizens
    • Easier than SQL?
  • Why Pig?
    • Because I bet you can read the following script.
  • A Real Pig Script
    • Now, just for fun... the same calculation in vanilla Hadoop MapReduce.
  • No, seriously.
  • Pig Democratizes Large-scale Data Analysis
    • The Pig version is:
      • 5% of the code
      • 5% of the time
      • Within 30% of the execution time.
      • Innovation increasingly driven from large-scale data analysis
      • Need fast iteration to understand the right questions
      • More minds contributing = more value from your data
  • Pig Examples
    • Using the HBase Loader
    • Using the protobuf loaders
  • Pig Workflow
    • Oink: framework around Pig for loading, combining, running, post-processing
    • Everyone I know has one of these
    • Points to an opening for innovation; discussion beginning
    • Something we’re looking at: Ruby DSL for Pig, Piglet 1
    1 http://github.com/ningliang/piglet
  • Counting Big Data
    • standard counts, min, max, std dev
    • How many requests do we serve in a day?
    • What is the average latency? 95% latency?
    • Group by response code. What is the hourly distribution?
    • How many searches happen each day on Twitter?
    • How many unique queries, how many unique users?
    • What is their geographic distribution?
  • Correlating Big Data
    • How does usage differ for mobile users?
    • How about for users with 3rd party desktop clients?
    • Cohort analyses
    • Site problems: what goes wrong at the same time?
    • Which features get users hooked?
    • Which features do successful users use often?
    • Search corrections, search suggestions
    • A/B testing
    • probabilities, covariance, influence
  • Research on Big Data
    • What can we tell about a user from their tweets?
    • From the tweets of those they follow?
    • From the tweets of their followers?
    • From the ratio of followers/following?
    • What graph structures lead to successful networks?
    • User reputation
    • prediction, graph analysis, natural language
  • Research on Big Data
    • Sentiment analysis
    • What features get a tweet retweeted?
    • How deep is the corresponding retweet tree?
    • Long-term duplicate detection
    • Machine learning
    • Language detection
    • ... the list goes on.
    • prediction, graph analysis, natural language
  • The Twitter Data Lifecycle
    • Data Input
    • Data Storage
    • Data Analysis
    • Data Products: Birdbrain
    1 Community Open Source 2 Twitter Open Source
  • Data Products
    • Ad Hoc Analyses
    • Answer questions to keep the business agile, do research
    • Online Products
    • Name search, other upcoming products
    • Company Dashboard
    • Birdbrain
  • Questions ? Follow me at twitter.com/kevinweil TM
    • P.S. We’re hiring. Help us build the next step: realtime big data analytics.