• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Hadoop and Pig at Twitter__HadoopSummit2010
 

Hadoop and Pig at Twitter__HadoopSummit2010

on

  • 7,351 views

Hadoop Summit 2010 - Developers Track

Hadoop Summit 2010 - Developers Track
Hadoop and Pig at Twitter
Kevin Weil, Twitter

Statistics

Views

Total Views
7,351
Views on SlideShare
6,839
Embed Views
512

Actions

Likes
25
Downloads
0
Comments
0

5 Embeds 512

http://abrdev.com 501
url_unknown 5
http://bart7449.tistory.com 3
https://abs.twimg.com 2
http://slideclip.b-prep.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Hadoop and Pig at Twitter__HadoopSummit2010 Hadoop and Pig at Twitter__HadoopSummit2010 Presentation Transcript

    • Hadoop at Twitter
      • Kevin Weil -- @kevinweil
      • Analytics Lead, Twitter
    • The Twitter Data Lifecycle
      • Data Input
      • Data Storage
      • Data Analysis
      • Data Products
    • The Twitter Data Lifecycle
      • Data Input: Scribe , Crane
      • Data Storage: Elephant Bird , HBase
      • Data Analysis: Pig , Oink
      • Data Products: Birdbrain
      1 Community Open Source 2 Twitter Open Source (or soon)
    • My Background
      • Studied Mathematics and Physics at Harvard, Physics at Stanford
      • Tropos Networks (city-wide wireless): mesh routing algorithms, GBs of data
      • Cooliris (web media): Hadoop and Pig for analytics, TBs of data
      • Twitter : Hadoop, Pig, machine learning, visualization, social graph analysis, (soon) PBs of data
    • The Twitter Data Lifecycle
      • Data Input: Scribe , Crane
      • Data Storage
      • Data Analysis
      • Data Products
      1 Community Open Source 2 Twitter Open Source
    • What Data?
      • Two main kinds of raw data
      • Logs
      • Tabular data
    • Logs
      • Started with syslog-ng
      • As our volume grew, it didn’t scale
    • Logs
      • Started with syslog-ng
      • As our volume grew, it didn’t scale
      • Resources overwhelmed
      • Lost data
    • Scribe
      • Scribe daemon runs locally; reliable in network outage
      • Nodes only know downstream
      • writer; hierarchical, scalable
      • Pluggable outputs, per category
      FE FE FE Agg Agg HDFS File
    • Scribe at Twitter
      • Solved our problem, opened new vistas
      • Currently 57 different categories logged from multiple sources
      • FE: Javascript, Ruby on Rails
      • Middle tier: Ruby on Rails, Scala
      • Backend: Scala, Java, C++
      • 7 TB/day into HDFS
      • Log first, ask questions later.
    • Scribe at Twitter
      • We’ve contributed to it as we’ve used it 1
      • Improved logging, monitoring, writing to HDFS, compression
      • Added ZooKeeper-based config
      • Continuing to work with FB on patches
      • Also: working with Cloudera to evaluate Flume
      1 http://github.com/traviscrawford/scribe
    • Tabular Data
      • Most site data is in MySQL
      • Tweets, users, devices, client applications, etc
      • Need to move it between MySQL and HDFS
              • Also between MySQL and HBase, or MySQL and MySQL
              • Crane: configuration driven ETL tool
    • Crane Driver Configuration/Batch Management Extract Load Transform Protobuf P1 Protobuf P2 Source Sink ZooKeeper Registration
    • Crane
      • Extract
      • MySQL, HDFS, HBase, Flock, GA, Facebook Insights
    • Crane
      • Extract
      • MySQL, HDFS, HBase, Flock, GA, Facebook Insights
      • Transform
      • IP/Phone -> Geo, canonicalize dates, cleaning, arbitrary logic
    • Crane
      • Extract
      • MySQL, HDFS, HBase, Flock, GA, Facebook Insights
      • Transform
      • IP/Phone -> Geo, canonicalize dates, cleaning, arbitrary logic
      • Load
      • MySQL, Local file, Stdout, HDFS, HBase
    • Crane
      • Extract
      • MySQL, HDFS, HBase, Flock, GA, Facebook Insights
      • Transform
      • IP/Phone -> Geo, canonicalize dates, cleaning, arbitrary logic
      • Load
      • MySQL, Local file, Stdout, HDFS, HBase
      • ZooKeeper coordination, intelligent date management
      • Run all the time from multiple servers, self healing
    • The Twitter Data Lifecycle
      • Data Input
      • Data Storage: Elephant Bird , HBase
      • Data Analysis
      • Data Products
      1 Community Open Source 2 Twitter Open Source
    • Storage Basics
      • Incoming data: 7 TB/day
      • LZO encode everything
      • Save 3-4x on storage, pay little CPU
      • Splittable! 1
      • IO-bound jobs ==> 3-4x perf increase
      1 http://www.github.com/kevinweil/hadoop-lzo
    • Elephant Bird http://www.flickr.com/photos/jagadish/3072134867/ 1 http://github.com/kevinweil/elephant-bird
    • Elephant Bird
      • We have data coming in as protocol buffers via Crane...
    • Elephant Bird
      • We have data coming in as protocol buffers via Crane...
      • Protobufs: codegen for efficient ser-de of data structures
    • Elephant Bird
      • We have data coming in as protocol buffers via Crane...
      • Protobufs: codegen for efficient ser-de of data structures
      • Why shouldn’t we just continue, and codegen more glue?
    • Elephant Bird
      • We have data coming in as protocol buffers via Crane...
      • Protobufs: codegen for efficient ser-de of data structures
      • Why shouldn’t we just continue, and codegen more glue?
      • InputFormats, OutputFormats, Pig LoadFuncs, Pig StoreFuncs, Pig HBaseLoaders
    • Elephant Bird
      • We have data coming in as protocol buffers via Crane...
      • Protobufs: codegen for efficient ser-de of data structures
      • Why shouldn’t we just continue, and codegen more glue?
      • InputFormats, OutputFormats, Pig LoadFuncs, Pig StoreFuncs, Pig HBaseLoaders
      • Also now does part of this with Thrift, soon Avro
      • And JSON, W3C Logs
    • Challenge: Mutable Data
      • HDFS is write-once: no seek on write, no append (yet)
      • Logs are easy.
      • But our tables change.
    • Challenge: Mutable Data
      • HDFS is write-once: no seek on write, no append (yet)
      • Logs are easy.
      • But our tables change.
      • Handling rapidly changing data in HDFS: not trivial.
      • Don’t worry about updated data
      • Refresh entire dataset
      • Download updates, tombstone old versions of data, ensure jobs only run over current versions of data, occasionally rewrite full dataset
    • Challenge: Mutable Data
      • HDFS is write-once: no seek on write, no append (yet)
      • Logs are easy.
      • But our tables change.
      • Handling changing data in HDFS: not trivial.
    • HBase
      • Has already solved the update problem
      • Bonus: low-latency query API
      • Bonus: rich, BigTable-style data model based on column families
    • HBase at Twitter
      • Crane loads data directly into HBase
      • One CF for protobuf bytes, one CF to denormalize columns for indexing or quicker batch access
      • Processing updates transparent, so we always have accurate data in HBase
      • Pig Loader for HBase in Elephant Bird makes integration with existing analyses easy
    • HBase at Twitter
      • Crane loads data directly into HBase
      • One CF for protobuf bytes, one CF to denormalize columns for indexing or quicker batch access
      • Processing updates transparent, so we always have accurate data in HBase
      • Pig Loader for HBase in Elephant Bird
    • The Twitter Data Lifecycle
      • Data Input
      • Data Storage
      • Data Analysis: Pig , Oink
      • Data Products
      1 Community Open Source 2 Twitter Open Source
    • Enter Pig
      • High level language
      • Transformations on sets of records
      • Process data one step at a time
      • UDFs are first-class citizens
      • Easier than SQL?
    • Why Pig?
      • Because I bet you can read the following script.
    • A Real Pig Script
      • Now, just for fun... the same calculation in vanilla Hadoop MapReduce.
    • No, seriously.
    • Pig Democratizes Large-scale Data Analysis
      • The Pig version is:
        • 5% of the code
        • 5% of the time
        • Within 30% of the execution time.
        • Innovation increasingly driven from large-scale data analysis
        • Need fast iteration to understand the right questions
        • More minds contributing = more value from your data
    • Pig Examples
      • Using the HBase Loader
      • Using the protobuf loaders
    • Pig Workflow
      • Oink: framework around Pig for loading, combining, running, post-processing
      • Everyone I know has one of these
      • Points to an opening for innovation; discussion beginning
      • Something we’re looking at: Ruby DSL for Pig, Piglet 1
      1 http://github.com/ningliang/piglet
    • Counting Big Data
      • standard counts, min, max, std dev
      • How many requests do we serve in a day?
      • What is the average latency? 95% latency?
      • Group by response code. What is the hourly distribution?
      • How many searches happen each day on Twitter?
      • How many unique queries, how many unique users?
      • What is their geographic distribution?
    • Correlating Big Data
      • How does usage differ for mobile users?
      • How about for users with 3rd party desktop clients?
      • Cohort analyses
      • Site problems: what goes wrong at the same time?
      • Which features get users hooked?
      • Which features do successful users use often?
      • Search corrections, search suggestions
      • A/B testing
      • probabilities, covariance, influence
    • Research on Big Data
      • What can we tell about a user from their tweets?
      • From the tweets of those they follow?
      • From the tweets of their followers?
      • From the ratio of followers/following?
      • What graph structures lead to successful networks?
      • User reputation
      • prediction, graph analysis, natural language
    • Research on Big Data
      • Sentiment analysis
      • What features get a tweet retweeted?
      • How deep is the corresponding retweet tree?
      • Long-term duplicate detection
      • Machine learning
      • Language detection
      • ... the list goes on.
      • prediction, graph analysis, natural language
    • The Twitter Data Lifecycle
      • Data Input
      • Data Storage
      • Data Analysis
      • Data Products: Birdbrain
      1 Community Open Source 2 Twitter Open Source
    • Data Products
      • Ad Hoc Analyses
      • Answer questions to keep the business agile, do research
      • Online Products
      • Name search, other upcoming products
      • Company Dashboard
      • Birdbrain
    • Questions ? Follow me at twitter.com/kevinweil TM
      • P.S. We’re hiring. Help us build the next step: realtime big data analytics.