• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Hadoop at Twitter (Hadoop Summit 2010)
 

Hadoop at Twitter (Hadoop Summit 2010)

on

  • 32,568 views

 

Statistics

Views

Total Views
32,568
Views on SlideShare
29,173
Embed Views
3,395

Actions

Likes
66
Downloads
955
Comments
3

28 Embeds 3,395

http://d.hatena.ne.jp 2764
http://pgpushpin.wordpress.com 462
http://paper.li 28
http://shakeelstha.blogspot.com 28
http://localhost 15
http://www.linkedin.com 14
http://shakeelstha.blogspot.in 10
http://dschool.co 9
https://www.linkedin.com 9
http://www.slideshare.net 9
http://hadoopbrasil.com 6
http://webcache.googleusercontent.com 6
https://twitter.com 6
http://192.168.6.179 4
http://j-reference.blogspot.com 3
http://www.lifeyun.com 3
https://euranova.knowledgeplaza.net 3
url_unknown 3
http://www.techgig.com 2
http://shakeelstha.blogspot.de 2
http://timesjobs.techgig.com 2
http://j-reference.blogspot.kr 1
http://twittertim.es 1
http://translate.googleusercontent.com 1
http://www.onlydoo.com 1
http://slideclip.b-prep.com 1
http://shakeelstha.blogspot.fr 1
http://www.coolfolder.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Apple Keynote

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

13 of 3 previous next Post a comment

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • Very good, Thanks for sharing.
    Are you sure you want to
    Your message goes here
    Processing…
  • YDN Theater: Hadoop2010: Hadoop and Pig at Twitter
    > http://developer.yahoo.net/blogs/theater/archives/2010/07/hadoop_and_pig_at_twitter.html

    <br /><object type="application/x-shockwave-flash" data="http://d.yimg.com/m/up/ypp/default/player.swf" width="350" height="288"><param name="movie" value="http://d.yimg.com/m/up/ypp/default/player.swf"></param><embed src="http://d.yimg.com/m/up/ypp/default/player.swf" width="350" height="288" type="application/x-shockwave-flash"></embed></object>
    Are you sure you want to
    Your message goes here
    Processing…
  • I'm sure I saw a video link of this talk [or v similiar] somewhere, but can't find it now... do you have a pointer?
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />

Hadoop at Twitter (Hadoop Summit 2010) Hadoop at Twitter (Hadoop Summit 2010) Presentation Transcript

  • Hadoop at Twitter Kevin Weil -- @kevinweil Analytics Lead, Twitter TM
  • The Twitter Data Lifecycle ‣ Data Input ‣ Data Storage ‣ Data Analysis ‣ Data Products
  • The Twitter Data Lifecycle ‣ Data Input: Scribe, Crane ‣ Data Storage: Elephant Bird, HBase ‣ Data Analysis: Pig, Oink ‣ Data Products: Birdbrain 1 Community Open Source 2 Twitter Open Source (or soon)
  • My Background ‣ Studied Mathematics and Physics at Harvard, Physics at Stanford ‣ Tropos Networks (city-wide wireless): mesh routing algorithms, GBs of data ‣ Cooliris (web media): Hadoop and Pig for analytics, TBs of data ‣ Twitter: Hadoop, Pig, machine learning, visualization, social graph analysis, (soon) PBs of data
  • The Twitter Data Lifecycle ‣ Data Input: Scribe, Crane ‣ Data Storage ‣ Data Analysis ‣ Data Products 1 Community Open Source 2 Twitter Open Source
  • What Data? ‣ Two main kinds of raw data ‣ Logs ‣ Tabular data
  • Logs ‣ Started with syslog-ng ‣ As our volume grew, it didn’t scale
  • Logs ‣ Started with syslog-ng ‣ As our volume grew, it didn’t scale ‣ Resources overwhelmed ‣ Lost data
  • Scribe ‣ Scribe daemon runs locally; reliable in network outage ‣ Nodes only know downstream FE FE FE writer; hierarchical, scalable ‣ Pluggable outputs, per category Agg Agg File HDFS
  • Scribe at Twitter ‣ Solved our problem, opened new vistas ‣ Currently 57 different categories logged from multiple sources ‣ FE: Javascript, Ruby on Rails ‣ Middle tier: Ruby on Rails, Scala ‣ Backend: Scala, Java, C++ ‣ 7 TB/day into HDFS ‣ Log first, ask questions later.
  • Scribe at Twitter ‣ We’ve contributed to it as we’ve used it1 ‣ Improved logging, monitoring, writing to HDFS, compression ‣ Added ZooKeeper-based config ‣ Continuing to work with FB on patches ‣ Also: working with Cloudera to evaluate Flume 1 http://github.com/traviscrawford/scribe
  • Tabular Data ‣ Most site data is in MySQL ‣ Tweets, users, devices, client applications, etc ‣ Need to move it between MySQL and HDFS ‣ Also between MySQL and HBase, or MySQL and MySQL ‣ Crane: configuration driven ETL tool
  • Crane Driver Source Configuration/Batch Management Sink ZooKeeper Registration Transform Extract Load Protobuf P1 Protobuf P2
  • Crane ‣ Extract ‣ MySQL, HDFS, HBase, Flock, GA, Facebook Insights
  • Crane ‣ Extract ‣ MySQL, HDFS, HBase, Flock, GA, Facebook Insights ‣ Transform ‣ IP/Phone -> Geo, canonicalize dates, cleaning, arbitrary logic
  • Crane ‣ Extract ‣ MySQL, HDFS, HBase, Flock, GA, Facebook Insights ‣ Transform ‣ IP/Phone -> Geo, canonicalize dates, cleaning, arbitrary logic ‣ Load ‣ MySQL, Local file, Stdout, HDFS, HBase
  • Crane ‣ Extract ‣ MySQL, HDFS, HBase, Flock, GA, Facebook Insights ‣ Transform ‣ IP/Phone -> Geo, canonicalize dates, cleaning, arbitrary logic ‣ Load ‣ MySQL, Local file, Stdout, HDFS, HBase ‣ ZooKeeper coordination, intelligent date management ‣ Run all the time from multiple servers, self healing
  • The Twitter Data Lifecycle ‣ Data Input ‣ Data Storage: Elephant Bird, HBase ‣ Data Analysis ‣ Data Products 1 Community Open Source 2 Twitter Open Source
  • Storage Basics ‣ Incoming data: 7 TB/day ‣ LZO encode everything ‣ Save 3-4x on storage, pay little CPU ‣ Splittable!1 ‣ IO-bound jobs ==> 3-4x perf increase 1 http://www.github.com/kevinweil/hadoop-lzo
  • http://www.flickr.com/photos/jagadish/3072134867/ Elephant Bird 1 http://github.com/kevinweil/elephant-bird
  • Elephant Bird ‣ We have data coming in as protocol buffers via Crane...
  • Elephant Bird ‣ We have data coming in as protocol buffers via Crane... ‣ Protobufs: codegen for efficient ser-de of data structures
  • Elephant Bird ‣ We have data coming in as protocol buffers via Crane... ‣ Protobufs: codegen for efficient ser-de of data structures ‣ Why shouldn’t we just continue, and codegen more glue?
  • Elephant Bird ‣ We have data coming in as protocol buffers via Crane... ‣ Protobufs: codegen for efficient ser-de of data structures ‣ Why shouldn’t we just continue, and codegen more glue? ‣ InputFormats, OutputFormats, Pig LoadFuncs, Pig StoreFuncs, Pig HBaseLoaders
  • Elephant Bird ‣ We have data coming in as protocol buffers via Crane... ‣ Protobufs: codegen for efficient ser-de of data structures ‣ Why shouldn’t we just continue, and codegen more glue? ‣ InputFormats, OutputFormats, Pig LoadFuncs, Pig StoreFuncs, Pig HBaseLoaders ‣ Also now does part of this with Thrift, soon Avro ‣ And JSON, W3C Logs
  • Challenge: Mutable Data ‣ HDFS is write-once: no seek on write, no append (yet) ‣ Logs are easy. ‣ But our tables change.
  • Challenge: Mutable Data ‣ HDFS is write-once: no seek on write, no append (yet) ‣ Logs are easy. ‣ But our tables change. ‣ Handling rapidly changing data in HDFS: not trivial. ‣ Don’t worry about updated data ‣ Refresh entire dataset ‣ Download updates, tombstone old versions of data, ensure jobs only run over current versions of data, occasionally rewrite full dataset
  • Challenge: Mutable Data ‣ HDFS is write-once: no seek on write, no append (yet) ‣ Logs are easy. ‣ But our tables change. ‣ Handling changing data in HDFS: not trivial.
  • HBase ‣ Has already solved the update problem ‣ Bonus: low-latency query API ‣ Bonus: rich, BigTable-style data model based on column families
  • HBase at Twitter ‣ Crane loads data directly into HBase ‣ One CF for protobuf bytes, one CF to denormalize columns for indexing or quicker batch access ‣ Processing updates transparent, so we always have accurate data in HBase ‣ Pig Loader for HBase in Elephant Bird makes integration with existing analyses easy
  • HBase at Twitter ‣ Crane loads data directly into HBase ‣ One CF for protobuf bytes, one CF to denormalize columns for indexing or quicker batch access ‣ Processing updates transparent, so we always have accurate data in HBase ‣ Pig Loader for HBase in Elephant Bird
  • The Twitter Data Lifecycle ‣ Data Input ‣ Data Storage ‣ Data Analysis: Pig, Oink ‣ Data Products 1 Community Open Source 2 Twitter Open Source
  • Enter Pig ‣ High level language ‣ Transformations on sets of records ‣ Process data one step at a time ‣ UDFs are first-class citizens ‣ Easier than SQL?
  • Why Pig? ‣ Because I bet you can read the following script.
  • A Real Pig Script ‣ Now, just for fun... the same calculation in vanilla Hadoop MapReduce.
  • No, seriously.
  • Pig Democratizes Large-scale Data Analysis ‣ The Pig version is: ‣ 5% of the code ‣ 5% of the time ‣ Within 30% of the execution time. ‣ Innovation increasingly driven from large-scale data analysis ‣ Need fast iteration to understand the right questions ‣ More minds contributing = more value from your data
  • Pig Examples ‣ Using the HBase Loader ‣ Using the protobuf loaders
  • Pig Workflow ‣ Oink: framework around Pig for loading, combining, running, post-processing ‣ Everyone I know has one of these ‣ Points to an opening for innovation; discussion beginning ‣ Something we’re looking at: Ruby DSL for Pig, Piglet1 1 http://github.com/ningliang/piglet
  • Counting Big Data ‣ standard counts, min, max, std dev ‣ How many requests do we serve in a day? ‣ What is the average latency? 95% latency? ‣ Group by response code. What is the hourly distribution? ‣ How many searches happen each day on Twitter? ‣ How many unique queries, how many unique users? ‣ What is their geographic distribution?
  • Correlating Big Data ‣ probabilities, covariance, influence ‣ How does usage differ for mobile users? ‣ How about for users with 3rd party desktop clients? ‣ Cohort analyses ‣ Site problems: what goes wrong at the same time? ‣ Which features get users hooked? ‣ Which features do successful users use often? ‣ Search corrections, search suggestions ‣ A/B testing
  • Research on Big Data ‣ prediction, graph analysis, natural language ‣ What can we tell about a user from their tweets? ‣ From the tweets of those they follow? ‣ From the tweets of their followers? ‣ From the ratio of followers/following? ‣ What graph structures lead to successful networks? ‣ User reputation
  • Research on Big Data ‣ prediction, graph analysis, natural language ‣ Sentiment analysis ‣ What features get a tweet retweeted? ‣ How deep is the corresponding retweet tree? ‣ Long-term duplicate detection ‣ Machine learning ‣ Language detection ‣ ... the list goes on.
  • The Twitter Data Lifecycle ‣ Data Input ‣ Data Storage ‣ Data Analysis ‣ Data Products: Birdbrain 1 Community Open Source 2 Twitter Open Source
  • Data Products ‣ Ad Hoc Analyses ‣ Answer questions to keep the business agile, do research ‣ Online Products ‣ Name search, other upcoming products ‣ Company Dashboard ‣ Birdbrain
  • Questions? Follow me at twitter.com/kevinweil ‣ P.S. We’re hiring. Help us build the next step: realtime big data analytics. TM