Your SlideShare is downloading. ×
0
Hadoop at Twitter
Kevin Weil -- @kevinweil
Analytics Lead, Twitter




                           TM
The Twitter Data Lifecycle
‣   Data Input
‣   Data Storage
‣   Data Analysis
‣   Data Products
The Twitter Data Lifecycle
‣   Data Input: Scribe, Crane
‣   Data Storage: Elephant Bird, HBase
‣   Data Analysis: Pig, Oi...
My Background
‣   Studied Mathematics and Physics at Harvard, Physics at
    Stanford
‣   Tropos Networks (city-wide wirel...
The Twitter Data Lifecycle
‣   Data Input: Scribe, Crane
‣   Data Storage
‣   Data Analysis
‣   Data Products



1 Communi...
What Data?
‣   Two main kinds of raw data
‣   	   Logs
‣   	   Tabular data
Logs
‣   Started with syslog-ng
‣   As our volume grew, it didn’t scale
Logs
‣   Started with syslog-ng
‣   As our volume grew, it didn’t scale
‣   Resources overwhelmed
‣   Lost data
Scribe
‣   Scribe daemon runs locally; reliable in network outage
‣   Nodes only know downstream
                         ...
Scribe at Twitter
‣   Solved our problem, opened new vistas
‣   Currently 57 different categories logged from multiple sou...
Scribe at Twitter
‣   We’ve contributed to it as we’ve used       it1

‣       Improved logging, monitoring, writing to HD...
Tabular Data
‣   Most site data is in MySQL
‣     Tweets, users, devices, client applications, etc
‣   Need to move it bet...
Crane

                           Driver
Source        Configuration/Batch Management
                                     ...
Crane
‣   Extract
‣   	   MySQL, HDFS, HBase, Flock, GA, Facebook Insights
Crane
‣   Extract
‣   	   MySQL, HDFS, HBase, Flock, GA, Facebook Insights
‣   Transform
‣   	   IP/Phone -> Geo, canonica...
Crane
‣   Extract
‣   	   MySQL, HDFS, HBase, Flock, GA, Facebook Insights
‣   Transform
‣   	   IP/Phone -> Geo, canonica...
Crane
‣   Extract
‣   	   MySQL, HDFS, HBase, Flock, GA, Facebook Insights
‣   Transform
‣   	   IP/Phone -> Geo, canonica...
The Twitter Data Lifecycle
‣   Data Input
‣   Data Storage: Elephant Bird, HBase
‣   Data Analysis
‣   Data Products



1 ...
Storage Basics
‣   Incoming data: 7 TB/day
‣   LZO encode everything
‣   	   Save 3-4x on storage, pay little CPU
‣   	   ...
http://www.flickr.com/photos/jagadish/3072134867/




Elephant Bird




1 http://github.com/kevinweil/elephant-bird
Elephant Bird
‣   We have data coming in as protocol buffers via Crane...
Elephant Bird
‣   We have data coming in as protocol buffers via Crane...
‣   Protobufs: codegen for efficient ser-de of da...
Elephant Bird
‣   We have data coming in as protocol buffers via Crane...
‣   Protobufs: codegen for efficient ser-de of da...
Elephant Bird
‣   We have data coming in as protocol buffers via Crane...
‣   Protobufs: codegen for efficient ser-de of da...
Elephant Bird
‣   We have data coming in as protocol buffers via Crane...
‣   Protobufs: codegen for efficient ser-de of da...
Challenge: Mutable Data
‣   HDFS is write-once: no seek on write, no append (yet)
‣     Logs are easy.
‣     But our table...
Challenge: Mutable Data
‣   HDFS is write-once: no seek on write, no append (yet)
‣       Logs are easy.
‣       But our t...
Challenge: Mutable Data
‣   HDFS is write-once: no seek on write, no append (yet)
‣     Logs are easy.
‣     But our table...
HBase
‣   Has already solved the update problem
‣     Bonus: low-latency query API
‣     Bonus: rich, BigTable-style data ...
HBase at Twitter
‣   Crane loads data directly into HBase
‣      One CF for protobuf bytes, one CF to denormalize columns ...
HBase at Twitter
‣   Crane loads data directly into HBase
‣      One CF for protobuf bytes, one CF to denormalize columns ...
The Twitter Data Lifecycle
‣   Data Input
‣   Data Storage
‣   Data Analysis: Pig, Oink
‣   Data Products



1 Community O...
Enter Pig

            ‣   High level language
            ‣   Transformations on sets of records
            ‣   Process ...
Why Pig?
‣   Because I bet you can read the following script.
A Real Pig Script




‣   Now, just for fun... the same calculation in vanilla Hadoop MapReduce.
No, seriously.
Pig Democratizes Large-scale Data
Analysis
‣   The Pig version is:
‣     5% of the code
‣     5% of the time
‣     Within ...
Pig Examples
‣   Using the HBase Loader




‣   Using the protobuf loaders
Pig Workflow
‣   Oink: framework around Pig for loading, combining, running,
    post-processing
‣   	   Everyone I know h...
Counting Big Data
‣                standard counts, min, max, std dev
‣   How many requests do we serve in a day?
‣   What...
Correlating Big Data
‣                 probabilities, covariance, influence
‣   How does usage differ for mobile users?
‣ ...
Research on Big Data
‣           prediction, graph analysis, natural language
‣   What can we tell about a user from their...
Research on Big Data
‣            prediction, graph analysis, natural language
‣   Sentiment analysis
‣   What features ge...
The Twitter Data Lifecycle
‣   Data Input
‣   Data Storage
‣   Data Analysis
‣   Data Products: Birdbrain



1 Community O...
Data Products
‣   Ad Hoc Analyses
‣     Answer questions to keep the business agile, do research
‣   Online Products
‣    ...
Questions?                                          Follow me at
                                                         ...
Upcoming SlideShare
Loading in...5
×

Hadoop at Twitter (Hadoop Summit 2010)

30,875

Published on

4 Comments
74 Likes
Statistics
Notes
  • http://dbmanagement.info/Tutorials/Hadoop.htm
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Very good, Thanks for sharing.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • YDN Theater: Hadoop2010: Hadoop and Pig at Twitter
    > http://developer.yahoo.net/blogs/theater/archives/2010/07/hadoop_and_pig_at_twitter.html

    <br /><object type="application/x-shockwave-flash" data="http://d.yimg.com/m/up/ypp/default/player.swf" width="350" height="288"><param name="movie" value="http://d.yimg.com/m/up/ypp/default/player.swf"></param><embed src="http://d.yimg.com/m/up/ypp/default/player.swf" width="350" height="288" type="application/x-shockwave-flash"></embed></object>
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • I'm sure I saw a video link of this talk [or v similiar] somewhere, but can't find it now... do you have a pointer?
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total Views
30,875
On Slideshare
0
From Embeds
0
Number of Embeds
14
Actions
Shares
0
Downloads
1,038
Comments
4
Likes
74
Embeds 0
No embeds

No notes for slide














































  • Transcript of "Hadoop at Twitter (Hadoop Summit 2010)"

    1. 1. Hadoop at Twitter Kevin Weil -- @kevinweil Analytics Lead, Twitter TM
    2. 2. The Twitter Data Lifecycle ‣ Data Input ‣ Data Storage ‣ Data Analysis ‣ Data Products
    3. 3. The Twitter Data Lifecycle ‣ Data Input: Scribe, Crane ‣ Data Storage: Elephant Bird, HBase ‣ Data Analysis: Pig, Oink ‣ Data Products: Birdbrain 1 Community Open Source 2 Twitter Open Source (or soon)
    4. 4. My Background ‣ Studied Mathematics and Physics at Harvard, Physics at Stanford ‣ Tropos Networks (city-wide wireless): mesh routing algorithms, GBs of data ‣ Cooliris (web media): Hadoop and Pig for analytics, TBs of data ‣ Twitter: Hadoop, Pig, machine learning, visualization, social graph analysis, (soon) PBs of data
    5. 5. The Twitter Data Lifecycle ‣ Data Input: Scribe, Crane ‣ Data Storage ‣ Data Analysis ‣ Data Products 1 Community Open Source 2 Twitter Open Source
    6. 6. What Data? ‣ Two main kinds of raw data ‣ Logs ‣ Tabular data
    7. 7. Logs ‣ Started with syslog-ng ‣ As our volume grew, it didn’t scale
    8. 8. Logs ‣ Started with syslog-ng ‣ As our volume grew, it didn’t scale ‣ Resources overwhelmed ‣ Lost data
    9. 9. Scribe ‣ Scribe daemon runs locally; reliable in network outage ‣ Nodes only know downstream FE FE FE writer; hierarchical, scalable ‣ Pluggable outputs, per category Agg Agg File HDFS
    10. 10. Scribe at Twitter ‣ Solved our problem, opened new vistas ‣ Currently 57 different categories logged from multiple sources ‣ FE: Javascript, Ruby on Rails ‣ Middle tier: Ruby on Rails, Scala ‣ Backend: Scala, Java, C++ ‣ 7 TB/day into HDFS ‣ Log first, ask questions later.
    11. 11. Scribe at Twitter ‣ We’ve contributed to it as we’ve used it1 ‣ Improved logging, monitoring, writing to HDFS, compression ‣ Added ZooKeeper-based config ‣ Continuing to work with FB on patches ‣ Also: working with Cloudera to evaluate Flume 1 http://github.com/traviscrawford/scribe
    12. 12. Tabular Data ‣ Most site data is in MySQL ‣ Tweets, users, devices, client applications, etc ‣ Need to move it between MySQL and HDFS ‣ Also between MySQL and HBase, or MySQL and MySQL ‣ Crane: configuration driven ETL tool
    13. 13. Crane Driver Source Configuration/Batch Management Sink ZooKeeper Registration Transform Extract Load Protobuf P1 Protobuf P2
    14. 14. Crane ‣ Extract ‣ MySQL, HDFS, HBase, Flock, GA, Facebook Insights
    15. 15. Crane ‣ Extract ‣ MySQL, HDFS, HBase, Flock, GA, Facebook Insights ‣ Transform ‣ IP/Phone -> Geo, canonicalize dates, cleaning, arbitrary logic
    16. 16. Crane ‣ Extract ‣ MySQL, HDFS, HBase, Flock, GA, Facebook Insights ‣ Transform ‣ IP/Phone -> Geo, canonicalize dates, cleaning, arbitrary logic ‣ Load ‣ MySQL, Local file, Stdout, HDFS, HBase
    17. 17. Crane ‣ Extract ‣ MySQL, HDFS, HBase, Flock, GA, Facebook Insights ‣ Transform ‣ IP/Phone -> Geo, canonicalize dates, cleaning, arbitrary logic ‣ Load ‣ MySQL, Local file, Stdout, HDFS, HBase ‣ ZooKeeper coordination, intelligent date management ‣ Run all the time from multiple servers, self healing
    18. 18. The Twitter Data Lifecycle ‣ Data Input ‣ Data Storage: Elephant Bird, HBase ‣ Data Analysis ‣ Data Products 1 Community Open Source 2 Twitter Open Source
    19. 19. Storage Basics ‣ Incoming data: 7 TB/day ‣ LZO encode everything ‣ Save 3-4x on storage, pay little CPU ‣ Splittable!1 ‣ IO-bound jobs ==> 3-4x perf increase 1 http://www.github.com/kevinweil/hadoop-lzo
    20. 20. http://www.flickr.com/photos/jagadish/3072134867/ Elephant Bird 1 http://github.com/kevinweil/elephant-bird
    21. 21. Elephant Bird ‣ We have data coming in as protocol buffers via Crane...
    22. 22. Elephant Bird ‣ We have data coming in as protocol buffers via Crane... ‣ Protobufs: codegen for efficient ser-de of data structures
    23. 23. Elephant Bird ‣ We have data coming in as protocol buffers via Crane... ‣ Protobufs: codegen for efficient ser-de of data structures ‣ Why shouldn’t we just continue, and codegen more glue?
    24. 24. Elephant Bird ‣ We have data coming in as protocol buffers via Crane... ‣ Protobufs: codegen for efficient ser-de of data structures ‣ Why shouldn’t we just continue, and codegen more glue? ‣ InputFormats, OutputFormats, Pig LoadFuncs, Pig StoreFuncs, Pig HBaseLoaders
    25. 25. Elephant Bird ‣ We have data coming in as protocol buffers via Crane... ‣ Protobufs: codegen for efficient ser-de of data structures ‣ Why shouldn’t we just continue, and codegen more glue? ‣ InputFormats, OutputFormats, Pig LoadFuncs, Pig StoreFuncs, Pig HBaseLoaders ‣ Also now does part of this with Thrift, soon Avro ‣ And JSON, W3C Logs
    26. 26. Challenge: Mutable Data ‣ HDFS is write-once: no seek on write, no append (yet) ‣ Logs are easy. ‣ But our tables change.
    27. 27. Challenge: Mutable Data ‣ HDFS is write-once: no seek on write, no append (yet) ‣ Logs are easy. ‣ But our tables change. ‣ Handling rapidly changing data in HDFS: not trivial. ‣ Don’t worry about updated data ‣ Refresh entire dataset ‣ Download updates, tombstone old versions of data, ensure jobs only run over current versions of data, occasionally rewrite full dataset
    28. 28. Challenge: Mutable Data ‣ HDFS is write-once: no seek on write, no append (yet) ‣ Logs are easy. ‣ But our tables change. ‣ Handling changing data in HDFS: not trivial.
    29. 29. HBase ‣ Has already solved the update problem ‣ Bonus: low-latency query API ‣ Bonus: rich, BigTable-style data model based on column families
    30. 30. HBase at Twitter ‣ Crane loads data directly into HBase ‣ One CF for protobuf bytes, one CF to denormalize columns for indexing or quicker batch access ‣ Processing updates transparent, so we always have accurate data in HBase ‣ Pig Loader for HBase in Elephant Bird makes integration with existing analyses easy
    31. 31. HBase at Twitter ‣ Crane loads data directly into HBase ‣ One CF for protobuf bytes, one CF to denormalize columns for indexing or quicker batch access ‣ Processing updates transparent, so we always have accurate data in HBase ‣ Pig Loader for HBase in Elephant Bird
    32. 32. The Twitter Data Lifecycle ‣ Data Input ‣ Data Storage ‣ Data Analysis: Pig, Oink ‣ Data Products 1 Community Open Source 2 Twitter Open Source
    33. 33. Enter Pig ‣ High level language ‣ Transformations on sets of records ‣ Process data one step at a time ‣ UDFs are first-class citizens ‣ Easier than SQL?
    34. 34. Why Pig? ‣ Because I bet you can read the following script.
    35. 35. A Real Pig Script ‣ Now, just for fun... the same calculation in vanilla Hadoop MapReduce.
    36. 36. No, seriously.
    37. 37. Pig Democratizes Large-scale Data Analysis ‣ The Pig version is: ‣ 5% of the code ‣ 5% of the time ‣ Within 30% of the execution time. ‣ Innovation increasingly driven from large-scale data analysis ‣ Need fast iteration to understand the right questions ‣ More minds contributing = more value from your data
    38. 38. Pig Examples ‣ Using the HBase Loader ‣ Using the protobuf loaders
    39. 39. Pig Workflow ‣ Oink: framework around Pig for loading, combining, running, post-processing ‣ Everyone I know has one of these ‣ Points to an opening for innovation; discussion beginning ‣ Something we’re looking at: Ruby DSL for Pig, Piglet1 1 http://github.com/ningliang/piglet
    40. 40. Counting Big Data ‣ standard counts, min, max, std dev ‣ How many requests do we serve in a day? ‣ What is the average latency? 95% latency? ‣ Group by response code. What is the hourly distribution? ‣ How many searches happen each day on Twitter? ‣ How many unique queries, how many unique users? ‣ What is their geographic distribution?
    41. 41. Correlating Big Data ‣ probabilities, covariance, influence ‣ How does usage differ for mobile users? ‣ How about for users with 3rd party desktop clients? ‣ Cohort analyses ‣ Site problems: what goes wrong at the same time? ‣ Which features get users hooked? ‣ Which features do successful users use often? ‣ Search corrections, search suggestions ‣ A/B testing
    42. 42. Research on Big Data ‣ prediction, graph analysis, natural language ‣ What can we tell about a user from their tweets? ‣ From the tweets of those they follow? ‣ From the tweets of their followers? ‣ From the ratio of followers/following? ‣ What graph structures lead to successful networks? ‣ User reputation
    43. 43. Research on Big Data ‣ prediction, graph analysis, natural language ‣ Sentiment analysis ‣ What features get a tweet retweeted? ‣ How deep is the corresponding retweet tree? ‣ Long-term duplicate detection ‣ Machine learning ‣ Language detection ‣ ... the list goes on.
    44. 44. The Twitter Data Lifecycle ‣ Data Input ‣ Data Storage ‣ Data Analysis ‣ Data Products: Birdbrain 1 Community Open Source 2 Twitter Open Source
    45. 45. Data Products ‣ Ad Hoc Analyses ‣ Answer questions to keep the business agile, do research ‣ Online Products ‣ Name search, other upcoming products ‣ Company Dashboard ‣ Birdbrain
    46. 46. Questions? Follow me at twitter.com/kevinweil ‣ P.S. We’re hiring. Help us build the next step: realtime big data analytics. TM
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×