Your SlideShare is downloading. ×
Hadoop at Twitter (Hadoop Summit 2010)
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Hadoop at Twitter (Hadoop Summit 2010)

30,541
views

Published on


4 Comments
72 Likes
Statistics
Notes
  • http://dbmanagement.info/Tutorials/Hadoop.htm
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Very good, Thanks for sharing.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • YDN Theater: Hadoop2010: Hadoop and Pig at Twitter
    > http://developer.yahoo.net/blogs/theater/archives/2010/07/hadoop_and_pig_at_twitter.html

    <br /><object type="application/x-shockwave-flash" data="http://d.yimg.com/m/up/ypp/default/player.swf" width="350" height="288"><param name="movie" value="http://d.yimg.com/m/up/ypp/default/player.swf"></param><embed src="http://d.yimg.com/m/up/ypp/default/player.swf" width="350" height="288" type="application/x-shockwave-flash"></embed></object>
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • I'm sure I saw a video link of this talk [or v similiar] somewhere, but can't find it now... do you have a pointer?
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total Views
30,541
On Slideshare
0
From Embeds
0
Number of Embeds
14
Actions
Shares
0
Downloads
1,032
Comments
4
Likes
72
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide














































  • Transcript

    • 1. Hadoop at Twitter Kevin Weil -- @kevinweil Analytics Lead, Twitter TM
    • 2. The Twitter Data Lifecycle ‣ Data Input ‣ Data Storage ‣ Data Analysis ‣ Data Products
    • 3. The Twitter Data Lifecycle ‣ Data Input: Scribe, Crane ‣ Data Storage: Elephant Bird, HBase ‣ Data Analysis: Pig, Oink ‣ Data Products: Birdbrain 1 Community Open Source 2 Twitter Open Source (or soon)
    • 4. My Background ‣ Studied Mathematics and Physics at Harvard, Physics at Stanford ‣ Tropos Networks (city-wide wireless): mesh routing algorithms, GBs of data ‣ Cooliris (web media): Hadoop and Pig for analytics, TBs of data ‣ Twitter: Hadoop, Pig, machine learning, visualization, social graph analysis, (soon) PBs of data
    • 5. The Twitter Data Lifecycle ‣ Data Input: Scribe, Crane ‣ Data Storage ‣ Data Analysis ‣ Data Products 1 Community Open Source 2 Twitter Open Source
    • 6. What Data? ‣ Two main kinds of raw data ‣ Logs ‣ Tabular data
    • 7. Logs ‣ Started with syslog-ng ‣ As our volume grew, it didn’t scale
    • 8. Logs ‣ Started with syslog-ng ‣ As our volume grew, it didn’t scale ‣ Resources overwhelmed ‣ Lost data
    • 9. Scribe ‣ Scribe daemon runs locally; reliable in network outage ‣ Nodes only know downstream FE FE FE writer; hierarchical, scalable ‣ Pluggable outputs, per category Agg Agg File HDFS
    • 10. Scribe at Twitter ‣ Solved our problem, opened new vistas ‣ Currently 57 different categories logged from multiple sources ‣ FE: Javascript, Ruby on Rails ‣ Middle tier: Ruby on Rails, Scala ‣ Backend: Scala, Java, C++ ‣ 7 TB/day into HDFS ‣ Log first, ask questions later.
    • 11. Scribe at Twitter ‣ We’ve contributed to it as we’ve used it1 ‣ Improved logging, monitoring, writing to HDFS, compression ‣ Added ZooKeeper-based config ‣ Continuing to work with FB on patches ‣ Also: working with Cloudera to evaluate Flume 1 http://github.com/traviscrawford/scribe
    • 12. Tabular Data ‣ Most site data is in MySQL ‣ Tweets, users, devices, client applications, etc ‣ Need to move it between MySQL and HDFS ‣ Also between MySQL and HBase, or MySQL and MySQL ‣ Crane: configuration driven ETL tool
    • 13. Crane Driver Source Configuration/Batch Management Sink ZooKeeper Registration Transform Extract Load Protobuf P1 Protobuf P2
    • 14. Crane ‣ Extract ‣ MySQL, HDFS, HBase, Flock, GA, Facebook Insights
    • 15. Crane ‣ Extract ‣ MySQL, HDFS, HBase, Flock, GA, Facebook Insights ‣ Transform ‣ IP/Phone -> Geo, canonicalize dates, cleaning, arbitrary logic
    • 16. Crane ‣ Extract ‣ MySQL, HDFS, HBase, Flock, GA, Facebook Insights ‣ Transform ‣ IP/Phone -> Geo, canonicalize dates, cleaning, arbitrary logic ‣ Load ‣ MySQL, Local file, Stdout, HDFS, HBase
    • 17. Crane ‣ Extract ‣ MySQL, HDFS, HBase, Flock, GA, Facebook Insights ‣ Transform ‣ IP/Phone -> Geo, canonicalize dates, cleaning, arbitrary logic ‣ Load ‣ MySQL, Local file, Stdout, HDFS, HBase ‣ ZooKeeper coordination, intelligent date management ‣ Run all the time from multiple servers, self healing
    • 18. The Twitter Data Lifecycle ‣ Data Input ‣ Data Storage: Elephant Bird, HBase ‣ Data Analysis ‣ Data Products 1 Community Open Source 2 Twitter Open Source
    • 19. Storage Basics ‣ Incoming data: 7 TB/day ‣ LZO encode everything ‣ Save 3-4x on storage, pay little CPU ‣ Splittable!1 ‣ IO-bound jobs ==> 3-4x perf increase 1 http://www.github.com/kevinweil/hadoop-lzo
    • 20. http://www.flickr.com/photos/jagadish/3072134867/ Elephant Bird 1 http://github.com/kevinweil/elephant-bird
    • 21. Elephant Bird ‣ We have data coming in as protocol buffers via Crane...
    • 22. Elephant Bird ‣ We have data coming in as protocol buffers via Crane... ‣ Protobufs: codegen for efficient ser-de of data structures
    • 23. Elephant Bird ‣ We have data coming in as protocol buffers via Crane... ‣ Protobufs: codegen for efficient ser-de of data structures ‣ Why shouldn’t we just continue, and codegen more glue?
    • 24. Elephant Bird ‣ We have data coming in as protocol buffers via Crane... ‣ Protobufs: codegen for efficient ser-de of data structures ‣ Why shouldn’t we just continue, and codegen more glue? ‣ InputFormats, OutputFormats, Pig LoadFuncs, Pig StoreFuncs, Pig HBaseLoaders
    • 25. Elephant Bird ‣ We have data coming in as protocol buffers via Crane... ‣ Protobufs: codegen for efficient ser-de of data structures ‣ Why shouldn’t we just continue, and codegen more glue? ‣ InputFormats, OutputFormats, Pig LoadFuncs, Pig StoreFuncs, Pig HBaseLoaders ‣ Also now does part of this with Thrift, soon Avro ‣ And JSON, W3C Logs
    • 26. Challenge: Mutable Data ‣ HDFS is write-once: no seek on write, no append (yet) ‣ Logs are easy. ‣ But our tables change.
    • 27. Challenge: Mutable Data ‣ HDFS is write-once: no seek on write, no append (yet) ‣ Logs are easy. ‣ But our tables change. ‣ Handling rapidly changing data in HDFS: not trivial. ‣ Don’t worry about updated data ‣ Refresh entire dataset ‣ Download updates, tombstone old versions of data, ensure jobs only run over current versions of data, occasionally rewrite full dataset
    • 28. Challenge: Mutable Data ‣ HDFS is write-once: no seek on write, no append (yet) ‣ Logs are easy. ‣ But our tables change. ‣ Handling changing data in HDFS: not trivial.
    • 29. HBase ‣ Has already solved the update problem ‣ Bonus: low-latency query API ‣ Bonus: rich, BigTable-style data model based on column families
    • 30. HBase at Twitter ‣ Crane loads data directly into HBase ‣ One CF for protobuf bytes, one CF to denormalize columns for indexing or quicker batch access ‣ Processing updates transparent, so we always have accurate data in HBase ‣ Pig Loader for HBase in Elephant Bird makes integration with existing analyses easy
    • 31. HBase at Twitter ‣ Crane loads data directly into HBase ‣ One CF for protobuf bytes, one CF to denormalize columns for indexing or quicker batch access ‣ Processing updates transparent, so we always have accurate data in HBase ‣ Pig Loader for HBase in Elephant Bird
    • 32. The Twitter Data Lifecycle ‣ Data Input ‣ Data Storage ‣ Data Analysis: Pig, Oink ‣ Data Products 1 Community Open Source 2 Twitter Open Source
    • 33. Enter Pig ‣ High level language ‣ Transformations on sets of records ‣ Process data one step at a time ‣ UDFs are first-class citizens ‣ Easier than SQL?
    • 34. Why Pig? ‣ Because I bet you can read the following script.
    • 35. A Real Pig Script ‣ Now, just for fun... the same calculation in vanilla Hadoop MapReduce.
    • 36. No, seriously.
    • 37. Pig Democratizes Large-scale Data Analysis ‣ The Pig version is: ‣ 5% of the code ‣ 5% of the time ‣ Within 30% of the execution time. ‣ Innovation increasingly driven from large-scale data analysis ‣ Need fast iteration to understand the right questions ‣ More minds contributing = more value from your data
    • 38. Pig Examples ‣ Using the HBase Loader ‣ Using the protobuf loaders
    • 39. Pig Workflow ‣ Oink: framework around Pig for loading, combining, running, post-processing ‣ Everyone I know has one of these ‣ Points to an opening for innovation; discussion beginning ‣ Something we’re looking at: Ruby DSL for Pig, Piglet1 1 http://github.com/ningliang/piglet
    • 40. Counting Big Data ‣ standard counts, min, max, std dev ‣ How many requests do we serve in a day? ‣ What is the average latency? 95% latency? ‣ Group by response code. What is the hourly distribution? ‣ How many searches happen each day on Twitter? ‣ How many unique queries, how many unique users? ‣ What is their geographic distribution?
    • 41. Correlating Big Data ‣ probabilities, covariance, influence ‣ How does usage differ for mobile users? ‣ How about for users with 3rd party desktop clients? ‣ Cohort analyses ‣ Site problems: what goes wrong at the same time? ‣ Which features get users hooked? ‣ Which features do successful users use often? ‣ Search corrections, search suggestions ‣ A/B testing
    • 42. Research on Big Data ‣ prediction, graph analysis, natural language ‣ What can we tell about a user from their tweets? ‣ From the tweets of those they follow? ‣ From the tweets of their followers? ‣ From the ratio of followers/following? ‣ What graph structures lead to successful networks? ‣ User reputation
    • 43. Research on Big Data ‣ prediction, graph analysis, natural language ‣ Sentiment analysis ‣ What features get a tweet retweeted? ‣ How deep is the corresponding retweet tree? ‣ Long-term duplicate detection ‣ Machine learning ‣ Language detection ‣ ... the list goes on.
    • 44. The Twitter Data Lifecycle ‣ Data Input ‣ Data Storage ‣ Data Analysis ‣ Data Products: Birdbrain 1 Community Open Source 2 Twitter Open Source
    • 45. Data Products ‣ Ad Hoc Analyses ‣ Answer questions to keep the business agile, do research ‣ Online Products ‣ Name search, other upcoming products ‣ Company Dashboard ‣ Birdbrain
    • 46. Questions? Follow me at twitter.com/kevinweil ‣ P.S. We’re hiring. Help us build the next step: realtime big data analytics. TM

    ×