Hadoop at Twitter (Hadoop Summit 2010)
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

Hadoop at Twitter (Hadoop Summit 2010)

  • 33,366 views
Uploaded on

 

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
  • Very good, Thanks for sharing.
    Are you sure you want to
    Your message goes here
  • YDN Theater: Hadoop2010: Hadoop and Pig at Twitter
    > http://developer.yahoo.net/blogs/theater/archives/2010/07/hadoop_and_pig_at_twitter.html

    <br /><object type="application/x-shockwave-flash" data="http://d.yimg.com/m/up/ypp/default/player.swf" width="350" height="288"><param name="movie" value="http://d.yimg.com/m/up/ypp/default/player.swf"></param><embed src="http://d.yimg.com/m/up/ypp/default/player.swf" width="350" height="288" type="application/x-shockwave-flash"></embed></object>
    Are you sure you want to
    Your message goes here
  • I'm sure I saw a video link of this talk [or v similiar] somewhere, but can't find it now... do you have a pointer?
    Are you sure you want to
    Your message goes here
No Downloads

Views

Total Views
33,366
On Slideshare
29,922
From Embeds
3,444
Number of Embeds
29

Actions

Shares
Downloads
986
Comments
3
Likes
69

Embeds 3,444

http://d.hatena.ne.jp 2,803
http://pgpushpin.wordpress.com 462
http://shakeelstha.blogspot.com 28
http://paper.li 28
https://www.linkedin.com 16
http://localhost 15
http://www.linkedin.com 14
http://shakeelstha.blogspot.in 10
http://www.slideshare.net 9
http://dschool.co 9
https://twitter.com 6
http://webcache.googleusercontent.com 6
http://hadoopbrasil.com 6
http://192.168.6.179 4
http://www.lifeyun.com 3
http://pmomale-ld1 3
https://euranova.knowledgeplaza.net 3
url_unknown 3
http://j-reference.blogspot.com 3
http://shakeelstha.blogspot.de 2
http://www.techgig.com 2
http://timesjobs.techgig.com 2
http://www.onlydoo.com 1
http://twittertim.es 1
http://www.coolfolder.com 1
http://translate.googleusercontent.com 1
http://shakeelstha.blogspot.fr 1
http://j-reference.blogspot.kr 1
http://slideclip.b-prep.com 1

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide














































Transcript

  • 1. Hadoop at Twitter Kevin Weil -- @kevinweil Analytics Lead, Twitter TM
  • 2. The Twitter Data Lifecycle ‣ Data Input ‣ Data Storage ‣ Data Analysis ‣ Data Products
  • 3. The Twitter Data Lifecycle ‣ Data Input: Scribe, Crane ‣ Data Storage: Elephant Bird, HBase ‣ Data Analysis: Pig, Oink ‣ Data Products: Birdbrain 1 Community Open Source 2 Twitter Open Source (or soon)
  • 4. My Background ‣ Studied Mathematics and Physics at Harvard, Physics at Stanford ‣ Tropos Networks (city-wide wireless): mesh routing algorithms, GBs of data ‣ Cooliris (web media): Hadoop and Pig for analytics, TBs of data ‣ Twitter: Hadoop, Pig, machine learning, visualization, social graph analysis, (soon) PBs of data
  • 5. The Twitter Data Lifecycle ‣ Data Input: Scribe, Crane ‣ Data Storage ‣ Data Analysis ‣ Data Products 1 Community Open Source 2 Twitter Open Source
  • 6. What Data? ‣ Two main kinds of raw data ‣ Logs ‣ Tabular data
  • 7. Logs ‣ Started with syslog-ng ‣ As our volume grew, it didn’t scale
  • 8. Logs ‣ Started with syslog-ng ‣ As our volume grew, it didn’t scale ‣ Resources overwhelmed ‣ Lost data
  • 9. Scribe ‣ Scribe daemon runs locally; reliable in network outage ‣ Nodes only know downstream FE FE FE writer; hierarchical, scalable ‣ Pluggable outputs, per category Agg Agg File HDFS
  • 10. Scribe at Twitter ‣ Solved our problem, opened new vistas ‣ Currently 57 different categories logged from multiple sources ‣ FE: Javascript, Ruby on Rails ‣ Middle tier: Ruby on Rails, Scala ‣ Backend: Scala, Java, C++ ‣ 7 TB/day into HDFS ‣ Log first, ask questions later.
  • 11. Scribe at Twitter ‣ We’ve contributed to it as we’ve used it1 ‣ Improved logging, monitoring, writing to HDFS, compression ‣ Added ZooKeeper-based config ‣ Continuing to work with FB on patches ‣ Also: working with Cloudera to evaluate Flume 1 http://github.com/traviscrawford/scribe
  • 12. Tabular Data ‣ Most site data is in MySQL ‣ Tweets, users, devices, client applications, etc ‣ Need to move it between MySQL and HDFS ‣ Also between MySQL and HBase, or MySQL and MySQL ‣ Crane: configuration driven ETL tool
  • 13. Crane Driver Source Configuration/Batch Management Sink ZooKeeper Registration Transform Extract Load Protobuf P1 Protobuf P2
  • 14. Crane ‣ Extract ‣ MySQL, HDFS, HBase, Flock, GA, Facebook Insights
  • 15. Crane ‣ Extract ‣ MySQL, HDFS, HBase, Flock, GA, Facebook Insights ‣ Transform ‣ IP/Phone -> Geo, canonicalize dates, cleaning, arbitrary logic
  • 16. Crane ‣ Extract ‣ MySQL, HDFS, HBase, Flock, GA, Facebook Insights ‣ Transform ‣ IP/Phone -> Geo, canonicalize dates, cleaning, arbitrary logic ‣ Load ‣ MySQL, Local file, Stdout, HDFS, HBase
  • 17. Crane ‣ Extract ‣ MySQL, HDFS, HBase, Flock, GA, Facebook Insights ‣ Transform ‣ IP/Phone -> Geo, canonicalize dates, cleaning, arbitrary logic ‣ Load ‣ MySQL, Local file, Stdout, HDFS, HBase ‣ ZooKeeper coordination, intelligent date management ‣ Run all the time from multiple servers, self healing
  • 18. The Twitter Data Lifecycle ‣ Data Input ‣ Data Storage: Elephant Bird, HBase ‣ Data Analysis ‣ Data Products 1 Community Open Source 2 Twitter Open Source
  • 19. Storage Basics ‣ Incoming data: 7 TB/day ‣ LZO encode everything ‣ Save 3-4x on storage, pay little CPU ‣ Splittable!1 ‣ IO-bound jobs ==> 3-4x perf increase 1 http://www.github.com/kevinweil/hadoop-lzo
  • 20. http://www.flickr.com/photos/jagadish/3072134867/ Elephant Bird 1 http://github.com/kevinweil/elephant-bird
  • 21. Elephant Bird ‣ We have data coming in as protocol buffers via Crane...
  • 22. Elephant Bird ‣ We have data coming in as protocol buffers via Crane... ‣ Protobufs: codegen for efficient ser-de of data structures
  • 23. Elephant Bird ‣ We have data coming in as protocol buffers via Crane... ‣ Protobufs: codegen for efficient ser-de of data structures ‣ Why shouldn’t we just continue, and codegen more glue?
  • 24. Elephant Bird ‣ We have data coming in as protocol buffers via Crane... ‣ Protobufs: codegen for efficient ser-de of data structures ‣ Why shouldn’t we just continue, and codegen more glue? ‣ InputFormats, OutputFormats, Pig LoadFuncs, Pig StoreFuncs, Pig HBaseLoaders
  • 25. Elephant Bird ‣ We have data coming in as protocol buffers via Crane... ‣ Protobufs: codegen for efficient ser-de of data structures ‣ Why shouldn’t we just continue, and codegen more glue? ‣ InputFormats, OutputFormats, Pig LoadFuncs, Pig StoreFuncs, Pig HBaseLoaders ‣ Also now does part of this with Thrift, soon Avro ‣ And JSON, W3C Logs
  • 26. Challenge: Mutable Data ‣ HDFS is write-once: no seek on write, no append (yet) ‣ Logs are easy. ‣ But our tables change.
  • 27. Challenge: Mutable Data ‣ HDFS is write-once: no seek on write, no append (yet) ‣ Logs are easy. ‣ But our tables change. ‣ Handling rapidly changing data in HDFS: not trivial. ‣ Don’t worry about updated data ‣ Refresh entire dataset ‣ Download updates, tombstone old versions of data, ensure jobs only run over current versions of data, occasionally rewrite full dataset
  • 28. Challenge: Mutable Data ‣ HDFS is write-once: no seek on write, no append (yet) ‣ Logs are easy. ‣ But our tables change. ‣ Handling changing data in HDFS: not trivial.
  • 29. HBase ‣ Has already solved the update problem ‣ Bonus: low-latency query API ‣ Bonus: rich, BigTable-style data model based on column families
  • 30. HBase at Twitter ‣ Crane loads data directly into HBase ‣ One CF for protobuf bytes, one CF to denormalize columns for indexing or quicker batch access ‣ Processing updates transparent, so we always have accurate data in HBase ‣ Pig Loader for HBase in Elephant Bird makes integration with existing analyses easy
  • 31. HBase at Twitter ‣ Crane loads data directly into HBase ‣ One CF for protobuf bytes, one CF to denormalize columns for indexing or quicker batch access ‣ Processing updates transparent, so we always have accurate data in HBase ‣ Pig Loader for HBase in Elephant Bird
  • 32. The Twitter Data Lifecycle ‣ Data Input ‣ Data Storage ‣ Data Analysis: Pig, Oink ‣ Data Products 1 Community Open Source 2 Twitter Open Source
  • 33. Enter Pig ‣ High level language ‣ Transformations on sets of records ‣ Process data one step at a time ‣ UDFs are first-class citizens ‣ Easier than SQL?
  • 34. Why Pig? ‣ Because I bet you can read the following script.
  • 35. A Real Pig Script ‣ Now, just for fun... the same calculation in vanilla Hadoop MapReduce.
  • 36. No, seriously.
  • 37. Pig Democratizes Large-scale Data Analysis ‣ The Pig version is: ‣ 5% of the code ‣ 5% of the time ‣ Within 30% of the execution time. ‣ Innovation increasingly driven from large-scale data analysis ‣ Need fast iteration to understand the right questions ‣ More minds contributing = more value from your data
  • 38. Pig Examples ‣ Using the HBase Loader ‣ Using the protobuf loaders
  • 39. Pig Workflow ‣ Oink: framework around Pig for loading, combining, running, post-processing ‣ Everyone I know has one of these ‣ Points to an opening for innovation; discussion beginning ‣ Something we’re looking at: Ruby DSL for Pig, Piglet1 1 http://github.com/ningliang/piglet
  • 40. Counting Big Data ‣ standard counts, min, max, std dev ‣ How many requests do we serve in a day? ‣ What is the average latency? 95% latency? ‣ Group by response code. What is the hourly distribution? ‣ How many searches happen each day on Twitter? ‣ How many unique queries, how many unique users? ‣ What is their geographic distribution?
  • 41. Correlating Big Data ‣ probabilities, covariance, influence ‣ How does usage differ for mobile users? ‣ How about for users with 3rd party desktop clients? ‣ Cohort analyses ‣ Site problems: what goes wrong at the same time? ‣ Which features get users hooked? ‣ Which features do successful users use often? ‣ Search corrections, search suggestions ‣ A/B testing
  • 42. Research on Big Data ‣ prediction, graph analysis, natural language ‣ What can we tell about a user from their tweets? ‣ From the tweets of those they follow? ‣ From the tweets of their followers? ‣ From the ratio of followers/following? ‣ What graph structures lead to successful networks? ‣ User reputation
  • 43. Research on Big Data ‣ prediction, graph analysis, natural language ‣ Sentiment analysis ‣ What features get a tweet retweeted? ‣ How deep is the corresponding retweet tree? ‣ Long-term duplicate detection ‣ Machine learning ‣ Language detection ‣ ... the list goes on.
  • 44. The Twitter Data Lifecycle ‣ Data Input ‣ Data Storage ‣ Data Analysis ‣ Data Products: Birdbrain 1 Community Open Source 2 Twitter Open Source
  • 45. Data Products ‣ Ad Hoc Analyses ‣ Answer questions to keep the business agile, do research ‣ Online Products ‣ Name search, other upcoming products ‣ Company Dashboard ‣ Birdbrain
  • 46. Questions? Follow me at twitter.com/kevinweil ‣ P.S. We’re hiring. Help us build the next step: realtime big data analytics. TM