• Like
  • Save

Loading…

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

Hadoop Ecosystem at Twitter - Kevin Weil - Hadoop World 2010

  • 4,091 views
Published

Hadoop Ecosystem at Twitter …

Hadoop Ecosystem at Twitter

Kevin Weil
Twitter

Learn more @ http://www.cloudera.com/hadoop/

Published in Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
4,091
On SlideShare
0
From Embeds
0
Number of Embeds
4

Actions

Shares
Downloads
0
Comments
0
Likes
12

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • Don’t worry about the “yellow grid lines” on the side, they’re just there to help you align everything.
    They won’t show up when you present.











































  • Change this to your big-idea call-outs...


















  • Change this to your big-idea call-outs...






















Transcript

  • 1. The Hadoop Ecosystem at Hadoop World 2010 Kevin Weil @kevinweil
  • 2. The Data Hierarchy Data Products Data Analysis Data Input Data Formats HDFS
  • 3. My Background ‣ Studied Mathematics and Physics at Harvard, Physics at Stanford ‣ Tropos Networks (city-wide wireless): GBs of data ‣ Cooliris (web media): Hadoop for analytics, TBs of data ‣ Twitter: Hadoop, Pig, machine learning, visualization, social graph analysis, PBs of data
  • 4. HDFS
  • 5. Data Formats HDFS
  • 6. ‣ Challenge: store some tweets Data Formats HDFS
  • 7. ‣ Challenge: store some tweets 1 trillion tweets Data Formats HDFS
  • 8. ‣ Challenge: store some tweets 1 trillion tweets ‣ Requirements: Data Formats HDFS
  • 9. ‣ Challenge: store some tweets 1 trillion tweets ‣ Requirements: ‣ Robust to changes Data Formats HDFS
  • 10. ‣ Challenge: store some tweets 1 trillion tweets ‣ Requirements: ‣ Robust to changes ‣ Efficient in size, speed Data Formats HDFS
  • 11. ‣ Challenge: store some tweets 1 trillion tweets ‣ Requirements: ‣ Robust to changes ‣ Efficient in size, speed ‣ Amenable to large-scale analysis Data Formats HDFS
  • 12. ‣ Challenge: store some tweets 1 trillion tweets ‣ Requirements: ‣ Robust to changes ‣ Efficient in size, speed ‣ Amenable to large-scale analysis ‣ Reusable: same techniques apply for other data Data Formats HDFS
  • 13. ‣ curl http://api.twitter.com/1/statuses/show/ 26096606401.xml?include_entities=true ‣ Each tweet has 17 fields ‣ 6 fields have substructure ‣ Some have subsub- structure ‣ It changes over time! ‣ XML? CSV? JSON?
  • 14. Protocol Buffers ‣ “Protocol Buffers are a way of encoding structured data in an efficient yet extensible format.” http://code.google.com/p/protobuf ‣ You write: IDL ‣ It generates code in your language of choice to construct, serialize, deserialize, describe, etc your data structure ‣ Avro and Thrift are nearly identical in theme
  • 15. IDL Example ‣ message Tweet { ‣ optional string created_at = 1; ‣ optional int64 id = 2; ‣ optional string text = 3; ‣ optional string source = 4; ‣ optional bool truncated = 5; ‣ optional int64 in_reply_to_status_id = 6; ‣ optional int64 in_reply_to_user_id = 7; ‣ optional bool favorited = 8; ‣ optional string in_reply_to_screen_name = 9; ‣ optional message User = 10; ‣ optional message Geo = 11; ‣ optional message Contributors = 12; ‣ message User { ‣ optional int64 id = 1; ‣ optional string name = 2; ‣ ... ‣ } ‣ message Geo { ... } ‣ message Contributors { ... } ‣ }
  • 16. Protobuf Codegen ‣ Generated code is: ‣ Efficient (Google quotes 80x vs |-delimited format)1,2 ‣ Extensible ‣ Backwards-compatible ‣ Metadata-rich 1. http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext 2. http://code.google.com/p/thrift-protobuf-compare/wiki/Benchmarking
  • 17. Elephant Bird1 ‣ Codegen for data structures is nice ‣ Next step: codegen for all Hadoop-related code 1. http://www.github.com/kevinweil/elephant-bird
  • 18. Elephant Bird1 ‣ Codegen for data structures is nice ‣ Next step: codegen for all Hadoop-related code ‣ Protocol Buffer InputFormats 1. http://www.github.com/kevinweil/elephant-bird
  • 19. Elephant Bird1 ‣ Codegen for data structures is nice ‣ Next step: codegen for all Hadoop-related code ‣ Protocol Buffer InputFormats ‣ OutputFormats 1. http://www.github.com/kevinweil/elephant-bird
  • 20. Elephant Bird1 ‣ Codegen for data structures is nice ‣ Next step: codegen for all Hadoop-related code ‣ Protocol Buffer InputFormats ‣ OutputFormats ‣ Writables 1. http://www.github.com/kevinweil/elephant-bird
  • 21. Elephant Bird1 ‣ Codegen for data structures is nice ‣ Next step: codegen for all Hadoop-related code ‣ Protocol Buffer InputFormats ‣ OutputFormats ‣ Writables ‣ Pig LoadFuncs and StoreFuncs 1. http://www.github.com/kevinweil/elephant-bird
  • 22. Elephant Bird1 ‣ Codegen for data structures is nice ‣ Next step: codegen for all Hadoop-related code ‣ Protocol Buffer InputFormats ‣ OutputFormats ‣ Writables ‣ Pig LoadFuncs and StoreFuncs ‣ Hive Serdes 1. http://www.github.com/kevinweil/elephant-bird
  • 23. Elephant Bird1 ‣ Codegen for data structures is nice ‣ Next step: codegen for all Hadoop-related code ‣ Protocol Buffer InputFormats ‣ OutputFormats ‣ Writables ‣ Pig LoadFuncs and StoreFuncs ‣ Hive Serdes ‣ Add your own! 1. http://www.github.com/kevinweil/elephant-bird
  • 24. Protobuf InputFormats ‣ All objects (hierarchical data, inheritance, etc) ‣ All automatically generated ‣ Efficient, extensible storage and serialization
  • 25. Pig LoadFuncs ‣ All objects (hierarchical data, inheritance, etc) ‣ All automatically generated ‣ Even the load statement itself is generated code
  • 26. Elephant Bird ‣ Also, with and without LZO compression (http:// www.github.com/kevinweil/hadoop-lzo) ‣ Support for Hive/Pig/Hadoop readers for other data ‣ JSON, Regex-based, W3C logs, Thrift ‣ Many great outside contributions from the Hadoop community!
  • 27. Data Input Elephant Bird Data Formats Hadoop-LZO HDFS
  • 28. ‣ Structured Data: LZO-encoded protobufs Data Input Elephant Bird Data Formats Hadoop-LZO HDFS
  • 29. ‣ Structured Data: LZO-encoded protobufs ‣ Logs: LZO-encoded protobufs, LZO-encoded JSON Data Input Elephant Bird Data Formats Hadoop-LZO HDFS
  • 30. ‣ Structured Data: LZO-encoded protobufs ‣ Logs: LZO-encoded protobufs, LZO-encoded JSON ‣ 12 TB/day Data Input Elephant Bird Data Formats Hadoop-LZO HDFS
  • 31. Scribe ‣ Facebook needed to collect large amounts of log data ‣ They built a solution, open-sourced it ‣ We contributed to it, use it everywhere
  • 32. Scribe ‣ Facebook needed to collect large amounts of log data ‣ They built a solution, open-sourced it ‣ We contributed to it, use it everywhere ‣ Log collection framework over Thrift
  • 33. Scribe ‣ Facebook needed to collect large amounts of log data ‣ They built a solution, open-sourced it ‣ We contributed to it, use it everywhere ‣ Log collection framework over Thrift ‣ You “scribe” log lines, with categories
  • 34. Scribe ‣ Facebook needed to collect large amounts of log data ‣ They built a solution, open-sourced it ‣ We contributed to it, use it everywhere ‣ Log collection framework over Thrift ‣ You “scribe” log lines, with categories ‣ It does the rest
  • 35. Scribe ‣ Facebook needed to collect large amounts of log data ‣ They built a solution, open-sourced it ‣ We contributed to it, use it everywhere ‣ Log collection framework over Thrift ‣ You “scribe” log lines, with categories ‣ It does the rest ‣ Note: Flume is a strong contender here
  • 36. Scribe ‣ Runs locally; reliable in FE FE FE network outage
  • 37. Scribe ‣ Runs locally; reliable in FE FE FE network outage ‣ Nodes only know downstream writer; Agg Agg hierarchical, scalable
  • 38. Scribe ‣ Runs locally; reliable in FE FE FE network outage ‣ Nodes only know downstream writer; Agg Agg hierarchical, scalable ‣ Pluggable outputs File HDFS
  • 39. Scribe at Twitter ‣ Currently 80 different categories logged from javascript, Ruby, Scala, Java, etc ‣ We improved logging, monitoring, behavior during failure conditions, writes to HDFS, etc ‣ Continuing to work with FB to make it better ‣ Painlessly logging 12TB into HDFS each day http://github.com/traviscrawford/scribe
  • 40. Data Analysis Scribe Data Input Crane Elephant Bird Data Formats Hadoop-LZO HDFS
  • 41. ‣ Now we have data. How should we analyze it? ‣ Java MapReduce, Pig, Hive, HBase? ‣ Use each for their strengths Data Analysis Scribe Data Input Crane Elephant Bird Data Formats Hadoop-LZO HDFS
  • 42. Java MapReduce ‣ Efficient ‣ Flexible ‣ Send RPCs to Flock1 (distributed social graph store) in the mapper to compute graph aggregates! ‣ Easier to control iterative algorithms like PageRank 1. http://www.github.com/twitter/flockdb
  • 43. But... ‣ Analysis typically in Java ‣ Projections, filters: custom code ‣ Joins lengthy, error-prone ‣ Single-input, two-stage data flow is rigid ‣ n-stage jobs hard to manage ‣ Data exploration requires compilation
  • 44. Pig ‣ High level language ‣ Transformations on sets of records ‣ Process data one step at a time ‣ Easier than SQL?
  • 45. Why Pig? Because I bet you can read the following script
  • 46. A Real Pig Script ‣ Just for fun... the same calculation in Java next
  • 47. No, Seriously.
  • 48. Pig Makes it Easy ‣ 5% of the code
  • 49. Pig Makes it Easy ‣ 5% of the code ‣ 5% of the dev time
  • 50. Pig Makes it Easy ‣ 5% of the code ‣ 5% of the dev time ‣ Within 20% of the running time
  • 51. Pig Makes it Easy ‣ 5% of the code ‣ 5% of the dev time ‣ Within 20% of the running time ‣ Readable, reusable
  • 52. Pig Makes it Easy ‣ 5% of the code ‣ 5% of the dev time ‣ Within 20% of the running time ‣ Readable, reusable ‣ As Pig improves, your calculations run faster
  • 53. Pig Makes it Easy ‣ 5% of the code ‣ 5% of the dev time ‣ Within 20% of the running time ‣ Readable, reusable ‣ As Pig improves, your calculations run faster ‣ Elephant Bird integrates with every calculation
  • 54. One Thing I’ve Learned ‣ It’s easy to answer questions ‣ It’s hard to ask the right questions
  • 55. One Thing I’ve Learned ‣ It’s easy to answer questions ‣ It’s hard to ask the right questions ‣ Value the system that promotes innovation and iteration
  • 56. One Thing I’ve Learned ‣ It’s easy to answer questions ‣ It’s hard to ask the right questions ‣ Value the system that promotes innovation and iteration ‣ More minds contributing = more value from your data
  • 57. Workflow: ‣ Hadoop jobs getting more complex ‣ Multi-stage, multi-input jobs, each with own failure scenarios ‣ Enterprise-y workflow engine for Hadoop 1. http://yahoo.github.com/oozie
  • 58. Data Products Java MR Data Analysis Pig, Hive, Oozie Scribe Data Input Crane Elephant Bird Data Formats Hadoop-LZO HDFS
  • 59. ‣ More and more of our products rely on Hadoop Data Products Java MR Data Analysis Pig, Hive, Oozie Scribe Data Input Crane Elephant Bird Data Formats Hadoop-LZO HDFS
  • 60. Data Products ‣ Ad Hoc Analyses ‣ Answer questions to keep the business agile ‣ Every team across Twitter runs Hadoop jobs
  • 61. Data Products ‣ Ad Hoc Analyses ‣ Answer questions to keep the business agile ‣ Every team across Twitter runs Hadoop jobs ‣ Online Products ‣ Name search, others under the hood
  • 62. Data Products ‣ Ad Hoc Analyses ‣ Answer questions to keep the business agile ‣ Every team across Twitter runs Hadoop jobs ‣ Online Products ‣ Name search, others under the hood ‣ Company Dashboard
  • 63. Questions? Follow me at twitter.com/kevinweil
  • 64. Counting Big Data ‣ How many requests per day?
  • 65. Counting Big Data ‣ How many requests per day? ‣ Average latency? 95% latency?
  • 66. Counting Big Data ‣ How many requests per day? ‣ Average latency? 95% latency? ‣ Response code distribution per hour?
  • 67. Counting Big Data ‣ How many requests per day? ‣ Average latency? 95% latency? ‣ Response code distribution per hour? ‣ Twitter searches per day?
  • 68. Counting Big Data ‣ How many requests per day? ‣ Average latency? 95% latency? ‣ Response code distribution per hour? ‣ Twitter searches per day? ‣ Unique users searching, unique queries?
  • 69. Counting Big Data ‣ How many requests per day? ‣ Average latency? 95% latency? ‣ Response code distribution per hour? ‣ Twitter searches per day? ‣ Unique users searching, unique queries? ‣ Links tweeted per day? By domain?
  • 70. Counting Big Data ‣ How many requests per day? ‣ Average latency? 95% latency? ‣ Response code distribution per hour? ‣ Twitter searches per day? ‣ Unique users searching, unique queries? ‣ Links tweeted per day? By domain? ‣ Geographic distribution of all of the above
  • 71. Correlating Big Data ‣ Usage difference for mobile users?
  • 72. Correlating Big Data ‣ Usage difference for mobile users? ‣ ... for users on desktop clients?
  • 73. Correlating Big Data ‣ Usage difference for mobile users? ‣ ... for users on desktop clients? ‣ ... for users of #newtwitter?
  • 74. Correlating Big Data ‣ Usage difference for mobile users? ‣ ... for users on desktop clients? ‣ ... for users of #newtwitter? ‣ Cohort analyses
  • 75. Correlating Big Data ‣ Usage difference for mobile users? ‣ ... for users on desktop clients? ‣ ... for users of #newtwitter? ‣ Cohort analyses ‣ What features get users hooked?
  • 76. Correlating Big Data ‣ Usage difference for mobile users? ‣ ... for users on desktop clients? ‣ ... for users of #newtwitter? ‣ Cohort analyses ‣ What features get users hooked? ‣ What features power Twitter users use often?
  • 77. Research on Big Data ‣ What can we tell from a user’s tweets?
  • 78. Research on Big Data ‣ What can we tell from a user’s tweets? ‣ ... from the tweets of their followers?
  • 79. Research on Big Data ‣ What can we tell from a user’s tweets? ‣ ... from the tweets of their followers? ‣ ... from the tweets of those they follow?
  • 80. Research on Big Data ‣ What can we tell from a user’s tweets? ‣ ... from the tweets of their followers? ‣ ... from the tweets of those they follow? ‣ What influences retweets? Depth of the retweet tree?
  • 81. Research on Big Data ‣ What can we tell from a user’s tweets? ‣ ... from the tweets of their followers? ‣ ... from the tweets of those they follow? ‣ What influences retweets? Depth of the retweet tree? ‣ Duplicate detection (spam)
  • 82. Research on Big Data ‣ What can we tell from a user’s tweets? ‣ ... from the tweets of their followers? ‣ ... from the tweets of those they follow? ‣ What influences retweets? Depth of the retweet tree? ‣ Duplicate detection (spam) ‣ Language detection (search)
  • 83. Research on Big Data ‣ What can we tell from a user’s tweets? ‣ ... from the tweets of their followers? ‣ ... from the tweets of those they follow? ‣ What influences retweets? Depth of the retweet tree? ‣ Duplicate detection (spam) ‣ Language detection (search) ‣ Machine learning
  • 84. Research on Big Data ‣ What can we tell from a user’s tweets? ‣ ... from the tweets of their followers? ‣ ... from the tweets of those they follow? ‣ What influences retweets? Depth of the retweet tree? ‣ Duplicate detection (spam) ‣ Language detection (search) ‣ Machine learning ‣ Natural language processing