• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content

Loading…

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

Like this presentation? Why not share!

Hadoop Ecosystem at Twitter - Kevin Weil - Hadoop World 2010

on

  • 3,784 views

Hadoop Ecosystem at Twitter

Hadoop Ecosystem at Twitter

Kevin Weil
Twitter

Learn more @ http://www.cloudera.com/hadoop/

Statistics

Views

Total Views
3,784
Views on SlideShare
2,767
Embed Views
1,017

Actions

Likes
12
Downloads
0
Comments
0

11 Embeds 1,017

http://www.cloudera.com 648
http://nosql.mypopescu.com 257
http://www.nosqldatabases.com 87
http://paper.li 10
http://static.slidesharecdn.com 5
unmht:// 3
http://192.168.6.179:3030 3
https://twitter.com 1
http://www.hanrss.com 1
http://blog.cloudera.com 1
http://translate.googleusercontent.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Apple Keynote

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Don&#x2019;t worry about the &#x201C;yellow grid lines&#x201D; on the side, they&#x2019;re just there to help you align everything. <br /> They won&#x2019;t show up when you present. <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • Change this to your big-idea call-outs... <br /> <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • Change this to your big-idea call-outs... <br /> <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />

Hadoop Ecosystem at Twitter - Kevin Weil - Hadoop World 2010 Hadoop Ecosystem at Twitter - Kevin Weil - Hadoop World 2010 Presentation Transcript

  • The Hadoop Ecosystem at Hadoop World 2010 Kevin Weil @kevinweil
  • The Data Hierarchy Data Products Data Analysis Data Input Data Formats HDFS
  • My Background ‣ Studied Mathematics and Physics at Harvard, Physics at Stanford ‣ Tropos Networks (city-wide wireless): GBs of data ‣ Cooliris (web media): Hadoop for analytics, TBs of data ‣ Twitter: Hadoop, Pig, machine learning, visualization, social graph analysis, PBs of data
  • HDFS
  • Data Formats HDFS
  • ‣ Challenge: store some tweets Data Formats HDFS
  • ‣ Challenge: store some tweets 1 trillion tweets Data Formats HDFS
  • ‣ Challenge: store some tweets 1 trillion tweets ‣ Requirements: Data Formats HDFS
  • ‣ Challenge: store some tweets 1 trillion tweets ‣ Requirements: ‣ Robust to changes Data Formats HDFS
  • ‣ Challenge: store some tweets 1 trillion tweets ‣ Requirements: ‣ Robust to changes ‣ Efficient in size, speed Data Formats HDFS
  • ‣ Challenge: store some tweets 1 trillion tweets ‣ Requirements: ‣ Robust to changes ‣ Efficient in size, speed ‣ Amenable to large-scale analysis Data Formats HDFS
  • ‣ Challenge: store some tweets 1 trillion tweets ‣ Requirements: ‣ Robust to changes ‣ Efficient in size, speed ‣ Amenable to large-scale analysis ‣ Reusable: same techniques apply for other data Data Formats HDFS
  • ‣ curl http://api.twitter.com/1/statuses/show/ 26096606401.xml?include_entities=true ‣ Each tweet has 17 fields ‣ 6 fields have substructure ‣ Some have subsub- structure ‣ It changes over time! ‣ XML? CSV? JSON?
  • Protocol Buffers ‣ “Protocol Buffers are a way of encoding structured data in an efficient yet extensible format.” http://code.google.com/p/protobuf ‣ You write: IDL ‣ It generates code in your language of choice to construct, serialize, deserialize, describe, etc your data structure ‣ Avro and Thrift are nearly identical in theme
  • IDL Example ‣ message Tweet { ‣ optional string created_at = 1; ‣ optional int64 id = 2; ‣ optional string text = 3; ‣ optional string source = 4; ‣ optional bool truncated = 5; ‣ optional int64 in_reply_to_status_id = 6; ‣ optional int64 in_reply_to_user_id = 7; ‣ optional bool favorited = 8; ‣ optional string in_reply_to_screen_name = 9; ‣ optional message User = 10; ‣ optional message Geo = 11; ‣ optional message Contributors = 12; ‣ message User { ‣ optional int64 id = 1; ‣ optional string name = 2; ‣ ... ‣ } ‣ message Geo { ... } ‣ message Contributors { ... } ‣ }
  • Protobuf Codegen ‣ Generated code is: ‣ Efficient (Google quotes 80x vs |-delimited format)1,2 ‣ Extensible ‣ Backwards-compatible ‣ Metadata-rich 1. http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext 2. http://code.google.com/p/thrift-protobuf-compare/wiki/Benchmarking
  • Elephant Bird1 ‣ Codegen for data structures is nice ‣ Next step: codegen for all Hadoop-related code 1. http://www.github.com/kevinweil/elephant-bird
  • Elephant Bird1 ‣ Codegen for data structures is nice ‣ Next step: codegen for all Hadoop-related code ‣ Protocol Buffer InputFormats 1. http://www.github.com/kevinweil/elephant-bird
  • Elephant Bird1 ‣ Codegen for data structures is nice ‣ Next step: codegen for all Hadoop-related code ‣ Protocol Buffer InputFormats ‣ OutputFormats 1. http://www.github.com/kevinweil/elephant-bird
  • Elephant Bird1 ‣ Codegen for data structures is nice ‣ Next step: codegen for all Hadoop-related code ‣ Protocol Buffer InputFormats ‣ OutputFormats ‣ Writables 1. http://www.github.com/kevinweil/elephant-bird
  • Elephant Bird1 ‣ Codegen for data structures is nice ‣ Next step: codegen for all Hadoop-related code ‣ Protocol Buffer InputFormats ‣ OutputFormats ‣ Writables ‣ Pig LoadFuncs and StoreFuncs 1. http://www.github.com/kevinweil/elephant-bird
  • Elephant Bird1 ‣ Codegen for data structures is nice ‣ Next step: codegen for all Hadoop-related code ‣ Protocol Buffer InputFormats ‣ OutputFormats ‣ Writables ‣ Pig LoadFuncs and StoreFuncs ‣ Hive Serdes 1. http://www.github.com/kevinweil/elephant-bird
  • Elephant Bird1 ‣ Codegen for data structures is nice ‣ Next step: codegen for all Hadoop-related code ‣ Protocol Buffer InputFormats ‣ OutputFormats ‣ Writables ‣ Pig LoadFuncs and StoreFuncs ‣ Hive Serdes ‣ Add your own! 1. http://www.github.com/kevinweil/elephant-bird
  • Protobuf InputFormats ‣ All objects (hierarchical data, inheritance, etc) ‣ All automatically generated ‣ Efficient, extensible storage and serialization
  • Pig LoadFuncs ‣ All objects (hierarchical data, inheritance, etc) ‣ All automatically generated ‣ Even the load statement itself is generated code
  • Elephant Bird ‣ Also, with and without LZO compression (http:// www.github.com/kevinweil/hadoop-lzo) ‣ Support for Hive/Pig/Hadoop readers for other data ‣ JSON, Regex-based, W3C logs, Thrift ‣ Many great outside contributions from the Hadoop community!
  • Data Input Elephant Bird Data Formats Hadoop-LZO HDFS
  • ‣ Structured Data: LZO-encoded protobufs Data Input Elephant Bird Data Formats Hadoop-LZO HDFS
  • ‣ Structured Data: LZO-encoded protobufs ‣ Logs: LZO-encoded protobufs, LZO-encoded JSON Data Input Elephant Bird Data Formats Hadoop-LZO HDFS
  • ‣ Structured Data: LZO-encoded protobufs ‣ Logs: LZO-encoded protobufs, LZO-encoded JSON ‣ 12 TB/day Data Input Elephant Bird Data Formats Hadoop-LZO HDFS
  • Scribe ‣ Facebook needed to collect large amounts of log data ‣ They built a solution, open-sourced it ‣ We contributed to it, use it everywhere
  • Scribe ‣ Facebook needed to collect large amounts of log data ‣ They built a solution, open-sourced it ‣ We contributed to it, use it everywhere ‣ Log collection framework over Thrift
  • Scribe ‣ Facebook needed to collect large amounts of log data ‣ They built a solution, open-sourced it ‣ We contributed to it, use it everywhere ‣ Log collection framework over Thrift ‣ You “scribe” log lines, with categories
  • Scribe ‣ Facebook needed to collect large amounts of log data ‣ They built a solution, open-sourced it ‣ We contributed to it, use it everywhere ‣ Log collection framework over Thrift ‣ You “scribe” log lines, with categories ‣ It does the rest
  • Scribe ‣ Facebook needed to collect large amounts of log data ‣ They built a solution, open-sourced it ‣ We contributed to it, use it everywhere ‣ Log collection framework over Thrift ‣ You “scribe” log lines, with categories ‣ It does the rest ‣ Note: Flume is a strong contender here
  • Scribe ‣ Runs locally; reliable in FE FE FE network outage
  • Scribe ‣ Runs locally; reliable in FE FE FE network outage ‣ Nodes only know downstream writer; Agg Agg hierarchical, scalable
  • Scribe ‣ Runs locally; reliable in FE FE FE network outage ‣ Nodes only know downstream writer; Agg Agg hierarchical, scalable ‣ Pluggable outputs File HDFS
  • Scribe at Twitter ‣ Currently 80 different categories logged from javascript, Ruby, Scala, Java, etc ‣ We improved logging, monitoring, behavior during failure conditions, writes to HDFS, etc ‣ Continuing to work with FB to make it better ‣ Painlessly logging 12TB into HDFS each day http://github.com/traviscrawford/scribe
  • Data Analysis Scribe Data Input Crane Elephant Bird Data Formats Hadoop-LZO HDFS
  • ‣ Now we have data. How should we analyze it? ‣ Java MapReduce, Pig, Hive, HBase? ‣ Use each for their strengths Data Analysis Scribe Data Input Crane Elephant Bird Data Formats Hadoop-LZO HDFS
  • Java MapReduce ‣ Efficient ‣ Flexible ‣ Send RPCs to Flock1 (distributed social graph store) in the mapper to compute graph aggregates! ‣ Easier to control iterative algorithms like PageRank 1. http://www.github.com/twitter/flockdb
  • But... ‣ Analysis typically in Java ‣ Projections, filters: custom code ‣ Joins lengthy, error-prone ‣ Single-input, two-stage data flow is rigid ‣ n-stage jobs hard to manage ‣ Data exploration requires compilation
  • Pig ‣ High level language ‣ Transformations on sets of records ‣ Process data one step at a time ‣ Easier than SQL?
  • Why Pig? Because I bet you can read the following script
  • A Real Pig Script ‣ Just for fun... the same calculation in Java next
  • No, Seriously.
  • Pig Makes it Easy ‣ 5% of the code
  • Pig Makes it Easy ‣ 5% of the code ‣ 5% of the dev time
  • Pig Makes it Easy ‣ 5% of the code ‣ 5% of the dev time ‣ Within 20% of the running time
  • Pig Makes it Easy ‣ 5% of the code ‣ 5% of the dev time ‣ Within 20% of the running time ‣ Readable, reusable
  • Pig Makes it Easy ‣ 5% of the code ‣ 5% of the dev time ‣ Within 20% of the running time ‣ Readable, reusable ‣ As Pig improves, your calculations run faster
  • Pig Makes it Easy ‣ 5% of the code ‣ 5% of the dev time ‣ Within 20% of the running time ‣ Readable, reusable ‣ As Pig improves, your calculations run faster ‣ Elephant Bird integrates with every calculation
  • One Thing I’ve Learned ‣ It’s easy to answer questions ‣ It’s hard to ask the right questions
  • One Thing I’ve Learned ‣ It’s easy to answer questions ‣ It’s hard to ask the right questions ‣ Value the system that promotes innovation and iteration
  • One Thing I’ve Learned ‣ It’s easy to answer questions ‣ It’s hard to ask the right questions ‣ Value the system that promotes innovation and iteration ‣ More minds contributing = more value from your data
  • Workflow: ‣ Hadoop jobs getting more complex ‣ Multi-stage, multi-input jobs, each with own failure scenarios ‣ Enterprise-y workflow engine for Hadoop 1. http://yahoo.github.com/oozie
  • Data Products Java MR Data Analysis Pig, Hive, Oozie Scribe Data Input Crane Elephant Bird Data Formats Hadoop-LZO HDFS
  • ‣ More and more of our products rely on Hadoop Data Products Java MR Data Analysis Pig, Hive, Oozie Scribe Data Input Crane Elephant Bird Data Formats Hadoop-LZO HDFS
  • Data Products ‣ Ad Hoc Analyses ‣ Answer questions to keep the business agile ‣ Every team across Twitter runs Hadoop jobs
  • Data Products ‣ Ad Hoc Analyses ‣ Answer questions to keep the business agile ‣ Every team across Twitter runs Hadoop jobs ‣ Online Products ‣ Name search, others under the hood
  • Data Products ‣ Ad Hoc Analyses ‣ Answer questions to keep the business agile ‣ Every team across Twitter runs Hadoop jobs ‣ Online Products ‣ Name search, others under the hood ‣ Company Dashboard
  • Questions? Follow me at twitter.com/kevinweil
  • Counting Big Data ‣ How many requests per day?
  • Counting Big Data ‣ How many requests per day? ‣ Average latency? 95% latency?
  • Counting Big Data ‣ How many requests per day? ‣ Average latency? 95% latency? ‣ Response code distribution per hour?
  • Counting Big Data ‣ How many requests per day? ‣ Average latency? 95% latency? ‣ Response code distribution per hour? ‣ Twitter searches per day?
  • Counting Big Data ‣ How many requests per day? ‣ Average latency? 95% latency? ‣ Response code distribution per hour? ‣ Twitter searches per day? ‣ Unique users searching, unique queries?
  • Counting Big Data ‣ How many requests per day? ‣ Average latency? 95% latency? ‣ Response code distribution per hour? ‣ Twitter searches per day? ‣ Unique users searching, unique queries? ‣ Links tweeted per day? By domain?
  • Counting Big Data ‣ How many requests per day? ‣ Average latency? 95% latency? ‣ Response code distribution per hour? ‣ Twitter searches per day? ‣ Unique users searching, unique queries? ‣ Links tweeted per day? By domain? ‣ Geographic distribution of all of the above
  • Correlating Big Data ‣ Usage difference for mobile users?
  • Correlating Big Data ‣ Usage difference for mobile users? ‣ ... for users on desktop clients?
  • Correlating Big Data ‣ Usage difference for mobile users? ‣ ... for users on desktop clients? ‣ ... for users of #newtwitter?
  • Correlating Big Data ‣ Usage difference for mobile users? ‣ ... for users on desktop clients? ‣ ... for users of #newtwitter? ‣ Cohort analyses
  • Correlating Big Data ‣ Usage difference for mobile users? ‣ ... for users on desktop clients? ‣ ... for users of #newtwitter? ‣ Cohort analyses ‣ What features get users hooked?
  • Correlating Big Data ‣ Usage difference for mobile users? ‣ ... for users on desktop clients? ‣ ... for users of #newtwitter? ‣ Cohort analyses ‣ What features get users hooked? ‣ What features power Twitter users use often?
  • Research on Big Data ‣ What can we tell from a user’s tweets?
  • Research on Big Data ‣ What can we tell from a user’s tweets? ‣ ... from the tweets of their followers?
  • Research on Big Data ‣ What can we tell from a user’s tweets? ‣ ... from the tweets of their followers? ‣ ... from the tweets of those they follow?
  • Research on Big Data ‣ What can we tell from a user’s tweets? ‣ ... from the tweets of their followers? ‣ ... from the tweets of those they follow? ‣ What influences retweets? Depth of the retweet tree?
  • Research on Big Data ‣ What can we tell from a user’s tweets? ‣ ... from the tweets of their followers? ‣ ... from the tweets of those they follow? ‣ What influences retweets? Depth of the retweet tree? ‣ Duplicate detection (spam)
  • Research on Big Data ‣ What can we tell from a user’s tweets? ‣ ... from the tweets of their followers? ‣ ... from the tweets of those they follow? ‣ What influences retweets? Depth of the retweet tree? ‣ Duplicate detection (spam) ‣ Language detection (search)
  • Research on Big Data ‣ What can we tell from a user’s tweets? ‣ ... from the tweets of their followers? ‣ ... from the tweets of those they follow? ‣ What influences retweets? Depth of the retweet tree? ‣ Duplicate detection (spam) ‣ Language detection (search) ‣ Machine learning
  • Research on Big Data ‣ What can we tell from a user’s tweets? ‣ ... from the tweets of their followers? ‣ ... from the tweets of those they follow? ‣ What influences retweets? Depth of the retweet tree? ‣ Duplicate detection (spam) ‣ Language detection (search) ‣ Machine learning ‣ Natural language processing