Your SlideShare is downloading. ×
Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)

20,707
views

Published on

A look at Twitter's data lifecycle, some of the tools we use to handle big data, and some of the questions we answer from our data.

A look at Twitter's data lifecycle, some of the tools we use to handle big data, and some of the questions we answer from our data.

Published in: Technology

2 Comments
91 Likes
Statistics
Notes
No Downloads
Views
Total Views
20,707
On Slideshare
0
From Embeds
0
Number of Embeds
13
Actions
Shares
0
Downloads
1,031
Comments
2
Likes
91
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Don’t worry about the “yellow grid lines” on the side, they’re just there to help you align everything.
    They won’t show up when you present.








































  • Change this to your big-idea call-outs...

































  • Change this to your big-idea call-outs...

  • Transcript

    • 1. Analyzing Big Data at Web 2.0 Expo, 2010 Kevin Weil @kevinweil
    • 2. Three Challenges ‣ Collecting Data ‣ Large-Scale Storage and Analysis ‣ Rapid Learning over Big Data
    • 3. My Background ‣ Studied Mathematics and Physics at Harvard, Physics at Stanford ‣ Tropos Networks (city-wide wireless): GBs of data ‣ Cooliris (web media): Hadoop for analytics, TBs of data ‣ Twitter: Hadoop, Pig, machine learning, visualization, social graph analysis, PBs of data
    • 4. Three Challenges ‣ Collecting Data ‣ Large-Scale Storage and Analysis ‣ Rapid Learning over Big Data
    • 5. Data, Data Everywhere ‣ You guys generate a lot of data ‣ Anybody want to guess?
    • 6. Data, Data Everywhere ‣ You guys generate a lot of data ‣ Anybody want to guess? ‣ 12 TB/day (4+ PB/yr)
    • 7. Data, Data Everywhere ‣ You guys generate a lot of data ‣ Anybody want to guess? ‣ 12 TB/day (4+ PB/yr) ‣ 20,000 CDs
    • 8. Data, Data Everywhere ‣ You guys generate a lot of data ‣ Anybody want to guess? ‣ 12 TB/day (4+ PB/yr) ‣ 20,000 CDs ‣ 10 million floppy disks
    • 9. Data, Data Everywhere ‣ You guys generate a lot of data ‣ Anybody want to guess? ‣ 12 TB/day (4+ PB/yr) ‣ 20,000 CDs ‣ 10 million floppy disks ‣ 450 GB while I give this talk
    • 10. Syslog? ‣ Started with syslog-ng ‣ As our volume grew, it didn’t scale
    • 11. Syslog? ‣ Started with syslog-ng ‣ As our volume grew, it didn’t scale ‣ Resources overwhelmed ‣ Lost data
    • 12. Scribe ‣ Surprise! FB had same problem, built and open- sourced Scribe ‣ Log collection framework over Thrift ‣ You “scribe” log lines, with categories ‣ It does the rest
    • 13. Scribe ‣ Runs locally; reliable in FE FE FE network outage
    • 14. Scribe ‣ Runs locally; reliable in FE FE FE network outage ‣ Nodes only know downstream writer; Agg Agg hierarchical, scalable
    • 15. Scribe ‣ Runs locally; reliable in FE FE FE network outage ‣ Nodes only know downstream writer; Agg Agg hierarchical, scalable ‣ Pluggable outputs File HDFS
    • 16. Scribe at Twitter ‣ Solved our problem, opened new vistas ‣ Currently 40 different categories logged from javascript, Ruby, Scala, Java, etc ‣ We improved logging, monitoring, behavior during failure conditions, writes to HDFS, etc ‣ Continuing to work with FB to make it better http://github.com/traviscrawford/scribe
    • 17. Three Challenges ‣ Collecting Data ‣ Large-Scale Storage and Analysis ‣ Rapid Learning over Big Data
    • 18. How Do You Store 12 TB/day? ‣ Single machine? ‣ What’s hard drive write speed?
    • 19. How Do You Store 12 TB/day? ‣ Single machine? ‣ What’s hard drive write speed? ‣ ~80 MB/s
    • 20. How Do You Store 12 TB/day? ‣ Single machine? ‣ What’s hard drive write speed? ‣ ~80 MB/s ‣ 42 hours to write 12 TB
    • 21. How Do You Store 12 TB/day? ‣ Single machine? ‣ What’s hard drive write speed? ‣ ~80 MB/s ‣ 42 hours to write 12 TB ‣ Uh oh.
    • 22. Where Do I Put 12TB/day? ‣ Need a cluster of machines ‣ ... which adds new layers of complexity
    • 23. Hadoop ‣ Distributed file system ‣ Automatic replication ‣ Fault tolerance ‣ Transparently read/write across multiple machines
    • 24. Hadoop ‣ Distributed file system ‣ Automatic replication ‣ Fault tolerance ‣ Transparently read/write across multiple machines ‣ MapReduce-based parallel computation ‣ Key-value based computation interface allows for wide applicability ‣ Fault tolerance, again
    • 25. Hadoop ‣ Open source: top-level Apache project ‣ Scalable: Y! has a 4000 node cluster ‣ Powerful: sorted 1TB random integers in 62 seconds ‣ Easy packaging/install: free Cloudera RPMs
    • 26. MapReduce Workflow Inputs ‣ Challenge: how many tweets Shuffle/ Map Sort per user, given tweets table? Map ‣ Input: key=row, value=tweet info Outputs Map Reduce ‣ Map: output key=user_id, value=1 Map Reduce ‣ Shuffle: sort by user_id Map Reduce ‣ Reduce: for each user_id, sum Map ‣ Output: user_id, tweet count Map ‣ With 2x machines, runs 2x faster
    • 27. MapReduce Workflow Inputs ‣ Challenge: how many tweets Shuffle/ Map Sort per user, given tweets table? Map ‣ Input: key=row, value=tweet info Outputs Map Reduce ‣ Map: output key=user_id, value=1 Map Reduce ‣ Shuffle: sort by user_id Map Reduce ‣ Reduce: for each user_id, sum Map ‣ Output: user_id, tweet count Map ‣ With 2x machines, runs 2x faster
    • 28. MapReduce Workflow Inputs ‣ Challenge: how many tweets Shuffle/ Map Sort per user, given tweets table? Map ‣ Input: key=row, value=tweet info Outputs Map Reduce ‣ Map: output key=user_id, value=1 Map Reduce ‣ Shuffle: sort by user_id Map Reduce ‣ Reduce: for each user_id, sum Map ‣ Output: user_id, tweet count Map ‣ With 2x machines, runs 2x faster
    • 29. MapReduce Workflow Inputs ‣ Challenge: how many tweets Shuffle/ Map Sort per user, given tweets table? Map ‣ Input: key=row, value=tweet info Outputs Map Reduce ‣ Map: output key=user_id, value=1 Map Reduce ‣ Shuffle: sort by user_id Map Reduce ‣ Reduce: for each user_id, sum Map ‣ Output: user_id, tweet count Map ‣ With 2x machines, runs 2x faster
    • 30. MapReduce Workflow Inputs ‣ Challenge: how many tweets Shuffle/ Map Sort per user, given tweets table? Map ‣ Input: key=row, value=tweet info Outputs Map Reduce ‣ Map: output key=user_id, value=1 Map Reduce ‣ Shuffle: sort by user_id Map Reduce ‣ Reduce: for each user_id, sum Map ‣ Output: user_id, tweet count Map ‣ With 2x machines, runs 2x faster
    • 31. MapReduce Workflow Inputs ‣ Challenge: how many tweets Shuffle/ Map Sort per user, given tweets table? Map ‣ Input: key=row, value=tweet info Outputs Map Reduce ‣ Map: output key=user_id, value=1 Map Reduce ‣ Shuffle: sort by user_id Map Reduce ‣ Reduce: for each user_id, sum Map ‣ Output: user_id, tweet count Map ‣ With 2x machines, runs 2x faster
    • 32. MapReduce Workflow Inputs ‣ Challenge: how many tweets Shuffle/ Map Sort per user, given tweets table? Map ‣ Input: key=row, value=tweet info Outputs Map Reduce ‣ Map: output key=user_id, value=1 Map Reduce ‣ Shuffle: sort by user_id Map Reduce ‣ Reduce: for each user_id, sum Map ‣ Output: user_id, tweet count Map ‣ With 2x machines, runs 2x faster
    • 33. Two Analysis Challenges ‣ Compute mutual followings in Twitter’s interest graph ‣ grep, awk? No way. ‣ If data is in MySQL... self join on an n- billion row table? ‣ n,000,000,000 x n,000,000,000 = ?
    • 34. Two Analysis Challenges ‣ Compute mutual followings in Twitter’s interest graph ‣ grep, awk? No way. ‣ If data is in MySQL... self join on an n- billion row table? ‣ n,000,000,000 x n,000,000,000 = ? ‣ I don’t know either.
    • 35. Two Analysis Challenges ‣ Large-scale grouping and counting ‣ select count(*) from users? maybe. ‣ select count(*) from tweets? uh... ‣ Imagine joining these two. ‣ And grouping. ‣ And sorting.
    • 36. Back to Hadoop ‣ Didn’t we have a cluster of machines? ‣ Hadoop makes it easy to distribute the calculation ‣ Purpose-built for parallel calculation ‣ Just a slight mindset adjustment
    • 37. Back to Hadoop ‣ Didn’t we have a cluster of machines? ‣ Hadoop makes it easy to distribute the calculation ‣ Purpose-built for parallel calculation ‣ Just a slight mindset adjustment ‣ But a fun one!
    • 38. Analysis at Scale ‣ Now we’re rolling ‣ Count all tweets: 20+ billion, 5 minutes ‣ Parallel network calls to FlockDB to compute interest graph aggregates ‣ Run PageRank across users and interest graph
    • 39. But... ‣ Analysis typically in Java ‣ Single-input, two-stage data flow is rigid ‣ Projections, filters: custom code ‣ Joins lengthy, error-prone ‣ n-stage jobs hard to manage ‣ Data exploration requires compilation
    • 40. Three Challenges ‣ Collecting Data ‣ Large-Scale Storage and Analysis ‣ Rapid Learning over Big Data
    • 41. Pig ‣ High level language ‣ Transformations on sets of records ‣ Process data one step at a time ‣ Easier than SQL?
    • 42. Why Pig? Because I bet you can read the following script
    • 43. A Real Pig Script ‣ Just for fun... the same calculation in Java next
    • 44. No, Seriously.
    • 45. Pig Makes it Easy ‣ 5% of the code
    • 46. Pig Makes it Easy ‣ 5% of the code ‣ 5% of the dev time
    • 47. Pig Makes it Easy ‣ 5% of the code ‣ 5% of the dev time ‣ Within 20% of the running time
    • 48. Pig Makes it Easy ‣ 5% of the code ‣ 5% of the dev time ‣ Within 20% of the running time ‣ Readable, reusable
    • 49. Pig Makes it Easy ‣ 5% of the code ‣ 5% of the dev time ‣ Within 20% of the running time ‣ Readable, reusable ‣ As Pig improves, your calculations run faster
    • 50. One Thing I’ve Learned ‣ It’s easy to answer questions ‣ It’s hard to ask the right questions
    • 51. One Thing I’ve Learned ‣ It’s easy to answer questions ‣ It’s hard to ask the right questions ‣ Value the system that promotes innovation and iteration
    • 52. One Thing I’ve Learned ‣ It’s easy to answer questions ‣ It’s hard to ask the right questions ‣ Value the system that promotes innovation and iteration ‣ More minds contributing = more value from your data
    • 53. Counting Big Data ‣ How many requests per day?
    • 54. Counting Big Data ‣ How many requests per day? ‣ Average latency? 95% latency?
    • 55. Counting Big Data ‣ How many requests per day? ‣ Average latency? 95% latency? ‣ Response code distribution per hour?
    • 56. Counting Big Data ‣ How many requests per day? ‣ Average latency? 95% latency? ‣ Response code distribution per hour? ‣ Twitter searches per day?
    • 57. Counting Big Data ‣ How many requests per day? ‣ Average latency? 95% latency? ‣ Response code distribution per hour? ‣ Twitter searches per day? ‣ Unique users searching, unique queries?
    • 58. Counting Big Data ‣ How many requests per day? ‣ Average latency? 95% latency? ‣ Response code distribution per hour? ‣ Twitter searches per day? ‣ Unique users searching, unique queries? ‣ Links tweeted per day? By domain?
    • 59. Counting Big Data ‣ How many requests per day? ‣ Average latency? 95% latency? ‣ Response code distribution per hour? ‣ Twitter searches per day? ‣ Unique users searching, unique queries? ‣ Links tweeted per day? By domain? ‣ Geographic distribution of all of the above
    • 60. Correlating Big Data ‣ Usage difference for mobile users?
    • 61. Correlating Big Data ‣ Usage difference for mobile users? ‣ ... for users on desktop clients?
    • 62. Correlating Big Data ‣ Usage difference for mobile users? ‣ ... for users on desktop clients? ‣ ... for users of #newtwitter?
    • 63. Correlating Big Data ‣ Usage difference for mobile users? ‣ ... for users on desktop clients? ‣ ... for users of #newtwitter? ‣ Cohort analyses
    • 64. Correlating Big Data ‣ Usage difference for mobile users? ‣ ... for users on desktop clients? ‣ ... for users of #newtwitter? ‣ Cohort analyses ‣ What features get users hooked?
    • 65. Correlating Big Data ‣ Usage difference for mobile users? ‣ ... for users on desktop clients? ‣ ... for users of #newtwitter? ‣ Cohort analyses ‣ What features get users hooked? ‣ What features power Twitter users use often?
    • 66. Research on Big Data ‣ What can we tell from a user’s tweets?
    • 67. Research on Big Data ‣ What can we tell from a user’s tweets? ‣ ... from the tweets of their followers?
    • 68. Research on Big Data ‣ What can we tell from a user’s tweets? ‣ ... from the tweets of their followers? ‣ ... from the tweets of those they follow?
    • 69. Research on Big Data ‣ What can we tell from a user’s tweets? ‣ ... from the tweets of their followers? ‣ ... from the tweets of those they follow? ‣ What influences retweets? Depth of the retweet tree?
    • 70. Research on Big Data ‣ What can we tell from a user’s tweets? ‣ ... from the tweets of their followers? ‣ ... from the tweets of those they follow? ‣ What influences retweets? Depth of the retweet tree? ‣ Duplicate detection (spam)
    • 71. Research on Big Data ‣ What can we tell from a user’s tweets? ‣ ... from the tweets of their followers? ‣ ... from the tweets of those they follow? ‣ What influences retweets? Depth of the retweet tree? ‣ Duplicate detection (spam) ‣ Language detection (search)
    • 72. Research on Big Data ‣ What can we tell from a user’s tweets? ‣ ... from the tweets of their followers? ‣ ... from the tweets of those they follow? ‣ What influences retweets? Depth of the retweet tree? ‣ Duplicate detection (spam) ‣ Language detection (search) ‣ Machine learning
    • 73. Research on Big Data ‣ What can we tell from a user’s tweets? ‣ ... from the tweets of their followers? ‣ ... from the tweets of those they follow? ‣ What influences retweets? Depth of the retweet tree? ‣ Duplicate detection (spam) ‣ Language detection (search) ‣ Machine learning ‣ Natural language processing
    • 74. Diving Deeper ‣ HBase and building products from Hadoop ‣ LZO Compression ‣ Protocol Buffers and Hadoop ‣ Our analytics-related open source: hadoop-lzo, elephant-bird ‣ Moving analytics to realtime http://github.com/kevinweil/hadoop-lzo http://github.com/kevinweil/elephant-bird
    • 75. Questions? Follow me at twitter.com/kevinweil