Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)

  • 20,554 views
Uploaded on

A look at Twitter's data lifecycle, some of the tools we use to handle big data, and some of the questions we answer from our data.

A look at Twitter's data lifecycle, some of the tools we use to handle big data, and some of the questions we answer from our data.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
  • Very good, Thanks for sharing.
    Are you sure you want to
    Your message goes here
  • hadoop, big-data, nosql, twitter, analytics
    Are you sure you want to
    Your message goes here
No Downloads

Views

Total Views
20,554
On Slideshare
0
From Embeds
0
Number of Embeds
13

Actions

Shares
Downloads
1,028
Comments
2
Likes
91

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • Don’t worry about the “yellow grid lines” on the side, they’re just there to help you align everything.
    They won’t show up when you present.








































  • Change this to your big-idea call-outs...

































  • Change this to your big-idea call-outs...

Transcript

  • 1. Analyzing Big Data at Web 2.0 Expo, 2010 Kevin Weil @kevinweil
  • 2. Three Challenges ‣ Collecting Data ‣ Large-Scale Storage and Analysis ‣ Rapid Learning over Big Data
  • 3. My Background ‣ Studied Mathematics and Physics at Harvard, Physics at Stanford ‣ Tropos Networks (city-wide wireless): GBs of data ‣ Cooliris (web media): Hadoop for analytics, TBs of data ‣ Twitter: Hadoop, Pig, machine learning, visualization, social graph analysis, PBs of data
  • 4. Three Challenges ‣ Collecting Data ‣ Large-Scale Storage and Analysis ‣ Rapid Learning over Big Data
  • 5. Data, Data Everywhere ‣ You guys generate a lot of data ‣ Anybody want to guess?
  • 6. Data, Data Everywhere ‣ You guys generate a lot of data ‣ Anybody want to guess? ‣ 12 TB/day (4+ PB/yr)
  • 7. Data, Data Everywhere ‣ You guys generate a lot of data ‣ Anybody want to guess? ‣ 12 TB/day (4+ PB/yr) ‣ 20,000 CDs
  • 8. Data, Data Everywhere ‣ You guys generate a lot of data ‣ Anybody want to guess? ‣ 12 TB/day (4+ PB/yr) ‣ 20,000 CDs ‣ 10 million floppy disks
  • 9. Data, Data Everywhere ‣ You guys generate a lot of data ‣ Anybody want to guess? ‣ 12 TB/day (4+ PB/yr) ‣ 20,000 CDs ‣ 10 million floppy disks ‣ 450 GB while I give this talk
  • 10. Syslog? ‣ Started with syslog-ng ‣ As our volume grew, it didn’t scale
  • 11. Syslog? ‣ Started with syslog-ng ‣ As our volume grew, it didn’t scale ‣ Resources overwhelmed ‣ Lost data
  • 12. Scribe ‣ Surprise! FB had same problem, built and open- sourced Scribe ‣ Log collection framework over Thrift ‣ You “scribe” log lines, with categories ‣ It does the rest
  • 13. Scribe ‣ Runs locally; reliable in FE FE FE network outage
  • 14. Scribe ‣ Runs locally; reliable in FE FE FE network outage ‣ Nodes only know downstream writer; Agg Agg hierarchical, scalable
  • 15. Scribe ‣ Runs locally; reliable in FE FE FE network outage ‣ Nodes only know downstream writer; Agg Agg hierarchical, scalable ‣ Pluggable outputs File HDFS
  • 16. Scribe at Twitter ‣ Solved our problem, opened new vistas ‣ Currently 40 different categories logged from javascript, Ruby, Scala, Java, etc ‣ We improved logging, monitoring, behavior during failure conditions, writes to HDFS, etc ‣ Continuing to work with FB to make it better http://github.com/traviscrawford/scribe
  • 17. Three Challenges ‣ Collecting Data ‣ Large-Scale Storage and Analysis ‣ Rapid Learning over Big Data
  • 18. How Do You Store 12 TB/day? ‣ Single machine? ‣ What’s hard drive write speed?
  • 19. How Do You Store 12 TB/day? ‣ Single machine? ‣ What’s hard drive write speed? ‣ ~80 MB/s
  • 20. How Do You Store 12 TB/day? ‣ Single machine? ‣ What’s hard drive write speed? ‣ ~80 MB/s ‣ 42 hours to write 12 TB
  • 21. How Do You Store 12 TB/day? ‣ Single machine? ‣ What’s hard drive write speed? ‣ ~80 MB/s ‣ 42 hours to write 12 TB ‣ Uh oh.
  • 22. Where Do I Put 12TB/day? ‣ Need a cluster of machines ‣ ... which adds new layers of complexity
  • 23. Hadoop ‣ Distributed file system ‣ Automatic replication ‣ Fault tolerance ‣ Transparently read/write across multiple machines
  • 24. Hadoop ‣ Distributed file system ‣ Automatic replication ‣ Fault tolerance ‣ Transparently read/write across multiple machines ‣ MapReduce-based parallel computation ‣ Key-value based computation interface allows for wide applicability ‣ Fault tolerance, again
  • 25. Hadoop ‣ Open source: top-level Apache project ‣ Scalable: Y! has a 4000 node cluster ‣ Powerful: sorted 1TB random integers in 62 seconds ‣ Easy packaging/install: free Cloudera RPMs
  • 26. MapReduce Workflow Inputs ‣ Challenge: how many tweets Shuffle/ Map Sort per user, given tweets table? Map ‣ Input: key=row, value=tweet info Outputs Map Reduce ‣ Map: output key=user_id, value=1 Map Reduce ‣ Shuffle: sort by user_id Map Reduce ‣ Reduce: for each user_id, sum Map ‣ Output: user_id, tweet count Map ‣ With 2x machines, runs 2x faster
  • 27. MapReduce Workflow Inputs ‣ Challenge: how many tweets Shuffle/ Map Sort per user, given tweets table? Map ‣ Input: key=row, value=tweet info Outputs Map Reduce ‣ Map: output key=user_id, value=1 Map Reduce ‣ Shuffle: sort by user_id Map Reduce ‣ Reduce: for each user_id, sum Map ‣ Output: user_id, tweet count Map ‣ With 2x machines, runs 2x faster
  • 28. MapReduce Workflow Inputs ‣ Challenge: how many tweets Shuffle/ Map Sort per user, given tweets table? Map ‣ Input: key=row, value=tweet info Outputs Map Reduce ‣ Map: output key=user_id, value=1 Map Reduce ‣ Shuffle: sort by user_id Map Reduce ‣ Reduce: for each user_id, sum Map ‣ Output: user_id, tweet count Map ‣ With 2x machines, runs 2x faster
  • 29. MapReduce Workflow Inputs ‣ Challenge: how many tweets Shuffle/ Map Sort per user, given tweets table? Map ‣ Input: key=row, value=tweet info Outputs Map Reduce ‣ Map: output key=user_id, value=1 Map Reduce ‣ Shuffle: sort by user_id Map Reduce ‣ Reduce: for each user_id, sum Map ‣ Output: user_id, tweet count Map ‣ With 2x machines, runs 2x faster
  • 30. MapReduce Workflow Inputs ‣ Challenge: how many tweets Shuffle/ Map Sort per user, given tweets table? Map ‣ Input: key=row, value=tweet info Outputs Map Reduce ‣ Map: output key=user_id, value=1 Map Reduce ‣ Shuffle: sort by user_id Map Reduce ‣ Reduce: for each user_id, sum Map ‣ Output: user_id, tweet count Map ‣ With 2x machines, runs 2x faster
  • 31. MapReduce Workflow Inputs ‣ Challenge: how many tweets Shuffle/ Map Sort per user, given tweets table? Map ‣ Input: key=row, value=tweet info Outputs Map Reduce ‣ Map: output key=user_id, value=1 Map Reduce ‣ Shuffle: sort by user_id Map Reduce ‣ Reduce: for each user_id, sum Map ‣ Output: user_id, tweet count Map ‣ With 2x machines, runs 2x faster
  • 32. MapReduce Workflow Inputs ‣ Challenge: how many tweets Shuffle/ Map Sort per user, given tweets table? Map ‣ Input: key=row, value=tweet info Outputs Map Reduce ‣ Map: output key=user_id, value=1 Map Reduce ‣ Shuffle: sort by user_id Map Reduce ‣ Reduce: for each user_id, sum Map ‣ Output: user_id, tweet count Map ‣ With 2x machines, runs 2x faster
  • 33. Two Analysis Challenges ‣ Compute mutual followings in Twitter’s interest graph ‣ grep, awk? No way. ‣ If data is in MySQL... self join on an n- billion row table? ‣ n,000,000,000 x n,000,000,000 = ?
  • 34. Two Analysis Challenges ‣ Compute mutual followings in Twitter’s interest graph ‣ grep, awk? No way. ‣ If data is in MySQL... self join on an n- billion row table? ‣ n,000,000,000 x n,000,000,000 = ? ‣ I don’t know either.
  • 35. Two Analysis Challenges ‣ Large-scale grouping and counting ‣ select count(*) from users? maybe. ‣ select count(*) from tweets? uh... ‣ Imagine joining these two. ‣ And grouping. ‣ And sorting.
  • 36. Back to Hadoop ‣ Didn’t we have a cluster of machines? ‣ Hadoop makes it easy to distribute the calculation ‣ Purpose-built for parallel calculation ‣ Just a slight mindset adjustment
  • 37. Back to Hadoop ‣ Didn’t we have a cluster of machines? ‣ Hadoop makes it easy to distribute the calculation ‣ Purpose-built for parallel calculation ‣ Just a slight mindset adjustment ‣ But a fun one!
  • 38. Analysis at Scale ‣ Now we’re rolling ‣ Count all tweets: 20+ billion, 5 minutes ‣ Parallel network calls to FlockDB to compute interest graph aggregates ‣ Run PageRank across users and interest graph
  • 39. But... ‣ Analysis typically in Java ‣ Single-input, two-stage data flow is rigid ‣ Projections, filters: custom code ‣ Joins lengthy, error-prone ‣ n-stage jobs hard to manage ‣ Data exploration requires compilation
  • 40. Three Challenges ‣ Collecting Data ‣ Large-Scale Storage and Analysis ‣ Rapid Learning over Big Data
  • 41. Pig ‣ High level language ‣ Transformations on sets of records ‣ Process data one step at a time ‣ Easier than SQL?
  • 42. Why Pig? Because I bet you can read the following script
  • 43. A Real Pig Script ‣ Just for fun... the same calculation in Java next
  • 44. No, Seriously.
  • 45. Pig Makes it Easy ‣ 5% of the code
  • 46. Pig Makes it Easy ‣ 5% of the code ‣ 5% of the dev time
  • 47. Pig Makes it Easy ‣ 5% of the code ‣ 5% of the dev time ‣ Within 20% of the running time
  • 48. Pig Makes it Easy ‣ 5% of the code ‣ 5% of the dev time ‣ Within 20% of the running time ‣ Readable, reusable
  • 49. Pig Makes it Easy ‣ 5% of the code ‣ 5% of the dev time ‣ Within 20% of the running time ‣ Readable, reusable ‣ As Pig improves, your calculations run faster
  • 50. One Thing I’ve Learned ‣ It’s easy to answer questions ‣ It’s hard to ask the right questions
  • 51. One Thing I’ve Learned ‣ It’s easy to answer questions ‣ It’s hard to ask the right questions ‣ Value the system that promotes innovation and iteration
  • 52. One Thing I’ve Learned ‣ It’s easy to answer questions ‣ It’s hard to ask the right questions ‣ Value the system that promotes innovation and iteration ‣ More minds contributing = more value from your data
  • 53. Counting Big Data ‣ How many requests per day?
  • 54. Counting Big Data ‣ How many requests per day? ‣ Average latency? 95% latency?
  • 55. Counting Big Data ‣ How many requests per day? ‣ Average latency? 95% latency? ‣ Response code distribution per hour?
  • 56. Counting Big Data ‣ How many requests per day? ‣ Average latency? 95% latency? ‣ Response code distribution per hour? ‣ Twitter searches per day?
  • 57. Counting Big Data ‣ How many requests per day? ‣ Average latency? 95% latency? ‣ Response code distribution per hour? ‣ Twitter searches per day? ‣ Unique users searching, unique queries?
  • 58. Counting Big Data ‣ How many requests per day? ‣ Average latency? 95% latency? ‣ Response code distribution per hour? ‣ Twitter searches per day? ‣ Unique users searching, unique queries? ‣ Links tweeted per day? By domain?
  • 59. Counting Big Data ‣ How many requests per day? ‣ Average latency? 95% latency? ‣ Response code distribution per hour? ‣ Twitter searches per day? ‣ Unique users searching, unique queries? ‣ Links tweeted per day? By domain? ‣ Geographic distribution of all of the above
  • 60. Correlating Big Data ‣ Usage difference for mobile users?
  • 61. Correlating Big Data ‣ Usage difference for mobile users? ‣ ... for users on desktop clients?
  • 62. Correlating Big Data ‣ Usage difference for mobile users? ‣ ... for users on desktop clients? ‣ ... for users of #newtwitter?
  • 63. Correlating Big Data ‣ Usage difference for mobile users? ‣ ... for users on desktop clients? ‣ ... for users of #newtwitter? ‣ Cohort analyses
  • 64. Correlating Big Data ‣ Usage difference for mobile users? ‣ ... for users on desktop clients? ‣ ... for users of #newtwitter? ‣ Cohort analyses ‣ What features get users hooked?
  • 65. Correlating Big Data ‣ Usage difference for mobile users? ‣ ... for users on desktop clients? ‣ ... for users of #newtwitter? ‣ Cohort analyses ‣ What features get users hooked? ‣ What features power Twitter users use often?
  • 66. Research on Big Data ‣ What can we tell from a user’s tweets?
  • 67. Research on Big Data ‣ What can we tell from a user’s tweets? ‣ ... from the tweets of their followers?
  • 68. Research on Big Data ‣ What can we tell from a user’s tweets? ‣ ... from the tweets of their followers? ‣ ... from the tweets of those they follow?
  • 69. Research on Big Data ‣ What can we tell from a user’s tweets? ‣ ... from the tweets of their followers? ‣ ... from the tweets of those they follow? ‣ What influences retweets? Depth of the retweet tree?
  • 70. Research on Big Data ‣ What can we tell from a user’s tweets? ‣ ... from the tweets of their followers? ‣ ... from the tweets of those they follow? ‣ What influences retweets? Depth of the retweet tree? ‣ Duplicate detection (spam)
  • 71. Research on Big Data ‣ What can we tell from a user’s tweets? ‣ ... from the tweets of their followers? ‣ ... from the tweets of those they follow? ‣ What influences retweets? Depth of the retweet tree? ‣ Duplicate detection (spam) ‣ Language detection (search)
  • 72. Research on Big Data ‣ What can we tell from a user’s tweets? ‣ ... from the tweets of their followers? ‣ ... from the tweets of those they follow? ‣ What influences retweets? Depth of the retweet tree? ‣ Duplicate detection (spam) ‣ Language detection (search) ‣ Machine learning
  • 73. Research on Big Data ‣ What can we tell from a user’s tweets? ‣ ... from the tweets of their followers? ‣ ... from the tweets of those they follow? ‣ What influences retweets? Depth of the retweet tree? ‣ Duplicate detection (spam) ‣ Language detection (search) ‣ Machine learning ‣ Natural language processing
  • 74. Diving Deeper ‣ HBase and building products from Hadoop ‣ LZO Compression ‣ Protocol Buffers and Hadoop ‣ Our analytics-related open source: hadoop-lzo, elephant-bird ‣ Moving analytics to realtime http://github.com/kevinweil/hadoop-lzo http://github.com/kevinweil/elephant-bird
  • 75. Questions? Follow me at twitter.com/kevinweil