Big Data at Twitter, Chirp 2010
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Big Data at Twitter, Chirp 2010

on

  • 13,865 views

Slides from my talk on collecting, storing, and analyzing big data at Twitter for Chirp Hack Day at Twitter's Chirp conference.

Slides from my talk on collecting, storing, and analyzing big data at Twitter for Chirp Hack Day at Twitter's Chirp conference.

Statistics

Views

Total Views
13,865
Views on SlideShare
13,280
Embed Views
585

Actions

Likes
48
Downloads
645
Comments
4

32 Embeds 585

http://escalabilidade.com 220
http://www.slideshare.net 98
http://mikeg.typepad.com 88
http://www.labnol.org 36
http://www.linkedin.com 35
https://www.linkedin.com 18
http://blog.sina.com.cn 17
http://www.devsundar.com 14
http://blog.devsundar.com 9
http://10.209.254.251 6
http://jisi.dreamblog.jp 5
http://localhost 5
http://zoernert.posterous.com 4
http://www.lifeyun.com 4
http://www.lmodules.com 3
http://www.webtrafficroi.com 3
http://www.cyber-junk.de 3
http://www.thehoard.org 2
https://hatenainfra.g.hatena.ne.jp 2
http://feeds.feedburner.com 1
http://twittertim.es 1
http://safe.tumblr.com 1
http://notes.ztaylor.org 1
https://twitter.com 1
http://webcache.googleusercontent.com 1
http://static.slidesharecdn.com 1
http://ztaylor.tumblr.com 1
http://translate.googleusercontent.com 1
http://paper.li 1
http://hoard.heroku.com 1
http://twitter.com 1
http://pmomale-ld1 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Apple Keynote

Usage Rights

CC Attribution-ShareAlike LicenseCC Attribution-ShareAlike License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />

Big Data at Twitter, Chirp 2010 Presentation Transcript

  • 1. Big Data at Twitter #chirpdata Kevin Weil @kevinweil Twitter Analytics
  • 2. Three Challenges • Collecting Data • Large-Scale Storage & Analysis • Rapid Learning over Big Data
  • 3. Three Challenges • Collecting Data • Large-Scale Storage & Analysis • Rapid Learning over Big Data
  • 4. Data, Data Everywhere • You guys generate a lot of data • Anybody want to guess?
  • 5. Data, Data Everywhere • You guys generate a lot of data • Anybody want to guess? • 7 TB/day (2+ PB/yr)
  • 6. Data, Data Everywhere • You guys generate a lot of data • Anybody want to guess? • 7 TB/day (2+ PB/yr) • 10,000 CDs
  • 7. Data, Data Everywhere • You guys generate a lot of data • Anybody want to guess? • 7 TB/day (2+ PB/yr) • 10,000 CDs • 5 million floppy disks
  • 8. Data, Data Everywhere • You guys generate a lot of data • Anybody want to guess? • 7 TB/day (2+ PB/yr) • 10,000 CDs • 5 million floppy disks • 225 GB while I give this talk
  • 9. Syslog? • Started with syslog-ng • As our volume grew, it didn’t scale
  • 10. Syslog? • Started with syslog-ng • As our volume grew, it didn’t scale • Resources overwhelmed • Lost data
  • 11. Scribe • Surprise! FB had same problem, built and open-sourced Scribe • Log collection framework over Thrift • You write log lines, with categories • It does the rest
  • 12. Scribe FE FE FE • Runs locally; reliable in network outage
  • 13. Scribe FE FE FE • Runs locally; reliable in network outage • Nodes only know downstream writer; Agg Agg hierarchical, scalable
  • 14. Scribe FE FE FE • Runs locally; reliable in network outage • Nodes only know downstream writer; Agg Agg hierarchical, scalable • Pluggable outputs File HDFS
  • 15. Scribe at Twitter • Solved our problem, opened new vistas • Currently 30 different categories logged from javascript, RoR, Scala, etc • We improved logging, monitoring, writing to Hadoop, compression
  • 16. Scribe at Twitter • Continuing to work with FB • GSoC project! Help make it more awesome. • http://github.com/traviscrawford/scribe • http://wiki.developers.facebook.com/index.php/User:GSoC
  • 17. Three Challenges • Collecting Data • Large-Scale Storage & Analysis • Rapid Learning over Big Data
  • 18. How do you store 7TB/day? • Single machine? • What’s HD write speed?
  • 19. How do you store 7TB/day? • Single machine? • What’s HD write speed? • 80 MB/s
  • 20. How do you store 7TB/day? • Single machine? • What’s HD write speed? • 80 MB/s • 24.3 hrs to write 7 TB
  • 21. How do you store 7TB/day? • Single machine? • What’s HD write speed? • 80 MB/s • 24.3 hrs to write 7 TB • Uh oh.
  • 22. Where do I put 7TB/day? • Need a cluster of machines
  • 23. Where do I put 7TB/day? • Need a cluster of machines • ... which adds new layers of complexity
  • 24. Hadoop • Distributed file system • Automatic replication, fault tolerance • MapReduce-based parallel computation • Key-value based computation interface allows for wide applicability
  • 25. Hadoop • Open source: top-level Apache project • Scalable: Y! has a 4000 node cluster • Powerful: sorted 1TB random integers in 62 seconds • Easy packaging: free Cloudera RPMs
  • 26. Inputs MapReduce Workflow Shuffle/ Map Sort • Challenge: how many tweets per Map Outputs user, given tweets table? Map Reduce • Input: key=row, value=tweet info • Map: output key=user_id, value=1 Map Reduce • Shuffle: sort by user_id Map Reduce • Reduce: for each user_id, sum Map • Output: user_id, tweet count • With 2x machines, runs 2x faster Map
  • 27. Inputs MapReduce Workflow Shuffle/ Map Sort • Challenge: how many tweets per Map Outputs user, given tweets table? Map Reduce • Input: key=row, value=tweet info • Map: output key=user_id, value=1 Map Reduce • Shuffle: sort by user_id Map Reduce • Reduce: for each user_id, sum Map • Output: user_id, tweet count • With 2x machines, runs 2x faster Map
  • 28. Inputs MapReduce Workflow Shuffle/ Map Sort • Challenge: how many tweets per Map Outputs user, given tweets table? Map Reduce • Input: key=row, value=tweet info • Map: output key=user_id, value=1 Map Reduce • Shuffle: sort by user_id Map Reduce • Reduce: for each user_id, sum Map • Output: user_id, tweet count • With 2x machines, runs 2x faster Map
  • 29. Inputs MapReduce Workflow Shuffle/ Map Sort • Challenge: how many tweets per Map Outputs user, given tweets table? Map Reduce • Input: key=row, value=tweet info • Map: output key=user_id, value=1 Map Reduce • Shuffle: sort by user_id Map Reduce • Reduce: for each user_id, sum Map • Output: user_id, tweet count • With 2x machines, runs 2x faster Map
  • 30. Inputs MapReduce Workflow Shuffle/ Map Sort • Challenge: how many tweets per Map Outputs user, given tweets table? Map Reduce • Input: key=row, value=tweet info • Map: output key=user_id, value=1 Map Reduce • Shuffle: sort by user_id Map Reduce • Reduce: for each user_id, sum Map • Output: user_id, tweet count • With 2x machines, runs 2x faster Map
  • 31. Inputs MapReduce Workflow Shuffle/ Map Sort • Challenge: how many tweets per Map Outputs user, given tweets table? Map Reduce • Input: key=row, value=tweet info • Map: output key=user_id, value=1 Map Reduce • Shuffle: sort by user_id Map Reduce • Reduce: for each user_id, sum Map • Output: user_id, tweet count • With 2x machines, runs 2x faster Map
  • 32. Inputs MapReduce Workflow Shuffle/ Map Sort • Challenge: how many tweets per Map Outputs user, given tweets table? Map Reduce • Input: key=row, value=tweet info • Map: output key=user_id, value=1 Map Reduce • Shuffle: sort by user_id Map Reduce • Reduce: for each user_id, sum Map • Output: user_id, tweet count • With 2x machines, runs 2x faster Map
  • 33. Two Analysis Challenges 1. Compute friendships in Twitter’s social graph • grep, awk? No way. • Data is in MySQL... self join on an n- billion row table? • n,000,000,000 x n,000,000,000 = ?
  • 34. Two Analysis Challenges 1. Compute friendships in Twitter’s social graph • grep, awk? No way. • Data is in MySQL... self join on an n- billion row table? • n,000,000,000 x n,000,000,000 = ? • I don’t know either.
  • 35. Two Analysis Challenges 2. Large-scale grouping and counting • select count(*) from users? maybe. • select count(*) from tweets? uh... • Imagine joining them. • And grouping. • And sorting.
  • 36. Back to Hadoop • Didn’t we have a cluster of machines? • Hadoop makes it easy to distribute the calculation • Purpose-built for parallel calculation • Just a slight mindset adjustment
  • 37. Back to Hadoop • Didn’t we have a cluster of machines? • Hadoop makes it easy to distribute the calculation • Purpose-built for parallel calculation • Just a slight mindset adjustment • But a fun one!
  • 38. Analysis at Scale • Now we’re rolling • Count all tweets: 12 billion, 5 minutes • Hit FlockDB in parallel to assemble social graph aggregates • Run pagerank across users to calculate reputations
  • 39. But... • Analysis typically in Java • Single-input, two-stage data flow is rigid • Projections, filters: custom code • Joins lengthy, error-prone • n-stage jobs: hard to manage • Exploration requires compilation
  • 40. Three Challenges • Collecting Data • Large-Scale Storage & Analysis • Rapid Learning over Big Data
  • 41. Pig • High level language • Transformations on sets of records • Process data one step at a time • Easier than SQL?
  • 42. Why Pig? Because I bet you can read the following script
  • 43. A Real Pig Script • Just for fun... the same calculation in Java
  • 44. No, Seriously.
  • 45. Pig Makes it Easy • 5% of the code
  • 46. Pig Makes it Easy • 5% of the code • 5% of the dev time
  • 47. Pig Makes it Easy • 5% of the code • 5% of the dev time • Within 25% of the running time
  • 48. Pig Makes it Easy • 5% of the code • 5% of the dev time • Within 25% of the running time • Readable, reusable
  • 49. One Thing I’ve Learned • It’s easy to answer questions. • It’s hard to ask the right questions
  • 50. One Thing I’ve Learned • It’s easy to answer questions. • It’s hard to ask the right questions. • Value the system that promotes innovation and iteration
  • 51. One Thing I’ve Learned • It’s easy to answer questions. • It’s hard to ask the right questions. • Value the system that promotes innovation and iteration • More minds contributing = more value from your data
  • 52. Counting Big Data • How many requests per day?
  • 53. Counting Big Data • How many requests per day? • Average latency? 95% latency?
  • 54. Counting Big Data • How many requests per day? • Average latency? 95% latency? • Response code distribution per hour?
  • 55. Counting Big Data • How many requests per day? • Average latency? 95% latency? • Response code distribution per hour? • Searches per day?
  • 56. Counting Big Data • How many requests per day? • Average latency? 95% latency? • Response code distribution per hour? • Searches per day? • Unique users searching, unique queries?
  • 57. Counting Big Data • How many requests per day? • Average latency? 95% latency? • Response code distribution per hour? • Searches per day? • Unique users searching, unique queries? • Geographic distribution of queries?
  • 58. Correlating Big Data • Usage difference for mobile users?
  • 59. Correlating Big Data • Usage difference for mobile users? • ... for users on desktop clients?
  • 60. Correlating Big Data • Usage difference for mobile users? • ... for users on desktop clients? • Cohort analyses
  • 61. Correlating Big Data • Usage difference for mobile users? • ... for users on desktop clients? • Cohort analyses • What features get users hooked?
  • 62. Correlating Big Data • Usage difference for mobile users? • ... for users on desktop clients? • Cohort analyses • What features get users hooked? • What do successful users use often?
  • 63. Research on Big Data • What can we tell from a user’s tweets?
  • 64. Research on Big Data • What can we tell from a user’s tweets? • ... from the tweets of their followers?
  • 65. Research on Big Data • What can we tell from a user’s tweets? • ... from the tweets of their followers? • ... from the tweets of those they follow?
  • 66. Research on Big Data • What can we tell from a user’s tweets? • ... from the tweets of their followers? • ... from the tweets of those they follow? • What influences retweet tree depth?
  • 67. Research on Big Data • What can we tell from a user’s tweets? • ... from the tweets of their followers? • ... from the tweets of those they follow? • What influences retweet tree depth? • Duplicate detection, language detection
  • 68. Research on Big Data • What can we tell from a user’s tweets? • ... from the tweets of their followers? • ... from the tweets of those they follow? • What influences retweet tree depth? • Duplicate detection, language detection • Machine learning
  • 69. If We Had More Time... • HBase backing namesearch • LZO compression • Protocol Buffers and Hadoop • Our open source: hadoop-lzo, elephant- bird • Realtime analytics with Cassandra
  • 70. Questions? Follow me at twitter.com/kevinweil