Big Data at Twitter, Chirp 2010

12,973
-1

Published on

Slides from my talk on collecting, storing, and analyzing big data at Twitter for Chirp Hack Day at Twitter's Chirp conference.

Published in: Technology
4 Comments
48 Likes
Statistics
Notes
No Downloads
Views
Total Views
12,973
On Slideshare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
653
Comments
4
Likes
48
Embeds 0
No embeds

No notes for slide







































































  • Big Data at Twitter, Chirp 2010

    1. Big Data at Twitter #chirpdata Kevin Weil @kevinweil Twitter Analytics
    2. Three Challenges • Collecting Data • Large-Scale Storage & Analysis • Rapid Learning over Big Data
    3. Three Challenges • Collecting Data • Large-Scale Storage & Analysis • Rapid Learning over Big Data
    4. Data, Data Everywhere • You guys generate a lot of data • Anybody want to guess?
    5. Data, Data Everywhere • You guys generate a lot of data • Anybody want to guess? • 7 TB/day (2+ PB/yr)
    6. Data, Data Everywhere • You guys generate a lot of data • Anybody want to guess? • 7 TB/day (2+ PB/yr) • 10,000 CDs
    7. Data, Data Everywhere • You guys generate a lot of data • Anybody want to guess? • 7 TB/day (2+ PB/yr) • 10,000 CDs • 5 million floppy disks
    8. Data, Data Everywhere • You guys generate a lot of data • Anybody want to guess? • 7 TB/day (2+ PB/yr) • 10,000 CDs • 5 million floppy disks • 225 GB while I give this talk
    9. Syslog? • Started with syslog-ng • As our volume grew, it didn’t scale
    10. Syslog? • Started with syslog-ng • As our volume grew, it didn’t scale • Resources overwhelmed • Lost data
    11. Scribe • Surprise! FB had same problem, built and open-sourced Scribe • Log collection framework over Thrift • You write log lines, with categories • It does the rest
    12. Scribe FE FE FE • Runs locally; reliable in network outage
    13. Scribe FE FE FE • Runs locally; reliable in network outage • Nodes only know downstream writer; Agg Agg hierarchical, scalable
    14. Scribe FE FE FE • Runs locally; reliable in network outage • Nodes only know downstream writer; Agg Agg hierarchical, scalable • Pluggable outputs File HDFS
    15. Scribe at Twitter • Solved our problem, opened new vistas • Currently 30 different categories logged from javascript, RoR, Scala, etc • We improved logging, monitoring, writing to Hadoop, compression
    16. Scribe at Twitter • Continuing to work with FB • GSoC project! Help make it more awesome. • http://github.com/traviscrawford/scribe • http://wiki.developers.facebook.com/index.php/User:GSoC
    17. Three Challenges • Collecting Data • Large-Scale Storage & Analysis • Rapid Learning over Big Data
    18. How do you store 7TB/day? • Single machine? • What’s HD write speed?
    19. How do you store 7TB/day? • Single machine? • What’s HD write speed? • 80 MB/s
    20. How do you store 7TB/day? • Single machine? • What’s HD write speed? • 80 MB/s • 24.3 hrs to write 7 TB
    21. How do you store 7TB/day? • Single machine? • What’s HD write speed? • 80 MB/s • 24.3 hrs to write 7 TB • Uh oh.
    22. Where do I put 7TB/day? • Need a cluster of machines
    23. Where do I put 7TB/day? • Need a cluster of machines • ... which adds new layers of complexity
    24. Hadoop • Distributed file system • Automatic replication, fault tolerance • MapReduce-based parallel computation • Key-value based computation interface allows for wide applicability
    25. Hadoop • Open source: top-level Apache project • Scalable: Y! has a 4000 node cluster • Powerful: sorted 1TB random integers in 62 seconds • Easy packaging: free Cloudera RPMs
    26. Inputs MapReduce Workflow Shuffle/ Map Sort • Challenge: how many tweets per Map Outputs user, given tweets table? Map Reduce • Input: key=row, value=tweet info • Map: output key=user_id, value=1 Map Reduce • Shuffle: sort by user_id Map Reduce • Reduce: for each user_id, sum Map • Output: user_id, tweet count • With 2x machines, runs 2x faster Map
    27. Inputs MapReduce Workflow Shuffle/ Map Sort • Challenge: how many tweets per Map Outputs user, given tweets table? Map Reduce • Input: key=row, value=tweet info • Map: output key=user_id, value=1 Map Reduce • Shuffle: sort by user_id Map Reduce • Reduce: for each user_id, sum Map • Output: user_id, tweet count • With 2x machines, runs 2x faster Map
    28. Inputs MapReduce Workflow Shuffle/ Map Sort • Challenge: how many tweets per Map Outputs user, given tweets table? Map Reduce • Input: key=row, value=tweet info • Map: output key=user_id, value=1 Map Reduce • Shuffle: sort by user_id Map Reduce • Reduce: for each user_id, sum Map • Output: user_id, tweet count • With 2x machines, runs 2x faster Map
    29. Inputs MapReduce Workflow Shuffle/ Map Sort • Challenge: how many tweets per Map Outputs user, given tweets table? Map Reduce • Input: key=row, value=tweet info • Map: output key=user_id, value=1 Map Reduce • Shuffle: sort by user_id Map Reduce • Reduce: for each user_id, sum Map • Output: user_id, tweet count • With 2x machines, runs 2x faster Map
    30. Inputs MapReduce Workflow Shuffle/ Map Sort • Challenge: how many tweets per Map Outputs user, given tweets table? Map Reduce • Input: key=row, value=tweet info • Map: output key=user_id, value=1 Map Reduce • Shuffle: sort by user_id Map Reduce • Reduce: for each user_id, sum Map • Output: user_id, tweet count • With 2x machines, runs 2x faster Map
    31. Inputs MapReduce Workflow Shuffle/ Map Sort • Challenge: how many tweets per Map Outputs user, given tweets table? Map Reduce • Input: key=row, value=tweet info • Map: output key=user_id, value=1 Map Reduce • Shuffle: sort by user_id Map Reduce • Reduce: for each user_id, sum Map • Output: user_id, tweet count • With 2x machines, runs 2x faster Map
    32. Inputs MapReduce Workflow Shuffle/ Map Sort • Challenge: how many tweets per Map Outputs user, given tweets table? Map Reduce • Input: key=row, value=tweet info • Map: output key=user_id, value=1 Map Reduce • Shuffle: sort by user_id Map Reduce • Reduce: for each user_id, sum Map • Output: user_id, tweet count • With 2x machines, runs 2x faster Map
    33. Two Analysis Challenges 1. Compute friendships in Twitter’s social graph • grep, awk? No way. • Data is in MySQL... self join on an n- billion row table? • n,000,000,000 x n,000,000,000 = ?
    34. Two Analysis Challenges 1. Compute friendships in Twitter’s social graph • grep, awk? No way. • Data is in MySQL... self join on an n- billion row table? • n,000,000,000 x n,000,000,000 = ? • I don’t know either.
    35. Two Analysis Challenges 2. Large-scale grouping and counting • select count(*) from users? maybe. • select count(*) from tweets? uh... • Imagine joining them. • And grouping. • And sorting.
    36. Back to Hadoop • Didn’t we have a cluster of machines? • Hadoop makes it easy to distribute the calculation • Purpose-built for parallel calculation • Just a slight mindset adjustment
    37. Back to Hadoop • Didn’t we have a cluster of machines? • Hadoop makes it easy to distribute the calculation • Purpose-built for parallel calculation • Just a slight mindset adjustment • But a fun one!
    38. Analysis at Scale • Now we’re rolling • Count all tweets: 12 billion, 5 minutes • Hit FlockDB in parallel to assemble social graph aggregates • Run pagerank across users to calculate reputations
    39. But... • Analysis typically in Java • Single-input, two-stage data flow is rigid • Projections, filters: custom code • Joins lengthy, error-prone • n-stage jobs: hard to manage • Exploration requires compilation
    40. Three Challenges • Collecting Data • Large-Scale Storage & Analysis • Rapid Learning over Big Data
    41. Pig • High level language • Transformations on sets of records • Process data one step at a time • Easier than SQL?
    42. Why Pig? Because I bet you can read the following script
    43. A Real Pig Script • Just for fun... the same calculation in Java
    44. No, Seriously.
    45. Pig Makes it Easy • 5% of the code
    46. Pig Makes it Easy • 5% of the code • 5% of the dev time
    47. Pig Makes it Easy • 5% of the code • 5% of the dev time • Within 25% of the running time
    48. Pig Makes it Easy • 5% of the code • 5% of the dev time • Within 25% of the running time • Readable, reusable
    49. One Thing I’ve Learned • It’s easy to answer questions. • It’s hard to ask the right questions
    50. One Thing I’ve Learned • It’s easy to answer questions. • It’s hard to ask the right questions. • Value the system that promotes innovation and iteration
    51. One Thing I’ve Learned • It’s easy to answer questions. • It’s hard to ask the right questions. • Value the system that promotes innovation and iteration • More minds contributing = more value from your data
    52. Counting Big Data • How many requests per day?
    53. Counting Big Data • How many requests per day? • Average latency? 95% latency?
    54. Counting Big Data • How many requests per day? • Average latency? 95% latency? • Response code distribution per hour?
    55. Counting Big Data • How many requests per day? • Average latency? 95% latency? • Response code distribution per hour? • Searches per day?
    56. Counting Big Data • How many requests per day? • Average latency? 95% latency? • Response code distribution per hour? • Searches per day? • Unique users searching, unique queries?
    57. Counting Big Data • How many requests per day? • Average latency? 95% latency? • Response code distribution per hour? • Searches per day? • Unique users searching, unique queries? • Geographic distribution of queries?
    58. Correlating Big Data • Usage difference for mobile users?
    59. Correlating Big Data • Usage difference for mobile users? • ... for users on desktop clients?
    60. Correlating Big Data • Usage difference for mobile users? • ... for users on desktop clients? • Cohort analyses
    61. Correlating Big Data • Usage difference for mobile users? • ... for users on desktop clients? • Cohort analyses • What features get users hooked?
    62. Correlating Big Data • Usage difference for mobile users? • ... for users on desktop clients? • Cohort analyses • What features get users hooked? • What do successful users use often?
    63. Research on Big Data • What can we tell from a user’s tweets?
    64. Research on Big Data • What can we tell from a user’s tweets? • ... from the tweets of their followers?
    65. Research on Big Data • What can we tell from a user’s tweets? • ... from the tweets of their followers? • ... from the tweets of those they follow?
    66. Research on Big Data • What can we tell from a user’s tweets? • ... from the tweets of their followers? • ... from the tweets of those they follow? • What influences retweet tree depth?
    67. Research on Big Data • What can we tell from a user’s tweets? • ... from the tweets of their followers? • ... from the tweets of those they follow? • What influences retweet tree depth? • Duplicate detection, language detection
    68. Research on Big Data • What can we tell from a user’s tweets? • ... from the tweets of their followers? • ... from the tweets of those they follow? • What influences retweet tree depth? • Duplicate detection, language detection • Machine learning
    69. If We Had More Time... • HBase backing namesearch • LZO compression • Protocol Buffers and Hadoop • Our open source: hadoop-lzo, elephant- bird • Realtime analytics with Cassandra
    70. Questions? Follow me at twitter.com/kevinweil
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×