Big Data at Twitter, Chirp 2010

16,259 views
15,166 views

Published on

Slides from my talk on collecting, storing, and analyzing big data at Twitter for Chirp Hack Day at Twitter's Chirp conference.

Published in: Technology
4 Comments
47 Likes
Statistics
Notes
No Downloads
Views
Total views
16,259
On SlideShare
0
From Embeds
0
Number of Embeds
714
Actions
Shares
0
Downloads
653
Comments
4
Likes
47
Embeds 0
No embeds

No notes for slide







































































  • Big Data at Twitter, Chirp 2010

    1. Big Data at Twitter #chirpdata Kevin Weil @kevinweil Twitter Analytics
    2. Three Challenges • Collecting Data • Large-Scale Storage & Analysis • Rapid Learning over Big Data
    3. Three Challenges • Collecting Data • Large-Scale Storage & Analysis • Rapid Learning over Big Data
    4. Data, Data Everywhere • You guys generate a lot of data • Anybody want to guess?
    5. Data, Data Everywhere • You guys generate a lot of data • Anybody want to guess? • 7 TB/day (2+ PB/yr)
    6. Data, Data Everywhere • You guys generate a lot of data • Anybody want to guess? • 7 TB/day (2+ PB/yr) • 10,000 CDs
    7. Data, Data Everywhere • You guys generate a lot of data • Anybody want to guess? • 7 TB/day (2+ PB/yr) • 10,000 CDs • 5 million floppy disks
    8. Data, Data Everywhere • You guys generate a lot of data • Anybody want to guess? • 7 TB/day (2+ PB/yr) • 10,000 CDs • 5 million floppy disks • 225 GB while I give this talk
    9. Syslog? • Started with syslog-ng • As our volume grew, it didn’t scale
    10. Syslog? • Started with syslog-ng • As our volume grew, it didn’t scale • Resources overwhelmed • Lost data
    11. Scribe • Surprise! FB had same problem, built and open-sourced Scribe • Log collection framework over Thrift • You write log lines, with categories • It does the rest
    12. Scribe FE FE FE • Runs locally; reliable in network outage
    13. Scribe FE FE FE • Runs locally; reliable in network outage • Nodes only know downstream writer; Agg Agg hierarchical, scalable
    14. Scribe FE FE FE • Runs locally; reliable in network outage • Nodes only know downstream writer; Agg Agg hierarchical, scalable • Pluggable outputs File HDFS
    15. Scribe at Twitter • Solved our problem, opened new vistas • Currently 30 different categories logged from javascript, RoR, Scala, etc • We improved logging, monitoring, writing to Hadoop, compression
    16. Scribe at Twitter • Continuing to work with FB • GSoC project! Help make it more awesome. • http://github.com/traviscrawford/scribe • http://wiki.developers.facebook.com/index.php/User:GSoC
    17. Three Challenges • Collecting Data • Large-Scale Storage & Analysis • Rapid Learning over Big Data
    18. How do you store 7TB/day? • Single machine? • What’s HD write speed?
    19. How do you store 7TB/day? • Single machine? • What’s HD write speed? • 80 MB/s
    20. How do you store 7TB/day? • Single machine? • What’s HD write speed? • 80 MB/s • 24.3 hrs to write 7 TB
    21. How do you store 7TB/day? • Single machine? • What’s HD write speed? • 80 MB/s • 24.3 hrs to write 7 TB • Uh oh.
    22. Where do I put 7TB/day? • Need a cluster of machines
    23. Where do I put 7TB/day? • Need a cluster of machines • ... which adds new layers of complexity
    24. Hadoop • Distributed file system • Automatic replication, fault tolerance • MapReduce-based parallel computation • Key-value based computation interface allows for wide applicability
    25. Hadoop • Open source: top-level Apache project • Scalable: Y! has a 4000 node cluster • Powerful: sorted 1TB random integers in 62 seconds • Easy packaging: free Cloudera RPMs
    26. Inputs MapReduce Workflow Shuffle/ Map Sort • Challenge: how many tweets per Map Outputs user, given tweets table? Map Reduce • Input: key=row, value=tweet info • Map: output key=user_id, value=1 Map Reduce • Shuffle: sort by user_id Map Reduce • Reduce: for each user_id, sum Map • Output: user_id, tweet count • With 2x machines, runs 2x faster Map
    27. Inputs MapReduce Workflow Shuffle/ Map Sort • Challenge: how many tweets per Map Outputs user, given tweets table? Map Reduce • Input: key=row, value=tweet info • Map: output key=user_id, value=1 Map Reduce • Shuffle: sort by user_id Map Reduce • Reduce: for each user_id, sum Map • Output: user_id, tweet count • With 2x machines, runs 2x faster Map
    28. Inputs MapReduce Workflow Shuffle/ Map Sort • Challenge: how many tweets per Map Outputs user, given tweets table? Map Reduce • Input: key=row, value=tweet info • Map: output key=user_id, value=1 Map Reduce • Shuffle: sort by user_id Map Reduce • Reduce: for each user_id, sum Map • Output: user_id, tweet count • With 2x machines, runs 2x faster Map
    29. Inputs MapReduce Workflow Shuffle/ Map Sort • Challenge: how many tweets per Map Outputs user, given tweets table? Map Reduce • Input: key=row, value=tweet info • Map: output key=user_id, value=1 Map Reduce • Shuffle: sort by user_id Map Reduce • Reduce: for each user_id, sum Map • Output: user_id, tweet count • With 2x machines, runs 2x faster Map
    30. Inputs MapReduce Workflow Shuffle/ Map Sort • Challenge: how many tweets per Map Outputs user, given tweets table? Map Reduce • Input: key=row, value=tweet info • Map: output key=user_id, value=1 Map Reduce • Shuffle: sort by user_id Map Reduce • Reduce: for each user_id, sum Map • Output: user_id, tweet count • With 2x machines, runs 2x faster Map
    31. Inputs MapReduce Workflow Shuffle/ Map Sort • Challenge: how many tweets per Map Outputs user, given tweets table? Map Reduce • Input: key=row, value=tweet info • Map: output key=user_id, value=1 Map Reduce • Shuffle: sort by user_id Map Reduce • Reduce: for each user_id, sum Map • Output: user_id, tweet count • With 2x machines, runs 2x faster Map
    32. Inputs MapReduce Workflow Shuffle/ Map Sort • Challenge: how many tweets per Map Outputs user, given tweets table? Map Reduce • Input: key=row, value=tweet info • Map: output key=user_id, value=1 Map Reduce • Shuffle: sort by user_id Map Reduce • Reduce: for each user_id, sum Map • Output: user_id, tweet count • With 2x machines, runs 2x faster Map
    33. Two Analysis Challenges 1. Compute friendships in Twitter’s social graph • grep, awk? No way. • Data is in MySQL... self join on an n- billion row table? • n,000,000,000 x n,000,000,000 = ?
    34. Two Analysis Challenges 1. Compute friendships in Twitter’s social graph • grep, awk? No way. • Data is in MySQL... self join on an n- billion row table? • n,000,000,000 x n,000,000,000 = ? • I don’t know either.
    35. Two Analysis Challenges 2. Large-scale grouping and counting • select count(*) from users? maybe. • select count(*) from tweets? uh... • Imagine joining them. • And grouping. • And sorting.
    36. Back to Hadoop • Didn’t we have a cluster of machines? • Hadoop makes it easy to distribute the calculation • Purpose-built for parallel calculation • Just a slight mindset adjustment
    37. Back to Hadoop • Didn’t we have a cluster of machines? • Hadoop makes it easy to distribute the calculation • Purpose-built for parallel calculation • Just a slight mindset adjustment • But a fun one!
    38. Analysis at Scale • Now we’re rolling • Count all tweets: 12 billion, 5 minutes • Hit FlockDB in parallel to assemble social graph aggregates • Run pagerank across users to calculate reputations
    39. But... • Analysis typically in Java • Single-input, two-stage data flow is rigid • Projections, filters: custom code • Joins lengthy, error-prone • n-stage jobs: hard to manage • Exploration requires compilation
    40. Three Challenges • Collecting Data • Large-Scale Storage & Analysis • Rapid Learning over Big Data
    41. Pig • High level language • Transformations on sets of records • Process data one step at a time • Easier than SQL?
    42. Why Pig? Because I bet you can read the following script
    43. A Real Pig Script • Just for fun... the same calculation in Java
    44. No, Seriously.
    45. Pig Makes it Easy • 5% of the code
    46. Pig Makes it Easy • 5% of the code • 5% of the dev time
    47. Pig Makes it Easy • 5% of the code • 5% of the dev time • Within 25% of the running time
    48. Pig Makes it Easy • 5% of the code • 5% of the dev time • Within 25% of the running time • Readable, reusable
    49. One Thing I’ve Learned • It’s easy to answer questions. • It’s hard to ask the right questions
    50. One Thing I’ve Learned • It’s easy to answer questions. • It’s hard to ask the right questions. • Value the system that promotes innovation and iteration
    51. One Thing I’ve Learned • It’s easy to answer questions. • It’s hard to ask the right questions. • Value the system that promotes innovation and iteration • More minds contributing = more value from your data
    52. Counting Big Data • How many requests per day?
    53. Counting Big Data • How many requests per day? • Average latency? 95% latency?
    54. Counting Big Data • How many requests per day? • Average latency? 95% latency? • Response code distribution per hour?
    55. Counting Big Data • How many requests per day? • Average latency? 95% latency? • Response code distribution per hour? • Searches per day?
    56. Counting Big Data • How many requests per day? • Average latency? 95% latency? • Response code distribution per hour? • Searches per day? • Unique users searching, unique queries?
    57. Counting Big Data • How many requests per day? • Average latency? 95% latency? • Response code distribution per hour? • Searches per day? • Unique users searching, unique queries? • Geographic distribution of queries?
    58. Correlating Big Data • Usage difference for mobile users?
    59. Correlating Big Data • Usage difference for mobile users? • ... for users on desktop clients?
    60. Correlating Big Data • Usage difference for mobile users? • ... for users on desktop clients? • Cohort analyses
    61. Correlating Big Data • Usage difference for mobile users? • ... for users on desktop clients? • Cohort analyses • What features get users hooked?
    62. Correlating Big Data • Usage difference for mobile users? • ... for users on desktop clients? • Cohort analyses • What features get users hooked? • What do successful users use often?
    63. Research on Big Data • What can we tell from a user’s tweets?
    64. Research on Big Data • What can we tell from a user’s tweets? • ... from the tweets of their followers?
    65. Research on Big Data • What can we tell from a user’s tweets? • ... from the tweets of their followers? • ... from the tweets of those they follow?
    66. Research on Big Data • What can we tell from a user’s tweets? • ... from the tweets of their followers? • ... from the tweets of those they follow? • What influences retweet tree depth?
    67. Research on Big Data • What can we tell from a user’s tweets? • ... from the tweets of their followers? • ... from the tweets of those they follow? • What influences retweet tree depth? • Duplicate detection, language detection
    68. Research on Big Data • What can we tell from a user’s tweets? • ... from the tweets of their followers? • ... from the tweets of those they follow? • What influences retweet tree depth? • Duplicate detection, language detection • Machine learning
    69. If We Had More Time... • HBase backing namesearch • LZO compression • Protocol Buffers and Hadoop • Our open source: hadoop-lzo, elephant- bird • Realtime analytics with Cassandra
    70. Questions? Follow me at twitter.com/kevinweil

    ×