Big Data at Twitter
#chirpdata

     Kevin Weil
     @kevinweil

     Twitter Analytics
Three Challenges
• Collecting Data
• Large-Scale Storage & Analysis
• Rapid Learning over Big Data
Three Challenges
• Collecting Data
• Large-Scale Storage & Analysis
• Rapid Learning over Big Data
Data, Data Everywhere
• You guys generate a lot of data
• Anybody want to guess?
Data, Data Everywhere
• You guys generate a lot of data
• Anybody want to guess?
• 7 TB/day (2+ PB/yr)
Data, Data Everywhere
• You guys generate a lot of data
• Anybody want to guess?
• 7 TB/day (2+ PB/yr)
•   10,000 CDs
Data, Data Everywhere
• You guys generate a lot of data
• Anybody want to guess?
• 7 TB/day (2+ PB/yr)
•   10,000 CDs
•   ...
Data, Data Everywhere
• You guys generate a lot of data
• Anybody want to guess?
• 7 TB/day (2+ PB/yr)
•   10,000 CDs
•   ...
Syslog?
• Started with syslog-ng
• As our volume grew, it didn’t scale
Syslog?
• Started with syslog-ng
• As our volume grew, it didn’t scale
• Resources
  overwhelmed
• Lost data
Scribe
• Surprise! FB had same problem, built
and open-sourced Scribe
• Log collection framework over Thrift
• You write l...
Scribe
                           FE   FE   FE
• Runs locally; reliable
in network outage
Scribe
                           FE         FE     FE
• Runs locally; reliable
in network outage
• Nodes only know
downst...
Scribe
                              FE         FE     FE
• Runs locally; reliable
in network outage
• Nodes only know
dow...
Scribe at Twitter
• Solved our problem, opened new
vistas
• Currently 30 different categories
logged from javascript, RoR,...
Scribe at Twitter
 • Continuing to work with FB
 • GSoC project! Help make it more
 awesome.


• http://github.com/travisc...
Three Challenges
• Collecting Data
• Large-Scale Storage & Analysis
• Rapid Learning over Big Data
How do you store 7TB/day?
• Single machine?
• What’s HD write speed?
How do you store 7TB/day?
• Single machine?
• What’s HD write speed?
• 80 MB/s
How do you store 7TB/day?
• Single machine?
• What’s HD write speed?
• 80 MB/s
• 24.3 hrs to write 7 TB
How do you store 7TB/day?
• Single machine?
• What’s HD write speed?
• 80 MB/s
• 24.3 hrs to write 7 TB
• Uh oh.
Where do I put 7TB/day?
• Need a cluster of
machines
Where do I put 7TB/day?
• Need a cluster of
machines

• ... which adds new
layers of complexity
Hadoop
• Distributed file system
   • Automatic replication, fault
   tolerance
• MapReduce-based parallel computation
   ...
Hadoop
• Open source: top-level Apache project
• Scalable: Y! has a 4000 node cluster
• Powerful: sorted 1TB random intege...
Inputs
         MapReduce Workflow
            Shuffle/
         Map Sort
                                   • Challenge: h...
Inputs
         MapReduce Workflow
            Shuffle/
         Map Sort
                                   • Challenge: h...
Inputs
         MapReduce Workflow
            Shuffle/
         Map Sort
                                   • Challenge: h...
Inputs
         MapReduce Workflow
            Shuffle/
         Map Sort
                                   • Challenge: h...
Inputs
         MapReduce Workflow
            Shuffle/
         Map Sort
                                   • Challenge: h...
Inputs
         MapReduce Workflow
            Shuffle/
         Map Sort
                                   • Challenge: h...
Inputs
         MapReduce Workflow
            Shuffle/
         Map Sort
                                   • Challenge: h...
Two Analysis Challenges
1. Compute friendships in Twitter’s social
graph
    • grep, awk? No way.
    • Data is in MySQL.....
Two Analysis Challenges
1. Compute friendships in Twitter’s social
graph
    • grep, awk? No way.
    • Data is in MySQL.....
Two Analysis Challenges
2. Large-scale grouping and counting
   • select count(*) from users? maybe.
   • select count(*) ...
Back to Hadoop
• Didn’t we have a cluster of machines?
• Hadoop makes it easy to distribute the
calculation
• Purpose-buil...
Back to Hadoop
• Didn’t we have a cluster of machines?
• Hadoop makes it easy to distribute the
calculation
• Purpose-buil...
Analysis at Scale
• Now we’re rolling
• Count all tweets: 12 billion, 5 minutes
• Hit FlockDB in parallel to assemble
soci...
But...
• Analysis typically in Java
• Single-input, two-stage data flow is rigid
• Projections, filters: custom code
• Joi...
Three Challenges
• Collecting Data
• Large-Scale Storage & Analysis
• Rapid Learning over Big Data
Pig
• High level language
• Transformations on
sets of records
• Process data one
step at a time
• Easier than SQL?
Why Pig?
 Because I bet you can read
 the following script
A Real Pig Script




• Just for fun... the same calculation in Java
No, Seriously.
Pig Makes it Easy
• 5% of the code
Pig Makes it Easy
• 5% of the code
• 5% of the dev time
Pig Makes it Easy
• 5% of the code
• 5% of the dev time
• Within 25% of the running time
Pig Makes it Easy
• 5% of the code
• 5% of the dev time
• Within 25% of the running time
• Readable, reusable
One Thing I’ve Learned
• It’s easy to answer questions.
• It’s hard to ask the right questions
One Thing I’ve Learned
• It’s easy to answer questions.
• It’s hard to ask the right questions.
• Value the system that pr...
One Thing I’ve Learned
• It’s easy to answer questions.
• It’s hard to ask the right questions.
• Value the system that pr...
Counting Big Data
• How many requests per day?
Counting Big Data
• How many requests per day?
• Average latency? 95% latency?
Counting Big Data
• How many requests per day?
• Average latency? 95% latency?
• Response code distribution per hour?
Counting Big Data
• How many requests per day?
• Average latency? 95% latency?
• Response code distribution per hour?
• Se...
Counting Big Data
• How many requests per day?
• Average latency? 95% latency?
• Response code distribution per hour?
• Se...
Counting Big Data
• How many requests per day?
• Average latency? 95% latency?
• Response code distribution per hour?
• Se...
Correlating Big Data
• Usage difference for mobile users?
Correlating Big Data
• Usage difference for mobile users?
• ... for users on desktop clients?
Correlating Big Data
• Usage difference for mobile users?
• ... for users on desktop clients?
• Cohort analyses
Correlating Big Data
• Usage difference for mobile users?
• ... for users on desktop clients?
• Cohort analyses
• What fea...
Correlating Big Data
• Usage difference for mobile users?
• ... for users on desktop clients?
• Cohort analyses
• What fea...
Research on Big Data
• What can we tell from a user’s tweets?
Research on Big Data
• What can we tell from a user’s tweets?
• ... from the tweets of their followers?
Research on Big Data
• What can we tell from a user’s tweets?
• ... from the tweets of their followers?
• ... from the twe...
Research on Big Data
• What can we tell from a user’s tweets?
• ... from the tweets of their followers?
• ... from the twe...
Research on Big Data
• What can we tell from a user’s tweets?
• ... from the tweets of their followers?
• ... from the twe...
Research on Big Data
• What can we tell from a user’s tweets?
• ... from the tweets of their followers?
• ... from the twe...
If We Had More Time...
• HBase backing namesearch
• LZO compression
• Protocol Buffers and Hadoop
• Our open source: hadoo...
Questions?
        Follow me at
        twitter.com/kevinweil
Big Data at Twitter, Chirp 2010
Upcoming SlideShare
Loading in...5
×

Big Data at Twitter, Chirp 2010

11,880

Published on

Slides from my talk on collecting, storing, and analyzing big data at Twitter for Chirp Hack Day at Twitter's Chirp conference.

Published in: Technology
4 Comments
48 Likes
Statistics
Notes
No Downloads
Views
Total Views
11,880
On Slideshare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
651
Comments
4
Likes
48
Embeds 0
No embeds

No notes for slide







































































  • Big Data at Twitter, Chirp 2010

    1. 1. Big Data at Twitter #chirpdata Kevin Weil @kevinweil Twitter Analytics
    2. 2. Three Challenges • Collecting Data • Large-Scale Storage & Analysis • Rapid Learning over Big Data
    3. 3. Three Challenges • Collecting Data • Large-Scale Storage & Analysis • Rapid Learning over Big Data
    4. 4. Data, Data Everywhere • You guys generate a lot of data • Anybody want to guess?
    5. 5. Data, Data Everywhere • You guys generate a lot of data • Anybody want to guess? • 7 TB/day (2+ PB/yr)
    6. 6. Data, Data Everywhere • You guys generate a lot of data • Anybody want to guess? • 7 TB/day (2+ PB/yr) • 10,000 CDs
    7. 7. Data, Data Everywhere • You guys generate a lot of data • Anybody want to guess? • 7 TB/day (2+ PB/yr) • 10,000 CDs • 5 million floppy disks
    8. 8. Data, Data Everywhere • You guys generate a lot of data • Anybody want to guess? • 7 TB/day (2+ PB/yr) • 10,000 CDs • 5 million floppy disks • 225 GB while I give this talk
    9. 9. Syslog? • Started with syslog-ng • As our volume grew, it didn’t scale
    10. 10. Syslog? • Started with syslog-ng • As our volume grew, it didn’t scale • Resources overwhelmed • Lost data
    11. 11. Scribe • Surprise! FB had same problem, built and open-sourced Scribe • Log collection framework over Thrift • You write log lines, with categories • It does the rest
    12. 12. Scribe FE FE FE • Runs locally; reliable in network outage
    13. 13. Scribe FE FE FE • Runs locally; reliable in network outage • Nodes only know downstream writer; Agg Agg hierarchical, scalable
    14. 14. Scribe FE FE FE • Runs locally; reliable in network outage • Nodes only know downstream writer; Agg Agg hierarchical, scalable • Pluggable outputs File HDFS
    15. 15. Scribe at Twitter • Solved our problem, opened new vistas • Currently 30 different categories logged from javascript, RoR, Scala, etc • We improved logging, monitoring, writing to Hadoop, compression
    16. 16. Scribe at Twitter • Continuing to work with FB • GSoC project! Help make it more awesome. • http://github.com/traviscrawford/scribe • http://wiki.developers.facebook.com/index.php/User:GSoC
    17. 17. Three Challenges • Collecting Data • Large-Scale Storage & Analysis • Rapid Learning over Big Data
    18. 18. How do you store 7TB/day? • Single machine? • What’s HD write speed?
    19. 19. How do you store 7TB/day? • Single machine? • What’s HD write speed? • 80 MB/s
    20. 20. How do you store 7TB/day? • Single machine? • What’s HD write speed? • 80 MB/s • 24.3 hrs to write 7 TB
    21. 21. How do you store 7TB/day? • Single machine? • What’s HD write speed? • 80 MB/s • 24.3 hrs to write 7 TB • Uh oh.
    22. 22. Where do I put 7TB/day? • Need a cluster of machines
    23. 23. Where do I put 7TB/day? • Need a cluster of machines • ... which adds new layers of complexity
    24. 24. Hadoop • Distributed file system • Automatic replication, fault tolerance • MapReduce-based parallel computation • Key-value based computation interface allows for wide applicability
    25. 25. Hadoop • Open source: top-level Apache project • Scalable: Y! has a 4000 node cluster • Powerful: sorted 1TB random integers in 62 seconds • Easy packaging: free Cloudera RPMs
    26. 26. Inputs MapReduce Workflow Shuffle/ Map Sort • Challenge: how many tweets per Map Outputs user, given tweets table? Map Reduce • Input: key=row, value=tweet info • Map: output key=user_id, value=1 Map Reduce • Shuffle: sort by user_id Map Reduce • Reduce: for each user_id, sum Map • Output: user_id, tweet count • With 2x machines, runs 2x faster Map
    27. 27. Inputs MapReduce Workflow Shuffle/ Map Sort • Challenge: how many tweets per Map Outputs user, given tweets table? Map Reduce • Input: key=row, value=tweet info • Map: output key=user_id, value=1 Map Reduce • Shuffle: sort by user_id Map Reduce • Reduce: for each user_id, sum Map • Output: user_id, tweet count • With 2x machines, runs 2x faster Map
    28. 28. Inputs MapReduce Workflow Shuffle/ Map Sort • Challenge: how many tweets per Map Outputs user, given tweets table? Map Reduce • Input: key=row, value=tweet info • Map: output key=user_id, value=1 Map Reduce • Shuffle: sort by user_id Map Reduce • Reduce: for each user_id, sum Map • Output: user_id, tweet count • With 2x machines, runs 2x faster Map
    29. 29. Inputs MapReduce Workflow Shuffle/ Map Sort • Challenge: how many tweets per Map Outputs user, given tweets table? Map Reduce • Input: key=row, value=tweet info • Map: output key=user_id, value=1 Map Reduce • Shuffle: sort by user_id Map Reduce • Reduce: for each user_id, sum Map • Output: user_id, tweet count • With 2x machines, runs 2x faster Map
    30. 30. Inputs MapReduce Workflow Shuffle/ Map Sort • Challenge: how many tweets per Map Outputs user, given tweets table? Map Reduce • Input: key=row, value=tweet info • Map: output key=user_id, value=1 Map Reduce • Shuffle: sort by user_id Map Reduce • Reduce: for each user_id, sum Map • Output: user_id, tweet count • With 2x machines, runs 2x faster Map
    31. 31. Inputs MapReduce Workflow Shuffle/ Map Sort • Challenge: how many tweets per Map Outputs user, given tweets table? Map Reduce • Input: key=row, value=tweet info • Map: output key=user_id, value=1 Map Reduce • Shuffle: sort by user_id Map Reduce • Reduce: for each user_id, sum Map • Output: user_id, tweet count • With 2x machines, runs 2x faster Map
    32. 32. Inputs MapReduce Workflow Shuffle/ Map Sort • Challenge: how many tweets per Map Outputs user, given tweets table? Map Reduce • Input: key=row, value=tweet info • Map: output key=user_id, value=1 Map Reduce • Shuffle: sort by user_id Map Reduce • Reduce: for each user_id, sum Map • Output: user_id, tweet count • With 2x machines, runs 2x faster Map
    33. 33. Two Analysis Challenges 1. Compute friendships in Twitter’s social graph • grep, awk? No way. • Data is in MySQL... self join on an n- billion row table? • n,000,000,000 x n,000,000,000 = ?
    34. 34. Two Analysis Challenges 1. Compute friendships in Twitter’s social graph • grep, awk? No way. • Data is in MySQL... self join on an n- billion row table? • n,000,000,000 x n,000,000,000 = ? • I don’t know either.
    35. 35. Two Analysis Challenges 2. Large-scale grouping and counting • select count(*) from users? maybe. • select count(*) from tweets? uh... • Imagine joining them. • And grouping. • And sorting.
    36. 36. Back to Hadoop • Didn’t we have a cluster of machines? • Hadoop makes it easy to distribute the calculation • Purpose-built for parallel calculation • Just a slight mindset adjustment
    37. 37. Back to Hadoop • Didn’t we have a cluster of machines? • Hadoop makes it easy to distribute the calculation • Purpose-built for parallel calculation • Just a slight mindset adjustment • But a fun one!
    38. 38. Analysis at Scale • Now we’re rolling • Count all tweets: 12 billion, 5 minutes • Hit FlockDB in parallel to assemble social graph aggregates • Run pagerank across users to calculate reputations
    39. 39. But... • Analysis typically in Java • Single-input, two-stage data flow is rigid • Projections, filters: custom code • Joins lengthy, error-prone • n-stage jobs: hard to manage • Exploration requires compilation
    40. 40. Three Challenges • Collecting Data • Large-Scale Storage & Analysis • Rapid Learning over Big Data
    41. 41. Pig • High level language • Transformations on sets of records • Process data one step at a time • Easier than SQL?
    42. 42. Why Pig? Because I bet you can read the following script
    43. 43. A Real Pig Script • Just for fun... the same calculation in Java
    44. 44. No, Seriously.
    45. 45. Pig Makes it Easy • 5% of the code
    46. 46. Pig Makes it Easy • 5% of the code • 5% of the dev time
    47. 47. Pig Makes it Easy • 5% of the code • 5% of the dev time • Within 25% of the running time
    48. 48. Pig Makes it Easy • 5% of the code • 5% of the dev time • Within 25% of the running time • Readable, reusable
    49. 49. One Thing I’ve Learned • It’s easy to answer questions. • It’s hard to ask the right questions
    50. 50. One Thing I’ve Learned • It’s easy to answer questions. • It’s hard to ask the right questions. • Value the system that promotes innovation and iteration
    51. 51. One Thing I’ve Learned • It’s easy to answer questions. • It’s hard to ask the right questions. • Value the system that promotes innovation and iteration • More minds contributing = more value from your data
    52. 52. Counting Big Data • How many requests per day?
    53. 53. Counting Big Data • How many requests per day? • Average latency? 95% latency?
    54. 54. Counting Big Data • How many requests per day? • Average latency? 95% latency? • Response code distribution per hour?
    55. 55. Counting Big Data • How many requests per day? • Average latency? 95% latency? • Response code distribution per hour? • Searches per day?
    56. 56. Counting Big Data • How many requests per day? • Average latency? 95% latency? • Response code distribution per hour? • Searches per day? • Unique users searching, unique queries?
    57. 57. Counting Big Data • How many requests per day? • Average latency? 95% latency? • Response code distribution per hour? • Searches per day? • Unique users searching, unique queries? • Geographic distribution of queries?
    58. 58. Correlating Big Data • Usage difference for mobile users?
    59. 59. Correlating Big Data • Usage difference for mobile users? • ... for users on desktop clients?
    60. 60. Correlating Big Data • Usage difference for mobile users? • ... for users on desktop clients? • Cohort analyses
    61. 61. Correlating Big Data • Usage difference for mobile users? • ... for users on desktop clients? • Cohort analyses • What features get users hooked?
    62. 62. Correlating Big Data • Usage difference for mobile users? • ... for users on desktop clients? • Cohort analyses • What features get users hooked? • What do successful users use often?
    63. 63. Research on Big Data • What can we tell from a user’s tweets?
    64. 64. Research on Big Data • What can we tell from a user’s tweets? • ... from the tweets of their followers?
    65. 65. Research on Big Data • What can we tell from a user’s tweets? • ... from the tweets of their followers? • ... from the tweets of those they follow?
    66. 66. Research on Big Data • What can we tell from a user’s tweets? • ... from the tweets of their followers? • ... from the tweets of those they follow? • What influences retweet tree depth?
    67. 67. Research on Big Data • What can we tell from a user’s tweets? • ... from the tweets of their followers? • ... from the tweets of those they follow? • What influences retweet tree depth? • Duplicate detection, language detection
    68. 68. Research on Big Data • What can we tell from a user’s tweets? • ... from the tweets of their followers? • ... from the tweets of those they follow? • What influences retweet tree depth? • Duplicate detection, language detection • Machine learning
    69. 69. If We Had More Time... • HBase backing namesearch • LZO compression • Protocol Buffers and Hadoop • Our open source: hadoop-lzo, elephant- bird • Realtime analytics with Cassandra
    70. 70. Questions? Follow me at twitter.com/kevinweil
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×