Big Data at Twitter
#chirpdata

     Kevin Weil
     @kevinweil

     Twitter Analytics
Three Challenges
• Collecting Data
• Large-Scale Storage & Analysis
• Rapid Learning over Big Data
Three Challenges
• Collecting Data
• Large-Scale Storage & Analysis
• Rapid Learning over Big Data
Data, Data Everywhere
• You guys generate a lot of data
• Anybody want to guess?
Data, Data Everywhere
• You guys generate a lot of data
• Anybody want to guess?
• 7 TB/day (2+ PB/yr)
Data, Data Everywhere
• You guys generate a lot of data
• Anybody want to guess?
• 7 TB/day (2+ PB/yr)
•   10,000 CDs
Data, Data Everywhere
• You guys generate a lot of data
• Anybody want to guess?
• 7 TB/day (2+ PB/yr)
•   10,000 CDs
•   5 million floppy disks
Data, Data Everywhere
• You guys generate a lot of data
• Anybody want to guess?
• 7 TB/day (2+ PB/yr)
•   10,000 CDs
•   5 million floppy disks
•   225 GB while I give this talk
Syslog?
• Started with syslog-ng
• As our volume grew, it didn’t scale
Syslog?
• Started with syslog-ng
• As our volume grew, it didn’t scale
• Resources
  overwhelmed
• Lost data
Scribe
• Surprise! FB had same problem, built
and open-sourced Scribe
• Log collection framework over Thrift
• You write log lines, with categories
• It does the rest
Scribe
                           FE   FE   FE
• Runs locally; reliable
in network outage
Scribe
                           FE         FE     FE
• Runs locally; reliable
in network outage
• Nodes only know
downstream writer;              Agg        Agg
hierarchical, scalable
Scribe
                              FE         FE     FE
• Runs locally; reliable
in network outage
• Nodes only know
downstream writer;                 Agg        Agg
hierarchical, scalable
• Pluggable outputs
                       File          HDFS
Scribe at Twitter
• Solved our problem, opened new
vistas
• Currently 30 different categories
logged from javascript, RoR, Scala, etc
• We improved logging, monitoring,
writing to Hadoop, compression
Scribe at Twitter
 • Continuing to work with FB
 • GSoC project! Help make it more
 awesome.


• http://github.com/traviscrawford/scribe
• http://wiki.developers.facebook.com/index.php/User:GSoC
Three Challenges
• Collecting Data
• Large-Scale Storage & Analysis
• Rapid Learning over Big Data
How do you store 7TB/day?
• Single machine?
• What’s HD write speed?
How do you store 7TB/day?
• Single machine?
• What’s HD write speed?
• 80 MB/s
How do you store 7TB/day?
• Single machine?
• What’s HD write speed?
• 80 MB/s
• 24.3 hrs to write 7 TB
How do you store 7TB/day?
• Single machine?
• What’s HD write speed?
• 80 MB/s
• 24.3 hrs to write 7 TB
• Uh oh.
Where do I put 7TB/day?
• Need a cluster of
machines
Where do I put 7TB/day?
• Need a cluster of
machines

• ... which adds new
layers of complexity
Hadoop
• Distributed file system
   • Automatic replication, fault
   tolerance
• MapReduce-based parallel computation
   • Key-value based computation
   interface allows for wide applicability
Hadoop
• Open source: top-level Apache project
• Scalable: Y! has a 4000 node cluster
• Powerful: sorted 1TB random integers
in 62 seconds

• Easy packaging: free Cloudera RPMs
Inputs
         MapReduce Workflow
            Shuffle/
         Map Sort
                                   • Challenge: how many tweets per
         Map
                           Outputs user, given tweets table?
         Map      Reduce           • Input: key=row, value=tweet info
                                   • Map: output key=user_id, value=1
         Map      Reduce
                                   • Shuffle: sort by user_id
         Map      Reduce           • Reduce: for each user_id, sum
         Map
                                   • Output: user_id, tweet count
                                   • With 2x machines, runs 2x faster
         Map
Inputs
         MapReduce Workflow
            Shuffle/
         Map Sort
                                   • Challenge: how many tweets per
         Map
                           Outputs user, given tweets table?
         Map      Reduce           • Input: key=row, value=tweet info
                                   • Map: output key=user_id, value=1
         Map      Reduce
                                   • Shuffle: sort by user_id
         Map      Reduce           • Reduce: for each user_id, sum
         Map
                                   • Output: user_id, tweet count
                                   • With 2x machines, runs 2x faster
         Map
Inputs
         MapReduce Workflow
            Shuffle/
         Map Sort
                                   • Challenge: how many tweets per
         Map
                           Outputs user, given tweets table?
         Map      Reduce           • Input: key=row, value=tweet info
                                   • Map: output key=user_id, value=1
         Map      Reduce
                                   • Shuffle: sort by user_id
         Map      Reduce           • Reduce: for each user_id, sum
         Map
                                   • Output: user_id, tweet count
                                   • With 2x machines, runs 2x faster
         Map
Inputs
         MapReduce Workflow
            Shuffle/
         Map Sort
                                   • Challenge: how many tweets per
         Map
                           Outputs user, given tweets table?
         Map      Reduce           • Input: key=row, value=tweet info
                                   • Map: output key=user_id, value=1
         Map      Reduce
                                   • Shuffle: sort by user_id
         Map      Reduce           • Reduce: for each user_id, sum
         Map
                                   • Output: user_id, tweet count
                                   • With 2x machines, runs 2x faster
         Map
Inputs
         MapReduce Workflow
            Shuffle/
         Map Sort
                                   • Challenge: how many tweets per
         Map
                           Outputs user, given tweets table?
         Map      Reduce           • Input: key=row, value=tweet info
                                   • Map: output key=user_id, value=1
         Map      Reduce
                                   • Shuffle: sort by user_id
         Map      Reduce           • Reduce: for each user_id, sum
         Map
                                   • Output: user_id, tweet count
                                   • With 2x machines, runs 2x faster
         Map
Inputs
         MapReduce Workflow
            Shuffle/
         Map Sort
                                   • Challenge: how many tweets per
         Map
                           Outputs user, given tweets table?
         Map      Reduce           • Input: key=row, value=tweet info
                                   • Map: output key=user_id, value=1
         Map      Reduce
                                   • Shuffle: sort by user_id
         Map      Reduce           • Reduce: for each user_id, sum
         Map
                                   • Output: user_id, tweet count
                                   • With 2x machines, runs 2x faster
         Map
Inputs
         MapReduce Workflow
            Shuffle/
         Map Sort
                                   • Challenge: how many tweets per
         Map
                           Outputs user, given tweets table?
         Map      Reduce           • Input: key=row, value=tweet info
                                   • Map: output key=user_id, value=1
         Map      Reduce
                                   • Shuffle: sort by user_id
         Map      Reduce           • Reduce: for each user_id, sum
         Map
                                   • Output: user_id, tweet count
                                   • With 2x machines, runs 2x faster
         Map
Two Analysis Challenges
1. Compute friendships in Twitter’s social
graph
    • grep, awk? No way.
    • Data is in MySQL... self join on an n-
    billion row table?
    • n,000,000,000 x n,000,000,000 = ?
Two Analysis Challenges
1. Compute friendships in Twitter’s social
graph
    • grep, awk? No way.
    • Data is in MySQL... self join on an n-
    billion row table?
    • n,000,000,000 x n,000,000,000 = ?
    • I don’t know either.
Two Analysis Challenges
2. Large-scale grouping and counting
   • select count(*) from users? maybe.
   • select count(*) from tweets? uh...
   • Imagine joining them.
   • And grouping.
   • And sorting.
Back to Hadoop
• Didn’t we have a cluster of machines?
• Hadoop makes it easy to distribute the
calculation
• Purpose-built for parallel calculation
• Just a slight mindset adjustment
Back to Hadoop
• Didn’t we have a cluster of machines?
• Hadoop makes it easy to distribute the
calculation
• Purpose-built for parallel calculation
• Just a slight mindset adjustment
• But a fun one!
Analysis at Scale
• Now we’re rolling
• Count all tweets: 12 billion, 5 minutes
• Hit FlockDB in parallel to assemble
social graph aggregates
• Run pagerank across users to calculate
reputations
But...
• Analysis typically in Java
• Single-input, two-stage data flow is rigid
• Projections, filters: custom code
• Joins lengthy, error-prone
• n-stage jobs: hard to manage
• Exploration requires compilation
Three Challenges
• Collecting Data
• Large-Scale Storage & Analysis
• Rapid Learning over Big Data
Pig
• High level language
• Transformations on
sets of records
• Process data one
step at a time
• Easier than SQL?
Why Pig?
 Because I bet you can read
 the following script
A Real Pig Script




• Just for fun... the same calculation in Java
No, Seriously.
Pig Makes it Easy
• 5% of the code
Pig Makes it Easy
• 5% of the code
• 5% of the dev time
Pig Makes it Easy
• 5% of the code
• 5% of the dev time
• Within 25% of the running time
Pig Makes it Easy
• 5% of the code
• 5% of the dev time
• Within 25% of the running time
• Readable, reusable
One Thing I’ve Learned
• It’s easy to answer questions.
• It’s hard to ask the right questions
One Thing I’ve Learned
• It’s easy to answer questions.
• It’s hard to ask the right questions.
• Value the system that promotes
innovation and iteration
One Thing I’ve Learned
• It’s easy to answer questions.
• It’s hard to ask the right questions.
• Value the system that promotes
innovation and iteration
• More minds contributing = more value
from your data
Counting Big Data
• How many requests per day?
Counting Big Data
• How many requests per day?
• Average latency? 95% latency?
Counting Big Data
• How many requests per day?
• Average latency? 95% latency?
• Response code distribution per hour?
Counting Big Data
• How many requests per day?
• Average latency? 95% latency?
• Response code distribution per hour?
• Searches per day?
Counting Big Data
• How many requests per day?
• Average latency? 95% latency?
• Response code distribution per hour?
• Searches per day?
• Unique users searching, unique queries?
Counting Big Data
• How many requests per day?
• Average latency? 95% latency?
• Response code distribution per hour?
• Searches per day?
• Unique users searching, unique queries?
• Geographic distribution of queries?
Correlating Big Data
• Usage difference for mobile users?
Correlating Big Data
• Usage difference for mobile users?
• ... for users on desktop clients?
Correlating Big Data
• Usage difference for mobile users?
• ... for users on desktop clients?
• Cohort analyses
Correlating Big Data
• Usage difference for mobile users?
• ... for users on desktop clients?
• Cohort analyses
• What features get users hooked?
Correlating Big Data
• Usage difference for mobile users?
• ... for users on desktop clients?
• Cohort analyses
• What features get users hooked?
• What do successful users use often?
Research on Big Data
• What can we tell from a user’s tweets?
Research on Big Data
• What can we tell from a user’s tweets?
• ... from the tweets of their followers?
Research on Big Data
• What can we tell from a user’s tweets?
• ... from the tweets of their followers?
• ... from the tweets of those they follow?
Research on Big Data
• What can we tell from a user’s tweets?
• ... from the tweets of their followers?
• ... from the tweets of those they follow?
• What influences retweet tree depth?
Research on Big Data
• What can we tell from a user’s tweets?
• ... from the tweets of their followers?
• ... from the tweets of those they follow?
• What influences retweet tree depth?
• Duplicate detection, language detection
Research on Big Data
• What can we tell from a user’s tweets?
• ... from the tweets of their followers?
• ... from the tweets of those they follow?
• What influences retweet tree depth?
• Duplicate detection, language detection
• Machine learning
If We Had More Time...
• HBase backing namesearch
• LZO compression
• Protocol Buffers and Hadoop
• Our open source: hadoop-lzo, elephant-
bird
• Realtime analytics with Cassandra
Questions?
        Follow me at
        twitter.com/kevinweil

Big Data at Twitter, Chirp 2010

  • 2.
    Big Data atTwitter #chirpdata Kevin Weil @kevinweil Twitter Analytics
  • 3.
    Three Challenges • CollectingData • Large-Scale Storage & Analysis • Rapid Learning over Big Data
  • 4.
    Three Challenges • CollectingData • Large-Scale Storage & Analysis • Rapid Learning over Big Data
  • 5.
    Data, Data Everywhere •You guys generate a lot of data • Anybody want to guess?
  • 6.
    Data, Data Everywhere •You guys generate a lot of data • Anybody want to guess? • 7 TB/day (2+ PB/yr)
  • 7.
    Data, Data Everywhere •You guys generate a lot of data • Anybody want to guess? • 7 TB/day (2+ PB/yr) • 10,000 CDs
  • 8.
    Data, Data Everywhere •You guys generate a lot of data • Anybody want to guess? • 7 TB/day (2+ PB/yr) • 10,000 CDs • 5 million floppy disks
  • 9.
    Data, Data Everywhere •You guys generate a lot of data • Anybody want to guess? • 7 TB/day (2+ PB/yr) • 10,000 CDs • 5 million floppy disks • 225 GB while I give this talk
  • 10.
    Syslog? • Started withsyslog-ng • As our volume grew, it didn’t scale
  • 11.
    Syslog? • Started withsyslog-ng • As our volume grew, it didn’t scale • Resources overwhelmed • Lost data
  • 12.
    Scribe • Surprise! FBhad same problem, built and open-sourced Scribe • Log collection framework over Thrift • You write log lines, with categories • It does the rest
  • 13.
    Scribe FE FE FE • Runs locally; reliable in network outage
  • 14.
    Scribe FE FE FE • Runs locally; reliable in network outage • Nodes only know downstream writer; Agg Agg hierarchical, scalable
  • 15.
    Scribe FE FE FE • Runs locally; reliable in network outage • Nodes only know downstream writer; Agg Agg hierarchical, scalable • Pluggable outputs File HDFS
  • 16.
    Scribe at Twitter •Solved our problem, opened new vistas • Currently 30 different categories logged from javascript, RoR, Scala, etc • We improved logging, monitoring, writing to Hadoop, compression
  • 17.
    Scribe at Twitter • Continuing to work with FB • GSoC project! Help make it more awesome. • http://github.com/traviscrawford/scribe • http://wiki.developers.facebook.com/index.php/User:GSoC
  • 18.
    Three Challenges • CollectingData • Large-Scale Storage & Analysis • Rapid Learning over Big Data
  • 19.
    How do youstore 7TB/day? • Single machine? • What’s HD write speed?
  • 20.
    How do youstore 7TB/day? • Single machine? • What’s HD write speed? • 80 MB/s
  • 21.
    How do youstore 7TB/day? • Single machine? • What’s HD write speed? • 80 MB/s • 24.3 hrs to write 7 TB
  • 22.
    How do youstore 7TB/day? • Single machine? • What’s HD write speed? • 80 MB/s • 24.3 hrs to write 7 TB • Uh oh.
  • 23.
    Where do Iput 7TB/day? • Need a cluster of machines
  • 24.
    Where do Iput 7TB/day? • Need a cluster of machines • ... which adds new layers of complexity
  • 25.
    Hadoop • Distributed filesystem • Automatic replication, fault tolerance • MapReduce-based parallel computation • Key-value based computation interface allows for wide applicability
  • 26.
    Hadoop • Open source:top-level Apache project • Scalable: Y! has a 4000 node cluster • Powerful: sorted 1TB random integers in 62 seconds • Easy packaging: free Cloudera RPMs
  • 27.
    Inputs MapReduce Workflow Shuffle/ Map Sort • Challenge: how many tweets per Map Outputs user, given tweets table? Map Reduce • Input: key=row, value=tweet info • Map: output key=user_id, value=1 Map Reduce • Shuffle: sort by user_id Map Reduce • Reduce: for each user_id, sum Map • Output: user_id, tweet count • With 2x machines, runs 2x faster Map
  • 28.
    Inputs MapReduce Workflow Shuffle/ Map Sort • Challenge: how many tweets per Map Outputs user, given tweets table? Map Reduce • Input: key=row, value=tweet info • Map: output key=user_id, value=1 Map Reduce • Shuffle: sort by user_id Map Reduce • Reduce: for each user_id, sum Map • Output: user_id, tweet count • With 2x machines, runs 2x faster Map
  • 29.
    Inputs MapReduce Workflow Shuffle/ Map Sort • Challenge: how many tweets per Map Outputs user, given tweets table? Map Reduce • Input: key=row, value=tweet info • Map: output key=user_id, value=1 Map Reduce • Shuffle: sort by user_id Map Reduce • Reduce: for each user_id, sum Map • Output: user_id, tweet count • With 2x machines, runs 2x faster Map
  • 30.
    Inputs MapReduce Workflow Shuffle/ Map Sort • Challenge: how many tweets per Map Outputs user, given tweets table? Map Reduce • Input: key=row, value=tweet info • Map: output key=user_id, value=1 Map Reduce • Shuffle: sort by user_id Map Reduce • Reduce: for each user_id, sum Map • Output: user_id, tweet count • With 2x machines, runs 2x faster Map
  • 31.
    Inputs MapReduce Workflow Shuffle/ Map Sort • Challenge: how many tweets per Map Outputs user, given tweets table? Map Reduce • Input: key=row, value=tweet info • Map: output key=user_id, value=1 Map Reduce • Shuffle: sort by user_id Map Reduce • Reduce: for each user_id, sum Map • Output: user_id, tweet count • With 2x machines, runs 2x faster Map
  • 32.
    Inputs MapReduce Workflow Shuffle/ Map Sort • Challenge: how many tweets per Map Outputs user, given tweets table? Map Reduce • Input: key=row, value=tweet info • Map: output key=user_id, value=1 Map Reduce • Shuffle: sort by user_id Map Reduce • Reduce: for each user_id, sum Map • Output: user_id, tweet count • With 2x machines, runs 2x faster Map
  • 33.
    Inputs MapReduce Workflow Shuffle/ Map Sort • Challenge: how many tweets per Map Outputs user, given tweets table? Map Reduce • Input: key=row, value=tweet info • Map: output key=user_id, value=1 Map Reduce • Shuffle: sort by user_id Map Reduce • Reduce: for each user_id, sum Map • Output: user_id, tweet count • With 2x machines, runs 2x faster Map
  • 34.
    Two Analysis Challenges 1.Compute friendships in Twitter’s social graph • grep, awk? No way. • Data is in MySQL... self join on an n- billion row table? • n,000,000,000 x n,000,000,000 = ?
  • 35.
    Two Analysis Challenges 1.Compute friendships in Twitter’s social graph • grep, awk? No way. • Data is in MySQL... self join on an n- billion row table? • n,000,000,000 x n,000,000,000 = ? • I don’t know either.
  • 36.
    Two Analysis Challenges 2.Large-scale grouping and counting • select count(*) from users? maybe. • select count(*) from tweets? uh... • Imagine joining them. • And grouping. • And sorting.
  • 37.
    Back to Hadoop •Didn’t we have a cluster of machines? • Hadoop makes it easy to distribute the calculation • Purpose-built for parallel calculation • Just a slight mindset adjustment
  • 38.
    Back to Hadoop •Didn’t we have a cluster of machines? • Hadoop makes it easy to distribute the calculation • Purpose-built for parallel calculation • Just a slight mindset adjustment • But a fun one!
  • 39.
    Analysis at Scale •Now we’re rolling • Count all tweets: 12 billion, 5 minutes • Hit FlockDB in parallel to assemble social graph aggregates • Run pagerank across users to calculate reputations
  • 40.
    But... • Analysis typicallyin Java • Single-input, two-stage data flow is rigid • Projections, filters: custom code • Joins lengthy, error-prone • n-stage jobs: hard to manage • Exploration requires compilation
  • 41.
    Three Challenges • CollectingData • Large-Scale Storage & Analysis • Rapid Learning over Big Data
  • 42.
    Pig • High levellanguage • Transformations on sets of records • Process data one step at a time • Easier than SQL?
  • 43.
    Why Pig? BecauseI bet you can read the following script
  • 44.
    A Real PigScript • Just for fun... the same calculation in Java
  • 45.
  • 46.
    Pig Makes itEasy • 5% of the code
  • 47.
    Pig Makes itEasy • 5% of the code • 5% of the dev time
  • 48.
    Pig Makes itEasy • 5% of the code • 5% of the dev time • Within 25% of the running time
  • 49.
    Pig Makes itEasy • 5% of the code • 5% of the dev time • Within 25% of the running time • Readable, reusable
  • 50.
    One Thing I’veLearned • It’s easy to answer questions. • It’s hard to ask the right questions
  • 51.
    One Thing I’veLearned • It’s easy to answer questions. • It’s hard to ask the right questions. • Value the system that promotes innovation and iteration
  • 52.
    One Thing I’veLearned • It’s easy to answer questions. • It’s hard to ask the right questions. • Value the system that promotes innovation and iteration • More minds contributing = more value from your data
  • 53.
    Counting Big Data •How many requests per day?
  • 54.
    Counting Big Data •How many requests per day? • Average latency? 95% latency?
  • 55.
    Counting Big Data •How many requests per day? • Average latency? 95% latency? • Response code distribution per hour?
  • 56.
    Counting Big Data •How many requests per day? • Average latency? 95% latency? • Response code distribution per hour? • Searches per day?
  • 57.
    Counting Big Data •How many requests per day? • Average latency? 95% latency? • Response code distribution per hour? • Searches per day? • Unique users searching, unique queries?
  • 58.
    Counting Big Data •How many requests per day? • Average latency? 95% latency? • Response code distribution per hour? • Searches per day? • Unique users searching, unique queries? • Geographic distribution of queries?
  • 59.
    Correlating Big Data •Usage difference for mobile users?
  • 60.
    Correlating Big Data •Usage difference for mobile users? • ... for users on desktop clients?
  • 61.
    Correlating Big Data •Usage difference for mobile users? • ... for users on desktop clients? • Cohort analyses
  • 62.
    Correlating Big Data •Usage difference for mobile users? • ... for users on desktop clients? • Cohort analyses • What features get users hooked?
  • 63.
    Correlating Big Data •Usage difference for mobile users? • ... for users on desktop clients? • Cohort analyses • What features get users hooked? • What do successful users use often?
  • 64.
    Research on BigData • What can we tell from a user’s tweets?
  • 65.
    Research on BigData • What can we tell from a user’s tweets? • ... from the tweets of their followers?
  • 66.
    Research on BigData • What can we tell from a user’s tweets? • ... from the tweets of their followers? • ... from the tweets of those they follow?
  • 67.
    Research on BigData • What can we tell from a user’s tweets? • ... from the tweets of their followers? • ... from the tweets of those they follow? • What influences retweet tree depth?
  • 68.
    Research on BigData • What can we tell from a user’s tweets? • ... from the tweets of their followers? • ... from the tweets of those they follow? • What influences retweet tree depth? • Duplicate detection, language detection
  • 69.
    Research on BigData • What can we tell from a user’s tweets? • ... from the tweets of their followers? • ... from the tweets of those they follow? • What influences retweet tree depth? • Duplicate detection, language detection • Machine learning
  • 70.
    If We HadMore Time... • HBase backing namesearch • LZO compression • Protocol Buffers and Hadoop • Our open source: hadoop-lzo, elephant- bird • Realtime analytics with Cassandra
  • 71.
    Questions? Follow me at twitter.com/kevinweil