Your SlideShare is downloading. ×
0
Analyzing Big Data at



Web 2.0 Expo, 2010
Kevin Weil @kevinweil
Three Challenges
‣   Collecting Data
‣   Large-Scale Storage and Analysis
‣   Rapid Learning over Big Data
My Background
‣   Studied Mathematics and Physics at Harvard,

    Physics at Stanford
‣   Tropos Networks (city-wide wire...
Three Challenges
‣   Collecting Data
‣   Large-Scale Storage and Analysis
‣   Rapid Learning over Big Data
Data, Data Everywhere
‣   You guys generate a lot of data
‣   Anybody want to guess?
Data, Data Everywhere
‣   You guys generate a lot of data
‣   Anybody want to guess?
‣   	     12 TB/day (4+ PB/yr)
Data, Data Everywhere
‣   You guys generate a lot of data
‣   Anybody want to guess?
‣   	     12 TB/day (4+ PB/yr)
‣   	 ...
Data, Data Everywhere
‣   You guys generate a lot of data
‣   Anybody want to guess?
‣   	     12 TB/day (4+ PB/yr)
‣   	 ...
Data, Data Everywhere
‣   You guys generate a lot of data
‣   Anybody want to guess?
‣   	     12 TB/day (4+ PB/yr)
‣   	 ...
Syslog?
‣   Started with syslog-ng
‣   As our volume grew, it didn’t scale
Syslog?
‣   Started with syslog-ng
‣   As our volume grew, it didn’t scale
‣   Resources

    overwhelmed
‣   Lost data
Scribe
‣   Surprise! FB had same problem, built and open-

    sourced Scribe
‣   Log collection framework over Thrift
‣  ...
Scribe
‣   Runs locally; reliable in   FE   FE   FE
    network outage
Scribe
‣   Runs locally; reliable in   FE         FE         FE
    network outage


‣   Nodes only know

    downstream w...
Scribe
‣   Runs locally; reliable in          FE         FE         FE
    network outage


‣   Nodes only know

    downs...
Scribe at Twitter
‣   Solved our problem, opened new vistas
‣   Currently 40 different categories logged from

    javascr...
Three Challenges
‣   Collecting Data
‣   Large-Scale Storage and Analysis
‣   Rapid Learning over Big Data
How Do You Store 12 TB/day?
‣   Single machine?
‣   What’s hard drive write speed?
How Do You Store 12 TB/day?
‣   Single machine?
‣   What’s hard drive write speed?
‣   ~80 MB/s
How Do You Store 12 TB/day?
‣   Single machine?
‣   What’s hard drive write speed?
‣   ~80 MB/s
‣   42 hours to write 12 TB
How Do You Store 12 TB/day?
‣   Single machine?
‣   What’s hard drive write speed?
‣   ~80 MB/s
‣   42 hours to write 12 T...
Where Do I Put 12TB/day?
‣   Need a cluster of machines


‣   ... which adds new layers

    of complexity
Hadoop
‣   Distributed file system
‣   	     Automatic replication
‣   	     Fault tolerance
‣   	     Transparently read/...
Hadoop
‣   Distributed file system
‣   	     Automatic replication
‣   	     Fault tolerance
‣   	     Transparently read/...
Hadoop
‣   Open source: top-level Apache project
‣   Scalable: Y! has a 4000 node cluster
‣   Powerful: sorted 1TB random ...
MapReduce Workflow
Inputs
                                     ‣   Challenge: how many tweets
            Shuffle/
        ...
MapReduce Workflow
Inputs
                                     ‣   Challenge: how many tweets
            Shuffle/
        ...
MapReduce Workflow
Inputs
                                     ‣   Challenge: how many tweets
            Shuffle/
        ...
MapReduce Workflow
Inputs
                                     ‣   Challenge: how many tweets
            Shuffle/
        ...
MapReduce Workflow
Inputs
                                     ‣   Challenge: how many tweets
            Shuffle/
        ...
MapReduce Workflow
Inputs
                                     ‣   Challenge: how many tweets
            Shuffle/
        ...
MapReduce Workflow
Inputs
                                     ‣   Challenge: how many tweets
            Shuffle/
        ...
Two Analysis Challenges
‣   Compute mutual followings in Twitter’s interest graph
‣   	     grep, awk? No way.
‣   	     I...
Two Analysis Challenges
‣   Compute mutual followings in Twitter’s interest graph
‣   	     grep, awk? No way.
‣   	     I...
Two Analysis Challenges
‣   Large-scale grouping and counting
‣   	    select count(*) from users? maybe.

‣   	    select...
Back to Hadoop
‣   Didn’t we have a cluster of machines?
‣   Hadoop makes it easy to distribute the calculation
‣   Purpos...
Back to Hadoop
‣   Didn’t we have a cluster of machines?
‣   Hadoop makes it easy to distribute the calculation
‣   Purpos...
Analysis at Scale
‣   Now we’re rolling
‣   Count all tweets: 20+ billion, 5 minutes
‣   Parallel network calls to FlockDB...
But...
‣   Analysis typically in Java
‣   Single-input, two-stage data flow is rigid
‣   Projections, filters: custom code...
Three Challenges
‣   Collecting Data
‣   Large-Scale Storage and Analysis
‣   Rapid Learning over Big Data
Pig
‣   High level language
‣   Transformations on

    sets of records
‣   Process data one

    step at a time
‣   Easie...
Why Pig?
Because I bet you can read the following script
A Real Pig Script




‣   Just for fun... the same calculation in Java next
No, Seriously.
Pig Makes it Easy
‣   5% of the code
Pig Makes it Easy
‣   5% of the code
‣   5% of the dev time
Pig Makes it Easy
‣   5% of the code
‣   5% of the dev time
‣   Within 20% of the running time
Pig Makes it Easy
‣   5% of the code
‣   5% of the dev time
‣   Within 20% of the running time
‣   Readable, reusable
Pig Makes it Easy
‣   5% of the code
‣   5% of the dev time
‣   Within 20% of the running time
‣   Readable, reusable
‣   ...
One Thing I’ve Learned
‣   It’s easy to answer questions
‣   It’s hard to ask the right questions
One Thing I’ve Learned
‣   It’s easy to answer questions
‣   It’s hard to ask the right questions
‣   Value the system tha...
One Thing I’ve Learned
‣   It’s easy to answer questions
‣   It’s hard to ask the right questions
‣   Value the system tha...
Counting Big Data
‣   How many requests per day?
Counting Big Data
‣   How many requests per day?
‣   Average latency? 95% latency?
Counting Big Data
‣   How many requests per day?
‣   Average latency? 95% latency?
‣   Response code distribution per hour?
Counting Big Data
‣   How many requests per day?
‣   Average latency? 95% latency?
‣   Response code distribution per hour...
Counting Big Data
‣   How many requests per day?
‣   Average latency? 95% latency?
‣   Response code distribution per hour...
Counting Big Data
‣   How many requests per day?
‣   Average latency? 95% latency?
‣   Response code distribution per hour...
Counting Big Data
‣   How many requests per day?
‣   Average latency? 95% latency?
‣   Response code distribution per hour...
Correlating Big Data
‣   Usage difference for mobile users?
Correlating Big Data
‣   Usage difference for mobile users?
‣   ... for users on desktop clients?
Correlating Big Data
‣   Usage difference for mobile users?
‣   ... for users on desktop clients?
‣   ... for users of #ne...
Correlating Big Data
‣   Usage difference for mobile users?
‣   ... for users on desktop clients?
‣   ... for users of #ne...
Correlating Big Data
‣   Usage difference for mobile users?
‣   ... for users on desktop clients?
‣   ... for users of #ne...
Correlating Big Data
‣   Usage difference for mobile users?
‣   ... for users on desktop clients?
‣   ... for users of #ne...
Research on Big Data
‣   What can we tell from a user’s tweets?
Research on Big Data
‣   What can we tell from a user’s tweets?
‣   ... from the tweets of their followers?
Research on Big Data
‣   What can we tell from a user’s tweets?
‣   ... from the tweets of their followers?
‣   ... from t...
Research on Big Data
‣   What can we tell from a user’s tweets?
‣   ... from the tweets of their followers?
‣   ... from t...
Research on Big Data
‣   What can we tell from a user’s tweets?
‣   ... from the tweets of their followers?
‣   ... from t...
Research on Big Data
‣   What can we tell from a user’s tweets?
‣   ... from the tweets of their followers?
‣   ... from t...
Research on Big Data
‣   What can we tell from a user’s tweets?
‣   ... from the tweets of their followers?
‣   ... from t...
Research on Big Data
‣   What can we tell from a user’s tweets?
‣   ... from the tweets of their followers?
‣   ... from t...
Diving Deeper
‣   HBase and building products from Hadoop
‣   LZO Compression
‣   Protocol Buffers and Hadoop
‣   Our anal...
Questions?
             Follow me at
             twitter.com/kevinweil
Upcoming SlideShare
Loading in...5
×

Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)

20,856

Published on

A look at Twitter's data lifecycle, some of the tools we use to handle big data, and some of the questions we answer from our data.

Published in: Technology
2 Comments
92 Likes
Statistics
Notes
No Downloads
Views
Total Views
20,856
On Slideshare
0
From Embeds
0
Number of Embeds
13
Actions
Shares
0
Downloads
1,031
Comments
2
Likes
92
Embeds 0
No embeds

No notes for slide
  • Don’t worry about the “yellow grid lines” on the side, they’re just there to help you align everything.
    They won’t show up when you present.








































  • Change this to your big-idea call-outs...

































  • Change this to your big-idea call-outs...

  • Transcript of "Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)"

    1. 1. Analyzing Big Data at Web 2.0 Expo, 2010 Kevin Weil @kevinweil
    2. 2. Three Challenges ‣ Collecting Data ‣ Large-Scale Storage and Analysis ‣ Rapid Learning over Big Data
    3. 3. My Background ‣ Studied Mathematics and Physics at Harvard, Physics at Stanford ‣ Tropos Networks (city-wide wireless): GBs of data ‣ Cooliris (web media): Hadoop for analytics, TBs of data ‣ Twitter: Hadoop, Pig, machine learning, visualization, social graph analysis, PBs of data
    4. 4. Three Challenges ‣ Collecting Data ‣ Large-Scale Storage and Analysis ‣ Rapid Learning over Big Data
    5. 5. Data, Data Everywhere ‣ You guys generate a lot of data ‣ Anybody want to guess?
    6. 6. Data, Data Everywhere ‣ You guys generate a lot of data ‣ Anybody want to guess? ‣ 12 TB/day (4+ PB/yr)
    7. 7. Data, Data Everywhere ‣ You guys generate a lot of data ‣ Anybody want to guess? ‣ 12 TB/day (4+ PB/yr) ‣ 20,000 CDs
    8. 8. Data, Data Everywhere ‣ You guys generate a lot of data ‣ Anybody want to guess? ‣ 12 TB/day (4+ PB/yr) ‣ 20,000 CDs ‣ 10 million floppy disks
    9. 9. Data, Data Everywhere ‣ You guys generate a lot of data ‣ Anybody want to guess? ‣ 12 TB/day (4+ PB/yr) ‣ 20,000 CDs ‣ 10 million floppy disks ‣ 450 GB while I give this talk
    10. 10. Syslog? ‣ Started with syslog-ng ‣ As our volume grew, it didn’t scale
    11. 11. Syslog? ‣ Started with syslog-ng ‣ As our volume grew, it didn’t scale ‣ Resources overwhelmed ‣ Lost data
    12. 12. Scribe ‣ Surprise! FB had same problem, built and open- sourced Scribe ‣ Log collection framework over Thrift ‣ You “scribe” log lines, with categories ‣ It does the rest
    13. 13. Scribe ‣ Runs locally; reliable in FE FE FE network outage
    14. 14. Scribe ‣ Runs locally; reliable in FE FE FE network outage ‣ Nodes only know downstream writer; Agg Agg hierarchical, scalable
    15. 15. Scribe ‣ Runs locally; reliable in FE FE FE network outage ‣ Nodes only know downstream writer; Agg Agg hierarchical, scalable ‣ Pluggable outputs File HDFS
    16. 16. Scribe at Twitter ‣ Solved our problem, opened new vistas ‣ Currently 40 different categories logged from javascript, Ruby, Scala, Java, etc ‣ We improved logging, monitoring, behavior during failure conditions, writes to HDFS, etc ‣ Continuing to work with FB to make it better http://github.com/traviscrawford/scribe
    17. 17. Three Challenges ‣ Collecting Data ‣ Large-Scale Storage and Analysis ‣ Rapid Learning over Big Data
    18. 18. How Do You Store 12 TB/day? ‣ Single machine? ‣ What’s hard drive write speed?
    19. 19. How Do You Store 12 TB/day? ‣ Single machine? ‣ What’s hard drive write speed? ‣ ~80 MB/s
    20. 20. How Do You Store 12 TB/day? ‣ Single machine? ‣ What’s hard drive write speed? ‣ ~80 MB/s ‣ 42 hours to write 12 TB
    21. 21. How Do You Store 12 TB/day? ‣ Single machine? ‣ What’s hard drive write speed? ‣ ~80 MB/s ‣ 42 hours to write 12 TB ‣ Uh oh.
    22. 22. Where Do I Put 12TB/day? ‣ Need a cluster of machines ‣ ... which adds new layers of complexity
    23. 23. Hadoop ‣ Distributed file system ‣ Automatic replication ‣ Fault tolerance ‣ Transparently read/write across multiple machines
    24. 24. Hadoop ‣ Distributed file system ‣ Automatic replication ‣ Fault tolerance ‣ Transparently read/write across multiple machines ‣ MapReduce-based parallel computation ‣ Key-value based computation interface allows for wide applicability ‣ Fault tolerance, again
    25. 25. Hadoop ‣ Open source: top-level Apache project ‣ Scalable: Y! has a 4000 node cluster ‣ Powerful: sorted 1TB random integers in 62 seconds ‣ Easy packaging/install: free Cloudera RPMs
    26. 26. MapReduce Workflow Inputs ‣ Challenge: how many tweets Shuffle/ Map Sort per user, given tweets table? Map ‣ Input: key=row, value=tweet info Outputs Map Reduce ‣ Map: output key=user_id, value=1 Map Reduce ‣ Shuffle: sort by user_id Map Reduce ‣ Reduce: for each user_id, sum Map ‣ Output: user_id, tweet count Map ‣ With 2x machines, runs 2x faster
    27. 27. MapReduce Workflow Inputs ‣ Challenge: how many tweets Shuffle/ Map Sort per user, given tweets table? Map ‣ Input: key=row, value=tweet info Outputs Map Reduce ‣ Map: output key=user_id, value=1 Map Reduce ‣ Shuffle: sort by user_id Map Reduce ‣ Reduce: for each user_id, sum Map ‣ Output: user_id, tweet count Map ‣ With 2x machines, runs 2x faster
    28. 28. MapReduce Workflow Inputs ‣ Challenge: how many tweets Shuffle/ Map Sort per user, given tweets table? Map ‣ Input: key=row, value=tweet info Outputs Map Reduce ‣ Map: output key=user_id, value=1 Map Reduce ‣ Shuffle: sort by user_id Map Reduce ‣ Reduce: for each user_id, sum Map ‣ Output: user_id, tweet count Map ‣ With 2x machines, runs 2x faster
    29. 29. MapReduce Workflow Inputs ‣ Challenge: how many tweets Shuffle/ Map Sort per user, given tweets table? Map ‣ Input: key=row, value=tweet info Outputs Map Reduce ‣ Map: output key=user_id, value=1 Map Reduce ‣ Shuffle: sort by user_id Map Reduce ‣ Reduce: for each user_id, sum Map ‣ Output: user_id, tweet count Map ‣ With 2x machines, runs 2x faster
    30. 30. MapReduce Workflow Inputs ‣ Challenge: how many tweets Shuffle/ Map Sort per user, given tweets table? Map ‣ Input: key=row, value=tweet info Outputs Map Reduce ‣ Map: output key=user_id, value=1 Map Reduce ‣ Shuffle: sort by user_id Map Reduce ‣ Reduce: for each user_id, sum Map ‣ Output: user_id, tweet count Map ‣ With 2x machines, runs 2x faster
    31. 31. MapReduce Workflow Inputs ‣ Challenge: how many tweets Shuffle/ Map Sort per user, given tweets table? Map ‣ Input: key=row, value=tweet info Outputs Map Reduce ‣ Map: output key=user_id, value=1 Map Reduce ‣ Shuffle: sort by user_id Map Reduce ‣ Reduce: for each user_id, sum Map ‣ Output: user_id, tweet count Map ‣ With 2x machines, runs 2x faster
    32. 32. MapReduce Workflow Inputs ‣ Challenge: how many tweets Shuffle/ Map Sort per user, given tweets table? Map ‣ Input: key=row, value=tweet info Outputs Map Reduce ‣ Map: output key=user_id, value=1 Map Reduce ‣ Shuffle: sort by user_id Map Reduce ‣ Reduce: for each user_id, sum Map ‣ Output: user_id, tweet count Map ‣ With 2x machines, runs 2x faster
    33. 33. Two Analysis Challenges ‣ Compute mutual followings in Twitter’s interest graph ‣ grep, awk? No way. ‣ If data is in MySQL... self join on an n- billion row table? ‣ n,000,000,000 x n,000,000,000 = ?
    34. 34. Two Analysis Challenges ‣ Compute mutual followings in Twitter’s interest graph ‣ grep, awk? No way. ‣ If data is in MySQL... self join on an n- billion row table? ‣ n,000,000,000 x n,000,000,000 = ? ‣ I don’t know either.
    35. 35. Two Analysis Challenges ‣ Large-scale grouping and counting ‣ select count(*) from users? maybe. ‣ select count(*) from tweets? uh... ‣ Imagine joining these two. ‣ And grouping. ‣ And sorting.
    36. 36. Back to Hadoop ‣ Didn’t we have a cluster of machines? ‣ Hadoop makes it easy to distribute the calculation ‣ Purpose-built for parallel calculation ‣ Just a slight mindset adjustment
    37. 37. Back to Hadoop ‣ Didn’t we have a cluster of machines? ‣ Hadoop makes it easy to distribute the calculation ‣ Purpose-built for parallel calculation ‣ Just a slight mindset adjustment ‣ But a fun one!
    38. 38. Analysis at Scale ‣ Now we’re rolling ‣ Count all tweets: 20+ billion, 5 minutes ‣ Parallel network calls to FlockDB to compute interest graph aggregates ‣ Run PageRank across users and interest graph
    39. 39. But... ‣ Analysis typically in Java ‣ Single-input, two-stage data flow is rigid ‣ Projections, filters: custom code ‣ Joins lengthy, error-prone ‣ n-stage jobs hard to manage ‣ Data exploration requires compilation
    40. 40. Three Challenges ‣ Collecting Data ‣ Large-Scale Storage and Analysis ‣ Rapid Learning over Big Data
    41. 41. Pig ‣ High level language ‣ Transformations on sets of records ‣ Process data one step at a time ‣ Easier than SQL?
    42. 42. Why Pig? Because I bet you can read the following script
    43. 43. A Real Pig Script ‣ Just for fun... the same calculation in Java next
    44. 44. No, Seriously.
    45. 45. Pig Makes it Easy ‣ 5% of the code
    46. 46. Pig Makes it Easy ‣ 5% of the code ‣ 5% of the dev time
    47. 47. Pig Makes it Easy ‣ 5% of the code ‣ 5% of the dev time ‣ Within 20% of the running time
    48. 48. Pig Makes it Easy ‣ 5% of the code ‣ 5% of the dev time ‣ Within 20% of the running time ‣ Readable, reusable
    49. 49. Pig Makes it Easy ‣ 5% of the code ‣ 5% of the dev time ‣ Within 20% of the running time ‣ Readable, reusable ‣ As Pig improves, your calculations run faster
    50. 50. One Thing I’ve Learned ‣ It’s easy to answer questions ‣ It’s hard to ask the right questions
    51. 51. One Thing I’ve Learned ‣ It’s easy to answer questions ‣ It’s hard to ask the right questions ‣ Value the system that promotes innovation and iteration
    52. 52. One Thing I’ve Learned ‣ It’s easy to answer questions ‣ It’s hard to ask the right questions ‣ Value the system that promotes innovation and iteration ‣ More minds contributing = more value from your data
    53. 53. Counting Big Data ‣ How many requests per day?
    54. 54. Counting Big Data ‣ How many requests per day? ‣ Average latency? 95% latency?
    55. 55. Counting Big Data ‣ How many requests per day? ‣ Average latency? 95% latency? ‣ Response code distribution per hour?
    56. 56. Counting Big Data ‣ How many requests per day? ‣ Average latency? 95% latency? ‣ Response code distribution per hour? ‣ Twitter searches per day?
    57. 57. Counting Big Data ‣ How many requests per day? ‣ Average latency? 95% latency? ‣ Response code distribution per hour? ‣ Twitter searches per day? ‣ Unique users searching, unique queries?
    58. 58. Counting Big Data ‣ How many requests per day? ‣ Average latency? 95% latency? ‣ Response code distribution per hour? ‣ Twitter searches per day? ‣ Unique users searching, unique queries? ‣ Links tweeted per day? By domain?
    59. 59. Counting Big Data ‣ How many requests per day? ‣ Average latency? 95% latency? ‣ Response code distribution per hour? ‣ Twitter searches per day? ‣ Unique users searching, unique queries? ‣ Links tweeted per day? By domain? ‣ Geographic distribution of all of the above
    60. 60. Correlating Big Data ‣ Usage difference for mobile users?
    61. 61. Correlating Big Data ‣ Usage difference for mobile users? ‣ ... for users on desktop clients?
    62. 62. Correlating Big Data ‣ Usage difference for mobile users? ‣ ... for users on desktop clients? ‣ ... for users of #newtwitter?
    63. 63. Correlating Big Data ‣ Usage difference for mobile users? ‣ ... for users on desktop clients? ‣ ... for users of #newtwitter? ‣ Cohort analyses
    64. 64. Correlating Big Data ‣ Usage difference for mobile users? ‣ ... for users on desktop clients? ‣ ... for users of #newtwitter? ‣ Cohort analyses ‣ What features get users hooked?
    65. 65. Correlating Big Data ‣ Usage difference for mobile users? ‣ ... for users on desktop clients? ‣ ... for users of #newtwitter? ‣ Cohort analyses ‣ What features get users hooked? ‣ What features power Twitter users use often?
    66. 66. Research on Big Data ‣ What can we tell from a user’s tweets?
    67. 67. Research on Big Data ‣ What can we tell from a user’s tweets? ‣ ... from the tweets of their followers?
    68. 68. Research on Big Data ‣ What can we tell from a user’s tweets? ‣ ... from the tweets of their followers? ‣ ... from the tweets of those they follow?
    69. 69. Research on Big Data ‣ What can we tell from a user’s tweets? ‣ ... from the tweets of their followers? ‣ ... from the tweets of those they follow? ‣ What influences retweets? Depth of the retweet tree?
    70. 70. Research on Big Data ‣ What can we tell from a user’s tweets? ‣ ... from the tweets of their followers? ‣ ... from the tweets of those they follow? ‣ What influences retweets? Depth of the retweet tree? ‣ Duplicate detection (spam)
    71. 71. Research on Big Data ‣ What can we tell from a user’s tweets? ‣ ... from the tweets of their followers? ‣ ... from the tweets of those they follow? ‣ What influences retweets? Depth of the retweet tree? ‣ Duplicate detection (spam) ‣ Language detection (search)
    72. 72. Research on Big Data ‣ What can we tell from a user’s tweets? ‣ ... from the tweets of their followers? ‣ ... from the tweets of those they follow? ‣ What influences retweets? Depth of the retweet tree? ‣ Duplicate detection (spam) ‣ Language detection (search) ‣ Machine learning
    73. 73. Research on Big Data ‣ What can we tell from a user’s tweets? ‣ ... from the tweets of their followers? ‣ ... from the tweets of those they follow? ‣ What influences retweets? Depth of the retweet tree? ‣ Duplicate detection (spam) ‣ Language detection (search) ‣ Machine learning ‣ Natural language processing
    74. 74. Diving Deeper ‣ HBase and building products from Hadoop ‣ LZO Compression ‣ Protocol Buffers and Hadoop ‣ Our analytics-related open source: hadoop-lzo, elephant-bird ‣ Moving analytics to realtime http://github.com/kevinweil/hadoop-lzo http://github.com/kevinweil/elephant-bird
    75. 75. Questions? Follow me at twitter.com/kevinweil
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×