SlideShare a Scribd company logo
NoSQL at Twitter
Kevin Weil -- @kevinweil
Analytics Lead, Twitter

April 21, 2010




                           TM
Introduction
‣   How We Arrived at NoSQL: A Crash Course
‣     Collecting Data (Scribe)
‣     Storing and Analyzing Data (Hadoop)
‣     Rapid Learning over Big Data (Pig)
‣   And More: Cassandra, HBase, FlockDB
My Background
‣   Studied Mathematics and Physics at Harvard, Physics at
    Stanford
‣   Tropos Networks (city-wide wireless): mesh routing algorithms,
    GBs of data
‣   Cooliris (web media): Hadoop and Pig for analytics, TBs of data
‣   Twitter: Hadoop, Pig, HBase, Cassandra, machine learning,
    visualization, social graph analysis, soon to be PBs data
Introduction
‣   How We Arrived at NoSQL: A Crash Course
‣     Collecting Data (Scribe)
‣     Storing and Analyzing Data (Hadoop)
‣     Rapid Learning over Big Data (Pig)
‣   And More: Cassandra, HBase, FlockDB
Data, Data Everywhere
‣   Twitter users generate a lot of data
‣   Anybody want to guess?
Data, Data Everywhere
‣   Twitter users generate a lot of data
‣   Anybody want to guess?
‣     7 TB/day (2+ PB/yr)
Data, Data Everywhere
‣   Twitter users generate a lot of data
‣   Anybody want to guess?
‣     7 TB/day (2+ PB/yr)
‣     10,000 CDs/day
Data, Data Everywhere
‣   Twitter users generate a lot of data
‣   Anybody want to guess?
‣     7 TB/day (2+ PB/yr)
‣     10,000 CDs/day
‣     5 million floppy disks
Data, Data Everywhere
‣   Twitter users generate a lot of data
‣   Anybody want to guess?
‣     7 TB/day (2+ PB/yr)
‣     10,000 CDs/day
‣     5 million floppy disks
‣     300 GB while I give this talk
Data, Data Everywhere
‣   Twitter users generate a lot of data
‣   Anybody want to guess?
‣     7 TB/day (2+ PB/yr)
‣     10,000 CDs/day
‣     5 million floppy disks
‣     300 GB while I give this talk
‣   And doubling multiple times per year
Syslog?
‣   Started with syslog-ng
‣   As our volume grew, it didn’t scale
Syslog?
‣   Started with syslog-ng
‣   As our volume grew, it didn’t scale
‣   Resources overwhelmed
‣   Lost data
Scribe
‣   Surprise! FB had same problem, built and open-sourced Scribe
‣   Log collection framework over Thrift
‣   You write log lines, with categories
‣   It does the rest
Scribe
‣   Runs locally; reliable in network outage

                                               FE   FE   FE
Scribe
‣   Runs locally; reliable in network outage
‣   Nodes only know downstream
                                               FE         FE         FE
    writer; hierarchical, scalable



                                                    Agg        Agg
Scribe
‣   Runs locally; reliable in network outage
‣   Nodes only know downstream
                                               FE         FE           FE
    writer; hierarchical, scalable
‣   Pluggable outputs

                                                    Agg          Agg




                                        File              HDFS
Scribe at Twitter
‣   Solved our problem, opened new vistas
‣   Currently 30 different categories logged from multiple sources
‣     FE: Javascript, Ruby on Rails
‣     Middle tier: Ruby on Rails, Scala
‣     Backend: Scala, Java, C++
Scribe at Twitter
‣   We’ve contributed to it as we’ve used it
‣       Improved logging, monitoring, writing to HDFS, compression
‣       Continuing to work with FB on patches
‣   GSoC project! Help make it more awesome.




    • http://github.com/traviscrawford/scribe
    • http://wiki.developers.facebook.com/index.php/User:GSoC
Introduction
‣   How We Arrived at NoSQL: A Crash Course
‣     Collecting Data (Scribe)
‣     Storing and Analyzing Data (Hadoop)
‣     Rapid Learning over Big Data (Pig)
‣   And More: Cassandra, HBase, FlockDB
How do you store 7TB/day?
‣   Single machine?
‣   What’s HD write speed?
How do you store 7TB/day?
‣   Single machine?
‣   What’s HD write speed?
‣     ~80 MB/s
How do you store 7TB/day?
‣   Single machine?
‣   What’s HD write speed?
‣     ~80 MB/s
‣   24.3 hours to write 7 TB
How do you store 7TB/day?
‣   Single machine?
‣   What’s HD write speed?
‣     ~80 MB/s
‣   24.3 hours to write 7 TB
‣   Uh oh.
Where do I put 7TB/day?
‣   Need a cluster of machines
Where do I put 7TB/day?
‣   Need a cluster of machines


‣   ... which adds new layers of complexity
Hadoop
‣   Distributed file system
‣     Automatic replication, fault tolerance
Hadoop
‣   Distributed file system
‣     Automatic replication, fault tolerance
‣   MapReduce-based parallel computation
‣     Key-value based computation interface allows for wide
    applicability
Hadoop
‣   Open source: top-level Apache project
‣   Scalable: Y! has a 4000 node cluster
‣   Powerful: sorted 1TB of random integers in 62 seconds


‣   Easy packaging: free Cloudera RPMs
MapReduce Workflow
Inputs


            Map
                  Shuffle/Sort                      ‣   Challenge: how many tweets per user,
                                                       given tweets table?
            Map
                                         Outputs   ‣   Input: key=row, value=tweet info
            Map                 Reduce
                                                   ‣   Map: output key=user_id, value=1
            Map                 Reduce
                                                   ‣   Shuffle: sort by user_id
            Map                 Reduce
                                                   ‣   Reduce: for each user_id, sum
            Map                                    ‣   Output: user_id, tweet count
            Map                                    ‣   With 2x machines, runs 2x faster
MapReduce Workflow
Inputs


            Map
                  Shuffle/Sort                      ‣   Challenge: how many tweets per user,
                                                       given tweets table?
            Map
                                         Outputs   ‣   Input: key=row, value=tweet info
            Map                 Reduce
                                                   ‣   Map: output key=user_id, value=1
            Map                 Reduce
                                                   ‣   Shuffle: sort by user_id
            Map                 Reduce
                                                   ‣   Reduce: for each user_id, sum
            Map                                    ‣   Output: user_id, tweet count
            Map                                    ‣   With 2x machines, runs 2x faster
MapReduce Workflow
Inputs


            Map
                  Shuffle/Sort                      ‣   Challenge: how many tweets per user,
                                                       given tweets table?
            Map
                                         Outputs   ‣   Input: key=row, value=tweet info
            Map                 Reduce
                                                   ‣   Map: output key=user_id, value=1
            Map                 Reduce
                                                   ‣   Shuffle: sort by user_id
            Map                 Reduce
                                                   ‣   Reduce: for each user_id, sum
            Map                                    ‣   Output: user_id, tweet count
            Map                                    ‣   With 2x machines, runs 2x faster
MapReduce Workflow
Inputs


            Map
                  Shuffle/Sort                      ‣   Challenge: how many tweets per user,
                                                       given tweets table?
            Map
                                         Outputs   ‣   Input: key=row, value=tweet info
            Map                 Reduce
                                                   ‣   Map: output key=user_id, value=1
            Map                 Reduce
                                                   ‣   Shuffle: sort by user_id
            Map                 Reduce
                                                   ‣   Reduce: for each user_id, sum
            Map                                    ‣   Output: user_id, tweet count
            Map                                    ‣   With 2x machines, runs 2x faster
MapReduce Workflow
Inputs


            Map
                  Shuffle/Sort                      ‣   Challenge: how many tweets per user,
                                                       given tweets table?
            Map
                                         Outputs   ‣   Input: key=row, value=tweet info
            Map                 Reduce
                                                   ‣   Map: output key=user_id, value=1
            Map                 Reduce
                                                   ‣   Shuffle: sort by user_id
            Map                 Reduce
                                                   ‣   Reduce: for each user_id, sum
            Map                                    ‣   Output: user_id, tweet count
            Map                                    ‣   With 2x machines, runs 2x faster
MapReduce Workflow
Inputs


            Map
                  Shuffle/Sort                      ‣   Challenge: how many tweets per user,
                                                       given tweets table?
            Map
                                         Outputs   ‣   Input: key=row, value=tweet info
            Map                 Reduce
                                                   ‣   Map: output key=user_id, value=1
            Map                 Reduce
                                                   ‣   Shuffle: sort by user_id
            Map                 Reduce
                                                   ‣   Reduce: for each user_id, sum
            Map                                    ‣   Output: user_id, tweet count
            Map                                    ‣   With 2x machines, runs 2x faster
MapReduce Workflow
Inputs


            Map
                  Shuffle/Sort                      ‣   Challenge: how many tweets per user,
                                                       given tweets table?
            Map
                                         Outputs   ‣   Input: key=row, value=tweet info
            Map                 Reduce
                                                   ‣   Map: output key=user_id, value=1
            Map                 Reduce
                                                   ‣   Shuffle: sort by user_id
            Map                 Reduce
                                                   ‣   Reduce: for each user_id, sum
            Map                                    ‣   Output: user_id, tweet count
            Map                                    ‣   With 2x machines, runs 2x faster
Two Analysis Challenges
‣   1. Compute friendships in Twitter’s social graph
‣     grep, awk? No way.
‣     Data is in MySQL... self join on an n-billion row table?
‣        n,000,000,000 x n,000,000,000 = ?
Two Analysis Challenges
‣   1. Compute friendships in Twitter’s social graph
‣     grep, awk? No way.
‣     Data is in MySQL... self join on an n-billion row table?
‣        n,000,000,000 x n,000,000,000 = ?
‣        I don’t know either.
Two Analysis Challenges
‣   2. Large-scale grouping and counting?
‣    select count(*) from users? Maybe...
‣    select count(*) from tweets? Uh...
‣    Imagine joining them...
‣    ... and grouping...
‣    ... and sorting...
Back to Hadoop
‣   Didn’t we have a cluster of machines?
Back to Hadoop
‣   Didn’t we have a cluster of machines?
Back to Hadoop
‣   Didn’t we have a cluster of machines?
‣   Hadoop makes it easy to distribute the
    calculation
‣   Purpose-built for parallel computation
‣   Just a slight mindset adjustment
Back to Hadoop
‣   Didn’t we have a cluster of machines?
‣   Hadoop makes it easy to distribute the
    calculation
‣   Purpose-built for parallel computation
‣   Just a slight mindset adjustment
‣   But a fun and valuable one!
Analysis at scale
‣   Now we’re rolling
‣   Count all tweets: 12 billion, 5 minutes
‣   Hit FlockDB in parallel to assemble social graph aggregates
‣   Run pagerank across users to calculate reputations
But...
‣   Analysis typically in Java
‣     “I need less Java in my life, not more.”
But...
‣   Analysis typically in Java
‣     “I need less Java in my life, not more.”
‣   Single-input, two-stage data flow is rigid
But...
‣   Analysis typically in Java
‣     “I need less Java in my life, not more.”
‣   Single-input, two-stage data flow is rigid
‣   Projections, filters: custom code
But...
‣   Analysis typically in Java
‣     “I need less Java in my life, not more.”
‣   Single-input, two-stage data flow is rigid
‣   Projections, filters: custom code
‣   Joins are lengthy, error-prone
But...
‣   Analysis typically in Java
‣     “I need less Java in my life, not more.”
‣   Single-input, two-stage data flow is rigid
‣   Projections, filters: custom code
‣   Joins are lengthy, error-prone
‣   n-stage jobs hard to manage
But...
‣   Analysis typically in Java
‣     “I need less Java in my life, not more.”
‣   Single-input, two-stage data flow is rigid
‣   Projections, filters: custom code
‣   Joins are lengthy, error-prone
‣   n-stage jobs hard to manage
‣   Exploration requires compilation!
Introduction
‣   How We Arrived at NoSQL: A Crash Course
‣     Collecting Data (Scribe)
‣     Storing and Analyzing Data (Hadoop)
‣     Rapid Learning over Big Data (Pig)
‣   And More: Cassandra, HBase, FlockDB
Pig
‣   High-level language
‣   Transformations on sets of records
‣   Process data one step at a time
‣   Easier than SQL?
Why Pig?
‣   Because I bet you can read the following script.
A Real Pig Script
A Real Pig Script




‣   Now, just for fun... the same calculation in vanilla Hadoop MapReduce.
No, seriously.
Pig Democratizes Large-scale Data
Analysis
‣   The Pig version is:
‣     5% of the code
Pig Democratizes Large-scale Data
Analysis
‣   The Pig version is:
‣     5% of the code
‣     5% of the time
Pig Democratizes Large-scale Data
Analysis
‣   The Pig version is:
‣     5% of the code
‣     5% of the time
‣     Within 25% of the execution time
One Thing I’ve Learned
‣   It’s easy to answer questions
‣   It’s hard to ask the right questions
One Thing I’ve Learned
‣   It’s easy to answer questions
‣   It’s hard to ask the right questions


‣   Value the system that promotes innovation, iteration
One Thing I’ve Learned
‣   It’s easy to answer questions
‣   It’s hard to ask the right questions


‣   Value the system that promotes innovation, iteration
‣   More minds contributing = more value from your data
The Hadoop Ecosystem at Twitter
‣   Running Cloudera’s free distro, CDH2 and Hadoop 0.20.1
The Hadoop Ecosystem at Twitter
‣   Running Cloudera’s free distro, CDH2 and Hadoop 0.20.1
‣   Heavily modified Scribe writing LZO-compressed to HDFS
‣     LZO: fast, splittable compression, ideal for HDFS*




‣   * http://www.github.com/kevinweil/hadoop-lzo
‣
The Hadoop Ecosystem at Twitter
‣   Running Cloudera’s free distro, CDH2 and Hadoop 0.20.1
‣   Heavily modified Scribe writing LZO-compressed to HDFS
‣     LZO: fast, splittable compression, ideal for HDFS*
‣   Data either as flat files (logs) or in protocol buffer format (newer
    logs, structured data, etc)
‣     Libs for reading/writing/more open-sourced as elephant-bird**



‣   * http://www.github.com/kevinweil/hadoop-lzo
‣   ** http://www.github.com/kevinweil/elephant-bird
The Hadoop Ecosystem at Twitter
‣   Running Cloudera’s free distro, CDH2 and Hadoop 0.20.1
‣   Heavily modified Scribe writing LZO-compressed to HDFS
‣     LZO: fast, splittable compression, ideal for HDFS*
‣   Data either as flat files (logs) or in protocol buffer format (newer
    logs, structured data, etc)
‣     Libs for reading/writing/more open-sourced as elephant-bird**
‣   Some Java-based MapReduce, a little Hadoop streaming


‣   * http://www.github.com/kevinweil/hadoop-lzo
‣   ** http://www.github.com/kevinweil/elephant-bird
The Hadoop Ecosystem at Twitter
‣   Running Cloudera’s free distro, CDH2 and Hadoop 0.20.1
‣   Heavily modified Scribe writing LZO-compressed to HDFS
‣     LZO: fast, splittable compression, ideal for HDFS*
‣   Data either as flat files (logs) or in protocol buffer format (newer
    logs, structured data, etc)
‣     Libs for reading/writing/more open-sourced as elephant-bird**
‣   Some Java-based MapReduce, some HBase, Hadoop streaming
‣   Most analysis, and most interesting analyses, done in Pig
‣   * http://www.github.com/kevinweil/hadoop-lzo
‣   ** http://www.github.com/kevinweil/elephant-bird
Data?
‣   Semi-structured: apache logs
    (search, .com, mobile), search query
    logs, RoR logs, mysql query logs, A/B
    testing logs, signup flow logging, and
    on...
Data?
‣   Semi-structured: apache logs
    (search, .com, mobile), search query
    logs, RoR logs, mysql query logs, A/B
    testing logs, signup flow logging, and
    on...
‣   Structured: tweets, users, blocks,
    phones, favorites, saved searches,
    retweets, geo, authentications, sms,
    3rd party clients, followings
Data?
‣   Semi-structured: apache logs
    (search, .com, mobile), search query
    logs, RoR logs, mysql query logs, A/B
    testing logs, signup flow logging, and
    on...
‣   Structured: tweets, users, blocks,
    phones, favorites, saved searches,
    retweets, geo, authentications, sms,
    3rd party clients, followings
‣   Entangled: the social graph
So what do we do with it?
Counting Big Data
‣               standard counts, min, max, std dev
‣   How many requests do we serve in a day?
Counting Big Data
‣               standard counts, min, max, std dev
‣   How many requests do we serve in a day?
‣   What is the average latency? 95% latency?
‣
Counting Big Data
‣               standard counts, min, max, std dev
‣   How many requests do we serve in a day?
‣   What is the average latency? 95% latency?
‣   Group by response code. What is the hourly distribution?
‣
Counting Big Data
‣               standard counts, min, max, std dev
‣   How many requests do we serve in a day?
‣   What is the average latency? 95% latency?
‣   Group by response code. What is the hourly distribution?
‣   How many searches happen each day on Twitter?
‣
Counting Big Data
‣               standard counts, min, max, std dev
‣   How many requests do we serve in a day?
‣   What is the average latency? 95% latency?
‣   Group by response code. What is the hourly distribution?
‣   How many searches happen each day on Twitter?
‣   How many unique queries, how many unique users?
‣
Counting Big Data
‣                standard counts, min, max, std dev
‣   How many requests do we serve in a day?
‣   What is the average latency? 95% latency?
‣   Group by response code. What is the hourly distribution?
‣   How many searches happen each day on Twitter?
‣   How many unique queries, how many unique users?
‣   What is their geographic distribution?
Counting Big Data
‣   Where are users querying from? The API, the front page, their
    profile page, etc?
‣
Correlating Big Data
‣               probabilities, covariance, influence
‣   How does usage differ for mobile users?
Correlating Big Data
‣               probabilities, covariance, influence
‣   How does usage differ for mobile users?
‣   How about for users with 3rd party desktop clients?
Correlating Big Data
‣               probabilities, covariance, influence
‣   How does usage differ for mobile users?
‣   How about for users with 3rd party desktop clients?
‣   Cohort analyses
Correlating Big Data
‣               probabilities, covariance, influence
‣   How does usage differ for mobile users?
‣   How about for users with 3rd party desktop clients?
‣   Cohort analyses
‣   Site problems: what goes wrong at the same time?
Correlating Big Data
‣               probabilities, covariance, influence
‣   How does usage differ for mobile users?
‣   How about for users with 3rd party desktop clients?
‣   Cohort analyses
‣   Site problems: what goes wrong at the same time?
‣   Which features get users hooked?
Correlating Big Data
‣               probabilities, covariance, influence
‣   How does usage differ for mobile users?
‣   How about for users with 3rd party desktop clients?
‣   Cohort analyses
‣   Site problems: what goes wrong at the same time?
‣   Which features get users hooked?
‣   Which features do successful users use often?
Correlating Big Data
‣               probabilities, covariance, influence
‣   How does usage differ for mobile users?
‣   How about for users with 3rd party desktop clients?
‣   Cohort analyses
‣   Site problems: what goes wrong at the same time?
‣   Which features get users hooked?
‣   Which features do successful users use often?
‣   Search corrections, search suggestions
Correlating Big Data
‣                 probabilities, covariance, influence
‣   How does usage differ for mobile users?
‣   How about for users with 3rd party desktop clients?
‣   Cohort analyses
‣   Site problems: what goes wrong at the same time?
‣   Which features get users hooked?
‣   Which features do successful users use often?
‣   Search corrections, search suggestions
‣   A/B testing
Correlating Big Data
‣   What is the correlation between users with registered phones
    and users that tweet?
Research on Big Data
‣           prediction, graph analysis, natural language
‣   What can we tell about a user from their tweets?
Research on Big Data
‣           prediction, graph analysis, natural language
‣   What can we tell about a user from their tweets?
‣     From the tweets of those they follow?
Research on Big Data
‣           prediction, graph analysis, natural language
‣   What can we tell about a user from their tweets?
‣     From the tweets of those they follow?
‣     From the tweets of their followers?
Research on Big Data
‣           prediction, graph analysis, natural language
‣   What can we tell about a user from their tweets?
‣     From the tweets of those they follow?
‣     From the tweets of their followers?
‣     From the ratio of followers/following?
Research on Big Data
‣           prediction, graph analysis, natural language
‣   What can we tell about a user from their tweets?
‣     From the tweets of those they follow?
‣     From the tweets of their followers?
‣     From the ratio of followers/following?
‣   What graph structures lead to successful networks?
Research on Big Data
‣           prediction, graph analysis, natural language
‣   What can we tell about a user from their tweets?
‣     From the tweets of those they follow?
‣     From the tweets of their followers?
‣     From the ratio of followers/following?
‣   What graph structures lead to successful networks?
‣   User reputation
Research on Big Data
‣          prediction, graph analysis, natural language
‣   Sentiment analysis
Research on Big Data
‣          prediction, graph analysis, natural language
‣   Sentiment analysis
‣   What features get a tweet retweeted?
Research on Big Data
‣          prediction, graph analysis, natural language
‣   Sentiment analysis
‣   What features get a tweet retweeted?
‣     How deep is the corresponding retweet tree?
Research on Big Data
‣          prediction, graph analysis, natural language
‣   Sentiment analysis
‣   What features get a tweet retweeted?
‣     How deep is the corresponding retweet tree?
‣   Long-term duplicate detection
Research on Big Data
‣          prediction, graph analysis, natural language
‣   Sentiment analysis
‣   What features get a tweet retweeted?
‣     How deep is the corresponding retweet tree?
‣   Long-term duplicate detection
‣   Machine learning
Research on Big Data
‣          prediction, graph analysis, natural language
‣   Sentiment analysis
‣   What features get a tweet retweeted?
‣     How deep is the corresponding retweet tree?
‣   Long-term duplicate detection
‣   Machine learning
‣   Language detection
Research on Big Data
‣            prediction, graph analysis, natural language
‣   Sentiment analysis
‣   What features get a tweet retweeted?
‣     How deep is the corresponding retweet tree?
‣   Long-term duplicate detection
‣   Machine learning
‣   Language detection
‣   ... the list goes on.
Research on Big Data
‣   How well can we detect bots and other non-human tweeters?
Introduction
‣   How We Arrived at NoSQL: A Crash Course
‣     Collecting Data (Scribe)
‣     Storing and Analyzing Data (Hadoop)
‣     Rapid Learning over Big Data (Pig)
‣   And More: Cassandra, HBase, FlockDB
HBase
‣   BigTable clone on top of HDFS
‣   Distributed, column-oriented, no datatypes
‣   Unlike the rest of HDFS, designed for low-latency
‣   Importantly, data is mutable
HBase at Twitter
‣   We began building real products based on Hadoop
‣   People search
HBase at Twitter
‣   We began building real products based on Hadoop
‣   People search
‣     Old version: offline process on a single node
HBase at Twitter
‣   We began building real products based on Hadoop
‣   People search
‣     Old version: offline process on a single node
‣     New version: complex user calculations,
    hit extra services in real time, custom indexing
HBase at Twitter
‣   We began building real products based on Hadoop
‣   People search
‣     Old version: offline process on a single node
‣     New version: complex user calculations,
    hit extra services in real time, custom indexing
‣     Underlying data is mutable
‣     Mutable layer on top of HDFS --> HBase
People Search
‣   Import user data into HBase
People Search
‣   Import user data into HBase
‣   Periodic MapReduce job reading from HBase
‣     Hits FlockDB, multiple other internal services in mapper
‣     Custom partitioning
People Search
‣   Import user data into HBase
‣   Periodic MapReduce job reading from HBase
‣     Hits FlockDB, multiple other internal services in mapper
‣     Custom partitioning
‣     Data sucked across to sharded, replicated, horizontally
    scalable, in-memory, low-latency Scala service
‣        Build a trie, do case folding/normalization, suggestions, etc
People Search
‣   Import user data into HBase
‣   Periodic MapReduce job reading from HBase
‣     Hits FlockDB, multiple other internal services in mapper
‣     Custom partitioning
‣     Data sucked across to sharded, replicated, horizontally
    scalable, in-memory, low-latency Scala service
‣        Build a trie, do case folding/normalization, suggestions, etc
‣   See http://www.slideshare.net/al3x/building-distributed-systems-
    in-scala for more
HBase
‣   More products now being built on top of it
‣   Flexible, easy to connect to MapReduce/Pig
HBase vs Cassandra
‣   “Their origins reveal their strengths and weaknesses”
HBase vs Cassandra
‣   “Their origins reveal their strengths and weaknesses”
‣   HBase built on top of batch-oriented system, not low latency
HBase vs Cassandra
‣   “Their origins reveal their strengths and weaknesses”
‣   HBase built on top of batch-oriented system, not low latency
‣   Cassandra built from ground up for low latency
HBase vs Cassandra
‣   “Their origins reveal their strengths and weaknesses”
‣   HBase built on top of batch-oriented system, not low latency
‣   Cassandra built from ground up for low latency
‣   HBase easy to connect to batch jobs as input and output
HBase vs Cassandra
‣   “Their origins reveal their strengths and weaknesses”
‣   HBase built on top of batch-oriented system, not low latency
‣   Cassandra built from ground up for low latency
‣   HBase easy to connect to batch jobs as input and output
‣   Cassandra not so much (but we’re working on it)
HBase vs Cassandra
‣   “Their origins reveal their strengths and weaknesses”
‣   HBase built on top of batch-oriented system, not low latency
‣   Cassandra built from ground up for low latency
‣   HBase easy to connect to batch jobs as input and output
‣   Cassandra not so much (but we’re working on it)
‣   HBase has SPOF in the namenode
HBase vs Cassandra
‣   Your mileage may vary
‣     At Twitter: HBase for analytics, analysis, dataset generation
‣     Cassandra for online systems
HBase vs Cassandra
‣   Your mileage may vary
‣     At Twitter: HBase for analytics, analysis, dataset generation
‣     Cassandra for online systems




‣   As with all NoSQL systems: strengths in different situations
FlockDB
‣   Realtime, distributed
    social graph store
‣   NOT optimized for data mining


‣   Note: the following slides largely come from @nk’s more
    complete talk at http://www.slideshare.net/nkallen/
    q-con-3770885
FlockDB
‣   Realtime, distributed                      Intersection
                                    Temporal

    social graph store
‣   NOT optimized for data mining
‣   Who follows who (nearly 8
                                                              Counts
    orders of magnitude!)
‣   Intersection/set operations
‣   Cardinality
‣   Temporal index
Set operations?
‣   This tweet needs to
    be delivered to people
    who follow both
    @aplusk (4.7M
    followers) and
    @foursquare (53K followers)
Original solution
‣   MySQL table                 source_id   destination-id

‣   Indices on source_id
                                   20            12

    and destination_id
                                   29            12
‣   Couldn’t handle write
                                   34            16
    throughput
‣   Indices too large for RAM
Next Try
‣   MySQL still
‣   Denormalized
‣   Byte-packed
‣   Chunked
‣   Still temporally ordered
Next Try
‣   Problems
‣     O(n) deletes
‣     Data consistency challenges
‣     Inefficient intersections
‣   All of these manifested strongly
    for huge users like @aplusk
    or @lancearmstrong
FlockDB
‣   MySQL underneath still (like PNUTS from Y!)
‣   Partitioned by user_id, gizzard handles sharding/partitioning
‣   Edges stored in both directions, indexed by (src, dest)
‣   Denormalized counts stored

                                           Forward                                          Backward

                             source_id   destination_id   updated_at   x   destination_id    source_id   updated_at   x

                                20            12           20:50:14    x        12              20        20:50:14    x

                                20            13           20:51:32             12              32        20:51:32

                                20            16                                12              16
FlockDB Timings
‣   Counts: 1ms
FlockDB Timings
‣   Counts: 1ms
‣   Temporal Query: 2ms
FlockDB Timings
‣   Counts: 1ms
‣   Temporal Query: 2ms
‣   Writes: 1ms for journal, 16ms for durability
FlockDB Timings
‣   Counts: 1ms
‣   Temporal Query: 2ms
‣   Writes: 1ms for journal, 16ms for durability
‣   Full walks: 100 edges/ms
FlockDB is Open Source
‣   We will maintain a community at
‣        http://www.github.com/twitter/flockdb
‣        http://www.github.com/twitter/gizzard



‣   See Nick Kallen’s QCon talk for more
‣       http://www.slideshare.net/nkallen/q-
    con-3770885
Cassandra
‣   Why Cassandra, for Twitter?
Cassandra
‣   Why Cassandra, for Twitter?
‣     Old/current: vertically, horizontally partitioned MySQL
Cassandra
‣   Why Cassandra, for Twitter?
‣     Old/current: vertically, horizontally partitioned MySQL
‣     All kinds of caching layers, all application managed
Cassandra
‣   Why Cassandra, for Twitter?
‣     Old/current: vertically, horizontally partitioned MySQL
‣     All kinds of caching layers, all application managed
‣     Alter table impossible, leads to bitfields, piggyback tables
Cassandra
‣   Why Cassandra, for Twitter?
‣     Old/current: vertically, horizontally partitioned MySQL
‣     All kinds of caching layers, all application managed
‣     Alter table impossible, leads to bitfields, piggyback tables
‣     Hardware intensive, error prone, etc
Cassandra
‣   Why Cassandra, for Twitter?
‣     Old/current: vertically, horizontally partitioned MySQL
‣     All kinds of caching layers, all application managed
‣     Alter table impossible, leads to bitfields, piggyback tables
‣     Hardware intensive, error prone, etc
‣     Not to mention, we hit MySQL write limits sometimes
Cassandra
‣   Why Cassandra, for Twitter?
‣     Old/current: vertically, horizontally partitioned MySQL
‣     All kinds of caching layers, all application managed
‣     Alter table impossible, leads to bitfields, piggyback tables
‣     Hardware intensive, error prone, etc
‣     Not to mention, we hit MySQL write limits sometimes


‣   First goal: move all tweets to Cassandra
Cassandra
‣   Why Cassandra, for Twitter?
‣     Decentralized, fault-tolerant
‣     All kinds of caching layers, all application managed
‣     Alter table impossible, leads to bitfields, piggyback tables
‣     Hardware intensive, error prone, etc
‣     Not to mention, we hit MySQL write limits sometimes


‣   First goal: move all tweets to Cassandra
Cassandra
‣   Why Cassandra, for Twitter?
‣     Decentralized, fault-tolerant
‣     All kinds of caching layers, all application managed
‣     Alter table impossible, leads to bitfields, piggyback tables
‣     Hardware intensive, error prone, etc
‣     Not to mention, we hit MySQL write limits sometimes


‣   First goal: move all tweets to Cassandra
Cassandra
‣   Why Cassandra, for Twitter?
‣     Decentralized, fault-tolerant
‣     All kinds of caching layers, all application managed
‣     Flexible schema
‣     Hardware intensive, error prone, etc
‣     Not to mention, we hit MySQL write limits sometimes


‣   First goal: move all tweets to Cassandra
Cassandra
‣   Why Cassandra, for Twitter?
‣     Decentralized, fault-tolerant
‣     All kinds of caching layers, all application managed
‣     Flexible schema
‣     Elastic
‣     Not to mention, we hit MySQL write limits sometimes


‣   First goal: move all tweets to Cassandra
Cassandra
‣   Why Cassandra, for Twitter?
‣     Decentralized, fault-tolerant
‣     All kinds of caching layers, all application managed
‣     Flexible schema
‣     Elastic
‣     High write throughput


‣   First goal: move all tweets to Cassandra
Eventually Consistent?
‣   Twitter is already eventually consistent
Eventually Consistent?
‣   Twitter is already eventually consistent
‣   Your system may be even worse
Eventually Consistent?
‣   Twitter is already eventually consistent
‣   Your system may be even worse
‣     Ryan’s new term: “potential consistency”
‣     Do you have write-through caching?
‣     Do you ever have MySQL replication failures?
Eventually Consistent?
‣   Twitter is already eventually consistent
‣   Your system may be even worse
‣     Ryan’s new term: “potential consistency”
‣     Do you have write-through caching?
‣     Do you ever have MySQL replication failures?
‣   There is no automatic consistency repair there, unlike Cassandra
Eventually Consistent?
‣   Twitter is already eventually consistent
‣   Your system may be even worse
‣     Ryan’s new term: “potential consistency”
‣     Do you have write-through caching?
‣     Do you ever have MySQL replication failures?
‣   There is no automatic consistency repair there, unlike Cassandra


‣   http://www.slideshare.net/ryansking/scaling-
    twitter-with-cassandra
Rolling out Cassandra
‣   1. Integrate Cassandra alongside MySQL
‣     100% reads/writes to MySQL
‣     Dynamic switches for % dark reads/writes to Cassandra
Rolling out Cassandra
‣   1. Integrate Cassandra alongside MySQL
‣     100% reads/writes to MySQL
‣     Dynamic switches for % dark reads/writes to Cassandra
‣   2. Turn up traffic to Cassandra
Rolling out Cassandra
‣   1. Integrate Cassandra alongside MySQL
‣     100% reads/writes to MySQL
‣     Dynamic switches for % dark reads/writes to Cassandra
‣   2. Turn up traffic to Cassandra
‣   3. Find something that’s broken, set switch to 0%
Rolling out Cassandra
‣   1. Integrate Cassandra alongside MySQL
‣      100% reads/writes to MySQL
‣      Dynamic switches for % dark reads/writes to Cassandra
‣   2. Turn up traffic to Cassandra
‣   3. Find something that’s broken, set switch to 0%
‣   4. Fix it
Rolling out Cassandra
‣   1. Integrate Cassandra alongside MySQL
‣      100% reads/writes to MySQL
‣      Dynamic switches for % dark reads/writes to Cassandra
‣   2. Turn up traffic to Cassandra
‣   3. Find something that’s broken, set switch to 0%
‣   4. Fix it
‣   5. GOTO 2
Cassandra for Realtime Analytics
‣   Starting a project around realtime analytics
‣   Cassandra as the backing store
‣     Using, developing, testing Digg’s atomic incr patches
‣   More soon.
That was a lot of slides
‣   Thanks for sticking with me.
Questions?   Follow me at
             twitter.com/kevinweil




                          TM

More Related Content

What's hot

Building Scalable, Highly Concurrent & Fault Tolerant Systems - Lessons Learned
Building Scalable, Highly Concurrent & Fault Tolerant Systems -  Lessons LearnedBuilding Scalable, Highly Concurrent & Fault Tolerant Systems -  Lessons Learned
Building Scalable, Highly Concurrent & Fault Tolerant Systems - Lessons Learned
Jonas Bonér
 
Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...
Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...
Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...
Databricks
 
Real-time Stream Processing with Apache Flink
Real-time Stream Processing with Apache FlinkReal-time Stream Processing with Apache Flink
Real-time Stream Processing with Apache Flink
DataWorks Summit
 
The RED Method: How to monitoring your microservices.
The RED Method: How to monitoring your microservices.The RED Method: How to monitoring your microservices.
The RED Method: How to monitoring your microservices.
Grafana Labs
 
Neo4j - 5 cool graph examples
Neo4j - 5 cool graph examplesNeo4j - 5 cool graph examples
Neo4j - 5 cool graph examples
Peter Neubauer
 
Scalability, Availability & Stability Patterns
Scalability, Availability & Stability PatternsScalability, Availability & Stability Patterns
Scalability, Availability & Stability Patterns
Jonas Bonér
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
Allen (Xiaozhong) Wang
 
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on KubernetesApache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
DataWorks Summit
 
Series of Unfortunate Netflix Container Events - QConNYC17
Series of Unfortunate Netflix Container Events - QConNYC17Series of Unfortunate Netflix Container Events - QConNYC17
Series of Unfortunate Netflix Container Events - QConNYC17
aspyker
 
Flink vs. Spark
Flink vs. SparkFlink vs. Spark
Flink vs. Spark
Slim Baltagi
 
Managing (Schema) Migrations in Cassandra
Managing (Schema) Migrations in CassandraManaging (Schema) Migrations in Cassandra
Managing (Schema) Migrations in Cassandra
DataStax Academy
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
Dvir Volk
 
Appache Cassandra
Appache Cassandra  Appache Cassandra
Appache Cassandra
nehabsairam
 
Apache Druid 101
Apache Druid 101Apache Druid 101
Apache Druid 101
Data Con LA
 
6 Nines: How Stripe keeps Kafka highly-available across the globe with Donny ...
6 Nines: How Stripe keeps Kafka highly-available across the globe with Donny ...6 Nines: How Stripe keeps Kafka highly-available across the globe with Donny ...
6 Nines: How Stripe keeps Kafka highly-available across the globe with Donny ...
HostedbyConfluent
 
"행복한 백발의 개발자"라는 제목으로 2024-03-06 어느 IT 업체에서 직책자로 승진한 분들을 대상으로 한...
"행복한 백발의 개발자"라는 제목으로 2024-03-06 어느 IT 업체에서 직책자로 승진한 분들을 대상으로 한..."행복한 백발의 개발자"라는 제목으로 2024-03-06 어느 IT 업체에서 직책자로 승진한 분들을 대상으로 한...
"행복한 백발의 개발자"라는 제목으로 2024-03-06 어느 IT 업체에서 직책자로 승진한 분들을 대상으로 한...
Myeongseok Baek
 
Let’s Make Your CFO Happy; A Practical Guide for Kafka Cost Reduction with El...
Let’s Make Your CFO Happy; A Practical Guide for Kafka Cost Reduction with El...Let’s Make Your CFO Happy; A Practical Guide for Kafka Cost Reduction with El...
Let’s Make Your CFO Happy; A Practical Guide for Kafka Cost Reduction with El...
HostedbyConfluent
 
Introduction to Cassandra
Introduction to CassandraIntroduction to Cassandra
Introduction to Cassandra
Gokhan Atil
 
Real time data processing and model inferncing platform with Kafka streams (N...
Real time data processing and model inferncing platform with Kafka streams (N...Real time data processing and model inferncing platform with Kafka streams (N...
Real time data processing and model inferncing platform with Kafka streams (N...
KafkaZone
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
AIMDek Technologies
 

What's hot (20)

Building Scalable, Highly Concurrent & Fault Tolerant Systems - Lessons Learned
Building Scalable, Highly Concurrent & Fault Tolerant Systems -  Lessons LearnedBuilding Scalable, Highly Concurrent & Fault Tolerant Systems -  Lessons Learned
Building Scalable, Highly Concurrent & Fault Tolerant Systems - Lessons Learned
 
Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...
Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...
Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...
 
Real-time Stream Processing with Apache Flink
Real-time Stream Processing with Apache FlinkReal-time Stream Processing with Apache Flink
Real-time Stream Processing with Apache Flink
 
The RED Method: How to monitoring your microservices.
The RED Method: How to monitoring your microservices.The RED Method: How to monitoring your microservices.
The RED Method: How to monitoring your microservices.
 
Neo4j - 5 cool graph examples
Neo4j - 5 cool graph examplesNeo4j - 5 cool graph examples
Neo4j - 5 cool graph examples
 
Scalability, Availability & Stability Patterns
Scalability, Availability & Stability PatternsScalability, Availability & Stability Patterns
Scalability, Availability & Stability Patterns
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
 
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on KubernetesApache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
 
Series of Unfortunate Netflix Container Events - QConNYC17
Series of Unfortunate Netflix Container Events - QConNYC17Series of Unfortunate Netflix Container Events - QConNYC17
Series of Unfortunate Netflix Container Events - QConNYC17
 
Flink vs. Spark
Flink vs. SparkFlink vs. Spark
Flink vs. Spark
 
Managing (Schema) Migrations in Cassandra
Managing (Schema) Migrations in CassandraManaging (Schema) Migrations in Cassandra
Managing (Schema) Migrations in Cassandra
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
 
Appache Cassandra
Appache Cassandra  Appache Cassandra
Appache Cassandra
 
Apache Druid 101
Apache Druid 101Apache Druid 101
Apache Druid 101
 
6 Nines: How Stripe keeps Kafka highly-available across the globe with Donny ...
6 Nines: How Stripe keeps Kafka highly-available across the globe with Donny ...6 Nines: How Stripe keeps Kafka highly-available across the globe with Donny ...
6 Nines: How Stripe keeps Kafka highly-available across the globe with Donny ...
 
"행복한 백발의 개발자"라는 제목으로 2024-03-06 어느 IT 업체에서 직책자로 승진한 분들을 대상으로 한...
"행복한 백발의 개발자"라는 제목으로 2024-03-06 어느 IT 업체에서 직책자로 승진한 분들을 대상으로 한..."행복한 백발의 개발자"라는 제목으로 2024-03-06 어느 IT 업체에서 직책자로 승진한 분들을 대상으로 한...
"행복한 백발의 개발자"라는 제목으로 2024-03-06 어느 IT 업체에서 직책자로 승진한 분들을 대상으로 한...
 
Let’s Make Your CFO Happy; A Practical Guide for Kafka Cost Reduction with El...
Let’s Make Your CFO Happy; A Practical Guide for Kafka Cost Reduction with El...Let’s Make Your CFO Happy; A Practical Guide for Kafka Cost Reduction with El...
Let’s Make Your CFO Happy; A Practical Guide for Kafka Cost Reduction with El...
 
Introduction to Cassandra
Introduction to CassandraIntroduction to Cassandra
Introduction to Cassandra
 
Real time data processing and model inferncing platform with Kafka streams (N...
Real time data processing and model inferncing platform with Kafka streams (N...Real time data processing and model inferncing platform with Kafka streams (N...
Real time data processing and model inferncing platform with Kafka streams (N...
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 

Similar to NoSQL at Twitter (NoSQL EU 2010)

Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)
Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)
Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)
Kevin Weil
 
Hadoop and pig at twitter (oscon 2010)
Hadoop and pig at twitter (oscon 2010)Hadoop and pig at twitter (oscon 2010)
Hadoop and pig at twitter (oscon 2010)
Kevin Weil
 
Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)
Kevin Weil
 
Geo Analytics Tutorial - Where 2.0 2011
Geo Analytics Tutorial - Where 2.0 2011Geo Analytics Tutorial - Where 2.0 2011
Geo Analytics Tutorial - Where 2.0 2011
Peter Skomoroch
 
Big Data at Twitter, Chirp 2010
Big Data at Twitter, Chirp 2010Big Data at Twitter, Chirp 2010
Big Data at Twitter, Chirp 2010
Kevin Weil
 
Where20 Spatial Analytics 2010
Where20 Spatial Analytics 2010Where20 Spatial Analytics 2010
Where20 Spatial Analytics 2010
seagor
 
Spatial Analytics, Where 2.0 2010
Spatial Analytics, Where 2.0 2010Spatial Analytics, Where 2.0 2010
Spatial Analytics, Where 2.0 2010
Kevin Weil
 
MapReduce Algorithm Design
MapReduce Algorithm DesignMapReduce Algorithm Design
MapReduce Algorithm Design
Gabriela Agustini
 
Big Data Analytics with Hadoop with @techmilind
Big Data Analytics with Hadoop with @techmilindBig Data Analytics with Hadoop with @techmilind
Big Data Analytics with Hadoop with @techmilind
EMC
 
1st UIM-GDB - Connections to the Real World
1st UIM-GDB - Connections to the Real World1st UIM-GDB - Connections to the Real World
1st UIM-GDB - Connections to the Real World
Achim Friedland
 
Perl on Amazon Elastic MapReduce
Perl on Amazon Elastic MapReducePerl on Amazon Elastic MapReduce
Perl on Amazon Elastic MapReduce
Pedro Figueiredo
 
Hadoop Streaming: Programming Hadoop without Java
Hadoop Streaming: Programming Hadoop without JavaHadoop Streaming: Programming Hadoop without Java
Hadoop Streaming: Programming Hadoop without Java
Glenn K. Lockwood
 
Hadoop Hive Talk At IIT-Delhi
Hadoop Hive Talk At IIT-DelhiHadoop Hive Talk At IIT-Delhi
Hadoop Hive Talk At IIT-Delhi
Joydeep Sen Sarma
 
TheEdge10 : Big Data is Here - Hadoop to the Rescue
TheEdge10 : Big Data is Here - Hadoop to the RescueTheEdge10 : Big Data is Here - Hadoop to the Rescue
TheEdge10 : Big Data is Here - Hadoop to the Rescue
Shay Sofer
 
Infrastructure for cloud_computing
Infrastructure for cloud_computingInfrastructure for cloud_computing
Infrastructure for cloud_computing
JULIO GONZALEZ SANZ
 
2011.10.14 Apache Giraph - Hortonworks
2011.10.14 Apache Giraph - Hortonworks2011.10.14 Apache Giraph - Hortonworks
2011.10.14 Apache Giraph - Hortonworks
Avery Ching
 
Hadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticiansHadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticians
attilacsordas
 
Hadoop introduction 2
Hadoop introduction 2Hadoop introduction 2
Hadoop introduction 2
Tianwei Liu
 
Hadoop at Twitter (Hadoop Summit 2010)
Hadoop at Twitter (Hadoop Summit 2010)Hadoop at Twitter (Hadoop Summit 2010)
Hadoop at Twitter (Hadoop Summit 2010)
Kevin Weil
 
An Introduction to MapReduce
An Introduction to MapReduce An Introduction to MapReduce
An Introduction to MapReduce
Sina Ebrahimi
 

Similar to NoSQL at Twitter (NoSQL EU 2010) (20)

Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)
Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)
Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)
 
Hadoop and pig at twitter (oscon 2010)
Hadoop and pig at twitter (oscon 2010)Hadoop and pig at twitter (oscon 2010)
Hadoop and pig at twitter (oscon 2010)
 
Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)
 
Geo Analytics Tutorial - Where 2.0 2011
Geo Analytics Tutorial - Where 2.0 2011Geo Analytics Tutorial - Where 2.0 2011
Geo Analytics Tutorial - Where 2.0 2011
 
Big Data at Twitter, Chirp 2010
Big Data at Twitter, Chirp 2010Big Data at Twitter, Chirp 2010
Big Data at Twitter, Chirp 2010
 
Where20 Spatial Analytics 2010
Where20 Spatial Analytics 2010Where20 Spatial Analytics 2010
Where20 Spatial Analytics 2010
 
Spatial Analytics, Where 2.0 2010
Spatial Analytics, Where 2.0 2010Spatial Analytics, Where 2.0 2010
Spatial Analytics, Where 2.0 2010
 
MapReduce Algorithm Design
MapReduce Algorithm DesignMapReduce Algorithm Design
MapReduce Algorithm Design
 
Big Data Analytics with Hadoop with @techmilind
Big Data Analytics with Hadoop with @techmilindBig Data Analytics with Hadoop with @techmilind
Big Data Analytics with Hadoop with @techmilind
 
1st UIM-GDB - Connections to the Real World
1st UIM-GDB - Connections to the Real World1st UIM-GDB - Connections to the Real World
1st UIM-GDB - Connections to the Real World
 
Perl on Amazon Elastic MapReduce
Perl on Amazon Elastic MapReducePerl on Amazon Elastic MapReduce
Perl on Amazon Elastic MapReduce
 
Hadoop Streaming: Programming Hadoop without Java
Hadoop Streaming: Programming Hadoop without JavaHadoop Streaming: Programming Hadoop without Java
Hadoop Streaming: Programming Hadoop without Java
 
Hadoop Hive Talk At IIT-Delhi
Hadoop Hive Talk At IIT-DelhiHadoop Hive Talk At IIT-Delhi
Hadoop Hive Talk At IIT-Delhi
 
TheEdge10 : Big Data is Here - Hadoop to the Rescue
TheEdge10 : Big Data is Here - Hadoop to the RescueTheEdge10 : Big Data is Here - Hadoop to the Rescue
TheEdge10 : Big Data is Here - Hadoop to the Rescue
 
Infrastructure for cloud_computing
Infrastructure for cloud_computingInfrastructure for cloud_computing
Infrastructure for cloud_computing
 
2011.10.14 Apache Giraph - Hortonworks
2011.10.14 Apache Giraph - Hortonworks2011.10.14 Apache Giraph - Hortonworks
2011.10.14 Apache Giraph - Hortonworks
 
Hadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticiansHadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticians
 
Hadoop introduction 2
Hadoop introduction 2Hadoop introduction 2
Hadoop introduction 2
 
Hadoop at Twitter (Hadoop Summit 2010)
Hadoop at Twitter (Hadoop Summit 2010)Hadoop at Twitter (Hadoop Summit 2010)
Hadoop at Twitter (Hadoop Summit 2010)
 
An Introduction to MapReduce
An Introduction to MapReduce An Introduction to MapReduce
An Introduction to MapReduce
 

Recently uploaded

BLOCKCHAIN TECHNOLOGY - Advantages and Disadvantages
BLOCKCHAIN TECHNOLOGY - Advantages and DisadvantagesBLOCKCHAIN TECHNOLOGY - Advantages and Disadvantages
BLOCKCHAIN TECHNOLOGY - Advantages and Disadvantages
SAI KAILASH R
 
Communications Mining Series - Zero to Hero - Session 3
Communications Mining Series - Zero to Hero - Session 3Communications Mining Series - Zero to Hero - Session 3
Communications Mining Series - Zero to Hero - Session 3
DianaGray10
 
Integrating Kafka with MuleSoft 4 and usecase
Integrating Kafka with MuleSoft 4 and usecaseIntegrating Kafka with MuleSoft 4 and usecase
Integrating Kafka with MuleSoft 4 and usecase
shyamraj55
 
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
OnBoard
 
It's your unstructured data: How to get your GenAI app to production (and spe...
It's your unstructured data: How to get your GenAI app to production (and spe...It's your unstructured data: How to get your GenAI app to production (and spe...
It's your unstructured data: How to get your GenAI app to production (and spe...
Zilliz
 
Intel Unveils Core Ultra 200V Lunar chip .pdf
Intel Unveils Core Ultra 200V Lunar chip .pdfIntel Unveils Core Ultra 200V Lunar chip .pdf
Intel Unveils Core Ultra 200V Lunar chip .pdf
Tech Guru
 
Vertex AI Agent Builder - GDG Alicante - Julio 2024
Vertex AI Agent Builder - GDG Alicante - Julio 2024Vertex AI Agent Builder - GDG Alicante - Julio 2024
Vertex AI Agent Builder - GDG Alicante - Julio 2024
Nicolás Lopéz
 
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Zilliz
 
UX Webinar Series: Drive Revenue and Decrease Costs with Passkeys for Consume...
UX Webinar Series: Drive Revenue and Decrease Costs with Passkeys for Consume...UX Webinar Series: Drive Revenue and Decrease Costs with Passkeys for Consume...
UX Webinar Series: Drive Revenue and Decrease Costs with Passkeys for Consume...
FIDO Alliance
 
kk vathada _digital transformation frameworks_2024.pdf
kk vathada _digital transformation frameworks_2024.pdfkk vathada _digital transformation frameworks_2024.pdf
kk vathada _digital transformation frameworks_2024.pdf
KIRAN KV
 
Sonkoloniya documentation - ONEprojukti.pdf
Sonkoloniya documentation - ONEprojukti.pdfSonkoloniya documentation - ONEprojukti.pdf
Sonkoloniya documentation - ONEprojukti.pdf
SubhamMandal40
 
Vulnerability Management: A Comprehensive Overview
Vulnerability Management: A Comprehensive OverviewVulnerability Management: A Comprehensive Overview
Vulnerability Management: A Comprehensive Overview
Steven Carlson
 
LeadMagnet IQ Review: Unlock the Secret to Effortless Traffic and Leads.pdf
LeadMagnet IQ Review:  Unlock the Secret to Effortless Traffic and Leads.pdfLeadMagnet IQ Review:  Unlock the Secret to Effortless Traffic and Leads.pdf
LeadMagnet IQ Review: Unlock the Secret to Effortless Traffic and Leads.pdf
SelfMade bd
 
UX Webinar Series: Essentials for Adopting Passkeys as the Foundation of your...
UX Webinar Series: Essentials for Adopting Passkeys as the Foundation of your...UX Webinar Series: Essentials for Adopting Passkeys as the Foundation of your...
UX Webinar Series: Essentials for Adopting Passkeys as the Foundation of your...
FIDO Alliance
 
The Path to General-Purpose Robots - Coatue
The Path to General-Purpose Robots - CoatueThe Path to General-Purpose Robots - Coatue
The Path to General-Purpose Robots - Coatue
Razin Mustafiz
 
How UiPath Discovery Suite supports identification of Agentic Process Automat...
How UiPath Discovery Suite supports identification of Agentic Process Automat...How UiPath Discovery Suite supports identification of Agentic Process Automat...
How UiPath Discovery Suite supports identification of Agentic Process Automat...
DianaGray10
 
Tailored CRM Software Development for Enhanced Customer Insights
Tailored CRM Software Development for Enhanced Customer InsightsTailored CRM Software Development for Enhanced Customer Insights
Tailored CRM Software Development for Enhanced Customer Insights
SynapseIndia
 
NVIDIA at Breakthrough Discuss for Space Exploration
NVIDIA at Breakthrough Discuss for Space ExplorationNVIDIA at Breakthrough Discuss for Space Exploration
NVIDIA at Breakthrough Discuss for Space Exploration
Alison B. Lowndes
 
The History of Embeddings & Multimodal Embeddings
The History of Embeddings & Multimodal EmbeddingsThe History of Embeddings & Multimodal Embeddings
The History of Embeddings & Multimodal Embeddings
Zilliz
 
Mule Experience Hub and Release Channel with Java 17
Mule Experience Hub and Release Channel with Java 17Mule Experience Hub and Release Channel with Java 17
Mule Experience Hub and Release Channel with Java 17
Bhajan Mehta
 

Recently uploaded (20)

BLOCKCHAIN TECHNOLOGY - Advantages and Disadvantages
BLOCKCHAIN TECHNOLOGY - Advantages and DisadvantagesBLOCKCHAIN TECHNOLOGY - Advantages and Disadvantages
BLOCKCHAIN TECHNOLOGY - Advantages and Disadvantages
 
Communications Mining Series - Zero to Hero - Session 3
Communications Mining Series - Zero to Hero - Session 3Communications Mining Series - Zero to Hero - Session 3
Communications Mining Series - Zero to Hero - Session 3
 
Integrating Kafka with MuleSoft 4 and usecase
Integrating Kafka with MuleSoft 4 and usecaseIntegrating Kafka with MuleSoft 4 and usecase
Integrating Kafka with MuleSoft 4 and usecase
 
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
 
It's your unstructured data: How to get your GenAI app to production (and spe...
It's your unstructured data: How to get your GenAI app to production (and spe...It's your unstructured data: How to get your GenAI app to production (and spe...
It's your unstructured data: How to get your GenAI app to production (and spe...
 
Intel Unveils Core Ultra 200V Lunar chip .pdf
Intel Unveils Core Ultra 200V Lunar chip .pdfIntel Unveils Core Ultra 200V Lunar chip .pdf
Intel Unveils Core Ultra 200V Lunar chip .pdf
 
Vertex AI Agent Builder - GDG Alicante - Julio 2024
Vertex AI Agent Builder - GDG Alicante - Julio 2024Vertex AI Agent Builder - GDG Alicante - Julio 2024
Vertex AI Agent Builder - GDG Alicante - Julio 2024
 
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
 
UX Webinar Series: Drive Revenue and Decrease Costs with Passkeys for Consume...
UX Webinar Series: Drive Revenue and Decrease Costs with Passkeys for Consume...UX Webinar Series: Drive Revenue and Decrease Costs with Passkeys for Consume...
UX Webinar Series: Drive Revenue and Decrease Costs with Passkeys for Consume...
 
kk vathada _digital transformation frameworks_2024.pdf
kk vathada _digital transformation frameworks_2024.pdfkk vathada _digital transformation frameworks_2024.pdf
kk vathada _digital transformation frameworks_2024.pdf
 
Sonkoloniya documentation - ONEprojukti.pdf
Sonkoloniya documentation - ONEprojukti.pdfSonkoloniya documentation - ONEprojukti.pdf
Sonkoloniya documentation - ONEprojukti.pdf
 
Vulnerability Management: A Comprehensive Overview
Vulnerability Management: A Comprehensive OverviewVulnerability Management: A Comprehensive Overview
Vulnerability Management: A Comprehensive Overview
 
LeadMagnet IQ Review: Unlock the Secret to Effortless Traffic and Leads.pdf
LeadMagnet IQ Review:  Unlock the Secret to Effortless Traffic and Leads.pdfLeadMagnet IQ Review:  Unlock the Secret to Effortless Traffic and Leads.pdf
LeadMagnet IQ Review: Unlock the Secret to Effortless Traffic and Leads.pdf
 
UX Webinar Series: Essentials for Adopting Passkeys as the Foundation of your...
UX Webinar Series: Essentials for Adopting Passkeys as the Foundation of your...UX Webinar Series: Essentials for Adopting Passkeys as the Foundation of your...
UX Webinar Series: Essentials for Adopting Passkeys as the Foundation of your...
 
The Path to General-Purpose Robots - Coatue
The Path to General-Purpose Robots - CoatueThe Path to General-Purpose Robots - Coatue
The Path to General-Purpose Robots - Coatue
 
How UiPath Discovery Suite supports identification of Agentic Process Automat...
How UiPath Discovery Suite supports identification of Agentic Process Automat...How UiPath Discovery Suite supports identification of Agentic Process Automat...
How UiPath Discovery Suite supports identification of Agentic Process Automat...
 
Tailored CRM Software Development for Enhanced Customer Insights
Tailored CRM Software Development for Enhanced Customer InsightsTailored CRM Software Development for Enhanced Customer Insights
Tailored CRM Software Development for Enhanced Customer Insights
 
NVIDIA at Breakthrough Discuss for Space Exploration
NVIDIA at Breakthrough Discuss for Space ExplorationNVIDIA at Breakthrough Discuss for Space Exploration
NVIDIA at Breakthrough Discuss for Space Exploration
 
The History of Embeddings & Multimodal Embeddings
The History of Embeddings & Multimodal EmbeddingsThe History of Embeddings & Multimodal Embeddings
The History of Embeddings & Multimodal Embeddings
 
Mule Experience Hub and Release Channel with Java 17
Mule Experience Hub and Release Channel with Java 17Mule Experience Hub and Release Channel with Java 17
Mule Experience Hub and Release Channel with Java 17
 

NoSQL at Twitter (NoSQL EU 2010)

  • 1. NoSQL at Twitter Kevin Weil -- @kevinweil Analytics Lead, Twitter April 21, 2010 TM
  • 2. Introduction ‣ How We Arrived at NoSQL: A Crash Course ‣ Collecting Data (Scribe) ‣ Storing and Analyzing Data (Hadoop) ‣ Rapid Learning over Big Data (Pig) ‣ And More: Cassandra, HBase, FlockDB
  • 3. My Background ‣ Studied Mathematics and Physics at Harvard, Physics at Stanford ‣ Tropos Networks (city-wide wireless): mesh routing algorithms, GBs of data ‣ Cooliris (web media): Hadoop and Pig for analytics, TBs of data ‣ Twitter: Hadoop, Pig, HBase, Cassandra, machine learning, visualization, social graph analysis, soon to be PBs data
  • 4. Introduction ‣ How We Arrived at NoSQL: A Crash Course ‣ Collecting Data (Scribe) ‣ Storing and Analyzing Data (Hadoop) ‣ Rapid Learning over Big Data (Pig) ‣ And More: Cassandra, HBase, FlockDB
  • 5. Data, Data Everywhere ‣ Twitter users generate a lot of data ‣ Anybody want to guess?
  • 6. Data, Data Everywhere ‣ Twitter users generate a lot of data ‣ Anybody want to guess? ‣ 7 TB/day (2+ PB/yr)
  • 7. Data, Data Everywhere ‣ Twitter users generate a lot of data ‣ Anybody want to guess? ‣ 7 TB/day (2+ PB/yr) ‣ 10,000 CDs/day
  • 8. Data, Data Everywhere ‣ Twitter users generate a lot of data ‣ Anybody want to guess? ‣ 7 TB/day (2+ PB/yr) ‣ 10,000 CDs/day ‣ 5 million floppy disks
  • 9. Data, Data Everywhere ‣ Twitter users generate a lot of data ‣ Anybody want to guess? ‣ 7 TB/day (2+ PB/yr) ‣ 10,000 CDs/day ‣ 5 million floppy disks ‣ 300 GB while I give this talk
  • 10. Data, Data Everywhere ‣ Twitter users generate a lot of data ‣ Anybody want to guess? ‣ 7 TB/day (2+ PB/yr) ‣ 10,000 CDs/day ‣ 5 million floppy disks ‣ 300 GB while I give this talk ‣ And doubling multiple times per year
  • 11. Syslog? ‣ Started with syslog-ng ‣ As our volume grew, it didn’t scale
  • 12. Syslog? ‣ Started with syslog-ng ‣ As our volume grew, it didn’t scale ‣ Resources overwhelmed ‣ Lost data
  • 13. Scribe ‣ Surprise! FB had same problem, built and open-sourced Scribe ‣ Log collection framework over Thrift ‣ You write log lines, with categories ‣ It does the rest
  • 14. Scribe ‣ Runs locally; reliable in network outage FE FE FE
  • 15. Scribe ‣ Runs locally; reliable in network outage ‣ Nodes only know downstream FE FE FE writer; hierarchical, scalable Agg Agg
  • 16. Scribe ‣ Runs locally; reliable in network outage ‣ Nodes only know downstream FE FE FE writer; hierarchical, scalable ‣ Pluggable outputs Agg Agg File HDFS
  • 17. Scribe at Twitter ‣ Solved our problem, opened new vistas ‣ Currently 30 different categories logged from multiple sources ‣ FE: Javascript, Ruby on Rails ‣ Middle tier: Ruby on Rails, Scala ‣ Backend: Scala, Java, C++
  • 18. Scribe at Twitter ‣ We’ve contributed to it as we’ve used it ‣ Improved logging, monitoring, writing to HDFS, compression ‣ Continuing to work with FB on patches ‣ GSoC project! Help make it more awesome. • http://github.com/traviscrawford/scribe • http://wiki.developers.facebook.com/index.php/User:GSoC
  • 19. Introduction ‣ How We Arrived at NoSQL: A Crash Course ‣ Collecting Data (Scribe) ‣ Storing and Analyzing Data (Hadoop) ‣ Rapid Learning over Big Data (Pig) ‣ And More: Cassandra, HBase, FlockDB
  • 20. How do you store 7TB/day? ‣ Single machine? ‣ What’s HD write speed?
  • 21. How do you store 7TB/day? ‣ Single machine? ‣ What’s HD write speed? ‣ ~80 MB/s
  • 22. How do you store 7TB/day? ‣ Single machine? ‣ What’s HD write speed? ‣ ~80 MB/s ‣ 24.3 hours to write 7 TB
  • 23. How do you store 7TB/day? ‣ Single machine? ‣ What’s HD write speed? ‣ ~80 MB/s ‣ 24.3 hours to write 7 TB ‣ Uh oh.
  • 24. Where do I put 7TB/day? ‣ Need a cluster of machines
  • 25. Where do I put 7TB/day? ‣ Need a cluster of machines ‣ ... which adds new layers of complexity
  • 26. Hadoop ‣ Distributed file system ‣ Automatic replication, fault tolerance
  • 27. Hadoop ‣ Distributed file system ‣ Automatic replication, fault tolerance ‣ MapReduce-based parallel computation ‣ Key-value based computation interface allows for wide applicability
  • 28. Hadoop ‣ Open source: top-level Apache project ‣ Scalable: Y! has a 4000 node cluster ‣ Powerful: sorted 1TB of random integers in 62 seconds ‣ Easy packaging: free Cloudera RPMs
  • 29. MapReduce Workflow Inputs Map Shuffle/Sort ‣ Challenge: how many tweets per user, given tweets table? Map Outputs ‣ Input: key=row, value=tweet info Map Reduce ‣ Map: output key=user_id, value=1 Map Reduce ‣ Shuffle: sort by user_id Map Reduce ‣ Reduce: for each user_id, sum Map ‣ Output: user_id, tweet count Map ‣ With 2x machines, runs 2x faster
  • 30. MapReduce Workflow Inputs Map Shuffle/Sort ‣ Challenge: how many tweets per user, given tweets table? Map Outputs ‣ Input: key=row, value=tweet info Map Reduce ‣ Map: output key=user_id, value=1 Map Reduce ‣ Shuffle: sort by user_id Map Reduce ‣ Reduce: for each user_id, sum Map ‣ Output: user_id, tweet count Map ‣ With 2x machines, runs 2x faster
  • 31. MapReduce Workflow Inputs Map Shuffle/Sort ‣ Challenge: how many tweets per user, given tweets table? Map Outputs ‣ Input: key=row, value=tweet info Map Reduce ‣ Map: output key=user_id, value=1 Map Reduce ‣ Shuffle: sort by user_id Map Reduce ‣ Reduce: for each user_id, sum Map ‣ Output: user_id, tweet count Map ‣ With 2x machines, runs 2x faster
  • 32. MapReduce Workflow Inputs Map Shuffle/Sort ‣ Challenge: how many tweets per user, given tweets table? Map Outputs ‣ Input: key=row, value=tweet info Map Reduce ‣ Map: output key=user_id, value=1 Map Reduce ‣ Shuffle: sort by user_id Map Reduce ‣ Reduce: for each user_id, sum Map ‣ Output: user_id, tweet count Map ‣ With 2x machines, runs 2x faster
  • 33. MapReduce Workflow Inputs Map Shuffle/Sort ‣ Challenge: how many tweets per user, given tweets table? Map Outputs ‣ Input: key=row, value=tweet info Map Reduce ‣ Map: output key=user_id, value=1 Map Reduce ‣ Shuffle: sort by user_id Map Reduce ‣ Reduce: for each user_id, sum Map ‣ Output: user_id, tweet count Map ‣ With 2x machines, runs 2x faster
  • 34. MapReduce Workflow Inputs Map Shuffle/Sort ‣ Challenge: how many tweets per user, given tweets table? Map Outputs ‣ Input: key=row, value=tweet info Map Reduce ‣ Map: output key=user_id, value=1 Map Reduce ‣ Shuffle: sort by user_id Map Reduce ‣ Reduce: for each user_id, sum Map ‣ Output: user_id, tweet count Map ‣ With 2x machines, runs 2x faster
  • 35. MapReduce Workflow Inputs Map Shuffle/Sort ‣ Challenge: how many tweets per user, given tweets table? Map Outputs ‣ Input: key=row, value=tweet info Map Reduce ‣ Map: output key=user_id, value=1 Map Reduce ‣ Shuffle: sort by user_id Map Reduce ‣ Reduce: for each user_id, sum Map ‣ Output: user_id, tweet count Map ‣ With 2x machines, runs 2x faster
  • 36. Two Analysis Challenges ‣ 1. Compute friendships in Twitter’s social graph ‣ grep, awk? No way. ‣ Data is in MySQL... self join on an n-billion row table? ‣ n,000,000,000 x n,000,000,000 = ?
  • 37. Two Analysis Challenges ‣ 1. Compute friendships in Twitter’s social graph ‣ grep, awk? No way. ‣ Data is in MySQL... self join on an n-billion row table? ‣ n,000,000,000 x n,000,000,000 = ? ‣ I don’t know either.
  • 38. Two Analysis Challenges ‣ 2. Large-scale grouping and counting? ‣ select count(*) from users? Maybe... ‣ select count(*) from tweets? Uh... ‣ Imagine joining them... ‣ ... and grouping... ‣ ... and sorting...
  • 39. Back to Hadoop ‣ Didn’t we have a cluster of machines?
  • 40. Back to Hadoop ‣ Didn’t we have a cluster of machines?
  • 41. Back to Hadoop ‣ Didn’t we have a cluster of machines? ‣ Hadoop makes it easy to distribute the calculation ‣ Purpose-built for parallel computation ‣ Just a slight mindset adjustment
  • 42. Back to Hadoop ‣ Didn’t we have a cluster of machines? ‣ Hadoop makes it easy to distribute the calculation ‣ Purpose-built for parallel computation ‣ Just a slight mindset adjustment ‣ But a fun and valuable one!
  • 43. Analysis at scale ‣ Now we’re rolling ‣ Count all tweets: 12 billion, 5 minutes ‣ Hit FlockDB in parallel to assemble social graph aggregates ‣ Run pagerank across users to calculate reputations
  • 44. But... ‣ Analysis typically in Java ‣ “I need less Java in my life, not more.”
  • 45. But... ‣ Analysis typically in Java ‣ “I need less Java in my life, not more.” ‣ Single-input, two-stage data flow is rigid
  • 46. But... ‣ Analysis typically in Java ‣ “I need less Java in my life, not more.” ‣ Single-input, two-stage data flow is rigid ‣ Projections, filters: custom code
  • 47. But... ‣ Analysis typically in Java ‣ “I need less Java in my life, not more.” ‣ Single-input, two-stage data flow is rigid ‣ Projections, filters: custom code ‣ Joins are lengthy, error-prone
  • 48. But... ‣ Analysis typically in Java ‣ “I need less Java in my life, not more.” ‣ Single-input, two-stage data flow is rigid ‣ Projections, filters: custom code ‣ Joins are lengthy, error-prone ‣ n-stage jobs hard to manage
  • 49. But... ‣ Analysis typically in Java ‣ “I need less Java in my life, not more.” ‣ Single-input, two-stage data flow is rigid ‣ Projections, filters: custom code ‣ Joins are lengthy, error-prone ‣ n-stage jobs hard to manage ‣ Exploration requires compilation!
  • 50. Introduction ‣ How We Arrived at NoSQL: A Crash Course ‣ Collecting Data (Scribe) ‣ Storing and Analyzing Data (Hadoop) ‣ Rapid Learning over Big Data (Pig) ‣ And More: Cassandra, HBase, FlockDB
  • 51. Pig ‣ High-level language ‣ Transformations on sets of records ‣ Process data one step at a time ‣ Easier than SQL?
  • 52. Why Pig? ‣ Because I bet you can read the following script.
  • 53. A Real Pig Script
  • 54. A Real Pig Script ‣ Now, just for fun... the same calculation in vanilla Hadoop MapReduce.
  • 56. Pig Democratizes Large-scale Data Analysis ‣ The Pig version is: ‣ 5% of the code
  • 57. Pig Democratizes Large-scale Data Analysis ‣ The Pig version is: ‣ 5% of the code ‣ 5% of the time
  • 58. Pig Democratizes Large-scale Data Analysis ‣ The Pig version is: ‣ 5% of the code ‣ 5% of the time ‣ Within 25% of the execution time
  • 59. One Thing I’ve Learned ‣ It’s easy to answer questions ‣ It’s hard to ask the right questions
  • 60. One Thing I’ve Learned ‣ It’s easy to answer questions ‣ It’s hard to ask the right questions ‣ Value the system that promotes innovation, iteration
  • 61. One Thing I’ve Learned ‣ It’s easy to answer questions ‣ It’s hard to ask the right questions ‣ Value the system that promotes innovation, iteration ‣ More minds contributing = more value from your data
  • 62. The Hadoop Ecosystem at Twitter ‣ Running Cloudera’s free distro, CDH2 and Hadoop 0.20.1
  • 63. The Hadoop Ecosystem at Twitter ‣ Running Cloudera’s free distro, CDH2 and Hadoop 0.20.1 ‣ Heavily modified Scribe writing LZO-compressed to HDFS ‣ LZO: fast, splittable compression, ideal for HDFS* ‣ * http://www.github.com/kevinweil/hadoop-lzo ‣
  • 64. The Hadoop Ecosystem at Twitter ‣ Running Cloudera’s free distro, CDH2 and Hadoop 0.20.1 ‣ Heavily modified Scribe writing LZO-compressed to HDFS ‣ LZO: fast, splittable compression, ideal for HDFS* ‣ Data either as flat files (logs) or in protocol buffer format (newer logs, structured data, etc) ‣ Libs for reading/writing/more open-sourced as elephant-bird** ‣ * http://www.github.com/kevinweil/hadoop-lzo ‣ ** http://www.github.com/kevinweil/elephant-bird
  • 65. The Hadoop Ecosystem at Twitter ‣ Running Cloudera’s free distro, CDH2 and Hadoop 0.20.1 ‣ Heavily modified Scribe writing LZO-compressed to HDFS ‣ LZO: fast, splittable compression, ideal for HDFS* ‣ Data either as flat files (logs) or in protocol buffer format (newer logs, structured data, etc) ‣ Libs for reading/writing/more open-sourced as elephant-bird** ‣ Some Java-based MapReduce, a little Hadoop streaming ‣ * http://www.github.com/kevinweil/hadoop-lzo ‣ ** http://www.github.com/kevinweil/elephant-bird
  • 66. The Hadoop Ecosystem at Twitter ‣ Running Cloudera’s free distro, CDH2 and Hadoop 0.20.1 ‣ Heavily modified Scribe writing LZO-compressed to HDFS ‣ LZO: fast, splittable compression, ideal for HDFS* ‣ Data either as flat files (logs) or in protocol buffer format (newer logs, structured data, etc) ‣ Libs for reading/writing/more open-sourced as elephant-bird** ‣ Some Java-based MapReduce, some HBase, Hadoop streaming ‣ Most analysis, and most interesting analyses, done in Pig ‣ * http://www.github.com/kevinweil/hadoop-lzo ‣ ** http://www.github.com/kevinweil/elephant-bird
  • 67. Data? ‣ Semi-structured: apache logs (search, .com, mobile), search query logs, RoR logs, mysql query logs, A/B testing logs, signup flow logging, and on...
  • 68. Data? ‣ Semi-structured: apache logs (search, .com, mobile), search query logs, RoR logs, mysql query logs, A/B testing logs, signup flow logging, and on... ‣ Structured: tweets, users, blocks, phones, favorites, saved searches, retweets, geo, authentications, sms, 3rd party clients, followings
  • 69. Data? ‣ Semi-structured: apache logs (search, .com, mobile), search query logs, RoR logs, mysql query logs, A/B testing logs, signup flow logging, and on... ‣ Structured: tweets, users, blocks, phones, favorites, saved searches, retweets, geo, authentications, sms, 3rd party clients, followings ‣ Entangled: the social graph
  • 70. So what do we do with it?
  • 71. Counting Big Data ‣ standard counts, min, max, std dev ‣ How many requests do we serve in a day?
  • 72. Counting Big Data ‣ standard counts, min, max, std dev ‣ How many requests do we serve in a day? ‣ What is the average latency? 95% latency? ‣
  • 73. Counting Big Data ‣ standard counts, min, max, std dev ‣ How many requests do we serve in a day? ‣ What is the average latency? 95% latency? ‣ Group by response code. What is the hourly distribution? ‣
  • 74. Counting Big Data ‣ standard counts, min, max, std dev ‣ How many requests do we serve in a day? ‣ What is the average latency? 95% latency? ‣ Group by response code. What is the hourly distribution? ‣ How many searches happen each day on Twitter? ‣
  • 75. Counting Big Data ‣ standard counts, min, max, std dev ‣ How many requests do we serve in a day? ‣ What is the average latency? 95% latency? ‣ Group by response code. What is the hourly distribution? ‣ How many searches happen each day on Twitter? ‣ How many unique queries, how many unique users? ‣
  • 76. Counting Big Data ‣ standard counts, min, max, std dev ‣ How many requests do we serve in a day? ‣ What is the average latency? 95% latency? ‣ Group by response code. What is the hourly distribution? ‣ How many searches happen each day on Twitter? ‣ How many unique queries, how many unique users? ‣ What is their geographic distribution?
  • 77. Counting Big Data ‣ Where are users querying from? The API, the front page, their profile page, etc? ‣
  • 78. Correlating Big Data ‣ probabilities, covariance, influence ‣ How does usage differ for mobile users?
  • 79. Correlating Big Data ‣ probabilities, covariance, influence ‣ How does usage differ for mobile users? ‣ How about for users with 3rd party desktop clients?
  • 80. Correlating Big Data ‣ probabilities, covariance, influence ‣ How does usage differ for mobile users? ‣ How about for users with 3rd party desktop clients? ‣ Cohort analyses
  • 81. Correlating Big Data ‣ probabilities, covariance, influence ‣ How does usage differ for mobile users? ‣ How about for users with 3rd party desktop clients? ‣ Cohort analyses ‣ Site problems: what goes wrong at the same time?
  • 82. Correlating Big Data ‣ probabilities, covariance, influence ‣ How does usage differ for mobile users? ‣ How about for users with 3rd party desktop clients? ‣ Cohort analyses ‣ Site problems: what goes wrong at the same time? ‣ Which features get users hooked?
  • 83. Correlating Big Data ‣ probabilities, covariance, influence ‣ How does usage differ for mobile users? ‣ How about for users with 3rd party desktop clients? ‣ Cohort analyses ‣ Site problems: what goes wrong at the same time? ‣ Which features get users hooked? ‣ Which features do successful users use often?
  • 84. Correlating Big Data ‣ probabilities, covariance, influence ‣ How does usage differ for mobile users? ‣ How about for users with 3rd party desktop clients? ‣ Cohort analyses ‣ Site problems: what goes wrong at the same time? ‣ Which features get users hooked? ‣ Which features do successful users use often? ‣ Search corrections, search suggestions
  • 85. Correlating Big Data ‣ probabilities, covariance, influence ‣ How does usage differ for mobile users? ‣ How about for users with 3rd party desktop clients? ‣ Cohort analyses ‣ Site problems: what goes wrong at the same time? ‣ Which features get users hooked? ‣ Which features do successful users use often? ‣ Search corrections, search suggestions ‣ A/B testing
  • 86. Correlating Big Data ‣ What is the correlation between users with registered phones and users that tweet?
  • 87. Research on Big Data ‣ prediction, graph analysis, natural language ‣ What can we tell about a user from their tweets?
  • 88. Research on Big Data ‣ prediction, graph analysis, natural language ‣ What can we tell about a user from their tweets? ‣ From the tweets of those they follow?
  • 89. Research on Big Data ‣ prediction, graph analysis, natural language ‣ What can we tell about a user from their tweets? ‣ From the tweets of those they follow? ‣ From the tweets of their followers?
  • 90. Research on Big Data ‣ prediction, graph analysis, natural language ‣ What can we tell about a user from their tweets? ‣ From the tweets of those they follow? ‣ From the tweets of their followers? ‣ From the ratio of followers/following?
  • 91. Research on Big Data ‣ prediction, graph analysis, natural language ‣ What can we tell about a user from their tweets? ‣ From the tweets of those they follow? ‣ From the tweets of their followers? ‣ From the ratio of followers/following? ‣ What graph structures lead to successful networks?
  • 92. Research on Big Data ‣ prediction, graph analysis, natural language ‣ What can we tell about a user from their tweets? ‣ From the tweets of those they follow? ‣ From the tweets of their followers? ‣ From the ratio of followers/following? ‣ What graph structures lead to successful networks? ‣ User reputation
  • 93. Research on Big Data ‣ prediction, graph analysis, natural language ‣ Sentiment analysis
  • 94. Research on Big Data ‣ prediction, graph analysis, natural language ‣ Sentiment analysis ‣ What features get a tweet retweeted?
  • 95. Research on Big Data ‣ prediction, graph analysis, natural language ‣ Sentiment analysis ‣ What features get a tweet retweeted? ‣ How deep is the corresponding retweet tree?
  • 96. Research on Big Data ‣ prediction, graph analysis, natural language ‣ Sentiment analysis ‣ What features get a tweet retweeted? ‣ How deep is the corresponding retweet tree? ‣ Long-term duplicate detection
  • 97. Research on Big Data ‣ prediction, graph analysis, natural language ‣ Sentiment analysis ‣ What features get a tweet retweeted? ‣ How deep is the corresponding retweet tree? ‣ Long-term duplicate detection ‣ Machine learning
  • 98. Research on Big Data ‣ prediction, graph analysis, natural language ‣ Sentiment analysis ‣ What features get a tweet retweeted? ‣ How deep is the corresponding retweet tree? ‣ Long-term duplicate detection ‣ Machine learning ‣ Language detection
  • 99. Research on Big Data ‣ prediction, graph analysis, natural language ‣ Sentiment analysis ‣ What features get a tweet retweeted? ‣ How deep is the corresponding retweet tree? ‣ Long-term duplicate detection ‣ Machine learning ‣ Language detection ‣ ... the list goes on.
  • 100. Research on Big Data ‣ How well can we detect bots and other non-human tweeters?
  • 101. Introduction ‣ How We Arrived at NoSQL: A Crash Course ‣ Collecting Data (Scribe) ‣ Storing and Analyzing Data (Hadoop) ‣ Rapid Learning over Big Data (Pig) ‣ And More: Cassandra, HBase, FlockDB
  • 102. HBase ‣ BigTable clone on top of HDFS ‣ Distributed, column-oriented, no datatypes ‣ Unlike the rest of HDFS, designed for low-latency ‣ Importantly, data is mutable
  • 103. HBase at Twitter ‣ We began building real products based on Hadoop ‣ People search
  • 104. HBase at Twitter ‣ We began building real products based on Hadoop ‣ People search ‣ Old version: offline process on a single node
  • 105. HBase at Twitter ‣ We began building real products based on Hadoop ‣ People search ‣ Old version: offline process on a single node ‣ New version: complex user calculations, hit extra services in real time, custom indexing
  • 106. HBase at Twitter ‣ We began building real products based on Hadoop ‣ People search ‣ Old version: offline process on a single node ‣ New version: complex user calculations, hit extra services in real time, custom indexing ‣ Underlying data is mutable ‣ Mutable layer on top of HDFS --> HBase
  • 107. People Search ‣ Import user data into HBase
  • 108. People Search ‣ Import user data into HBase ‣ Periodic MapReduce job reading from HBase ‣ Hits FlockDB, multiple other internal services in mapper ‣ Custom partitioning
  • 109. People Search ‣ Import user data into HBase ‣ Periodic MapReduce job reading from HBase ‣ Hits FlockDB, multiple other internal services in mapper ‣ Custom partitioning ‣ Data sucked across to sharded, replicated, horizontally scalable, in-memory, low-latency Scala service ‣ Build a trie, do case folding/normalization, suggestions, etc
  • 110. People Search ‣ Import user data into HBase ‣ Periodic MapReduce job reading from HBase ‣ Hits FlockDB, multiple other internal services in mapper ‣ Custom partitioning ‣ Data sucked across to sharded, replicated, horizontally scalable, in-memory, low-latency Scala service ‣ Build a trie, do case folding/normalization, suggestions, etc ‣ See http://www.slideshare.net/al3x/building-distributed-systems- in-scala for more
  • 111. HBase ‣ More products now being built on top of it ‣ Flexible, easy to connect to MapReduce/Pig
  • 112. HBase vs Cassandra ‣ “Their origins reveal their strengths and weaknesses”
  • 113. HBase vs Cassandra ‣ “Their origins reveal their strengths and weaknesses” ‣ HBase built on top of batch-oriented system, not low latency
  • 114. HBase vs Cassandra ‣ “Their origins reveal their strengths and weaknesses” ‣ HBase built on top of batch-oriented system, not low latency ‣ Cassandra built from ground up for low latency
  • 115. HBase vs Cassandra ‣ “Their origins reveal their strengths and weaknesses” ‣ HBase built on top of batch-oriented system, not low latency ‣ Cassandra built from ground up for low latency ‣ HBase easy to connect to batch jobs as input and output
  • 116. HBase vs Cassandra ‣ “Their origins reveal their strengths and weaknesses” ‣ HBase built on top of batch-oriented system, not low latency ‣ Cassandra built from ground up for low latency ‣ HBase easy to connect to batch jobs as input and output ‣ Cassandra not so much (but we’re working on it)
  • 117. HBase vs Cassandra ‣ “Their origins reveal their strengths and weaknesses” ‣ HBase built on top of batch-oriented system, not low latency ‣ Cassandra built from ground up for low latency ‣ HBase easy to connect to batch jobs as input and output ‣ Cassandra not so much (but we’re working on it) ‣ HBase has SPOF in the namenode
  • 118. HBase vs Cassandra ‣ Your mileage may vary ‣ At Twitter: HBase for analytics, analysis, dataset generation ‣ Cassandra for online systems
  • 119. HBase vs Cassandra ‣ Your mileage may vary ‣ At Twitter: HBase for analytics, analysis, dataset generation ‣ Cassandra for online systems ‣ As with all NoSQL systems: strengths in different situations
  • 120. FlockDB ‣ Realtime, distributed social graph store ‣ NOT optimized for data mining ‣ Note: the following slides largely come from @nk’s more complete talk at http://www.slideshare.net/nkallen/ q-con-3770885
  • 121. FlockDB ‣ Realtime, distributed Intersection Temporal social graph store ‣ NOT optimized for data mining ‣ Who follows who (nearly 8 Counts orders of magnitude!) ‣ Intersection/set operations ‣ Cardinality ‣ Temporal index
  • 122. Set operations? ‣ This tweet needs to be delivered to people who follow both @aplusk (4.7M followers) and @foursquare (53K followers)
  • 123. Original solution ‣ MySQL table source_id destination-id ‣ Indices on source_id 20 12 and destination_id 29 12 ‣ Couldn’t handle write 34 16 throughput ‣ Indices too large for RAM
  • 124. Next Try ‣ MySQL still ‣ Denormalized ‣ Byte-packed ‣ Chunked ‣ Still temporally ordered
  • 125. Next Try ‣ Problems ‣ O(n) deletes ‣ Data consistency challenges ‣ Inefficient intersections ‣ All of these manifested strongly for huge users like @aplusk or @lancearmstrong
  • 126. FlockDB ‣ MySQL underneath still (like PNUTS from Y!) ‣ Partitioned by user_id, gizzard handles sharding/partitioning ‣ Edges stored in both directions, indexed by (src, dest) ‣ Denormalized counts stored Forward Backward source_id destination_id updated_at x destination_id source_id updated_at x 20 12 20:50:14 x 12 20 20:50:14 x 20 13 20:51:32 12 32 20:51:32 20 16 12 16
  • 127. FlockDB Timings ‣ Counts: 1ms
  • 128. FlockDB Timings ‣ Counts: 1ms ‣ Temporal Query: 2ms
  • 129. FlockDB Timings ‣ Counts: 1ms ‣ Temporal Query: 2ms ‣ Writes: 1ms for journal, 16ms for durability
  • 130. FlockDB Timings ‣ Counts: 1ms ‣ Temporal Query: 2ms ‣ Writes: 1ms for journal, 16ms for durability ‣ Full walks: 100 edges/ms
  • 131. FlockDB is Open Source ‣ We will maintain a community at ‣ http://www.github.com/twitter/flockdb ‣ http://www.github.com/twitter/gizzard ‣ See Nick Kallen’s QCon talk for more ‣ http://www.slideshare.net/nkallen/q- con-3770885
  • 132. Cassandra ‣ Why Cassandra, for Twitter?
  • 133. Cassandra ‣ Why Cassandra, for Twitter? ‣ Old/current: vertically, horizontally partitioned MySQL
  • 134. Cassandra ‣ Why Cassandra, for Twitter? ‣ Old/current: vertically, horizontally partitioned MySQL ‣ All kinds of caching layers, all application managed
  • 135. Cassandra ‣ Why Cassandra, for Twitter? ‣ Old/current: vertically, horizontally partitioned MySQL ‣ All kinds of caching layers, all application managed ‣ Alter table impossible, leads to bitfields, piggyback tables
  • 136. Cassandra ‣ Why Cassandra, for Twitter? ‣ Old/current: vertically, horizontally partitioned MySQL ‣ All kinds of caching layers, all application managed ‣ Alter table impossible, leads to bitfields, piggyback tables ‣ Hardware intensive, error prone, etc
  • 137. Cassandra ‣ Why Cassandra, for Twitter? ‣ Old/current: vertically, horizontally partitioned MySQL ‣ All kinds of caching layers, all application managed ‣ Alter table impossible, leads to bitfields, piggyback tables ‣ Hardware intensive, error prone, etc ‣ Not to mention, we hit MySQL write limits sometimes
  • 138. Cassandra ‣ Why Cassandra, for Twitter? ‣ Old/current: vertically, horizontally partitioned MySQL ‣ All kinds of caching layers, all application managed ‣ Alter table impossible, leads to bitfields, piggyback tables ‣ Hardware intensive, error prone, etc ‣ Not to mention, we hit MySQL write limits sometimes ‣ First goal: move all tweets to Cassandra
  • 139. Cassandra ‣ Why Cassandra, for Twitter? ‣ Decentralized, fault-tolerant ‣ All kinds of caching layers, all application managed ‣ Alter table impossible, leads to bitfields, piggyback tables ‣ Hardware intensive, error prone, etc ‣ Not to mention, we hit MySQL write limits sometimes ‣ First goal: move all tweets to Cassandra
  • 140. Cassandra ‣ Why Cassandra, for Twitter? ‣ Decentralized, fault-tolerant ‣ All kinds of caching layers, all application managed ‣ Alter table impossible, leads to bitfields, piggyback tables ‣ Hardware intensive, error prone, etc ‣ Not to mention, we hit MySQL write limits sometimes ‣ First goal: move all tweets to Cassandra
  • 141. Cassandra ‣ Why Cassandra, for Twitter? ‣ Decentralized, fault-tolerant ‣ All kinds of caching layers, all application managed ‣ Flexible schema ‣ Hardware intensive, error prone, etc ‣ Not to mention, we hit MySQL write limits sometimes ‣ First goal: move all tweets to Cassandra
  • 142. Cassandra ‣ Why Cassandra, for Twitter? ‣ Decentralized, fault-tolerant ‣ All kinds of caching layers, all application managed ‣ Flexible schema ‣ Elastic ‣ Not to mention, we hit MySQL write limits sometimes ‣ First goal: move all tweets to Cassandra
  • 143. Cassandra ‣ Why Cassandra, for Twitter? ‣ Decentralized, fault-tolerant ‣ All kinds of caching layers, all application managed ‣ Flexible schema ‣ Elastic ‣ High write throughput ‣ First goal: move all tweets to Cassandra
  • 144. Eventually Consistent? ‣ Twitter is already eventually consistent
  • 145. Eventually Consistent? ‣ Twitter is already eventually consistent ‣ Your system may be even worse
  • 146. Eventually Consistent? ‣ Twitter is already eventually consistent ‣ Your system may be even worse ‣ Ryan’s new term: “potential consistency” ‣ Do you have write-through caching? ‣ Do you ever have MySQL replication failures?
  • 147. Eventually Consistent? ‣ Twitter is already eventually consistent ‣ Your system may be even worse ‣ Ryan’s new term: “potential consistency” ‣ Do you have write-through caching? ‣ Do you ever have MySQL replication failures? ‣ There is no automatic consistency repair there, unlike Cassandra
  • 148. Eventually Consistent? ‣ Twitter is already eventually consistent ‣ Your system may be even worse ‣ Ryan’s new term: “potential consistency” ‣ Do you have write-through caching? ‣ Do you ever have MySQL replication failures? ‣ There is no automatic consistency repair there, unlike Cassandra ‣ http://www.slideshare.net/ryansking/scaling- twitter-with-cassandra
  • 149. Rolling out Cassandra ‣ 1. Integrate Cassandra alongside MySQL ‣ 100% reads/writes to MySQL ‣ Dynamic switches for % dark reads/writes to Cassandra
  • 150. Rolling out Cassandra ‣ 1. Integrate Cassandra alongside MySQL ‣ 100% reads/writes to MySQL ‣ Dynamic switches for % dark reads/writes to Cassandra ‣ 2. Turn up traffic to Cassandra
  • 151. Rolling out Cassandra ‣ 1. Integrate Cassandra alongside MySQL ‣ 100% reads/writes to MySQL ‣ Dynamic switches for % dark reads/writes to Cassandra ‣ 2. Turn up traffic to Cassandra ‣ 3. Find something that’s broken, set switch to 0%
  • 152. Rolling out Cassandra ‣ 1. Integrate Cassandra alongside MySQL ‣ 100% reads/writes to MySQL ‣ Dynamic switches for % dark reads/writes to Cassandra ‣ 2. Turn up traffic to Cassandra ‣ 3. Find something that’s broken, set switch to 0% ‣ 4. Fix it
  • 153. Rolling out Cassandra ‣ 1. Integrate Cassandra alongside MySQL ‣ 100% reads/writes to MySQL ‣ Dynamic switches for % dark reads/writes to Cassandra ‣ 2. Turn up traffic to Cassandra ‣ 3. Find something that’s broken, set switch to 0% ‣ 4. Fix it ‣ 5. GOTO 2
  • 154. Cassandra for Realtime Analytics ‣ Starting a project around realtime analytics ‣ Cassandra as the backing store ‣ Using, developing, testing Digg’s atomic incr patches ‣ More soon.
  • 155. That was a lot of slides ‣ Thanks for sticking with me.
  • 156. Questions? Follow me at twitter.com/kevinweil TM

Editor's Notes