SlideShare a Scribd company logo
1 of 82
Executing Queries on a Sharded
           Database

           Neha Narula
        September 25, 2012
Choosing and Scaling Your
       Datastore

        Neha Narula
     September 25, 2012
Who Am I?

       Froogle
       Blobstore
       Native Client
In This Talk
• What to think about when choosing a
  datastore for a web application
• Myths and legends
• Executing distributed queries
• My research: a consistent cache
Every so often…



Friends ask me for advice when they are
       building a new application
Friend
“Hi Neha! I am making a new application.
I have heard MySQL sucks and I should use
                NoSQL.
I am going to be iterating on my app a lot
I don't have any customers yet
and my current data set could fit on a thumb drive
                from 2004, but…
Can you tell me which NoSQL database I
              should use?
And how to shard it?"
Neha
http://knowyourmeme.com/memes/facepalm
What to Think about When You’re
     Choosing a Datastore

        Hint: Not Scaling
Development Cycle
• Prototyping
  – Ease of use
• Developing
  – Flexibility, multiple developers
• Running a real production system
  – Reliability
• SUCCESS! Sharding and specialized
  datastorage
  – We should only be so lucky
Getting Started

   First five minutes – what’s it like?




Idea credit: Adam Marcus, The NoSQL Ecosystem, HPTS 2011

              via Justin Sheehy from Basho
First Five Minutes: Redis




             http://simonwillison.net/2009/Oct/22/redis/
First Five Minutes: MySQL
Developing

• Multiple people working on the same code
• Testing new features => new access patterns
• New person comes along…
Redis


Time to go get
    lunch
MySQL
Questions
•   What is your mix of reads and writes?
•   How much data do you have?
•   Do you need transactions?
•   What are your access patterns?
•   What will grow and change?
•   What do you already know?
Reads and Writes
• Write optimized vs. Read optimized
• MongoDB has a global write lock per process
• No concurrent writes!
Size of Data
• Does it fit in memory?
• Disk-based solutions will be slower
• Worst case, Redis needs 2X the memory of
  your data!
   – It forks with copy-on-write



sudo echo 1 > /proc/sys/vm/overcommit_memory
Requirements
• Performance
  – Latency tolerance
• Durability
  – Data loss tolerance
• Freshness
  – Staleness tolerance
• Uptime
  – Downtime tolerance
Performance
• Relational databases are considered slow
• But they are doing a lot of work!
• Sometimes you need that work, sometimes
  you don’t.
Simplest Scenario
•   Responding to user requests in real time
•   Frequently read data fits in memory
•   High read rate
•   No need to go to disk or scan lots of data

                   Datastore CPU is the bottleneck
Cost of Executing a Primary Key Lookup

                           • Receiving message
                           • Instantiating thread
                           • Parsing query
           Query           • Optimizing
            Cost
                           • Take out read locks
                           • (Sometimes) lookup in
                             an index btree
                           • Responding to the client

Actually retrieving data
Options to Make This Fast
• Query cache
• Prepared statements
• Handler Socket
  – 750K lookups/sec on a single server


Lesson: Primary key lookups can be fast no
matter what datastore you use
Flexibility vs. Performance
• We might want to pay the overhead for query
  flexibility
• In a primary key datastore, we can only ask
  queries on primary key
• SQL gives us flexibility to change our queries
Durability
• Persistent datastores
  – Client: write
  – Server: flush to disk, then send “I completed your
    write”
  – CRASH
  – Recover: See the write
• By default, MongoDB does not fsync() before
  returning to client on write
  – Need j:true
• By default, MySQL uses MyISAM instead of
  InnoDB
Specialization
• You know your
  – query access patterns and traffic
  – consistency requirements
• Design specialized lookups
  – Transactional datastore for consistent data
  – Memcached for static, mostly unchanging content
  – Redis for a data processing pipeline
• Know what tradeoffs to make
Ways to Scale
• Reads
  – Cache
  – Replicate
  – Partition
• Writes
  – Partition data amongst multiple servers
Lots of Folklore
Myths of Sharded Datastores
• NoSQL scales better than a relational database
• You can’t do a JOIN on a sharded datastore
MYTH: NoSQL scales better than a
        relational database
• Scaling isn’t about the datastore, it’s about the
  application
  – Examples: FriendFeed, Quora, Facebook
• Complex queries that go to all shards don’t
  scale
• Simple queries that partition well and use only
  one shard do scale
Problem: Applications Look Up
    Data in Different Ways
HA!
Post Page
SELECT *
FROM comments
WHERE post_id = 100


zrange comments:100 0 -1
Example Partitioned
Database
                      Database
  comments table
                        0-99

 post_id


     Webservers       Database

                      100-199




                      Database

                       200-199
Query Goes to One
Partition
                        MySQL

                         0-99




                        MySQL
          Comments
          on post 100
                        100-199




                        MySQL

                        200-299
Many Concurrent
Queries
                        MySQL
          Comments
          on post 52     0-99




                        MySQL
          Comments
          on post 100
                        100-199




          Comments      MySQL
          on post 289
                        200-299
User Comments Page
Fetch all of a user's comments:

SELECT * FROM comments WHERE user = 'sam'
Query Goes to All
Partitions
                            MySQL

                             0-99
Query goes to all servers


                  Sam’s     MySQL
                comments
                            100-199




                            MySQL

                            200-299
Costs for Query Go Up
When Adding a New       MySQL
Server                    0-99




                        MySQL
                         100-199


             Sam’s
           comments     MySQL
                        200-299




                        MySQL
                         300-399
CPU Cost on Server of Retrieving One Row

Two costs: Query overhead and row retrieval




                                    Query     97%
                                   Overhead
               MySQL
Idea: Multiple Partitionings
Partition comments table on post_id and user.

Reads can choose appropriate copy so queries go to
only one server.

Writes go to all copies.
Multiple Partitionings
                         Database
  comments table
                           0-99
                             A-J

 post_id
           user

                         Database

                         100-199
                            K-R




                         Database

                          200-199
                             S-Z
All Read Queries Go To Only
One Partition
                              MySQL

                               0-99
            Comments on
              post 100


                              MySQL

                              100-199


               Sam’s
             comments
                              MySQL

                                S-Z
Writes Go to Both Table
Copies
                            MySQL
            Write a new      0-99
            comment by
            Vicki on post
                 100

                            MySQL

                            100-199




                            MySQL

                              S-Z
Scaling
• Reads scale by N – the number of servers
• Writes are slowed down by T – the number of
  table copies
• Big win if N >> T
• Table copies use more memory
  – Often don’t have to copy large tables – usually
    metadata
• How to execute these queries?
Dixie

• Is a query planner which executes SQL queries over a
  partitioned database

• Optimizes read queries to minimize query overhead and
  data retrieval costs

• Uses a novel cost formula which takes advantage of data
  partitioned in different ways
Wikipedia Workload with Dixie

      25000                                                 Each
              Wikipedia One copy of page table             query
                                                         going to 1
      20000   • DatabaseDixie + page.title copy
                          dump from 2008                   server
              • Real HTTP traces                 3.2 X
      15000
              • Sharded across 1, 2, 5, and 10 servers
QPS




      10000   • Compare by adding a copy of the page
                                             Many
                                            queries
                table (< 100 MB), sharded another way
                                          going to all
       5000                             servers


          0
                   1        2               5     10
                           Number of Partitions
Myths of Sharded Datastores
• NoSQL scales better than a relational database
• You can’t do a JOIN on a sharded datastore
MYTH: You can’t execute a JOIN on a
         sharded database
• What are the JOIN keys? What are your
  partition keys?
• Bad to move lots of data
• Not bad to lookup a few pieces of data
• Index lookup JOINs expressed as primary key
  lookups are fast
Join Query

Fetch all of Alice's comments on Max's posts.
         comments table                 posts table
 post_id     user     text         id   author        link

    3        Alice   First post!   1     Max      http://…
    6        Alice   Like.         3     Max      www.
    7       Alice    You think?    22    Max      Ask Hacker
    22      Alice    Nice work!    37    Max      http://…
Join Query

Fetch all of Alice's comments on Max's posts.
         comments table                 posts table
 post_id     user     text         id   author        link

    3        Alice   First post!   1     Max      http://…
    6        Alice   Like.         3     Max      www.
    7       Alice    You think?    22    Max      Ask Hacker
    22      Alice    Nice work!    37    Max      http://…
Distributed Query Planning
• R*
• Shore
• The State of the Art in Distributed Query
  Processing, by Donald Kossman
• Gizzard
Conventional Wisdom
Partition on JOIN keys, Send query   0-99   0-99
as is to each server.

 SELECT comments.text
 FROM posts, comments
 WHERE comments.post_id = posts.id   100-   100-
 AND posts.author = 'Max'            199    199
 AND comments.user = 'Alice'




                                     200-   200-
                                     299    299
Conventional Plan Scaling

                   Alice’s
                comments on
                 Max’s posts




    MySQL   MySQL              MySQL
                         …
Index Lookup Join
                                              0-99
Partition on filter keys, retrieve Max's      A-I
                                            0-99
posts, then retrieve Alice's comments on   A-I
Max's posts.



                                               100-
 SELECT posts.id                               199
                                              J-R
                                            100-
 FROM posts                                 199
                                           J-R
 WHERE posts.author = 'Max’




                                              200-
                                               299
                                             S-Z
                                            200-
                                            299
                                           S-Z
Index Lookup Join
                                              0-99
Partition on filter keys, retrieve Max's      A-I
                                            0-99
                                           A-I
posts, then retrieve Alice's comments on
Max's posts.


                                               100-
                    ids of Max’s               199
                                              J-R
                                            100-
                        posts               199
                                           J-R




                                              200-
                                               299
                                             S-Z
                                            200-
                                            299
                                           S-Z
Index Lookup Join
                                              0-99
Partition on filter keys, retrieve Max's     A-I
                                            0-99
posts, then retrieve Alice's comments on   A-I
Max's posts.


  SELECT comments.text
                                               100-
  FROM comments
                                               199
                                              J-R
                                            100-
  WHERE comments.post_id IN […]
                                            199
                                           J-R
  AND comments.user = 'Alice'




                                              200-
                                               299
                                             S-Z
                                            200-
                                            299
                                           S-Z
Index Lookup Join
                                              0-99
Partition on filter keys, retrieve Max's     A-I
                                            0-99
posts, then retrieve Alice's comments on   A-I
Max's posts.



                                               100-
              Alice’s comments on              199
                                              J-R
                                            100-
                                            199
                                           J-R
                   Max’s posts




                                              200-
                                               299
                                             S-Z
                                            200-
                                            299
                                           S-Z
Index Lookup Join Plan Scaling

                   Alice’s
                 comments
                  on Max’s
                   posts




    MySQL   MySQL            MySQL
                        …
Comparison

     Conventional Plan   Index Lookup Plan

                                             Intermediate
                                             data, Max’s
                                             posts Alice
                                             DIDN’T
                                             comment on
Comparison

     Conventional Plan   Index Lookup Plan

                                             Intermediate
                                             data, Max’s
                                             posts Alice
                                             DIDN’T
                                             comment on
Dixie
• Cost model and query optimizer which
  predicts the costs of the two plans
• Query executor which executes the original
  query, designed for a single database, on a
  partitioned database with table copies
Dixie's Predictions for the Two Plans
                          Index Lookup
                            Plan, each
              10000
                          query going to
                           two servers                               2-Step Join
                                                                     Estimate
               8000

                    • Varying the size of the intermediate data
Queries/Sec




               6000
                       • Max’s posts, Alice didn’t comment on
                           Conventional               Pushdown
                                                      Join Estimate
               4000
                    • Fixing the number of results returned
                             Plan, each
                            query going
                    • Partitioned over 10 servers
                           to all servers
               2000


                  0
                      0                    50                  100            150
                                                Posts/Author
Performance of the Two Plans

              10000                                      2-Step Join


               8000                                      Pushdown Join

                                                         2-Step Join
Queries/Sec




               6000                                      Estimate
                                                         Pushdown Join
               4000                                      Estimate


               2000

                                                     Blue beats yellow after this.
                  0                                        Dixie predicts it!
                      0    50                  100                 150
                                Posts/Author
JOINs Are Not Evil
• When only transferring small amounts data
• And with carefully partitioned data
Lessons Learned
• Don’t worry at the beginning: Use what you
  know
• Not about SQL vs. NoSQL systems – you can
  scale any system
• More about complex vs. simple queries
• When you do scale, try to make every query
  use one (or a few) shards
Research: Caching
 • Applications use a cache to store results
   computed from a webserver or database rows
 • Expiration or invalidation?
 • Annoying and difficult for app developers to
   invalidate cache items correctly


Work in progress with Bryan Kate, Eddie Kohler, Yandong Mao, and Robert Morris
                               Harvard and MIT
Solution: Dependency Tracking Cache
• Application indicates what data went into a
  cached result
• The system will take care of invalidations
• Don’t need to recalculate expired data if it has
  not changed
• Always get the freshest data when data has
  changed
• Track ranges, even if empty!
Example Application: Twitter
• Cache a user’s view of their twitter homepage
  (timeline)
• Only invalidate when a followed user tweets
• Even if many followed users tweet, only
  recalculate the timeline once when read
• No need to recalculate on read if no one has
  tweeted
Challenges
• Invalidation + expiration times
  – Reading stale data
• Distributed caching
• Cache coherence on the original database
  rows
Benefits
• Application gets all the benefits of caching
  – Saves on computation
  – Less traffic to the backend database
• Doesn’t have to worry about freshness
Summary
• Choosing a datastore
• Dixie
• Dependency tracking cache
Thanks!



 narula@mit.edu
http://nehanaru.la
      @neha

More Related Content

What's hot

MySQL 5.6 Performance
MySQL 5.6 PerformanceMySQL 5.6 Performance
MySQL 5.6 PerformanceMYXPLAIN
 
Performance evaluation of cloudera impala (with Comparison to Hive)
Performance evaluation of cloudera impala (with Comparison to Hive)Performance evaluation of cloudera impala (with Comparison to Hive)
Performance evaluation of cloudera impala (with Comparison to Hive)Yukinori Suda
 
Moving Data Between Exadata and Hadoop
Moving Data Between Exadata and HadoopMoving Data Between Exadata and Hadoop
Moving Data Between Exadata and HadoopEnkitec
 
Cloudera Impala Internals
Cloudera Impala InternalsCloudera Impala Internals
Cloudera Impala InternalsDavid Groozman
 
Hekaton introduction for .Net developers
Hekaton introduction for .Net developersHekaton introduction for .Net developers
Hekaton introduction for .Net developersShy Engelberg
 
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera, Inc.
 
Real-time Big Data Analytics Engine using Impala
Real-time Big Data Analytics Engine using ImpalaReal-time Big Data Analytics Engine using Impala
Real-time Big Data Analytics Engine using ImpalaJason Shih
 
Design Patterns for Distributed Non-Relational Databases
Design Patterns for Distributed Non-Relational DatabasesDesign Patterns for Distributed Non-Relational Databases
Design Patterns for Distributed Non-Relational Databasesguestdfd1ec
 
Evan Ellis "Tumblr. Massively Sharded MySQL"
Evan Ellis "Tumblr. Massively Sharded MySQL"Evan Ellis "Tumblr. Massively Sharded MySQL"
Evan Ellis "Tumblr. Massively Sharded MySQL"Alexey Mahotkin
 
IN-MEMORY DATABASE SYSTEMS FOR BIG DATA MANAGEMENT.SAP HANA DATABASE.
IN-MEMORY DATABASE SYSTEMS FOR BIG DATA MANAGEMENT.SAP HANA DATABASE.IN-MEMORY DATABASE SYSTEMS FOR BIG DATA MANAGEMENT.SAP HANA DATABASE.
IN-MEMORY DATABASE SYSTEMS FOR BIG DATA MANAGEMENT.SAP HANA DATABASE.George Joseph
 
SQL Server 2014 Memory Optimised Tables - Advanced
SQL Server 2014 Memory Optimised Tables - AdvancedSQL Server 2014 Memory Optimised Tables - Advanced
SQL Server 2014 Memory Optimised Tables - AdvancedTony Rogerson
 
Impala 2.0 Update #impalajp
Impala 2.0 Update #impalajpImpala 2.0 Update #impalajp
Impala 2.0 Update #impalajpCloudera Japan
 
Kill mysql-performance
Kill mysql-performanceKill mysql-performance
Kill mysql-performancekriptonium
 
What Every Developer Should Know About Database Scalability
What Every Developer Should Know About Database ScalabilityWhat Every Developer Should Know About Database Scalability
What Every Developer Should Know About Database Scalabilityjbellis
 
How Impala Works
How Impala WorksHow Impala Works
How Impala WorksYue Chen
 
Upgrade to MySQL 5.7 and latest news planned for MySQL 8
Upgrade to MySQL 5.7 and latest news planned for MySQL 8Upgrade to MySQL 5.7 and latest news planned for MySQL 8
Upgrade to MySQL 5.7 and latest news planned for MySQL 8Ted Wennmark
 
Cassandra from the trenches: migrating Netflix
Cassandra from the trenches: migrating NetflixCassandra from the trenches: migrating Netflix
Cassandra from the trenches: migrating NetflixJason Brown
 
8. key value databases laboratory
8. key value databases laboratory 8. key value databases laboratory
8. key value databases laboratory Fabio Fumarola
 

What's hot (20)

MySQL 5.6 Performance
MySQL 5.6 PerformanceMySQL 5.6 Performance
MySQL 5.6 Performance
 
Performance evaluation of cloudera impala (with Comparison to Hive)
Performance evaluation of cloudera impala (with Comparison to Hive)Performance evaluation of cloudera impala (with Comparison to Hive)
Performance evaluation of cloudera impala (with Comparison to Hive)
 
Moving Data Between Exadata and Hadoop
Moving Data Between Exadata and HadoopMoving Data Between Exadata and Hadoop
Moving Data Between Exadata and Hadoop
 
Cloudera Impala Internals
Cloudera Impala InternalsCloudera Impala Internals
Cloudera Impala Internals
 
Hekaton introduction for .Net developers
Hekaton introduction for .Net developersHekaton introduction for .Net developers
Hekaton introduction for .Net developers
 
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
 
Fudcon talk.ppt
Fudcon talk.pptFudcon talk.ppt
Fudcon talk.ppt
 
Real-time Big Data Analytics Engine using Impala
Real-time Big Data Analytics Engine using ImpalaReal-time Big Data Analytics Engine using Impala
Real-time Big Data Analytics Engine using Impala
 
Design Patterns for Distributed Non-Relational Databases
Design Patterns for Distributed Non-Relational DatabasesDesign Patterns for Distributed Non-Relational Databases
Design Patterns for Distributed Non-Relational Databases
 
Evan Ellis "Tumblr. Massively Sharded MySQL"
Evan Ellis "Tumblr. Massively Sharded MySQL"Evan Ellis "Tumblr. Massively Sharded MySQL"
Evan Ellis "Tumblr. Massively Sharded MySQL"
 
IN-MEMORY DATABASE SYSTEMS FOR BIG DATA MANAGEMENT.SAP HANA DATABASE.
IN-MEMORY DATABASE SYSTEMS FOR BIG DATA MANAGEMENT.SAP HANA DATABASE.IN-MEMORY DATABASE SYSTEMS FOR BIG DATA MANAGEMENT.SAP HANA DATABASE.
IN-MEMORY DATABASE SYSTEMS FOR BIG DATA MANAGEMENT.SAP HANA DATABASE.
 
SQL Server 2014 Memory Optimised Tables - Advanced
SQL Server 2014 Memory Optimised Tables - AdvancedSQL Server 2014 Memory Optimised Tables - Advanced
SQL Server 2014 Memory Optimised Tables - Advanced
 
Impala 2.0 Update #impalajp
Impala 2.0 Update #impalajpImpala 2.0 Update #impalajp
Impala 2.0 Update #impalajp
 
Kill mysql-performance
Kill mysql-performanceKill mysql-performance
Kill mysql-performance
 
What Every Developer Should Know About Database Scalability
What Every Developer Should Know About Database ScalabilityWhat Every Developer Should Know About Database Scalability
What Every Developer Should Know About Database Scalability
 
How Impala Works
How Impala WorksHow Impala Works
How Impala Works
 
Upgrade to MySQL 5.7 and latest news planned for MySQL 8
Upgrade to MySQL 5.7 and latest news planned for MySQL 8Upgrade to MySQL 5.7 and latest news planned for MySQL 8
Upgrade to MySQL 5.7 and latest news planned for MySQL 8
 
Cassandra from the trenches: migrating Netflix
Cassandra from the trenches: migrating NetflixCassandra from the trenches: migrating Netflix
Cassandra from the trenches: migrating Netflix
 
8. key value databases laboratory
8. key value databases laboratory 8. key value databases laboratory
8. key value databases laboratory
 
Vertica-Database
Vertica-DatabaseVertica-Database
Vertica-Database
 

Viewers also liked

Introduction to column oriented databases
Introduction to column oriented databasesIntroduction to column oriented databases
Introduction to column oriented databasesArangoDB Database
 
Flexviews materialized views for my sql
Flexviews materialized views for my sqlFlexviews materialized views for my sql
Flexviews materialized views for my sqlJustin Swanhart
 
Intro to HBase Internals & Schema Design (for HBase users)
Intro to HBase Internals & Schema Design (for HBase users)Intro to HBase Internals & Schema Design (for HBase users)
Intro to HBase Internals & Schema Design (for HBase users)alexbaranau
 
Hbase: Introduction to column oriented databases
Hbase: Introduction to column oriented databasesHbase: Introduction to column oriented databases
Hbase: Introduction to column oriented databasesLuis Cipriani
 
Row or Columnar Database
Row or Columnar DatabaseRow or Columnar Database
Row or Columnar DatabaseBiju Nair
 
What is Artificial Intelligence | Artificial Intelligence Tutorial For Beginn...
What is Artificial Intelligence | Artificial Intelligence Tutorial For Beginn...What is Artificial Intelligence | Artificial Intelligence Tutorial For Beginn...
What is Artificial Intelligence | Artificial Intelligence Tutorial For Beginn...Edureka!
 

Viewers also liked (7)

Introduction to column oriented databases
Introduction to column oriented databasesIntroduction to column oriented databases
Introduction to column oriented databases
 
Flexviews materialized views for my sql
Flexviews materialized views for my sqlFlexviews materialized views for my sql
Flexviews materialized views for my sql
 
Intro to HBase Internals & Schema Design (for HBase users)
Intro to HBase Internals & Schema Design (for HBase users)Intro to HBase Internals & Schema Design (for HBase users)
Intro to HBase Internals & Schema Design (for HBase users)
 
Hbase: Introduction to column oriented databases
Hbase: Introduction to column oriented databasesHbase: Introduction to column oriented databases
Hbase: Introduction to column oriented databases
 
Row or Columnar Database
Row or Columnar DatabaseRow or Columnar Database
Row or Columnar Database
 
Intro to HBase
Intro to HBaseIntro to HBase
Intro to HBase
 
What is Artificial Intelligence | Artificial Intelligence Tutorial For Beginn...
What is Artificial Intelligence | Artificial Intelligence Tutorial For Beginn...What is Artificial Intelligence | Artificial Intelligence Tutorial For Beginn...
What is Artificial Intelligence | Artificial Intelligence Tutorial For Beginn...
 

Similar to Executing Queries on a Sharded Database

MySQL NDB Cluster 8.0
MySQL NDB Cluster 8.0MySQL NDB Cluster 8.0
MySQL NDB Cluster 8.0Ted Wennmark
 
PayPal Big Data and MySQL Cluster
PayPal Big Data and MySQL ClusterPayPal Big Data and MySQL Cluster
PayPal Big Data and MySQL ClusterMat Keep
 
Scaling with sync_replication using Galera and EC2
Scaling with sync_replication using Galera and EC2Scaling with sync_replication using Galera and EC2
Scaling with sync_replication using Galera and EC2Marco Tusa
 
Scaling the Web: Databases & NoSQL
Scaling the Web: Databases & NoSQLScaling the Web: Databases & NoSQL
Scaling the Web: Databases & NoSQLRichard Schneeman
 
Storage Systems For Scalable systems
Storage Systems For Scalable systemsStorage Systems For Scalable systems
Storage Systems For Scalable systemselliando dias
 
Scaling Social Games
Scaling Social GamesScaling Social Games
Scaling Social GamesPaolo Negri
 
Exploring the Fundamentals of YugabyteDB - Mydbops
Exploring the Fundamentals of YugabyteDB - Mydbops Exploring the Fundamentals of YugabyteDB - Mydbops
Exploring the Fundamentals of YugabyteDB - Mydbops Mydbops
 
Tech Talk Series, Part 2: Why is sharding not smart to do in MySQL?
Tech Talk Series, Part 2: Why is sharding not smart to do in MySQL?Tech Talk Series, Part 2: Why is sharding not smart to do in MySQL?
Tech Talk Series, Part 2: Why is sharding not smart to do in MySQL?Clustrix
 
A Global In-memory Data System for MySQL
A Global In-memory Data System for MySQLA Global In-memory Data System for MySQL
A Global In-memory Data System for MySQLDaniel Austin
 
NoSQL in the context of Social Web
NoSQL in the context of Social WebNoSQL in the context of Social Web
NoSQL in the context of Social WebBogdan Gaza
 
Optimizing MySQL for Cascade Server
Optimizing MySQL for Cascade ServerOptimizing MySQL for Cascade Server
Optimizing MySQL for Cascade Serverhannonhill
 
Scaling MySQL using Fabric
Scaling MySQL using FabricScaling MySQL using Fabric
Scaling MySQL using FabricKarthik .P.R
 
Scaling MySQL Using Fabric
Scaling MySQL Using FabricScaling MySQL Using Fabric
Scaling MySQL Using FabricRemote MySQL DBA
 
How & When to Use NoSQL at Websummit Dublin
How & When to Use NoSQL at Websummit DublinHow & When to Use NoSQL at Websummit Dublin
How & When to Use NoSQL at Websummit DublinAmazon Web Services
 
Webinar Slides: MySQL HA/DR/Geo-Scale - High Noon #4: MS Azure Database MySQL
Webinar Slides: MySQL HA/DR/Geo-Scale - High Noon #4: MS Azure Database MySQLWebinar Slides: MySQL HA/DR/Geo-Scale - High Noon #4: MS Azure Database MySQL
Webinar Slides: MySQL HA/DR/Geo-Scale - High Noon #4: MS Azure Database MySQLContinuent
 
Yes sql08 inmemorydb
Yes sql08 inmemorydbYes sql08 inmemorydb
Yes sql08 inmemorydbDaniel Austin
 

Similar to Executing Queries on a Sharded Database (20)

MySQL NDB Cluster 8.0
MySQL NDB Cluster 8.0MySQL NDB Cluster 8.0
MySQL NDB Cluster 8.0
 
noSQL choices
noSQL choicesnoSQL choices
noSQL choices
 
PayPal Big Data and MySQL Cluster
PayPal Big Data and MySQL ClusterPayPal Big Data and MySQL Cluster
PayPal Big Data and MySQL Cluster
 
How and when to use NoSQL
How and when to use NoSQLHow and when to use NoSQL
How and when to use NoSQL
 
Scaling with sync_replication using Galera and EC2
Scaling with sync_replication using Galera and EC2Scaling with sync_replication using Galera and EC2
Scaling with sync_replication using Galera and EC2
 
MySQL highav Availability
MySQL highav AvailabilityMySQL highav Availability
MySQL highav Availability
 
Revision
RevisionRevision
Revision
 
Scaling the Web: Databases & NoSQL
Scaling the Web: Databases & NoSQLScaling the Web: Databases & NoSQL
Scaling the Web: Databases & NoSQL
 
Storage Systems For Scalable systems
Storage Systems For Scalable systemsStorage Systems For Scalable systems
Storage Systems For Scalable systems
 
Scaling Social Games
Scaling Social GamesScaling Social Games
Scaling Social Games
 
Exploring the Fundamentals of YugabyteDB - Mydbops
Exploring the Fundamentals of YugabyteDB - Mydbops Exploring the Fundamentals of YugabyteDB - Mydbops
Exploring the Fundamentals of YugabyteDB - Mydbops
 
Tech Talk Series, Part 2: Why is sharding not smart to do in MySQL?
Tech Talk Series, Part 2: Why is sharding not smart to do in MySQL?Tech Talk Series, Part 2: Why is sharding not smart to do in MySQL?
Tech Talk Series, Part 2: Why is sharding not smart to do in MySQL?
 
A Global In-memory Data System for MySQL
A Global In-memory Data System for MySQLA Global In-memory Data System for MySQL
A Global In-memory Data System for MySQL
 
NoSQL in the context of Social Web
NoSQL in the context of Social WebNoSQL in the context of Social Web
NoSQL in the context of Social Web
 
Optimizing MySQL for Cascade Server
Optimizing MySQL for Cascade ServerOptimizing MySQL for Cascade Server
Optimizing MySQL for Cascade Server
 
Scaling MySQL using Fabric
Scaling MySQL using FabricScaling MySQL using Fabric
Scaling MySQL using Fabric
 
Scaling MySQL Using Fabric
Scaling MySQL Using FabricScaling MySQL Using Fabric
Scaling MySQL Using Fabric
 
How & When to Use NoSQL at Websummit Dublin
How & When to Use NoSQL at Websummit DublinHow & When to Use NoSQL at Websummit Dublin
How & When to Use NoSQL at Websummit Dublin
 
Webinar Slides: MySQL HA/DR/Geo-Scale - High Noon #4: MS Azure Database MySQL
Webinar Slides: MySQL HA/DR/Geo-Scale - High Noon #4: MS Azure Database MySQLWebinar Slides: MySQL HA/DR/Geo-Scale - High Noon #4: MS Azure Database MySQL
Webinar Slides: MySQL HA/DR/Geo-Scale - High Noon #4: MS Azure Database MySQL
 
Yes sql08 inmemorydb
Yes sql08 inmemorydbYes sql08 inmemorydb
Yes sql08 inmemorydb
 

Recently uploaded

How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 

Recently uploaded (20)

How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 

Executing Queries on a Sharded Database

  • 1. Executing Queries on a Sharded Database Neha Narula September 25, 2012
  • 2. Choosing and Scaling Your Datastore Neha Narula September 25, 2012
  • 3. Who Am I? Froogle Blobstore Native Client
  • 4. In This Talk • What to think about when choosing a datastore for a web application • Myths and legends • Executing distributed queries • My research: a consistent cache
  • 5. Every so often… Friends ask me for advice when they are building a new application
  • 7. “Hi Neha! I am making a new application.
  • 8. I have heard MySQL sucks and I should use NoSQL.
  • 9. I am going to be iterating on my app a lot
  • 10. I don't have any customers yet
  • 11. and my current data set could fit on a thumb drive from 2004, but…
  • 12. Can you tell me which NoSQL database I should use?
  • 13. And how to shard it?"
  • 14. Neha
  • 16.
  • 17.
  • 18. What to Think about When You’re Choosing a Datastore Hint: Not Scaling
  • 19. Development Cycle • Prototyping – Ease of use • Developing – Flexibility, multiple developers • Running a real production system – Reliability • SUCCESS! Sharding and specialized datastorage – We should only be so lucky
  • 20. Getting Started First five minutes – what’s it like? Idea credit: Adam Marcus, The NoSQL Ecosystem, HPTS 2011 via Justin Sheehy from Basho
  • 21. First Five Minutes: Redis http://simonwillison.net/2009/Oct/22/redis/
  • 23. Developing • Multiple people working on the same code • Testing new features => new access patterns • New person comes along…
  • 24. Redis Time to go get lunch
  • 25. MySQL
  • 26. Questions • What is your mix of reads and writes? • How much data do you have? • Do you need transactions? • What are your access patterns? • What will grow and change? • What do you already know?
  • 27. Reads and Writes • Write optimized vs. Read optimized • MongoDB has a global write lock per process • No concurrent writes!
  • 28. Size of Data • Does it fit in memory? • Disk-based solutions will be slower • Worst case, Redis needs 2X the memory of your data! – It forks with copy-on-write sudo echo 1 > /proc/sys/vm/overcommit_memory
  • 29. Requirements • Performance – Latency tolerance • Durability – Data loss tolerance • Freshness – Staleness tolerance • Uptime – Downtime tolerance
  • 30. Performance • Relational databases are considered slow • But they are doing a lot of work! • Sometimes you need that work, sometimes you don’t.
  • 31. Simplest Scenario • Responding to user requests in real time • Frequently read data fits in memory • High read rate • No need to go to disk or scan lots of data Datastore CPU is the bottleneck
  • 32. Cost of Executing a Primary Key Lookup • Receiving message • Instantiating thread • Parsing query Query • Optimizing Cost • Take out read locks • (Sometimes) lookup in an index btree • Responding to the client Actually retrieving data
  • 33. Options to Make This Fast • Query cache • Prepared statements • Handler Socket – 750K lookups/sec on a single server Lesson: Primary key lookups can be fast no matter what datastore you use
  • 34. Flexibility vs. Performance • We might want to pay the overhead for query flexibility • In a primary key datastore, we can only ask queries on primary key • SQL gives us flexibility to change our queries
  • 35. Durability • Persistent datastores – Client: write – Server: flush to disk, then send “I completed your write” – CRASH – Recover: See the write • By default, MongoDB does not fsync() before returning to client on write – Need j:true • By default, MySQL uses MyISAM instead of InnoDB
  • 36. Specialization • You know your – query access patterns and traffic – consistency requirements • Design specialized lookups – Transactional datastore for consistent data – Memcached for static, mostly unchanging content – Redis for a data processing pipeline • Know what tradeoffs to make
  • 37. Ways to Scale • Reads – Cache – Replicate – Partition • Writes – Partition data amongst multiple servers
  • 39. Myths of Sharded Datastores • NoSQL scales better than a relational database • You can’t do a JOIN on a sharded datastore
  • 40. MYTH: NoSQL scales better than a relational database • Scaling isn’t about the datastore, it’s about the application – Examples: FriendFeed, Quora, Facebook • Complex queries that go to all shards don’t scale • Simple queries that partition well and use only one shard do scale
  • 41. Problem: Applications Look Up Data in Different Ways
  • 42. HA! Post Page SELECT * FROM comments WHERE post_id = 100 zrange comments:100 0 -1
  • 43. Example Partitioned Database Database comments table 0-99 post_id Webservers Database 100-199 Database 200-199
  • 44. Query Goes to One Partition MySQL 0-99 MySQL Comments on post 100 100-199 MySQL 200-299
  • 45. Many Concurrent Queries MySQL Comments on post 52 0-99 MySQL Comments on post 100 100-199 Comments MySQL on post 289 200-299
  • 46. User Comments Page Fetch all of a user's comments: SELECT * FROM comments WHERE user = 'sam'
  • 47. Query Goes to All Partitions MySQL 0-99 Query goes to all servers Sam’s MySQL comments 100-199 MySQL 200-299
  • 48. Costs for Query Go Up When Adding a New MySQL Server 0-99 MySQL 100-199 Sam’s comments MySQL 200-299 MySQL 300-399
  • 49. CPU Cost on Server of Retrieving One Row Two costs: Query overhead and row retrieval Query 97% Overhead MySQL
  • 50. Idea: Multiple Partitionings Partition comments table on post_id and user. Reads can choose appropriate copy so queries go to only one server. Writes go to all copies.
  • 51. Multiple Partitionings Database comments table 0-99 A-J post_id user Database 100-199 K-R Database 200-199 S-Z
  • 52. All Read Queries Go To Only One Partition MySQL 0-99 Comments on post 100 MySQL 100-199 Sam’s comments MySQL S-Z
  • 53. Writes Go to Both Table Copies MySQL Write a new 0-99 comment by Vicki on post 100 MySQL 100-199 MySQL S-Z
  • 54. Scaling • Reads scale by N – the number of servers • Writes are slowed down by T – the number of table copies • Big win if N >> T • Table copies use more memory – Often don’t have to copy large tables – usually metadata • How to execute these queries?
  • 55. Dixie • Is a query planner which executes SQL queries over a partitioned database • Optimizes read queries to minimize query overhead and data retrieval costs • Uses a novel cost formula which takes advantage of data partitioned in different ways
  • 56. Wikipedia Workload with Dixie 25000 Each Wikipedia One copy of page table query going to 1 20000 • DatabaseDixie + page.title copy dump from 2008 server • Real HTTP traces 3.2 X 15000 • Sharded across 1, 2, 5, and 10 servers QPS 10000 • Compare by adding a copy of the page Many queries table (< 100 MB), sharded another way going to all 5000 servers 0 1 2 5 10 Number of Partitions
  • 57. Myths of Sharded Datastores • NoSQL scales better than a relational database • You can’t do a JOIN on a sharded datastore
  • 58. MYTH: You can’t execute a JOIN on a sharded database • What are the JOIN keys? What are your partition keys? • Bad to move lots of data • Not bad to lookup a few pieces of data • Index lookup JOINs expressed as primary key lookups are fast
  • 59. Join Query Fetch all of Alice's comments on Max's posts. comments table posts table post_id user text id author link 3 Alice First post! 1 Max http://… 6 Alice Like. 3 Max www. 7 Alice You think? 22 Max Ask Hacker 22 Alice Nice work! 37 Max http://…
  • 60. Join Query Fetch all of Alice's comments on Max's posts. comments table posts table post_id user text id author link 3 Alice First post! 1 Max http://… 6 Alice Like. 3 Max www. 7 Alice You think? 22 Max Ask Hacker 22 Alice Nice work! 37 Max http://…
  • 61. Distributed Query Planning • R* • Shore • The State of the Art in Distributed Query Processing, by Donald Kossman • Gizzard
  • 62. Conventional Wisdom Partition on JOIN keys, Send query 0-99 0-99 as is to each server. SELECT comments.text FROM posts, comments WHERE comments.post_id = posts.id 100- 100- AND posts.author = 'Max' 199 199 AND comments.user = 'Alice' 200- 200- 299 299
  • 63. Conventional Plan Scaling Alice’s comments on Max’s posts MySQL MySQL MySQL …
  • 64. Index Lookup Join 0-99 Partition on filter keys, retrieve Max's A-I 0-99 posts, then retrieve Alice's comments on A-I Max's posts. 100- SELECT posts.id 199 J-R 100- FROM posts 199 J-R WHERE posts.author = 'Max’ 200- 299 S-Z 200- 299 S-Z
  • 65. Index Lookup Join 0-99 Partition on filter keys, retrieve Max's A-I 0-99 A-I posts, then retrieve Alice's comments on Max's posts. 100- ids of Max’s 199 J-R 100- posts 199 J-R 200- 299 S-Z 200- 299 S-Z
  • 66. Index Lookup Join 0-99 Partition on filter keys, retrieve Max's A-I 0-99 posts, then retrieve Alice's comments on A-I Max's posts. SELECT comments.text 100- FROM comments 199 J-R 100- WHERE comments.post_id IN […] 199 J-R AND comments.user = 'Alice' 200- 299 S-Z 200- 299 S-Z
  • 67. Index Lookup Join 0-99 Partition on filter keys, retrieve Max's A-I 0-99 posts, then retrieve Alice's comments on A-I Max's posts. 100- Alice’s comments on 199 J-R 100- 199 J-R Max’s posts 200- 299 S-Z 200- 299 S-Z
  • 68. Index Lookup Join Plan Scaling Alice’s comments on Max’s posts MySQL MySQL MySQL …
  • 69. Comparison Conventional Plan Index Lookup Plan Intermediate data, Max’s posts Alice DIDN’T comment on
  • 70. Comparison Conventional Plan Index Lookup Plan Intermediate data, Max’s posts Alice DIDN’T comment on
  • 71. Dixie • Cost model and query optimizer which predicts the costs of the two plans • Query executor which executes the original query, designed for a single database, on a partitioned database with table copies
  • 72. Dixie's Predictions for the Two Plans Index Lookup Plan, each 10000 query going to two servers 2-Step Join Estimate 8000 • Varying the size of the intermediate data Queries/Sec 6000 • Max’s posts, Alice didn’t comment on Conventional Pushdown Join Estimate 4000 • Fixing the number of results returned Plan, each query going • Partitioned over 10 servers to all servers 2000 0 0 50 100 150 Posts/Author
  • 73. Performance of the Two Plans 10000 2-Step Join 8000 Pushdown Join 2-Step Join Queries/Sec 6000 Estimate Pushdown Join 4000 Estimate 2000 Blue beats yellow after this. 0 Dixie predicts it! 0 50 100 150 Posts/Author
  • 74. JOINs Are Not Evil • When only transferring small amounts data • And with carefully partitioned data
  • 75. Lessons Learned • Don’t worry at the beginning: Use what you know • Not about SQL vs. NoSQL systems – you can scale any system • More about complex vs. simple queries • When you do scale, try to make every query use one (or a few) shards
  • 76. Research: Caching • Applications use a cache to store results computed from a webserver or database rows • Expiration or invalidation? • Annoying and difficult for app developers to invalidate cache items correctly Work in progress with Bryan Kate, Eddie Kohler, Yandong Mao, and Robert Morris Harvard and MIT
  • 77. Solution: Dependency Tracking Cache • Application indicates what data went into a cached result • The system will take care of invalidations • Don’t need to recalculate expired data if it has not changed • Always get the freshest data when data has changed • Track ranges, even if empty!
  • 78. Example Application: Twitter • Cache a user’s view of their twitter homepage (timeline) • Only invalidate when a followed user tweets • Even if many followed users tweet, only recalculate the timeline once when read • No need to recalculate on read if no one has tweeted
  • 79. Challenges • Invalidation + expiration times – Reading stale data • Distributed caching • Cache coherence on the original database rows
  • 80. Benefits • Application gets all the benefits of caching – Saves on computation – Less traffic to the backend database • Doesn’t have to worry about freshness
  • 81. Summary • Choosing a datastore • Dixie • Dependency tracking cache

Editor's Notes

  1. First, a bit about me.Former Software Engineer at Google – worked on Froogle, a product aggregator and shopping website, Blobstore, a storage system for large binary objects like gmail attachments and images, and Native Client, a way of running fast native code safely in the browser.Left to go be a PhD student at MIT, studying distributed systems at CSAIL (computer science and AI lab). I’ve been there for four years!This summer I decided to do an internship at news.me, as a data scientist, because I’m really interested on how we get our news and content these days. News.me then bought digg.com (interesting story) so what I really did last summer was work on powering real-time news by looking at social sharing on twitter and facebook, and relaunchingdigg in six weeks.
  2. Here’s what I’m going to cover in this talk.First,Then, some myths I’ve heard about NoSQL, SQL, and partitioning on the internet.
  3. Split?
  4. Split?
  5. Split?
  6. Split?
  7. Split?
  8. Split?
  9. Split?
  10. But I kind of understand where they are coming from – there are so many options out there.
  11. Apologies if I left your favorite out.It’s totally overwhelming, and there is a lot of hype.
  12. So I’m here to tell you what to think about when you’re choosing a datastore. It’s definitely not scaling!
  13. We can see what is really important by looking at the development cycle of a project.When prototyping, we really care about iterating quickly, and how easy the storage system is to use and how soon you can get up and running with itWhen developing, we care a lot about flexibility, the ability to develop new features, and having a codebase that multiple developers can use.When we’re running your system in the real world, our first priority is reliability and performance.If we even get to the stage where we need to think about sharding, it means we’ve really succeeded, and it’s a great problem to have!But we might not even get there.
  14. One thing a lot of developers pay attention to is how easy it is to get started with a datastore.This is something I saw in someone else’s presentation, and I thought it was really great – what are the first five minutes like with your datastore?
  15. Here we have redis, and it’s pretty awesome. This guy has created a 30 second guide -- A few commands and I’m storing keys and getting them back out. Very little startup overhead.
  16. Contrast that with MySQL – I didn’t even show you the actual commands to run to get to the same point, because we have to go through so much other stuff first. Namely, setting up permissions, creating a database, and choosing a schema, which is something I might not even be ready to do yet!!!Redis seems like a big win here.
  17. But – what happens when we get to the next stage, developing? At this point, we might have multiple people working on the code, we’re testing new features and accessing the data in different ways.Someone new might want to see how your data is structured.Let’s go back and look at Redis and MySQL again
  18. This is a redis server I used at Digg for testing. I came in not knowing what was in this datastore – some of the key patterns were strewn about the code, which is not ideal. And the only way to tell the type of the key (set, sorted set, etc) was to look at the operations on it – zadd vs. set and get.How do I find out what’s going on?Redis has a keys command…. But this datastore is 24 GB of smallish keys. Running “keys” will take forever (click) and overload my server. This isn’t a good thing to do on a production system.So it’s hard to find out what’s going on here.
  19. Contrast that with MySQL – because SQL is a flexible query language, I can learn about the data really quickly. Here I’m learning about the table “cards”. I can describe it, and select a few rows to see what’s in it.I can also easily add an index if I want to look up stuff in different ways, no need to completely rewrite everything in the database or change my code a lot.So if it’s not scaling and it’s not initial impressions that are important, what is?
  20. What we should really think about are some of the following key questions. I’m going to present this as a list, and then go into detail on a few of themFirst off, we would want to choose a datastore optimized for how many reads/writes the app is doing.The size of our data is incredibly important when choosing from in-memory vs. disk optimized data stores. Do we require multi-statement transactions? If so, there are only a few datastores that provide that out of the box. However, it is possible to build out this layer in the application.Our access patterns are important too, as is how you predict your workload will grow and change, and I’ll describe this in more detail later in the talk when I talk about scaling.Finally, what do you already know? The best datastore is the one you’re most familiar with. A lot of them have pretty funny semantics, and I’ll give some examples.
  21. It’s important to figure out whether you need a read-optimized or a write-optimizeddatastore.For example, if your application needs to do a lot of writes, you might be in trouble if you choose MongoDB – [click]MongoDB has a global write lock per process. What this means is that you can’t do concurrent writes, so your write-heavy application will be really slowed down!
  22. So let’s talk about the size of our data. Disk based solutions are slower, but it’s really important to pick a datastore that can handle our data size if it might not fit in memory. And figuring this out can be a bit tricky – a store like Redis might actually require 2X the memory of our data size! This is because it forks a process to do background persistence to disk, and in the worst case, we might need to replicate all of that memory between the two processes. If Redis doesn’t get this memory, it will become totally unresponsive, and we might wish we went with a slower disk-based solution.There are ways around this, by changing the way our virtual memory manager allocates memory, but that’s scary. Also, it might not work if we’re changing all the keys while this happens
  23. Another great thing to do is lay out our application requirements. What can we tolerate, if we have to make a tradeoff? Here are some axes we should think about.First, performance, or latency tolerance. How slow is ok?Second, durability or data loss tolerance. How bad is it if we lose a few writes, in the case of a crash?Freshness, or staleness tolerance – really this means, can we serve stale data from a cache? For how long?Finally, uptime, or downtime tolerance. Do we need to be able to switch over to a replica in the case of failure immediately, or can we handle being read-only or down for a while?
  24. First, let’s discuss performance. Oftentimes relational databases are considered very slow.But, they are doing a lot of work! Some of which you might not be interested in. As an example, consider the base case for an application, doing very simple primary key lookups
  25. The application needs the data fast, since it is responding to user requests in real time. The data mostly fits in memory [click], with a high read rate [click] and it’s primary key lookups, so there is no need [click] to go to disk or scan lots of data.What this means is that we have removed the constraints of disk and network, and really the CPU is the bottleneck here.What does this mean for a relational database? How much does it cost in server CPU time to do one of these lookups?
  26. If we don’t need all of the fancy query flexibility, we have some options. First off, the query cache. Skips optimization and parsing, good if we are doing the same query over and overPrepared statements avoid parsing, but not optimization phase, and can’t use query cache, but work for the same query with different arguments.Handler Socket – made by YoshinoriMatsunobu, incredibly fast. Bypasses all of the querying machinery and talks directly to the storage engine so we still have all of the durability and error recovery that the storage engine guarantees! Also what’s really cool is we have the flexibility of using SQL on that data if we want. Lesson:
  27. Now, you might be wondering WHY we would bother with a relational database if we’re just not going to use half of it. CLICK. Well, we might be willing to pay that query overhead for flexibility, so we can access our data in many different ways.[Read 2nd point] we have to know queries up front, otherwise we might find ourselves in a situation where we can’t execute certain queries because they require scanning all the data. [click] SQL makes it easier to change what queries we are issuing, which is especially important if we don’t fully know all the features of our application yet.
  28. Cut this?
  29. Cut this?
  30. Another really important issue is Durability. How is this defined? Well, in a persistent datastore, here is what you should expect.When a client does a write, the server flushes that right to disk BEFORE sending “done” to the client.That way, if the server were to crash, when it recovers we can see the write. [click, read 2nd point] In fact, it doesn’t even return to the client AT ALL after a write, so the client has no idea if the write succeeded or not without making another request. [CLICK, mysql]The thing is, some applications might not care about every single write making it to disk. Maybe they are just logging things to a database. But I really think making this the default, for supposedly persistent datastores, was a bad idea. This is something we really have to be aware of when choosing a datastore and building our application
  31. What happens when we’ve gotten to the point where we have so much traffic that we need to shard? This is the time when we start thinking about customized data store solutions.At this point we know our application very well, [click]and we know where we use simple queries and where we need something more complex. We know where to create copies for fast reads and where we need to keep one copy for consistency. We can use a combination of datastores [click] to get the best performance, but we know what tradeoffs we can make to use one datastore for ease of developmentHopefully we will reach this stage of tremendous growth. I’ve told before you don’t need to worry about scaling in the beginning, and here’s why: No matter what we chose initially, we can scale it! I’ll show you what I mean.
  32. First, let’s discuss some ways of scaling – when scaling reads, we have a few options. We can set up a cache, like memcached. We can replicate your servers, so reads can be spread across all of them, like with MongoDB replica sets or MySQL master-slave replication, or we can partition the data amongst different severs.For writes we only have the latter option, so I’m going to talk about partitioning when I talk about scaling.
  33. Now, there’s a lot of folklore out there about which systems do what. I’m going to address two myths around scaling.
  34. First, the idea that we had to have chosen a NoSQL database from the beginning, because relational databases don’t scale well. This is not true.Second, that you can’t do a join if you have a sharded database. Also not true!
  35. Scaling isn’t about the datastore, it’s about the application.[click]Plenty of applications have scaled using a relational database. They just use them in a different way. Friendfeed and Quora both appreciated the tested solution of MySQL, and built partitioning into their applications. Facebook has thousands of MySQL servers. (though I guess Stonebraker would say they are trapped in a fate worse than death… handling 1B users is nothing to laugh at)The issue here is complex vs. simple queries.[click]Complex queries that go to all shards don’t scale. [click]. Simple … read [PAUSE]Great. So all we need to do is pick a good partition key and send every lookup and query to one server! But…
  36. We still have a problem. LOTS of applications need to look up data in different ways. Nobody really talks about this. What do I mean? Well, let’s look at an example.
  37. This is a post page in Hacker News. We need to get all of the comments on a post. Let’s even ignore nested comments for now.Here is a SQL query, getting all the comments for post 100.Here’s how we might do it in redis if you had comments stored in a sorted set by post id.I’d also just like to point out that I got this snapshot from hacker news today, and it’s about a guy getting bitten by the fact that mongo doesn’t return anything to the client after a write. HA.
  38. Read slideNow it becomes a bit confusing to figure out how to use these table copies. Particularly if you are still using SQL.
  39. To solve this problem we created a system called Dixie (Read 1st point).SQL queries originally written as though for a single database with no table copies[read rest]Pause after this.So does Dixie actually help?
  40. Well, to evaluate this, we ran an experiment using Wikipedia.Here is a graph showing how throughput scales as we add more servers, using one copy per table or using Dixie and table copies.First off, a bit about this experiment – we used a Wikipedia database dump from 2008, and real HTTP traces of traffic. We partitioned this across 1, 2, 5, and 10 servers, both with one copy of each table and then with an added copy of one of the tables, the page table, partitioned in a different way.The yellow bars are most queries going to all servers [click] while the blue bars are most queries going to one server [click] because of the extra table copyOn 10 servers, Dixie plus a copy provides a 3.2 X improvement [click] in throughput over not having table copies.This shows that as we scale to more partitions, the benefit of table copies grows. When there are only one or two servers table copies don’t help much.
  41. So we’ve shown that a SQL datastore can scale perfectly well, if we add table copies and make sure queries only need to go to one server. Now let’s talk about JOINs.
  42. [read slide]Let’s look at an example
  43. A little background on distributed query planning – there has been a lot of research in this area, and systems like Oracle and DB2 have sophisticated distributed query planners. Not a ton of open source solutions. Some things out there are very simple, like Gizzard.
  44. This is what Dixie does!Dixie has a [read 1st point]And a [read 2nd point]
  45. [read slide]So, now I’m going to change track and talk about research I’m currently working on, involving cacheing
  46. And this is research with people from Harvard and MIT.[read slide]The problem is it’s often confusing as to whether we should set an expire time on the data, or if we should pay the cost of invalidating it on writes.
  47. [last point] And we do tracking at the level of ranges, so that if a result is based on a range, it will get invalidated when new things are added to that range. Even if the range was originally empty!
  48. [click] So here’s an example of how this might work. We cache a user’s view of their twitter homepage, their timeline.[click] We only invalidate this cached object when another user the original user is following tweets[click]So if a user reads their timeline infrequently, the invalidations don’t matter, we only recalculate the timeline once on a read[click] And if the people the user follows don’t tweet often, the timeline won’t change and won’t need to be recalculated
  49. There are many challenges involved in this work. First off, we want to offer applications the option of using invalidation and expiration times together, if they are willing to occasionally serve stale data.Next, we are creating a distributed cache, so there is all of the work that goes along with that.Finally, we need to properly do cache coherence on the original database rows, which are spread out amongst all of these distributed caches.
  50. But this coule be really helpful – it would mean that an application could get all of the benefits of cachingWhich include saving on computation, like not having to regenerate HTML pages, and sending less traffic to the backend databaseWhile it doesn’t have to worry about freshness or invalidating resultsThis is still a work in progress. Hopefully we’ll have a paper soon.
  51. So in summary, I’ve told you about choosing adastore, scaling it, Dixie, and a dependency tracking cacheI’d love to talk further about any of these things if you’re interested.
  52. Thank you so much for your time! I’m really excited to talk about any of this stuff, and I don’t know anyone at strange loop, so please feel free to say hi!Here’s my contact information, with pointers to some of the things I talked about.
  53. Cut this?