Scaling your Database in the CloudSpring2011Presented by:Cory Isaacson, CEOCodeFutures Corporationhttp://www.dbshards.comView the video of this presentation here!
IntroductionWho I amCory Isaacson, CEO of CodeFuturesProviders of dbShardsAuthor of Software PipelinesPartnerships:RightscaleThe leading Cloud Management PlatformLeaders in database scalability, performance, and high-availability for the cloud……based on real-world experience with dozens of cloud-based applications…social networking, gaming, data collection, mobile, analyticsObjective is to provide useful experience you can apply to scaling (and managing) your database tier……especially for high volume applications
Challenges of cloud computingCloud provides highly attractive service environmentFlexible, scales with need (up or down)No need for dedicated IT staff, fixed facility costsPay-as-you-go modelCloud services occasionally failPartial network outagesServer failures……by their nature cloud servers are “transient”Disk volume issuesCloud-based resources are constrainedCPUI/O Rates……the “Cloud I/O Barrier”
Typical Application Architecture
Scaling in the CloudScaling Load Balancers is easyStateless routing to app serverCan add redundant Load Balancers if neededDNS round-robin or intelligent routing for larger sitesIf one goes down……failover to anotherScaling Application Servers is easyStatelessSessions can easily transition to another serverAdd or remove servers as need dictatesIf one goes down……failover to another
Scaling in the CloudScaling the Database tier is hard“Stateful” by definition (and necessity…)Large, integrated data sets…10s of GBs to TBs (or more…)Difficult to move, reloadI/O dependent……adversely affected by cloud service failures…and slow cloud I/OIf one goes down……ouch!Databases form the “last mile” of true application scalabilityInitially simple optimization produces the best resultImplement a follow-on scalability strategy for long-term performance goals……plus a high-availability strategy is a must
All CPUs wait at the same speed…The Cloud I/O Barrier
More Database scalability challengesDatabases have many other challenges that limit scalabilityACID compliance……especially Consistency…user contentionOperational challengesFailoverPlanned, unplannedMaintenanceIndex rebuildRestoreSpace reclamationLifecycleReliable Backup/RestoreMonitoringApplication UpdatesManagement
Database slowdown is not linear…
Challenges apply to all types of databasesTraditional RDBMS (MySQL, Postgres, Oracle…)I/O boundMulti-user, lock contentionHigh-availabilityLifecycle managementNoSQLDatabases In-memory, Caching, Document…)Reliability, High-availabilityLimits of a single server…and a single threadData dumps to diskLifecycle ManagementNo matter what the technology, big databases are hard to manage……elastic scaling is a real challenge…degradation from growth in size and volume is a certaintyApplication-specific database requirements add to the challenge…database design is key……balance performance vs. convenience vs. data size
The Laws of DatabasesLaw #1: Small Databases are fast…Law #2: Big Databases are slow…Law #3: Keep databases small
What is the answer?Database sharding is the only effective method for achieving scale, elasticity, reliability and easy management……regardless of your database technology
What is Database Sharding?“Horizontal partitioning is a database design principle whereby rows of a database table are held separately... Each partition forms part of a shard, which may in turn be located on a separate database server or physical location.” Wikipedia
The key to Database Sharding…
Database Sharding Architecture
Database Sharding Architecture
Database Sharding… the results
Why does Database Sharding work?Maximize CPU/Memory per database instance……as compared to database sizeReduce the size of index treesSpeeds writes dramaticallyNo contention between serversLocking, disk, memory, CPUAllows for intelligent parallel processing……Go Fish queries across shardsKeep CPUs busy and productive
Breaking the Cloud I/O Barrier
Database typesTraditional RDBMSProvenSQL language……although ORM can change thatDurable, transactional, dependable……perceived as inflexibleNoSQLTypically in-memory……the real speed advantageVarious flavors…Key/Value StoreDocument databaseHybridStore complex documents using a keyIn-memory and disk based
Black box vs. Relational ShardingBoth utilize sharding on a data key……typically modulus on a value or consistent hashBlack box sharding is automatic… …attempts to evenly shard across all available servers…no developer visibility or control…can work acceptably for simple, non-indexed NoSQL data stores…easily supports single-row/object resultsRelational sharding is defined by the developer……selective sharding of large data…explicit developer control and visibility…tunable as the database grows and matures…more efficient for result sets and searchable queries…preserves data relationships
How Database Sharding works for NoSQL
Relational ShardingShard-Tree Root TableShard-Tree Child TablesGlobal Tables
How Relational Sharding works
How Relational Sharding worksShard key recognition in SQLSELECT * FROM customerWHERE customer_id = 1234INSERT INTO customer(customer_id, first_name, last_name, addr_line1,…)VALUES(2345, ‘John’, ‘Jones’, ‘123 B Street’,…)UPDATE customerSET addr_line1 = ‘456 C Avenue’WHERE customer_id = 4567
What about Cross-Shard result sets?
Cross-shard result set exampleGo Fish (no shard key)SELECT country_id, count(*) FROM customerGROUP BY country_id
More on Cross-Shard result sets…Black Box approach requires “scatter gather” for multi-row/object result set……common with NoSQL engines…forces use of denormalized lists…must be maintained by developer/application code…Map Reduce processing helps with this (non-realtime)Relational sharding provides access to meaningful result sets of related data……aggregation, sort easier to perform…logical search operations more natural…related rows are stored together
Make your Application Shard-SafeFor sharding to be efficient, every sharded table should include a shard key.SELECT WHERE clause statements should include a shard key where possible so that the query goes directly to a single shard rather than being executed against multiple shards in parallel.INSERT, UPDATE and DELETE WHERE CLAUSE statements must include a shard key so that the data is written to the correct shard.Avoid cross-shard inner joins. For example, joining one customer to another is not feasible as both customers may be in different shards. This can be accomplished, but requires complex logic and can adversely affect performance.
What about High-Availability?Can you afford to take your databases offline:…for scheduled maintenance? …for unplanned failure? …can you accept some lost transactions?By definition Database Sharding adds failure points to the data tierA proven High-Availability strategy is a must……system outages…especially in the cloud…planned maintenance…a necessity for all database engines
Traditional asynch replication is inadequate…Replication “Lag” can result in lost transactions
No rapid failover for continuous operation
Maintenance difficult in production environments
Used by many NoSQL enginesDatabase Sharding and High-AvailabilityNeed a minimum of 2 active-active instances per shard…
…3 active-active instances for in-memory databases
Support for…
…fail-down

Scaling SQL and NoSQL Databases in the Cloud

  • 1.
    Scaling your Databasein the CloudSpring2011Presented by:Cory Isaacson, CEOCodeFutures Corporationhttp://www.dbshards.comView the video of this presentation here!
  • 2.
    IntroductionWho I amCoryIsaacson, CEO of CodeFuturesProviders of dbShardsAuthor of Software PipelinesPartnerships:RightscaleThe leading Cloud Management PlatformLeaders in database scalability, performance, and high-availability for the cloud……based on real-world experience with dozens of cloud-based applications…social networking, gaming, data collection, mobile, analyticsObjective is to provide useful experience you can apply to scaling (and managing) your database tier……especially for high volume applications
  • 3.
    Challenges of cloudcomputingCloud provides highly attractive service environmentFlexible, scales with need (up or down)No need for dedicated IT staff, fixed facility costsPay-as-you-go modelCloud services occasionally failPartial network outagesServer failures……by their nature cloud servers are “transient”Disk volume issuesCloud-based resources are constrainedCPUI/O Rates……the “Cloud I/O Barrier”
  • 4.
  • 5.
    Scaling in theCloudScaling Load Balancers is easyStateless routing to app serverCan add redundant Load Balancers if neededDNS round-robin or intelligent routing for larger sitesIf one goes down……failover to anotherScaling Application Servers is easyStatelessSessions can easily transition to another serverAdd or remove servers as need dictatesIf one goes down……failover to another
  • 6.
    Scaling in theCloudScaling the Database tier is hard“Stateful” by definition (and necessity…)Large, integrated data sets…10s of GBs to TBs (or more…)Difficult to move, reloadI/O dependent……adversely affected by cloud service failures…and slow cloud I/OIf one goes down……ouch!Databases form the “last mile” of true application scalabilityInitially simple optimization produces the best resultImplement a follow-on scalability strategy for long-term performance goals……plus a high-availability strategy is a must
  • 7.
    All CPUs waitat the same speed…The Cloud I/O Barrier
  • 8.
    More Database scalabilitychallengesDatabases have many other challenges that limit scalabilityACID compliance……especially Consistency…user contentionOperational challengesFailoverPlanned, unplannedMaintenanceIndex rebuildRestoreSpace reclamationLifecycleReliable Backup/RestoreMonitoringApplication UpdatesManagement
  • 9.
  • 10.
    Challenges apply toall types of databasesTraditional RDBMS (MySQL, Postgres, Oracle…)I/O boundMulti-user, lock contentionHigh-availabilityLifecycle managementNoSQLDatabases In-memory, Caching, Document…)Reliability, High-availabilityLimits of a single server…and a single threadData dumps to diskLifecycle ManagementNo matter what the technology, big databases are hard to manage……elastic scaling is a real challenge…degradation from growth in size and volume is a certaintyApplication-specific database requirements add to the challenge…database design is key……balance performance vs. convenience vs. data size
  • 11.
    The Laws ofDatabasesLaw #1: Small Databases are fast…Law #2: Big Databases are slow…Law #3: Keep databases small
  • 12.
    What is theanswer?Database sharding is the only effective method for achieving scale, elasticity, reliability and easy management……regardless of your database technology
  • 13.
    What is DatabaseSharding?“Horizontal partitioning is a database design principle whereby rows of a database table are held separately... Each partition forms part of a shard, which may in turn be located on a separate database server or physical location.” Wikipedia
  • 14.
    The key toDatabase Sharding…
  • 15.
  • 16.
  • 17.
  • 18.
    Why does DatabaseSharding work?Maximize CPU/Memory per database instance……as compared to database sizeReduce the size of index treesSpeeds writes dramaticallyNo contention between serversLocking, disk, memory, CPUAllows for intelligent parallel processing……Go Fish queries across shardsKeep CPUs busy and productive
  • 19.
  • 20.
    Database typesTraditional RDBMSProvenSQLlanguage……although ORM can change thatDurable, transactional, dependable……perceived as inflexibleNoSQLTypically in-memory……the real speed advantageVarious flavors…Key/Value StoreDocument databaseHybridStore complex documents using a keyIn-memory and disk based
  • 21.
    Black box vs.Relational ShardingBoth utilize sharding on a data key……typically modulus on a value or consistent hashBlack box sharding is automatic… …attempts to evenly shard across all available servers…no developer visibility or control…can work acceptably for simple, non-indexed NoSQL data stores…easily supports single-row/object resultsRelational sharding is defined by the developer……selective sharding of large data…explicit developer control and visibility…tunable as the database grows and matures…more efficient for result sets and searchable queries…preserves data relationships
  • 22.
    How Database Shardingworks for NoSQL
  • 23.
    Relational ShardingShard-Tree RootTableShard-Tree Child TablesGlobal Tables
  • 24.
  • 25.
    How Relational ShardingworksShard key recognition in SQLSELECT * FROM customerWHERE customer_id = 1234INSERT INTO customer(customer_id, first_name, last_name, addr_line1,…)VALUES(2345, ‘John’, ‘Jones’, ‘123 B Street’,…)UPDATE customerSET addr_line1 = ‘456 C Avenue’WHERE customer_id = 4567
  • 26.
  • 27.
    Cross-shard result setexampleGo Fish (no shard key)SELECT country_id, count(*) FROM customerGROUP BY country_id
  • 28.
    More on Cross-Shardresult sets…Black Box approach requires “scatter gather” for multi-row/object result set……common with NoSQL engines…forces use of denormalized lists…must be maintained by developer/application code…Map Reduce processing helps with this (non-realtime)Relational sharding provides access to meaningful result sets of related data……aggregation, sort easier to perform…logical search operations more natural…related rows are stored together
  • 29.
    Make your ApplicationShard-SafeFor sharding to be efficient, every sharded table should include a shard key.SELECT WHERE clause statements should include a shard key where possible so that the query goes directly to a single shard rather than being executed against multiple shards in parallel.INSERT, UPDATE and DELETE WHERE CLAUSE statements must include a shard key so that the data is written to the correct shard.Avoid cross-shard inner joins. For example, joining one customer to another is not feasible as both customers may be in different shards. This can be accomplished, but requires complex logic and can adversely affect performance.
  • 30.
    What about High-Availability?Canyou afford to take your databases offline:…for scheduled maintenance? …for unplanned failure? …can you accept some lost transactions?By definition Database Sharding adds failure points to the data tierA proven High-Availability strategy is a must……system outages…especially in the cloud…planned maintenance…a necessity for all database engines
  • 31.
    Traditional asynch replicationis inadequate…Replication “Lag” can result in lost transactions
  • 32.
    No rapid failoverfor continuous operation
  • 33.
    Maintenance difficult inproduction environments
  • 34.
    Used by manyNoSQL enginesDatabase Sharding and High-AvailabilityNeed a minimum of 2 active-active instances per shard…
  • 35.
    …3 active-active instancesfor in-memory databases
  • 36.
  • 37.
  • 38.
  • 39.
  • 40.
  • 41.
  • 42.
    Database Sharding…elastic shardsExpandthe number of shards……divide a single shard into N new shards…or rebalance existing shards across N new shardsContract the number of shards……consolidate N shards into a single shardDue to large data sizes, this takes time……regardless of data architectureRequires High-Availability to ensure no downtime……perform scaling on live replica in background…fail back to active instances
  • 43.
    Database Sharding…the futureAbilityto leverage proven database engines……SQL…NoSQL…CachingHybrid database infrastructures using the best of all technologiesIn-memory and disk-based solutions co-existingAllow developers to select the best database engine for a given set of application requirements……seamless context-switching within the application…use the API of choiceImproved management……monitoring…configuration “on-the-fly”…dynamic elastic shards based on demand
  • 44.
    Database Sharding summaryDatabaseSharding is the only effective tool for scaling your database tier in the cloud……or anywhereUse Database Sharding for any type of engine……SQL…NoSQL…CacheUse the best engine for a given application requirement……avoid the “one-size-fits-all” trap…each engine has its strengths to capitalize onEnsure your High-Availability is proven and bulletproof……especially critical on the cloud…must support failure and maintenance
  • 45.

Editor's Notes

  • #18 Writes are linear, reads can be faster – depending on your database architecture.
  • #19 How did your database perform 6 months or a year ago?
  • #21 NoSQL is attractive due to simple API, ease of use. But beware, you really need a solid data design for exactly what you want to do. Relationships are not handled by the database, they are handled by the developer.
  • #27 Difference between Black box sharding/NoSQL and App Aware sharding with an RDBMS – you can get sets of related data from the same shard, otherwise need to retrieve a row at a time and consolidate