Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Database Architecture & Scaling Strategies, in the Cloud & on the Rack


Published on

Watch the recording here:

In this webinar, Robbie Mihayli, VP of Engineering at Clustrix explores how to set up a SQL RDBMS architecture that scales out and is both elastic and consistent, while simultaneously delivering fault tolerance and ACID compliance.

He also covers how data gets distributed in this architecture, how the query processor works, how rebalancing happens and other architectural elements. Examples cited include cloud deployments and e-commerce use-cases.

In this webinar, you will learn:

1. Five RDBMS scaling strategies along with their trade offs
2. The importance of having no single point of failure for OLTP (fault tolerance)
3. The vagaries of the cloud and how it impacts using an RDBMS in the cloud

Who should watch?

1. People interested in high performance, real-time database solutions
2. Companies who have MySQL in their infrastructure and are concerned that their growth will soon overwhelm MySQL’s single-box design
3. DBA’s who implement ‘read slaves’, ‘multiple-masters’ and ‘sharding’ for MySQL databases and want to learn about better ways to scale

Published in: Technology
  • Login to see the comments

Database Architecture & Scaling Strategies, in the Cloud & on the Rack

  1. 1. © 2014 CLUSTRIX© 2015 CLUSTRIX Database Scaling Strategies, in the Cloud & on the Rack Robbie Mihalyi @Clustrix
  2. 2. SQL SCALE-OUT ClustrixDB Overview2 Resiliency Capacity Elasticity Cloud
  3. 3. Cloud o Commoditized hardware resources  Rapid deployment and pay by the hour o Access  Publish your applications quickly  Use existing services from provider o Capacity  Scale resources as you need them ClustrixDB Overview3 Utility Computing (bare metal) Platform as a Service (PaaS) SaaS o Virtualized (Shared) Resources  You do not always get the performance envelope you ask for o Dedicated (Hardware) Resources  Available but expensive  Less flexible
  4. 4. E-Commerce Applications Example of a Great Match for Cloud o Need for capacity varies by seasonality and specific events  Some events can generate 10x normal traffic & increased conversion rates o Sensitive to performance characteristics  Throughput and latency o Up-time is most crucial at the busiest time  Every minute of downtime can mean thousands of $$$$ in lost revenue ClustrixDB Overview4
  5. 5. SQL SCALE-OUT ClustrixDB Overview5 Resiliency Capacity Elasticity
  6. 6. SQL SCALE-OUT ClustrixDB Overview6 Resiliency Capacity Elasticity SCALE  Data, Users, Session THROUGHPUT  Concurrency, Transactions LATENCY  Response Time
  7. 7. Application Scaling (App Layer Only) Easy Installation and Setup o Load-Balancer  HAProxy or equivalent  Distributes incoming requests o Scale out by adding servers  All servers are the same – no master o Redundant backend network  Low-latency cluster intercommunication ClustrixDB Overview7 Load Balancer Commodity servers APP APP APP
  8. 8. Application Scaling (Database Layer) Database Scaling Is Very Hard o Data Consistency o Read vs. Write Scale o ACID Properties (if you care about it) o Throughput and Latency o Application Impact ClustrixDB Overview8
  9. 9. Non-Relational (NoSQL) Database Architectures ClustrixDB Overview9 o No imposed structure o Relaxed or no ACID properties  BASE – alternative to ACID o Fast and Scalable o Suited for specific applications  IOT, click-stream, object store, document  Good for Insert workload  Not good for read / query apps o RDBMS will provide fast non-structured data store
  10. 10. ClustrixDB Overview10 RDBMS SCALING
  11. 11. Scaling-Up o Keep increasing the size of the (single) database server o Pros  Simple, no application changes needed o Cons  Expensive. At some point, you’re paying 5x for 2x the performance  ‘Exotic’ hardware (128 cores and above) become price prohibitive  Eventually you ‘hit the wall’, and you literally cannot scale-up anymore ClustrixDB Overview11
  12. 12. Scaling Reads: Master/Slave o Add a ‘Slave’ read-server(s) to your ‘Master’ database server o Pros  Reasonably simple to implement.  Read/write fan-out can be done at the proxy level o Cons  Only adds Read performance  Data consistency issues can occur, especially if the application isn’t coded to ensure reads from the slave are consistent with reads from the master ClustrixDB Overview12
  13. 13. Scaling Writes: Master/Master ClustrixDB Overview13 o Add additional ‘Master’(s) to your ‘Master’ database server o Pros  Adds Write scaling without needing to shard o Cons  Adds write scaling at the cost of read-slaves  Adding read-slaves would add even more latency  Application changes are required to ensure data consistency / conflict resolution
  14. 14. Scaling Reads & Writes: Sharding ClustrixDB Overview14 SHARDO1 SHARDO2 SHARDO3 SHARDO4 o Partitioning tables across separate database servers o Pros  Adds both write and read scaling o Cons  Loses the ability of an RDBMS to manage transactionality, referential integrity and ACID  ACID compliance & transactionality must be managed at the application level  Consistent backups across all the shards are very hard to manage  Read and Writes can be skewed / unbalanced  Application changes can be significant A - K L - O P - S T - Z
  15. 15. Scaling Reads & Writes: MySQL Cluster o Provides shared-nothing clustering and auto-sharding for MySQL. (designed for Telco deployments: minimal cross-node transactions, HA emphasis) o Pros  Distributed, multi-master model  Provides high availability and high throughput o Cons  Only supports read-committed isolation  Long-running transactions can block a node restart  SBR replication not supported  Range scans are expensive and lower performance than MySQL  Unclear how it scales with many nodes ClustrixDB Overview15
  16. 16. Application Workload Partitioning ClustrixDB Overview16 o Partition entire application + RDBMS stack across several “pods” o Pros  Adds both write and read scaling  Flexible: can keep scaling with addition of pods o Cons  No data consistency across pods (only suited for cases where it is not needed)  High overhead in DBMS maintenance and upgrade  Queries / Reports across all pods can be very complex  Complex environment to setup and support APP APP APP APP APP APP
  17. 17. SQL SCALE-OUT ClustrixDB Overview17 Resiliency Capacity Elasticity
  18. 18. SQL SCALE-OUT ClustrixDB Overview18 Resiliency Capacity Elasticity Ease of ADDING and REMOVING resources Flex Up or Down  Capacity On-Demand Adapt Resources to Price- Performance Requirements
  19. 19. Elasticity – flexing up and down ClustrixDB Overview19 o Application (only) o NoSQL databases o Scale-up o Master – Slave o Master – Master o Sharding o MySQL Cluster o Application Partitioning Scaling Options Flex UP Flex DOWN o Easy o Easy o Easy o Unclear if it is possible o Expensive o Not Applicable o Reasonably simple o Turn off read slaves o Involved o Involved o Expensive and complex o Not feasible o Involved o Involved o Expensive and complex o Expensive and complex
  20. 20. SQL SCALE-OUT ClustrixDB Overview20 Resiliency Resilience to Failures  Hardware or Software Fault Tolerance and High Availability Capacity Elasticity
  21. 21. Resiliency – high-availably and fault tolerance ClustrixDB Overview21 o Application (only) o NoSQL databases o Scale-up o Master – Slave o Master – Master o Sharding o MySQL Cluster o Application Partitioning Scaling Options o No single point failure – failed node bypassed Resilience to failures o Support exists o One large machine  Single point failure o Fail-over to Slave o Resilient to one of the Masters failing o Multiple points of failures o No single point failure o Multiple points of failures
  22. 22. RDBMS Capacity, Elasticity and Resiliency ClustrixDB Overview22 Scale-up Master – Slave Master – Master MySQL Cluster Sharding RDBMS Scaling Many cores – very expensive Reads Only Read / Write Read / Write Unbalanced Read/Writes Capacity Single Point Failure Fail-over Yes Yes Multiple points of failure ResiliencyElasticity No No No No No None Yes – for read scale High – update conflict None (or minor) Very High Application Impact
  24. 24. ClustrixDB – Shared Nothing Symmetric Architecture ClustrixDB Overview24 Each Node Contains o Database Engine:  all nodes can perform all database operations (no leader, aggregator, leaf, data-only, special nodes) o Query Compiler:  distribute compiled partial query fragments to the node containing the ranking replica o Data: Table Slices:  All table slices auto-redistributed by the Rebalancer (default: replicas=2) o Data Map:  all nodes know where all replicas are ClustrixDB Compiler Map Engine Data Compiler Map Engine Data Compiler Map Engine Data
  25. 25. BillionsofRows Database Tables S1 S2 S2 S3 S3 S4 S4 S5 S5 Intelligent Data Distribution o Tables auto-split into slices o Every slice has a replica on another server  Auto-distributed and auto-protected ClustrixDB Overview25 S1 ClustrixDB
  26. 26. S1 S2 S3 S3 S4 S4 S5 Database Capacity And Elasticity o Easy and simple Flex Up (and Flex Down)  Flex multiple nodes at the same time o Data is automatically rebalanced across the cluster o All servers handle writes + reads o Application always sees a single Database instance ClustrixDB Overview26 S1 ClustrixDB S2 S5
  27. 27. S1 S2 S3 S3 S4 S4 S5 Built-in Fault Tolerance o No Single Point-of-Failure  No Data Loss  No Downtime o Server node goes down…  Data is automatically rebalanced across the remaining nodes ClustrixDB Overview27 S1 ClustrixDB S2 S5
  28. 28. Query Distributed Query Processing o Queries are fielded by any peer node  Routed to node holding the data o Complex queries are split into fragments processed in parallel  Automatically distributed for optimized performance ClustrixDB Overview28 ClustrixDB Load Balancer TRXTRXTRX
  29. 29. Replication and Disaster Recovery ClustrixDB Overview29 Asynchronous multi-point Replication ClustrixDB Parallel Backup up to 10x faster Replicate to any cloud, any datacenter, anywhere
  31. 31. ClustrixDB key components enabling Scale-Out o Shared-nothing architecture  Eliminates potential bottlenecks. o Independent Index Distribution  Hash each distribution key to a 64-bit number space divided into ranges with a specific slice owning each range o Rebalancer  Ensures optimal data distribution across all nodes.  Rebalancer assigns slices to available nodes for data capacity and access balance o Query Optimizer  Distributed query planner, compiler, and distributed shared-nothing execution engine  Executes queries with max parallelism and many simultaneous queries concurrently. o Evaluation Model  Parallelizes queries, which are distributed to the node(s) with the relevant data. o Consistency and Concurrency Control  Using Multi-Version Concurrency Control (MVCC) and 2 Phase Locking (2PL) ClustrixDB Overview31
  32. 32. Rebalancer Process ClustrixDB Overview32 o User tables are vertically partitioned in representations. o Representations are horizontally partitioned into slices. o Rebalancer ensures:  The representation has an appropriate number of slices.  Slices are well distributed around the cluster on storage devices  Slices are not placed on server(s) that are being flexed-down.  Reads from each representation are balanced across the nodes
  33. 33. ClustrixDB Rebalancer Tasks o Flex-UP  Re-distribute replicas to new nodes o Flex-DOWN  Move replicas from the flex-down nodes to other nodes in the cluster o Under-Protection – when a slice has fewer replicas than desired  Create a new copy of the slice on a different node. o Slice Too Big  Split the slice into several new slices and re-distribute them ClustrixDB Overview33
  34. 34. ClustrixDB Query Optimizer o The ClustrixDB Query Optimizer is modeled on the Cascades optimization framework.  Other RDBMS leverage Cascades are Tandem's Nonstop SQL and Microsoft's SQL Server.  Cost-driven - Extensible via a rule based mechanism  Top-down approach o Query Optimizer must answer the following, per SQL query:  In what order should the tables be joined?  Which indexes should be used?  Should the sort/aggregate be non-blocking? ClustrixDB Overview34
  35. 35. ClustrixDB Evaluation Model o Parallel query evaluation o Massively Parallel Processing (MPP) for analytic queries o The Fair Scheduler ensures OLTP prioritized ahead of OLAP o Queries are broken into fragments (functions). o Joins require more data movement by their nature.  ClustrixDB is able to achieve minimal data movement  Each representation (table or index) has its own distribution map, allowing direct look-ups for which node/slice to go to next, removing broadcasts.  There is no a central node orchestrating data motion. Data moves directly to the next node it needs to go to. This reduces hops to the minimum possible given the data distribution. ClustrixDB Overview35 COMPILATION FRAGMENTS FRAGMENT 1 FRAGMENT 2 VM FRAGMENT 1 Node := lookup id = 15 <forward to node> VM FRAGMENT 2 SELECT id, amount <return> SELECT id, amount FROM donation WHERE id=15
  36. 36. Concurrency Control ClustrixDB Overview36 Time reader reader writer writer writer row conflict one writer blocked no conflict no blocking o Readers never interfere with writers (or vice-versa). Writers use explicit locking for updates o MVCC maintains a version of each row as writers modify rows o Readers have lock-free snapshot isolation while writers use 2PL to manage conflict Lock Conflict Matrix Reader Writer Reader None None Writer None Row
  38. 38. Example: Huge Write Workload (AWS Deployment) ClustrixDB Overview38 The Application Inserts 254 million / day Updates 1.35 million / day Reads 252.3 million / day Deletes 7,800 / day The Database Queries 5-9k per sec CPU Load 45-65% Nodes - Cores 10 nodes - 80 cores
  39. 39. Example: Huge Update Workload (Bare-Metal Deployment) ClustrixDB Overview39 The Application Inserts 31.4 million / day Updates 3.7 billion / day Reads 1 billion / day Deletes 4,300 / day The Database Queries 35-55k per sec CPU Load 25-35% Nodes - Cores 6 nodes - 120 cores
  40. 40. ClustrixDB Overview40 CLUSTRIXDB IN DEVELOPMENT
  41. 41. Next Release o Additional Performance Improvements  Further improvements to read and write scaling o Deployment and Provisioning Optimization  Cloud templates and deployment scripts  Instance testing and validation o New Admin architecture and much improved Web UI  Services based architecture with (RESTful) API  Simplified single-click FLEX Management  Significant Graphing and Reporting improvements  Multi-Cluster topology view and management ClustrixDB Overview41
  42. 42. New Web UI – Enhanced Dashboard ClustrixDB Overview42 482 tps
  43. 43. New Web UI – Historical Workload Comparison ClustrixDB Overview43
  44. 44. New Web UI – FLEX Administration ClustrixDB Overview44
  45. 45. ClustrixDB Overview45 FINAL THOUGHTS
  46. 46. ClustrixDB Overview46 Capacity Massive read write scalability Very high concurrency Linear throughput scale Elasticity Flex UP in minutes Flex DOWN easily Right-size resources on-demand Resiliency Automatic, 100% fault tolerance No single point of failure Battle-tested performance Flexible Deployment Cloud, VM, or bare-metal Virtual Images available Point/click Scale-out ClustrixDB
  47. 47. Thank You. @clustrix ClustrixDB Overview47
  48. 48. Competitive Cluster Solutions o Most MySQL clustering solutions leverage Master/Master via replication:  MySQL Cluster  Galera (open-source library)  Percona XtraDB Cluster (leverages Galera replication library)  Tungsten o ClustrixDB does NOT use replication to keep all the servers in sync  Replication cannot scale writes as highly as our own technology  Replication has inherent potential consistency and latency issues  Transactional workloads such as OLTP (e.g. E-Commerce) are exactly the workloads that replication struggles the most with ClustrixDB Overview48
  49. 49. MySQL Cluster o Provides shared-nothing clustering and auto-sharding for MySQL (designed for Telco deployments: minimal cross-node transactions, HA emphasis) o Pros:  Distributed, multi-master with no SPOF  Designed to provide high availability and high throughput with low latency, while allowing for near linear scalability  Synchronous replication, 2-Phase Commit o Cons:  Global checkpoint is 2sec. “There are no guaranteed durable COMMITs to disk”  Only supports read_committed isolation  “MySQL cluster does not handle large transactions well”  Long-running transactions can block a node restart  Overflow of data in replication stream drops node from cluster, consistency loss  ‘True’ HA requires multiple replication lines; “1 is not sufficient” for HA  DELETEs release memory for same-table; full release requires cluster rolling restart  Range scans are expensive and low(er) performance than MySQL  No distributed table locks ClustrixDB Overview49
  50. 50. Galera Cluster o Is a multi-master topology using their own replication protocol (designed primarily for High-Availability, and secondarily for scale) o Pros:  Writes to any master are replicated to the other master(s) in sync, ensuring all masters have the same data.  It is open source, and 24/7 Support can be purchased for $7,950/yr/server. Percona also provides support, for a higher price. o Cons:  Write-scale is limited. Galera support recommends that writes go to one master, rather than be distributed across the nodes. That helps with isolation issues, but increases consistency and latency issues across the nodes.  Snapshot isolation does NOT use first-committer-wins (and so fails Aphyr Jepsen CAP tests). ClustrixDB does use first-committer wins for snapshot consistency  Writesets are processed as a single memory-resident buffer and as a result, extremely large transactions (e.g. LOAD DATA) may adversely affect node performance.  Locking is lax with DDL. Eg, if your DML transaction uses a table, and a parallel DDL statement is started, Galera won’t wait for a metadata lock, causing potential consistency issues ClustrixDB Overview50
  51. 51. Percona XtraDB Cluster o Is an active/active high availability and high scalability open source solution for MySQL® clustering. It integrates Percona Server and Percona XtraBackup with the Galera replication library o Pros:  Synchronous replication  Multi-master replication support  Parallel replication  Automatic node provisioning o Cons:  Not designed for write scaling  SELECT FOR UPDATE can easily create deadlocks  Not true synchronous replication, but ‘virtually synchronous’: The data is committed on the originating node and ack is sent to the application, but the other nodes are committed asynchronously. This can lead to consistency issues for applications reading from the other nodes  “If multiple nodes are used, the ability to read your own writes is not guaranteed. In that case, a certified transaction, which is already committed on the originating node can still sit in the receive queue of the node the application is reading from, waiting to be applied.”  ClustrixDB Overview51
  52. 52. Tungsten Replicator o Is an open source replication engine. Compatible with MySQL, Oracle, and Amazon RDS; NoSQL stores such as MongoDB, and datawarehouse stores such as Vertica, InfiniDB, and Hadoop o Pros:  Allows data to be exchanged between different databases and different database versions  During replication, information can be filtered and modified, and deployment can be between on-premise or cloud-based databases  For performance, Tungsten Replicator includes support for parallel replication, and advanced topologies such as fan-in, star and multi-master, and can be used efficiently in cross-site deployments o Cons:  Very complicated to setup, maintain  No automated management, automated failover, transparent connections, nor built-in conflict resolution  Only allows asynchronous replication  Cannot suppress slave-side triggers. Need to alter each trigger to add an IF statement that prevents the trigger from running on the slave. ClustrixDB Overview52