Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

New databases that scales high

1,705 views

Published on

Introduction to NoSQL Databases, CAP theorem and comparison of some of the major DBMS. The presentation also proposes a decision tree at the end.

Published in: Technology
  • Be the first to comment

New databases that scales high

  1. 1. New Databases That Scales High William El Kaim Oct. 2016 - V 2.1
  2. 2. This Presentation is part of the Enterprise Architecture Digital Codex http://www.eacodex.com/Copyright © William El Kaim 2016 2
  3. 3. Plan Why the Need for Databases That Scales High? • What is NoSQL? • NoSQL Database Taxonomy • From CAP to PACELC • NoSQL Database Comparisons and Decision Tree • NoSQL Data Modeling Techniques • Resources Copyright © William El Kaim 2016 3
  4. 4. The Need For Scalability • Today, most structured data storage is managed in a relational database. • Relational databases enforce a set of rules to ensure that data is consistent and to ensure transactions are atomic (they either succeeded or failed). • With these rules in place, it becomes much harder to ensure transaction consistency across one or more database servers and spread data out to multiple nodes to increase retrieval speed and therefore processing speed. • This is due to the requirement that the storage of data must occur on each database server, limiting vertical scaling to the speed of the slowest server’s speed of storage. • While transaction consistency may be critical for some systems, when datasets reach extreme scale, traditional databases often cannot keep up and require alternative approaches to data storage and retrieval. • Newer databases are offering different approaches to overcome these limitations, as data grows beyond the reach of a single server or database cluster. Source: James HigginbothamCopyright © William El Kaim 2016 4
  5. 5. Issues with scaling up • When dataset is too big “vertical” scaling (increase the computer power of a single machine) is not enough! • Need to scale out and put in place “horizontal” scaling by adding more servers. • Two main approaches for horizontal scaling (multi-node databases) • Master/Slave • All “write” are written to the master • All “read” are performed against the replicated slave databases • Critical “read” may be incorrect as “write” may not have been propagated down • Large datasets can pose problems as master needs to duplicate data to slaves • Master is a Single Point of Failure! • Sharding (Partitioning) • Scales well for both “read” and “write” • Application needs to be partition-aware, no transparency • Can no longer have relationships/joins across partitions • Loss of referential integrity across shards Copyright © William El Kaim 2016 5
  6. 6. Other Ways To Scale Out • Multi-Master replication • data “write” by a group of computers, and updated by any member of the group. • All members can respond to “read”. • The multi-master replication system is responsible for propagating the data modifications made by each member to the rest of the group, and resolving any conflicts that might arise between concurrent changes made by different members. • Immutable, append only data store (Big Data) • Do INSERT only, no UPDATE or DELETE • Keep information • In-memory databases • No JOIN • This involves de-normalizing data Copyright © William El Kaim 2016 6
  7. 7. Techniques to Scale Out 7Copyright © William El Kaim 2016 Source: Felix Gessert
  8. 8. Plan • Why the Need for Databases That Scales High? • What is NoSQL? • NoSQL Database Taxonomy • From CAP to PACELC • NoSQL Database Comparisons and Decision Tree • NoSQL Data Modeling Techniques • Resources Copyright © William El Kaim 2016 8
  9. 9. What is NoSQL? • NoSQL • Stands for Not Only SQL • The term NOSQL was introduced by Carl Strozzi in 1998 to name his file-based database • It was again re-introduced by Eric Evans when an event was organized to discuss open source distributed databases • Eric states that “… but the whole point of seeking alternatives is that you need to solve a problem that relational databases are a bad fit for. …” • is a new non relational approach to data management that supports dynamic and flexible schemas, optimized storage for web scale, and extreme performance as well as makes semi-structured and unstructured data easier to use and access. • Three major papers were the “seeds” of the NOSQL movement: • BigTable (Google): Article / Web Site • DynamoDB (Amazon): Article / Web Site • CAP Theorem Copyright © William El Kaim 2016 9
  10. 10. What is NoSQL? • Class of non-relational data storage systems • Usually do not require a fixed table schema nor do they use the concept of joins • All NoSQL offerings relax one or more of the ACID properties • Cheap, easy to implement (open source) • Data are replicated to multiple nodes (therefore identical and fault-tolerant) and can be partitioned • Down nodes easily replaced, No single point of failure • Don't require a schema • Can scale up and down • Relax the data consistency requirement Copyright © William El Kaim 2016 10
  11. 11. Plan • Why the Need for Databases That Scales High? • What is NoSQL? NoSQL Database Taxonomy • From CAP to PACELC • NoSQL Database Comparisons and Decision Tree • NoSQL Data Modeling Techniques • Resources Copyright © William El Kaim 2016 11
  12. 12. NoSQL Database Taxonomy Copyright © William El Kaim 2016 12 Source: Highly Scalable
  13. 13. Key-value Store • Key-value store consists of a set of key-value pairs with unique keys. • Due to this simple structure, it only supports get and put operations • The stored value is transparent to the database, pure key-value stores do not support operations beyond simple CRUD (Create, Read, Update, Delete). • Key-value stores often referred as schemaless • Any assumptions about the structure of stored data are implicitly encoded in the application logic (schema-on-read) and not explicitly defined through a data definition language (schema-on-write). • The obvious advantages of this data model lie in its simplicity. • The very simple abstraction makes it easy to partition and query the data, so that the database system can achieve low latency as well as high throughput. • However, if an application demands more complex operations, this data model is not powerful enough. Copyright © William El Kaim 2016 13Source: Felix Gessert
  14. 14. Key-value Store • Properties • Focus on scaling to huge amounts of data • Designed to handle massive load • Based on Amazon’s dynamo paper • Examples • (AP): DynamoDB, Riak, Voldemort • (CP): Redis, Scalaris Copyright © William El Kaim 2016 14 Source: Felix Gessert
  15. 15. Wide-Column Store • Wide-column stores inherit their name from the image that is often used to explain the underlying data model: a relational table with many sparse columns. • Technically, however, a wide-column store is closer to a distributed multi-level sorted map: • The first-level keys identify rows which themselves consist of key-value pairs and are called row keys • The second-level keys are called column keys. • This storage scheme makes tables with arbitrarily number of columns feasible, because there is no column key without a corresponding value. • The set of all columns is partitioned into so-called column families to colocate columns on disk that are usually accessed together. • On disk, wide-column stores do not colocate all data from each row, but instead values of the same column family and from the same row. • Hence, an entity (a row) cannot be retrieved by one single lookup as in a document store, but has to be joined together from the columns of all column families. • However, this storage layout usually enables highly efficient data compression and makes retrieving only a portion of an entity very efficient. • The data are stored in lexicographic order of their keys, so that data that are accessed together are physically co-located, given a careful key design. Copyright © William El Kaim 2016 15Source: Felix Gessert
  16. 16. Wide-Column Store • Properties • Also called extensible record stores • Store data in records with an ability to hold very large numbers of dynamic columns • Name and format of the columns can vary from row to row in the same table • Since the column names as well as the record keys are not fixed, and since a record can have billions of columns, wide column stores can be seen as two- dimensional key-value stores. • Examples • (AP): Apache Cassandra, • (CP): Apache Hbase, Apache Accumulo, Google Bigtable, Hypertable • (AC): Vertica Copyright © William El Kaim 2016 16 Source: Felix Gessert
  17. 17. Document Store • Definition • A document store is a key-value store that restricts values to semi-structured formats such as JSON documents. • Properties • This restriction in comparison to key-value stores brings great flexibility in accessing the data. It is not only possible to fetch an entire document by its ID, but also to retrieve only parts of a document, and to execute queries like aggregation, query-by-example or even full-text search. • Can model more complex objects • Data model: collection of documents • Document: JSON, XML, other semi-structured formats. • Examples • (AP): Apache CouchDB, Riak, SimpleDB, CouchBase • (CP): MongoDB Copyright © William El Kaim 2016 17 Source: Felix Gessert
  18. 18. Graph Databases • Properties • Focus on modeling the structure of data (interconnectivity) • Inspired by mathematical Graph Theory • Nodes and edges and key-value pairs on both • Nodes may have properties (including ID) • Edges may have labels or roles • Examples • Apache Hama, FlockDB, InfoGrid, Neo4j, Pregel, Titan Copyright © William El Kaim 2016 18 Source: Felix Gessert
  19. 19. Plan • Why the Need for Databases That Scales High? • What is NoSQL? • NoSQL Database Taxonomy From CAP to PACELC • NoSQL Database Comparisons and Decision Tree • NoSQL Data Modeling Techniques • Resources Copyright © William El Kaim 2016 19
  20. 20. Brewer’s CAP Theorem • For any system sharing data (or multi-node database), it is “impossible” to guarantee simultaneously all of these three properties: • Consistency: all copies have same value 1. Strong consistency – ACID (Atomicity, Consistency, Isolation, Durability) • Atomicity: either the whole process is done or none is • Consistency: only valid data are written • Isolation: one operation at a time • Durability: once committed, it stays that way 2. Weak consistency – BASE (Basically Available Soft-state Eventual consistency) • Availability: reads and writes always succeed • Partition-tolerance: system properties (consistency and/or availability) hold even when network failures prevent some machines from communicating with others • But to scale out, you need to partition! • That leaves either consistency or availability to choose from. In almost all cases, availability is chosen over consistency! Copyright © William El Kaim 2016 20
  21. 21. BASE vs. ACID • Rise of the BASE (Basically Available, Soft state, Eventual consistency) model • Basically Available - system seems to work all the time • Soft State - it doesn't have to be consistent all the time • Eventually Consistent - becomes consistent at some later time • In other words: • When no updates occur for a long period of time, eventually all updates will propagate through the system and all the nodes will be consistent. • For a given accepted update and a given node, eventually either the update reaches the node or the node is removed from service. • Google, Yahoo, Facebook, Amazon, eBay all adopted CAP and BASE principles! • Read the blog post from the CTO of Amazon to learn more on the subject Copyright © William El Kaim 2016 21
  22. 22. Another NoSQL Taxonomy 22Copyright © William El Kaim 2016
  23. 23. CAP Based Taxonomy 1. Relational databases choose Consistency and Availability (CA), ensuring writes are consistent and immediately available across all instances. 2. Many new database vendors are opting for Availability and Partition Tolerance (AP), accepting new/updated records without immediate confirmation (aka: “eventually consistent”). 3. Other database vendors are opting for Consistency and Partition Tolerance (CP), allowing arbitrary loss of messages to some instances, while the system continues to be available. Copyright © William El Kaim 2016 23 Source: Nathan Hurst
  24. 24. Criticisms of CAP Theorem • The first confusion is about the existence of CA systems, which pretend that partition tolerance is optional, or claim that partitions don't happen. • In reality, you can't sacrifice partition tolerance, because partitions happen in real large-scale systems all the time. • The second misconception is that the CAP theorem means you can't be consistent and available during partitions. That's not true. • Specifically, the CAP theorem only prevents everybody from being consistent and available, not anybody (some literature calls this always available). It doesn't prevent clients and replicas on the majority side of simple partitions from making progress, and experiencing both consistency and availability. • The third misconception is that the consistency in CAP is all or nothing, and that you can't offer any consistency guarantees at all during partitions. In reality, many very useful consistency models can be offered on all sides of a partition. • Implementation tricks like session stickiness and client-side caching can allow systems to offer useful models like read your writes, monotonic reads and even causal consistency. Bernstein and Das, and Bailis et al have good overviews of some of the possibilities. Copyright © William El Kaim 2016 24Source: Marc Brooker
  25. 25. Criticisms of CAP Theorem • The fourth misconception is that eventual consistency is all about CAP, and that everybody would chose strong consistency for every application if it wasn't for the CAP theorem. • Another source of confusion is the different versions of the CAP theorem, from Brewer's original version, to his later writings to Gilbert and Lynch's proof. • Some seem to call the former Brewer's Conjecture and the latter the CAP theorem, but this usage is far from universal. Typically, they're both just called CAP or the CAP theorem. Copyright © William El Kaim 2016 25Source: Marc Brooker
  26. 26. PACELC: An alternative CAP formulation • Daniel Abadi's Consistency Tradeoffs in Modern Distributed Database System Design proposes an alternative: PACELC. • Idea: Classify systems according to their behavior during network partitions • if there is a partition (P), how does the system trade off availability and consistency (A and C); else (E), when the system is running normally in the absence of partitions, how does the system trade off latency (L) and consistency (C)? Copyright © William El Kaim 2016 26 Source: Felix Gessert
  27. 27. Plan • Why the Need for Databases That Scales High? • What is NoSQL? • NoSQL Database Taxonomy • From CAP to PACELC NoSQL Database Comparisons and Decision Tree • NoSQL Data Modeling Techniques • Resources Copyright © William El Kaim 2016 27
  28. 28. NoSQL Comparisons Source: Ben Scofielfd Copyright © William El Kaim 2016 28Source: PWC
  29. 29. Form of Data Normalization Source: Lost … sorryCopyright © William El Kaim 2016 29
  30. 30. Database Evolution History Copyright © William El Kaim 2016 Source: Robin Purohit 30
  31. 31. NoSQL Database Evolution History 31Copyright © William El Kaim 2016 Source: Felix Gessert
  32. 32. Copyright © William El Kaim 2016 32
  33. 33. 2016 Forrester Waves 33Copyright © William El Kaim 2016 Source: Forrester
  34. 34. 2015 Magic Quadrant for Operational Database Management Systems 34Copyright © William El Kaim 2016
  35. 35. NoSQLDecisionTree 35Copyright © William El Kaim 2016 Source: Felix Gessert
  36. 36. 36Copyright © William El Kaim 2016
  37. 37. Plan • Why the Need for Databases That Scales High? • What is NoSQL? • NoSQL Database Taxonomy • From CAP to PACELC • NoSQL Database Comparisons and Decision Tree NoSQL Data Modeling Techniques • Resources Copyright © William El Kaim 2016 37
  38. 38. NoSQL Data Modeling Techniques Introduction • NoSQL data modeling often starts from the application-specific queries as opposed to relational modeling: • Relational modeling is typically driven by the structure of available data. The main design theme is “What answers do I have?” • NoSQL data modeling is typically driven by application-specific access patterns, i.e. the types of queries to be supported. The main design theme is “What questions do I have?” • NoSQL data modeling often requires a deeper understanding of data structures and algorithms than relational database modeling does. • Data duplication and denormalization are first-class citizens. • Relational databases are not very convenient for hierarchical or graph-like data modeling and processing. • Graph databases are obviously a perfect solution for this area, but actually most of NoSQL solutions are surprisingly strong for such problems. Copyright © William El Kaim 2016 38Source: Highly Scalable Blog
  39. 39. NoSQL Data Modeling Techniques Denormalization • Denormalization can be defined as the copying of the same data into multiple documents or tables in order to simplify/optimize query processing or to fit the user’s data into a particular data model. • In general, denormalization is helpful for the following trade-offs: • Query data volume or IO per query VS total data volume. • Using denormalization one can group all data that is needed to process a query in one place. • This often means that for different query flows the same data will be accessed in different combinations. Hence we need to duplicate data, which increases total data volume. • Processing complexity VS total data volume. • Modeling-time normalization and consequent query-time joins obviously increase complexity of the query processor, especially in distributed systems. • Denormalization allow one to store data in a query-friendly structure to simplify query processing. • Applicability • Key-Value Stores, Document Databases, BigTable-style Databases Copyright © William El Kaim 2016 39Source: Highly Scalable Blog
  40. 40. NoSQL Data Modeling Techniques Aggregates • All major genres of NoSQL provide soft schema capabilities: • Key-Value Stores and Graph Databases typically do not place constraints on values, so values can be comprised of arbitrary format. • BigTable models support soft schema via a variable set of columns within a column family and a variable number of versions for one cell. • Document databases are inherently schema-less, although some of them allow one to validate incoming data using a user-defined schema. • Objective is to form classes of entities with complex internal structures (nested entities) and vary the structure of particular entities in order to : • Minimize one-to-many relationships by means of nested entities and, consequently, reduction of joins. • Mask “technical” differences between business entities and model heterogeneous business entities using one collection of documents or one table. • Applicability • Key-Value Stores, Document Databases, BigTable-style Databases Copyright © William El Kaim 2016 40Source: Highly Scalable Blog
  41. 41. NoSQL Data Modeling Techniques Aggregates example 41Copyright © William El Kaim 2016 Source: Highly Scalable Blog
  42. 42. NoSQL Data Modeling Techniques Application Side Joins • Joins are rarely supported in NoSQL solutions. • Joins are then often handled at design time as opposed to relational models where joins are handled at query execution time. • Query time joins almost always mean a performance penalty, but in many cases one can avoid joins using Denormalization and Aggregates, i.e. embedding nested entities. • Of course, in many cases joins are inevitable and should be handled by an application when: • Many to many relationships are often modeled by links and require joins. • Aggregates are often inapplicable when entity internals are the subject of frequent modifications. It is usually better to keep a record that something happened and join the records at query time as opposed to changing a value . • Applicability • Key-Value Stores, Document Databases, BigTable-style Databases, Graph Databases Copyright © William El Kaim 2016 42Source: Highly Scalable Blog
  43. 43. NoSQL Data Modeling Techniques Others • Atomic Aggregates • Many, although not all, NoSQL solutions have limited transaction support. It is common to model data using an Aggregates technique to guarantee some of the ACID properties. • Aggregates allow one to store a single business entity as one document, row or key-value pair and update it atomically (instead of normalized data that typically require multi-place updates) • Applicability • Key-Value Stores, Document Databases, BigTable-style Databases • Enumerable Keys • Perhaps the greatest benefit of an unordered Key-Value data model is that entries can be partitioned across multiple servers by just hashing the key. • Sorting makes things more complex, but sometimes an application is able to take some advantages of ordered keys even if storage doesn’t offer such a feature. • Applicability • Key-Value Stores • More Here Copyright © William El Kaim 2016 43Source: Highly Scalable Blog
  44. 44. Plan • Why the Need for Databases That Scales High? • What is NoSQL? • NoSQL Database Taxonomy • From CAP to PACELC • NoSQL Database Comparisons and Decision Tree • NoSQL Data Modeling Techniques Resources Copyright © William El Kaim 2016 44
  45. 45. Key Papers 1. The Amazon Dynamo paper is classic. Almost everyone in the NoSQL world has read this paper. 2. Google's Bigtable paper. 3. Werner Vogels's "Eventually Consistent" (originally published in ACM Queue) 4. Brewer's CAP Theorem (a foundational bit of scalability theory) is well- explained here. Also see Brewer's original slides from his famous July 2000 PODC keynote. 5. The slideshows from the June 11, 2009 NoSQL meetup in SFO. Copyright © William El Kaim 2016 45
  46. 46. Resources • Excellent presentation from Felix Gessert • Martin Fowler NoSQL dedicated site • PWC Technology Forecast: “Remapping the database landscape” • Highly Scalable Blog: “NoSQL data modeling techniques” • Introduction NoSQL by Laurent Broudoux (French) • NoSQLDatabase.org: "Your Ultimate Guide to the Non-Relational Universe!" Copyright © William El Kaim 2016 46
  47. 47. Copyright © William El Kaim 2016 Twitter http://www.twitter.com/welkaim SlideShare http://www.slideshare.net/welkaim EA Digital Codex http://www.eacodex.com/ Linkedin http://fr.linkedin.com/in/williamelkaim Claudine O'Sullivan 47

×