Modern Databases in
a nutshell
Memory based Distributed Transactional Databases:
• Removes the impediments to performance of a traditional OLTP, by removing overheads like
Concurrency Control, logging, Locking mechanisms, and by storing the data in main memory.
• Uses Main Memory to store data (departure from traditional DB storing data in disk)
• Uses a pattern known as Anti-caching (opp. of caching)
• Uses Command logging (not Data logging as in traditional DB)
• Is Single threaded, uses multiple single-threaded engines
• Distributed cluster of shared nothing machines, with high availability
• High throughput : 100x times faster than traditional OLTP, Maintains ACID Transactions
• Uses streaming analytics with millisecond latency
Ref: H-store, Volt DB
Column stores:
• Stores data as Columns, Querying using a Column-executor is much faster (100x times faster
compared to Relational DB) as compared to traditional row-executor (technique called
‘vector processing’)
• Data compression is high as it is easy to compress data in columns stores, Record headers
are optimized
• Uses Shared nothing architecture, each node stores part of the data
• High availability, high elasticity
• Supports SQL Like Query language, has built-in analytical functions
• Supports open source technologies like Hadoop and R, and is Cloud enabled
Ref : HP Vertica, SAP Hana
NoSQL Distributed Document Stores:
• High throughput and Performance – millions/sec
• Supports database replication (primary and secondary nodes) and automatic failover
• Data is stored in the form of JSON documents, objects from programming language can be persisted as-is
(schema later paradigm). (MongoDB uses BSON serialization)
• Does not support JOINs, Does not support ACID Transactions over multiple documents
• Uses JSON like Query language, and powerful querying and aggregation capabilities including Analytics
Ex: Mongo DB, Apache Couch DB
NoSQL Distributed Key-value Store:
• High Performance, high throughput (sp. on writes)
• Ideally suited for Data-centers at geographically different places, fast replication
• No Single Point of Failure, automatic failover
• Linearly scalable
• Supports SQL Like query language, no Joins allowed
• Uses modeling by Query and key-value stores with Column families
Ex: Apache Cassandra
NoSQL Distributed Data Store, using Apache Lucene:
• Stores JSON documents (ElasticSearch), and multiple document formats like email, pdf (Apache
Solr)
• Can scale out to hundreds of servers and handle petabytes of data, and hides the complexity of
distributed systems. Uses shards (primary and replica) to scale horizontally
• Does not support ACID transactions on multiple documents
• Has extensive querying and aggregation function inbuilt
• Uses Lucene Search and Inverted index (every field is indexed), providing very fast text based
search capabilities
• Limited joins allowed (join can be used to restrict the output from one document type)
• Queries can be sent to any node in the cluster to trigger full distributed search across all shards,
with load balancing built in
Ex : Elastic Search , Apache Solr
Distributed Data Store, supporting ACID transactions:
• Scalable, distributes to multiple servers across geographical locations (proprietary clustering model)
• Supports ACID Transactions and uses SQL
• Combines features of traditional RDBMS with support for elastic scalability and availability
• Cloud enabled
• High performance
Ex : Nuo DB
Graph DB:
• Query performance is orders of magnitude better than RDBMS, and does not deteriorate with more
data
• Allows addition of relationships, node types without any change to existing queries
• Gives native Graph Database processing and storage
• Scales horizontally using Shards
• Supports ACID Transactions
• Has its own powerful Query Language (Cypher programming language)
• Superior caching features
Ex: Neo 4j
When to use which option:
• Use Main Memory DB when you have the following requirements:
• Response time needs to be v fast
• You need v high throughput in terms of millions / sec
• You need ACID Transactions
• Your data follows Relational pattern
• Use Column store DB like when you have the following requirements:
• Response time and Throughput needs are very high
• You have DWH Fact tables with huge no of columns and complex querying
requirements.
• Data growth is rapid
• Need to generate complex analytics, with fast response time
• Use a NoSQL Document Store when you have the following requirements
• Your data can be described in self-contained Documents, with little or no relations
• You need to generate analytics using aggregation techniques
• You need text centric search capabilities, ranking of results
• Your data can grow rapidly and you need elastic scalability
• You do not have ACID Transaction requirements when updating data
• Use a Distributed, No SQL Key-value store when you have the following requirements:
• Your data centers are geographically apart
• You know your queries and you can model your queries and define column families
• You have complex querying and analytics requirements
• You have self-contained data structures with no relations
• Your data can grow exponentially
• You need high performance
• Use a Distributed, Transactional DB when you have the following requirements
• You have data centers spread across geographical locations
• Your data can grow rapidly
• You need to maintain Transactions with ACID properties
• Your data is defined in the Relational form
• Use Graph Database, when you have the following requirements:
• You have highly inter-connected data, for ex: social networks and connections
• Your data can grow exponentially
• You need to maintain transactions with ACID properties
• You need fast response time
• Your data model can change very fast over time
References:
• Course content : Tackling the Challenges of Big Data – MIT Professional Education
• Elastic Search : The Definitive Guide, By: Clinton Gormley; Zachary TongO’Reily Media
Inc
• Neo4j Graph Data Modeling : By: Mahesh Lal, Packt Publishing
• MongoDB in Action, By: Kyle Banker, Publisher: Manning Publications
• Next Generation Databases: NoSQL, NewSQL, and Big Data, By: Guy Harrison Publisher:
Apress
• HP Vertica Essentials, By: Rishabh Agrawal, Publisher: Packt Publishing
• Practical Cassandra: A Developer’s Approach, By: Russell Bradberry; Eric Lubow,
Publisher: Addison-Wesley Professional
• Solr in Action, By: Trey Grainger and Timothy Potter, Publisher: Manning Publications

Comparative study of modern databases

  • 1.
  • 3.
    Memory based DistributedTransactional Databases: • Removes the impediments to performance of a traditional OLTP, by removing overheads like Concurrency Control, logging, Locking mechanisms, and by storing the data in main memory. • Uses Main Memory to store data (departure from traditional DB storing data in disk) • Uses a pattern known as Anti-caching (opp. of caching) • Uses Command logging (not Data logging as in traditional DB) • Is Single threaded, uses multiple single-threaded engines • Distributed cluster of shared nothing machines, with high availability • High throughput : 100x times faster than traditional OLTP, Maintains ACID Transactions • Uses streaming analytics with millisecond latency Ref: H-store, Volt DB
  • 4.
    Column stores: • Storesdata as Columns, Querying using a Column-executor is much faster (100x times faster compared to Relational DB) as compared to traditional row-executor (technique called ‘vector processing’) • Data compression is high as it is easy to compress data in columns stores, Record headers are optimized • Uses Shared nothing architecture, each node stores part of the data • High availability, high elasticity • Supports SQL Like Query language, has built-in analytical functions • Supports open source technologies like Hadoop and R, and is Cloud enabled Ref : HP Vertica, SAP Hana
  • 5.
    NoSQL Distributed DocumentStores: • High throughput and Performance – millions/sec • Supports database replication (primary and secondary nodes) and automatic failover • Data is stored in the form of JSON documents, objects from programming language can be persisted as-is (schema later paradigm). (MongoDB uses BSON serialization) • Does not support JOINs, Does not support ACID Transactions over multiple documents • Uses JSON like Query language, and powerful querying and aggregation capabilities including Analytics Ex: Mongo DB, Apache Couch DB NoSQL Distributed Key-value Store: • High Performance, high throughput (sp. on writes) • Ideally suited for Data-centers at geographically different places, fast replication • No Single Point of Failure, automatic failover • Linearly scalable • Supports SQL Like query language, no Joins allowed • Uses modeling by Query and key-value stores with Column families Ex: Apache Cassandra
  • 6.
    NoSQL Distributed DataStore, using Apache Lucene: • Stores JSON documents (ElasticSearch), and multiple document formats like email, pdf (Apache Solr) • Can scale out to hundreds of servers and handle petabytes of data, and hides the complexity of distributed systems. Uses shards (primary and replica) to scale horizontally • Does not support ACID transactions on multiple documents • Has extensive querying and aggregation function inbuilt • Uses Lucene Search and Inverted index (every field is indexed), providing very fast text based search capabilities • Limited joins allowed (join can be used to restrict the output from one document type) • Queries can be sent to any node in the cluster to trigger full distributed search across all shards, with load balancing built in Ex : Elastic Search , Apache Solr
  • 7.
    Distributed Data Store,supporting ACID transactions: • Scalable, distributes to multiple servers across geographical locations (proprietary clustering model) • Supports ACID Transactions and uses SQL • Combines features of traditional RDBMS with support for elastic scalability and availability • Cloud enabled • High performance Ex : Nuo DB Graph DB: • Query performance is orders of magnitude better than RDBMS, and does not deteriorate with more data • Allows addition of relationships, node types without any change to existing queries • Gives native Graph Database processing and storage • Scales horizontally using Shards • Supports ACID Transactions • Has its own powerful Query Language (Cypher programming language) • Superior caching features Ex: Neo 4j
  • 8.
    When to usewhich option: • Use Main Memory DB when you have the following requirements: • Response time needs to be v fast • You need v high throughput in terms of millions / sec • You need ACID Transactions • Your data follows Relational pattern • Use Column store DB like when you have the following requirements: • Response time and Throughput needs are very high • You have DWH Fact tables with huge no of columns and complex querying requirements. • Data growth is rapid • Need to generate complex analytics, with fast response time • Use a NoSQL Document Store when you have the following requirements • Your data can be described in self-contained Documents, with little or no relations • You need to generate analytics using aggregation techniques • You need text centric search capabilities, ranking of results • Your data can grow rapidly and you need elastic scalability • You do not have ACID Transaction requirements when updating data
  • 9.
    • Use aDistributed, No SQL Key-value store when you have the following requirements: • Your data centers are geographically apart • You know your queries and you can model your queries and define column families • You have complex querying and analytics requirements • You have self-contained data structures with no relations • Your data can grow exponentially • You need high performance • Use a Distributed, Transactional DB when you have the following requirements • You have data centers spread across geographical locations • Your data can grow rapidly • You need to maintain Transactions with ACID properties • Your data is defined in the Relational form • Use Graph Database, when you have the following requirements: • You have highly inter-connected data, for ex: social networks and connections • Your data can grow exponentially • You need to maintain transactions with ACID properties • You need fast response time • Your data model can change very fast over time
  • 10.
    References: • Course content: Tackling the Challenges of Big Data – MIT Professional Education • Elastic Search : The Definitive Guide, By: Clinton Gormley; Zachary TongO’Reily Media Inc • Neo4j Graph Data Modeling : By: Mahesh Lal, Packt Publishing • MongoDB in Action, By: Kyle Banker, Publisher: Manning Publications • Next Generation Databases: NoSQL, NewSQL, and Big Data, By: Guy Harrison Publisher: Apress • HP Vertica Essentials, By: Rishabh Agrawal, Publisher: Packt Publishing • Practical Cassandra: A Developer’s Approach, By: Russell Bradberry; Eric Lubow, Publisher: Addison-Wesley Professional • Solr in Action, By: Trey Grainger and Timothy Potter, Publisher: Manning Publications