Choosing the right data store

Choosing the right data store

Before we delve into the criteria that one should look at before choosing their data stores, let us
do a brief recap of basic fundamentals.

ACID Properties:
Atomicity: Each DB change follows all or nothing rule. Every transaction is atomic. If one part of
the transaction fails, the entire transaction fails.

Consistency: Only valid data is written to the database. Data is committed only if it passes all
the rules like referential constraints, data types, triggers, etc.

Isolation: Concurrent transactions happening at the same time do not impact each other’s
execution.

Durability: Any transaction committed to the database will not be lost.

CAP Theorem:
Consistency: All nodes see the same data at the same time.

Availability: A guarantee that every request receives a response about whether it succeeded or
failed.

Partition Tolerance: The systems continues to operate despite arbitrary message loss or failure
of part of the system.

ACID is a set or rules that a database can choose to follow that guarantees how it handles
transactions and keep data safe and reliable. CAP provides basic requirements that a
distributed storage system must follow. If ACID requirement is a must then you would go with a
relational data store. If you are designing a highly scalable distributed system where you are
willing to give up at least one attribute of CAP theorem, for example you are fine with eventual
consistency then you would go with a NoSQL data store.

Relational vs NoSQL:
The below table captures high level differences between RDBMS and NoSQL. As such, it’s not
feasible to encapsulate all the granular details, it’s imperative for architects to understand the
application requirements before making a choice between relational and non relational data
store.

RDBMS NoSQL
ACID Compliant Follows CAP theorem
Persistent data store, Guaranteed
Consistency
Heavily sharded, eventually consistent

Trends towards availability and consistency Trends towards partition tolerance
Good fit for transactional databases where
atomicity and integrity is needed and data is
stored in rows and columns based on
relations
Good fit for distributed databases based
on key value pairs, documents, graphs or
wide column stores

Recommended for mission critical data Recommended for analytical data
Data access through SQL Data access through Key
Limited scalability, or manual sharding Highly Scalable with auto sharding
Predefined schema for structured data.
Define your data types and structure first.
Hard to change later as data grows. The
schema change is expensive,if you decide to
add one extra attribute you will have to
modify the table schema first by alter
command which would hold a lock.

Dynamic schema for structured,
unstructured and semistructured data.
No need to define structure first. The
schema is dynamic in the sense you just
need to save another document with
modified / new attribute / key value pair.

Each business attribute is a column. Each business attribute is a key.
create table EmployeeTable(name
varchar(20),age int(2),sex char(6));
insert into EmployeeTable (name,age,sex)
values ( "Mike",12,"male");

db.EmployeeTable.save({"name":"Mike","
age":12,"sex":"male"});
Database==>Tables==>Rows Database==>Collection==>Documents
Vertically scalable by increasing the
horsepower of the hardware
Horizontally scalable by adding more
servers in the pool
Well instrumented. Relies on vendor support
Relies on dev expertise and community
support
Mix of open source (MySQL, Postgress) and
commercial (Oracle, MSSQL)
All open source (MongoDB, Cassandra,
HBase)
TCO high on CAPEX, for proprietary
databases like Oracle the licensing cost is
high.

TCO high on OPEX, depending on
choice of NoSQL data store, the
development cost could be high so
should be reviewed for calculating total
cost of ownership.

MySQL vs Oracle:
DBengines ranking shows 7 out of top 10 databases are the relational ones. Let us compare
MySQL and Oracle as they are the two most popular and leading database technologies for
storing relational data.

When to choose MySQL:
Use MySQL when you need ACID compliant RDBMS for storing small to medium footprint
transactional and persistent data in the range of few hundreds of gigabytes. Database access
should be mostly for relatively simple lookups with major functionality being built in the middle

tier (memcache, redis, etc) and database not expected to do heavy processing. Whenever there
is a tradeoff between speed and capabilities, MySQL tends to keep its database engine fast.

MySQL Advantages:
● Free, fast, light weight, open source
● High on developer productivity, easy and quick to setup.
● Active open source development community, more capabilities being added with every
release.

MySQL Limitations:
● Single threaded replication, unable to support heavy writes.
● Limited scalability on single node, unable to handle large data size.
● Limited data consistent high availability under heavy writes.

MySQL Scalability Options:
● Scale out reads by adding slaves. MySQL 5.6 offers GTID based parallel replication
which is still per schema. MySQL 5.7 offers true parallel replication.
● Scale out reads and writes by sharding. This should be part of architecture runway
discussion. Carefully review the sharding key, sharding unit, process to handle data
merges and choose it only when your application growth absolutely warrants need for
application based partitioning. Currently there is no mature auto sharding technology, so
application logic is required to deal with the complexity of sharding, from key lookups,
result set merges and additional aggregation logic. Few open source sharding
technologies to consider: jetpants from Yahoo tumblr, vitess from Google, MySQL fabric
from Oracle, Tesora virtualization

When to choose Oracle:
Use Oracle when you need ACID compliant RDBMS for storing revenue critical medium to large
footprint transactional and persistent data. Oracle is preferred for large scale enterprise data
marts and data warehouses where you need lots of functionality within database like
partitioning, parallelism, resource manager, very large joins and analytic functions and the DB
storage requirement is in the range of hundreds of terabytes. RAC is also a good choice for

large OLTP database with heavy read/write traffic, especially when sharding is not an option for
application requirement.

Oracle Advantages:
● Feature rich to support VLDBs
● HA with data consistency
● Well instrumented with detailed data dictionary for better performance monitoring and
tuning
● Well integrated with Hadoop for reporting and analytics

Oracle Limitations:
● H/W setup is rack dependent. All nodes within Oracle cluster should be setup within
same cabinet and connected to same switch.
● Cluster installation dependent on infrastructure expertise. Experienced DBAs, storage
admins, sys admins needed to draw cabling diagram and provide custom install
instructions to data center team.
● Annual Licensing cost

Oracle Scalability Options:
● Consider block storage SAN like EMC. Compared to file based NFS storage like Netapp,
SAN provides heavy throughput for IO intensive workloads.
● Implement Oracle dNFS (it’s free and not a licensed feature) for Netapp based storage
to provide multi path to NFS and increase the IO throughput.
● Add more nodes to Oracle cluster.
● Consider engineered appliances like Exadata if you can afford.

Feature Oracle MySQL
ACID ✔ ✔
Standard SQL Support ✔ ✔
Open Source ✖ ✔

Easier to install and manage ✖ ✔
Independent of Infrastructure expertise ✖ ✔
Popular with Web Apps ✖ ✔
VLDB ✔ ✖
Reporting Analytics ✔ ✖
Enterprise DW ✔ ✖
HA with data consistency ✔ ✖
Resource Manager ✔ ✖
Instrumentation ✔ ✖
Low Cost ✖ ✔

Big Data:
As the world becomes more instrumented, the volumes of data available to the enterprise are
growing by orders of magnitude exceeding the processing capacity of traditional data
warehouses based on relational data stores like Oracle. There are enterprise solutions like
Teradata, Netezza, Vertica, Exadata, Greenplum to mention a few but they are largely sold at
high price points.

Big Data Analytics:
As data sources became more diverse: structured, unstructured (images, documents, videos),
semistructured (XML, JSON, log files), Hadoop gained prominence as a platform designed to
use commodity hardware to build a massively parallel and highly scalable cost effective data
processing cluster. Inspired by GFS (Google File System), Hadoop is today deployed in many
organizations to store and analyze large amounts of log data. These low value event streams
are converted into high value aggregates for BI (business intelligence) applications and drive
actionable business insights, that’s what big data analytics is all about.

Hadoop Limitations:

Hadoop is best suited for big data batch processing. Always remember it’s optimized for
throughput and not latency. It’s not a fit for small data. Also unlike relational databases, it lacks
joins and doesn’t have a query optimizer.

Summary:
Cassandra and MongoDB are the fastest growing NoSQL data stores. To overcome some of
the Hadoop limitations, Druid is doing a bunch of innovations on low latency data ingestions and
fast aggregations and could be considered for analytics dashboards. RocksDB as an embedded
keyvalue store saves N/W latency and provides fast storage. As such, there are hundreds of
NoSQL data stores to choose from so architects should carefully review their use case and most
importantly consider the development cost before making a decision. A lot of NoSQL options
lack proper instrumentation and documentation which becomes a hindrance for widespread
enterprise adoption.

Relational data stores whether open source or commercial are not growing as fast as NoSQL
but would always be needed to store relational metadata and for enterprise applications that
need strong consistency with data integrity and persistence. Despite the deluge of NoSQL
databases that are getting created almost every other month, the reason relational databases
still dominate the industry is because of stability, availability of expertise (DBAs of course) and
tons of documentation. Having said that DBAs too need to adapt and keep honing their skill sets
otherwise they would end up wondering “Who moved my Cheese?”

Both Relational and NoSQL can coexist in the database ecosystem and the healthy
competition should challenge every vendor and community to push the boundaries of innovation
and meet the ever demanding business requirements.

Choosing the right data store

Recommended

Recommended

More Related Content

Similar to Choosing the right data store

Similar to Choosing the right data store (20)

Choosing the right data store