Agenda
● History
● Relational databases
● Horizontal vs vertical scaling
● CAP theorem
● Document databases
● Key value databases
● Graph databases
● Column family databases
History
● Non SQL (not traditional tabular database)
● Facebook, Google, Amazon..etc (Big data and real
time applications)
● Horizontal scaling is a problem in relational
database
● Not only SQL (SQL like queries)
Relational Databases :)
● MySQL, Oracle, SQL Server, Postgres..etc
● Carpenter Hammer
● Easy & Popular
● Avoid data duplication but complex queries
● Atomicity (transactions)
Relational Databases :(
● Defined schema, optional attributes (NULLs)
● Use joins to aggregate related data
● Large data VOLUME and high rate of READ
(scalability)
Scaling
source: https://commons.wikimedia.org/wiki/File:They_started_our_car_by_pushing_it_backwards_up_the_hill!_(3854246685).jpg
Scaling
source: http://slashnode.com/the-12-factor-php-app-part-2/
Horizontal (Sharding)
Horizontal (Master-Slave Replication)
CAP Theorem
● Consistency
(all nodes
see the
same data
at the same
time)
CAP Theorem
● Availability
(every request
definitely receives
a response with
success or failure)
CAP Theorem
● Partition tolerance
(the system continues to
operate )
Pick
Only
“TWO”
source: http://www.abramsimon.com/
CAP Proof
Eventually Consistent
SQL Vs NoSQL
Relational Databases NoSQL Databases
Vertical and not too many horizontal Horizontal scaling
Consistent Consistent or Eventual consistent
Scalable reads Scalable reads/writes
Transactions on multiple tables Difficult to support transactions
No partition tolerance Partition tolerance
Schema/tables Schemaless
Flexible queries (joins) Limited queries
1) Document Databases
● Simple & popular
● Close to relational database
● MongoDB was a rising star in 2009
1) Document Databases
● Simple & Popular
● Seven Databases in Seven Weeks
JSON Document Vs Row
● Document Vs Row
● Collection Vs Table
● Nesting no joins
● Query in sub-doc
● Duplicate data to
avoid joins
● Schemaless
MongoDB CP
● Consistency
Master-Slave (elections)
● CouchDB is AP
MongoDB Conclusion
● Simple
● Scalable
● Embedded document
● CP
● No joins
● May need to duplicate data
● Writes should go through master node
● Built-in Geo-spatial support
2) Key-Value Databases
● Light & compact
● Hash table (values; text, blob, json, image..etc)
● Reads are fast, writes are faster
Key-Value Databases
● Redis Hash
Redis Complex Data Types
● List
Redis Complex Data Types
● Blocking List
Redis Complex Data Types
● Publish-Subscribe
Redis Complex Data Types
● Set
Redis Complex Data Types
● Expiry Caching
Redis in Memory
● No instant persistency by default in memory
● Persist periodically by taking snapshots
Redis CP
● Sharding (A,B,C)
● Replication A => A1, B => B1, C => C1
● If master B fails, B1 is the promoted to be a master
● Redis is NOT strong consistent (if both A, A1 fails)
● Riak is AP
Redis Conclusion
● Light & Compact
● Key-value
● Complex data types
● Fast in memory
● Dataset should be less than RAM size
● Transforming data, caching, messaging
● CP but not strongly consistent
● Flexible persistence levels
● Rarely used alone
3) Graph Databases
● Directed graph
● Node has properties
● Relation has properties
Graph Databases
Graph Databases
Graph Databases (AP)
● Tens of billions of nodes and edges
● No Sharding; replicate all the graph
● High availability over Consistency
● Elect a gold master but writes to
slaves directly
● Community edition is free but full
version is NOT
4) Column-Family Databases
Row family database:
● Many columns
● Seek disk operation
● Low compression
rate
Column-Family Databases
● In RDBMS,
heavy writes,
so store rows
as a bulk
● In columns,
heavy reads,
store columns
together
HBase
● Database for HDFS (RDBMS vs files)
● Widely used with Hadoop
● Scalability! At least five nodes in
production
● Facebook messaging system
infrastructure 2010
HBase Column Family
HBase Column Family
● Key-Value pairs
(Map of maps)
● Column families
should be defined
but the columns are
schema-less
HBase Versioning
● Versioning
● It became map of map
of map (asc, asc, desc)
● Garbage collector for
expired data
● Everything is binary
● Compression rate
FB Messaging Index Table
● The row keys are user IDs
● Column qualifiers are words that appear in
that user’s messages
● Timestamps are message IDs of messages
that contain that word
● Value is offset of word in message
HBase Vs Cassandra
● HBase on Hadoop, Cassandra is standalone
● HBase community is more active
● HBase is CP, Cassandra is AP
● Cassandra more suitable for high concurrent writes
The right tool for the right job

NoSQL Databases