Anotomy of NoSQL Databases
Date: 11/10/13
Amit Kumar
2
Agenda
+Background
+What are NoSQL Databases
+Relational vs NoSQL Databases
+HBase
+Cassandra
+Design Strategies behind NoSQL Databases
3
Background
+Traditional Applications
Limited Data
Top priority on consistency
Focus on average latency
Ideally fit with RDBMS
Utilized the DB intrinsic features well
Good part of logic resided in DB
+Next Gen Applications
Web Scale (~infinite)
ALWAYS available
High performance in ALL cases
Data in the form of key/value pair
Logic part of Application Layer
4
RDBMS with Nextgen Apps – Failure
+Scale
Limit to maximum data supported
Sharding is an option, but then RDBMS features are lost
+Economy
Requires large arrays of fast, expensive disks
Very expensive
+Availability still an issue
5
NoSQL Databases
+Name is confusing
Not RDBMS at all
NoREL Databases a better name
+Key Value Store
+Extremely scalable
+High performance
+Always available
+Weak Consistency (CAP Theorem)
+Distributed
Use commodity hardware - Cheap
+Might not hold ACID properties
+Only for specific Use – Not everything is good
RDBMS vs NoSQL Databases
+Go for RDBMS when
Small instances of simple straight forward systems
Joins, secondary indexing, referential integrity, group by/order by
+Go for NoSQL when
Data scale
Read/write scale
Data model is
Flexible
Semi-structured
6
NoSQL Current Limitations
+Maturity
+Support
+Analytics & Business Intelligence
+Administration
+Ease of Use
7
Some famous NoSQL Databases
+Open-source
HBase
Cassandra
Voldemort
Dynomite
Hypertable
CouchDB
VPork
MongoDB
Riak
+Closed-source
BigTable
Dynamo
PNUTS
8
9
HBase
+Based on Google BigTable
+Sparse distributed persistent multi-dimensional sorted map
+On top of Hadoop HDFS
+Master Slave Model
Single Master (SPOF)
+Especially good when
Objects are huge
Data production/consumption is distributed and is tunneled through map/reduce
jobs
+Loose Data Model
Column Families
+Timestamp based versioning
+Not supported on Windows
+Major Users – Adobe, Twitter, Yahoo, Veoh, Streamy, Trend Micro
HBase Architecture & Table Structure
+Loosely based on Consistent Hashing
+Table made up of regions
Region specified by startkey and endkey
A region may live on a different node.
+Tables sorted by Rows
+Schema defines column families only
Each family consists of any no. of columns
Each column consists of any no. of versions
Columns within a family are sorted & stored together
+Everything except table name are byte[]
10
Connecting to Hbase
+Java Client API
HBaseConfiguration config = new HBaseConfiguration();
HTable table = new HTable(config, “table_name”);
Put p = new Put(Bytes.toBytes(“key”));
p.add(Bytes.toBytes(“key”), Bytes.toBytes(“column”), Bytes.toBytes(“value”));
table.put(p);
Get g = new Get(Bytes.toBytes(“key”));
Result r = table.get(g);
+HBase Shell
$ ${HBASE_HOME}/bin/hbase shell
hbase> describe “table_name”
hbase> put “table_name", “key”, “columnfamily:columnname", "value“
hbase> get “table_name”, “key”
hbase> scan “table_name”
+Thrift Gateway
+REST Gateway
+Many other non-java clients
11
Cassandra
+Based on Amazon Dynamo
+Open sourced by Facebook in 2008
+Peer to Peer Model
No Master Node
+Works on Windows as well
+Distributed Key/Value Store
+Configurable parameters for Consistency/Availability
+Especially suited if
Number of Objects is huge
objects are of small sizes (<1 MB)
+Major Users: Facebook, Digg, Twitter etc.
12
13
NoSQL Databases – Assumptions
+Data size is huge
System must partition its data across multiple nodes
+Reliable
Data must be safe even when disks and nodes fail
System must replicate data
+Performance
Needs to perform well on cheap hardware and maintain low latency ALWAYS
14
NoSQL Databases – Design Strategies
+Complex Distributed System
+Partitioning
Consistent Hashing
+Consistency
Eventual Consistency
Vector Clocks
+Data Models
Primary Key -> Value
Value can be semi-structured
Multi-version Storage
+Storage Layouts
Column storage with Locality groups
Log structured Merge Trees
+Cluster Management
Peer to Peer vs Master/Slave approach
Gossip
15
References
+Bigtable: A Distributed Storage System for Structured Data
http://labs.google.com/papers/bigtable-osdi06.pdf
+Dynamo: Amazon's Highly Available Key-value Store
http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf
+NOSQL debrief, June 2009
http://static.last.fm/johan/nosql-20090611/intro_nosql.pdf
http://static.last.fm/johan/nosql-20090611/hbase_nosql.pdf
http://static.last.fm/johan/nosql-20090611/cassandra_nosql.ppt
+NoSQL Databases Official Site
http://nosql-database.org
+Hbase – Hadoop Wiki
http://wiki.apache.org/hadoop/Hbase
+Apache Cassandra Wikipedia
http://en.wikipedia.org/wiki/Apache_Cassandra
16
Questions + Answers
Thank You

NoSQL Databases

  • 1.
    Anotomy of NoSQLDatabases Date: 11/10/13 Amit Kumar
  • 2.
    2 Agenda +Background +What are NoSQLDatabases +Relational vs NoSQL Databases +HBase +Cassandra +Design Strategies behind NoSQL Databases
  • 3.
    3 Background +Traditional Applications Limited Data Toppriority on consistency Focus on average latency Ideally fit with RDBMS Utilized the DB intrinsic features well Good part of logic resided in DB +Next Gen Applications Web Scale (~infinite) ALWAYS available High performance in ALL cases Data in the form of key/value pair Logic part of Application Layer
  • 4.
    4 RDBMS with NextgenApps – Failure +Scale Limit to maximum data supported Sharding is an option, but then RDBMS features are lost +Economy Requires large arrays of fast, expensive disks Very expensive +Availability still an issue
  • 5.
    5 NoSQL Databases +Name isconfusing Not RDBMS at all NoREL Databases a better name +Key Value Store +Extremely scalable +High performance +Always available +Weak Consistency (CAP Theorem) +Distributed Use commodity hardware - Cheap +Might not hold ACID properties +Only for specific Use – Not everything is good
  • 6.
    RDBMS vs NoSQLDatabases +Go for RDBMS when Small instances of simple straight forward systems Joins, secondary indexing, referential integrity, group by/order by +Go for NoSQL when Data scale Read/write scale Data model is Flexible Semi-structured 6
  • 7.
    NoSQL Current Limitations +Maturity +Support +Analytics& Business Intelligence +Administration +Ease of Use 7
  • 8.
    Some famous NoSQLDatabases +Open-source HBase Cassandra Voldemort Dynomite Hypertable CouchDB VPork MongoDB Riak +Closed-source BigTable Dynamo PNUTS 8
  • 9.
    9 HBase +Based on GoogleBigTable +Sparse distributed persistent multi-dimensional sorted map +On top of Hadoop HDFS +Master Slave Model Single Master (SPOF) +Especially good when Objects are huge Data production/consumption is distributed and is tunneled through map/reduce jobs +Loose Data Model Column Families +Timestamp based versioning +Not supported on Windows +Major Users – Adobe, Twitter, Yahoo, Veoh, Streamy, Trend Micro
  • 10.
    HBase Architecture &Table Structure +Loosely based on Consistent Hashing +Table made up of regions Region specified by startkey and endkey A region may live on a different node. +Tables sorted by Rows +Schema defines column families only Each family consists of any no. of columns Each column consists of any no. of versions Columns within a family are sorted & stored together +Everything except table name are byte[] 10
  • 11.
    Connecting to Hbase +JavaClient API HBaseConfiguration config = new HBaseConfiguration(); HTable table = new HTable(config, “table_name”); Put p = new Put(Bytes.toBytes(“key”)); p.add(Bytes.toBytes(“key”), Bytes.toBytes(“column”), Bytes.toBytes(“value”)); table.put(p); Get g = new Get(Bytes.toBytes(“key”)); Result r = table.get(g); +HBase Shell $ ${HBASE_HOME}/bin/hbase shell hbase> describe “table_name” hbase> put “table_name", “key”, “columnfamily:columnname", "value“ hbase> get “table_name”, “key” hbase> scan “table_name” +Thrift Gateway +REST Gateway +Many other non-java clients 11
  • 12.
    Cassandra +Based on AmazonDynamo +Open sourced by Facebook in 2008 +Peer to Peer Model No Master Node +Works on Windows as well +Distributed Key/Value Store +Configurable parameters for Consistency/Availability +Especially suited if Number of Objects is huge objects are of small sizes (<1 MB) +Major Users: Facebook, Digg, Twitter etc. 12
  • 13.
    13 NoSQL Databases –Assumptions +Data size is huge System must partition its data across multiple nodes +Reliable Data must be safe even when disks and nodes fail System must replicate data +Performance Needs to perform well on cheap hardware and maintain low latency ALWAYS
  • 14.
    14 NoSQL Databases –Design Strategies +Complex Distributed System +Partitioning Consistent Hashing +Consistency Eventual Consistency Vector Clocks +Data Models Primary Key -> Value Value can be semi-structured Multi-version Storage +Storage Layouts Column storage with Locality groups Log structured Merge Trees +Cluster Management Peer to Peer vs Master/Slave approach Gossip
  • 15.
    15 References +Bigtable: A DistributedStorage System for Structured Data http://labs.google.com/papers/bigtable-osdi06.pdf +Dynamo: Amazon's Highly Available Key-value Store http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf +NOSQL debrief, June 2009 http://static.last.fm/johan/nosql-20090611/intro_nosql.pdf http://static.last.fm/johan/nosql-20090611/hbase_nosql.pdf http://static.last.fm/johan/nosql-20090611/cassandra_nosql.ppt +NoSQL Databases Official Site http://nosql-database.org +Hbase – Hadoop Wiki http://wiki.apache.org/hadoop/Hbase +Apache Cassandra Wikipedia http://en.wikipedia.org/wiki/Apache_Cassandra
  • 16.
  • 17.

Editor's Notes

  • #4 DB features like joins, db links, constraints, streams,
  • #9 8