The NoSQL Landscape Objective – Reasonable understanding of the non-relational or NoSQL data stores and how they relate to RDBMS databases we are all used to working with.
About Me Chief Architect – youwho.com Former dot com CTO NoSql advocate nosqltips.blogspot.com @nosqltips on twitter
Agenda What is NoSQL? Landscape  Vocabulary and concepts CAP Theorem SQL vs NoSQL comparison Overview of each type w/ examples Question and Answer
 
 
 
 
 
Vocabulary CAP Theorem – consistency, availability, partitioning ACID – Atomic, Consistent, Isolated, Durable BASE – Basically Available, Soft state, Eventually consistent RDF – Resource Description Framework Sharding – Partitioning, distributed Web Scale – Google, Twitter, Facebook, etc
 
CAP Tuning NRW N: Number of Data Copies R: Read Quorum W: Write Quorum Hard Consistency – RDBMS Soft Consistency – No Guarantees Eventual Consistency – Most NoSQL
Cap Tuning Chart NRW Outcome N=3 Magic Number of Data Replicas W=N R=1 Read Optimized – Strong Consistency. W=1 R=N Write Optimized – Strong Consistency. W+R > N Strong Consistency on Read and Write. W+R <= N Weak Eventual Consistency.  Read may not see the latest Data. N > W > 1 Eventual Consistency  - Most NoSQL data stores live here.
Eventual Consistency All replicas have same data – eventually Milliseconds to seconds Not all applications are compatible Various ways to ensure latest data Vector Clocks, Read Repair, Gossiping Application determines correct data
 
Comparison SQL Prefers big-box, self redundant Keep things from breaking Solidly in CA land P is difficult and expensive Query by SQL Stored procedures NoSQL Prefers commodity hardware, distributed Assume things break or are broken Mostly AP, some tunable P generally easy Custom API, SQLish Map/Reduce
Comparison SQL ACID transactions Advanced indexing Foreign key support Strong lock support Schema centric API – usually JPA or JDBC Strong access control NoSQL BASE transactions Key only to Advanced Usually none Usually none Usually schema-less Depends on implementation Usually none
Comparison SQL Complex disk store, random access Easy for dev with JPA/Hibernate/SQL Multi-platform General purpose Strong commercial support Great tool support NoSQL Usually append only, 1 seek, 1 read Puts more work on application dev Favors Linux/Unix  More special purpose Strong to no commercial support Not so much
 
Column Stores Data stored by column instead of row Schema-less Non-relational, data is de-normalized Column format stores sparse data efficiently Column families cannot change 10,000+ columns by 100 million+ rows Easy sharding (partitioning) Usually not ACID compliant
Column stores BigTable – Google, 2006 paper Hadoop/HBase – Part of Apache Hadoop Cassandra – Facebook, LAN/WAN replication Hypertable – Pluggable DFS, HQL  Vertica – Full SQL implementation Amazon SimpleDB – Cloud store
Document Stores CAP tunable Either key/value or bucket/key/value Easy/Auto sharding - Consistent hashing Usually ACID compliant Not SQL compliant, maybe custom query Easy implementation via map or custom api
Document stores Amazon – Dynamo and S3 (cloud based) Riak – CAP tunable, built in map/reduce CouchDB – ACID, REST api MongoDB – Indexing, query support Voldemort – Java, pluggable serialization MySQL – Key access, denormalize schema, kill indexes
Memory Stores Mostly in the CA realm P can be tough depending on implementation  Some are distributed, some local only Usually key-value stores Many are disk backed, append only files Designed for very high-speed access
Memory stores CouchBase – Membase + CouchDb Memcached – Local map  Coherence – Commercial Oracle, distributed Redis – Supports hash, list, set, and sorted set, data structure server Tokyo/Kyoto Cabinet – disk backed map Infinispan – JSR-107 jcache impl Scalaris – Erlang, strong consistency
Graph/Triple Store Model relationships well, bi-directional Node/edges – edges can be weighted or not RDF Triple – subject -> predicate -> object, w3c standard for semantic web  Many implement SPARQL, object api  Sharding can difficult because of graph nature Schema-less – nodes, edges, properties Fast set operations
Graph/Triple Stores Neo4j – ACID transactions, object API Alegrograph – Reference impl of SPARQL Bigdata – dynamic sharding Trinity – Microsoft research Infinite Graph – Distributed, cross-platform FlockDb – Twitter, fast set operations Infogrid – Object based, REST api
Interesting Integrations Lucene - Document Store with Search as Query Language SOLR and Elastic Search – Scalable Lucene Riak Search – Elang impl of Lucene APIs Solandra – Lucene on Cassandra backend Couchdb-lucene – Integration DistributedLucene – Lucene on Hadoop Neo4j – Full Text Search on Graph Store
Worth Mentioning Configuration Dbs – ZooKeeper, Doozer Distributed configuration, locks, synchronization Used to make other apps scalable XML Dbs – eXist, BaseX, Xindice XML only, Xquery, Xpath, ACID, GUI support non-distributed
 
 
Case Study - HBase Apache – part of Hadoop/HDFS Requires ZooKeeper Java based Runs well on Amazon EC2 Excellent language support Supports REST interface
HBase continued Map/Reduce via Hadoop Schema-less, column families fixed Nearly unlimited columns and rows HBQL – partial sql + JDBC support Some ACID support, atomicity, durability Integration with Hive for data warehousing, ad-hoc query support - HiveQL
Case Study - Riak Data Model – Bucket/Key/Value Value has MIME type, byte[] Value supports one-way Links, basic graph Erlang, Protocol Buffers, REST interfaces Pre/Post Commit Hooks CAP Tunable per bucket Map/Reduce – Erlang and Javascript
Riak Continued Vector Clocks Read repair for R < N Peer-to-Peer, Nothing Shared Architecture Replication across data centers Pluggable storage API for Most Languages + REST Commercial Support
Case Study - Redis Supports hash, list, set, and sorted set Fast set operations Atomic updates Everything stored in memory Persistence to disk – periodic save, append only file, can be compacted Good API support, JDBC subset driver
Redis Continued Master – slave replication, read scalability, redundancy, slave can sync to disk Can swap out values, keys must be in memory Can be used as pub/sub messaging system Can send multiple commands in single request Built to be extremely fast Supports very high speed atomic counters
Case Study - Neo4j Java based – cross platform ACID transactions Durable persistence Handle billions of nodes/edges single machine Supports bulk data loading Good language support
Neo4j Continued Spatial index support RDF triples/OWL/SPARQL support Replication and HA – commercial version Object oriented API Sharding at client level Dual open source and commercial license
Resources fallabs.com/tokyocabinet fallabs.com/kyotocabinet redis.io www.membase.org neo4j.org en.wikipedia.org/wiki/Triplestore en.wikipedia.org/wiki/Graph_theory research.microsoft.com/en-us/projects/trinity
Resources www.jboss.org/infinispan basho.com nosqlpedia.com/wiki/Consistency_models_in_nonrelational_dbs www.hypertable.org project-voldemort.com www.allthingsdistributed.com/2007/10/amazons_dynamo.html
Resources nosql-database.org couchdb.apache.org engineering.twitter.com/2010/05/introducing-flockdb.html infinitegraph.com nosql-database.org http://www.w3.org/TR/rdf-concepts/

No sql landscape_nosqltips

  • 1.
    The NoSQL LandscapeObjective – Reasonable understanding of the non-relational or NoSQL data stores and how they relate to RDBMS databases we are all used to working with.
  • 2.
    About Me ChiefArchitect – youwho.com Former dot com CTO NoSql advocate nosqltips.blogspot.com @nosqltips on twitter
  • 3.
    Agenda What isNoSQL? Landscape Vocabulary and concepts CAP Theorem SQL vs NoSQL comparison Overview of each type w/ examples Question and Answer
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
    Vocabulary CAP Theorem– consistency, availability, partitioning ACID – Atomic, Consistent, Isolated, Durable BASE – Basically Available, Soft state, Eventually consistent RDF – Resource Description Framework Sharding – Partitioning, distributed Web Scale – Google, Twitter, Facebook, etc
  • 10.
  • 11.
    CAP Tuning NRWN: Number of Data Copies R: Read Quorum W: Write Quorum Hard Consistency – RDBMS Soft Consistency – No Guarantees Eventual Consistency – Most NoSQL
  • 12.
    Cap Tuning ChartNRW Outcome N=3 Magic Number of Data Replicas W=N R=1 Read Optimized – Strong Consistency. W=1 R=N Write Optimized – Strong Consistency. W+R > N Strong Consistency on Read and Write. W+R <= N Weak Eventual Consistency. Read may not see the latest Data. N > W > 1 Eventual Consistency - Most NoSQL data stores live here.
  • 13.
    Eventual Consistency Allreplicas have same data – eventually Milliseconds to seconds Not all applications are compatible Various ways to ensure latest data Vector Clocks, Read Repair, Gossiping Application determines correct data
  • 14.
  • 15.
    Comparison SQL Prefersbig-box, self redundant Keep things from breaking Solidly in CA land P is difficult and expensive Query by SQL Stored procedures NoSQL Prefers commodity hardware, distributed Assume things break or are broken Mostly AP, some tunable P generally easy Custom API, SQLish Map/Reduce
  • 16.
    Comparison SQL ACIDtransactions Advanced indexing Foreign key support Strong lock support Schema centric API – usually JPA or JDBC Strong access control NoSQL BASE transactions Key only to Advanced Usually none Usually none Usually schema-less Depends on implementation Usually none
  • 17.
    Comparison SQL Complexdisk store, random access Easy for dev with JPA/Hibernate/SQL Multi-platform General purpose Strong commercial support Great tool support NoSQL Usually append only, 1 seek, 1 read Puts more work on application dev Favors Linux/Unix More special purpose Strong to no commercial support Not so much
  • 18.
  • 19.
    Column Stores Datastored by column instead of row Schema-less Non-relational, data is de-normalized Column format stores sparse data efficiently Column families cannot change 10,000+ columns by 100 million+ rows Easy sharding (partitioning) Usually not ACID compliant
  • 20.
    Column stores BigTable– Google, 2006 paper Hadoop/HBase – Part of Apache Hadoop Cassandra – Facebook, LAN/WAN replication Hypertable – Pluggable DFS, HQL Vertica – Full SQL implementation Amazon SimpleDB – Cloud store
  • 21.
    Document Stores CAPtunable Either key/value or bucket/key/value Easy/Auto sharding - Consistent hashing Usually ACID compliant Not SQL compliant, maybe custom query Easy implementation via map or custom api
  • 22.
    Document stores Amazon– Dynamo and S3 (cloud based) Riak – CAP tunable, built in map/reduce CouchDB – ACID, REST api MongoDB – Indexing, query support Voldemort – Java, pluggable serialization MySQL – Key access, denormalize schema, kill indexes
  • 23.
    Memory Stores Mostlyin the CA realm P can be tough depending on implementation Some are distributed, some local only Usually key-value stores Many are disk backed, append only files Designed for very high-speed access
  • 24.
    Memory stores CouchBase– Membase + CouchDb Memcached – Local map Coherence – Commercial Oracle, distributed Redis – Supports hash, list, set, and sorted set, data structure server Tokyo/Kyoto Cabinet – disk backed map Infinispan – JSR-107 jcache impl Scalaris – Erlang, strong consistency
  • 25.
    Graph/Triple Store Modelrelationships well, bi-directional Node/edges – edges can be weighted or not RDF Triple – subject -> predicate -> object, w3c standard for semantic web Many implement SPARQL, object api Sharding can difficult because of graph nature Schema-less – nodes, edges, properties Fast set operations
  • 26.
    Graph/Triple Stores Neo4j– ACID transactions, object API Alegrograph – Reference impl of SPARQL Bigdata – dynamic sharding Trinity – Microsoft research Infinite Graph – Distributed, cross-platform FlockDb – Twitter, fast set operations Infogrid – Object based, REST api
  • 27.
    Interesting Integrations Lucene- Document Store with Search as Query Language SOLR and Elastic Search – Scalable Lucene Riak Search – Elang impl of Lucene APIs Solandra – Lucene on Cassandra backend Couchdb-lucene – Integration DistributedLucene – Lucene on Hadoop Neo4j – Full Text Search on Graph Store
  • 28.
    Worth Mentioning ConfigurationDbs – ZooKeeper, Doozer Distributed configuration, locks, synchronization Used to make other apps scalable XML Dbs – eXist, BaseX, Xindice XML only, Xquery, Xpath, ACID, GUI support non-distributed
  • 29.
  • 30.
  • 31.
    Case Study -HBase Apache – part of Hadoop/HDFS Requires ZooKeeper Java based Runs well on Amazon EC2 Excellent language support Supports REST interface
  • 32.
    HBase continued Map/Reducevia Hadoop Schema-less, column families fixed Nearly unlimited columns and rows HBQL – partial sql + JDBC support Some ACID support, atomicity, durability Integration with Hive for data warehousing, ad-hoc query support - HiveQL
  • 33.
    Case Study -Riak Data Model – Bucket/Key/Value Value has MIME type, byte[] Value supports one-way Links, basic graph Erlang, Protocol Buffers, REST interfaces Pre/Post Commit Hooks CAP Tunable per bucket Map/Reduce – Erlang and Javascript
  • 34.
    Riak Continued VectorClocks Read repair for R < N Peer-to-Peer, Nothing Shared Architecture Replication across data centers Pluggable storage API for Most Languages + REST Commercial Support
  • 35.
    Case Study -Redis Supports hash, list, set, and sorted set Fast set operations Atomic updates Everything stored in memory Persistence to disk – periodic save, append only file, can be compacted Good API support, JDBC subset driver
  • 36.
    Redis Continued Master– slave replication, read scalability, redundancy, slave can sync to disk Can swap out values, keys must be in memory Can be used as pub/sub messaging system Can send multiple commands in single request Built to be extremely fast Supports very high speed atomic counters
  • 37.
    Case Study -Neo4j Java based – cross platform ACID transactions Durable persistence Handle billions of nodes/edges single machine Supports bulk data loading Good language support
  • 38.
    Neo4j Continued Spatialindex support RDF triples/OWL/SPARQL support Replication and HA – commercial version Object oriented API Sharding at client level Dual open source and commercial license
  • 39.
    Resources fallabs.com/tokyocabinet fallabs.com/kyotocabinetredis.io www.membase.org neo4j.org en.wikipedia.org/wiki/Triplestore en.wikipedia.org/wiki/Graph_theory research.microsoft.com/en-us/projects/trinity
  • 40.
    Resources www.jboss.org/infinispan basho.comnosqlpedia.com/wiki/Consistency_models_in_nonrelational_dbs www.hypertable.org project-voldemort.com www.allthingsdistributed.com/2007/10/amazons_dynamo.html
  • 41.
    Resources nosql-database.org couchdb.apache.orgengineering.twitter.com/2010/05/introducing-flockdb.html infinitegraph.com nosql-database.org http://www.w3.org/TR/rdf-concepts/

Editor's Notes

  • #5 NoSQL does not mean no SQL, or that it is against SQL or RDBMS data bases. NoSQL is better characterized as non-RDBMS data stores, but even that is not completely true.
  • #6 NoSQL are very compatible and often used together. SQL usually takes the OLTP role while NoSQL slots in for special purposes.
  • #11 Brewer&apos;s Theorem - Inktomi C onsistency A vailability P artition Tolerance You can have any 2 but not all 3 C &amp; A in single node system Add P and you must choose between C and A
  • #25 Membase is distributed (elastic) map CouchDb is document store Companies combined to form CouchBase
  • #26 RDF = Resource Description Framework
  • #39 RDF – Resource Description Framework Triplestore – Subject – Predicate – Object Predicate is relationship OWL – Web Ontology Language – semantic web