Overview ofNoSQL...motivation, technologies, should youcare?
Overview● Evolution of/motivation for NoSQL databases● Characterization of NoSQL databases● Classification of NoSQL databases● Popularity/usage of NoSQL systems
A brief history of NoSQL● Originally coined in 1998 by Strozzi for specific non-rel database ○ easy to use, free, text based data storage, easy manipulation of contents of db● Reintroduced by Evans (Rackspace) in 2009 for conf on open source distributed databases ○ in response to increase in interest in non RDBMS solutions ■ bringing together Cassandra, Mongo, Couch, etc● Has grown as a movement over last 3 years
Current status● Significant buzz within community in 2010 ○ initial development of technology ○ pioneer deployments ○ lots of meetups/conferences/birds of feathers● Many key technologies evolved later 2010, 2011 ○ more large deployments for some technologies ○ small companies with no legacy basing operations on NoSQL
Current Status● 2012 ○ buzz/hype is fading ○ technology continues to mature ○ increased number of deployments ○ skills sought in job market
NoSQL - a negativedefinition● NoSQL simply defined by being non- relational ○ diverse set of technologies fall into NoSQL camp● Motivations mixed ○ open source ○ scale - TB, PB - particulary for read/write latency ○ increased flexibility over RDBMS systems ○ ability to work with raw data ○ ACID not always most appropriate design choice ■ analytics data is excellent example● Results in many different NoSQL technologies
Typical characteristics● Dont use SQL!● Open Source● Intended to deliver performance ○ in some dimension● Typically JOIN not supported ○ performance hit● Consistency often relaxed ○ eventual consistency● More flexibility in schema ○ if schema used at all!
Diversity of NoSQLdatabases● 122 seperate technologies listed on http: //nosql-database.org/ ○ mix of commercial, open source and some inbetween● Vary in many dimensions: ○ architecture ○ interfaces ■ api/languages ○ internal data storage ○ distribution mechanisms ■ redundancy, reliability ○ usage - deployments & support community ○ maturity
Classification of NoSQLsystems● Column based solutions● Document store solutions● Key/Value solutions● Graph based solutions● Less significantly: ○ XML databases ○ Object databases ○ Mulitvalue databases
Column based solutions● Structured data ○ similar to classical tables● Generally much more flexible ○ no rigorous schema necessary ○ can typically add columns in ad hoc fashion ■ often without explicitly declaring column● However, can result in very different usage ○ eg can have millions of columns associated with given row● Examples: Hadoop/HBase, Cassandra, Hypertable, SimpleDB
Document based solutions● Less structured data ○ DB composed of documents containing arbitrary data ■ usually containing longer form content eg CMS● Documents contain some structure to support query/search/filter, etc● Somewhat less emphasis on a key ○ can be autogenerated● Quite unlike classical databases● Examples: MongoDB, CouchDB
Key/value stores● DBs inspired by memcache ○ simple, fast key/value stores● Attempt to retain most of DB in memory ○ fast response times● Different designs for scalability ○ single node/multi node● Much emphasis on the keys in this type of DB● Write usually overwrites entire previous entry● Examples: Redis, Couchbase/Membase, DynamoDB, Riak
Graph based solutions● Obviously different from previous categories ○ Focus specifically on graphs● Queries supported are graph-specific ○ eg get nodes related to specified node● Typically support for solving standard graph problems ○ eg shortest path, general graph traversal● Can deliver very significant performance over non-graph specific solutions ○ for graph problems!● Examples: Neo4j
Its a noisy space...● Very many candidate technologies● Relatively small amount of real world solutions● Differences between classifications above is one of emphasis... ○ column based and document based arrive at semi- structured sweet spot from opposite ends of spectrum● ...although this results in different preferred use cases... ○ document based solution better for document problems, eg CMS
Common techniques used● Hashing techniques used to map data to nodes in cluster● Internode communication via Gossip● Common replication techniques● Thrift is used in a few cases● MapReduce often used to search over distributed system
Horses for courses...● SQL is perfectly good solution for many problems ○ tried and tested● Some problems require alternative solution ○ typically driven by scale and/or flexibility● NoSQL offers (many) alternatives ○ although relatively easy to identify realistic options● Column based approaches good for mostly structured data with enhanced flexibility● Document based approaches good for document oriented problems
...so lets dive into oneNoSQL database...● Cassandra...