Data, data, data. I cannot make bricks without clay. Sherlock Holmes, Sherlock Holmes 
Data Qualitative or Quantitative attributes of a variable or set of variables Lowest level of abstraction from which information and then knowledge are derived. Representation of a fact, figure and idea.
A well organized newspaper or a clumsy, cluttered one?
Data explosion From Gigabytes to Terabytes to Petabytes to perhaps (I’m out of nomenclature)-bytes
What? (Continued…) Partly or completely independent of RDBMS concepts No specific implementation Breakthrough Approaches Key: Non-relational approach Non-ACIDness A STEP BACKWARDS, THEN MANY STEPS FORWARD
NoSQL, the ‘screwdriver’ Yet another tool in our repository to go along with the hammer
NoSQL is about choice Not all problems are nails. Not all screws are same. GOOD PROGRAMMING PRACTICE: Know your tools and use them appropriately
SQL Databases Data Relational Tabular – Rows/Columns Interface Sql Basic Design Inspiration Set Theory ACID Design Scale Up Design
Scalability True Scalability Horizontal Scaling Transparency to the application No single point of failure Problems with SQL databases Vertical Scaling Partitioning aka Sharding Read Slaves Anti Patterns Normalized Data Joins ACID Transactions
No Breadcrumbs CRUD is crude Delete/Update strategy is improper CRA! Create, Read, Archive – way to go ahead Audit information is lost in CRUD but not in the case of CRA
Naive Data Support Not designed for Complex Data Structures Recursive Hierarchical Ordered List Circular Dynamic Metadata
Logical/Physical separation concerns Relational model -> Logical Model RDBMS implement it at physical level Using Multiple indices Artificial overhead in managing the database Frequent drop and create index to make DB perform
Spinning Disk Storage Design flaw for most RDBMS systems With cheaper memory, Memory based approach should also be included in the design Defiance of Moore’s law Disk reads grew only 12.5 times in about 50 years Disk writes much lesser. Disk write is expensive. RDBMS make things worse by writing more. ACID rains are UNHEALTHY
At Snail’s pace RDBMS engine growth – SLOW Optimizations have been minor since initial days Majority of growth due to Moore’s law Faster hardware Slightly faster storage Faster memory What when Moore’s law diminishes thanks to external factors like heat generated.
Database size limits RDBMS are too slow Over multiterabyte and petabyte databases Purpose designed parallel processing would be needed to handle such capacities of data in a RDBMS.
RDBMS has been there since years and is proven technology What aboutNoSQL
RDBMS grew fast but growth slowed down over time and might eventually reach a stale point NoSQL unarguably a new immature tool, has been growing faster than RDBMS ever did and is being supported by the Big Players
and more Microsoft is considering NoSQL as well for Azure services so is Twitter Are we next? Major IT Companies have implemented or even better created their own NoSQL to manage huge Data stores which couldn’t be managed by SQL Databases.
We are used to SQL and relatedness, why can’t they just fix RDBMS to handle Big Data STORAGE SEEK RATES Large writes and ACID being a huge limitation Big Data can be handled via Scale Out/Partitionability across Multiple Nodes
CAP Theorem Applies to distributed shared data system
A Deeper look Consistency: The system is in a consistent state after an operation All clients see the same data Strong Consistency(ACID) vs. Eventual (BASE) Availability: ‘Always On’ mode, no downtime All clients can find some available replica Software/hardware upgrade tolerance Partition Tolerance: The system continues to function even when split into disconnected subsets (by a network disruption) Reads and Writes combined
System is still available under partitioning but some of the data returned may be inaccurate
All of the operations in the transaction will complete, or none will. The database will be in a consistent state when the transaction begins and ends. The transaction will behave as if it is the only operation being performed upon the database. Upon completion of the transaction, the operation will not be reversed. Atomicity Consistency Isolation Durability
Basically Available Soft State Eventually Consistent When Availability and Partitionability are prioritized over Consistency, think in terms of BASE
Eventual Consistency If no new updates are made to the object, eventually all accesses will return the last updated value. Ex: Domain Name System (DNS)
Types of Eventual Consistency Read-your-write consistency Session consistency Monotonic read consistency Monotonic write consistency Causal consistency Practically, Read-your-write consistency and monotonic read consistency are desirable in an eventually consistent system
Hash() Different Apps – Different CAP requirement Prioritize among Consistency – Availability Availability – Partitionability Consistency - Partitionability
WHERE? So will NoSQL eventually replace RDBMSs everywhere?No, RDBMS are there to stay. NoSQL is here to help.
Big Data Denormalize Shard Scale Out And look no further than NoSQL
Write Intensive Applications I/OpS of the Best storage device <<< n * I/OpS of relatively cheaper storage devices in simple terms: ‘HARNESS THE POWER OF YOUR CLOUD’
Fast Key-Value Access NoSQL – ‘User, you are looking for $value’ RDBMS – ‘Query executing ….’ A O(1) Hash operation or O(log n) B+/B tree traversals
Flexible Schema and Data types ‘I once was a integer, then a string then a date; What am I’ - FieldRDBMS – ‘WTH! Whatever you are, You are beyond my scope’
Transient Data Data – ‘I’m here only for a while and want to get my work done fast’ RDBMS – ‘You are data and you shall be treated like the rest’ NoSQL – ‘Okay, I’ll allot you space in the RAM using Memcached If available otherwise you still have my cloud’
High Write Availability Warning - Incoming data ….NoSQL – ‘Anytime you like, user’ RDBMS – ‘This is insane, I’m already busy with other things’
ECONOMICS RDBMS – ‘I’m powered by a wonderful, beautiful rabbit’ NoSQL – ‘I’m powered by many cute little hamsters’
No Single Point of Failure Designed to run over Economic Commonly Available Unreliable hardware
Full table scan operations MapReduce: Map: To define your problems into optimal sub problems which can be computed in parallel and reduced later Reduce: To merge the sub optimal solutions into the result Divide and Conquer your way to Victory Powered by MapReduce! Or something similar
Ability to restore, maintain, repair itself No DBA required Design
HOW? Let us welcome Keys, Values, Collections, Data Structures, Objects, Documents Graphs
NoSQL View The basic approach at data: Key/Value store Run on multiple machines Partitions and Replication across these machines Relax consistency Aim at Eventual Consistency Asynchronous replication But not all NoSQL take the same path.
Document Store Key-Value Store Object NoSQL Multivalue Graph Stores BigTable Clones Tuble Store
Key-Value Stores One key, one value, no duplicates and crazy fast Distributed hash tables The value is stored as binary object – BLOB The DB doesn’t understand it and doesn’t want to Ex: Amazon Dynamo, MemcacheDB
Key4 Key3 Key2 Key1 Key/Value store doesn’t know what is in here
Document Store Key-value store, but the value is structured and understood by the DB Querying data is possible On not just the key Ex: MongoDB, CouchDB, Riaketc
Each database has collections Each collection has a set of documents They are well-designed for access through applications Suitable for web applications Few Document databases provide SQL Like query interface now
BigTable & its Clones Database, tables, rows, columns and ’ SuperColumn’ Row consists of columns and SuperColumns Few supercolumns can be made a must Each supercolumn – arbitrary set of columns Rows are typically versioned by a system assigned timestamp.
Intended for tables with huge number of columns Millions can also be supported very easily ‘a sparse, distributed multi-dimensional sorted map’ Also referred to as Wide Column stores Ex: Google BigTable, Cassandra, Hbase, Voldemort, Azure Tables
Graph Databases Nodes, Edges, Properties Replace traditional tables, columns, rows Graph database can be implement in different ways Key/value store, columnar, bigtable clone or even combination of these Fields are used to directly store the id of another entity forming the edge
Graph database is a multi-relational graph No need for secondary indexes Relationships in RDBMS are ‘weak’ Relationships in Graphs are ‘strong’ The rest don’t really care about relations at db level
Address Age: 32 Matt Mobile April Is related to SSN Spouse owns Drives Honda Model City registration
Key-Value Store Size Document Store BigTable Clone Graph Databases Complexity
Too Many Cooks and Recipes No specific recipe! Major implementations: Graph Document store Tabular Key value store Eventually consistent Hierarchical Ordered Other Known Recipes: Multivalue Object Tuble Store
The Menu On Disk BigTable Membase Tokyo Cabinet In RAM Memcached Velocity Eventually Consistent Cassandra Dynamo Riak Hierarchical GT.M Ordered Berkeley DB NMDB C-ISAM Multivalue eXe OpenQM Document Store CouchDB Lotus Notes MongoDB Graph AllegroGraph Neo4j DEX Tabular BigTable Hbase HyperTable The list isn’t even a quarter of the whole
_theOpenSourceIssue Most of them are open source Thus fork-ablelike Linux The first of the lot Google’s BigTable Amazon’s Dynamo All in all, there are about 10 roots with 4 major ones.
MongoDB Document Store JSON Storage REST ….. Not out of the box Map/Reduce Master slave replication Strong suite of query APIs Good support for SQL Work in Progress: Autosharding based scalability Failover support Open Source Non Relational Scalable Schemaless Queryable
Document Oriented Mongo stores documents in collections Documents are slightly enhanced JSON Objects Complex data structures is very much possible Data Modelling is a more natural process
Embeddable Objects Complexity.begin() Embed objects within a single document Document is an enhanced form of object like mentioned earlier The same thing in RDBMS can be achieved using multiple tables and joining them together Consider our requirement is to store a blogging post with this information Post Content Post Title Post Author Comments Comment order Comment content Comment author