1. Data, data, data. I cannot make bricks without clay. Sherlock Holmes, Sherlock Holmes [2009]
2. Data Qualitative or Quantitative attributes of a variable or set of variables Lowest level of abstraction from which information and then knowledge are derived. Representation of a fact, figure and idea.
34. Why? (Continued…) RDBMS drawbacks: Scalability CRUD Performance Write Overhead Limited by single disk architecture Lack of In Memory design Rigid schema design And more …..
37. Scalability True Scalability Horizontal Scaling Transparency to the application No single point of failure Problems with SQL databases Vertical Scaling Partitioning aka Sharding Read Slaves Anti Patterns Normalized Data Joins ACID Transactions
38. No Breadcrumbs CRUD is crude Delete/Update strategy is improper CRA! Create, Read, Archive – way to go ahead Audit information is lost in CRUD but not in the case of CRA
39. Naive Data Support Not designed for Complex Data Structures Recursive Hierarchical Ordered List Circular Dynamic Metadata
40. Logical/Physical separation concerns Relational model -> Logical Model RDBMS implement it at physical level Using Multiple indices Artificial overhead in managing the database Frequent drop and create index to make DB perform
41. Spinning Disk Storage Design flaw for most RDBMS systems With cheaper memory, Memory based approach should also be included in the design Defiance of Moore’s law Disk reads grew only 12.5 times in about 50 years Disk writes much lesser. Disk write is expensive. RDBMS make things worse by writing more. ACID rains are UNHEALTHY
43. At Snail’s pace RDBMS engine growth – SLOW Optimizations have been minor since initial days Majority of growth due to Moore’s law Faster hardware Slightly faster storage Faster memory What when Moore’s law diminishes thanks to external factors like heat generated.
44. Database size limits RDBMS are too slow Over multiterabyte and petabyte databases Purpose designed parallel processing would be needed to handle such capacities of data in a RDBMS.
45. RDBMS has been there since years and is proven technology What aboutNoSQL
46. RDBMS grew fast but growth slowed down over time and might eventually reach a stale point NoSQL unarguably a new immature tool, has been growing faster than RDBMS ever did and is being supported by the Big Players
55. LinkedIn – Voldemortand more Microsoft is considering NoSQL as well for Azure services so is Twitter Are we next? Major IT Companies have implemented or even better created their own NoSQL to manage huge Data stores which couldn’t be managed by SQL Databases.
56. We are used to SQL and relatedness, why can’t they just fix RDBMS to handle Big Data STORAGE SEEK RATES Large writes and ACID being a huge limitation Big Data can be handled via Scale Out/Partitionability across Multiple Nodes
59. A Deeper look Consistency: The system is in a consistent state after an operation All clients see the same data Strong Consistency(ACID) vs. Eventual (BASE) Availability: ‘Always On’ mode, no downtime All clients can find some available replica Software/hardware upgrade tolerance Partition Tolerance: The system continues to function even when split into disconnected subsets (by a network disruption) Reads and Writes combined
63. Basically Available Soft State Eventually Consistent When Availability and Partitionability are prioritized over Consistency, think in terms of BASE
64. Eventual Consistency If no new updates are made to the object, eventually all accesses will return the last updated value. Ex: Domain Name System (DNS)
65. Types of Eventual Consistency Read-your-write consistency Session consistency Monotonic read consistency Monotonic write consistency Causal consistency Practically, Read-your-write consistency and monotonic read consistency are desirable in an eventually consistent system
66. Hash() Different Apps – Different CAP requirement Prioritize among Consistency – Availability Availability – Partitionability Consistency - Partitionability
67. WHERE? So will NoSQL eventually replace RDBMSs everywhere?No, RDBMS are there to stay. NoSQL is here to help.
70. Write Intensive Applications I/OpS of the Best storage device <<< n * I/OpS of relatively cheaper storage devices in simple terms: ‘HARNESS THE POWER OF YOUR CLOUD’
71. Fast Key-Value Access NoSQL – ‘User, you are looking for $value’ RDBMS – ‘Query executing ….’ A O(1) Hash operation or O(log n) B+/B tree traversals
72. Flexible Schema and Data types ‘I once was a integer, then a string then a date; What am I’ - FieldRDBMS – ‘WTH! Whatever you are, You are beyond my scope’
73. Transient Data Data – ‘I’m here only for a while and want to get my work done fast’ RDBMS – ‘You are data and you shall be treated like the rest’ NoSQL – ‘Okay, I’ll allot you space in the RAM using Memcached If available otherwise you still have my cloud’
74. High Write Availability Warning - Incoming data ….NoSQL – ‘Anytime you like, user’ RDBMS – ‘This is insane, I’m already busy with other things’
75. ECONOMICS RDBMS – ‘I’m powered by a wonderful, beautiful rabbit’ NoSQL – ‘I’m powered by many cute little hamsters’
76. No Single Point of Failure Designed to run over Economic Commonly Available Unreliable hardware
77. Full table scan operations MapReduce: Map: To define your problems into optimal sub problems which can be computed in parallel and reduced later Reduce: To merge the sub optimal solutions into the result Divide and Conquer your way to Victory Powered by MapReduce! Or something similar
79. HOW? Let us welcome Keys, Values, Collections, Data Structures, Objects, Documents Graphs
80. NoSQL View The basic approach at data: Key/Value store Run on multiple machines Partitions and Replication across these machines Relax consistency Aim at Eventual Consistency Asynchronous replication But not all NoSQL take the same path.
82. Key-Value Stores One key, one value, no duplicates and crazy fast Distributed hash tables The value is stored as binary object – BLOB The DB doesn’t understand it and doesn’t want to Ex: Amazon Dynamo, MemcacheDB
83. Key4 Key3 Key2 Key1 Key/Value store doesn’t know what is in here
84. Document Store Key-value store, but the value is structured and understood by the DB Querying data is possible On not just the key Ex: MongoDB, CouchDB, Riaketc
85. Each database has collections Each collection has a set of documents They are well-designed for access through applications Suitable for web applications Few Document databases provide SQL Like query interface now
87. BigTable & its Clones Database, tables, rows, columns and ’ SuperColumn’ Row consists of columns and SuperColumns Few supercolumns can be made a must Each supercolumn – arbitrary set of columns Rows are typically versioned by a system assigned timestamp.
88. Intended for tables with huge number of columns Millions can also be supported very easily ‘a sparse, distributed multi-dimensional sorted map’ Also referred to as Wide Column stores Ex: Google BigTable, Cassandra, Hbase, Voldemort, Azure Tables
90. Graph Databases Nodes, Edges, Properties Replace traditional tables, columns, rows Graph database can be implement in different ways Key/value store, columnar, bigtable clone or even combination of these Fields are used to directly store the id of another entity forming the edge
91. Graph database is a multi-relational graph No need for secondary indexes Relationships in RDBMS are ‘weak’ Relationships in Graphs are ‘strong’ The rest don’t really care about relations at db level
92. Address Age: 32 Matt Mobile April Is related to SSN Spouse owns Drives Honda Model City registration
94. Too Many Cooks and Recipes No specific recipe! Major implementations: Graph Document store Tabular Key value store Eventually consistent Hierarchical Ordered Other Known Recipes: Multivalue Object Tuble Store
95. The Menu On Disk BigTable Membase Tokyo Cabinet In RAM Memcached Velocity Eventually Consistent Cassandra Dynamo Riak Hierarchical GT.M Ordered Berkeley DB NMDB C-ISAM Multivalue eXe OpenQM Document Store CouchDB Lotus Notes MongoDB Graph AllegroGraph Neo4j DEX Tabular BigTable Hbase HyperTable The list isn’t even a quarter of the whole
96. _theOpenSourceIssue Most of them are open source Thus fork-ablelike Linux The first of the lot Google’s BigTable Amazon’s Dynamo All in all, there are about 10 roots with 4 major ones.
100. MongoDB Document Store JSON Storage REST ….. Not out of the box Map/Reduce Master slave replication Strong suite of query APIs Good support for SQL Work in Progress: Autosharding based scalability Failover support Open Source Non Relational Scalable Schemaless Queryable
101. Document Oriented Mongo stores documents in collections Documents are slightly enhanced JSON Objects Complex data structures is very much possible Data Modelling is a more natural process
102. Embeddable Objects Complexity.begin() Embed objects within a single document Document is an enhanced form of object like mentioned earlier The same thing in RDBMS can be achieved using multiple tables and joining them together Consider our requirement is to store a blogging post with this information Post Content Post Title Post Author Comments Comment order Comment content Comment author
108. Schema-less No database enforced Schema Addition, Deletion of columns are simple Its about how the application uses APIs Data definition need not be defined up front.
109. Other Features Data Tagging Caching Real Time Analytics Image Storage Dynamic Queries Binary Storage
110. MongoDB - Why Not? Lacks transactions Doesn’t completely support SQL Lacks built-in revisioning system like CouchDB Lacks full text searching features
113. Calm down! Eventually Answered System All your questions will be answered eventually
Editor's Notes
SQL Databases approach data in the form of sets and tables. Incidentally its strength soon become its weakness.Assumptions made:Data is represented in the form of tables. Row and ColumnsData in each table can be related to data in another.Data can/has to be searchable through all columns.Strengths:Data manipulation through Set theory.Enforce relational constraints with its management system.Weakness:Relational ness becomes an overhead once data becomes real huge.Large amounts of writes in a SQL database is a lot of burden on the DBMS apart from the storage disk.
NoSQL is a collection of databases which elude from the drawbacks of RDBMS without completely giving up on Relational Models. They are not stringent when it comes to certain core RDBMS concepts like ACID complianceand other integrity constraints.The priority is to support high levels of scalability through easy partitioning abilities across multiple cheap naïve hardware by giving up on Consistency which SQL databases look at delivering apart from some amount of relatedness from the data.
The CAP theorem states that any shared-data system can only achieve two of these three.Consistency (All database clients see the same data, even with concurrent updates.)Availability (All database clients are able to access some version of the data.)Partition tolerance (The database can be split over multiple servers.)http://www.julianbrowne.com/article/viewer/brewers-cap-theoremhttp://devblog.streamy.com/2009/08/24/cap-theorem/http://www.royans.net/arch/brewers-cap-theorem-on-distributed-systems/