RDBMS to NoSQL. An overview.


Published on

An perspective into the raise of NoSQL systems and an comparison between RDBMS and NoSQL technologies.

The basic idea of the presentation originated while trying to understand the different alternatives available for managing data while building a fast, highly scalable, available, and reliable enterprise application.

Published in: Technology
1 Comment
  • More than 5000 registered IT consultants and Corporates.Search for IT online training Providers at http://www.todaycourses.com
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

RDBMS to NoSQL. An overview.

  1. 1. (R)Evolution in Database Systems
  2. 2.  RDBMS – The origins Concepts, Architecture and Principles Golden Age – Way of life. Changing Times– New Problems, New Needs Attack on the citadel - Revisiting the norms Ignited Minds – Working towards NoSQL Solutions Way Ahead– It is a Cloudy out there
  3. 3.  Girish Narasimha Raghavan Over 15 years experience building distributed, large scale and highly available enterprise systems. Current interest include build SAC (Social, Big Data Analytics, and Cloud) solutions. Likes to write and discuss technologies and its applications to solve real world problems.  http://randomtechthought.blogspot.com
  4. 4.  In the world data abounds. Always has and always will.  Record keeping is as old as Human race.  Consistent quest to improve storing , accessing, and analyzing records The early machines had serious shortcomings.  only a very limited amount of program code and data could be stored in memory.  Electromagnetic data storage was feasible only at an extremely high cost. Storing Data was an issue  Organizations had to store data – related to Administration, Research, Operations.  Data stored in proprietary formats – Database Systems did not exist  Plagued by data integrity issues  Non standard application logic for accessing stored data
  5. 5.  First attempt: File based systems  Data sets were growing and accumulating.  Data had to be managed at a detailed transaction level.  Computing systems started to be used for critical business needs.  Data inconsistency and redundancy. Enter Database Systems  Attempts to standardize the processes and rules to store and access data.  Intention to reuse, resell and redeploy solutions across organizations (with significant customizations).  Attempt to proactively manage Data Integrity and Quality.
  6. 6.  Database Systems and concepts Evolve  Hierarchical DBMS  Information represented using parent/child relationships  Tree structure is primary data structure.  Network DBMS  The relationships is represented in form of a network.  Graph is the primary data structure. Challenges Galore  Hardware Dependency – Software strongly dependent on the underlying hardware.  Modeling challenges – Representing data under a common structure.  Integration issues - Integrating across dependent packages was a nightmare.  Introducing new functionality and updates - Solution providers struggled with it across customized software deployment.
  7. 7. Father of the Relational Database model Edgar F CoddA British Computer Scientist who made significantcontributions to the theory of Relational Databases while working for IBM.
  8. 8.  Landmark Paper by Codd - “A relational Model of Data for large shared Data Banks”.  Independence of Data from the Hardware- and Storage Implementation.  automatic navigation to the data set through high level nonprocedural language for data access.  Concept of keys (primary, secondary).  theoretical proposal, no practical design or implementation. Codd’s 12 rules for Relational management System  http://cims.clayton.edu/booth/ITDB%204201/Codd%20PDF. pdf
  9. 9. Application Reporting 1 Solutions Database DatabasesApplication Management Data 2 Systems (DBMS) StrorageApplication Future 3 Applications
  10. 10.  Data Definition  For describing data and data structures for handling the data Data Manipulation  For describing the operations associated with the data like storage, query, change, etc. Data Security and Integrity  For ensuring secure and controlled access to storage and manipulation of data.  For ensuring correctness, consistency and reliability of the data stored . Data Recovery and Concurrency  For providing and enforcing recovery and concurrency controls. Data Dictionary  For providing information about the data stored.  For Liaisoning between the conceptual and physical storage. Performance  For ensuring all the above mentioned operations are performed efficiently and effectively
  11. 11. External/UserHow the user access and sees the data [Tables, Views] Conceptual/Logical How data is organized logically [Table Spaces] Physical/Internal How data is stored internally [Data Files]
  12. 12.  Relation (Tables)– Set of Tuples that have the same attributes.  Tuples (Rows) – A Tuple usually represents an object and information about that object.  Attribute (Columns)– Represent a particular characteristic of that object Domain - A domain describes the set of permitted values for a given attribute. It is the set from which the values of an attribute can be defined. Constraints - Constraints make it possible to further restrict the domain of an attribute. Constraints help in binding the attribute to a set of rules. Primary Key - A primary key is a (set of) attribute (s) that uniquely defines a relationship within a database. Foreign Key - The foreign key can be used to cross-reference tables. Cardinality - Expresses the number of instances of the entity to which another entity can be associated via a relation Index - An index is a mechanism for providing quicker access to data. Indices can be created on any combination of attributes on a relation.
  13. 13.  Based on the perception that real world can be modeled around base objects (entities) and relationship among them. Modeling of data in a top down fashion  Conceptual Model – The model is the highest and least granular model that defines master reference data entities that are commonly used in the problem space.  Logical Model – The model generally builds over the conceptual model by adding additional granular details like operational and transactional data entities.  Physical Model - Specifies relational database objects such as database tables, database indexes such as unique key indexes, and database constraints. The models can be visualized through what is commonly known as ER-Diagrams.
  14. 14.  Process for organizing the attributes and tables of a relational database to minimize redundancy and dependency. Objectives (as specified by Codd)  To free the collection of relations from undesirable insertion, update and deletion dependencies.  To reduce the need for restructuring the collection of relations, as new types of data are introduced, and thus increase the life span of application programs.  To make the relational model more informative to users.  To make the collection of relations neutral to the query statistics, where these statistics are liable to change as time goes by. Normal Forms (NF)  1NF - it contains atomic values only  2NF - 1NF + every non-key attribute is dependent on the primary key  3NF - 2NF + every non-key attribute is non-transitively dependent on the primary key
  15. 15.  Properties that guarantee that database transactions are processed reliably.  Single logical operation (involving multiple steps) is called transaction. Properties  Atomicity – “All or Nothing” – If one part of the transaction fails, entire transaction fails.  Consistency – Any data written to the database must be valid according to all defined rules, and constraints.  Isolation – Even during concurrent executions, the system result in a state that is same as the state which will be obtained when executed serially.  Durability - Once a transaction has been committed, the results will be stored permanently irrespective of errors and crashes that can occur post commit. In RDBMS ACID properties are implemented using various techniques like locking and Multi Versioning
  16. 16.  RDBMS based solutions is generally the first choice for database storage/access needs RDBMS solutions is now mature and predictable. An army of skilled specialists exists for using, managing and maintaining RDBMS based systems RDBMS has spawned an ecosystem of products that makes choosing RDBMS as no brainer
  17. 17.  Ensures Consistent behavior  With the table structure as the base, RDBMS provides a consistent mechanism for storing and accessing different data sets. Removes Redundancies  Through Normal forms, redundancies in the data are removed thereby addressing the errors that can arise from consistency of the data stored Avoid errors  Ensures Data integrity and quality by ensuring consistent storage, enforcing constraints and relationships and with ability to check data as they are entered Facilitates Easy analysis  With the SQL based query as the foundation, analyzing different data set is seamless. Also given the history of RDBMS, users are provided with a vast repository of tools to perform analysis. Ensures Robust Maintenance and Management  Database administrators are provided with tools that enable them to easily maintain, test, repair and back up the databases housed in the system. Is Secure  Offers good level of security and access control. Whole or part of the data can be securely shared across multiple users(applications) based on the privileges granted to them(it).
  18. 18.  Raise of Social Networks during early 2000s  World Wide Web acts as the foundation Shift in communication patterns  Sharing of personal information and usage of the same  Everyone turned into a publisher Increased focus around personalization  Recommendations, Ratings, Preferences and providing Personalized interfaces Big Data Flood  More data is being generated currently than what was generated till now throughout history of human kind  Need to store and process unstructured or semi structured data at volumes previously not anticipated and at frequencies not encountered previously
  19. 19. Ref: http://www.go-gulf.com/blog/60-seconds
  20. 20.  Accessible by users across the globe  Geography is irrelevant  Facebook, Google, Yahoo, Twitter, etc. have users across the world Highly networked and distributed systems  Systems are accessed and connected over the Internet Need to be highly scalable  Should be able to handle additional load without redesign  Amazon sees a manifold increase in traffic to the site during the holiday seasons Expected to be highly available  Systems will be available for access and operations always  Google will incur a huge revenue and credibility loss if the site goes down Handle large data sets hitting the systems with high frequency  The data need to be stored and processed very quickly  Number of likes and comments on Facebook has exceeded 2.7 billion per day
  21. 21.  Brewers CAP Theorem  You can get only two out of the following three  Consistency – Same as Atomicity. You get “All or Nothing”  Availability - Need to be available for operations always  Partition Tolerance – Need to work when some nodes are not accessible. RDBMS were essentially designed for CA  Latency (response times) is an unfortunate tradeoff for consistency  Partition tolerance becomes essential in distributed systems
  22. 22.  Beyond a point you cannot afford to Scale up storage  It becomes very expensive to keep scaling up. Is strict consistency really so important?  Ensuring consistency slows the system  Google found that moving from a 10-result page loading in 0.4 seconds to a 30-result page loading in 0.9 seconds decreased traffic and ad revenues by 20% (Linden 2006) Redundancy can be managed  Joins across normalized database tables is less efficient than reading from a data store Not All data is relational  Fitting every kind of data under the Rigid Schema structure of RDBMS is a challenge  Data read from RDBMS modeled back in its original model (say tree, graph, key value) induces significant stress on computing resources.  Attributes (columns) are restricted by domain to store similar data.  Managing semi structured, unstructured data like documents becomes a challenge.
  23. 23.  CRUD (Create, Read, Update, Delete) is crude  Updates and deletes should never be allowed as they destroy information. Logical and physical separation of concerns ignored  Relational model is a logical model  Database products implemented the relational model at the physical level as a set of btree files with multiple indexes.  Induces artificial overhead onto managing the database. It is over spinning disks  All RDBMS implementations assume that the data is coming from the disks  Legacy of an era when memory was expensive.  Memory based systems will be faster Databases are big and slow  Fundamentally not designed for big data sets  Long queries get slower with more data
  24. 24.  Core Tenets  Basically Available  System seem to work all the time  Soft State  It doesn’t have to be consistent all the time  Eventual Consistency  Becomes consistent eventually (at some later time) Significance  BASE is diametrically opposed to ACID.  ACID is pessimistic and forces consistency at the end of every operation  BASE is optimistic and accepts that the database consistency will be in a state of flux.  The availability is achieved through supporting partial failures without total system failure  It is ok for the system to be available for 80% of users and limit failure to 20% of the user.  Users should understand the implication of Eventual Consistency  Factors in a probability of data loss. Safety of the data is the tradeoff  Need to understand how eventual is Eventual
  25. 25.  NoSQL – Not Only SQL  It is not SQL and it is not Relational Essential Feature set  Elastic Scaling – Rely on Scale out rather than Scale up  Big Data – Handle High Volume, High Velocity, High Variability  Commoditize Manageability – Reduce dependence on highly skilled DBA and lower administration costs  Economics – Build over commodity hardware  Flexible data model – Remove data model based restrictions. Applicability  Performance and real time nature over consistency  High scalability  Store and retrieve large data sets  Does not require a relational model
  26. 26.  Key Value  Idea is to use a hash table where there is a unique key and a pointer to a particular item of data. Simplest to implement.  it is inefficient when you are only interested in querying or updating part of a value Column Store  Created to store and process very large amounts of data distributed over many machines  Still keys but they point to multiple columns.  The columns are arranged by column family. Document  The model is basically versioned documents that are collections of other key-value collections.  The semi-structured documents are stored in formats like JSON.  allowing nested values associated with each key  Document databases support querying more efficiently. Graph  flexible graph model is used which, again, can scale across multiple machines
  27. 27. Access Interfaces Language SpecificREST/HTTP Thrift Map Reduce API Logical Data Model Key Value Column Family Store Document Graph Support and Distribution Multi Data Center DynamicCAP Support Proactive Monitoring Support Provisioning Data Persistence Combination of Memory and Memory Based Disk Based Disk
  28. 28. NoSQLKey Value Column Store Document Graph MemCached SimpleDB CouchDB Neo4J Redis BigTable MangoDB InfoGrid SimpleDB Hbase Lotus Domino FlockDB Tokyo Cabinet Cassandra Riak InfiniteGraph Dynamo HyperTable Voldemort Azure TS
  29. 29.  It is not Mature  RDBMS is mature, stable and functionally rich.  Most NoSQL alternatives are in pre-production versions with many key features yet to be implemented. Support  Nost NoSQL systems are open source projects.  Support mostly offered by startup companies, with reach and credibility not on par with RDBMS Vendors. Analytics  NoSQL databases offer few facilities for ad-hoc query and analysis.  Even a simple query requires significant programming expertise.  At present, commonly used BI tools do not provide credible connectivity to NoSQL. Administration and Maintenance  The desired goal of zero maintenance is far away.  In reality significant effort t required to maintain the systems. Expertise  Currently very limited awareness and knowledge
  30. 30.  Scalability  Master Slave - One master many slaves  Write to master; Read from any of the slaves  Partitioning – Group and localize related functions across nodes  Partition Vertically (by functions) or Horizontally ( by keys)  Caching - Memory based cache in front of the Database  Address scaling issues due to read and write loads High Availability  Clustering - Group of systems responsible for a service  Build redundancy into a cluster to eliminate single points of failure  Mirroring and Replication – Maintain a hot standby  Handle planned or unplanned downtimes  Recovery Solutions - dependable data backup, restore, and recovery procedures  Combine process with tools
  31. 31.  Performance  Be open to Denormalization – And accelerate reads  Allow redundancy and duplicates to reduce joins  Optimize your costly queries- Analyze and optimize the expensive queries  Use a mix of design strategy, indices, and analysis from query optimization tools  Invest in better hardware – storage and memory  It is not a bad bet - The storage and memory costs have dropped significantly Rigid Schemas – Not all data is relational  Even the most schema-less model has some schema  World revolves round the structures  If Key-Value kind of store is needed, You can do the same in any RDBMS  RDBMS will provide an added advantage of structured access and queries
  32. 32.  Systems eventually will gravitate towards one of these three  Fast, agile, highly scalable data stores  Handlers of complex transactional semantics  Analytical processors and facilitators World is never binary  It is never either this or that.  Why fight over technicalities Drive decisions based on use cases  Choose a model based on the use cases and scenarios  Research and understand what your application needs  Stay away from substituting “Hard work” with “Rhetoric”  Be open to experimentation
  33. 33.  http://www.guug.de/lokal/muenchen/2007-05-14/rdbmsc.pdf http://ansonalex.com/infographics/twitter-usage-statistics-2012-infographic/ http://www.mountainman.com.au/software/history/it1.html http://www.slideshare.net/renguzi/codd http://cims.clayton.edu/booth/ITDB%204201/Codd%20PDF.pdf http://www.scribd.com/doc/19381895/RDBMS-Concepts http://www.gitta.info/DBSysConcept/en/text/DBSysConcept.pdf http://en.wikipedia.org/wiki/Relational_database http://en.wikipedia.org/wiki/ACID http://blogs.hbr.org/now-new-next/2009/05/the-social-data-revolution.html http://www.go-gulf.com/blog/60-seconds http://en.wikipedia.org/wiki/CAP_theorem http://highscalability.com/drop-acid-and-think-about-data http://queue.acm.org/detail.cfm?id=1394128 http://www.bailis.org/blog/safety-and-liveness-eventual-consistency-is-not-safe/ http://www.techrepublic.com/blog/10things/10-things-you-should-know-about-nosql-databases/1772 http://rebelic.nl/engineering/the-four-categories-of-nosql-databases/ http://www.slideshare.net/ksankar/nosql-4559402 http://www.thevirtualcircle.com/2008/11/10/6-reasons-why-relational-database-will-be-superseded/ http://www.slideshare.net/sbtourist/scale-your-database-and-be-happy Note: Many images used in the deck have been a result of using google image search. Even though, I have not been able to mention the sources of all the images individually, I extend my sincere thanks for the owners of the images for making the same available on the net