NoSQL databases


Published on

NoSQL database paradigms, how to query NoSQL, about relationships, comparison with RDBMS

Published in: Technology
  • More than 5000 registered IT consultants and Corporates.Search for IT online training Providers at
    Are you sure you want to  Yes  No
    Your message goes here
  • Seven Databases in Seven Weeks: A Guide to Modern Databases and the NoSQL Movement ---
    Are you sure you want to  Yes  No
    Your message goes here
  • Database Design for Mere Mortals: A Hands-On Guide to Relational Database Design (3rd Edition) ---
    Are you sure you want to  Yes  No
    Your message goes here
  • Fundamentals of Database Systems (7th Edition) ---
    Are you sure you want to  Yes  No
    Your message goes here
  • good !
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Hello...
    Today, I’m going to talk about one of the latest buzz words, so called “NoSQL” databases.
    A little disclaimer: I personally have real-life experience with only one NoSQL database called CouchDB. I use CouchDB at my service.
    So, why did I want to talk about the subject: I wanted to learn this stuff myself. What options there are to CouchDB and should I perhaps pick another option for my next generation
    Also, I strongly feel that relational databases are going to be, if not replaced, they will at least get strong competition from these NoSQL solutions. This will not happen within all domains and services, but for simple consumer web services this will eventually happen.

  • Ok, lets quickly review the database paradigms we have to choose from.
    We have relational databases -- the ones you use here at Futurice every day in most of our projects.
    Then there is NoSQL. NoSQL a relatively new term now getting hot and popular on blogsphere and Twitter.
    Like many buzzwords out there, NoSQL does not have a definite definition. It could refer to all those databases where you don’t have to use SQL. Some more friendly and wise people would put it more softly “not only SQL”.
    Well, I am not even going to try to give a definition, but instead I will list what kind of data stores are usually categorized under this umbrella term.
    They are: ...
    These I am going to talk about today, but then there are also others such as Object Databases, and they are out of the scope of this presentation.
  • Let’s start with something familiar.
    The great promise of relational databases is they are ACID.
    They all support a very strong query language called SQL.
    And these are some familiar examples of relation database implementations.
  • This could be a visual representation of a relational database and the relationships between the tables.
    We have dogs. A dog must have a name and it may have mood, birthdate and a color. We all know that dogs can really speak, and they speak by barking. Dogs may also comment barks made by the others. Quite simple.
    The content of table ‘Dog’ could be something like this. We can see that Stella is a happy dog and has birth on April fool’s day. Nothing special here, either. You know the stuff.
  • Ok, that was quick. So, let’s get into the business :)
    The simplest form of NoSQL are the Key-value stores.
    They are really simple, you could say “One key...”
    All in all, a key value-store is just a persistent Hash.
    The value-part can be anything. The key-value store does not care a bit about the content of it.
    Example implementations.
    Amazon Dynamo is very important one, because those Amazon guys published some theoretical material, and other NoSQL databases are partly based on these writings.
  • If you want to visualize it, you can see that we have a string based key, and that the value could be a serialized Java-object, or anything at all.
  • Document databases. The basic idea is still quite simple. They are key-value stores, but the difference is that the value...
    Because of this, querying of data is somehow possible. There might by a query language like SQL, but that is not actually very common.
    The examples of document databases are...
  • What is the difference between this picture and the one two slides earlier? The “value” is now called “document”.
    That’s it!? Well, the document here has a structure. The structure could be JSON, XML or anything. Having a structure means that we might do something with the data. Not just return a binary blob.
    But having a structure does not mean it should have a predefined structure. With relational databases you have to define a schema before you can store data.
    If you compare this document here with our relational example, you can see that in the document we do not have a “color”. It could be there, but it does not need to be there. On the other hand, if we want to add a new attribute, we just add it. With relational database we need to adjust the structure of the table, and that’s not always so pleasant.
    What about relationships? Well, if the relationship is strong enough, you could add it as a part of the document itself. You could do...
  • ...this.
    Here, barks and comments are just a part of the dog-document.
    I would not say that this would be a smart move, because the document might become huge. In -service “dog” is one document, and “bark” is another. But comments made to a bark ARE part of a bark document.
  • The next category does not have widely accepted name. Some would call them “wide column stores” and others might say they are “BigTable clones”.
    Whatever you call them, they ideologically reside somewhere between key-value stores and relational databases. There is no schema, but the data is still semi-structured. Like in relational databases, you could think that there are rows, but they can have any number of columns, and there is no need to store NULL values.
    Again, if you think of it in terms of relational databases, and talk about “rows” and “tables”, you might just get more confused. I still am a bit confused myself :-)
    This is one definition... It might not get you any wiser.
    Hopefully an example will help, but before that let’s see the example implementations.
    Cassandra is probably the most interesting one, because it is getting a lot of attention. And the reference as the store under the hoods of Facebook is quite good reference, or what do you think :-)
  • Ok, here’s the example I promised.
    I could not figure out the best way to depict this, but let’s hope this works.
    Like in the previous examples, we have a dog called Stella. Stella’s ID is dog_12 and Stella has a number of attributes such as birthdate, mood and name. Internally, we could store this information into a table structure, much like in relational database.
    Now, if we want to add an attribute “color” to Stella, we can do it easily.
    Ok, let’s look at terms on the top of the picture. “Row-id” is the “key” to the stored item. ”Column family” and “title” together form a “column”. The difference between these is that the “title” is dynamic and you can define new titles on the fly. But the column family is more or less fixed. It is a bit like “table” in relational database and it is costly to update it’s name, for instance.
    The “time” is simply the version of an attribute. If we look at the attribute “mood” here, you will see that Stella has been angry from time 11 on, and she became happy at time 45. If you query an attribute without time, you would simply get the latest value.
    A record is not tied to a single column family. Like in the document database example, I could say “barks are attributes of a dog”. Like this. Perhaps this explains why they are called “wide column stores” as a record can easily consist of a large number of attributes.
  • Then there are Graph databases. Someone would perhaps leave them out from the “NoSQL” family, but the authors themselves shout that they should belong to the hype.
    I don’t really know too much about them.
    The Neo4j seems to be most popular one.
  • Here’s an example how a graph database could look like. Again, there is our dog_12 having a name, mood, birthdate and so on.
    The relationships within graph databases are strong.
    What I mean to say is...
  • Relationships in...
    There is no REAL relationship between the tables, you may...
    In graph databases, however, the relationships are... It means that you can do very efficient calculations you might need in some social applications. I’m talking about friends and friends-of-friends here.
    What about the other NoSQL databases. You could say that... BUT just like in relational databases, you may...
    The only difference between these sentences is that here I talk about “constraints” and here “validations”... and again, they are, in a sense, the same thing.
    Why, then, relational databases are so good handling relationships? It is because they have...

  • ...SQL. And great support for indexing!
    NoSQL databases, on the other hand, might...
    And they usually don’t...
    But, on positive side, you...
    However, compared to the power SQL, isn’t this quite disappointing?
  • Well, how do you query NoSQL then?
    Of course, you can access a document with a key you know.
    ...and with a BigTable also query per attribute like this.
    With graph databases you do “graph traversals”.
    Usually there is an API which might support queries such as “give me all the dogs”.
    Some NoSQL solutions also provide an SQL-like query language.
    Then you can integrate to various search engines and some NoSQL solutions may provide integration out-of-the box.
    ...but the most common way seems to be to use Map-Reduce functions.
  • When Google popularized the term Map-Reduce, the first sentence of the publication was: “...”
    The big idea is that without understanding about parallel and distributed programming, users may write quite simple Map and Reduce functions to solve any kinds of problems. These functions can then be run in a big cluster of processes and computers.
    These Map and Reduce functions could be written with any language, but quite often NoSQL databases provide a JavaScript based solution. The main point is not the language. Rather you just need a means to process a record, and then a way to return processed data back to the pipeline.
  • A very simple example of a Map function could be this. The parameter passed to the Map-function is always a single document.
    The example here checks that given document is a dog, and if it is, it will return dog’s mood as a key and birthdate as value.
    Now, think that we have 10 million dogs in our database. It will be very easy to distribute this function because given a single document, you should always get the same key/value pair back. It is a little bit like rendering an animated 3D movie, where you distribute the rendering of each individual frame to number of workers, and the result will be frame number as a key and the bitmap as value.
    From NoSQL databases point of view, you may use Map-functions to generate views for your data. Internally the database caches the results so that the data will be very fast to access. With this map function defined as a view, I could easily find the birth dates of happy dogs.

  • Reduce function exists to aggregate results from a dataset. A common scenario would be to calculate sum of values.
    Here’s an example of quite stupid reduce function. You always pass two parameters to reduce phase: a key and values for the given key. This reduce function would simply calculate the average ages of happy dogs, angry dogs, bored dogs and so on.
    And as a footnote, you know it is easy to write poorly performing SQL, but you can easily write poor reduce functions as well.
  • If you want to get deeper into the subject, you should probably google this keyword: CAP
    Shortly, CAP theorem says that: if you have a distributed data system, you can only get two out of these three features.
    Consistency: meaning that once you write something into the database, all the readers of the system will get the same result.
    Availability: meaning the database will function even if a single component in the system fails.
    Partition tolerance is easiest to explain using an example. Imagine we have a database distributed that have nodes all over Europe and United states. Now, if the network connections suddenly disappear between the continents, we will have two partitions. If the system is still able to operate normally allowing both reads and writes, and is able to recover once the connections between Europe and USA works again, the system is called “partition tolerant”.
    If you think of a relational database there is no such concept as partition tolerance at all. However, relational databases are very consistent and can be quite highly available.
    Most NoSQL databases focus on providing very high availability. Most of the systems are partition tolerant, but there are also system providing consistency over partition tolerance.
    Then, there are some databases that let you decide the level of CAP you want. If you are interested, you could try googling N/R/W.

  • Ok, doesn’t it sound quite bad if a database is not consistent?
    If you write your comment into Facebook, some of your friends will see it immediately and others only after couple of seconds? The database designers (and I) think that consistency is not always top priority.
    What matters is that the data will be..

  • ..eventually consistent.
    Eventual consistency means that the consistent state may happen instantly, after a few seconds or after a network connections have been restored. But it will eventually happen.
    If you get better performance, better availability and better scalability, the ACID level consistency might not be that important after all. And of course, this depends on the domain and the application you are doing. Eventual consistency might not be a good idea when you are transacting money or doing other such critical task.

  • Why you should thing a NoSQL solution for your next project?
    Being schema-free is a HUGE help for many problems. And it almost always makes the life of developer much more pleasant.
    If you need to store massive amount of data, or know you need to scale easily, NoSQL might be for you.
    I could also claim that some services are perhaps easier to implement with a NoSQL solution that with a relational database.
    And this very true for many simple Web 2.0 services.
  • Then, why NoSQL is not always the best choice.
    First, relational databases have been around for many decades, and they are mature. The developer tools are mature.
    NoSQL solutions, on the other hand, are often new projects with great ideas and even greater promises, but only “alpha” quality.
    If you need strong data consistency, or if you need uncompromised support for transactions, use a relational database.
    And the last point. We often tend to think of scalability issues too early. It might not be a bad choice to write an application with the tools you know best, and when the time hits, and you need to scale, then think if a NoSQL solution could solve some of your scalability issues.
  • This is my last slide.
    If you want to compare relational databases with NoSQL databases, these could be the main points.
    We talked about consistency. In relational model, with ACID operation, it is very strong.
    Relational databases support rather big datasets, but some NoSQL give you almost unlimited scalability in terms of data size.
    Scaling can mean many things, but you could safely say that NoSQL solutions usually scale up much better than relational databases do.
    SQL is wonderful query language and NoSQL solutions often do not support any kind of query language. If they do, the languages are likely not as expressive as SQL is. On the other hand, Map-Reduce can solve some problems very efficiently.
    And in theory at least, high availability is also easier achieved with many NoSQL solutions.
    And remember, NoSQL solutions are very different from each other. And this slide perhaps simplifies things too much. But it is still a very good slide to finish this presentation.
    Thank you!

  • NoSQL databases

    1. 1. Harri Kauhanen 2010-03-09 NoSQL databases
    2. 2. Database paradigms • Relational (RDBMS) • NoSQL • Key-value stores • Document databases • Wide column stores (BigTable and clones) • Graph databases • Others
    3. 3. Relational databases • ACID (Atomicity, Consistency, Isolation and Durability) • SQL • MySQL, PostreSQL, Oracle, ...
    4. 4. dog bark id: integer id: integer name: varchar dog_id: integer mood: varchar bark: text birthdate: date color: enum comment id: integer bark_id: integer dog_id: integer comment: text id name mood birth_date color 12 Stella Happy 2007-04-01 NULL 13 Wimma Hungry NULL black 9 Ninja NULL NULL NULL
    5. 5. Key-value stores • “One key, one value, no duplicates, and crazy fast.” • It is a Hash! • The value is binary object aka. “blob” – the DB does not understand it and does not want to understand it. • Amazon Dynamo, MemcacheDB, ...
    6. 6. “value” “key” ... name_€#_Stella^^^ mood_€#_Happy^^^ dog_12 birthdate%/// 135465645) ...
    7. 7. Document databases • Key-value store, but the value is (usually) structured and “understood” by the DB. • Querying data is possible (by other means than just a key). • Amazon SimpleDB, CouchDB, MongoDB, Riak, ...
    8. 8. “document” “key” { type: “Dog”, name: “Stella”, dog_12 mood: “Happy”, birthdate: 2007-04-01 } vs. id name mood birth_date color 12 Stella Happy 2007-04-01 NULL 13 Wimma Hungry NULL black 9 Ninja NULL NULL NULL
    9. 9. { type: “Dog”, name: “Stella”, mood: “Happy”, birthdate: 2007-04-01, barks: [ { bark: “I had to wear stupid..” comments: [ { dog_12 dog_id: “dog_4”, comment: “You look so cute!” }, { dog_id: “dog_14”, comment: “I hate it, too!” } ] } ] }
    10. 10. Wide column stores • Often referred as “BigTable clones” • "a sparse, distributed multi-dimensional sorted map" • Google BigTable, Cassandra (Facebook), HBase, ...
    11. 11. “column” “row-id” “column family” “title” “time” “value” dog_12 dog birthdate 15 2007-04-01 dog mood 11 Angry dog mood 45 Happy dog name 25 Stella dog color 34 Black bark text 11 I had to wear...
    12. 12. Graph databases • “Relation database is a collection loosely connected tables” whereas “Graph database is a multi-relational graph”. • Neo4j, InfoGrid, ...
    13. 13. Dog Stella type name mood dog_12 barks bark_59 I had to wear stupid... Happy birth_date comment_to 2007-04-01 comment_83 You look so Cute comments dog_4
    14. 14. • Relationships in RDBMS are “weak”. • You may “define” one by using constraints, documenting a relationship, writing code, using naming conventions etc. • Relationships in graph databases are first class citizens. • There are no relationships in key-value stores, document databases and wide column stores. • You may “define” one by using validations, documenting a relationship, writing code, using naming conventions etc.
    15. 15. • Relational databases have almost limitless indexing, and a very strong language for dynamic, cross-table, queries (SQL) • That’s why they handle all kinds of relationships well and dynamically. • NoSQL databases... • ...might have limited support for dynamic queries and indexing • ...don’t support JOIN like operations of SQL • ...but you can store some relationships into document itself
    16. 16. How to query NoSQL? • Key-Value • Row-id/column-family:title[/time] • “stella_12”/”dogs”:”name” → Stella • Graph traversal • API • Query-language • Integration to indexing and search engines • Map-Reduce
    17. 17. Map-Reduce • “MapReduce is a programming model and an associated implementation for processing and generating large data sets.” • Often JavaScript (NoSQL implementations)
    18. 18. Map-function function map(doc) { if (doc['type'] == 'Dog') { emit(doc['mood'], doc['birthdate']); } } • Generates “indexed view” of data/ documents • This view is just another hash, but both key and value can be “anything”
    19. 19. Reduce-function • Aggregate results for a “view” (after the Map-function) function reduce(mood, listOfBirthdates) { return averageBirthDate(listOfBirthdates); } • Map-phase is easy to distribute, but you is also easy to write poor Reduce-functions
    20. 20. Theorems • CAP • Consistency, Availability, Partition tolerance • “Pick two” • N/R/W (adjusting CAP)
    21. 21. No consistency?
    22. 22. Eventual consistency
    23. 23. Why NoSQL? • Schema-free • Massive data stores • Scalability • Some services simpler to implement than using RDBMS • Great fit for many “Web 2.0” services
    24. 24. Why NOT NoSQL • DRBMS databases and tools are mature • NoSQL implementations often “alpha” • Data consistency, transactions • “Don’t scale until you need it”
    25. 25. RDBMS vs. NoSQL • Strong consistency vs. Eventual consistency • Big dataset vs. HUGE datasets • Scaling is possible vs. Scaling is easy • SQL vs. Map-Reduce • Good availability vs.Very high availability
    26. 26. Questions?