Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

NoSQL databases

64 views

Published on

My slides for the guest lecture on 24th October 2017 at the Groningen University

Published in: Technology
  • Hello! Who wants to chat with me? Nu photos with me here http://bit.ly/helenswee
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

NoSQL databases

  1. 1. NoSQL databases Filip Ilievski (f.ilievski@vu.nl) Vrije Universiteit Amsterdam
  2. 2. What will you hear about today 1) Databases: A historical perspective 2) Modern trends and why NoSQL? 3) Properties of NoSQL 4) Categories and instances of NoSQL databases 5) RDBMSs vs NoSQL 6) Use cases & The Semantic Web
  3. 3. The origin of Relational DBMSs “In 1970, Edgar F. Codd, an Oxford-educated mathematician working at the IBM San Jose Research Lab, published a paper showing how information stored in large databases could be accessed without knowing how the information was structured or where it resided in the database.”
  4. 4. The origin of Relational DBMSs “Until then, retrieving information required relatively sophisticated computer knowledge, or even the services of specialists who knew how to write programs to fetch specific information—a time-consuming and expensive task.” “What Codd did was open the door to a new world of data independence. Users wouldn’t have to be specialists, nor would they need to know where the information was or how the computer retrieved it. They could now concentrate more on their businesses and less on their computers.” (http://www-03.ibm.com/ibm/history/ibm100/us/en/icons/reldb/)
  5. 5. The origin of Relational DBMSs After Codd’s paper described the relational ideas in theory, practical implementations followed. Soon, the Relational DBMSs were proven superior over the ad-hoc flat file databases and became de facto standard in terms of modelling, representing and accessing data. Relational databases were (and often still are) ideal when the storage of data was expensive and the data schemas were fairly simple.
  6. 6. But the circumstances change.
  7. 7. Trend #1: Increase in modeling complexity In the Big Data era, it is not uncommon for a database to have hundreds- thousands of tables, and many of these should be combined to give the final requested information. => Trend #1: Queries get more complex and require complex SQL operations
  8. 8. Trend #2: Using tables is not always optimal Think of social media, where it is more intuitive to use something like a graph network => Trend #2: Some use cases require graphs or nested structures rather than tables
  9. 9. Trend #2: Using tables is not always optimal Think of social media, where it is more intuitive to use something like a graph network => Trend #2: Some use cases require graphs or nested structures rather than tables
  10. 10. Trend #3: A fixed schema is not always optimal Our model might change over time Applications in practice are typically incremental (“agile”), so the initial design choices get altered -- and this influences the data model. Having a fixed pre-defined schema is not always desired.
  11. 11. Trend #4: Increase in querying intensity The Internet in the 80s was not there as we know it: only a handful of people would access and use a database. There are many more users accessing the data now, mainly needed by the social networks and enabled by the computing power increase.
  12. 12. Trend #4: Increase in querying intensity The Internet in the 80s was not there as we know it: only a handful of people would access and use a database. There are many more users accessing the data now, mainly needed by the social networks and enabled by the computing power increase. Additionally, the intensity of access by an individual user is higher (imagine that each Facebook like is yet another WRITE to the database). ● A single website being accessed can trigger tens-hundreds of database queries. => Trend #4: Many more queries have to be handled => Trend #4’: The querying intensity varies
  13. 13. Trend #5: Cheap storage Hardware is much cheaper, more accessible and powerful now.
  14. 14. Trend #5: Cheap storage Hardware is much cheaper, more accessible and powerful now.
  15. 15. Trend #5: Cheap storage Hardware is much cheaper, more accessible and powerful now. When scaling up, it makes more sense to add more computers (horizontal scaling) rather than to upgrade a single machine (vertical scaling) => Trend #5: Usage of a lot of cheap (cloud) hardware instead of a single powerful machine
  16. 16. Modern trends: summary => Trend #1: Queries get more complex and require complex SQL operations => Trend #2: Some use cases require graphs or nested structures rather than table => Trend #3: The data model might evolve over time => Trend #4: Many more queries have to be handled => Trend #4’: The querying intensity varies => Trend #5: Usage of a lot of cheap (cloud) hardware instead of a single powerful machine
  17. 17. Relational databases in the modern era Relational databases are not designed to handle many users or large data. Relational databases are not designed to be distributed over multiple computers. Relational databases are not efficient with data that is not meant to be stored in tables. Sometimes, the relational database principles (e.g. normalization) are too complex and slow when there is a lot of data or a complex model.
  18. 18. NoSQL (=Not-Only-SQL) Main idea: adapt to the new trends/needs 1) To avoid making complex queries and joining many datasets, store as much as possible in a single record a) Consequently, the data structure in NoSQL is often not a table, but instead a dictionary, a tree, an array, etc. b) Yes, this leads to repetition and violates the normalization principle.
  19. 19. NoSQL (=Not-Only-SQL) Main idea: adapt to the new trends/needs 1) To avoid making complex queries and joining many datasets, store as much as possible in a single record a) Consequently, the data structure in NoSQL is often not a table, but instead a dictionary, a tree, an array, etc. b) Yes, this leads to repetition and violates the normalization principle. 2) Use a data structure that is most appropriate for the problem (e.g. use a graph to model social networks)
  20. 20. NoSQL (=Not-Only-SQL) Main idea: adapt to the new trends/needs 1) To avoid making complex queries and joining many datasets, store as much as possible in a single record a) Consequently, the data structure in NoSQL is often not a table, but instead a dictionary, a tree, an array, etc. b) Yes, this leads to repetition and violates the normalization principle. 2) Use a data structure that is most appropriate for the problem (e.g. use a graph to model social networks) 3) To handle many users and many queries, and to use the cheap hardware, distribute the data over many simple machines.
  21. 21. What is NoSQL?
  22. 22. What is NoSQL? It is a group of databases that attempts to provide more flexible way of data storage, while adapting to the new trends of intensive storage and accessible hardware. This idea started around 2009, and it has been already widely adopted.
  23. 23. What is NoSQL? NoSQL stands for Not-Only-SQL (yes, some NoSQL databases support SQL operations too). NoSQL databases are NOT meant to replace relational databases, both have their use cases.
  24. 24. ● Flexible schema / schema less Characteristics of a NoSQL database Image from https://www.slideshare.net/mikecrabb/a-beginners-guide-to-nosql/30-SIDENOTEcolour_tabbyname_Gunthercolour_gingername_Mylocolour
  25. 25. ● Flexible schema / schema less ● Non relational (forget about normalization today - but you have to know it for the exam :) ) Characteristics of a NoSQL database
  26. 26. ● Flexible schema / schema less ● Non relational ● Simple access compared to SQL (but not standard across products) Characteristics of a NoSQL database
  27. 27. ● Flexible schema / schema less ● Non relational ● Simple access compared to SQL (but not standard across products) ● Basically, does not support complicated queries ● Cheaper than RDBMS systems ● Horizontally scalable ● Replicated ● Distributed Characteristics of a NoSQL database
  28. 28. NoSQL databases Filip Ilievski (f.ilievski@vu.nl) Vrije Universiteit Amsterdam Half time
  29. 29. NoSQL DBs
  30. 30. A)KEY-VALUE DBs Each record in the database contains a unique key, which “hides” the value Analogue to key-value pairs in dictionaries/JSON Very simple and designed to work with a lot of data Each record can have a different structure/type of data
  31. 31. A)KEY-VALUE DBs
  32. 32. A)KEY-VALUE DBs
  33. 33. A)KEY-VALUE DBs
  34. 34. A)KEY-VALUE DBs: Code for Redis Initialization import redis pool = redis.ConnectionPool(host='localhost', port=6379, db=0) r = redis.Redis(connection_pool=pool) Store a record my_key =”B” my_value=”triangle” r.set(my_key, my_value) Obtain a record my_key=”B” r.get(my_key)
  35. 35. B) DOCUMENT-STORE DBs Again, simple and designed to work with a lot of data Very similar to a key-value database Main difference is that you can actually see (and QUERY for) the values
  36. 36. B) DOCUMENT-STORE DBs
  37. 37. B) DOCUMENT-STORE DBs
  38. 38. B) DOCUMENT-STORE DBs
  39. 39. B) DOCUMENT-STORE DBs: MongoDB
  40. 40. C) GRAPH DBs Graph databases focus on modelling the structure of the data Inspired by Euler’s graph theory, G=(E,V) Motivated mainly by social media networks
  41. 41. C) GRAPH DBs
  42. 42. C) GRAPH DBs
  43. 43. C) GRAPH DBs
  44. 44. C) GRAPH DBs
  45. 45. C) GRAPH DBs
  46. 46. C) GRAPH DBs: Neo4j
  47. 47. D) COLUMN STORE Column data is saved together, instead of row data Super useful for data analytics Inspired by Google BigTable
  48. 48. D) COLUMN STORE: Facebook’s Cassandra
  49. 49. It is mainly about the size
  50. 50. E) OTHER DATABASES Many other NoSQL databases exist that do not fall in these four categories: ● Text search databases (e.g. ElasticSearch) ● XML databases ● ...
  51. 51. Comparison of NoSQL categories
  52. 52. Popularity https://db-engines.com/en/ranking
  53. 53. So, which one to use? It depends In general, RDBMS is great for ensuring data consistency, when data validity is very important (think financial transactions and similar corporate applications) NoSQL is great ensuring high availability and speed rather than validity NoSQL is also great for many applications that are hard to imagine with RDBMSs: indexing text documents, social networks, storing coordinates, etc. Pick the right tool for the job!
  54. 54. Me and NoSQL: Map studio Bachelor thesis: Use of political maps to render statistical data on countries, provinces, areas, cities, etc. We are drawing the maps from scratch, so we have to keep a huge number of coordinates in our database Typically the coordinates were stored only once, and retrieved many times. So the READ operation is very important to be efficient, but not the WRITE one.
  55. 55. Me and NoSQL: Map studio Requirements: ● Main one is response time (=how quick can the database return an answer/the data) ● Secondary goal is horizontal scalability (=the database should allow one easily to split the data on multiple machines) What is less/not important: ● consistency ● concurrency
  56. 56. Map studio Solution:
  57. 57. Knowledge graphs
  58. 58. Knowledge graphs
  59. 59. Knowledge graphs A knowledge graph acquires and integrates information into an ontology and applies a reasoner to derive new knowledge. ● It is a graph ● It is semantic = the meaning of the data is encoded alongside the data in the graph, in the form of the ontology. A knowledge graph is self-descriptive, or, simply put, it provides a single place to find the data and understand what it’s all about. ● It is smart = the ontology allows implicit information to be inferred from explicit data ● It is alive = it can grow and get updated
  60. 60. Knowledge graphs These principles are shared in the research on Semantic Web and Linked open data, aiming to construct a Web of things. Each world fact is represented as a subject-predicate-object triple and stored as Linked Open Data. By the end of 2016 Google’s knowledge graph apparently contained 70 billion connected facts. The Linked Open Data cloud also contains billions of facts (our LOD Laundromat collection at VU might be among the largest -> last counted 40B, but next run will contain much more).
  61. 61. Take aways from today’s lecture 1) NoSQL fills the spectrum between flat files and relational DBs 2) NoSQL databases were introduced to fit the new trends: many users, many queries, complex data models, diverse data structures, and cheap hardware 3) The data structure in NoSQL is often not a table, but instead a dictionary, a tree, an array, etc. 4) Horizontal scaling, repetition of data, flexible schema 5) Classes of NoSQL DBs: graph, column, key-value, document, other 6) RDBMS or NoSQL? Pick the right tool for the job
  62. 62. Acknowledgements Some slides have been copied from existing slides on Slideshare ● https://www.slideshare.net/mikecrabb/a-beginners-guide-to-nosql/ ● https://www.slideshare.net/bhaskar_vk/introduction-to-nosql-databases- 47768468 ● https://www.slideshare.net/RTigger/sql-vs-no-sql https://hackernoon.com/wtf-is-a-knowledge-graph-a16603a1a25f For an advanced presentation on NoSQL, see: ● https://www.slideshare.net/quipo/nosql-databases-why-what-and-when/34-
  63. 63. Thanks! Anything you would like to talk about?

×