Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Nosql databases


Published on

nosql databases

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Nosql databases

  1. 1. Web Data Management Course (SCOM7348) University of Birzeit, Palestine January, 2014 Synthesis Paper Talk Synthesis Paper: NoSQL Databases Nader Abulhalaweh (Student) Nagham Ghanim Hamad (Student) Master of Computing, Birzeit University Master of Computing, Birzeit University Fayez Shayeb (Student) Dima Taji (Student) Master of Computing, Birzeit University Master of Computing, Birzeit University This is a student talk, at Web Data Management course, each student is is asked to present his/her synthesis paper. Course Page: 1
  2. 2. Outline • • • • • • • • • • Introduction General Characteristics of NoSQL Databases CAP Theorem Types of NoSQL Databases Column Databases Graph Databases Key-value Databases Documents Databases Comparison Future Work 2
  3. 3. Introduction • What are NoSQL Databases – – • Not Only SQL Databases Mostly distributed, non-relational storage systems Where do NoSQL DBs prove effective – Effective for simple operations such as • • – Key lookups Reads and writes of one record or a small number of records The records usually are for • • • • Electronic mail Personal profiles Wikis Customer records 3
  4. 4. Introduction: NoSQL needs • The need of NoSQL comes to fit the following issues: – – – – – – – – – – – Distributed (Large Data volume) Scalability Performance (Query needs to return answers quickly) High Avalability (replications give tolerance-partioning) Mostly Query, few updates Asynchronous inserts, updates Free-schema (schema-less) ACID transactions not needed  alternative is the BASE CAP theorem in distributed systems Open source development Cheap 4
  5. 5. General Characteristics of NoSQL Databases • The ability to horizontally scale simple operations throughput over many servers. • The ability to replicate and to distribute (partition) data over many servers. • A simple call level interface or protocol (in contrast to a SQL binding). • A weaker concurrency model than the ACID transactions of most relational (SQL) database systems. • Efficient use of distributed indices and RAM for data storage. • The ability to dynamically add new attributes to data records. 5
  6. 6. CAP Theorem Consistency Availability Partition tolerance • NoSQL Databases can guarantee 2 out of 3 properties only • Most NoSQL Databases are partitiontolerant. 6
  7. 7. Types of NoSQL Databases 1. 2. 3. 4. Key Value Column Document Graph 7
  8. 8. Key- Value DataStores • Definition: In simple words, it’s a Hash Table of Keys and it’s values. • We’ll call these systems key-value stores. • Provide fast access for small data ! (MapReduce for big data) • Subdivided into in-memory and disk-based solutions. • Has the largest number of members • No deal with highly connected data. • Provide: – replication, versioning, locking, transactions, sorting. • The client interface provides: – inserts, deletes, and index lookups. • Examples: Riak, Voldemort 8
  9. 9. MapReduce • It’s a technique for indexing and searching large data volumes. • Two phase: – Map: Select a subset of data (key, value pair). Potentially in parallel on multiple machines. – Reduce: Merge the subsets of (key,value) pairs, and then sort it. • The results from MapReduce may be useful for other searches. 9
  10. 10. Key- Value DataStores - Riak • Riak: It is a “key-value store” which written in Erlang by Basho. Riak described as a “key-value store” and “document store” by Basho. • Features: • Riak objects can be fetched and stored in JSON format { "bucket":"customers", "key":"12345", "object":{ "name":"Mr. Smith", "phone":”415-555-6524” } "links":[ ["sales","Mr. Salesguy","salesrep"], ["cust-orders","12345","orders"] ] "vclock":"opaque-riak-vclock", "lastmod":"Mon, 03 Aug 2009 18:49:42 GMT" } “ [2] 10
  11. 11. Key- Value DataStores - Riak • Supports replication of objects and sharding by hashing on the primary key. • Support MVCC. • Includes a map/reduce mechanism to split work over all the nodes in a cluster. • There is also a programmatic interface for Erlang, Java, and other languages. • One unique feature of Riak is that it can store “links” between objects (documents). 11
  12. 12. Code Examples on KVS • Riak is written in Java: – Example on java code IRiakClient client = RiakFactory.pbcClient(); // Note: Use this line instead of the former if using a local devrel cluster// IRiakClient client = RiakFactory.pbcClient("", 10017); Bucket myBucket = client.fetchBucket("test").execute(); int val1 = 1;"one", val1).execute(); 12
  13. 13. Riak • Disadvantages – Not implemented over windows OS – Does not support RAM storage. 13
  14. 14. Key- Value DataStores - Project-Voldemort • Project Voldemort: is an advanced key-value store, written in Java.[2]. It is developed by LinkedIn • To enable high performance and availability we allow only very simple key-value data access. Both keys and values can be complex compound objects including lists or maps, but none-the-less the only supported queries are effectively the following: 14
  15. 15. Key- Value DataStores - Project-Voldemort • Features: • Voldemort provides multi-version concurrency control (MVCC) for updates. • Supports optimistic locking for consistent multi-record updates.(in conflict updates, they can be backed out) • provide an ordering on versions using Vector clocks. • Supports automatic sharding of data.Consistent hashing is used to distribute data around a ring of nodes. • Can store data in RAM, Storage engine (Berkeley-DB). • Supports lists and records in addition to simple scalar values. • Easy to add and remove a Node to the cluster. • Support Lists and Maps for values. 15
  16. 16. KVS – Voldemort Continue • Voldemort Hashing: 16
  17. 17. Voldemort … • Disadvantages – no complex query filters – all joins must be done in code – no foreign key constraints – no triggers (no cascading) – Does not support MapReduce mechanism 17
  18. 18. Discussion Comparison 18
  19. 19. Column Databases • • • • What are Column Databases Structure and Patterns Problems With Column Databases Comparison – HBase – Cassandra 19
  20. 20. What are Column Databases • Also called Extensible Record Stores. • Many consider Column databases to be Key-Value databases • A column can be part of a Column Family that resembles at most a relational row, but it may appear in one row and not in the others. • Standard column families have a schema-less nature so that each of their "row"s can contain a different number of columns, and even different column names could be in each row. 20
  21. 21. What are Column Databases (cont.) • A column of a distributed data store is a NoSQL object of the lowest level in a keyspace. It is a tuple (a key-value pair) consisting of three elements: – Unique name: Used to reference the column – Value: The content of the column. It can have different types, like AsciiType, LongType, TimeUUIDType, UTF8Type among others. – Timestamp: The system timestamp used to determine the valid content. 21
  22. 22. What are Column Databases (Example) UserProfile = { Cassandra = { emailAddress:”” , age:”20”} TerryCho = { emailAddress:”” , gender:”male”} Cath = { emailAddress:”” , age:”20”,gender:”female”,address:”Seoul”} } where "Cassandra", "TerryCho", "Cath" correspond to row keys; and "emailAddress", "age", "gender", "address" correspond to the column names. 22
  23. 23. Structure and Patterns • Each column family is partitioned horizontally across different nodes, each node hosting partitions for all columns. • A single partition of a column family contains – Time-stamped K/V pairs that are appended at the tail end of the file. – The partition also includes an index which sorts the pairs lexicographically and points to their current version. • This allows for in-order traversal of the data and efficient location of the most-recent version of the data. • Since updates are appends to the data stores rather than in-place over-writes, old versions with stale data or deleted values in the column segments need to be periodically garbage-collected. 23
  24. 24. Problems with Column Databases 1. Inserted tuples have to be broken up into their component attributes and each attribute must be written separately 2. The dense-packed data layout makes moving tuples within a page nearly impossible. 3. Most queries access more than one attribute from an entity, but information about a logical entity is stored in multiple locations. 24
  25. 25. HBase vs. Cassandra HBase Cassandra High Availability Yes Distributed Transaction Processing Yes No Implementation language Java Java Influence Google's BigTable Amazon's Dynamo License Apache 2.0 Apache 2.0 Persistence Yes CAP Theorem Focus Major version upgrades require imports Uses HDFS, Amazon S3, or Amazon Elastic Block Store Facebook's Messaging Platform HBase Shell Commands or Cloudera Impala Query Engine Consistency & Availability Optimized For Reads CQL (Cassandra Query Language) Availability and Partition Tolerence Writes Partitioning Strategies Ordered Partitions Random Partitions Replication Major Use Query Language Yes Facebook's Inbox Search 25
  26. 26. Document-oriented database • document-oriented database is a computer program designed for storing, retrieving, and managing document-oriented, or semi structured data, information
  27. 27. Comparison  Relational Article - id - authorid - title - content Author - id - name - email Comment - id - articleid - message  Document db Article - _id - title - content - author - _id - name - email - comments[]
  28. 28. Concepts  Joins  No joins  Joins at "design time", not at "query time“  Constraints  No foreign key constraints  Unique indexes  Transactions  No commit/rollback  Atomic operations  Multiple actions inside the same document  Incl. embedded documents
  29. 29. Benefits  Scalable: good for a lot of data / traffic  Horizontal scaling: to more nodes  Good for web-apps  Performance  No joins and constraints  user friendly  Data is modeled to how the app is going to use it  No conversion between object oriented > relational  No static schema = agile
  30. 30. One-to-many  Embedded array / array keys  Some queries get harder  You can index arrays! Article - _id - content - tags: {“foo”, “bar”} - comments: {“id1”, “id2”}
  31. 31. many-to-many  Using array keys  No join table  References on both sides Article - _id - content - category_ids : {“id1”, “id2”} Category - _id - name - article_ids: {“id7”, “id8”}  Advantage: simple queries articles.Where(p => p.CategoryIds.Contains(categoryId)) categories.Where(c => c.ArticleIds.Contains(articleId))  Disadvantage: duplication, update two docs
  32. 32. json • • • Stands for Javascript Object Notation Derived from Javascript scripting language Used for representing simple data structures and associative arrays { { "_id": "BCCD12CBB", "type": "person", "name": "Darth Vader", "age": 63, "headware": ["Helmet", "Sombrero"], "dark_side": true } "_id": "BCCD12CBC", "type": "person", "name": "Luke", "age": 35, "powers": ["Pull", "Jedi Mind Trick"], "dark_side": false }
  33. 33. Replication  Only one server is active for writes (the primary, or master) at a given time – this is to allow strong consistent (atomic) operations. One can optionally send read operations to the secondaries when eventual consistency semantics are acceptable.
  34. 34. Sharding  Sharding is the partitioning of data among multiple machines in an order-preserving manner.(horizontal scaling ) Machine 1 Machine 2 Machine 3 Alabama → Arizona Colorado → Florida Arkansas → California Indiana → Kansas Idaho → Illinois Georgia → Hawaii Maryland → Michigan Kentucky → Maine Minnesota → Missouri Montana → Montana Nebraska → New Jersey Ohio → Pennsylvania New Mexico → North Dakota Rhode Island → South Dakota Tennessee → Utah Vermont → West Virgina Wisconsin → Wyoming
  35. 35. Replication & Sharding conclusion  sharding is the tool for scaling a system, and replication is the tool for data safety, high availability, and disaster recovery. The two work in tandem yet are orthogonal concepts in the design.
  36. 36. Document Databases
  37. 37. What is MongoDB?  Creator: 10 gen, former doublick  Name: short for humongous (‫)عمالق‬  Language: C++
  38. 38. What is MongoDB?  Defination: MongoDB is an open source, document- oriented database designed with both scalability and developer agility in mind. Instead of storing your data in tables and rows as you would with a relational database, in MongoDB you store JSON-like documents with dynamic schemas(schema-free, schemaless).
  39. 39. What is MongoDB?  Goal: bridge the gap between key-value stores (which are fast and scalable) and relational databases (which have rich functionality).
  40. 40. MongoDB terms ?
  41. 41. Term mapping
  42. 42. MongoDB Data model  Data model: Using BSON (binary JSON), developers can easily map to modern object-oriented languages without a complicated ORM layer.  BSON is a binary format in which zero or more key/value pairs are stored as a single entity.  lightweight, traversable, efficient
  43. 43. Schema design  RDBMS: join
  44. 44. Schema design
  45. 45. Schema design
  46. 46. Query MongoDB Vs sql? SQL mongodb Create table db.CreateCollection() Insert into select db.coll.find() update db.coll.update delete db.coll.remove() drop db.coll.drop()
  47. 47. Query MongoDB Vs sql?
  48. 48. Query MongoDB Vs sql?
  49. 49. API for MongoDB : C C++ C# Erlang Haskell Java node.js PHP Perl Python Ruby Scala (Casbah)
  50. 50. Outline • • • • • • Graph databases Neo4j graph databases Allegro graph databases Virtuoso graph databases Comparison Demo 50
  51. 51. Graph databases • What is a Graph Database? • is a model that describe a real data in the world and the whole database shown as structured crowded with relationships between nodes. Graph databases are generally built for use with transactional (OLTP) systems. 51
  52. 52. Graph database • A Graph contains Nodes and Relationships A Graph —records data in→ Nodes —which have→ Properties • Relationships organize the Graph Nodes —are organized by→ Relationships —which also have→ Properties • Labels group the Nodes Nodes —are grouped by→ Labels —into→ Sets • Query a Graph with a Traversal A Traversal —navigates→ a Graph; it —identifies→ Paths —which order→ Nodes • Indexes look-up Nodes or Relationships An Index —maps from→ Properties —to either→ Nodes or Relationships 52
  53. 53. Graph databases 53
  54. 54. Why graph databases is better • Performance • Flexibility 54
  55. 55. Graph Databases Models- Neo4j • Neo4j is an open source implemented using Java as a graph databases that allow representing complex and huge data in easy way using a key-value property. • Both nodes and relationships can contain properties. • Nodes are often used to represent entities, but depending on the domain relationships may be used for that purpose as well. • Neo4j's Cypher language is purpose built for working with graph data that supports most algorithms used for searching and finding path in graph such as shortest path and graph pattern matching • Schema-free 55
  56. 56. Graph Databases Models- Neo4j • Architecture • The node and relationship stores are concerned only with the structure of the graph, not its property data. Both stores use fixed-sized records so that any individual record’s location within a store file can be rapidly computed given its ID. 56
  57. 57. Graph Databases Models- AllegroGraph • AllegroGraph is a semantic web technology using for storing RDF in triples that depends on disk-based storage Store data and meta-data as triples. • One thing to note is that AllegroGraph doesn't restrict the contents of its triples to pure RDF. • represent any graph data-structure by treating its nodes as subjects and objects, its edges as predicates and creating a triple for every edge. The named-graph slot can be used to hold additional, application-specific, information. Used this way, AllegroGraph becomes a powerful graph-oriented database. • Allegro Generates 12 byte unique identifier for each string in the triple store. • Query these triples through various query APIs like SPARQL. • It has Java, Python, Lisp, Prolog interfaces. • AllegroGraph's RDFS++ reasoning supports all the RDFS predicates and some of OWL's. 57
  58. 58. Graph Databases Models- AllegroGraph • Architecture • • AllegroGraph provides a REST protocol architecture AllegroGraph is designed to store RDF triples, a standard format for Linked Data 58
  59. 59. Graph Databases Models- AllegroGraph HOW DOES IT WORK? • AllegroGraph runs as a service. • It can work with large variety of programming interfaces like Java, Lisp. • Although it’s called a “triple”, each triple has five fields. – – – – – Subject Predicate Object Graph ID – stands for triple identifier 59
  60. 60. Graph Databases Models- Virtuoso • Universal server is a multi-protocol RDBMS. • Virtuoso provides support for web servers which help in querying and loading data over HTTP. • Support SPARQL query language which typecasting is differs from SQL where Virtuoso provide a QUITCAST query hint that helps developers from writing and complex cast expression in SQL. • Implements a quad (graph, subject, predicate, object) 60
  61. 61. Virtuoso Architecture A high performance, scalable, secure and operating-system independent server designed to handle contemporary challenges associated with standards compliant data access, data integration, and data management 61
  62. 62. Comparison • • • Indexing SQL support, query languages Storing 62
  63. 63. indexing • Performance is gained by creating indexes, which improve the speed of looking up nodes in the databases. • Neo4j does not support automatic indexing, you have specifies which properties to index, since Neo4j schema optional, which you can use it without any schema (schema-free). • In the other hand Allegro and Virtuoso support a dynamic and automatic indexing. By default Allegro AllegroGraph now supports any index combination of S, P, O, G. The default indices default are SPOGI, POSGI, OSPGI, GPOSI, GOSPI, and I. The graph indices, namely GSPOI, GPOSI, GOSPI are used when the triple store is divided into sub graphs. The I index is a special type of index which lists all the triples by an id number. 63
  64. 64. Indexing – cont. • In virtuoso the indexing of RDF data includes 2 full indices over RDF quads and 3 partial indices. The indexing scheme in Virtuoso consists of the following indices: – – – – – PSOG - primary key POGS - bitmap index for lookups on object value. SP - partial index for cases where only S is specified. OP - partial index for cases where only O is specified. GS - partial index for cases where only G is specified. 64
  65. 65. Query languages • In Neo4j there is no supporting for sql query. It is support a Cypher language. • Virtuoso support SQL,SPARQL,XQuery,XPath and RDF++ resoning. And in the other hand, allegro support a SPARQL as a query language. 65
  66. 66. Storing • In neo4j is a native graph databases implemented java. Which means it’s internal database structure directly represent nodes, relationships and properties as records in its database files. Neo4j represents nodes and relationships as java objects in its embedded java-API, as ASCII art in its query language cipher. • Allegrograph and Virtuoso supports a native datatypes, which stores wide range of data types directly in its low level triple represtation. 66
  67. 67. Loading and indexing • • The author in [16] did the comparison between previous models in loading and indexing triples and measure the time for each one. The evaluation here focuses on centralized triple stores with a single client making queries at a given time. The queries are executed one after the other. triples 40377 374911 1075626 1809874 24711725 9279230 Allegro 1.608 15.505 45.135 77.627 1156.8 3198 Virtuoso 1.38 13.22 55.74 90.48 6408 4860 Neo4j 24.98 6.35 1347 2484 53496 175032 67
  68. 68. Loading and indexing-cont. • AllegroGraph and Virtuoso have similar best performance for smaller datasets. For larger datasets, AllegroGraph seems to be performing somewhat better than Virtuoso in our case where the maximum number of triples that were tested was 50 million triples. • Neo4j shows poor 25 performance among the four triple stores, The comparison of Neo4j shows though, that Neo4j has the longest loading times, furthermore Neo4j scales worst. 68
  69. 69. Cypher 69
  70. 70. 70
  71. 71. 71
  72. 72. 72
  73. 73. 73
  74. 74. 74
  75. 75. 75
  76. 76. 76
  77. 77. NoSQL DBs vs. Relational DBs • Relational Databases: – New relational systems can do everything that a NoSQL system can, with analogous performance and scalability, and with the convenience of transaction and SQL. – RDBMSs have taken and retained majority market share over other competitors in the past 30 years. – Successful RDBMSs have been build to handle other specific application loads in the past – SQL has a common interface that gives advantages in training, continuity, and data interchange. 77
  78. 78. NoSQL DBs vs. Relational DBs (cont.) • NoSQL Databases – RDBMSs cannot achieve scaling in comparison with NoSQL systems – Key-value stores is adequate for a lookup of objects based on a single key, and is probably easier to understand than an RDBMS. (Learning curve for the level of complexity you require only) – Provides flexible schema that allows each object in a collection to have different attributes. – NoSQL systems make expensive operation impossible or obviously expensive for programmers. – NoSQL products have established smaller but non-trivial markets in areas where there is a need for particular capabilities. 78
  79. 79. NoSQL DBs vs. Relational DBs (cont.) Data Model Performance Scalability Flexibility Complexity Functionality Key–value Stores high high high low variable (none) Column Store high high moderate low minimal Document Store high variable (high) high low variable (low) Graph Database variable variable high high graph theory Relational Database variable variable low moderate relational algebra Ben Scofield, Wikipedia, NoSQL ( 79
  80. 80. The Future of NoSQL DBs • NoSQL databases are popular enough so that many developers may abandon globally-ACID transactions. • NoSQL data stores will not be a passing fad, since they fill a market niche. • New RDBMSs will take a significant share of the scalable data storage market. • Scalable data stores will not prove to be enterprise ready for a while. • There will be a major consolidation among some systems. 80
  81. 81. References 1 1. Decker, S., Passant, A. and Breslin, J. (2009). The Social Semantic Web. Berlin, Heidelberg: Springer. 2. Grubar, T. (1993). Toward Principles for the Design of Ontologies Used for Knowledge Sharing. International Journal Human-Computer Studies, 43, 907-928. Retrieved from 3. Recordon, D. (2010). The Open Graph protocol Understanding the design decisions, in WWW Conference 2010, April 29th, 2010, 4. Brunati, M. (2010). Facebook Open Graph and the Semantic Web [PowerPoint slides], December, 2012 ( 5. Facebook Developers. Open Graph Protocol, Retrieved December 20, 2012 from 6. Facebook Developers. Open Graph Concepts, Retrieved December 20, 2012 from 7. Facebook. Open Graph protocol, Retrieved December 20, 2012 from 81
  82. 82. References 2 8. Berners-Lee, Tim; Fischetti, Mark,Weaving the Web. HarperSanFrancisco. chapter 12. (1999). ISBN 978-0-06-251587-2. 9. Victoria Shannon "A 'more revolutionary' Web". International Herald Tribune. (June 26, 2006). ( Retrieved December, 2012. 10. "W3C Semantic Web Activity". World Wide Web Consortium (W3C). November 7, 2011. Retrieved December, 2012 11. "OWL Web Ontology Language Overview". World Wide Web Consortium (W3C). February 10, 2004. ( . Retrieved December, 2012. 12. Bizer, Christian; Heath, Tom; Berners-Lee, Tim,"Linked Data—The Story So Far". International Journal on Semantic Web and Information Systems 5 (3): 1–22. doi:10.4018/jswis.2009081901. (2009). ISSN 15526283. Retrieved Dec, 2012 13. Heath and Christian Bizer ,Linked Data: Evolving the Web into a Global Data Space, Synthesis Lectures on the Semantic Web: Theory and Technology, Morgan & Claypool. (2011) 14. linked open data: ( (Visited December 2012) 82
  83. 83. References 3 15. Erwin Folmer, Paul Oude Luttighuis,Jos van Hillegersberg, Do semantic standards lack quality? A survey among 34 semantic standards, June 2011 16. [9] W3C: Linked Data. Webpage, Copyright 2012( December 2012) 17. XHTML Metain formation Attributes Module,( XHTML Metainformation Attributes Module)(Visited in Dec, 2012) 18. RDFa in XHTML: Syntax and Processing,( (Visited in Jan, 2013) 19. Zollmann, Johannes. "NoSQL Databases." SigMod , 2011. 20. Schindler, Jiri. "I/O Characteristics of NoSQL Databases." VLDB Endowment 5, no. 12 (August 2012): 2020-2021. 21. Cattell, Rick. "Scalable SQL and NoSQL Data Stores." SigMod Record 39, no. 4 (December 2010): 12-27. 22. Rabl, Tilmann, Mohammad Sadoghi, Hans-Arno Jacobsen, Sergio Gόmez-Villamor, Victor Muntés-Mulero, and Serge Mankovskii. "Solving Big Data Challenges for Enterprise Application Performance Management." VLDB Endownment 5, no. 12 (2012): 1724-1735. 83
  84. 84. References 4 15. Abadi, Daniel J., Peter A. Boncz, and Stavros Harizopoulos. "Column-oriented Database Systems." VLDB Endowment, 2009. 16. Atzeni, Paolo, Chrstian S. Jensen, Giorgio Orsi, Sudha Ram, Letizia Tanca, and Riccardo Torlone. "The relational model is dead, SQL is dead, and I don't feel so good myself." SigMod Record 42, no. 2 (June 2013): 64-68. 17. Stonebraker, M., S. Madden, D. J. Abadi, S. Harizopoulos, N. Hachem, and P. Helland. "The end of an architectural era: (it's time for a complete rewrite)." VLDB Endowment , 2007: 1150-1160. 84
  85. 85. Thank You 85