Graph databases

1,026 views
840 views

Published on

Published in: Software, Education, Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,026
On SlideShare
0
From Embeds
0
Number of Embeds
10
Actions
Shares
0
Downloads
17
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • G – graph
    V – vertice
    E – edge
  • D3 - Data-Driven Documents
  • Graph databases

    1. 1. Graph Databases Karol Grzegorczyk June 10, 2014
    2. 2. 2/25 Graph Theory Seven Bridges of Königsberg problem defined by Leonhard Euler in 1735 How to find a walk through the city that would cross each bridge once and only once? [© Google] Euler proved that it is impossible to solve this problem! G = (V, E) E {V × V}⊆
    3. 3. 3/25 Storing Connected Data in a Relational Database ● Relationships do exist in the relational databases, but only as a means of joins and joining tables ● Logically, join crates a Cartesian product of tables ● Operations of relational databases are index-intensive. Retrieval based on an index is fast, but not with a constant time (most often O(log 2 n)) ● Traversal queries require hierarchical joins, which are costly. Deep traversal queries are infeasible. Execution time increases exponentially with a depth of a join. ● For a given SQL query, RDBMS creates an in-memory graph data structure. ● Often relational database are normalized in order to efficiently organize data in a database. ● Normalization increases number of joins needed to query the database. Denormalization can be a partial solution.
    4. 4. 4/25 Database normalization ● Database normalization is the process of organizing the fields and tables of a relational database to minimize redundancy. – Normalization usually involves dividing large tables into smaller (and less redundant) tables and defining relationships between them. ● Normal forms – The first normal form (each attribute contains only atomic values) – The second normal form (each non primary key attribute is dependent on the whole primary key) – The third normal form (each non primary key attribute is dependent on nothing but the primary key) ● A relational database table is often described as "normalized" if it is in the 3NF ● When a database is intended for OLAP rather than OLTP, it is topically denormalized. ● Denormalization is the process of attempting to optimize the read performance of a database by adding redundant data or by grouping data ● Examples of denormalization techniques: – Materialised views – Star schemas – OLAP cubes
    5. 5. 5/25 Graph Database Highlights ● Graph data stores provide index-free adjacency resulting in a much better performance, if compared to traditional RDBMS ● Designed predominantly for traversal performance and executing graph algorithms ● Graph database is more natural, direct representation of a domain than RDBMS (no need for junction tables) ● There is no need for joining tables because the data structure is already “joined” by the edges that are defined. ● In graph databases denormalization is not needed! ● The interesting thing about graph diagrams is that they tend to contain specific instances of nodes and relationships, rather than classes or archetypes. ● The main purpose of Graph Databases is analysis and visualization of graphical data.
    6. 6. 6/25 Graph Database Models ● The Property Graph Model – Model is built of nodes and relationships – Nodes contain key-value properties. Sometimes relationships as well. – Relationships are named and directed, and always have a start and end node ● Hypergraphs – Generalization of a graph model. – A relationship can have any number of nodes at either end of a relationship (many-to- many relationships) ● Triple stores – A triple expresses a relationship between two resources. – The triple is a subject-predicate-object data structure, e.g. Fred likes ice cream
    7. 7. 7/25 Triple stores ● The Resource Description Framework (RDF) is a framework for expressing information about resources. ● Resources can be anything, including documents, people, physical objects, and abstract concepts. ● RDF is intended for situations in which information on the Web needs to be processed by applications, rather than being only displayed to people. ● RDF is a building block of the Semantic Web movement. ● RDF is a set of W3C specifications – SPARQL - SPARQL Protocol and RDF Query Language ● Disadvantages – Lack of index-free adjacencies. Data is stored in form of triplets which are independent artifacts. In order to traverse the graph one need to join multiple triplets.
    8. 8. 8/25 RDF example [G. Schreiber, Y. Raimond, RDF 1.1 Primer, W3C, 2014] In RDF, resources are described by IRI - International Resource Identifier RDF define logical relationships. A number of different serialization formats exist for writing down RDF graphs: ● Turtle ● JSON-LD ● RDFa ● RDF/XML Popular RDF datasets: ● Wikidata ● Dbpedia ● WordNet ● Europeana ● VIAF
    9. 9. 9/25 Hypergraphs [I. Robinson, J. Webber, E. Eifrem, Graph Databases, O’Reilly Media, 2013] HyperGraphDB http://www.hypergraphdb.org Using hypergraphs we lose the ability to add properties to the individual relationships.
    10. 10. 10/25 The Property Graph Model ● The most popular variant of graph model ● Only one-to-one relationships ● The Property Graph Model databases are typically schema-less. There is no notion of database schema. ● Querying is often done in specification by example way, i.e. by finding data (nodes and relationships) matching the specified pattern. ● Optimization for traversal ● Popular solutions: – Neo4j (pure graph DBMS) – OrientDB (hybrid document and graph DBMS)
    11. 11. 11/25 Neo4j ● Written in Java but uses some high-performance features of JVM ● Concepts: – Nodes (can have zero or more properties) – Relationships (always have direction and a type; can have zero or more properties) – Labels for grouping nodes together (a node can have zero or more labels; labels have colors assigned) ● Neo4j is a schema-optional graph database (since 2.0 version). There are two schema elements: – Indexes - you can create index on a set of properties of nodes with a specific label (Apache Lucene) – Constraints - constraint (currently only unique) on a property of nodes of a given label (index will be added automatically) ● Two versions/modes: – Web server with pure RESTful API and rich web GUI – Embedded Java library ● RESTful API was designed with discoverability in mind. Just start with a GET on the service root (e.g. http://localhost:7474/db/data) and you will a list of hyperlinks to available resources.
    12. 12. 12/25 Cypher Query Language basics ● Cypher is declarative query language based on pattern matching ● Basic SQL syntax structure: SELECT columns FROM table WHERE conditions ● Basic Cypher syntax structure: MATCH pattern WHERE conditions RETURN nodes ● Patterns are defined in ASCII art graphs, e.g.: MATCH x-->y RETURN x ● It is possible to crate data with Cypher as well: CREATE ({key:"value"})
    13. 13. 13/25 Cypher basic examples ● Create a simple node create ({name:"Anna"}) ● Retrieve all the nodes match x return x ● Create a labeled node with some properties create (x:Person {name:"Jan", from: "Poland"}) ● Retrieve all the nodes labeled as Person having parameter from: “Poland” match (y:Person) where y.from = "Poland" return y ● Create a relationship match x where x.name="Anna" match (y:Person) create x-[:knows]->y
    14. 14. 14/25 Traversal queries ● Find Jan's friends. Return him and his friends. MATCH (x:Person)-[:knows]-(friends) WHERE x.name = "Jan" RETURN x, friends ● Find friends of Jan's friends who likes surfing MATCH (x:Person)-[:knows]-()-[:knows]-(surfer) WHERE x.name = "Jan" AND surfer.hobby = "surfing" RETURN DISTINCT surfer
    15. 15. 15/25 Starting points ● Patterns often have starting points, i.e. nodes or relationships that are explicitly given. ● It is possible to specify the starting point using WHERE clause (as in the previous slide), but it can be inefficient (when there are no indices). ● More proper way of specifying the starting point (node or relationship) is by using the START keyword. ● These starting points are obtained via index lookups or, more rarely, accessed directly based on node or relationship IDs – START n=node:index-name(key = "value") – START n=node(id)
    16. 16. 16/25 START clause example Find the mutual friends of user named “Michael” [I. Robinson, J. Webber, E. Eifrem, Graph Databases, O’Reilly Media, 2013] START a=node:user(name='Michael') MATCH (c)-[:KNOWS]->(b)-[:KNOWS]->(a), (c)-[:KNOWS]->(a) RETURN b, c
    17. 17. 17/25 D3.js based graph visualization of the example data set
    18. 18. 18/25 Transaction management ● Neo4j provide full ACID support ● All relationships must have a valid start node and end node. In effect this means that trying to delete a node that still has relationships attached to, it will throw an exception upon commit. ● When updating or inserting massive amounts of data then periodic commit query hint (USING PERIODIC COMMIT) can be helpful. ● Currently only one isolation level (READ_COMMITTED) is supported. ● In order to execute a query inside a transaction, POST the query to http://localhost:7474/db/data/transaction/{id}
    19. 19. 19/25 Native Graph Storage There are separate stores for nodes, relationships and properties. In order to be able to compute a record’s location at cost O(1), all stores are fixed-size record stores. Nodes (9 bytes) Relationships are stored in doubly linked lists, so firstPrevRelId, firstNextRelId, secondPrevRelId and secondNextRelId are pointers for the next and previous relationship records for the start and end nodes [I. Robinson, J. Webber, E. Eifrem, Graph Databases, O’Reilly Media, 2013]
    20. 20. 20/25 Scalability ● On a single server, Neo4j is capable of managing 34*109 nodes ● Currently, only full DB replication for read-only purposes, is available – Master-slave architecture to support fault-tolerancy – Horizontally scaling for read-mostly purposes ● Open transactions are not shared among members of an HA cluster. Therefore, if you use this endpoint in an HA cluster, you must ensure that all requests for a given transaction are sent to the same Neo4j instance. ● As was stated, in the graph database data are already “joined”, so it is hard to partition (to shard) a graph into multiple machine. ● Neo4j team is working on this, but it is not ready yet. It would be desired to keep nodes tightly connected (or belonging to a common domain) together on the same machine and loosely connected (or belonging to different domains) on separate machines. ● The problem is that the connection that is currently loose, can one day in the future, become tight, and vice-versa.
    21. 21. 21/25 Graph algorithms ● Both graph theory and graph algorithms are mature and well-understood fields of computing science and both can can be used to mine sophisticated information from graph databases. ● Neo4j supports both depth- and breadth-first search – Search type can be specified using BranchSelector and BranchOrderingPolicy ● Graph Algorithms available in neo4j – all paths (find all paths between two nodes) – all simple paths (find paths with no repeated nodes) – shortest paths (find paths with the fewest relationship) ● Can find all shortest paths (if there are more than one) or just the first one. – Dijkstra (find paths with the lowest cost) – A* (improved version of Dijkstra algorithm)
    22. 22. 22/25 Example of finding the shortest path using REST API Example request POST http://localhost:7474/db/data/node/35/path Accept: application/json; charset=UTF-8 Content-Type: application/json { "to" : "http://localhost:7474/db/data/node/30", "max_depth" : 3, "relationships" : { "type" : "to", "direction" : "out" }, "algorithm" : "shortestPath" } Example response 200: OK Content-Type: application/json; charset=UTF-8 { "start" : "http://localhost:7474/db/data/node/35", "nodes" : [ "http://localhost:7474/db/data/node/35", "http://localhost:7474/db/data/node/31","http://localhost:7474/db/data/node/30" ], "length" : 2, "relationships" : [ "http://localhost:7474/db/data/relationship/26", "http://localhost:7474/db/data/relationship/32" ], "end" : "http://localhost:7474/db/data/node/30" }
    23. 23. 23/25 Spring Data Neo4J Spring Data is an umbrella project that makes it easy to use new data access technologies, such as non-relational databases, map-reduce frameworks, and cloud based data services. Spring Data Neo4j is an integration library for Neo4j and it was the first Spring Data project @NodeEntity public class Movie { @GraphId Long id; @Indexed(type = FULLTEXT, indexName = "search") String title; Person director; @RelatedTo(type="ACTS_IN", direction = INCOMING) Set<Person> actors; @Query("start movie=node({self}) match movie-->genre<--similar return similar") Iterable<Movie> similarMovies; }
    24. 24. 24/25 Bibliography ● I. Robinson, J. Webber, E. Eifrem, Graph Databases, O’Reilly Media, 2013 ● R. Angles, C. Gutierrez, Survey of graph database models, ACM Computing Surveys (CSUR), 2008 ● M. A. Rodriguez, P. Neubauer, The Graph Traversal Pattern, Graph Data Management: Techniques and Applications, 2011 ● Jonas Partner, Aleksa Vukotic, and Nicki Watt, Neo4j in Action, Manning, 2014 ● Eric Redmond. Jim R. Wilson, Seven Databases in Seven Weeks, The Pragmatic Bookshelf, 2012 ● G. Schreiber, Y. Raimond, RDF 1.1 Primer, W3C, 2014
    25. 25. 25/25 Thank you!

    ×