Semantic Data Management in Graph Databases
Upcoming SlideShare
Loading in...5
×
 

Semantic Data Management in Graph Databases

on

  • 1,377 views

The tutorial describes existing approaches to model graph databases and different techniques implemented in RDF and Database engines including their main drawbacks when a large volume of ...

The tutorial describes existing approaches to model graph databases and different techniques implemented in RDF and Database engines including their main drawbacks when a large volume of interconnected data needs to be traversed.

Statistics

Views

Total Views
1,377
Views on SlideShare
1,374
Embed Views
3

Actions

Likes
4
Downloads
42
Comments
1

1 Embed 3

https://twitter.com 3

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Semantic Data Management in Graph Databases Semantic Data Management in Graph Databases Presentation Transcript

  • Semantic Data Management inGraph Databases-Tutorial at ESWC 2013-Maria-EstherVidalEdnaRuckhausMaribelAcostaCosminBascaUSB1
  • 2! " #$ %% #&Network of Friends in a High School Annotation Graph representing relationsbetween clinical trailsAir-traffic between US cities A Fragment of Facebook Relationships between commentsabout a topic in TwitterGraphs …
  • 3! " #$ %% #&A significant increase of graph data in the form of social & biological information.Tasks to be Solved …Patternsofconnectionsbetween people tounderstand functioning of society.Topological propertiesof graphs can be used to identifypatterns that reveal phenomena,anomalies and potentially lead toa discovery.
  • Tasks to be Solved … (2)Relatednessbetween two nodes.4The importance of the information relies on the relations more or equalthan on the entities.[Anderson et al. 2012]Bestmatchingbetween two set of sub-graphs.Keywordqueries.Proximitypatterns.Frequentsub-graphs.Graphranking.
  • Basic Graph OperationsBasic Graph operations that need to be efficientlyperformed: Graph traversal. Measuring path length. Finding the most interesting central node.5
  • NoSQL-based ApproachesThese approaches can handle unstructured,unpredictable or messy data.6Wide-columnstoresBigTablemodel ofGoogleCassandraDocumentStoresSemi-structuredDataMongoDBKey-valuestoresBerkeley DBGraphDatabasesGraph-basedoriented data
  • Agenda7Basic Concepts & Background (25 min.)The Graph Data Management Paradigm (5 min.)Existing Graph Database Engines (45 min.)Coffee Break (30 min.)Existing RDF Graph Engines (50 min.)Hands-on session(25 min.)Questions & Discussion (5 min.)Summary & Closing (10 min.)1236754
  • BASIC CONCEPTS &BACKGROUND81
  • Abstract Data Type GraphG=(V, E,Σ,L) is a graph: V is a finite set of nodes or vertices,e.g. V={Term, forOffice, Organization,…} E is a set of edges representing binaryrelationship between elements in V,e.g. E={(forOffice,Term)(forOffice,Organization),(Office,Organization)…} Σis a set of labels,e.g., Σ={domain, range, sc, type, …} L is a function: V x V Σ,9e.g., L={((forOffice,Term),domain), ((forOffice,Organization),range)… }
  • Abstract Data Type Multi-GraphG=(V, E,Σ,L) is a multi-graph: V is a finite set of nodes or vertices,e.g. V={Term, forOffice, Organization,…} E is a set of edges representing binaryrelationship between elements in V,e.g. E={(forOffice,Term)(forOffice,Organization),(Office,Organization)…} Σis a set of labels,e.g., Σ={domain, range, sc, type, …} L is a function: V x V PowerSet(Σ),10e.g., L={((forOffice,Term),{domain}), ((forOffice,Organization),{range}),((_id0,AZ),{forOffice, forOrganization})… }
  • Basic OperationsGiven a graph G, the following are operations over G: AddNode(G,x): adds node x to the graph G. DeleteNode(G,x): deletes the node x from graph G. Adjacent(G,x,y):tests if there is an edge from x to y. Neighbors(G,x):nodes y s.t. there is a node from x to y. AdjacentEdges(G,x,y):set of labels of edges from x to y. Add(G,x,y,l):adds an edge between x and y with label l. Delete(G,x,y,l):deletes an edge between x and y with label l. Reach(G,x,y):tests if there a path from x to y. Path(G,x,y): a (shortest) path from x to y. 2-hop(G,x):set of nodes y s.t. there is a path of length 2 from x to y, or fromy to x. n-hop(G,x):set of nodes y s.t. there is a path of length n from x to y, or fromy to x.11
  • Implementation of Graphs12AdjacencyListFor each node a listof neighbors.If the graph isdirected,adjacency list of icontains only theoutgoing nodes ofi.Cheaper forobtaining theneighbors of anode.Not suitable forchecking if thereis an edgebetween twonodes.IncidenceListVertices and edgesare stored as recordsor objects.Each vertex storesincident edges.Each edge storesincident nodes.IncidenceMatrixBidimensionalrepresentation ofgraph.Rows representVertices.Columnsrepresent edgesAn entry of 1represents thatthe SourceVertex is incidentto the Edge.AdjacencyMatrixBidimensionalrepresentation ofgraph.Rows representSource Vertices.Columns representDestination Vertices.Each entry with 1represents thatthere is an edgefrom the sourcenode to thedestination node.CompressedAdjacencyMatrixDifferentialencodingbetween twoconsecutivenodes[Sakr and Pardede2012]
  • Adjacency List13V1V2V3V4(V1,{L2}) (V3,{L3})(V1,{L1})Properties: Storage: O(|V|+|E|+|L|) Adjacent(G,x,y): O(|E|) Neighbors(G,x): O(|E|) AdjacentEdges(G,x,y): O(|E|) Add(G,x,y,l): O(|E|) Delete(G,x,y,l): O(|E|)L2 L3L1V1 V2 V3V4
  • Implementation of Graphs14AdjacencyListFor each node a listof neighbors.If the graph isdirected, adjacencylist of i containsonly the outgoingnodes of i.Cheaper forobtaining theneighbors of anode.Not suitable forchecking if thereis an edgebetween twonodes.IncidenceListVertices and edgesare stored as recordsof objects.Each vertex storesincident edges.Each edge storesincident nodes.AdjacencyMatrixBidimensionalrepresentation ofgraph.Rows representSource Vertices.Columns representDestination Vertices.Each entry with 1represents thatthere is an edgefrom the sourcenode to thedestination node.IncidenceMatrixBi-dimensionalrepresentationof graph.RowsrepresentVertices.ColumnsrepresentedgesAn entry of 1 representsthat the Source Vertex isincident to the Edge.CompressedAdjacencyMatrixDifferentialencodingbetween twoconsecutivenodes[Sakr and Pardede2012]
  • Incidence List15Properties: Storage: O(|V|+|E|+|L|) Adjacent(G,x,y): O(|E|) Neighbors(G,x): O(|E|) AdjacentEdges(G,x,y): O(|E|) Add(G,x,y,l): O(|E|) Delete(G,x,y,l): O(|E|)(source,L2) (source,L3)(source,L1)(destination,L3)(V4,V1)(V2,V1)(V2,V3)(destination,L2)(destination,L1)V1V2V3V4L1L2L3L2 L3L1V1 V2 V3V4
  • Implementation of Graphs16AdjacencyListFor each node a listof neighbors.If the graph isdirected, adjacencylist of i containsonly the outgoingnodes of i.Cheaper forobtaining theneighbors of anode.Not suitable forchecking if thereis an edgebetween twonodes.IncidenceListVertices and edgesare stored as recordsof objects.Each vertex storesincident edges.Each edge storesincident nodes.AdjacencyMatrixBidimensionalgraphrepresentation.Rows representsource vertices.Columns representdestination vertices.Each non-nullentry representsthat there is anedge from thesource node to thedestination node.IncidenceMatrixBi-dimensionalrepresentationof graph.RowsrepresentVertices.ColumnsrepresentedgesAn entry of 1 representsthat the Source Vertex isincident to the Edge.CompressedAdjacencyMatrixDifferentialencodingbetween twoconsecutivenodes[Sakr and Pardede2012]
  • 17{L2} {L3}{L1}V1 V2 V3 V4V1V2V3V4Adjacency MatrixL2 L3L1V1 V2 V3V4Properties: Storage: O(|V|2) Adjacent(G,x,y): O(1) Neighbors(G,x): O(|V|) AdjacentEdges(G,x,y): O(|E|) Add(G,x,y,l): O(|E|) Delete(G,x,y,l): O(|E|)
  • Implementation of Graphs18AdjacencyListFor each node a listof neighbors.If the graph isdirected, adjacencylist of i containsonly the outgoingnodes of i.Cheaper forobtaining theneighbors of anode.Not suitable forchecking if thereis an edgebetween twonodes.IncidenceListVertices and edgesare stored as recordsof objects.Each vertex storesincident edges.Each edge storesincident nodes.AdjacencyMatrixBidimensionalgraphrepresentation.Rows representsource vertices.Columns representdestination vertices.Each non-nullentry representsthat there is anedge from thesource node to thedestination node.IncidenceMatrixBidimensionalgraphrepresentation.Rowsrepresentvertices.ColumnsrepresentedgesA non-null entry representsthat the source vertex isincident to the edge.CompressedAdjacencyMatrixDifferentialencodingbetween twoconsecutivenodes[Sakr and Pardede2012]
  • 19destination destinationsource sourcedestinationsourceL1 L2 L3V1V2V3V4Incidence MatrixL2 L3L1V1 V2 V3V4Properties: Storage: O(|V|x|E|) Adjacent(G,x,y): O(|E|) Neighbors(G,x): O(|V|x|E|) AdjacentEdges(G,x,y): O(|E|) Add(G,x,y,l): O(|V|) Delete(G,x,y,l): O(|V|)
  • Implementation of Graphs20AdjacencyListFor each node a listof neighbors.If the graph isdirected, adjacencylist of i containsonly the outgoingnodes of i.Cheaper forobtaining theneighbors of anode.Not suitable forchecking if thereis an edgebetween twonodes.IncidenceListVertices and edgesare stored as recordsof objects.Each vertex storesincident edges.Each edge storesincident nodes.AdjacencyMatrixBidimensionalgraphrepresentation.Rows representsource vertices.Columns representdestination vertices.Each non-nullentry representsthat there is anedge from thesource node to thedestination node.IncidenceMatrixBidimensionalgraphrepresentation.RowsrepresentVertices.ColumnsrepresentedgesA non-null entry representsthat the source vertex isincident to the Edge.CompressedAdjacencyMatrixDifferentialencodingbetween twoconsecutivenodes[Sakr and Pardede2012]
  • Compressed Adjacency ListGap Encoding21Node Neighbors1 20,23,24,25,272 30,32,33,343 40,42,45,46,47,48…..Node Successors1 38,2,0,0,12 56,1,0,03 74,1,2,0,0,0…..v(x) =2x if x ³ 02 x -1 if x < 0ìíïîïThe ordered adjacent list:A(x)=(a1,a2,a3,…,an)is encoded as(v(a1-x),a2-a1-1,a3-a2-1,…, an-an-1 -1)Where:Properties: Storage: O(|V|+|E|+|L|) Adjacent(G,x,y): O(|E|) Neighbors(G,x): O(|E|) AdjacentEdges(G,x,y): O(|E|) Add(G,x,y,l): O(|E|) Delete(G,x,y,l): O(|E|)OriginalAdjacencyListCompressedAdjacencyList
  • Traversal SearchBreadth FirstSearchExpands shallowestunexpanded nodesfirst.Search iscomplete.Depth FirstSearchExpands deepestunexpanded nodesfirst.Search may be notcomplete: may fail inloops.22
  • 2378213456021345601 Starting NodeFirst Level Visited NodesSecond Level Visited NodesThird Level Visited NodesBreadth First SearchNotation:
  • 247821345602134560Depth First Search1 Starting NodeFirst Level Visited NodesSecond Level Visited NodesThird Level Visited NodesNotation:
  • THE GRAPH DATAMANAGEMENT PARADIGM252
  • The Graph Data ManagementParadigmGraph database models: Data models for schema and instances aregraph models or generalizations of them. Data manipulation is expressed as graph-based operations.26
  • Survey of Graph Database Models27A graph data modelrepresents data andschema: Graphs or More general, thenotion of graphs:Hypergraphs,Digraphs.[Renzo and Gutiérrez 2008]
  • Survey of Graph Database Models28[Renzo and Gutiérrez 2008]Types of relationship supported by graph data models:• Properties,• Mono- or multi-valued.• Groups of real-world objects.• Structures torepresentneighborhoodsof an entity.Attributes EntitiesNeighborhoodrelations• Part-of,composed-by,n-aryassociations.• Subclasses andsuperclasses,• Relations ofinstantiations.• Recursivelyspecifiedrelations.StandardabstractionsDerivation andinheritanceNestedrelations
  • Representative Approaches29No IndexOnly worksfor “toy”graphs.Vertices areon the scaleof 1K.Node/EdgeIndexCreate indexin differentnodes/edgesHexastore,RDF3x,BitMat,Neoj4,FrequentSubgraph IndexIndices onfrequentlyqueriedsubgraphsReachability andNeighborhoodIndices2-hoplabelingschemas forindexstructures.Every twonodes withindistance L.
  • GRAPH DATABASESENGINES303
  • Graph Databases Advantages Natural modeling of highly connected data. Special graph storage structure Efficient schemalessgraph algorithms. Support for query languages. Operators to query the graph structure.31
  • 32Graph Databaseshttp://www.neo4j.orghttp://www.hypergraphdb.orghttp://www.sparsity-technologies.com/dex.php
  • Existing General Graph DatabaseEngines33DEXJava library formanagement ofpersistent andtemporarygraphs.Implementationrelies on bitmapsand secondarystructures (B+-tree)HyperGraphDBImplements thehyper graph datamodel.Neo4jNetwork orientedmodel whererelations are first-class objects.Native disk-basedstorage managerfor graphs.Framework forgraph traversal.Most graph databases implement an API instead of a query language.
  • DEX (1) Labeled attribute multi-graph. All node and edge objects belong to a type. Edges have a direction: Tail: corresponds to the source of the edge. Head: corresponds to the destination of the edge. Node and edge objects can have one ormore attributes. There are no restrictions on thenumber of edges between two nodes. Loops are allowed. Attributes are single valued.34[Martínez-Basan et al. 2007]http://www.sparsity-technologies.com/dex_tutorials
  • DEX (2) Attributes can have different indexes: Basic attributes. Indexed attributes. Unique attributes. Edges can index neighborhoods. Neighborhood index. The persistent database of DEX is a single file. DEX can manage very large graphs.35[Martínez-Basan et al. 2007]http://www.sparsity-technologies.com/dex_tutorials
  • DEX Architecture36[Martínez-Basan et al. 2007]http://www.sparsity-technologies.com/dex_tutorials
  • DEX Internal Representation (1)37[Martínez-Basan et al. 2007]DEX Approach: Map + Bitmaps  LinksLink: bidirectional association between values and OIDs Given a value  a set of OIDs Given an OID  the valuehttp://www.sparsity-technologies.com/dex_tutorials
  • DEX Internal Representation (2)38[Martínez-Basan et al. 2007] A DEX Graph is a combination ofBitmaps: Bitmap for each node or edgetype. One link for each attribute. Two links for each type: Out-going and in-going edges. Maps are B+trees A compressed UTF-8 storagefor UNICODE string.http://www.sparsity-technologies.com/dex_tutorials
  • DEX API39Operations of the Abstract Data Type Graph: Graph construction. Definition of node and edge types. Creation of node and edge objects. Definition and use of attributes. Query nodes and edges. Management of edges and nodes.http://www.sparsity-technologies.com/dex_tutorials
  • DEX API (2)Graph construction40DexConfigcfg = new DexConfig();cfg.setCacheMaxSize(2048); // 2 GBcfg.setLogFile(“PUBLICATIONSDex.log");Dexdex = new Dex(cfg);Database db = dex.create(”Publications.dex",”PUBLICATIONSDex");Session sess = db.newSession();Graph graph = sess.getGraph();……sess.close();db.close();dex.close();Operations on the graphdatabase will be performedon the object graph
  • Wrote CitesWrotePaper1 Paper2Paper3Peter Smith41ConferenceESWCPaper4ConferenceCitesPaper5ISWCConferenceConferenceCitesConferenceJuan PerezWroteJohn SmithWroteWroteWroteCitesWroteExample
  • DEX API (3)Adding nodes to the graph42intauthorTypeId = graph.newNodeType(‛AUTHOR");intpaperTypeId = graph.newNodeType(‛PAPER");intconferenceTypeId = graph.newNodeType(‛CONFERENCE");long author1 = graph.newNode(authorTypeId);long author2 = graph.newNode(authorTypeId);long author3 = graph.newNode(authorTypeId);long paper1 = graph.newNode(paperTypeId);long paper2 = graph.newNode(paperTypeId);long paper3 = graph.newNode(paperTypeId);long paper4 = graph.newNode(paperTypeId);long paper5 = graph.newNode(paperTypeId);long conference1 = graph.newNode(conferenceTypeId);long conference2 = graph.newNode(conferenceTypeId);
  • DEX API (4)Adding edges to the graph43intwroteTypeId = graph.newEdgeType(‛WROTE", true,true);intisWrittenTypeId = graph.newEdgeType(‛Is-Written", true,true);intciteTypeId = graph.newEdgeType(‛CITE", true,true);intisCitedByTypeId = graph.newEdgeType(‛IsCitedBy", true,true);intconferenceTypeId = graph.newEdgeType(‛IsConference", true,true);long wrote1 = graph.newEdge(wroteTypeId, author1, paper1);long wrote2 = graph.newEdge(wroteTypeId, author1, paper3);long wrote3 = graph.newEdge(wroteTypeId, author1, paper5);long wrote4 = graph.newEdge(wroteTypeId, author2, paper3);long wrote5 = graph.newEdge(wroteTypeId, author2, paper4);long wrote4 = graph.newEdge(wroteTypeId, author3, paper2);long wrote5 = graph.newEdge(wroteTypeId, author3, paper4);
  • DEX API (5)Indexing Basic attributes:There is no index associated with the attribute. Indexed attributes:There is an index automatically maintained by thesystem associated with the attribute. Unique attributes:The same as for indexed attributes but with an addedintegrity restriction: two different objects cannot have the same value, withthe exception of the null value. DEX operations:Accessing the graph through an attribute willautomatically use the defined index, significantly improving the performanceof the operation. A specific indexcan also be defined to improve certain navigationaloperations, for example, Neighborhood. There is theAttributeKindenum class which includes basic, indexed andunique attributes.44
  • DEX API (6)Adding attributes to nodes of the graph45nameAttrId = graph.newAttribute(authorTypeId, "Name",DataType.String, AttributeKind.Indexed);conferenceNameAttrId = graph.newAttribute(conferenceTypeId,‛ConferenceName", DataType.String, AttributeKind.Indexed);Value v = new Value();graph.setAttribute(author1, nameAttrId, v.setString(Peter Smith"));graph.setAttribute(author2, nameAttrId, v.setString(”Juan Perez"));graph.setAttribute(author2, nameAttrId, v.setString(”John Smith"));graph.setAttribute(conference1, conferenceNameAttrId,v.setString(“ESWC”)));graph.setAttribute(conference2, conferenceNameAttrId,v.setString(“ISWC"));
  • DEX API (7)Searching in the graph46Value v = new Value();// retrieve all ’AUTHOR’ node objectsnodeObjects authorObjs1 = graph.select(authorTypeId);...// retrieve Peter Smith from the graph, which is a ”AUTHOR"nodeObjects authorObjs2 = graph.select(nameAttrId, Condition.Equal,v.setString(‛Peter Smith"));...// retrieve all ’author node objects having ”Smith" in the name.It would retrieve// Peter Smith and John SmithnodeObjects authorObjs3 = graph.select(nameAttrId, Condition.Like,v.setString(‛Smith"));...authorObjs1.close();authorObjs2.close();authorObjs3.close();
  • DEX API (8)Searching in the graph47Graph graph = sess.getGraph();Value v = new Value();...// retrieve all ’author node objects having a value for the nameattribute// satisfying the ’^J[^]*n$regular expressionObjectsauthorObjs =graph.select(nameAttrId, Condition.RegExp,v.setString(‛^J[^]*n$"));...authorObjs.close();
  • DEX API (9)Searching in the graph48Value v = new Value();Graph graph = sess.getGraph();sess.begin();intauthorTypeId = graph.findType(‛AUTHOR");intnameAttrId = graph.findAttribute(authorTypeId, "NAME");v.setString(‛Peter Smith");Long peterSmith = graph.findObject(nameAttrId, v);sess.commit();
  • DEX API (10)Traversing the graph Navigational directioncan be restricted through theedges The EdgesDirectionenumclass: Outgoing: Only out-going edges, starting from the source node,will be valid for the navigation. Ingoing: Only in-going edges will be valid for the navigation. Any: Both out-going and in-going edges will be valid for thenavigation.49
  • Graph graph = sess.getGraph();...Objects authorObjs1 = graph.select(authorTypeId);...// retrieve Peter Smith from the graph, which is a ”AUTHOR"nodeObjects node1 = graph.select(nameAttrId, Condition.Equal,v.setString(‛Peter Smith"));nodeObjects node2 = graph.select(nameAttrId, Condition.Like,v.setString(‛Paper"));IntwroteTypeId = graph.findType(‛WROTE");IntciteTypeId = graph.findType(‛CITE");...// retrieve all in-comings WROTEedgesObjects edges = graph.explode(node1, wroteTypeId,EdgesDirection.Ingoing);...// retrieve all nodes through CITEedgesObjects cites = graph.neighbors(node2, citeTypeId,EdgesDirection.Any);...edges.close();cites.close();DEX API (11)Traversing the graph50Explode-basedmethods visitthe edges of a given nodeNeighbor-basedmethods visitthe neighbor nodes of a givennode identifier
  • DEX API (12)Traversing the graph51Graph graph = sess.getGraph();...nodeObjects node2 = graph.select(nameAttrId, Condition.Like,v.setString(‛Paper"));intciteTypeId = graph.findType(‛CITE");// 1-hop citesedgesObjects cites1 = graph.neighbors(node2, citeTypeId,EdgesDirection.Any);// cites of cites (2-hop)edgesObjects cites2 = graph.neighbors(cites1, citeTypeId,EdgesDirection.Any);…..edges.close();cites.close();
  • DEX API Summary52Operation DescriptionCreateGraph Creates a empty graphDropGraph Removes a graphRegisterType Registers a new node or edge typeGetTypes Returns the list of all node or edge typesFindType Returns the id of a node or edge typeNewNode Creates a new empty node of a given node typeAddNode Copies a nodeAddEdge Adds a new edge of some edge typeDropObject Removes a node or an edgeScan Returns all nodes or edges of a given typeSelect Returns the nodes or edges of a given type which have an attribute thatevaluates true to a comparison operator over a valueRelated Returns all neighbors of a node following edges members of an edge typeNeighbors Returns all neighbors
  • HyperGraphDB(1) Represents graph and hypergraphstructures, a mathematicalgeneralization of a graph, where several edges correspond toa hyperedge. Hyperedges connect an arbitrary set of nodes (n-aryrelations) and can point to other hyperedges(higher-orderrelations). This representation model saves space.53[Iordanov 2010]Researcher:Bob Researcher:MaryArea:Database Area:AI Area:SWResearcher:Bob Researcher:MaryArea:Database Area:AI Area:SWhttp://www.kobrix.com/index.jsp
  • HyperGraphDB(2) Edgesare represented as ordered sets (directed graphs withhyperedges). Nodesand edges are unified into something called atom,which is a typed tuple. The elements of a tuple are known as the Target Set. The size of a Target Set is the arity of a tuple. If a tuple x has 0 arity, it’s a node, otherwise it’s a link (or edge). The set of atoms pointing to a tuple is called the incidenceset of the tuple. Every atom has a Handle (unique identifier). Each entity has a type and value. A type is an atom conforming to a special interface, values aredata managed and stored by a type atom.54[Iordanov 2010]http://www.kobrix.com/index.jsp
  • HyperGraphDB Storage Data is stored in the form of key-value pairs. Berkeley DB is used for storage of key-value structures. Provides three access methods for data: Hashes:Linear hash. B-Trees:Data stored in leaves of a balanced tree. Recno:Assigns a logical record identifier to each pair (indexed for direct access). Based on a key-value layer with duplicate support. Handles are custom generated UUIDs, ints, longs, etc. Several key-value stores: some core, some are user defined indices.55http://www.kobrix.com/index.jsp
  • HyperGraphDB Operations AndIndexing56Predefined Indexers DescriptionByPartIndexer Indexes atoms with compoundtypes along a givendimension.ByTargetIndexer Indexes atoms by a specifictarget (a position in the targettuple).CompositeIndexer Combines any two indexersinto a single one. This allowsto essentially precompute andmaintain an arbitrary join.LinkIndexer Indexes atoms by their fulltarget tuple.TargetToTargetIndexer Given a link, index one of itstarget by another.Essential graph queries: Node/Edge adjacency. Pattern matching. Graph traversal.Indexing by: Properties. Targets. User defined.Predefined indexers: ByPartIndexer. ByTargetIndexer. CompositeIndexer. LinkIndexer. TargetToTargetIndexer.http://www.kobrix.com/index.jsp
  • Wrote CitesWrotePaper1 Paper2Paper3Peter Smith57ConferenceESWCPaper4ConferenceCitesPaper5ISWCConferenceConferenceCitesConferenceJuan PerezWroteJohn SmithWroteWroteWroteCitesWroteExample
  • HyperGraphDB API (1)Adding nodes to the graphHyperGraph graph = new HyperGraph(databaseLocation);author author1 = new AUTHOR(‛Peter Smith‛);author author2 = new AUTHOR(‛Juan Perez‛);author author3 = new AUTHOR(‛John Smith‛);PAPER paper1 = new PAPER(‚Paper1‛);PAPER paper2 = new PAPER(‚Paper2‛);PAPER paper3 = new PAPER(‚Paper3‛);PAPER paper4 = new PAPER(‚Paper4‛);PAPER paper5 = new PAPER(‚Paper5‛);CONFERENCE conference1 = new CONFERENCE(‚ESWC‛);CONFERENCE conference2 = new CONFERENCE(‚ISWC‛);HGHandle authorHandle1 = graph.add(paper1);HGHandle authorHandle2 = graph.add(paper2);…HGHandle conferenceHandle2 = graph.add(conference2);…graph.close();58
  • HyperGraphDB API (2)Adding edges to the graph59HyperGraph graph = new HyperGraph(databaseLocation);author author1 = new AUTHOR(‛Peter Smith‛);author author2 = new AUTHOR(‛Juan Perez‛);author author3 = new AUTHOR(‛John Smith‛);PAPER paper1 = new PAPER(‚Paper1‛);HGHandle authorHandle1 = graph.add(paper1);HGHandleyearHandle = graph.add(2013);HGValueLink link1 = new HGValueLink(‛publication year‛,authorHandle1, yearHandle);HGHandle linkHandle1 = graph.add(link1);graph.close();
  • HyperGraphDB API (3)Searching node properties60HGQueryCondition condition= new And(new AtomTypeCondition(AUTHOR.class),new AtomPartCondition(newString[]{‛name"}, ‛Peter Smith",ComparisonOperator.EQ));HGSearchResult<HGHandle>rs = graph.find(condition);while (rs.hasNext()) {HGHandle current = rs.next();AUTHOR author = graph.get(current);System.out.println(AUTHOR.getBithDate());}
  • HyperGraphDB API (4)61List<author> authors = hg.getAll(hg.and(hg.type(AUTHOR.class),hg.eq("author", ‛Peter Smith")));for (author p : authors)System.out.println(author.getBirthDate());}Searching node properties
  • HyperGraphDB API (5)Searching node properties62DefaultALGeneratoralgen = new DefaultALGenerator(graph,hg.type(CitedBy.class),hg.and(hg.type(PAPER.class),hg.eq(‛track‛,‛Semantic Data Management‛),true, false, false);HGTraversal traversal = new HGBreadthFirstTraversal(startingPaper, algen);Paper currentArticle = startingPaper;while (traversal.hasNext()) {Pair<HGHandle, HGHandle> next = traversal.next();PAPER nextPaper = graph.get(next.getSecond());System.out.println(‛Paper " + current + " quotes " + nextPaper);currentPaper = nextPaper;}HGBreadthFirstTraversalmethod to traverse a graph returns astartAtom andadjListGenerator
  • HyperGraphDB API (7)Searching node properties63author author1 = newAUTHOR(‛Peter Smith‛);HGDepthFirstTraversal traversal = new HGDepthFirstTraversal(author1,new SimpleALGenerator(graph));while (traversal.hasNext()) {Pair<HGHandle, HGHandle> current = traversal.next();HGLink l = (HGLink)graph.get(current.getFirst());Object atom = graph.get(current.getSecond());System.out.println("Visiting atom " + atom +‚ pointed to by " + l);}SimpleAlGeneratormethod toproduce all atoms link to a givenatom
  • HyperGraphDB API Summary64Class/Operation DescriptionHGEnvironment Creates a hypergraph from a fileAdd Adds a new object to the hypergraphUpdate Updates an object in the hypergraphRemove Removes an object from the hypergraphHGHandle A reference to a hypergraph atomHGPersistentHandle a HGHandle that survives system downtimeHGValueLink Adds a newhyperedgeget Returns the object referred by a given HGHandleHGQueryCondition Returns the nodes or edges of a given type whichhave an attribute that evaluates true to acomparison operator over a valueHGQuery.hg.getAll Returns all the objects that meet a condition
  • Neo4J Labeled attribute multigraph. Nodes and edges can have properties. There are norestrictions on the number of edgesbetween two nodes. Loops are allowed. Attributes are single valued. Different types of indexes: Nodes & relationships. Different types of traversal strategies. API for Java, Python.65[Robinson et al. 2013]http://neo4j.org/
  • Neo4J Architecture66[Robinson et al. 2013]http://neo4j.org/
  • Neo4J Logical View67[Robinson et al. 2013]http://neo4j.org/
  • Wrote CitesWrotePaper1 Paper2Paper3Peter Smith68ConferenceESWCPaper4ConferenceCitesPaper5ISWCConferenceConferenceCitesConferenceJuan PerezWroteJohn SmithWroteWroteWroteCitesWroteExample
  • Neo4J API (1)Creating a graph and adding nodes69db = new GraphDatabaseFactory().newEmbeddedDatabase( DB_PATH );Node author1 = db.createNode();Node author2 = db.createNode();Node author3 = db.createNode();Node paper1 = db.createNode();Node paper2 = db.createNode();Node paper3 = db.createNode();Node paper4 = db.createNode();Node paper5 = db.createNode();Node conference1 = db.createNode();Node conference2= db.createNode();
  • Neo4J API (2)Adding attributes to nodesauthor1.setProperty(‚firstname‛,‛Peter‛);author1.setProperty(‚lastname‛,‛Smith‛);author2.setProperty(‚firstname‛,‛Juan‛);author2.setProperty(‚lastname‛,‛Perez‛);author3.setProperty(‚firstname‛,‛John‛);author3.setProperty(‚lastname‛,‛Smith‛);conference1.setProperty(‚name‛,‛ESWC‛);conference2.setProperty(‚name‛,‛ISWC‛);70
  • Neo4J API (3)Adding relationships71Relationship rel1=author1.createRelationshipTo(paper1,DynamicRelationshipType.of(‛WROTE‛));Relationship rel2=author1.createRelationshipTo(paper5,DynamicRelationshipType.of(‛WROTE‛));Relationship rel3=author1.createRelationshipTo(paper3,DynamicRelationshipType.of(‛WROTE‛));…Relationship rel8=paper1.createRelationshipTo(paper2,DynamicRelationshipType.of(‛CITES‛));Relationship rel9=paper3.createRelationshipTo(paper1,DynamicRelationshipType.of(‛CITES‛));Relationship rel10=paper3.createRelationshipTo(paper4,DynamicRelationshipType.of(‛CITES‛));Relationship rel11=paper4.createRelationshipTo(paper2,DynamicRelationshipType.of(‛CITES‛));Relationshiprel12=paper1.createRelationshipTo(conference1,DynamicRelationshipType.of(‛Conference‛));Relationship rel13=paper2.createRelationshipTo(conference2,DynamicRelationshipType.of(‛Conference‛));Relationship rel14=paper3.createRelationshipTo(conference1,DynamicRelationshipType.of(‛Conference‛));Relationship rel15=paper4.createRelationshipTo(conference1,DynamicRelationshipType.of(‛Conference‛));Relationship rel16=paper5.createRelationshipTo(conference2,DynamicRelationshipType.of(‛Conference‛));
  • Neo4J API (4)Creating indexes72IndexManager index = db.index();Index<Node>authorIndex = index.forNodes( ‛authors" );//‛authors‛ index nameIndex<Node>paperIndex = index.forNodes( ‛papers" );Index<Node>conferenceIndex = index.forNodes( ‛conferences" );RelationshipIndexwroteIndex = index.forRelationships( ‛write" );RelationshipIndexciteIndex = index.forRelationships( ‛cite" );RelationshipIndexconferenceRelIndex = index.forRelationships(‛conferenceRel" );
  • author1.setProperty(‚firstname‛,‛Peter‛);author1.setProperty(‚lastname‛,‛Smith‛);authorIndex.add(author1,lastname,author1.getProperty(lastname));author2.setProperty(‚firstname‛,‛Juan‛);author2.setProperty(‚lastname‛,‛Perez‛);authorIndex.add(author2,lastname,author2.getProperty(lastname));author3.setProperty(‚firstname‛,‛John ‛);author3.setProperty(‚lastname‛,‛Smith‛);authorIndex.add(author3,lastname,author3.getProperty(lastname));conference1.setProperty(‚name‛,‛ESWC‛);conferenceIndex.add(conference1,name,conference1.getProperty(name));conference2.setProperty(‚name‛,‛ISWC‛);conferenceIndex.add(conference2,name,conference2.getProperty(name));Neo4J API (5)Indexing nodes73
  • Neo4J: Cypher Query Language (1) START: Specifies one or more starting points in the graph. Starting points can be defined as index lookups or just byan element ID. MATCH: Pattern matching to match based on the starting point(s) WHERE: Filtering criteria. RETURN: What is projected out from the evaluation of the query.74
  • Neo4J: Cypher Query Language (2)Query:IsPeter Smith connected to John Smith.75START author=node:authors(firstname:‛Peter" AND lastname:‛Smith")MATCH (author)-[R*]->(m)RETURNcount(R)
  • Neo4J: Cypher Query Language (2)Query: Papers written by Peter Smith.76START author=node:authors(firstname:‛Peter" AND lastname:‛Smith")MATCH (author)-[:WROTE]->(papers)RETURN papers
  • Neo4J: Cypher Query Language (3)Query: Papers cited by a paper written by PeterSmith.77START author=node:authors(firstname:‛Peter" AND lastname:‛Smith")MATCH (author)-[:WROTE]->()-[:CITES](papers)RETURN papers
  • Neo4J: Cypher Query Language (3)Query: Papers cited by a paper written by PeterSmith that have at most 20 cites.78START author=node:authors(firstname:‛Peter" AND lastname:‛Smith")MATCH (author)-[:WROTE]->()-[:CITES]->()-[:CITES]->(papers)WITH COUNT(papers) as citesWHERE cites < 21RETURN papers
  • Neo4J: Cypher Query Language (4)Query: Papers cited by a paper written by PeterSmith or cited by papers cited by a paper writtenby Peter Smith.79START author=node:authors(firstname:‛Peter" AND lastname:‛Smith")MATCH (author)-[:WROTE]->()-[:CITES*1..2](papers)RETURN papers
  • Neo4J: Cypher Query Language (5)Query: Number of papers cited by a paper writtenby Peter Smith or cited by papers cited by a paperwritten by Peter Smith.80START author=node:authors(firstname:‛Peter" AND lastname:‛Smith")MATCH (author)-[:WROTE]->()-[:CITES*1..2](papers)RETURN count(papers)
  • Neo4J: Cypher Query Language (6)Query: Number of papers cited by a paper writtenby Peter Smith or cited by papers cited by a paperwritten by Peter Smith, and have been published inESWC.81START author=node:authors(firstname:‛Peter" AND lastname:‛Smith")MATCH (author)-[:WROTE]->()-[:CITES*1..2](papers)WHERE papers.conference! =‘ESWC’RETURN count(papers)Exclamation mark in papers.conference ensures that paperswithout the conference attribute are not included in the output.
  • Neo4J: Cypher Query Language (7)Query: Number of papers cited by a paper writtenby Peter Smith or cited by papers cited by a paperwritten by Peter Smith, have been published inESWC and have at most 4 co-authors .82START author=node:authors(firstname:‛Peter" AND lastname:‛Smith")MATCH (author)-[:WROTE]->()-[:CITES*1..2](papers)-[:Was-Written]->authorFinalWHERE papers.conference! =‘ESWC’WITH author, count(authorFinal) as authorFinalCountWHERE authorFinalCount< 4RETURN author, authorFinalCountExclamation mark in papers.conference ensures that paperswithout the conference attribute are not included in the output.
  • Neo4J API-CypherQuery: Peter Smith Neighborhood83static Neo4j TestGraph;String Query = "start n=node:node_auto_index(idNode=PeterSmith)" +"match n-[r]->m return m.idNode";ExecutionResultresult = TestGraph.query(Query);booleanres = false;for ( Map<String, Object> row : result ){for ( Map.Entry<String, Object> column : row.entrySet() ){print(column.getValue().toString());res = true;}}if (!res) print(" No result");
  • Operation DescriptionGraphDatabaseFactory Creates a empty graphregisterShutdownHook Removes a graphcreateNode Creates a new empty node of a given node typeAddNode Copies a nodecreateRelationshipTo Adds a new edge of some edge typeforNodes Index for a NodeIndexManager Index ManagerforRelationships Index for a Relationshipadd Add an object to an indexTraversalDescription Different types of strategies to traverse a graphTraversalBranch SelectorpreorderDepthFirst, postorderDepthFirst,preorderBreadthFirst, postorderBreadthFirstNeo4J API Summary84
  • COFFEE BREAK85
  • RDF GRAPH ENGINES864
  • RDF-Based Graph Database Engines87Native storageHexastoreRDF3XBitMatDBMS-backedstorageHexastore*BHyper
  • In the beginnings …88… there were dinosaurs
  • In the beginnings …89• Traditionally:– Use Relational DBMS’ to store data– Early Approach: onelargeSPOtable(physical)• Pros: simple schema, fast updates (when no indexpresent), fast single triple pattern queries• Cons: not suitable for multi-join queries, results inmultiple self-joins
  • In the beginnings …90• Traditionally:– Use Relational DBMS’ to store data– Further Approaches: propertytables(multiple flavors)• Pros: fast retrieval of subject / object pairs forgiven predicates• Cons: not easy to solve queries where predicate isunbound & harder to maintain schema (a table foreach known predicate)
  • RDF-tailored management approaches91• Same conceptual approach: onelargeSPOtable (butwith RDF-specific physical organization):– Triples are indexed in all 3!=6 possible permutations:SPO, PSO, OSP, OPS, PSO, POS– Logical extension of the vertical partitioning proposedin [Abadi et al 2008] (the PSO index = generalizationof the SO table)– Each index (i.e. SPO) stores partial triple patterncardinalities– Resulting in 6 tree-based indexes, each 3 levels deep• First proposed in Hexastore[Weiss et al. 2008]
  • Hexastore / Conceptual Model92• 2 main operations: (input = tp: partially bounded triple pattern, i.e.<bound_subject><some predicate> ?o)• cardinality = getCardinality(tp)• term_set = getSet(tp, set)• Resulting term_set is sorted
  • Hexastore / Conceptual Model93• Pros:• The optimal index permutation & level is used (i.e. <bound_s> ?p ?o : SPO level1)• Ability to make use of fast merge-joins• Cons:• Higher upkeep cost: a theoretical upper bound of 5x space increase over original(encoded) data
  • 94Wrote CitesWrotePaper1 Paper2Paper3Peter SmithConferenceESWCPaper4ConferenceCitesPaper5ISWCConferenceConferenceCitesConferenceJuan PerezWroteJohn SmithWroteWroteWroteCitesWroteAn Example (Publications):logical representation
  • RDF Subgraph for the Publication Graph95conference/eswc/2012/paper/03conference/iswc/2012/paper/05person/peter-smithontoware/ontology#author2012ontoware/ontology#yearconference/iswc/semanticweb/ontology#isPartOfISWCsemanticweb/ontology#hasAcronymsemanticweb/ontology/ConferenceEventw3c/rdf-syntax#typeontoware/ontology#authorconference/eswc/semanticweb/ontology#isPartOf2012ontoware/ontology#yearw3c/rdf-syntax#typeontoware/ontology#biblioReferenceESWCsemanticweb/ontology#hasAcronymconference/iswc/2012/paper/01
  • An Example (Publications):first step, encode strings96http://data.semanticweb.org/person/john-smith 20http://data.semanticweb.org/person/juan-perez 1918http://data.semanticweb.org/person/peter-smithhttp://data.semanticweb.org/ns/swc/ontology#hasAcronym 17http://www.w3.org/1999/02/22-rdf-syntax-ns#type 1615"2012"14http://swrc.ontoware.org/ontology#biblioReference13http://swrc.ontoware.org/ontology#year12http://swrc.ontoware.org/ontology#author11http://data.semanticweb.org/ns/swc/ontology#isPartOfhttp://data.semanticweb.org/ns/swc/ontology#ConferenceEvent 10http://data.semanticweb.org/conference/eswc/2012 9http://data.semanticweb.org/conference/iswc/2012 87"ISWC2012""ESWC2012" 65http://data.semanticweb.org/conference/iswc/2012/paper/05http://data.semanticweb.org/conference/eswc/2012/paper/04 43http://data.semanticweb.org/conference/eswc/2012/paper/03http://data.semanticweb.org/conference/iswc/2012/paper/02 21http://data.semanticweb.org/conference/eswc/2012/paper/01HexastoreMappingDictionary
  • An Example (Publications):first step, encode strings97http://data.semanticweb.org/person/john-smith 20http://data.semanticweb.org/person/juan-perez 1918http://data.semanticweb.org/person/peter-smithhttp://data.semanticweb.org/ns/swc/ontology#hasAcronym 17http://www.w3.org/1999/02/22-rdf-syntax-ns#type 1615"2012"14http://swrc.ontoware.org/ontology#biblioReference13http://swrc.ontoware.org/ontology#year12http://swrc.ontoware.org/ontology#author11http://data.semanticweb.org/ns/swc/ontology#isPartOfhttp://data.semanticweb.org/ns/swc/ontology#ConferenceEvent 10http://data.semanticweb.org/conference/eswc/2012 9http://data.semanticweb.org/conference/iswc/2012 87"ISWC2012""ESWC2012" 65http://data.semanticweb.org/conference/iswc/2012/paper/05http://data.semanticweb.org/conference/eswc/2012/paper/04 43http://data.semanticweb.org/conference/eswc/2012/paper/03http://data.semanticweb.org/conference/iswc/2012/paper/02 21http://data.semanticweb.org/conference/eswc/2012/paper/01HexastoreMappingDictionary• Next, index original encoded RDF file …
  • Hexastore / Physical implementation.Native storage9822433449815243S11121113 114 1918521131121118201511121213 114 291918154111121213 114 19201915212 111113 18181517 1116 107171116 106P O• Uses vector-storage[Weiss etal 2009]• Each index level:– partially sorted vector (on disk,one file per level)– each record:<rdf term id, cardinality, pointer to set>– first level gets specialtreatment:• fully sorted vector: search =binary search, O(log(N))• Hint: using an additional hashcan circumvent the binary search
  • Hexastore / Physical implementation.Native storage9922433449815243S11121113 114 1918521131121118201511121213 114 291918154111121213 114 19201915212 111113 18181517 1116 107171116 106P O• Pros:– Optimized for read-intensive /static workloads– Optimal / fast triple patternmatching– No comparisons when scanningsince each (partial) cardinality isknown !• Cons:– Not suitable for write intensiveworkloads (worst case can resultin rewrite of entire index!)– Hypothesis (not yet tested): SSDmedia may be adequate…
  • Hexastore / Physical implementation.DBMS (managed) storage100• Builds on top of sortedindexes:– B+Tree, LSM Tree, etc• Each index level:– Same sorted index structure(but not necessarily)– each record:<rdf term id list, cardinality>– Search cost = as providedby the managed indexstructure22433449815243S111111121113 114 11 11 9 -1 -18121 13 5 -1 14 2 -2221131121112 11 8 -2 12 20 -2 13 15 -333311121213 114 23 11 9 -12 -3 183 13 15 -3 -14 14444 11121213 114 14 11 9 -4 -12 204 13 15 -4 14 2 -555 12 111113 15 11 8 -5 12 18 -5 13 15 -88 17 11168 16 10 -8 17 7 -99 1711169 16 10 -9 17 6 -SP SPO-3 191243 -1412 19 -4
  • Hexastore / Physical implementation.DBMS (managed) storage101• Pros:– Flexible design: each index canbe configured for a differentphysical data-structure:• i.e. PSO=B+Tree, OSP=LSM Tree– Separation of concerns (theunderlying DBMS manages andoptimizes for access patterns,cache, etc…)• Cons:– Less control over fine-graineddata storage (up to the levelpermitted by the DBMS API)– Usually less potential for nativeoptimization22433449815243S111111121113 114 11 11 9 -1 -18121 13 5 -1 14 2 -2221131121112 11 8 -2 12 20 -2 13 15 -333311121213 114 23 11 9 -12 -3 183 13 15 -3 -14 14444 11121213 114 14 11 9 -4 -12 204 13 15 -4 14 2 -555 12 111113 15 11 8 -5 12 18 -5 13 15 -88 17 11168 16 10 -8 17 7 -99 1711169 16 10 -9 17 6 -SP SPO-3 191243 -1412 19 -4
  • RDF3x (1) RISC style operators. Implements the traditionalOptimize-then-Executeparadigm, to identify left-linear plans of star-shaped sub-queries of basic triple patterns. Makes use of secondary memory structures to locallystore RDF data and reduce the overhead of I/Ooperations. Registers aggregated count or number of occurrences,and is able to detect if a query will return an emptyanswer.102[Neumann and Weikum 2008]
  • RDF3x (2) Triple table, each row represents a triple. Mapping dictionary, replacing all the literalstrings by an id; two dictionary indices. Compressed clustered B+-tree to index all triples: Six permutations of subject, predicate and object;each permutation in a different index. Triples are sorted lexicographically in each B+-tree.103[Neumann and Weikum 2008]
  • RDF3x (3)104Mapping dictionaryTriple Table
  • 105RDF3x(4)[Figure from Neumann et al 2010] Compressed B+-trees. One B+-tree per pattern. Structure is updatable.
  • 106RDF3x (5)[Figure from Lei Zouhttp://hotdb.info/slides/RDF-talk.pdf]
  • RDF3x (6)107
  • Wrote CitesWrotePaper1 Paper2Paper3Peter Smith108ConferenceESWCPaper4ConferenceCitesPaper5ISWCConferenceConferenceCitesConferenceJuan PerezWroteJohn SmithWroteWroteWroteCitesWroteExample
  • RDF Subgraph for the Publication Graph109conference/eswc/2012/paper/03conference/iswc/2012/paper/05person/peter-smithontoware/ontology#author2012ontoware/ontology#yearconference/iswc/semanticweb/ontology#isPartOfISWCsemanticweb/ontology#hasAcronymsemanticweb/ontology/ConferenceEventw3c/rdf-syntax#typeontoware/ontology#authorconference/eswc/semanticweb/ontology#isPartOf2012ontoware/ontology#yearw3c/rdf-syntax#typeontoware/ontology#biblioReferenceESWCsemanticweb/ontology#hasAcronymconference/iswc/2012/paper/01
  • RDF3x: Representation of thePublication Graph110OID Resource0 http://www.w3.org/1999/02/22-rdf-syntax-ns#type1 http://data.semanticweb.org/ns/swc/ontology#hasAcronym2 http://data.semanticweb.org/ns/swc/ontology#isPartOf3 http://swrc.ontoware.org/ontology#author4 http://swrc.ontoware.org/ontology#year5 http://swrc.ontoware.org/ontology#biblioReference6 http://data.semanticweb.org/conference/iswc/20127 http://data.semanticweb.org/ns/swc/ontology#ConferenceEvent8 ISWC20129 http://data.semanticweb.org/conference/eswc/201210 ESWC201211 http://data.semanticweb.org/conference/iswc/2012/paper/0512 http://data.semanticweb.org/conference/iswc/2012/paper/0213 http://data.semanticweb.org/conference/eswc/2012/paper/0114 http://data.semanticweb.org/conference/eswc/2012/paper/0315 http://data.semanticweb.org/conference/eswc/2012/paper/0416 http://data.semanticweb.org/author/peter-smith17 http://data.semanticweb.org/author/juan-perez18 http://data.semanticweb.org/author/john-smith19 2012S P O6 0 79 0 76 0 79 1 811 1 1012 2 613 2 614 2 915 2 911 3 1612 3 1813 3 1614 3 1614 3 1715 3 1715 3 1811 4 1912 4 1913 4 1914 4 1915 4 1913 5 1214 5 1314 5 1515 5 12RDF3xMappingDictionaryTripleTable
  • BitMat(1) 3D bit-cube representation where each dimensionrepresents subjects, predicates and objects. Each cell of the matrix is 0 or 1, and represents an RDFtriple that can be formed by the combination of S, P, O. 3D bit-cube is slicedalong a dimension to obtain 2Dmatrices. A gap compression schema on each bit-row isimplemented: Basic operations over BitMat are defined to join basic triplepatterns.111[Atre et al. 2010]
  • 112Triple Table 3D-cube BitMatS-O and O-SBitMats for PsBitMat(2)
  • 113BitMat(3)[Figure from Atreand Hendler 2009]
  • Wrote CitesWrotePaper1 Paper2Paper3Peter Smith114ConferenceESWCPaper4ConferenceCitesPaper5ISWCConferenceConferenceCitesConferenceJuan PerezWroteJohn SmithWroteWroteWroteCitesWroteExample
  • RDF Subgraph for the Publication Graph115conference/eswc/2012/paper/03conference/iswc/2012/paper/05person/peter-smithontoware/ontology#author2012ontoware/ontology#yearconference/iswc/semanticweb/ontology#isPartOfISWCsemanticweb/ontology#hasAcronymsemanticweb/ontology/ConferenceEventw3c/rdf-syntax#typeontoware/ontology#authorconference/eswc/semanticweb/ontology#isPartOf2012ontoware/ontology#yearw3c/rdf-syntax#typeontoware/ontology#biblioReferenceESWCsemanticweb/ontology#hasAcronymconference/iswc/2012/paper/01
  • BitMat: Representation of thePublication Graph (1)116OID Resource1 http://data.semanticweb.org/conference/eswc/20122 http://data.semanticweb.org/conference/eswc/2012/paper/013 http://data.semanticweb.org/conference/eswc/2012/paper/044 http://data.semanticweb.org/conference/iswc/20125 http://data.semanticweb.org/conference/iswc/2012/paper/026 http://data.semanticweb.org/conference/iswc/2012/paper/037 http://data.semanticweb.org/conference/eswc/2012/paper/05S P O4 6 84 1 121 6 81 1 77 2 45 2 42 2 16 2 13 2 12 3 116 3 116 3 103 3 103 3 97 3 115 3 92 5 65 5 66 5 63 5 67 5 6sID Table3DCubeBitMat
  • BitMat: Representation of thePublication Graph (2)117S P O4 6 84 1 121 6 81 1 77 2 45 2 42 2 16 2 13 2 12 3 116 3 116 3 103 3 103 3 97 3 115 3 92 5 65 5 66 5 63 5 67 5 6pID Table3DCubeBitMatOID Resource1 http://data.semanticweb.org/ns/swc/ontology#hasAcronym2 http://data.semanticweb.org/ns/swc/ontology#isPartOf3 http://swrc.ontoware.org/ontology#author4 http://swrc.ontoware.org/ontology#biblioReference5 http://swrc.ontoware.org/ontology#year6 http://www.w3.org/1999/02/22-rdf-syntax-ns#type
  • BitMat: Representation of thePublication Graph (3)118S P O4 6 84 1 121 6 81 1 77 2 45 2 42 2 16 2 13 2 12 3 116 3 116 3 103 3 103 3 97 3 115 3 92 5 65 5 66 5 63 5 67 5 6oID Table3DCubeBitMatOID Resource1 http://www.w3.org/1999/02/22-rdf-syntax-ns#type2 http://data.semanticweb.org/ns/swc/ontology#hasAcronym3 http://data.semanticweb.org/ns/swc/ontology#isPartOf4 http://swrc.ontoware.org/ontology#author5 http://swrc.ontoware.org/ontology#year6 http://swrc.ontoware.org/ontology#biblioReference7 http://data.semanticweb.org/conference/iswc/20128 http://data.semanticweb.org/ns/swc/ontology#ConferenceEvent9 ISWC201210 http://data.semanticweb.org/conference/eswc/201211 ESWC201212 http://data.semanticweb.org/conference/iswc/2012/paper/05
  • BHyper(1) A hypergraph-basedstructure is used to represent RDF documents. Mapping dictionary, replacing all the literal strings by an id, twodictionary indices, indexed with a Hash Index. Encodings of RDF resources are implemented as nodes, and set oftriple patterns corresponds to hyperedges. A role function is used to denote the role played by a resource in ahyperedge. Roles are represented as labels in the hyperedges. Indicesare used to index the hyperdges and the different rolesplayed by the resources in the hyperedges: B+-tree are used to index hyperedges. BHyper is built on top of Tokyo Cabinet1.4.451.1191 http://sourceforge.net/projects/tokyocabinet/[Vidal and Martínez 2011]
  • BHyper(2)120[Vidal and Martínez 2011]A hyper-edge can relate multiple edges.
  • BHyper(3)121[Vidal and Martínez 2011]
  • HANDS-ON SESSION1225
  • Experimental Set-UpThe experiments presented in this section were executed on amachine with the following characteristics: Model: Sun Fire x4100 Quantity of processors: 2 Processor specification: Dual-Core AMD Opteron™ Processor 2218 64bits RAM memory: 16GB Disk capacity: 2 disks, 135GB e/o Network interface: 2 + ALOM Operating System:CentOS 5.5. 64 bits RDF3x version: 0.3.7 DEX version: 4.8 Very Large Databases HyperGraphDB version: 1.2 Neo4J version: 1.8.2 Timeout: 3,600 secs. (Value returned in the portal -2)123
  • 124Graph Name #Nodes #Edges DescriptionDSJC1000.1 1,000 49,629 Graph Density:0.1[Johnson91]DSJC1000.5 1,000 249,826 Graph Density:0.5[Johnson91]DSJC1000.9 1,000 449,449 Graph Density:0.9[Johnson91]SSCA2-17 131,072 3,907,909 GTgraph 1)[Johnson91] Johnson, D., Aragon, C., McGeoch, L., and Schevon, C. Optimization by simulatedannealing: an experimental evaluation; part ii, graph coloring and number partitioning. Operationsresearch 39, 3 (1991), 378–406.GTgraph: A suite of three synthetic graph generators:1) SSCA2: generates graphs used as input instances for the DARPA. High Productivity ComputingSystems SSCA#2 graph theory benchmarkhttp://www.cse.psu.edu/~madduri/software/GTgraph/index.htmlBenchmark Graphs
  • 125Task DescriptionadjacentXP Given a node X and an edge label P, findadjacentnodes Y.adjacentX Given a node X,find adjacentnodes Y.edgeBetween Given two nodes X and Y, find the labels of theedges between X and Y.2-hopX Given a node X, find the 2-hop Y nodes.3-hopX Given a node X, find the 3-hop Y nodes.4-hopX Given a node X, find the 4-hop Y nodes.Benchmark TasksThe benchmark tasks evaluate the engine performance while executingthe following graph operations:
  • 126Task SPARQL QueryadjacentXP Select ?ywhere {<http://graphdatabase.ldc.usb.ve/resource/6><http://graphdatabase.ldc.usb.ve/resource/pr>?y}adjacentX Select ?ywhere {<http://graphdatabase.ldc.usb.ve/resource/6> ?p ?y}edgeBetween Select ?pwhere {<http://graphdatabase.ldc.usb.ve/resource/6> ?p<http://graphdatabase.ldc.usb.ve/resource/8>2-hopX Select ?zwhere {<http://graphdatabase.ldc.usb.ve/resource/6> ?p1 ?y1.?y1 ?p2 ?z}3-hopX Select ?zwhere {<http://graphdatabase.ldc.usb.ve/resource/6> ?p1 ?y1.?y1 ?p2 ?y2.?y2 ?p3 ?z}4-hopX Select ?zwhere {<http://graphdatabase.ldc.usb.ve/resource/6> ?p1 ?y1.?y1 ?p2 ?y2.?y2 ?p3 ?y3.?y3 ?p4 ?z}Benchmark Tasks: RDF EnginesThe benchmark tasks were implemented in the RDF engines with thefollowing SPARQL queries:
  • Graph Mining TasksThese tasks evaluate the engine performancewhile executing the following graph mining: Graph summarization. Dense subgraph. Shortest path.127
  • 4 5 761 2 38904 5 6 71 2 3Add(8,1)Add(9,3)Add(0.3)Del(4,1)8 90CorrectionsCost = 1(superedge) + 4(corrections)= 5Graph Summarization[Navlakha et al. 2008]OriginalGraphSummarizedGraph128
  • ABCDEFGH12 11133231112[Khuller et al. 2010]Dense Subgraph129
  • ABCDEFGH12 11133231112Given a subset of nodes E’ compute the densest subgraph containingthem, i.e., density of E’ is maximized:density(E) =weight(a,b)| E|(a,b) in Eå[Khuller et al. 2010]Dense Subgraph130
  • Shortest Path• Given two nodes finds the shortest pathbetween them.• Implemented using Iterative Depth First Search(DFID).• Ten node pairs are taken at random from eachgraph, and then the Shortest Path algorithm isexecuted on each pair of nodes131
  • Experimental Resultshttp://graphium.ldc.usb.ve/JoséSánchezValeriaPestanaJoséPiñeiroDomingo DeAbreuJonathanQueipo AlejandroFloresGuillermoPalmaCollaborators:USB132
  • 133Loading Time: RDF EnginesGraph Name #Nodes #Edges RDF3xLoading TimeHexastoreLoading TimeDSJC1000.1 1,000 49,629 2.036s 6.174sDSJC1000.5 1,000 249,826 5.594s 30.809sDSJC1000.9 1,000 449,449 8.992s 30.809sSSCA2-17 131,072 3,907,909 44.775s 2m43.814s
  • 134Task DescriptionadjacentXP Given a node X and an edge label P, findadjacentnodes Y.adjacentX Given a node X,find adjacentnodes Y.edgeBetween Given two nodes X and Y, find the labels of theedges between X and Y.2-hopX Given a node X, find the 2-hop Y nodes.3-hopX Given a node X, find the 3-hop Y nodes.4-hopX Given a node X, find the 4-hop Y nodes.Benchmark TasksThe benchmark tasks evaluate the engine performance while executingthe following graph operations:
  • 135Task SPARQL QueryadjacentXP Select ?ywhere {<http://graphdatabase.ldc.usb.ve/resource/6><http://graphdatabase.ldc.usb.ve/resource/pr>?y}adjacentX Select ?ywhere {<http://graphdatabase.ldc.usb.ve/resource/6> ?p ?y}edgeBetween Select ?pwhere {<http://graphdatabase.ldc.usb.ve/resource/6> ?p<http://graphdatabase.ldc.usb.ve/resource/8>2-hopX Select ?zwhere {<http://graphdatabase.ldc.usb.ve/resource/6> ?p1 ?y1.?y1 ?p2 ?z}3-hopX Select ?zwhere {<http://graphdatabase.ldc.usb.ve/resource/6> ?p1 ?y1.?y1 ?p2 ?y2.?y2 ?p3 ?z}4-hopX Select ?zwhere {<http://graphdatabase.ldc.usb.ve/resource/6> ?p1 ?y1.?y1 ?p2 ?y2.?y2 ?p3 ?y3.?y3 ?p4 ?z}Benchmark Tasks: RDF EnginesThe benchmark tasks were implemented in the RDF engines with thefollowing SPARQL queries:
  • 136Benchmark Tasks: Hexastore! "#$#$#! $#! ! $#! ! ! $#! ! ! ! $%&%(%)*%%+$,&-,.%+)/$,&-,.%+)/0$12345/$62345/$72345/$!"#$%&()*+,#)-.#$./0)12)3$45#)6#( $7, 489)*4.9.): #"4.; 8#)89:; #! ! ! "#$89:; #! ! ! "<$89:; #! ! ! "=$99; >12#?$K-hop seems to growexponentially
  • 137Benchmark Tasks: RDF3x! "! ! #$! "! #$! "#$#$#! $#! ! $#! ! ! $#! ! ! ! $%&%(%)*%%+$,&-,.%+)/$,&-,.%+)/0$12345/$62345/$72345/$!"#$%&()*+,#)-.#$./0)12)3$45#)6#( $7, 489)*4.9.): ; <="))89:; #! ! ! "#$89:; #! ! ! "<$89:; #! ! ! "=$99; >12#?$K-hop seems to growexponentially
  • 138Loading Time: Graph DatabaseEnginesGraph Name #Nodes #Edges DEXLoadingTimeNeo4JLoadingTimeHyperGraphDBLoadingTimeDSJC1000.1 1,000 49,629 10.031s 19.884s 159.665sDSJC1000.5 1,000 249,826 41.677s 113.174s 812.135sDSJC1000.9 1,000 449,449 76.389s 321.299s 1,557.574sSSCA2-17 131,072 3,907,909 455.293s 359.082s 7,148.301s
  • 139Benchmark Tasks: Neo4J! "#$#$#! $#! ! $#! ! ! $#! ! ! ! $%&%(%)*%%+$,&-,.%+)/$,&-,.%+)/0$12345/$62345/$72345/$!"#$%&()*+,#)-.#$./)01)2$34#)5#( $6, 378)*3.8)9 # : ;)89:; #! ! ! "#$89:; #! ! ! "<$89:; #! ! ! "=$99; >12#?$K-hop seems to growexponentially
  • 140Benchmark Tasks: DEXK-hop seems to growexponentially
  • 141Benchmark Tasks: HyperGraphDBK-hop seems to growexponentially
  • Graph Mining TasksThese tasks evaluate the engine performancewhile executing the following graph mining: Graph summarization. Dense subgraph. Shortest path.142
  • 4 5 761 2 38904 5 6 71 2 3Add(8,1)Add(9,3)Add(0.3)Del(4,1)8 90CorrectionsCost = 1(superedge) + 4(corrections)= 5Graph Summarization[Navlakha et al. 2008]OriginalGraphSummarizedGraph143
  • ABCDEFGH12 11133231112[Khuller et al. 2010]Dense Subgraph144
  • ABCDEFGH12 11133231112Given a subset of nodes E’ compute the densest subgraph containingthem, i.e., density of E’ is maximized:density(E) =weight(a,b)| E|(a,b) in Eå[Khuller et al. 2010]Dense Subgraph145
  • Shortest Path• Given two nodes finds the shortest pathbetween them.• Implemented using Iterative Depth First Search(DFID).• Ten node pairs are taken at random from eachgraph, and then the Shortest Path algorithm isexecuted on each pair of nodes146
  • 147Graph Mining Tasks: DEX
  • Graph Mining Tasks: Neo4J148! "! #"! ##"! ###"! ####"$%&!###(!"$%&!###()"$%&!###(*"%%+,-!."!"#$%&()*+,#)-.#$./0)12)3$45#)6#( $7, 489)*4.9.): # ; <)/ 0123"%45 5 1067189: "$; : <; "%4=/ 0123"%390>; <>"?1>3"
  • Graph Mining Tasks: HyperGraphDB149! "! #"! ##"! ###"! ####"$%&!###(!"$%&!###()"$%&!###(*"%%+,-!."!"#$%&()*+,#)-.#$./0)12)3$45#)6#( $7, 489)*4.9.): ; <#8=84<7>6)/ 0123"%45 5 1067189: "$; : <; "%4=/ 0123"%390>; <>"?1>3"
  • QUESTIONS & DISCUSSION1506
  • Graph Data Storing Features151Graph Database MainMemoryPersistentMemoryBackendStorageIndexingDEX X X XHyperGraphDB1 X X X XNeo4j X X XHexastore2 X X X XRDF3x X XBitmat X XBypher2 X X X X2Backend Storage: Tokyo Cabinet1Backend Storage: Berkeley DB
  • Graph Data Management Features152Graph Database Graph API QueryLanguageDataVisuali-zationDataDefinitionDataManipulationStandardComplianceDEX X X X XHyperGraphDB X X XNeo4j X X X X XHexastore X X X XRDF3x XBitMat X X XBypher X X X
  • AdjacencyQueriesNode/EdgeAdjacencyK-neighborhoodof a node.ReachabilityQueriesTest whenevertwo nodes areconnected by apath.Shortest Path.PatternMatchingQueriesFind all sub-graphs ofa data graph.SummarizationQueriesAggregate theresults of agraph-basedquery153Graph-Based Operations
  • 154GraphDatabaseNode/EdgeadjacencyK-neighborhoodFixed-lengthpathsRegularSimplePathsShortestPathPatternMatchingSummarizationQueriesDEX X X X X X XHyperGraph X X XNeo4j X X X X X XHexastore X XRDF3x X XBitMat X XBypher X XGraph-Based Operations
  • SUMMARY & CLOSING1557
  • Data Storing Features: Engines implement a wide variety ofstructures to efficiently represent large graphs. RDF engines implement special-purposestructures to efficiently store RDF graphs.156Conclusions (1)
  • Data Management Features: General-purpose graph database enginesprovide APIs to manage and query data. RDF engines support general data managementoperations.157Conclusions (2)
  • Graph-Based Operations Support: General-purpose graph database engines provide APImethods to implement a wide variety of graph-basedoperations. Only few offer query languages to express patternmatching. RDF engines are not customized for basic graph-basedoperations, however, They efficiently implement the pattern matching-based language SPARQL.158Conclusions (3)
  •  Extensionof existing enginesto support specifictasks of graph traversal and mining. Definition of benchmarksto study the performanceand quality of existing graph database engines, e.g., Graph traversal, link prediction, pattern discovery,graph mining. Empirical evaluation of the performance and qualityof existing graph database engines159Future Directions
  • P. Anderson, A. Thor, J. Benik, L. Raschid, and M.-E. Vidal. Pang: finding patterns inanno- tation graphs. In SIGMOD Conference, pages 677–680, 2012.R. Angles and C. Gutiérrez. Survey of graph database models. ACM Comput. Surv.,40(1), 2008.M. Atre, V. Chaoji, M. Zaki, and J. Hendler. Matrix Bit loaded: a scalable lightweight joinquery processor for RDF data. In International Conference on World Wide Web,Raleigh, USA, 2010. ACM.B.Iordanov.Hypergraphdb:Ageneralizedgraphdatabase.InWAIMWorkshops,pages25–36,2010.N. Martíınez-Bazan, V. Muntés-Mulero, S. G ómez-Villamor, J. Nin, M.-A. nchez-Martíınez, and J.-L. Larriba-Pey. Dex: high-performance exploration on large graphsfor information retrieval. In CIKM, pages 573–582, 2007.T. Neumann and G. Weikum. RDF-3X: a RISC-style engine for RDF. Proc. VLDB, 1(1),2008.160References
  • M. K. Nguyen, C. Basca, and A. Bernstein. B+hash tree: optimizing query executiontimes for on-disk semantic web data structures. In Proceedings Of The 6thInternational Workshop On Scalable Semantic Web Knowledge Base Systems(SSWS2010), 2010.I Robinson, J. Webber, and E. Eifrem. Graph Databases. O’Reilly Media, Incorporated,2013.S. Sakr and E. Pardede, editors. Graph Data Management: Techniques and Applications.IGI Global, 2011.D.ShimpiandS.Chaudhari. An overview of graph databases .In IJCA Proceedings onInternational Conference on Recent Trends in Information Technology andComputer Science 2012, 2013.M.-E. Vidal, A. Martínez, E. Ruckhaus, T. Lampo, and J. Sierra. On the efficiency ofquerying and storing rdf documents. In Graph Data Management, pages 354–385.2011.C. Weiss, P. Karras, and A. Bernstein. Hexastore: Sextuple indexing for semantic webdata management. Proc. of the 34th Intl Conf. on Very Large Data Bases (VLDB),2008.161References
  • C Weiss, A Bernstein, On-disk storage techniques for Semantic Web data - Are B-Treesalways the optimal solution?, Proceedings of the 5th International Workshop onScalable Semantic Web Knowledge Base Systems, October 2009.162References