Distributed Graph Databases and the Emerging Web of Data

  • 4,205 views
Uploaded on

The World Wide Web is the defacto medium for publicly exposing a corpus of interrelated documents. In its current form, the World Wide Web is the Web of Documents. The next generation of the World …

The World Wide Web is the defacto medium for publicly exposing a corpus of interrelated documents. In its current form, the World Wide Web is the Web of Documents. The next generation of the World Wide Web will support the Web of Data. The Web of Data utilizes the same Uniform Resource Identifier (URI) address space as the Web of Documents, but instead of a exposing a graph of documents, the Web of Data exposes a graph of data. Given that the URI address space of the Web is distributed and infinite, the Web of Data provides a single unified space by which the worlds data can be publicly exposed and interrelated. The Web of Data is supported by both graph databases (which structure the data) and distributed computing mechanism (which process the data). This presentation will discuss the Web of Data, graph databases, and models of computing in this emerging space.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
4,205
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
190
Comments
0
Likes
10

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Distributed Graph Databases and the Emerging Web of Data Marko A. Rodriguez T-5, Center for Nonlinear Studies Los Alamos National Laboratory http://markorodriguez.com April 16, 2009
  • 2. Abstract The World Wide Web is the defacto medium for publicly exposing a corpus of interrelated documents. In its current form, the World Wide Web is the Web of Documents. The next generation of the World Wide Web will support the Web of Data. The Web of Data utilizes the same Uniform Resource Identifier (URI) address space as the Web of Documents, but instead of a exposing a graph of documents, the Web of Data exposes a graph of data. Given that the URI address space of the Web is distributed and infinite, the Web of Data provides a single unified space by which the worlds data can be publicly exposed and interrelated. The Web of Data is supported by both graph databases (which structure the data) and distributed computing mechanism (which process the data). This presentation will discuss the Web of Data, graph databases, and models of computing in this emerging space. Computer Science Department Colloquium – University of New Mexico – April 16, 2009
  • 3. Outline • The Relational Database vs. the Graph Database • The Web of Documents vs. the Web of Data • Local Computing vs. Distributed Computing • Multi-Relational Network Analysis with Grammar Walkers Computer Science Department Colloquium – University of New Mexico – April 16, 2009
  • 4. Outline • The Relational Database vs. the Graph Database • The Web of Documents vs. the Web of Data • Local Computing vs. Distributed Computing • Multi-Relational Network Analysis with Grammar Walkers Computer Science Department Colloquium – University of New Mexico – April 16, 2009
  • 5. The Relational Database vs. the Graph Database • A relational database’s (e.g. MySQL, PostgreSQL, Oracle) data model is a collection interlinked tables. • A graph database’s (e.g. OpenSesame, AllegroGraph, Neo4j) data model is a multi-relational graph. Relational Database Graph Database d c a a b 127.0.0.1 127.0.0.2 Computer Science Department Colloquium – University of New Mexico – April 16, 2009
  • 6. Types of Graphs • Undirected single-relational graph: homogenous set of symmetric links. • Directed single-relational graph: homogenous set of links. • Directed multi-relational graph: heterogenous set of links. undirected single-relational graph x z directed single-relational graph x z directed multi-relational graph x y z Computer Science Department Colloquium – University of New Mexico – April 16, 2009
  • 7. Our Make Believe World - Phase 1 • Marko is a human and Fluffy is a dog. Computer Science Department Colloquium – University of New Mexico – April 16, 2009
  • 8. Our World Modeled in a Relational Database - Phase 1 ID Name Type Legs Fur 0001 Marko Human 2 false 0002 Fluffy Dog 4 true Object_Table Computer Science Department Colloquium – University of New Mexico – April 16, 2009
  • 9. Our World Modeled in a Graph Database - Phase 1 Human Dog type type 0001 0002 name name legs fur legs fur 2 Marko false 4 Fluffy true Computer Science Department Colloquium – University of New Mexico – April 16, 2009
  • 10. Our Make Believe World - Phase 2 • Marko is a human and Fluffy is a dog. • Marko and Fluffy are good friends. Computer Science Department Colloquium – University of New Mexico – April 16, 2009
  • 11. Our World Modeled in a Relational Database - Phase 2 ID Name Type Legs Fur ID2 ID2 0001 Marko Human 2 false 0001 0002 0002 Fluffy Dog 4 true 0002 0001 Object_Table Friendship_Table Computer Science Department Colloquium – University of New Mexico – April 16, 2009
  • 12. Our World Modeled in a Graph Database - Phase 2 Human Dog type type friend 0001 friend 0002 name name legs fur legs fur 2 Marko false 4 Fluffy true Computer Science Department Colloquium – University of New Mexico – April 16, 2009
  • 13. Our Make Believe World - Phase 3 • Marko is a human and Fluffy is a dog. • Marko and Fluffy are good friends. • Human and dog are a subclass of mammal. Computer Science Department Colloquium – University of New Mexico – April 16, 2009
  • 14. Our World Modeled in a Relational Database - Phase 3 ID Name Type Legs Fur ID2 ID2 Type1 Type2 0001 Marko Human 2 false 0001 0002 Human Mammal 0002 Fluffy Dog 4 true 0002 0001 Dog Mammal Object_Table Friendship_Table Subclass_Table Computer Science Department Colloquium – University of New Mexico – April 16, 2009
  • 15. Our World Modeled in a Graph Database - Phase 3 Mammal subclassof subclassof Human Dog type type friend 0001 friend 0002 name name legs fur legs fur 2 Marko false 4 Fluffy true Computer Science Department Colloquium – University of New Mexico – April 16, 2009
  • 16. Our Make Believe World - Phase 4 • Marko is a human and Fluffy is a dog. • Marko and Fluffy are good friends. • Human and dog are a subclass of mammal. • Fluffy peed on the carpet. Computer Science Department Colloquium – University of New Mexico – April 16, 2009
  • 17. Our World Modeled in a Relational Database - Phase 4 ID Name Type Legs Fur ID2 ID2 Type1 Type2 0001 Marko Human 2 false 0001 0002 Human Mammal 0002 Fluffy Dog 4 true 0002 0001 Dog Mammal 0003 My_Rug Carpet N/A N/A Friendship_Table Subclass_Table Object_Table ID1 ID2 0002 0003 Pee_Table Computer Science Department Colloquium – University of New Mexico – April 16, 2009
  • 18. Our World Modeled in a Graph Database - Phase 4 Mammal subclassof subclassof Human Dog Carpet type type type friend 0001 friend 0002 peedOn 0003 name name name legs fur legs fur 2 Marko false 4 Fluffy true My_Rug Computer Science Department Colloquium – University of New Mexico – April 16, 2009
  • 19. Our Make Believe World - Phase 5 • Marko is a human and Fluffy is a dog. • Marko and Fluffy are good friends. • Human and dog are a subclass of mammal. • Fluffy peed on the carpet. • Marko and Fluffy are both mammals. Computer Science Department Colloquium – University of New Mexico – April 16, 2009
  • 20. Our World Modeled in a Relational Database - Phase 5 ID Name Type Legs Fur ID2 ID2 Type1 Type2 0001 Marko Human 2 false 0001 0002 Human Mammal 0002 Fluffy Dog 4 true 0002 0001 Dog Mammal 0003 My_Rug Carpet N/A N/A Friendship_Table Subclass_Table Object_Table ID1 ID2 ID Type 0002 0003 0001 Human Pee_Table 0002 Dog 0003 Carpet 0001 Mammal 0002 Mammal Type_Table Computer Science Department Colloquium – University of New Mexico – April 16, 2009
  • 21. Our World Modeled in a Graph Database - Phase 5 Mammal subclassof subclassof Human Dog Carpet type type type type type friend 0001 friend 0002 peedOn 0003 name name name legs fur legs fur 2 Marko false 4 Fluffy true My_Rug Computer Science Department Colloquium – University of New Mexico – April 16, 2009
  • 22. The Graph as the Natural World Model • The world is inherently (or perceived as) object-oriented. • The world is filled with objects and relations among them. • The multi-relational graph is a very natural representation of the world. Computer Science Department Colloquium – University of New Mexico – April 16, 2009
  • 23. The Graph as the Natural Programming Model • High-level computer languages are object-oriented. • Nearly no impedance mismatch between the multi-relational graph and the programming object. • It is easy to go from graph database to in-memory object. Human marko = new Human(); marko.name = "Marko"; marko.addFriend(fluffy); marko.setHasFur(false); marko.setLegs(2); Computer Science Department Colloquium – University of New Mexico – April 16, 2009
  • 24. SQL vs. SPARQL SELECT OTY.Name FROM Object_Table AS OTX, Object_Table AS OTY, Friendship_Table WHERE OTX.Name = "Marko" AND Friendship_Table.ID1 = OTY.ID AND Friendship_Table.ID2 = OTX.ID; SELECT ?z WHERE { ?x name "Marko" . ?y friend ?x . ?y name ?z } E. Prud’hommeaux and A. Seaborne. SPARQL Query Language for RDF, WWW Consortium, http://www.w3.org/TR/2004/WD-rdf-sparql-query-20041012/, 2004. Computer Science Department Colloquium – University of New Mexico – April 16, 2009
  • 25. Outline • The Relational Database vs. the Graph Database • The Web of Documents vs. the Web of Data • Local Computing vs. Distributed Computing • Multi-Relational Network Analysis with Grammar Walkers Computer Science Department Colloquium – University of New Mexico – April 16, 2009
  • 26. Internet Address Spaces • The Uniform Resource Identifier (URI) is the superclass of the Uniform Resource Locator (URL) and Uniform Resource Name (URN). Computer Science Department Colloquium – University of New Mexico – April 16, 2009
  • 27. The Uniform Resource Locator • The set of all URLs is the address space of all resources that can be located and retrieved on the Web. URLs denote where a resource is. http://markorodriguez.com/index.html ∗ Domain name server (DNS): markorodriguez.com → 216.251.43.6 ∗ http:// means GET at port 80, ∗ /index.html means the resource to get at that Internet location. Web Server index.html markorodriguez.com 216.251.43.6 Computer Science Department Colloquium – University of New Mexico – April 16, 2009
  • 28. The Uniform Resource Name • The set of all URNs is the address space of all resources within the urn: namespace. urn:uuid:bd93def0-8026-11dd-842be54955baa12 urn:issn:0892-3310 urn:doi:10.1016/j.knosys.2008.03.030 • Named resources need not be retrievable through the Web. • URNs denote what a resource is. Computer Science Department Colloquium – University of New Mexico – April 16, 2009
  • 29. The Uniform Resource Identifier • The URI address space is an infinite space for all Internet resources. urn:issn:0892-3310 ftp://markorodriguez.com/private/markos_secrets.txt http://www.lanl.gov#fluffy • Important: URIs can denote concepts, instances, and datum. lanl:fluffy lanl:fluffy_legs lanl is a namespace prefix which extends to http://www.lanl.gov#. Computer Science Department Colloquium – University of New Mexico – April 16, 2009
  • 30. The Web of Documents • The World of Documents is primarily concerned with the Hyper-Text Transfer Protocol (HTTP) and with retrievable resources in the URL address space. • These retrievable resources are files: HTML documents, images, audio, etc. The “web” is created when HTML documents contain URLs. http://markorodriguez.com/ index.html href Resume.html href Home.html href Research.html Computer Science Department Colloquium – University of New Mexico – April 16, 2009
  • 31. The Web of Data • The Web of Data is primarily concerned with URIs. • The Resource Description Framework (RDF) is the standard for representing the relationship between URIs and literals (e.g. float, string, date time, etc.). subject predicate object lanl:marko foaf:knows lanl:fluffy foaf:name foaf:name "Marko A. Rodriguez"^^xsd:string "Fluffy P. Everywhere"^^xsd:string C. Bizer, T. Heath, K. Idehen, and T. Berners-Lee. Linked Data on the Web, International World Wide Web Conference, 2008. Computer Science Department Colloquium – University of New Mexico – April 16, 2009
  • 32. Our Make Believe World in RDF lanl:Mammal rdfs:subClassOf rdfs:subClassOf lanl:Human lanl:Dog rdf:type rdf:type rdf:type rdf:type lanl:marko lanl:friend lanl:fluffy lanl:friend lanl:fur lanl:legs lanl:fur lanl:legs foaf:name foaf:name "false"^^xsd:boolean "2"^^xsd:integer "true"^^xsd:boolean "4"^^xsd:integer "Marko A. Rodriguez"^^xsd:string "Fluffy P. Everywhere"^^xsd:string Computer Science Department Colloquium – University of New Mexico – April 16, 2009
  • 33. The Web of Data is a Distributed Database • The URI address space is distributed. • URIs can denote datum. • RDF denotes the relationships URIs. • The Web of Data’s foundational standard is RDF. • Therefore, the Web of Data is a distributed database. Computer Science Department Colloquium – University of New Mexico – April 16, 2009
  • 34. The Web of Documents vs. the Web of Data Web Server Web Server HTML href HTML 127.0.0.1 127.0.0.2 Graph Database Graph Database lanl:friend 127.0.0.1 127.0.0.2 Computer Science Department Colloquium – University of New Mexico – April 16, 2009
  • 35. The Current Web of Data - March 2009 homologenekegg projectgutenberg symbol homologenekegg libris projectgutenberg cas symbol bbcjohnpeel libris unists diseasome dailymed w3cwordnet chebi hgnc pubchem eurostat mgi geneid omim wikicompany geospecies cas bbcjohnpeel diseasome dailymed drugbank worldfactbook reactome pubmed unists magnatune opencyc w3cwordnet uniparc linkedct chebi freebase taxonomy uniref uniprot geneontology interpro hgnc pubchem eurostat pdb yago umbel pfam mgi dbpedia omim bbclatertotpgovtrack wikicompany geospecies prosite prodom flickrwrappr geneid opencalais reactome uscensusdata drugbank worldfactbook lingvoj linkedmdb surgeradio magnatune pubmed virtuososponger opencyc rdfbookmashup uniparc freebase swconferencecorpus geonames musicbrainz myspacewrapper linkedct dblpberlin uniprot pubguide taxonomy revyu interpro uniref geneontologyjamendo bbcplaycountdata rdfohloh pdb umbel yago semanticweborg siocsites riese pfam dbpedia bbclatertotp govtrack foafprofiles dblphannover openguides audioscrobbler prosite bbcprogrammes prodom crunchbase flickrwrappropencalais doapspace uscensusdata flickrexporter surgeradio budapestbme qdos lingvoj linkedmdb semwebcentral virtuososponger eurecom ecssouthampton pisa dblprkbexplorer newcastle rdfbookmashup geonames musicbrainz rae2001 eprints irittoulouse laascnrs acm citeseer swconferencecorpus myspacewrapper ieee dblpberlin pubguide resex ibm revyu jamendo rdfohloh bbcplaycountdata M.A. Rodriguez. A Graph Analysis of the Linked Data Cloud, in review, http://arxiv.org/abs/0903.0194, 2009. semanticweborg riese siocsites foafprofiles openguides audioscrobbler bbcprogrammes dblphannover crunchbase Computer Science Department Colloquium – University of New Mexico – April 16, 2009 doapspace flickrexporter qdos
  • 36. The Current Web of Data - March 2009 data set domain data set domain data set domain audioscrobbler music govtrack government pubguide books bbclatertotp music homologene biology qdos social bbcplaycountdata music ibm computer rae2001 computer bbcprogrammes media ieee computer rdfbookmashup books budapestbme computer interpro biology rdfohloh social chebi biology jamendo music resex computer crunchbase business laascnrs computer riese government dailymed medical libris books semanticweborg computer dblpberlin computer lingvoj reference semwebcentral social dblphannover computer linkedct medical siocsites social dblprkbexplorer computer linkedmdb movie surgeradio music dbpedia general magnatune music swconferencecorpus computer doapspace social musicbrainz music taxonomy reference drugbank medical myspacewrapper social umbel general eurecom computer opencalais reference uniref biology eurostat government opencyc general unists biology flickrexporter images openguides reference uscensusdata government flickrwrappr images pdb biology virtuososponger reference foafprofiles social pfam biology w3cwordnet reference freebase general pisa computer wikicompany business geneid biology prodom biology worldfactbook government geneontology biology projectgutenberg books yago general geonames geographic prosite biology ... Computer Science Department Colloquium – University of New Mexico – April 16, 2009
  • 37. Cultural Differences that are Leading to Web-Based Data Management - Part 1 • Relational databases tend to not maintain public access points. • Relational database users tend to not publish their schemas. • Web of Data graph databases maintain public access points called SPARQL end-points or Linked Data URLs. • Web of Data graph database users tend to reuse and extend public schemas called ontologies. Computer Science Department Colloquium – University of New Mexico – April 16, 2009
  • 38. Cultural Differences that are Leading to Web-Based Data Management - Part 2 Conventional Model Web of Data Model 127.0.0.1 127.0.0.2 127.0.0.3 127.0.0.1 127.0.0.2 127.0.0.3 Application 1 Application 2 Application 3 Application 1 Application 2 Application 3 processes processes processes processes processes processes Web of Data structures structures structures structures structures structures 127.0.0.1 127.0.0.2 127.0.0.3 127.0.0.4 127.0.0.5 127.0.0.6 Computer Science Department Colloquium – University of New Mexico – April 16, 2009
  • 39. Outline • The Relational Database vs. the Graph Database • The Web of Documents vs. the Web of Data • Local Computing vs. Distributed Computing • Multi-Relational Network Analysis with Grammar Walkers Computer Science Department Colloquium – University of New Mexico – April 16, 2009
  • 40. SPARQLing a Data Provider - Local Computing SELECT ?x WHERE { 127.0.0.2 lanl:marko lanl:friend ?x END-POINT 127.0.0.1 SPARQL } Graph Database { lanl:fluffy } • The 127.0.0.1 client is querying the 127.0.0.2 server. • The query is any read-based SPARQL query. • The results are those resources that bound to the query arguments. Computer Science Department Colloquium – University of New Mexico – April 16, 2009
  • 41. GETing Linked Data as RDF - Local Computing http://www.lanl.gov#marko lanl:fluffy lanl:friend lanl:fluffy lanl:marko HTTP GET lanl:wrote lanl:friend vub:1010 Web of Data lanl:marko ieee:2020 http://www.vub.edu#1010 lanl:wrote lanl:cites ieee:2020 vub:1010 lanl:cites vub:1010 HTTP GET 127.0.0.1 Computer Science Department Colloquium – University of New Mexico – April 16, 2009
  • 42. Problem with the Current Web of Data Infrastructure • The only interfaces are SPARQL end-points and HTTP GETs of RDF subgraphs. • For human-based document retrieval, this is fine. For machine-based data processing, this does not scale. M.A. Rodriguez. A Distributed Process Infrastructure for a Distributed Data Structure. Semantic Web and Information Systems Bulletin, AIS Special Interest Group on Semantic Web and Information Systems, http://arxiv.org/abs/0807.3908, 2008. Computer Science Department Colloquium – University of New Mexico – April 16, 2009
  • 43. Problem with the Current Web of Data Infrastructure • We can not rely on the “download and index” philosophy of the World Wide Web. As of March 2009, the Web of Data maintains 4.5 billion triples. • The Web of Data can not rely on a single service provider. too much data. too many types algorithms that can utilize this data. too many clock cycles to locally process this data. Computer Science Department Colloquium – University of New Mexico – April 16, 2009
  • 44. The Open Virtual Machine Farm Graph Database Graph Database lanl:friend 127.0.0.1 127.0.0.2 Virtual Machine code/ Virtual Machine Farm machine Farm • Distributed computing through code/machine migration between farms. • move the process to the data, not the data to the process. M.A. Rodriguez. General Purpose Computing on a Semantic Network Substrate. in Emergent Web Intelligence, eds. R. Chbeir, A. Hassanien, A. Abraham and Y. Badr, Springer-Verlag, http://arxiv.org/abs/0704.3395, 2009. M.A. Rodriguez. The RDF Virtual Machine, in review, LA-UR-08-03925, 2009. Computer Science Department Colloquium – University of New Mexico – April 16, 2009
  • 45. Neno RDF Programming Language - Code Serialization urn:uuid: demo:Human rdf:type 4fa0f752 hasMethod xsd:int example(xsd:string a) Method { urn:uuid: hasMethodName 6e400b42 if(a == "marko") return 1; hasBlock else Block "example"^^xsd:string return 2; urn:uuid: 4e0bada0 } nextInst Equals urn:uuid: Block 51b8d4a0 urn:uuid: falseInst 67bbd072 nextInst hasLeft Branch Block nextInst urn:uuid: urn:uuid: PushValue trueInst 51b8d4a0 610eb4b0 urn:uuid: LocalDirect 6d451a1e nextInst urn:uuid: hasRight 54e14d4c PushValue hasValue LocalDirect urn:uuid: LocalDirect hasURI urn:uuid: 5c4d5bc2 5869b878 urn:uuid: 62e8b8dc hasURI hasValue "a"^^xsd:string hasURI LocalDirect nextInst urn:uuid: "marko"^^xsd:string 6425e5ec nextInst "2"^^xsd:int hasURI Return urn:uuid: urn:uuid: 008e999a "1"^^xsd:int 0748e1c6 Return Computer Science Department Colloquium – University of New Mexico – April 16, 2009
  • 46. The Fhat RDF Virtual Machine - Machine Serialization xsd:boolean RVM xsd:boolean [1] [1] methodReuse halt programLocation Fhat operandTop hasFrame returnTop [0..1] [0..1] [0..1] currentFrame [0..1] Operand [0..1] Instruction ReturnStack Stack rdf:rest rdf:rest blockTop rdf:first [0..1] [0..*] rdf:first [0..1] [0..1] forFrame Frame [1] rdfs:Resource Instruction rdf:li [0..*] [0..1] [0..1] Frame Block Variable Stack rdf:rest hasSymbol hasValue fromBlock rdf:first [0..1] [1] [0..*] [1] Block xsd:string rdfs:Resource Block Computer Science Department Colloquium – University of New Mexico – April 16, 2009
  • 47. A Collection of Interlinked Graph Databases - Currently 127.0.0.2 127.0.0.3 127.0.0.6 127.0.0.4 127.0.0.5 127.0.0.10 127.0.0.9 127.0.0.8 127.0.0.7 127.0.0.11 Computer Science Department Colloquium – University of New Mexico – April 16, 2009
  • 48. A Collection of Interlinked Graph Databases and Processors - Future 127.0.0.2 127.0.0.3 127.0.0.6 127.0.0.4 127.0.0.5 127.0.0.10 127.0.0.9 127.0.0.8 127.0.0.7 127.0.0.11 Computer Science Department Colloquium – University of New Mexico – April 16, 2009
  • 49. The Future of Web-Based Distributed Computing • The HTTP GET approach to Web of Data does not scale. • The Neno/Fhat (or any general-purpose computing) environment is unsafe. • The Web of Data needs an open, safe, flexible, and easy to adopt computing infrastructure. Computer Science Department Colloquium – University of New Mexico – April 16, 2009
  • 50. What Type of Processing? • Object-oriented programming: Web of Data as an object repository. • Logic: Web of Data as a knowledge-base. • Graph/network analysis: Web of Data as a multi-relational graph. • The future computing environment should support at least these popular processing models. • We will focus on graph/network analysis for the remainder of this presentation. Computer Science Department Colloquium – University of New Mexico – April 16, 2009
  • 51. Outline • The Relational Database vs. the Graph Database • The Web of Documents vs. the Web of Data • Local Computing vs. Distributed Computing • Multi-Relational Network Analysis with Grammar Walkers Computer Science Department Colloquium – University of New Mexico – April 16, 2009
  • 52. Introduction to Random Walkers • Random walkers can be used in single-relational networks to calculate: stationary probability distribution: primary eigenvector calculation spreading activation: search by means of diffusion • There is a continuous and a discrete form of the general random walk method. Computer Science Department Colloquium – University of New Mexico – April 16, 2009
  • 53. Random Walks in a Single-Relational Network • Suppose a single-relational network G, where G = (V, E ⊆ (V × V )). • Let’s represent that network as a row stochastic adjacency matrix A ∈ [0, 1]|V |×|V |, where 1 Γ(i) if (i, j) ∈ E Ai,j = 0 otherwise. • Finally, assume an “energy vector” π ∈ R|V |. Computer Science Department Colloquium – University of New Mexico – April 16, 2009
  • 54. Random Walks in a Single-Relational Network a b c d a 0 0.5 0 0.5 b c b 0 0 1 0 1 0 0 0 c 0.5 0 0 0.5 a d d 0 1 0 0 G A π • πA can be interpreted as the continuous form of propagating random walkers over the G. Computer Science Department Colloquium – University of New Mexico – April 16, 2009
  • 55. Stationary Probability Distribution in a Single-Relational Network π1 1 0 0 0 a b c d π2 0 0.5 0 0.5 0 0.5 0 0.5 π3 0 0.5 0.5 0 1 π4 0 0 0 0.25 0 0.5 0.25 time 0.5 0 0 0.5 5 0 0 0 π 0.25 0.38 0 0.36 1 π6 0 0.5 0.38 0.13 A ... π∞ 0.15 0.31 0.31 0.23 Computer Science Department Colloquium – University of New Mexico – April 16, 2009
  • 56. Stationary Probability Distribution in a Single-Relational Network • If G is strongly connected and aperiodic then there exits a π such that π = πA. • This stationary π ∞ is the primary eigenvector of A. • PageRank computes the stationary π by forcing G (the Web citation graph) to be strongly connected and aperiodic. Computer Science Department Colloquium – University of New Mexico – April 16, 2009
  • 57. Spreading Activation in a Single-Relational Network • Spreading activation can be thought of as a “local rank” algorithm, while calculating the stationary probability provides you a “global rank”. • With spreading activation, you iterate for only a certain number of timesteps. • Also, you record how much energy has flowed through each vertex. • Let’s demonstrate using a single discrete walker... Computer Science Department Colloquium – University of New Mexico – April 16, 2009
  • 58. Spreading Activation in a Single-Relational Network • The walkers moves from vertex to vertex with choice dependent on the probability distribution of A. • At every step, if the walker is at vertex i then πi = π + 1. 2 3 π1 1 0 0 0 G b c π2 1 1 0 0 time 1 π3 1 1 1 0 π4 a d 4 2 1 1 0 Computer Science Department Colloquium – University of New Mexico – April 16, 2009
  • 59. Random Walks in a Multi-Relational Network • Suppose a multi-relational network M , where M = (V, E = {E0, E1, . . . , Ek ⊆ (V × V )}) • Represent as a {0, 1}-adjacency tensor A ∈ {0, 1}|V |×|V |×|E|, where 1 if (i, j) ∈ Em : 1 ≤ m ≤ k Am = i,j 0 otherwise. • Then assume a “energy vector” π ∈ R|V |. M.A. Rodriguez and J. Shinavier. Exposing Multi-Relational Networks to Single-Relational Network Analysis Algorithms, in review, http://arxiv.org/abs/0806.2274, 2009. Computer Science Department Colloquium – University of New Mexico – April 16, 2009
  • 60. Random Walks in a Multi-Relational Network b cites c 0 1 0 0 authored contains 0 0 0 0 1 0 0 0 a d 0 0 0 0 0 0 0 0 ns ai nt co s te ed ci or th au M A π Computer Science Department Colloquium – University of New Mexico – April 16, 2009
  • 61. The Operations of the Multi-Relational Path Algebra • A · B: ordinary matrix multiplication determines the number of (A, B)- paths between vertices. • A : matrix transpose inverts path directionality. • A ◦ B: Hadamard, entry-wise multiplication applies a filter to selectively exclude paths. • n(A): not generates the complement of a {0, 1}n×n matrix. • c(A): clip generates a {0, 1}n×n matrix from a Rn×n matrix. + • v ±(A): vertex generates a {0, 1}n×n matrix from a Rn×n matrix, where + only certain rows or columns contain non-zero values. • λA: scalar multiplication weights the entries of a matrix. • A + B: matrix addition merges paths. Computer Science Department Colloquium – University of New Mexico – April 16, 2009
  • 62. The Traverse Operation • An interesting aspect of the single-relational adjacency matrix A ∈ {0, 1}n×n is that when it is raised (k) to the kth power, the entry Ai,j is equal to the number of paths of length k that connect vertex i to vertex j . (1) • Given, by definition, that Ai,j (i.e. Ai,j ) represents the number of paths that go from i to j of length 1 (i.e. a single edge) and by the rules of ordinary matrix multiplication, (k) (k−1) Ai,j = Ai,l · Al,j : k ≥ 2. l∈V a b c a b c a b c a b c a 0 1 0 a 0 1 0 a 0 0 1 b 0 0 1 · b 0 0 1 = b 0 0 0 c 0 0 0 c 0 0 0 c 0 0 0 there is a path of length 2 from a to c Computer Science Department Colloquium – University of New Mexico – April 16, 2009
  • 63. A1 : authored A2 : cites A3 : contains h ih ih i The Traverse Operation Z = A1 · A2 · A1 , Zi,j defines the number of paths from vertex i to vertex j such that a path goes from author i to one the articles he or she has authored, from that article to one of the articles it cites, and finally, from that cited article to its author j . Semantically, Z is an author-citation single-relational path matrix. A2 vub:1010 lanl:cites ieee:2020 A1 lanl:authored A1 lanl:authored lanl:marko lanl:author-citation vub:fheyligh Z * NOTE: All diagrams are with respect to a “source” vertex (the blue vertex) in order to preserve clarity. In reality, the operations operate on all vertices in parallel. Computer Science Department Colloquium – University of New Mexico – April 16, 2009
  • 64. The Filter Operation Various path filters can be defined and applied using the entry-wise Hadamard matrix product denoted ◦, where   A1,1 · B1,1 · · · A1,m · B1,m A◦B= . . ... . . . An,1 · Bn,1 · · · An,m · Bn,m 24 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 72 0 4 0 0 1 0 0 0 0 72 0 0 0 23 0 0 0 0 ◦ 1 0 0 0 0 = 23 0 0 0 0 0 0 15.3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 12 0 0 0 0 0 0 0 0 0 0 Path Matrix Path Filter Filtered Path Matrix Computer Science Department Colloquium – University of New Mexico – April 16, 2009
  • 65. The Filter Operation • A◦1=A • A◦0=0 • A◦B=B◦A • A ◦ (B + C) = (A ◦ B) + (A ◦ C) • A ◦ B = (A ◦ B) . Computer Science Department Colloquium – University of New Mexico – April 16, 2009
  • 66. The Not Filter The not filter is useful for excluding a set of paths to or from a vertex. n : {0, 1}n×n → {0, 1}n×n with a function rule of 1 if Ai,j = 0 n(A)i,j = 0 otherwise. 0 0 1 1 1 1 1 0 0 0 1 0 1 0 1 0 1 0 1 0 n 0 1 1 1 1 = 1 0 0 0 0 1 1 0 1 1 0 0 1 0 0 1 1 1 1 0 0 0 0 0 1 Computer Science Department Colloquium – University of New Mexico – April 16, 2009
  • 67. The Not Filter If A ∈ {0, 1}n×n, then • n(n(A)) = A • A ◦ n(A) = 0 • n(A) ◦ n(A) = n(A). Computer Science Department Colloquium – University of New Mexico – April 16, 2009
  • 68. A1 : authored A2 : cites A3 : contains h ih ih i The Not Filter A coauthorship path matrix is Z = A1 · A1 ◦ n(I) acm:0505 A1 lanl:authored A1 lanl:authored lanl:marko lanl:coauthor lanl:jbollen Z n(I) lanl:coauthor Computer Science Department Colloquium – University of New Mexico – April 16, 2009
  • 69. The Clip Filter The general purpose of clip is to take a path matrix and “clip”, or normalize, it to a {0, 1}n×n matrix. c : Rn×n → {0, 1}n×n + 1 if Zi,j > 0 c(Z)i,j = 0 otherwise. 24 1 0 0 0 1 1 0 0 0 0 72 0 4 0 0 1 0 1 0 c 23 0 0 0 0 = 1 0 0 0 0 0 0 15.3 0 0 0 0 1 0 0 0 0 0 0 12 0 0 0 0 1 Computer Science Department Colloquium – University of New Mexico – April 16, 2009
  • 70. The Clip Filter If A, B ∈ {0, 1}n×n and Y, Z ∈ Rn×n, then + • c(A) = A • c(n(A)) = n(c(A)) = n(A) • c(Y ◦ Z) = c(Y) ◦ c(Z) • n(A ◦ B) = c (n(A) + n(B)) • n(A + B) = n(A) ◦ n(B) Computer Science Department Colloquium – University of New Mexico – April 16, 2009
  • 71. A1 : authored A2 : cites A3 : contains h ih ih i The Clip Filter Suppose we want to create an author citation path matrix that does not allow self citation or coauthor citations. „ « „ „ «« 1 2 1 1 1 Z= A ·A ·A ◦n c A · A ◦ n(I) ◦ n(I) |{z} | {z } | {z } no self cites no coauthors Z lanl:author-citation odu:nelson authored 2 A A1 lanl:3030 lanl:cites lanl:4040 A 1 A1 lanl:authored lanl:authored lanl:authored lanl:marko lanl:coauthor lanl:jbollen n c A1 · A1 ◦ n(I) self n(I) Computer Science Department Colloquium – University of New Mexico – April 16, 2009
  • 72. A1 : authored A2 : cites A3 : contains h ih ih i The Clip Filter However, using various theorems of the path algebra and abstract algebra in general, Z = A1 · A2 · A1 ◦ n c A1 · A1 ◦ n(I) ◦ n(I) no self cites no coauthors becomes Z = A1 · A2 · A1 ◦ n c A1 · A1 ◦ n(I). Computer Science Department Colloquium – University of New Mexico – April 16, 2009
  • 73. Other Filters and Operations... • Please refer to the article for more information on these filters and operations. Computer Science Department Colloquium – University of New Mexico – April 16, 2009
  • 74. Problems with the Path Algebra • As a matrix algebra, it is impossible (computationally speaking) to compute matrix operations over the entire Web of Data. • However, it is possible to approximate these calculations using “random” walkers. Computer Science Department Colloquium – University of New Mexico – April 16, 2009
  • 75. Mapping Paths to Grammar-Based Random Walkers • A grammar-based random walker is a walker that obeys a path description. • Able to compute “semantically rich” spreading activation and stationary probability distributions in a multi-relational network. • Able to approximate through the convergence properties of these operations. • Provides a convenient application to the Web of Data and linked graph databases. M.A. Rodriguez. Grammar-Based Random Walkers in Semantic Networks. Knowledge-Based Systems, 21(7), 727–739, 2008. Computer Science Department Colloquium – University of New Mexico – April 16, 2009
  • 76. A Grammar Walker Grammar Walker A1 · A1 ◦ n(I) t=1 t=2 t=3 Web of Data structures structures structures 127.0.0.4 127.0.0.5 127.0.0.6 Computer Science Department Colloquium – University of New Mexico – April 16, 2009
  • 77. Grammar Walking the Web of Data 127.0.0.1 1 7 127.0.0.2 127.0.0.3 2 127.0.0.6 127.0.0.4 127.0.0.5 127.0.0.10 3 127.0.0.9 127.0.0.8 6 5 127.0.0.7 4 127.0.0.11 Computer Science Department Colloquium – University of New Mexico – April 16, 2009
  • 78. Conclusion • Graph databases will increasingly support the Web of Data. • The Web of Data is about open, global-scale data management. • Distributed computing is required for global-scale data processing. • Grammar walkers can be used for distributed network analysis on the Web of Data. Computer Science Department Colloquium – University of New Mexico – April 16, 2009
  • 79. Thank You For Your Time My homepage: http://markorodriguez.com Neno/Fhat: http://neno.lanl.gov Collective Decision Making Systems: http://cdms.lanl.gov Faith in the Algorithm: http://faithinthealgorithm.net MESUR: http://www.mesur.org Computer Science Department Colloquium – University of New Mexico – April 16, 2009