Using MongoDB as a high performance graph database

31,708 views

Published on

1 Comment
36 Likes
Statistics
Notes
  • Hi everyone, we've open sourced the tripod-php library, see https://github.com/talis/tripod-php
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total views
31,708
On SlideShare
0
From Embeds
0
Number of Embeds
79
Actions
Shares
0
Downloads
1,086
Comments
1
Likes
36
Embeds 0
No embeds

No notes for slide

Using MongoDB as a high performance graph database

  1. 1. Thursday, 21 June 12
  2. 2. Using MongoDB as a high performance graph database MongoDB UK, 20th June 2012 Chris Clarke CTO, Talis Education LimitedThursday, 21 June 12Who is talis?Using mongo about 8 months (since 2.0)5 months in production
  3. 3. What this talk not aboutThursday, 21 June 12A blueprint for what you should doA pitch to encourage you to take our approachProviding or proving performance benchmarksEvangelism for the semantic web or linked dataEncouraging you to contribute/download/use an open sourceprojectOptimised for your use caseAlthough we can talk to you about any of the above (see meafter)
  4. 4. So, what is this talk about?Thursday, 21 June 12Our journey of using MongoDB as a high performance graphdatabaseSpecifically the software wrapper we implemented on top ofMongo to give us a leg up in terms of scalability and performanceTo give you some ideas for how to work with graph data modelsif you’d like to use document databases
  5. 5. GRAPHS 101Thursday, 21 June 12ApologiesNodes and edgesorResources and propertiesReally easy to represents facts
  6. 6. John knows Jane John knows JaneThursday, 21 June 12Ball and stick diagramsThis is an undirected graph. It implies that John knows Jane andJane knows John. The property has no directional significance.
  7. 7. John knows Jane Jane knows John John knows JaneThursday, 21 June 12This is an undirected graph. It implies that John knows Jane andJane knows John. The property has no directional significance.
  8. 8. John knows Jane Jane ? John John knows JaneThursday, 21 June 12This is a directed graph. The relationship is one way. To add Janeknows John we need a second property.We will only use directed graphs from herein as they are morespecific
  9. 9. John knows Jane Jane knows John knows John Jane knowsThursday, 21 June 12
  10. 10. Triples + RDF 101Thursday, 21 June 12
  11. 11. Subject Property Object John knows JaneThursday, 21 June 12This is a tripleProperty = predicate
  12. 12. Subject Property Object John knows Jane Jane knows JohnThursday, 21 June 12This is a second tripleThe same resource can be a subject or an object
  13. 13. Subject Property Object http://example.com/John http://xmlns.com/foaf/0.1/knows http://example.com/JaneThursday, 21 June 12RDFResources and properties as URIsURIs can be dereferencedCan share common property descriptions (RDF Schemas)Here using FOAF - billions if not trillions of triples defined usingFOAF
  14. 14. Subject Property Object http://example.com/John foaf:knows http://example.com/Jane http://example.com/John foaf:name “John” PREFIX foaf: <http://xmlns.com/foaf/0.1/>Thursday, 21 June 12Namespaces for readabilityIn RDF subjects are always urisBut objects can be literals i.e. plain textMany RDF/graph databases allow you to further type literals asdates, numbers, etc.
  15. 15. Subject Property Object http://example.com/John rdf:type foaf:Person http://example.com/John foaf:name “John” http://example.com/John foaf:knows http://example.com/Jane http://example.com/Jane rdf:type foaf:Person http://example.com/Jane foaf:name “Jane” http://example.com/Jane foaf:knows http://example.com/John PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX foaf: <http://xmlns.com/foaf/0.1/>Thursday, 21 June 12Here we type John and Jane as foaf:Person using rdf:typeNote both John and Jane appear as subjects and resourcesThis RDF graph represents six facts
  16. 16. foaf:Person rdf:type rdf:type foaf:knows example:John example:Jane foaf:knows “John” “Jane”Thursday, 21 June 12Here it is in ball and stick
  17. 17. FFS! I can do that in two minutes in BSONThursday, 21 June 12
  18. 18. > db.people.find() { _id: ObjectID(‘123’), name: ‘John’ knows: [ObjectID(‘456’)] }, { _id: ObjectID(‘456’), name: ‘Jane’ knows: [ObjectID(‘123’)] }Thursday, 21 June 12Yes, you can!Data only makes sense inside your db though
  19. 19. http://sheikspear.blogspot.co.uk/2011/07/simples.htmlThursday, 21 June 12Talk over, right?We can all go home
  20. 20. Some useful stuff, using RDFThursday, 21 June 12Lets look at some reasons why we think RDF is good
  21. 21. attributionThursday, 21 June 12This is the linked open data cloudLinked data is a way RDF published on the open webSearch linked data TED to hear why Tim Burness Lee cares aboutthisEach blob on this diagram represents an open, interlinkeddataset. The lines between them represent the interlinkingbetween data setsBillions of public “facts” and growing exponentially from sitessuch as BBC, governments, Last.fm, Wikipedia
  22. 22. Merging data from different sources is really easyThursday, 21 June 12Because the format is subject, predicate, object the shape of RDFis always the same.Because schemas are public and widely shared the sameproperties are used all over the place.Really easy to use this data in your own app and remix
  23. 23. Dataset A Dataset B example:John example:John rdf:type foaf:name “John” foaf:PersonThursday, 21 June 12
  24. 24. Dataset A+B example:John rdf:type foaf:name “John” foaf:PersonThursday, 21 June 12Really easy to merge graphs“Designed in” to the data formatLots of existing tooling to do this
  25. 25. RDF query language: SPARQLThursday, 21 June 12
  26. 26. PREFIX foaf: <http://xmlns.com/foaf/0.1/> SELECT ?name ?email WHERE { ?person a foaf:Person. ?person foaf:name ?name. ?person foaf:mbox ?email. } ORDER BY ?name LIMIT 50Thursday, 21 June 12SPARQL is mega flexible. Lots of functions for grouping, walkinggraphs, pattern matching, inference, UNIONS, Geo extensionsetc. etc. - all that shit.Most if not all of those datasets will have a SPARQL endpoint youcan query
  27. 27. SELECT Tabular DESCRIBE Graph ASK Boolean CONSTRUCT GraphThursday, 21 June 124 main query types
  28. 28. PREFIX foaf: <http://xmlns.com/foaf/0.1/> SELECT ?name ?email WHERE { FFS! That looks like SQL! ?person a foaf:Person. ?person foaf:name ?name. ?person foaf:mbox ?email. } ORDER BY ?name LIMIT 50Thursday, 21 June 12Yes it does. The WHERE clause is basically doing a shit load ofjoins. I’ll come back to that.
  29. 29. Application DB Triple store + (SQL or other) SPARQL Offline conversion processThursday, 21 June 12Most datasets on the LOD diagram don’t exist natively as Linkeddata and RDF. They are post-produced.Data not held natively - so conversion script - needs to bemaintained and updated every time app schema changesData not up to date (1 hour, 1 day, 1 month behind?)
  30. 30. Our innovation: Native Linked Data ApplicationsThursday, 21 June 12We started working on these applications back in 2008They are natively linked data so solve the conversion+currencyissueThere is no other “format” or schema the data is stored in, it’snative RDFWhen you have no schema, and you can integrate data fromelsewhere on the web, it’s addictive
  31. 31. Our problem: FFS! For applications, we need humongous scale and performanceThursday, 21 June 12Those applications becoming rather popular with our users...sub 50ms query timeModern web apps need speed and data scaleOut-grown triple store and SPARQLSPARQL is very flexible and expressive. It’s also expensiveSPARQL is great for data sets where the questions you can ask arelimitless, but our applications need a data layer where speed ismeasured in single digit ms.Complex caching (w/Memcache) to achieve performance andscalability90:10 read:write
  32. 32. TripodThursday, 21 June 12It’s a pod for our triplesA triple store designed for applications and scalabilityBased on Mongo
  33. 33. Functional requirements: • Order magnitude increase in perf/scale • Graph-orientated interface Non-functional requirements: • Strong communityThursday, 21 June 12Existing code very graph orientated
  34. 34. Core data format Tripod API Dealing with complex queries TripodTables Free text searchThursday, 21 June 12Walk through Tripod looking at 5 areas
  35. 35. { ‘http://example.com/John’ : { ‘http://purl.org/dc/elements/1.1/name’ : [ { value: ‘John’, type: ‘literal’ } ], ‘http://purl.org/dc/elements/1.1/knows’ : [ { value: ‘http://example.com/Jane’, type: ‘uri’ } ] }, ‘http://example.com/Jane’ : { ‘http://purl.org/dc/elements/1.1/name’ : [ { value: ‘Jane’, type: ‘literal’ } ], ‘http://purl.org/dc/elements/1.1/knows’ : [ { value: ‘http://example.com/John’, type: ‘uri’ }, { value: ‘http://example.com/James’, type: ‘uri’ } ] } }Thursday, 21 June 12RDF/JSON - a serialisation of RDF in JSONNeither disk space efficient or readablefull-formed properties not compatible with Mongo (dot notation)Even single values inside an array (problems for compoundindexing)
  36. 36. > db.CBD_people.find() { _id: ‘http://example.com/John’, ‘foaf:name’: {l: ‘John’}, ‘foaf:knows’: {u: ‘http://example.com/Jane’} }, { _id: ‘http://example.com/Jane’, ‘foaf:name’: {l: ‘Jane’}, ‘foaf:knows’: [ {u:‘http://example.com/John’}, {u:‘http://example.com/James’} ] }Thursday, 21 June 12Same semantics2 documents hereConcise bound descriptions - all data known about a subject,one relationship deepOne document per subject per collection, keyed (and thusenforced) by Subject URIProperty names are namespacedCBD collections are deemed as read/write in Tripod
  37. 37. class MongoGraph extends SimpleGraph { function add_tripod_array($tarray) function to_tripod_array($docId) }Thursday, 21 June 12All of our app already uses SimpleGraph from a library calledMoriarty (Google Code)Simple extension which can ingest/output the data format onprev slide
  38. 38. Core data format Tripod API Dealing with complex queries TripodTables Free text searchThursday, 21 June 12Walk through Tripod looking at 5 areas
  39. 39. interface ITripod { public function select($query,$fields,$sortBy=null,$limit=null); public function describeResource($resource); public function describeResources(Array $resources); public function saveChanges($oldGraph, $newGraph); public function search($query); }Thursday, 21 June 12Almost the same as our existing data access API onto generictriple storeAll of these methods return graphs, all are mega-simple querieson the CBD collectionsNone of these methods support joins (WHERE clause in SPARQL)
  40. 40. public function describeResource($resource) { $query = array(“_id”=>$resource); $bson = $this->getCollection()->findOne($query); $graph = new MongoGraph(); $graph->add_tripod_data($bson); return $graph; }Thursday, 21 June 12These methods mega simple to implement as they translate toreally simple Mongo Queries on the CBD collections returningsingle objects
  41. 41. interface ITripod { public function select($query,$fields,$sortBy=null,$limit=null); public function describeResource($resource); public function describeResources(Array $resources); public function saveChanges($oldGraph, $newGraph); public function search($query); public function getViewForResource($resource,$viewType); public function getViewForResources(Array $resources,$viewType); public function getViews(Array $filter,$viewType); }Thursday, 21 June 12Some extra methods to deal with complex queries involving joins
  42. 42. Core data format Tripod API Dealing with complex queries TripodTables Free text searchThursday, 21 June 122 things we realised when looking at our applications
  43. 43. DESCRIBE <http://example.com/foo> ?sectionOrItem ?resource ?document ? authorList ?author ?usedBy ?creator ?libraryNote ?publisher WHERE { OPTIONAL { <http://example.com/foo> resource:contains ?sectionOrItem . OPTIONAL { ?sectionOrItem resource:resource ?resource . OPTIONAL { ?resource dcterms:isPartOf ?document . } OPTIONAL { ?resource bibo:authorList ?authorList . OPTIONAL { ?authorList ?p ?author . } } OPTIONAL { ?resource dcterms:publisher ?publisher . } } OPTIONAL { ?libraryNote bibo:annotates ?sectionOrItem } } . OPTIONAL { <http://example.com/foo> resource:usedBy ?usedBy } . OPTIONAL { <http://example.com/foo> sioc:has_creator ?creator } }Thursday, 21 June 12Typical SPARQL query in our app9 “joins” in this query
  44. 44. DESCRIBE <http://example.com/foo> ?sectionOrItem ?resource ?document ? authorList ?author ?usedBy ?creator ?libraryNote ?publisher WHERE { OPTIONAL { <http://example.com/foo> resource:contains ?sectionOrItem . OPTIONAL { ?sectionOrItem resource:resource ?resource . OPTIONAL { ?resource dcterms:isPartOf ?document . } OPTIONAL { ?resource bibo:authorList ?authorList . OPTIONAL { ?authorList ?p ?author . } } OPTIONAL { ?resource dcterms:publisher ?publisher . } } OPTIONAL { ?libraryNote bibo:annotates ?sectionOrItem } } . OPTIONAL { <http://example.com/foo> resource:usedBy ?usedBy } . OPTIONAL { <http://example.com/foo> sioc:has_creator ?creator } }Thursday, 21 June 12Only thing that changes at run time in this query is this URIFlexibility of SPARQL great for developer but terrible here forsystem performanceQuery engine needs to join 9 times! Flexibility costs us everytime we run this query!This is why we hid it behind a cache
  45. 45. join count follow sequences (n times) join across databases All the above with a condition include certain properties include all propertiesThursday, 21 June 122nd thingWe only make use of minimal SPARQLAnd some of these aren’t even well supported in SPARQL(sequences + join across databases)
  46. 46. Materialised views, generated infrequently, read oftenThursday, 21 June 12Remember 90:10 read:updateView specifications based on a subset of SPARQLViews are for DESCRIBE like queries where all the data is broughtback in one hit (not tabular data)
  47. 47. { _id: "v_resource_brief", from: "CBD_harvest", type: "http://talisaspire.com/schema#Resource", include: ["rdf:type", "dct:subject", "dct:isVersionOf", "searchterms:usedAt", "dc:identifier"], joins: { "acorn:preferredMetadata": [], "acorn:listReferences": { include: ["acorn:list"] }, "acorn:bookmarkReferences": { include: ["acorn:bookmark"] }, "dcterms:isPartOf": [], "acorn:partReferences": { include: ["dct:hasPart"], joins: { "dct:hasPart": { joins: { "acorn:preferredMetadata": [] } } } } } }Thursday, 21 June 12A view specification - itself a document that can be stored inMongo8 keywords:type from include joinsttl followSequence maxJoins counts
  48. 48. Generated by incremental MapReduce when: 1) Data is changed 2) TTL expiresThursday, 21 June 12Tripod can take these specifications and manage views in aspecial collection within the DB.They expire and are regenerated automatically (andincrementally)Incremental map reduce inside the DBFast, interleaves with reads
  49. 49. > db.views.findOne() { "_id" : { "rdf:resource" : "http://talisaspire.com/examples/1", "type" : "v_resource_full" }, "value" : { "graphs" : [ { "_id" : "http://talisaspire.com/examples/1", "rdf:type" : { "type" : "uri", "value" : "http://talisaspire.com/schema#Resource" } } ], "impactIndex" : [ "rdf:resource" : "http://talisaspire.com/examples/1" ] } }Thursday, 21 June 12This is what a view looks likeID is a composite key of the view type and root resourceGraphs is a collection of CBDsMongoGraph we displayed earlier can take this and represent itas a unified graph to the applicationImpact index - A watch list of resources. When resources aresaved the impact index is queried to find views that needinvalidatingTTL is an alternative. If in viewspec timestamp is stored in view todetermine when it can be invalidated
  50. 50. 1 2 3 4 attributionThursday, 21 June 12Match views to data update rate
  51. 51. Core data format Tripod API Dealing with complex queries TripodTables Free text searchThursday, 21 June 12Tripod Tables are for larger datasets which cannot be broughtback in one hitThey can be paged or have individual columns indexed for fastsort capability
  52. 52. SELECT ?listName ?listUri! WHERE { ! ?resource bibo:isbn10 "$isbn" ! UNION ! { ! ! ?resource bibo:isbn10 "$isbnLowerCase" . ! } ! ?item resource:resource ?resource . ! UNION ! { ! ! ?resourcePartOf bibo:isbn10 "$isbn" . ! ! UNION ! ! { ! ! ! ?resourcePartOf bibo:isbn10 "$isbnLowerCase" . ! ! } ! ! ?resourcePartOf dct:hasPart ?resource . ! ! ?item resource:resource ?resource . } ?listUri resource:contains ?item . ?listUri sioc:name ?listName . ?listUri rdf:type resource:List } LIMIT 10 OFFSET 40Thursday, 21 June 12This is a select query that brings back a two col documentOFFSETLIMIT
  53. 53. <?xml version="1.0"?> <sparql xmlns="http://www.w3.org/2005/sparql-results#"> ! <head> ! ! <variable name="label"/> ! ! <variable name="type"/> ! </head> ! <results> ! ! <result> ! ! ! <binding name="label"> ! ! ! ! <literal>Tropical grassland</literal> ! ! ! </binding> ! ! ! <binding name="type"> ! ! ! ! <uri>http://purl.org/ontology/wo/TerrestrialHabitat</uri> ! ! ! </binding> ! ! </result> ! ! <result> ! ! ! <binding name="label"> ! ! ! ! <literal>Grassy field</literal> ! ! ! </binding> ! ! ! <binding name="type"> ! ! ! ! <uri>http://purl.org/ontology/wo/TerrestrialHabitat</uri> ! ! ! </binding> ! ! </result> ! </results> </sparql>Thursday, 21 June 12SPARQL SELECT results - tabular format - here in XML
  54. 54. > db.t_resource.findOne() { "_id" : "http://talisaspire.com/resources/3SplCtWGPqEyXcDiyhHQpA-2", "value" : { "type" : [ "http://purl.org/ontology/bibo/Book", "http://talisaspire.com/schema#Resource" ], "isbn" : "9780393929690", "isbn13" : [ "9780393929691", "9780393929691-2", ! "9780393929691-3" ], "impactIndex" : [ "http://talisaspire.com/works/4d101f63c10a6", ] } }Thursday, 21 June 12This time our map reduce doesn’t create one doc as withmaterialised viewsWe get one doc per row
  55. 55. Core data format Tripod API Dealing with complex queries TripodTables Free text searchThursday, 21 June 12Our triple store included free text searchWe wanted to stream updates into Elastic Search or A N Othersearch solutionWhen documents saved, same specification language used tobuild Search Document Format docs and submit them to anendpointWe like ElasticSearch but you could use Amazon CloudSearch
  56. 56. LimitationsThursday, 21 June 12Map Reduce as a non-blocking db.eval() and also to work aroundsync PHP programming modelPHP only for now - our web apps were PHPTo get a SPARQL endpoint we are exporting data out to Fueski -solved the mapping not the currency (for SPARQL)
  57. 57. FutureThursday, 21 June 12Node JS portUse as a server not a libraryEliminate dependancy on map reduceSpecification version controlTap into op log for stream approach into Fuseki and otherlocationsNamed graph supportFurther optimisation of data modelMaybe open source
  58. 58. That’s itThursday, 21 June 12
  59. 59. Questions? Find us on: Web: talisaspire.com Twitter: @talisaspire YouTube: youtube.com/user/TalisAspire Facebook: facebook.com/talisaspire Support: support.talisaspire.comThursday, 21 June 12
  60. 60. Find us on: Web: talisaspire.com Twitter: @talisaspire YouTube: youtube.com/user/TalisAspire Facebook: facebook.com/talisaspire Support: support.talisaspire.comThursday, 21 June 12

×