Improving RDF Search Performance with Lucene and SIREN

5,498 views

Published on

Published in: Technology, Business
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
5,498
On SlideShare
0
From Embeds
0
Number of Embeds
14
Actions
Shares
0
Downloads
151
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide

Improving RDF Search Performance with Lucene and SIREN

  1. 1. INDEXING AND SEARCHING RDF DATASETSImproving Performance of Semantic Web Applications with Lucene, SIREn and RDF Mike Hugo Entagen, LLC
  2. 2. slides and sample code can be found athttps://github.com/mjhugo/rdf-lucene-siren-presentation
  3. 3. ENTAGENACCELERATING INSIGHT
  4. 4. 17
  5. 5. AGENDA
  6. 6. SPARQL
  7. 7. SPARQL LUCENE
  8. 8. SPARQL LUCENE SIREN
  9. 9. SPARQL LUCENE SIREN TripleMap.com
  10. 10. LINKING OPEN DATA
  11. 11. LIFE SCIENCE LINKED DATA
  12. 12. WHAT’S A TRIPLE?Subject Predicate Object
  13. 13. WHAT’S A TRIPLE?<Mike> <name> “Mike Hugo”
  14. 14. WHAT’S A TRIPLE? “Minneapolis” <lives_in_city><Mike> <name> “Mike Hugo”
  15. 15. WHAT’S A TRIPLE? “Minneapolis”<Mike> <lives_in_city><daughter> <name> “Mike Hugo” <Lydia>
  16. 16. WHAT’S A TRIPLE? “Minneapolis”<Mike> <lives_in_city> <name><daughter> “Mike Hugo” <Lydia> <name> “Lydia Hugo”
  17. 17. Ready, GO!
  18. 18. select id, labelfrom targetswhere label = ‘${queryValue}’
  19. 19. select id, labelfrom targetswhere label ilike ‘%${queryValue}%’
  20. 20. SELECT ?uri ?type ?label WHERE { ?uri rdfs:label ?label . ?uri rdf:type ?type . FILTER (?label = ${params.query})} LIMIT 10
  21. 21. SELECT ?uri ?type ?label WHERE { ?uri rdfs:label ?label . ?uri rdf:type ?type . FILTER regex(?label, Q${params.query}E, i)} LIMIT 10
  22. 22. SELECT ?uri ?type ?label WHERE { ?uri rdfs:label ?label . ?uri rdf:type ?type . FILTER regex(?label, Q${params.query}E, i)} LIMIT 10 case insensitive query as literal value
  23. 23. DEMOBaseline SPARQL Query Performance
  24. 24. FASTER!
  25. 25. Java APIIndexing and Searching Text
  26. 26. `http://wiki.apache.org/lucene-java/PoweredBy
  27. 27. indexing storage
  28. 28. Document
  29. 29. Document field value ID 2 name “Mike Hugo”company “Entagen” “lorem ipsum bio dolor sum etc...”
  30. 30. Index field field value value field field value value field field value name name field “mike value value hugo” “mike hugo” name name “mike hugo” “mike hugo” name nameid “mike hugo” “mike 2hugo”company company “Entagen” “Entagen” company company name “Entagen” “Entagen” “Mike Hugo” company company “Entagen” “Entagen” “Entagen” Indexed company “lorem ipsum “lorem ipsum bio bio “lorem ipsum “lorem etc...” bio bio “loremipsum dolorsum ipsum dolor“loremetc...” not bio bio sum ipsum dolorsum etc...”” dolor sum etc... ” bio dolor sum ipsum” “lorem etc... dolor sum etc... Stored dolor sum etc...”
  31. 31. Query: name: mike
  32. 32. Query: name: mike MatchingDocuments: field value idfield 2 value idfield 2 value idfield 2 value id 2
  33. 33. field value id 2
  34. 34. field value id 2
  35. 35. field value id 2 field value ID 2 name “Mike Hugo” company “Entagen” “lorem ipsum bio dolor sum etc...”
  36. 36. Simplest Solution
  37. 37. Lucene index of rdfs:label
  38. 38. Build the Index
  39. 39. String queryLabels = """ SELECT ?uri ?label WHERE { ?uri rdfs:label ?label . } Build a SPARQL""" query to find all the rdfs:label propertiessparqlQueryService.executeForEach(repo def doc = new Document() String uri = it.uri.stringValue() String label = it.label.stringValu doc.add(new Field(SUBJECT_URI_FIEL
  40. 40. sparqlQueryService.executeForEach (repository, queryLabels) { String uri = it.uri.stringValue() String label = it.label.stringValu Execute the def doc = new Document() SPARQL query doc.add(new Field(SUBJECT_URI_FIEL Field.Store.YES, Field.Ind doc.add(new Field(LABEL_FIELD, lab Field.Store.NO, Field.Inde writer.addDocument(doc)}
  41. 41. arqlQueryService.executeForEach(reposito String uri = it.uri.stringValue() String label = it.label.stringValue() Document doc = new Document() doc.add(new Field(SUBJECT_URI_FIELD, uri, Instantiate a new Lucene Field.Store.YES, Document Field.Index.ANALYZED)) doc.add(new Field(LABEL_FIELD, label, Field.Store.NO, Field.Index.ANALYZED)) writer.addDocument(doc)
  42. 42. key Document doc = new Document() doc.add(new Field(SUBJECT_URI_FIELD, value uri, Field.Store.YES, Field.Index.ANALYZED)) doc.add(new Field(LABEL_FIELD, label, Add the Subject Field.Store.NO, URI to the Document Field.Index.ANALYZED)) writer.addDocument(doc)lly {
  43. 43. Field.Store.YES, Field.Index.ANALYZED)) doc.add(new Field(LABEL_FIELD, key value label, Field.Store.NO, Field.Index.ANALYZED)) Add the Label field writer.addDocument(doc) document to the (but don’t store it)lly {iter.close() // Close index
  44. 44. doc.add(new Field(LABEL_FIELD, labe Field.Store.NO, Field.Index.ANALYZED)) writer.addDocument(doc) }inally { writer.close() // Closethe document Add index to the Index
  45. 45. Query the Index
  46. 46. f query = { Query query = new QueryParser( Version.LUCENE_CURRENT, LABEL_FIELD, query this field new StandardAnalyzer()) .parse(params.query); for this value def s Create a Lucene = new Date().time Query from user List results = executeQuery(query) input def e = new Date().time render(view: index, model: [results:
  47. 47. IndexSearcher searcher = luceneSearcheScoreDoc[] scoreDocs = searcher.search(query, 10).scoreDoList results = [] Search the index (limit 10) fordef connection = repository.connectionscoreDocs.each { matching documents Document doc = searcher.doc(it.doc String uri = doc[SUBJECT_URI_FIELD Map labelAndType = sparqlQueryServ results << [uri: uri, type: labelA}connection.close()return results
  48. 48. List results = []def connection = repository.connectionscoreDocs.each { Document doc = searcher.doc(it.doc) String uri = doc[SUBJECT_URI_FIELD] Map labelAndType = For each matching sparqlQueryService. document, get the getLabelAndType(uri, connection) doc and extract the results.add([ Subject URI uri: uri, type: labelAndType.type, label: labelAndType.label])}connection.close()return results
  49. 49. List results = []def connection = repository.connectionscoreDocs.each { Document doc = searcher.doc(it.doc) String uri = doc[SUBJECT_URI_FIELD] Map labelAndType = sparqlQueryService. getLabelAndType(uri, connection) results.add([ uri: uri, Using the Subject URI, load properties type: labelAndType.type, from the triplestore label: labelAndType.label])}connection.close()return results
  50. 50. List results = []def connection = repository.connectionscoreDocs.each { Document doc = searcher.doc(it.doc) return results containing Subject String uri = doc[SUBJECT_URI_FIELD] Map labelAndType Type, and Label URI, = sparqlQueryService. getLabelAndType(uri, connection) results.add([ uri: uri, type: labelAndType.type, label: labelAndType.label])}connection.close()return results
  51. 51. DEMOLucene Index of Searchable Labels
  52. 52. WHAT ABOUT ENTITY RELATIONSHIPS?
  53. 53. WHAT ABOUT OTHER PROPERTIES?
  54. 54. Lucene ExtensionIndexing and Searching Semi-Structured Data
  55. 55. Document
  56. 56. Documentfield value URI <DB00619> <DB00619> rdfs:label "Imatinib" . <DB00619> rdf:type <drugbank:drugs> .triples <DB00619> drugbank:brandName "Gleevec" . <DB00619> drugbank:target <targets/1588> .
  57. 57. Build the Index
  58. 58. Connection connection = repository.conny { String subjectUris = """ SELECT distinct ?uri WHERE { ?uri ?p ?o . } """ sparqlQueryService.executeForEach(rep def doc = new Select all Subject Document() URIs from the triplestore String subjectUri = it.uri.string doc.add(new Field(SUBJECT_URI_FIE subjectUri,
  59. 59. """sparqlQueryService.executeForEach( repository, subjectUris) { def doc = new Document() String subjectUri = it.uri.stringV doc.add(new Field(SUBJECT_URI_FIEL subjectUri, Field.Store.YES, Execute the Sparql Query Field.Index.ANALYZED)) For each URI, create a new Document StringWriter triplesStringWriter = NTriplesWriter nTriplesWriter = new NTriplesWriter(triplesStri
  60. 60. epository, subjectUris) { def doc = new Document() String subjectUri = it.uri.stringValue doc.add(new Field(SUBJECT_URI_FIELD, subjectUri, Field.Store.YES, Field.Index.ANALYZED)) StringWriter triplesStringWriter = new NTriplesWriter nTriplesWriter =URI Add the Subject to the Document new NTriplesWriter(triplesStringWr connection.exportStatements( new URIImpl(subjectUri), null, null, false,
  61. 61. Field.Index.ANALYZED))StringWriter triplesStringWriter = newNTriplesWriter nTriplesWriter = new NTriplesWriter(triplesStringWrconnection.exportStatements( new URIImpl(subjectUri), null, null, false, nTriplesWriter) Get an NTriplesdoc.add(new Field(TRIPLES_FIELD, string from the triplesStringWriter.toString() Field.Store.NO, triplestore Field.Index.ANALYZED))
  62. 62. new URIImpl(subjectUri), null, null, false, nTriplesWriter)doc.add(new Field(TRIPLES_FIELD, triplesStringWriter.toString() Field.Store.NO, Field.Index.ANALYZED)) Add the NTripleswriter.addDocument(doc) string to the document
  63. 63. doc.add(new Field(TRIPLES_FIELD, triplesStringWriter.toString() Field.Store.NO, Field.Index.ANALYZED))writer.addDocument(doc) Add the document to the index
  64. 64. Query the Index
  65. 65. SirenCellQuery predicate = new SirenCellQuery( new SirenTermQuery( new Term(TRIPLES_FIELD, RDFS.LABEL.stringValue())));predicate.constraint = PREDICATE_CELLSirenCellQuery object = query the Triples new SirenCellQuery( new SirenTermQuery( field new Term(TRIPLES_FIELD, params.query.toLowerCase()))object.constraint = OBJECT_CELL
  66. 66. SirenCellQuery predicate = new SirenCellQuery( new SirenTermQuery( new Term(TRIPLES_FIELD, RDFS.LABEL.stringValue())));predicate.constraint = PREDICATE_CELLSirenCellQuery object = new SirenCellQuery( a predicate for new SirenTermQuery( new Term(TRIPLES_FIELD, params.query.toLowerCase()))object.constraint = OBJECT_CELL
  67. 67. SirenCellQuery predicate = new SirenCellQuery( new SirenTermQuery( new Term(TRIPLES_FIELD, RDFS.LABEL.stringValue())));predicate.constraint = PREDICATE_CELL of rdfs:label *SirenCellQuery object = new SirenCellQuery( new SirenTermQuery( new Term(TRIPLES_FIELD, params.query.toLowerCase())) * note: could be any predicate!object.constraint = OBJECT_CELL
  68. 68. SirenCellQuery object = new SirenCellQuery( new SirenTermQuery( new Term(TRIPLES_FIELD, params.query.toLowerCase())object.constraint = OBJECT_CELLQuery query = new SirenTupleQuery() query the Triplesquery.add(predicate, field SirenTupleClause.Occur.MUST)query.add(object, SirenTupleClause.Occur.MUST)
  69. 69. SirenCellQuery object = new SirenCellQuery( new SirenTermQuery( new Term(TRIPLES_FIELD, params.query.toLowerCase())object.constraint = OBJECT_CELLQuery query = new SirenTupleQuery()query.add(predicate, for an object SirenTupleClause.Occur.MUST)query.add(object, SirenTupleClause.Occur.MUST)
  70. 70. SirenCellQuery object = new SirenCellQuery( new SirenTermQuery( new Term(TRIPLES_FIELD, params.query.toLowerCase())object.constraint = OBJECT_CELLQuery query = new SirenTupleQuery()query.add(predicate, matching the user input SirenTupleClause.Occur.MUST)query.add(object, SirenTupleClause.Occur.MUST)
  71. 71. field value URI <DB00619> <DB00619> rdfs:label "Imatinib" . <DB00619> rdf:type <drugbank:drugs> . triples <DB00619> drugbank:brandName "Gleevec" . <DB00619> drugbank:target <targets/1588> .Query: “imatinib”
  72. 72. field value URI <DB00619> <DB00619> rdfs:label "Imatinib" . <DB00619> rdf:type <drugbank:drugs> . triples <DB00619> drugbank:brandName "Gleevec" . <DB00619> drugbank:target <targets/1588> .Query: triples field
  73. 73. field value URI <DB00619> <DB00619> rdfs:label "Imatinib" . <DB00619> rdf:type <drugbank:drugs> . triples <DB00619> drugbank:brandName "Gleevec" . <DB00619> drugbank:target <targets/1588> . Query:predicate = rdfs:label
  74. 74. field value URI <DB00619> <DB00619> rdfs:label "Imatinib" . <DB00619> rdf:type <drugbank:drugs> . triples <DB00619> drugbank:brandName "Gleevec" . <DB00619> drugbank:target <targets/1588> . Query:predicate = rdfs:label object = “imatinib”
  75. 75. List executeQuery(Query query) { IndexSearcher searcher = sirenSearcherM ScoreDoc[] scoreDocs = searcher.search(query, 10).scoreDocs List results = [] def connection = repository.connection Search the index scoreDocs.each { (limit 10) for matching Document doc = searcher.doc(it.doc) documents String uri = doc[SUBJECT_URI_FIELD] Map labelAndType = sparqlQueryServi getLabelAndType(uri, connectio results.add([ uri: uri, type: labelAndType.type,
  76. 76. List results = []def connection = repository.connectionscoreDocs.each { Document doc = searcher.doc(it.doc) String uri = doc[SUBJECT_URI_FIELD] Map labelAndType = sparqlQueryServic For each matching getLabelAndType(uri, connection document, get the results.add([ doc and extract the uri: uri, Subject URI type: labelAndType.type, label: labelAndType.label])}connection.close()return results
  77. 77. connection = repository.connectionreDocs.each { Document doc = searcher.doc(it.doc) String uri = doc[SUBJECT_URI_FIELD] Map labelAndType = sparqlQueryService. getLabelAndType(uri, connection) results.add([ uri: uri, Using the Subject type: labelAndType.type, URI, load properties label: labelAndType.label]) from the triplestorenection.close()urn results
  78. 78. String uri = doc[SUBJECT_URI_FIELD] Map labelAndType = sparqlQueryService. getLabelAndType(uri, connection) results.add([ uri: uri, type: labelAndType.type, label: labelAndType.label])nection.close() return resultsurn results containing Subject URI, Type, and Label
  79. 79. DEMOSIREn Index of RDF Entities
  80. 80. FLEXIBILITY
  81. 81. field value URI <DB00619> <DB00619> rdfs:label "Imatinib" . <DB00619> rdf:type <drugbank:drugs> . triples <DB00619> drugbank:brandName "Gleevec" . <DB00619> drugbank:target <targets/1588> . Query:predicate = rdfs:label object = “imatinib”
  82. 82. field value URI <DB00619> <DB00619> rdfs:label "Imatinib" . <DB00619> rdf:type <drugbank:drugs> . triples <DB00619> drugbank:brandName "Gleevec" . <DB00619> drugbank:target <targets/1588> .Query: object = “imatinib”
  83. 83. field value URI <DB00619> <DB00619> rdfs:label "Imatinib" . <DB00619> rdf:type <drugbank:drugs> . triples <DB00619> drugbank:brandName "Gleevec" . <DB00619> drugbank:target <targets/1588> .Query:object = “imatinib” OR object = “gleevec”
  84. 84. MORE THAN LITERALS
  85. 85. field value URI <DB00619> <DB00619> rdfs:label "Imatinib" . <DB00619> rdf:type <drugbank:drugs> . triples <DB00619> drugbank:brandName "Gleevec" . <DB00619> drugbank:target <targets/1588> .Query: predicate = brandName
  86. 86. field value URI <DB00619> <DB00619> rdfs:label "Imatinib" . <DB00619> rdf:type <drugbank:drugs> . triples <DB00619> drugbank:brandName "Gleevec" . <DB00619> drugbank:target <targets/1588> .Query: predicate = target
  87. 87. RELATIONSHIPS
  88. 88. field value URI <DB00619> <DB00619> rdfs:label "Imatinib" . <DB00619> rdf:type <drugbank:drugs> . triples <DB00619> drugbank:brandName "Gleevec" . <DB00619> drugbank:target <targets/1588> .Query: object = <targets/1588>
  89. 89. DEMOSearching SIREn Index for Relationships
  90. 90. DistributedIndexing and Searching Semi-Structured Data
  91. 91. Replication
  92. 92. 400 Million Documents> 12 Billion Triples
  93. 93. Query Parser
  94. 94. Query Parsersubject predicate object
  95. 95. DEMOSIREn in action on TripleMap.com
  96. 96. DEMOSIREn in action on TripleMap.com
  97. 97. SPARQL LUCENE SIREN TripleMap.com
  98. 98. QUESTIONS? mike@entagen.com / twitter: @piragua TripleMaphttp://www.entagen.com http://www.triplemap.com

×