Improving RDF Search Performance with Lucene and SIREN

INDEXING AND SEARCHING
RDF DATASETS
Improving Performance of Semantic Web Applications with
Lucene, SIREn and RDF

Mike Hugo
Entagen, LLC

slides and sample code can be found at
https://github.com/mjhugo/rdf-lucene-siren-presentation

SPARQL

LUCENE

SIREN

SPARQL

LUCENE

SIREN
TripleMap.com

WHAT’S A TRIPLE?

Subject

Predicate

Object

WHAT’S A TRIPLE?

<Mike>

<name>

“Mike Hugo”

WHAT’S A TRIPLE?

“Minneapolis”
<lives_in_city>
<Mike>

<name>

“Mike Hugo”

WHAT’S A TRIPLE?

“Minneapolis”
<Mike> <lives_in_city>

<daughter> <name>

“Mike Hugo”
<Lydia>

WHAT’S A TRIPLE?

“Minneapolis”
<Mike> <lives_in_city>

<name>
<daughter>
“Mike Hugo”

<Lydia>
<name> “Lydia Hugo”

select id, label
from targets
where label = ‘${queryValue}’

select id, label
from targets
where label
ilike ‘%${queryValue}%’

SELECT ?uri ?type ?label WHERE {
?uri rdfs:label ?label .
?uri rdf:type ?type .
FILTER (?label = '${params.query}')
} LIMIT 10

FILTER regex(?label,
'Q${params.query}E', 'i')
} LIMIT 10

FILTER regex(?label,
'Q${params.query}E', 'i')
} LIMIT 10

case insensitive
query as literal value

DEMO
Baseline SPARQL Query Performance

Java API
Indexing and Searching Text

`

http://wiki.apache.org/lucene-java/PoweredBy

Document

ﬁeld value
ID 2
name “Mike Hugo”
company “Entagen”
“lorem ipsum
bio
dolor sum etc...”

Index

field
field value
value
field
field value
value
field
field value
name
name field “mike value
value
hugo”
“mike hugo”
name
name “mike hugo”
“mike hugo”
name
nameid “mike hugo”
“mike 2hugo”
company
“Entagen”
company
company
name “Entagen”
“Entagen”
“Mike Hugo”
company
“Entagen”
“Entagen” Indexed
company “lorem ipsum
“lorem ipsum
bio
bio “lorem ipsum
“lorem etc...”
bio
bio “loremipsum
dolorsum ipsum
dolor“loremetc...” not
bio
bio sum ipsum
dolorsum etc...””
dolor sum etc... ”
bio dolor sum ipsum”
“lorem etc...
dolor sum etc... Stored
dolor sum etc...”

Query: name: mike

Matching
Documents: field value
idfield 2
value
idfield 2
value
idfield 2
value
id 2

ﬁeld value
id 2

ﬁeld value
ID 2
name “Mike Hugo”
“lorem ipsum
bio
dolor sum etc...”

String queryLabels = """
SELECT ?uri ?label
WHERE {
} Build a SPARQL
""" query to ﬁnd all the
rdfs:label properties
sparqlQueryService.executeForEach(repo
def doc = new Document()
String uri = it.uri.stringValue()
String label = it.label.stringValu

doc.add(new Field(SUBJECT_URI_FIEL

sparqlQueryService.executeForEach
(repository, queryLabels) {
String label = it.label.stringValu
Execute the
SPARQL query

Field.Store.YES, Field.Ind
doc.add(new Field(LABEL_FIELD, lab
Field.Store.NO, Field.Inde

writer.addDocument(doc)
}

arqlQueryService.executeForEach(reposito
String label = it.label.stringValue()

Document doc = new Document()
doc.add(new Field(SUBJECT_URI_FIELD,
uri, Instantiate a
new Lucene
Field.Store.YES,
Document
Field.Index.ANALYZED))
doc.add(new Field(LABEL_FIELD,
label,
Field.Store.NO,


key
Document doc = new Document()
value uri,
Field.Store.YES,
doc.add(new Field(LABEL_FIELD,
label, Add the Subject
Field.Store.NO, URI to the
Document


lly {

Field.Store.YES,
doc.add(new Field(LABEL_FIELD, key
value label,
Field.Store.NO,
Add the Label ﬁeld
writer.addDocument(doc) document
to the
(but don’t store it)
lly {
iter.close() // Close index

doc.add(new Field(LABEL_FIELD, labe
Field.Store.NO,

}
inally {
writer.close() // Closethe document
Add index
to the Index

f query = {
Query query = new QueryParser(
Version.LUCENE_CURRENT,
LABEL_FIELD, query this ﬁeld
new StandardAnalyzer())
.parse(params.query);
for this value
def s Create a Lucene
= new Date().time
Query from user
List results = executeQuery(query)
input
def e = new Date().time

render(view: 'index', model: [results:

IndexSearcher searcher = luceneSearche
ScoreDoc[] scoreDocs =
searcher.search(query, 10).scoreDo
List results = [] Search the index
(limit 10) for
def connection = repository.connection
scoreDocs.each { matching
documents
Document doc = searcher.doc(it.doc
String uri = doc[SUBJECT_URI_FIELD
Map labelAndType = sparqlQueryServ
results << [uri: uri, type: labelA
}
connection.close()
return results

List results = []
scoreDocs.each {
Document doc = searcher.doc(it.doc)
String uri = doc[SUBJECT_URI_FIELD]
Map labelAndType =
For each matching
sparqlQueryService.
document, get the
getLabelAndType(uri, connection)
doc and extract the
results.add([
Subject URI
uri: uri,
type: labelAndType.type,
label: labelAndType.label])
}
connection.close()
return results

List results = []
scoreDocs.each {
Map labelAndType =
sparqlQueryService.
results.add([
uri: uri, Using the Subject
URI, load properties
from the triplestore
}
connection.close()
return results

List results = []
scoreDocs.each {
return results
containing Subject
Map labelAndType Type, and Label
URI, =
sparqlQueryService.
results.add([
uri: uri,
}
connection.close()
return results

DEMO
Lucene Index of Searchable Labels

WHAT ABOUT ENTITY
RELATIONSHIPS?

WHAT ABOUT OTHER
PROPERTIES?

Lucene Extension

Indexing and Searching
Semi-Structured Data

Document

ﬁeld value

URI <DB00619>
<DB00619> rdfs:label "Imatinib" .
<DB00619> rdf:type <drugbank:drugs> .
triples
<DB00619> drugbank:brandName "Gleevec" .
<DB00619> drugbank:target <targets/1588> .

Connection connection = repository.conn
y {
String subjectUris = """
SELECT distinct ?uri
WHERE {
?uri ?p ?o .
}
"""
sparqlQueryService.executeForEach(rep
def doc = new Select all Subject
Document()
URIs from the
triplestore
String subjectUri = it.uri.string
doc.add(new Field(SUBJECT_URI_FIE
subjectUri,

"""
sparqlQueryService.executeForEach(
repository, subjectUris) {

String subjectUri = it.uri.stringV
subjectUri,
Field.Store.YES,
Execute the Sparql Query
For each URI, create a
new Document
StringWriter triplesStringWriter =
NTriplesWriter nTriplesWriter =
new NTriplesWriter(triplesStri

epository, subjectUris) {

String subjectUri = it.uri.stringValue
subjectUri,
Field.Store.YES,

StringWriter triplesStringWriter = new
NTriplesWriter nTriplesWriter =URI
Add the Subject
to the Document
new NTriplesWriter(triplesStringWr
connection.exportStatements(
new URIImpl(subjectUri),
null, null, false,


StringWriter triplesStringWriter = new
NTriplesWriter nTriplesWriter =
new NTriplesWriter(triplesStringWr
connection.exportStatements(
null, null, false,
nTriplesWriter)

Get an NTriples
doc.add(new Field(TRIPLES_FIELD,
string from the
triplesStringWriter.toString()
Field.Store.NO, triplestore

null, null, false,
nTriplesWriter)

Field.Store.NO,

Add the NTriples
string to the
document

Field.Store.NO,


Add the document
to the index

SirenCellQuery predicate =
new SirenCellQuery(
new SirenTermQuery(
new Term(TRIPLES_FIELD,
RDFS.LABEL.stringValue())));
predicate.constraint = PREDICATE_CELL

SirenCellQuery object =
query the Triples
new SirenCellQuery(
new SirenTermQuery( ﬁeld
params.query.toLowerCase()))
object.constraint = OBJECT_CELL

new SirenCellQuery(
new SirenTermQuery(

new SirenCellQuery( a predicate
for
new SirenTermQuery(

new SirenCellQuery(
new SirenTermQuery(
of rdfs:label *
new SirenCellQuery(
new SirenTermQuery(
* note: could be any predicate!

new SirenCellQuery(
new SirenTermQuery(
params.query.toLowerCase())

Query query = new SirenTupleQuery()
query the Triples
query.add(predicate,
ﬁeld
SirenTupleClause.Occur.MUST)
query.add(object,

new SirenCellQuery(
new SirenTermQuery(

query.add(predicate,
for an object
query.add(object,

new SirenCellQuery(
new SirenTermQuery(

query.add(predicate, matching the
user input
query.add(object,

ﬁeld value

URI <DB00619>
triples

Query: “imatinib”

ﬁeld value

URI <DB00619>
triples

Query:

triples ﬁeld

ﬁeld value

URI <DB00619>
triples

Query:

predicate = rdfs:label

ﬁeld value

URI <DB00619>
triples

Query:

predicate = rdfs:label
object = “imatinib”

List executeQuery(Query query) {
IndexSearcher searcher = sirenSearcherM
ScoreDoc[] scoreDocs =
searcher.search(query, 10).scoreDocs
List results = []
Search the index
scoreDocs.each { (limit 10) for
matching
documents
Map labelAndType = sparqlQueryServi
getLabelAndType(uri, connectio
results.add([
uri: uri,

List results = []
scoreDocs.each {
Map labelAndType = sparqlQueryServic
For each matching
getLabelAndType(uri, connection
document, get the
results.add([
doc and extract the
uri: uri,
Subject URI
}
connection.close()
return results

connection = repository.connection
reDocs.each {
Map labelAndType = sparqlQueryService.
results.add([
uri: uri, Using the Subject
URI, load properties
from the triplestore

nection.close()
urn results

Map labelAndType = sparqlQueryService.
results.add([
uri: uri,

nection.close() return results
urn results containing Subject
URI, Type, and Label

DEMO
SIREn Index of RDF Entities

ﬁeld value

URI <DB00619>
triples

Query:


ﬁeld value

URI <DB00619>
triples

Query:
OR
object = “gleevec”

ﬁeld value

URI <DB00619>
triples

Query:

predicate = brandName

ﬁeld value

URI <DB00619>
triples

Query:

predicate = target

ﬁeld value

URI <DB00619>
triples

Query:

object = <targets/1588>

DEMO
Searching SIREn Index for Relationships

Distributed
Indexing and Searching
Semi-Structured Data

400 Million Documents
> 12 Billion Triples

Query Parser

subject

predicate object

DEMO
SIREn in action on TripleMap.com

QUESTIONS?

mike@entagen.com / twitter: @piragua

TripleMap

http://www.entagen.com http://www.triplemap.com

Improving RDF Search Performance with Lucene and SIREN

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (11)

Similar to Improving RDF Search Performance with Lucene and SIREN

Similar to Improving RDF Search Performance with Lucene and SIREN (20)

Recently uploaded

Recently uploaded (20)

Improving RDF Search Performance with Lucene and SIREN