In a complex and distributed data world, a middleware for source selection aiming federated Sparql query engine to transparently execute service-less Sparql query in a distributed way.
4. RELATED WORK
Federated Service Endpoint Query Engine - Code Avalable
FedX SPLENDID DARQ CostFed LHD ANAPSID ADERIS
Code Avalability
YES (java- jar)
YES (java,
scala) YES (java) YES (python) YES (java) YES (python) YES (java)
LAST UPDATE 2016 2011 2006 2018 2013 2013 N.D.
SERVICE Sparql Clause
support yes NO No yes No yes No
SOURCE SELECTION
query (ASK)
query (ASK),
index, Sparql
service
descriptor, VoID
index (sparql
service
descriptor)
index, query cost
estimation based
on selectivity
query (ASK),
index, Sparql
service
descriptor, VoID
query (ASK),
index index
JOIN TYPE
nested loop,
bind hash, bind
nested loop,
bind
bound,
symmetric hash
join hash, bind adaptive
Index based
nested loop
CACHE
YES (ASK
history) YES NO NO NO NO NO
None of existing federated engines exploits data mining based strategies to select
services and to implement SPARQL queries executions on SPARQL endpoints
5. RELATED WORK (2)
- Topic model has been widely used for document identification
- Bhattacharya & Sil, 2016 for first used LDA and sparse representation
based classifier for information retrieval
- Wang et al., 2016 and Wei & Croft, 2006 consider a query as a distribution
of terms over topics through LDA for information retrieval
- Röder et al., 2016 used LDA to identify only the topics of RDF datasets
through the extracted English labels
- None of these research contributions:
- exploits LDA to determine the similarity between queries and services
- builds the corpus by IRIs, i.e. structured and semantic data
6. TOPIC FEDERATED QUERY ENGINE
- RDF dataset are intelligible documents i.e. semantic based
- LDA is used to get the dataset topic model and to infer query topics
- Topics are datasets summary
- Topic similarity between a query and a dataset reveals a possible
pertinence
- The Topic SPARQL Federated Query Engine
- learns through datasets hosted on services
- infers query routing information
- executes the federated query on the distributed architecture
11. SOURCE SELECTION STRATEGIES
…. ….
DTDDi
…. ….
DTDDi
…. ….
DTDDi
…. ….
DTDDi
* QTD
Tx
Ty
Tz
…. ….
QTD
…. ….
QTD
…. ….
QTD
BEST STRATEGY
ALL STRATEGY
ALL FILTERED STRATEGY
K-MEANS STRATEGY K=2
threshold
threshold
Centroid 1
Centroid 2
For some
pattern
Tx
Ty
Tz
Tx
Ty
Tz
Tx
Ty
Tz
Tx
Ty
Tz
Tx
Ty
Tz
Dm
Dn
Do
Dp Di
Cluster delimitation
12. THE BEST STRATEGY
SELECT *WHERE {
}
BEST ALL ALL-FILT. K-MEANS
Triple-pattern1 Ty
Triple-pattern2 Ty
Triple-pattern3 Ty
Triple-pattern4 Ty
…. ….
DTDDi
Tx
Ty
Tz
…. ….
QTD
BEST STRATEGY
Tx
Ty
Tz
13. THE ALL STRATEGY
SELECT *WHERE {
}
BEST ALL ALL-FILT. K-MEANS
Triple-pattern1 Ty
Ty
Tz
Triple-pattern2 Ty
Ty
Tz
Triple-pattern3 Ty
Ty
Tz
Triple-pattern4 Ty
Ty
Tz
….
…. ….
DTDDi
…. ….
QTD
ALL STRATEGY
threshold
Tx
Ty
Tz
Tx
Ty
Tz
14. THE ALL-FILTERED STRATEGY
SELECT *WHERE {
}
BEST ALL ALL-FILT. K-MEANS
Triple-pattern1 Ty
Ty
Tz
Ty
Tz
Triple-pattern2 Ty
Ty
Tz
Ty
Tz
Triple-pattern3 Ty
Ty
Tz
Ty
Triple-pattern4 Ty
Ty
Tz
Ty
…. ….
DTDDi
…. ….
QTD
ALL FILTERED STRATEGY
threshold
For some
pattern
Tx
Ty
Tz
Tx
Ty
Tz
15. THE K-MEANS STRATEGY
SELECT *WHERE {
}
BEST ALL ALL-FILT. K-MEANS
Triple-pattern1 Ty
Ty
Tz
Ty
Tz
Dn
Dp
Triple-pattern2 Ty
Ty
Tz
Ty
Tz
Dn
Dp
Triple-pattern3 Ty
Ty
Tz
Ty
Dn
Dp
Triple-pattern4 Ty
Ty
Tz
Ty
Dn
Dp
…. ….
DTDDi
* QTD
K-MEANS STRATEGY K=2
Centroid 1
Centroid 2
Dm
Dn
Do
Dp Di
Cluster delimitation
16. DATASET-QUERY TOPIC MATCHING
SELECT *WHERE {
}
BEST ALL ALL-FIL. K-MEANS
Triple-pattern1 Dn
Ty
Tz
Ty
Tz
Dn
Dp
Triple-pattern2 Dn
Ty
Tz
Ty
Tz
Dn
Dp
Triple-pattern3 Dn
Ty
Tz
Ty
Dn
Dp
Triple-pattern4 Dn
Ty
Tz
Ty
Dn
Dp
BEST STRATEGY
17. DATASET-QUERY TOPIC MATCHING (2)
SELECT *WHERE {
}
BEST ALL ALL-FIL. K-MEANS
Triple-pattern1 Dn
Dn
Do
Dp
Ty
Tz
Dn
Dp
Triple-pattern2 Dn
Dn
Do
Dp
Ty
Tz
Dn
Dp
Triple-pattern3 Dn
Dn
Do
Dp
Ty
Dn
Dp
Triple-pattern4 Dn
Dn
Do
Dp
Ty
Dn
Dp
ALL STRATEGY
18. DATASET-QUERY TOPIC MATCHING (3)
SELECT *WHERE {
}
BEST ALL ALL-FIL. K-MEANS
Triple-pattern1 Dn
Dn
Do
Dp
Dn
Do
Dp
Dn
Dp
Triple-pattern2 Dn
Dn
Do
Dp
Dn
Do
Dp
Dn
Dp
Triple-pattern3 Dn
Dn
Do
Dp
Dn
Dn
Dp
Triple-pattern4 Dn
Dn
Do
Dp
Dn
Dn
Dp
ALL FILTERED STRATEGY
19. SELECT *
WHERE {
}
BEST ALL ALL-FILTERED K-MEANS
Triple-pattern1
Sn
Sn
So
Sp
Sn
So
Sp
Sn
Sp
Triple-pattern2 Sn
So
Sp
Sn
So
Sp
Sn
Sp
Triple-pattern3 Sn
So
Sp
Sn
Sn
Sp
Triple-pattern4 Sn
So
Sp
Sn
Sp
SERVICE SUBSTITUTION - AGGREGATION
25. CONCLUSION
- RDF-dataset once treated as documents are exploited by LDA to extract
datasets latent semantics .
- This latent semantic is represented by topics that are datasets summaries.
- The Topic SPARQL Federated Query Engine learns through datasets hosted
on services how to split, route and execute service-less Sparql queries in a
federated way.
- It is a middleware oriented to transparently querying the Open Data world
- Work in progress:
- Benchmarking with other engines
- Evaluating index stability
- Improving performance and recall of the strategies
26. THANK YOU FOR YOUR ATTENTION!
ANY QUESTION?
Topic-based Federated Query Engine
Ester Giallonardo, Ciro Sorrentino, Eugenio Zimeo
ICWI - BUDAPEST - 2018