Heterogeneous Web Data SearchUsingRelevance-based On TheFly Data IntegrationDaniel M. Herzig, Thanh TranWWW2012INSTITUT FÜ...
Agenda     Motivation     Problem Definition     Existing Solutions     Our Approach         Entity Relevance Model (ERM) ...
Company running a movie shopping website                                                         Movies Shopping          ...
Users search the website via forms.    Search request is internally executed as a    structured query                     ...
Company discovers the plethora of Linked Data    available on the Web and identifies Data    Sources beneficial for its bu...
Zero Star Mugs!                                                    vs.6     WWW2012   Daniel M. Herzig - Institute AIFB
Problems of Data Integration arise…                                            qs does not return results                 ...
Problem Definition      Find relevant entities in a set of targetdatasetsDtgiven a      sourcedatasetDsand an      structu...
Problem Setting      Data Model is labeled directed graph          Directly related to RDF          RDF specifics, e.g. bl...
Heterogeneous Web Data      Daniel Craig,                        Steven Spielberg             Coyote, Peter Spielberg, Ste...
Aim: Integrate External Data into the Search     Process                                                     ?         qs ...
Existing Strategies – Keyword Search                                                         directors rainerwernerfassbin...
Existing Strategies – Query Rewriting                                                        Schema            Schema     ...
Heterogeneous Web Data Search     UsingRelevance-based On TheFly Data Integration14   WWW2012   Daniel M. Herzig - Institu...
Contributions       (1) Novelapproachforqueryingheterogeneous Web       datasources           No upfrontdataintegrationnec...
Overview of our Approach                                      Keyword search to cross vocabulary mismatches               ...
Entity Relevance Model (ERM)       Based on Structured Relevance Model (Lavrenkoet.al 2007)       Entity Relevance Model: ...
ERM (2)               World on Wires                                       Klaus Löwitsch                label         sta...
Modelling Target Entities       Coyote, Peter Spielberg, Steven (I)         i:actors                  i:directors         ...
Ranking          boosting seed query attributes                                                                       cros...
On The Fly Alignment       as ~ at ??       Idea:       Compare all language models of et to a field of ERM using       cr...
EXPERIMENTS22   WWW2012   Daniel M. Herzig - Institute AIFB
Datasets       Three real-world, heterogeneous Web datasets:       (1) DBpedia 3.5.1, structured representation of Wikiped...
Ground Truth                                                        db:Rainer_Werner_Fassbin     “Fassbinder, Rainer Werne...
IR Experiments       Baseline KW – Keyword Search       Baseline QR – Query Rewriting       Three configurations of ERM:  ...
Results (1) – Mean Average Precision       ERM improves over KW by 120% and over QR by 54%       ERMa performs slightly be...
Results (2) – On The Fly Alignment       Pooled mappings for n = 115k entities       Average Precision = 0.7, Average Reca...
Results (3) – Parameter and Runtime Analysis       Analysis on the parameters of the model           Sensitivness of retri...
Conclusion       Novel approach for searching entities in a target dataset Dt       with a structured query qsadhering to ...
Scenario                                Overview     Heterogeneous Web Data Search     UsingRelevance-based On TheFly     ...
ExecutionProcess of our Approach                 qs                                                                  et   ...
Runtime Analysis       Average execution time less than 13 sec for the parameter setting used in the IR       experiments....
Parameter Analysis       Model is robust in certain parameter ranges       Boosting b: Beneficial for similar datasets, no...
Boosting Parameter b       If attribute as is present in the seed query, the boosting       parameter is set to b, in orde...
Alignment       ERM     Compare LMs (Prob distributions) by cross entropy       et35     WWW2012   Daniel M. Herzig - Inst...
Related Work (excerpt)       Keyword Search           Wang et al.: Semplore: A scalable IR approach to searchthe Web of Da...
Upcoming SlideShare
Loading in...5
×

Heterogeneous Web Data Search Using Relevance-based On The Fly Data Integration

174

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
174
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Heterogeneous Web Data Search Using Relevance-based On The Fly Data Integration

  1. 1. Heterogeneous Web Data SearchUsingRelevance-based On TheFly Data IntegrationDaniel M. Herzig, Thanh TranWWW2012INSTITUT FÜR ANGEWANDTE INFORMATIK UND FORMALE BESCHREIBUNGSVERFAHRENKIT – Universität des Landes Baden-Württemberg undnationales Forschungszentrum in der Helmholtz-Gemeinschaft www.kit.edu
  2. 2. Agenda Motivation Problem Definition Existing Solutions Our Approach Entity Relevance Model (ERM) Ranking On-The-Fly alignment Experiments Conclusion2 WWW2012 Daniel M. Herzig - Institute AIFB
  3. 3. Company running a movie shopping website Movies Shopping Website Company’s dataset Ds3 WWW2012 Daniel M. Herzig - Institute AIFB
  4. 4. Users search the website via forms. Search request is internally executed as a structured query Steven Spielberg i:directors i:year ?x type qs Ds 1982 i:movie Structured Query IMdbi: (e.g. SQL, SPARQL)4 WWW2012 Daniel M. Herzig - Institute AIFB Screenshot of http://www.imdb.com/search/title
  5. 5. Company discovers the plethora of Linked Data available on the Web and identifies Data Sources beneficial for its business qs Ds Linked Data on the Web http://richard.cyganiak.de/2007/10/lod/5 WWW2012 Daniel M. Herzig - Institute AIFB
  6. 6. Zero Star Mugs! vs.6 WWW2012 Daniel M. Herzig - Institute AIFB
  7. 7. Problems of Data Integration arise… qs does not return results qs No links, no integration No knowledge about the Ds external data schema External data might change often7 WWW2012 Daniel M. Herzig - Institute AIFB
  8. 8. Problem Definition Find relevant entities in a set of targetdatasetsDtgiven a sourcedatasetDsand an structuredentityqueryqsadhering to thevocabulary of Ds. Structured entity query qs ? Ds Dt1 Dt2 Source Dataset Target Datasets8 WWW2012 Daniel M. Herzig - Institute AIFB
  9. 9. Problem Setting Data Model is labeled directed graph Directly related to RDF RDF specifics, e.g. blank nodes, are omitted Entity query: SPARQL BGP query with one select variable Entityqueriesarethemostfrequenttype of web searchqueries, Pound et al. WWW2010 Web Data scenario: Data exhibits a heterogeneity on the schema- and data-level9 WWW2012 Daniel M. Herzig - Institute AIFB
  10. 10. Heterogeneous Web Data Daniel Craig, Steven Spielberg Coyote, Peter Spielberg, Steven (I) db:Film db:Steven_Spielberg Eric Bana a:Actors a:Directors i:actors i:directors type db:director a:ReleaseDate type ea 2005 ei i:movie ed a:Title type a:Binding i:title i:producer rdfs:label db:starring E.T. Munich a:Movie DVD Spielberg, Steven (I) 1941 (film) db:John_Candy_(actor) (1994) Amazon a: IMdbi: DBpedia db: Schema-level: actors vs. starring Data-level: Steven Spielberg vs. Spielberg, Steven Varying number of attributes per entity10 WWW2012 Daniel M. Herzig - Institute AIFB
  11. 11. Aim: Integrate External Data into the Search Process ? qs Keyword Search Wang et al.: Semplore: A scalable IR Dt approach to searchthe Web of Data. In: Journal of Web Semantics. (2009) Query rewriting based Ds on up-front data integration Dt Calì et al.: Query Rewriting and AnsweringunderConstraints in Data Integration Systems. In: IJCAI. (2003)11 WWW2012 Daniel M. Herzig - Institute AIFB
  12. 12. Existing Strategies – Keyword Search directors rainerwernerfassbinder theatrical release “Rainer Werner Fassbinder” date 1982 type movie (2) a:Directors (3) i:title Veronika Voss e1 title veronikavoss i:director a:Theatrical ?x e1 Rainer Werner Fassbinder director rainerwernerfassbinde ReleaseDate type i:released 1982 r released 1982 i:title SchindlersListe (1994) 1982 a:Movie e2 title i:director schindlersliste 1994 e2 Spielberg, Steven (I) director Amazon a: (1) IMDB i: type i:movie spielbergsteveni type movie Transform qs into keyword query Match against bag-of-words representation of entities Bridges schema differences by neglecting the structure Baseline 1 (KW), IR baseline using Semplore (Lucene)12 WWW2012 Daniel M. Herzig - Institute AIFB
  13. 13. Existing Strategies – Query Rewriting Schema Schema “Rainer Maria Amazon DBpedia Fassbinder” ?y a:Directors Ontology db:director a:Theatrical Alignment Tool ReleaseDate ?x type Amazon a: Dbpedia db: ?x a:Directors = db:director type 1982 a:Movie a:Title = db:name A:Actor = db:starring ?z Amazon a: … = … DBpedia db: Create mappings using ontology alignment tools (Falcon AO) Rewrite query using the mappings, omit missing mappings, replace constants with variables Reduces the search space, perform keyword search on top Baseline 2 (QR), database-style baseline13 WWW2012 Daniel M. Herzig - Institute AIFB
  14. 14. Heterogeneous Web Data Search UsingRelevance-based On TheFly Data Integration14 WWW2012 Daniel M. Herzig - Institute AIFB
  15. 15. Contributions (1) Novelapproachforqueryingheterogeneous Web datasources No upfrontdataintegrationnecessary Uses an EntityRelevance Model (ERM) forranking and forcomputingmappings on thefly (2) Implementation of the approach Construction of an ERM and usage for alignment and ranking Best-effort algorithm for creating mappings during runtime (3) Large-scale evaluation with 3 real-world datasets Experiments show our approach exceeds KW and QR baseline by 120%, respectively 54% in terms of Mean Average Precision.15 WWW2012 Daniel M. Herzig - Institute AIFB
  16. 16. Overview of our Approach Keyword search to cross vocabulary mismatches keyword query qs et Dt Entity Rs Relevance et Model Ds Model et leveraging the Dt Relevance Feedback structure of the data et Matching and Ranking16 WWW2012 Daniel M. Herzig - Institute AIFB
  17. 17. Entity Relevance Model (ERM) Based on Structured Relevance Model (Lavrenkoet.al 2007) Entity Relevance Model: Query specific model Captures structure and content of relevant results Composite model consisting of language models weighted by occurrence. Based on Lavrenko et al.: Information Retrieval on Empty Fields. In: HLT- NAACL. (2007)17 WWW2012 Daniel M. Herzig - Institute AIFB
  18. 18. ERM (2) World on Wires Klaus Löwitsch label starring released starring 1973 e1 Barbara Valentin type director Film Rainer Werner Fassbinder type director released language 1982 e2 German label Veronika Voss qs Rs = {e1,e2} ERM18 WWW2012 Daniel M. Herzig - Institute AIFB
  19. 19. Modelling Target Entities Coyote, Peter Spielberg, Steven (I) i:actors i:directors type ei i:movie i:title i:producer E.T. Spielberg, Steven (I) (1994) IMdbi: Modeled the same way as ERM Language Model for each attribute19 WWW2012 Daniel M. Herzig - Institute AIFB
  20. 20. Ranking boosting seed query attributes cross entropy frequency of as Idea: Rank candidate entities according to their similarity to ERM Note: Alignment between ERM and et needed If no mapping available, use max H.20 WWW2012 Daniel M. Herzig - Institute AIFB
  21. 21. On The Fly Alignment as ~ at ?? Idea: Compare all language models of et to a field of ERM using cross entropy -H. Establish a mapping, if lowest value for H is lower than a threshold t. Worst case: nr comparisons n , r are usually small Allows reuse of computed cross entropies for subsequent ranking21 WWW2012 Daniel M. Herzig - Institute AIFB
  22. 22. EXPERIMENTS22 WWW2012 Daniel M. Herzig - Institute AIFB
  23. 23. Datasets Three real-world, heterogeneous Web datasets: (1) DBpedia 3.5.1, structured representation of Wikipedia (2) IMdb, information about movies (3) Amazon, information about DVD/Videos (2,3) are crawled and transformed to RDF. Provided by L3S23 WWW2012 Daniel M. Herzig - Institute AIFB
  24. 24. Ground Truth db:Rainer_Werner_Fassbin “Fassbinder, Rainer Werner” “Rainer Werner Fassbinder” der a:Directors db:director i:directors a:Theatrical ReleaseDate ?x type db:released ?x type i:year ?x type 1982 a:Movie 1982 db:Film 1982 i:movie Amazon a: DBpedia db: IMdbi: Goal is to find relevant entities in the target datasets Manually rewriting the seed query qsto obtain the relevant entities in the target datasets. 3 query sets each with 23 corresponding entity BGP SPARQL queries24 WWW2012 Daniel M. Herzig - Institute AIFB
  25. 25. IR Experiments Baseline KW – Keyword Search Baseline QR – Query Rewriting Three configurations of ERM: ERM – computes alignments on the fly ERMa– uses pre-computed alignments only ERMq– uses pre-computed alignments and creates mappings on top Six different retrieval settings.25 WWW2012 Daniel M. Herzig - Institute AIFB
  26. 26. Results (1) – Mean Average Precision ERM improves over KW by 120% and over QR by 54% ERMa performs slightly better than ERM ERMq performs best.26 WWW2012 Daniel M. Herzig - Institute AIFB
  27. 27. Results (2) – On The Fly Alignment Pooled mappings for n = 115k entities Average Precision = 0.7, Average Recall = 0.3 for relevant entities Pearson correlation ρ(MAP, Precision-Rel) = 0.9827 WWW2012 Daniel M. Herzig - Institute AIFB
  28. 28. Results (3) – Parameter and Runtime Analysis Analysis on the parameters of the model Sensitivness of retrieval performance in terms of MAP for varying parameter configurations Runtime analysis Execution takes less than 13s on average Can be improved by moving tasks (e.g. computation of language models) to index time.28 WWW2012 Daniel M. Herzig - Institute AIFB
  29. 29. Conclusion Novel approach for searching entities in a target dataset Dt with a structured query qsadhering to the vocabulary of Ds. Entity Relevance Model used for ranking and creating mappings during runtime. Experiments showed that our approach is effective and exceeds the baselines substantially.29 WWW2012 Daniel M. Herzig - Institute AIFB
  30. 30. Scenario Overview Heterogeneous Web Data Search UsingRelevance-based On TheFly Data Integration Baseline Keyword Search Daniel M. Herzig, Thanh Tran herzig@kit.edu Institute AIFB, Karlsruhe Institute of Technology, Germany THANK YOU! Query Rewriting ACKNOWLEDGEMENTS: Wethankourcolleagues Philipp Sorg and Günter Ladwigforhelpfuldiscussions. Also, wethank Julien Gaugaz and the L3S Research Center forprovidingustheirversions of theIMdb and Amazondatasets. Thiswork was supportedbythe German Federal Ministry of Education and Research (BMBF) undertheiGreenproject (grant 01IA08005K).30 WWW2012 Daniel M. Herzig - Institute AIFB
  31. 31. ExecutionProcess of our Approach qs et Dt Entity Ds Rs Relevance Model et et Run qs against Ds to obtain results Rs Dt Build ERM from Rs et Obtain candidate entities et Compare et to ERM # Rank et according to similarity to ERM31 WWW2012 Daniel M. Herzig - Institute AIFB
  32. 32. Runtime Analysis Average execution time less than 13 sec for the parameter setting used in the IR experiments. Increasing parameter c (i.e. reducing the number of fields of ERM) increases performances Our implementation performed some tasks at runtime, which can be moved to index time Improvements are easily possible32 WWW2012 Daniel M. Herzig - Institute AIFB
  33. 33. Parameter Analysis Model is robust in certain parameter ranges Boosting b: Beneficial for similar datasets, not so for diverse Pruning c: Small effect on effectiveness, larger on efficenicy33 WWW2012 Daniel M. Herzig - Institute AIFB
  34. 34. Boosting Parameter b If attribute as is present in the seed query, the boosting parameter is set to b, in order to increase its influence during ranking.34 WWW2012 Daniel M. Herzig - Institute AIFB
  35. 35. Alignment ERM Compare LMs (Prob distributions) by cross entropy et35 WWW2012 Daniel M. Herzig - Institute AIFB
  36. 36. Related Work (excerpt) Keyword Search Wang et al.: Semplore: A scalable IR approach to searchthe Web of Data. In: Journal of Web Semantics. (2009) Query rewriting Calì et al.: Query Rewriting and AnsweringunderConstraints in Data Integration Systems. In: IJCAI. (2003) Our approach is based on Lavrenko et al.: Information Retrieval on Empty Fields. In: HLT- NAACL. (2007) Madhavan et al.: Web-scale Data Integration: Youcanafford to pay as yougo. In: CIDR. (2007)36 WWW2012 Daniel M. Herzig - Institute AIFB
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×