Heterogeneous Web Data Search Using Relevance-based On The Fly Data Integration

247 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
247
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Heterogeneous Web Data Search Using Relevance-based On The Fly Data Integration

  1. 1. Heterogeneous Web Data SearchUsingRelevance-based On TheFly Data IntegrationDaniel M. Herzig, Thanh TranWWW2012INSTITUT FÜR ANGEWANDTE INFORMATIK UND FORMALE BESCHREIBUNGSVERFAHRENKIT – Universität des Landes Baden-Württemberg undnationales Forschungszentrum in der Helmholtz-Gemeinschaft www.kit.edu
  2. 2. Agenda Motivation Problem Definition Existing Solutions Our Approach Entity Relevance Model (ERM) Ranking On-The-Fly alignment Experiments Conclusion2 WWW2012 Daniel M. Herzig - Institute AIFB
  3. 3. Company running a movie shopping website Movies Shopping Website Company’s dataset Ds3 WWW2012 Daniel M. Herzig - Institute AIFB
  4. 4. Users search the website via forms. Search request is internally executed as a structured query Steven Spielberg i:directors i:year ?x type qs Ds 1982 i:movie Structured Query IMdbi: (e.g. SQL, SPARQL)4 WWW2012 Daniel M. Herzig - Institute AIFB Screenshot of http://www.imdb.com/search/title
  5. 5. Company discovers the plethora of Linked Data available on the Web and identifies Data Sources beneficial for its business qs Ds Linked Data on the Web http://richard.cyganiak.de/2007/10/lod/5 WWW2012 Daniel M. Herzig - Institute AIFB
  6. 6. Zero Star Mugs! vs.6 WWW2012 Daniel M. Herzig - Institute AIFB
  7. 7. Problems of Data Integration arise… qs does not return results qs No links, no integration No knowledge about the Ds external data schema External data might change often7 WWW2012 Daniel M. Herzig - Institute AIFB
  8. 8. Problem Definition Find relevant entities in a set of targetdatasetsDtgiven a sourcedatasetDsand an structuredentityqueryqsadhering to thevocabulary of Ds. Structured entity query qs ? Ds Dt1 Dt2 Source Dataset Target Datasets8 WWW2012 Daniel M. Herzig - Institute AIFB
  9. 9. Problem Setting Data Model is labeled directed graph Directly related to RDF RDF specifics, e.g. blank nodes, are omitted Entity query: SPARQL BGP query with one select variable Entityqueriesarethemostfrequenttype of web searchqueries, Pound et al. WWW2010 Web Data scenario: Data exhibits a heterogeneity on the schema- and data-level9 WWW2012 Daniel M. Herzig - Institute AIFB
  10. 10. Heterogeneous Web Data Daniel Craig, Steven Spielberg Coyote, Peter Spielberg, Steven (I) db:Film db:Steven_Spielberg Eric Bana a:Actors a:Directors i:actors i:directors type db:director a:ReleaseDate type ea 2005 ei i:movie ed a:Title type a:Binding i:title i:producer rdfs:label db:starring E.T. Munich a:Movie DVD Spielberg, Steven (I) 1941 (film) db:John_Candy_(actor) (1994) Amazon a: IMdbi: DBpedia db: Schema-level: actors vs. starring Data-level: Steven Spielberg vs. Spielberg, Steven Varying number of attributes per entity10 WWW2012 Daniel M. Herzig - Institute AIFB
  11. 11. Aim: Integrate External Data into the Search Process ? qs Keyword Search Wang et al.: Semplore: A scalable IR Dt approach to searchthe Web of Data. In: Journal of Web Semantics. (2009) Query rewriting based Ds on up-front data integration Dt Calì et al.: Query Rewriting and AnsweringunderConstraints in Data Integration Systems. In: IJCAI. (2003)11 WWW2012 Daniel M. Herzig - Institute AIFB
  12. 12. Existing Strategies – Keyword Search directors rainerwernerfassbinder theatrical release “Rainer Werner Fassbinder” date 1982 type movie (2) a:Directors (3) i:title Veronika Voss e1 title veronikavoss i:director a:Theatrical ?x e1 Rainer Werner Fassbinder director rainerwernerfassbinde ReleaseDate type i:released 1982 r released 1982 i:title SchindlersListe (1994) 1982 a:Movie e2 title i:director schindlersliste 1994 e2 Spielberg, Steven (I) director Amazon a: (1) IMDB i: type i:movie spielbergsteveni type movie Transform qs into keyword query Match against bag-of-words representation of entities Bridges schema differences by neglecting the structure Baseline 1 (KW), IR baseline using Semplore (Lucene)12 WWW2012 Daniel M. Herzig - Institute AIFB
  13. 13. Existing Strategies – Query Rewriting Schema Schema “Rainer Maria Amazon DBpedia Fassbinder” ?y a:Directors Ontology db:director a:Theatrical Alignment Tool ReleaseDate ?x type Amazon a: Dbpedia db: ?x a:Directors = db:director type 1982 a:Movie a:Title = db:name A:Actor = db:starring ?z Amazon a: … = … DBpedia db: Create mappings using ontology alignment tools (Falcon AO) Rewrite query using the mappings, omit missing mappings, replace constants with variables Reduces the search space, perform keyword search on top Baseline 2 (QR), database-style baseline13 WWW2012 Daniel M. Herzig - Institute AIFB
  14. 14. Heterogeneous Web Data Search UsingRelevance-based On TheFly Data Integration14 WWW2012 Daniel M. Herzig - Institute AIFB
  15. 15. Contributions (1) Novelapproachforqueryingheterogeneous Web datasources No upfrontdataintegrationnecessary Uses an EntityRelevance Model (ERM) forranking and forcomputingmappings on thefly (2) Implementation of the approach Construction of an ERM and usage for alignment and ranking Best-effort algorithm for creating mappings during runtime (3) Large-scale evaluation with 3 real-world datasets Experiments show our approach exceeds KW and QR baseline by 120%, respectively 54% in terms of Mean Average Precision.15 WWW2012 Daniel M. Herzig - Institute AIFB
  16. 16. Overview of our Approach Keyword search to cross vocabulary mismatches keyword query qs et Dt Entity Rs Relevance et Model Ds Model et leveraging the Dt Relevance Feedback structure of the data et Matching and Ranking16 WWW2012 Daniel M. Herzig - Institute AIFB
  17. 17. Entity Relevance Model (ERM) Based on Structured Relevance Model (Lavrenkoet.al 2007) Entity Relevance Model: Query specific model Captures structure and content of relevant results Composite model consisting of language models weighted by occurrence. Based on Lavrenko et al.: Information Retrieval on Empty Fields. In: HLT- NAACL. (2007)17 WWW2012 Daniel M. Herzig - Institute AIFB
  18. 18. ERM (2) World on Wires Klaus Löwitsch label starring released starring 1973 e1 Barbara Valentin type director Film Rainer Werner Fassbinder type director released language 1982 e2 German label Veronika Voss qs Rs = {e1,e2} ERM18 WWW2012 Daniel M. Herzig - Institute AIFB
  19. 19. Modelling Target Entities Coyote, Peter Spielberg, Steven (I) i:actors i:directors type ei i:movie i:title i:producer E.T. Spielberg, Steven (I) (1994) IMdbi: Modeled the same way as ERM Language Model for each attribute19 WWW2012 Daniel M. Herzig - Institute AIFB
  20. 20. Ranking boosting seed query attributes cross entropy frequency of as Idea: Rank candidate entities according to their similarity to ERM Note: Alignment between ERM and et needed If no mapping available, use max H.20 WWW2012 Daniel M. Herzig - Institute AIFB
  21. 21. On The Fly Alignment as ~ at ?? Idea: Compare all language models of et to a field of ERM using cross entropy -H. Establish a mapping, if lowest value for H is lower than a threshold t. Worst case: nr comparisons n , r are usually small Allows reuse of computed cross entropies for subsequent ranking21 WWW2012 Daniel M. Herzig - Institute AIFB
  22. 22. EXPERIMENTS22 WWW2012 Daniel M. Herzig - Institute AIFB
  23. 23. Datasets Three real-world, heterogeneous Web datasets: (1) DBpedia 3.5.1, structured representation of Wikipedia (2) IMdb, information about movies (3) Amazon, information about DVD/Videos (2,3) are crawled and transformed to RDF. Provided by L3S23 WWW2012 Daniel M. Herzig - Institute AIFB
  24. 24. Ground Truth db:Rainer_Werner_Fassbin “Fassbinder, Rainer Werner” “Rainer Werner Fassbinder” der a:Directors db:director i:directors a:Theatrical ReleaseDate ?x type db:released ?x type i:year ?x type 1982 a:Movie 1982 db:Film 1982 i:movie Amazon a: DBpedia db: IMdbi: Goal is to find relevant entities in the target datasets Manually rewriting the seed query qsto obtain the relevant entities in the target datasets. 3 query sets each with 23 corresponding entity BGP SPARQL queries24 WWW2012 Daniel M. Herzig - Institute AIFB
  25. 25. IR Experiments Baseline KW – Keyword Search Baseline QR – Query Rewriting Three configurations of ERM: ERM – computes alignments on the fly ERMa– uses pre-computed alignments only ERMq– uses pre-computed alignments and creates mappings on top Six different retrieval settings.25 WWW2012 Daniel M. Herzig - Institute AIFB
  26. 26. Results (1) – Mean Average Precision ERM improves over KW by 120% and over QR by 54% ERMa performs slightly better than ERM ERMq performs best.26 WWW2012 Daniel M. Herzig - Institute AIFB
  27. 27. Results (2) – On The Fly Alignment Pooled mappings for n = 115k entities Average Precision = 0.7, Average Recall = 0.3 for relevant entities Pearson correlation ρ(MAP, Precision-Rel) = 0.9827 WWW2012 Daniel M. Herzig - Institute AIFB
  28. 28. Results (3) – Parameter and Runtime Analysis Analysis on the parameters of the model Sensitivness of retrieval performance in terms of MAP for varying parameter configurations Runtime analysis Execution takes less than 13s on average Can be improved by moving tasks (e.g. computation of language models) to index time.28 WWW2012 Daniel M. Herzig - Institute AIFB
  29. 29. Conclusion Novel approach for searching entities in a target dataset Dt with a structured query qsadhering to the vocabulary of Ds. Entity Relevance Model used for ranking and creating mappings during runtime. Experiments showed that our approach is effective and exceeds the baselines substantially.29 WWW2012 Daniel M. Herzig - Institute AIFB
  30. 30. Scenario Overview Heterogeneous Web Data Search UsingRelevance-based On TheFly Data Integration Baseline Keyword Search Daniel M. Herzig, Thanh Tran herzig@kit.edu Institute AIFB, Karlsruhe Institute of Technology, Germany THANK YOU! Query Rewriting ACKNOWLEDGEMENTS: Wethankourcolleagues Philipp Sorg and Günter Ladwigforhelpfuldiscussions. Also, wethank Julien Gaugaz and the L3S Research Center forprovidingustheirversions of theIMdb and Amazondatasets. Thiswork was supportedbythe German Federal Ministry of Education and Research (BMBF) undertheiGreenproject (grant 01IA08005K).30 WWW2012 Daniel M. Herzig - Institute AIFB
  31. 31. ExecutionProcess of our Approach qs et Dt Entity Ds Rs Relevance Model et et Run qs against Ds to obtain results Rs Dt Build ERM from Rs et Obtain candidate entities et Compare et to ERM # Rank et according to similarity to ERM31 WWW2012 Daniel M. Herzig - Institute AIFB
  32. 32. Runtime Analysis Average execution time less than 13 sec for the parameter setting used in the IR experiments. Increasing parameter c (i.e. reducing the number of fields of ERM) increases performances Our implementation performed some tasks at runtime, which can be moved to index time Improvements are easily possible32 WWW2012 Daniel M. Herzig - Institute AIFB
  33. 33. Parameter Analysis Model is robust in certain parameter ranges Boosting b: Beneficial for similar datasets, not so for diverse Pruning c: Small effect on effectiveness, larger on efficenicy33 WWW2012 Daniel M. Herzig - Institute AIFB
  34. 34. Boosting Parameter b If attribute as is present in the seed query, the boosting parameter is set to b, in order to increase its influence during ranking.34 WWW2012 Daniel M. Herzig - Institute AIFB
  35. 35. Alignment ERM Compare LMs (Prob distributions) by cross entropy et35 WWW2012 Daniel M. Herzig - Institute AIFB
  36. 36. Related Work (excerpt) Keyword Search Wang et al.: Semplore: A scalable IR approach to searchthe Web of Data. In: Journal of Web Semantics. (2009) Query rewriting Calì et al.: Query Rewriting and AnsweringunderConstraints in Data Integration Systems. In: IJCAI. (2003) Our approach is based on Lavrenko et al.: Information Retrieval on Empty Fields. In: HLT- NAACL. (2007) Madhavan et al.: Web-scale Data Integration: Youcanafford to pay as yougo. In: CIDR. (2007)36 WWW2012 Daniel M. Herzig - Institute AIFB

×