• Like
Semantic Web Search - Searching Documents and Semantic Data on the Web
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Semantic Web Search - Searching Documents and Semantic Data on the Web

  • 675 views
Published

Presentation at Information Sciences Institute, USC

Presentation at Information Sciences Institute, USC

Published in Education , Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
675
On SlideShare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
30
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • Web data: Text+ Linked Data+ Semi-structured RDF+ Hybrid datathat can be conceived as forming data graphsHear abour bob and alice all the time (in computer science literatures), want to find out more… build Semantic Web search engine. To address complex information needs by exploiting Web data:- Information need interpreted as a set of constrains Match structured data Match text
  • Togive an impressionwherewearetowardsaccomplishingthisgoal: demofirstOurcurrentsystem: Support theprocessofaddressingcomplexinformationneeds: startswithkeywordsearch: intepretingthequeryintentandthenbrowsing / exploration / refinementofresultsset via facetedsearch
  • - Upon selecting a specificresult: resource-basenavigation (insteadoffacetedbased)
  • TF-idf are used to deal with the textual part of the dataPropose to also exploit the structure of keyword search resultsProximity-based ranking employ minimal distance heuristics to maximize structural compactness of results When JRT is more compact, it is assumed to be more meaningful and relevant Intuition: keyword specified by the users are closely related and thus should be connected over relatively short paths I.e. Compactness measured in terms of the length of paths between nodes, i.e. The proximity The larger the length of paths, the less relevant is the overall resultNi and nj are nodes in the graph sim(ni,nj) denotes the compactness between two any nodessim(ki,kj) denotes the compactness between two keywords (taking account the compactness of all pairs of nodes matching the two keywords), i.e. Cki denotes the set of all nodes that match kiOverall score of a JRT is an aggregation on the score of its
  • Schemas = summaries

Transcript

  • 1. Semantic Web Search Searching Documents and Semantic Data on the Web Presentation at Information Sciences Institute, USCSemantic Search Group at the AIFB InstituteThanh Tran, Günter Ladwig, Daniel M. Herzig, Andreas Wagner,Veli Bicer, Yongtao Ma and Rudi Studer.http://sites.google.com/site/kimducthanh KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)1
  • 2. Structure • Motivation • Previous and current work • Keyword query processing • Keyword query result ranking • Conclusion KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)2
  • 3. Besides documents, there is an increasing amount of structured data on the Web such as RDF, RDFa and Linked Data! How can we leverage this for enhancing the search experience? MOTIVATION KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)3
  • 4. RDFa … <div about="/alice/posts/trouble_with_bob"> <h2 property="dc:title">The trouble with Bob</h2> <h3 property="dc:creator">Alice</h3> Bob is a good friend of mine. We went to the same university, and also shared an apartment in Berlin in 2008. The trouble with Bob is that he takes much better photos than I do: <div about="http://example.com/bob/photos/sunset.jpg"> <img src="http://example.com/bob/photos/sunset.jpg" /> <span property="dc:title">Beautiful Sunset</span> by <span property="dc:creator">Bob</span>. </div> </div> … adopted from : http://www.w3.org/TR/xhtml-rdfa-primer/ KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)4
  • 5. RDFaBob is a good friend of mine. We contentwent to the same university, andalso shared an apartment in Berlinin 2008. The trouble with Bob isthat he takes much better photosthan I do: content adopted from : http://www.w3.org/TR/xhtml-rdfa-primer/ KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)5
  • 6. Semantic Data source: http://linkeddata.org/ KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)6
  • 7. Linked Data adopted from : http://www.w3.org/TR/xhtml-rdfa-primer/ KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)7
  • 8. Addressing Complex Information Needs  “Information about a friend of Alice, who shared an apartment with her in Berlin and knows someone in the field of Semantic Search working at KIT”. <shared apartment in Berlin with Alice> <knows someone in the field of Semantic <friend of Alice> Search working at KIT> trouble with bob FluidOps 34 Peter sunset.jpg Bob is a good friend Beautiful of mine. We went to Sunset the same university, Germany Semantic Alice Search and also shared an apartment in Berlin in 2008. The trouble with Bob is that he Germany 2009 Bob takes much better Thanh photos than I do: KIT KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)9
  • 9. Data Sources in SemanticSearch@AIFB Demo  English Wikipedia  Data from Linked Open Data  DBpedia  YAGO  Many more  Live data from Data.gov (US Government)  E.g. live data about earthquakes KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)10
  • 10. Search Intent Interpretation, Refinement and Exploration Keywords Query Completions Term Completions FacetsVorlesung Knowledge Discovery - Institut AIFB KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 13
  • 11. Result Inspection, Analysis and Browsing KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)14
  • 12. OVERVIEW OF WORK KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)15
  • 13. Search Concepts  Hybrid Search: Structured queries combined with keywords on structured and unstructured data in possibly remote (Linked Data) sources BACK-END  Query interpretation: Translation of keywords to hybrid queries  Keyword search (translated hybrid query) combined with faceted search: starting with keywords and then iterative refinement process based on operations on facets FRONT-END KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)16
  • 14. Previous and Current Work  Semi-structured RDF data management [ISWC09] [TKDE12]  Inverted index for RDF data management  Structure index  Linked data management [ESWC10][ISWC10] [ESWC11][ISWC11]  Keyword query routing to find relevant sources / relevant combination of sources  “Explorative” query processing and adaptive query optimization  Combining local and remote Linked Data  Search frontends [ICDE09][CIKM11] [SIGIR11][ISWC2011] [Dexa11]  Ontology and entity result summarization  Faceted and keyword search  Current work: hybrid data search KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)17 Tran Thanh: Schema-agnostic Search
  • 15. KEYWORD QUERY PROCESSING [ICDE09] KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)18
  • 16. DB-style Keyword Search Keyword query processing / translation“Articles of researchers at Stanford with Turing Award” „Stanford Article Turing Award“ Specification  Keywords might produce large number of matching elements in the data graph  The data graph might be large in size  Search complexity increases substantially with the size of the graph  Large number of results Selection Set of Queries Set of Results 1) Query 1 1) Result 1 2) Query 2 2) Result 2 KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)19
  • 17. Query Space Schema graph Query space  Main Idea  Exploration on much reduced the data graph model  Query space: more compact representation of summary  Online constructionspace space out of schema graph called query of query  Match keywords against labels of resources to find keyword elements  Substantially elements with elements of schema to obtain query space  Connect keyword decrease complexity  Top-k procedure for graph exploration to compute  Online top-k query graph exploration only top-k results KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)20
  • 18. Top-k Query Graph Exploration on Query SpacePaths and their costs The resulting query graph • Cost-directed exploration of Steiner graphs • Explore all possible distinct paths starting from keyword elements • At each exploration, take current path with lowest cost • When a connecting element is found, merge paths to construct the query graph and add it to candidate list • Top-k terminates when highest cost of the candidate list (the cost of the k- ranked query graph) is found to be lower than the lowest possible cost that can achieved with paths in the queues yet to be explored • Result: best k query interpretations to be shown to the user KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)21
  • 19. Evaluation – Performance • Comparison with bidirectional search [V. Kacholia et al.] and search based on graph indexing (1000 BFS, 1000 METIS, 300 BFS, 300 METIS in [H. He et al.]) • Query computation + processing time until finding 10 answers • Outperforms bidirectional search by at least one order of magn. • Performance comparable with indexing based approaches, but requires less space 100000 10000 Our Solution 1000 Bidirect 1000 BFS 100 1000 METIS 10 300BFS 1 300METIS Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Query Performance on DBLP Data KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)22
  • 20. KEYWORD QUERY RESULT RANKING [CIKM11] KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)23
  • 21. IR-based Ranking Schemes  TF*IDF based:  Discover, EASE, SPARK  [Liu et al, SIGMOD06] Score( JRT ) Score( r ) r JRT Score(r ) Weight (v, r ) Weight (v, Q) v r ,Q ntf Weight (v, r ) nidf ntf 1 ln(1 ln(tf )) ndl ndl (1 s) s dl / avdl N 1 nidf ln df 24 KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)24
  • 22. Proximity-based Ranking Schemes  EASE, XRANK, BLINKS, etc.  EASE  Proximity between a pair of keywords  Overall score of a JRT is aggregation on the score of keyword pairs  XRANK  Ranking of XML documents / elements  Proximity here is defined based on w, the smallest text window in n that contains all search keywords KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)25
  • 23. Prestige-based Ranking Schemes  Based on graph structure, i.e. PageRank-like methods to determine node prestige  XRank [Guo et al, SIGMOD03]  ObjectRank [Balmin et al, VLDB04] : considers both global ObjectRank and keyword-specific ObjectRank  The probability that edges of different types will be visited are not uniform: requires manual fine-tuning to set the importance of different types of edges  Naive: indegree KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)26
  • 24. Introduction  Recent study shows that the effectiveness of most works are below the expectations (Coffman and Weaver, CIKM 2010)  Problems:  Proximity does not directly model relevance  Ad-hoc TF/IDF normalization does not capture the nature of keyword search results well (small document length, skewed word occurrence statistics)  PageRank not directly applicable KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)27
  • 25. Overview of the Approach  Keyword query is short an ambiguous, while data (and results) provide rich structure information that can be exploited!  Principled approach to relevance based on language models and PRF  estimate model from content and structure of PRF results  Adopt relevance model as a fine-grained model representing both content and structure of relevant document and queries (relevance class) KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)28
  • 26. Relevance Models [SIGIR 01]  Explicit notion of relevance  Queries and documents are samples from a latent representation space, i.e. the relevance model underlying the information need KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)29
  • 27. Relevance Models q1 Israeli sample probabilities P(w|Q) w M q2 Palestinian .077 palestinian .055 israel M q3 raids .034 jerusalem M .033 protest w ??? .027 raid .011 clash .010 bank P( w, q1...qk ) .010 west P( w | R) P( w | q1...qk ) .010 troop P(q1...qk ) … k P ( w, q1...qk ) P( M ) P( w | M ) P (qi | M ) M UM i 1 KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)30
  • 28. Ranking with Relevance Models  Probability ranking principle P( D | R) P( w | R) P( D | N ) w D P( w | N )  See relevance model as query expansion  Rank of document is based on the cross-entropy of its model and the relevance model H ( R || D) P ( w | R) log P( w | D) w V n( w, D) P( w | D) D (1 D ) P( w | C ) |D| KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)31
  • 29. Edge-Specific Relevance Models  Given a query Q={q1,…,qn}, a set of PRF resources are retrieved from an inverted keyword index:  E.g. Q={Hepburn, Holiday}, FR = {m1, p1, p4,m2, p2m2,m3}  Based on PRF results, an edge specific relevance model is constructed for each unique edge e based on: KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)32
  • 30. Edge Specific Resource Models  Edge-specific resource model:  Smoothing with model for the entire resource  The score of a resource calculated based on cross-entropy of edge-specific RM and edge-specific ResM:  Alpha allows to control the importance of edges KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)33
  • 31. Ranking JRTs  Ranking aggregated JRTs:  The cross entropy between the edge-specific RM (Query Model) and geometric mean of combined edge-specific ResM:  The proposed ranking function is monotonic with respect to the individual resource scores (a necessary property for using top-k algorithms) KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)34
  • 32. Experiments  Datasets: Subsets of Wikipedia, IMDB and Mondial Web databases  Queries: 50 queries for each dataset including “TREC style” queries and “single resource” queries  Metrics: Three metrics are used: (1) the number of top-1 relevant results, (2) Reciprocal rank and (3) Mean Average Precision (MAP)  Baselines: BANKS , Bidirectional (proximity) , Efficient , SPARK, CoveredDensity (TF-IDF).  RM-S: Our approach KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)35
  • 33. Experiments – Single Resource Queries - Proximity-based approaches perform well - Minimizing compactness results in single resources being ranked high - TF-IDF normalization not as aggressive, not as effective Reciprocal rank for single resource queries KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)36
  • 34. Experiments – TREC-style Queries - TF-IDF based approaches performed better - Our approach outperformed existing approaches also in this category, providing more stable performance over the entire precision-recall curve Precision-recall for TREC-style queries on Wikipedia KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)37
  • 35. Experiment – All Queries - Our approach consistently shows superior performance - Encouraging, given that this is first study that use a general framework for evaluating keyword search ranking MAP scores for all queries KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)38
  • 36. Conclusions / Future Work  Front-to-backend work on using structured data for enhancing the search experience  From backend data management to frontend search concepts  Current work / future directions  Managing hybrid data  Hybrid query processing / interfaces  Ranking hybrid results KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)39
  • 37. References (1)  Günter Ladwig, Thanh Tran SIHJoin: Querying Remote and Local Linked Data In 8th Extended Semantic Web Conference (ESWC11). Heraklion, Greece, June, 2011 (full research paper, 23% acceptance rate).  Thanh Tran, Lei Zhang, Rudi Studer Summary Models for Routing Keywords to Linked Data Sources In Proceedings of 9th International Semantic Web Conference (ISWC10). Shanghai, China, November, 2010 (full research paper, 20% acceptance rate).  Günter Ladwig, Thanh Tran Linked Data Query Processing Strategies In Proceedings of 9th International Semantic Web Conference (ISWC10). Shanghai, China, November, 2010 (full research paper, 20% acceptance rate).  Duc Thanh Tran, Philipp Cimiano, Sebastian Rudolph, Rudi Studer Ontology-based Interpretation of Keywords for Semantic Search In Proceedings of the 6th International Semantic Web Conference (ISWC07), pp. 523- 536. Busan, Korea, November 2007 (full paper, 19% acceptance rate). KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)40
  • 38. References (2)  Duc Thanh Tran, Haofen Wang, Sebastian Rudolph, Philipp Cimiano Top-k Exploration of Query Graph Candidates for Efficient Keyword Search on RDF In Proceedings of the 25th International Conference on Data Engineering (ICDE09). Shanghai, China, March 2009 (full research paper, 17% acceptance rate).  Haofen Wang, Duc Thanh Tran, Chang Liu CE2 - Towards a Large Scale Hybrid Search Engine with Integrated Ranking Support In Proceedings of the 17th Conference on Information and Knowledge Management (CIKM08). Napa Valley, USA, October 2008 (poster paper, 16% acceptance rate).  Haofen Wang, Qiaoling Liu, Thomas Penin, Linyun Fu, Lei Zhang, Thanh Tran, Yong Yu, Yue Pan Semplore: A Scalable IR Approach to Search the Web of Data In Journal of Web Semantics, 2009 (Impact Factor 3.4).  Thomas Penin, Haofen Wang, Duc Thanh Tran, Yong Yu Snippet Generation for Semantic Web Search Engines In Proceedings of the 3rd Asian Semantic Web Conference (ASWC08). December 2008 (full research paper, 31% acceptance rate).  Thanh Tran, Günter Ladwig Structure Index for RDF In SemData@VLDB Workshop (SemData10). Singapore, September, 2010. KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)41
  • 39. Thanks! Tran Duc Thanh ducthanh.tran@kit.edu http://sites.google.com/site/kimducthanh/ KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)42
  • 40. Backups KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)43
  • 41.  Agrawal, S., Chaudhuri, S., and Das, G. (2002). DBXplorer: A system for keyword-based search over relational databases. In ICDE, pages 5-16. Amer-Yahia, S. and Shanmugasundaram, J. (2005). XML full-text search: Challenges and opportunities. In VLDB, page 1368. Bao, Z., Ling, T. W., Chen, B., and Lu, J. (2009). Effective xml keyword search with relevance oriented ranking. In ICDE, pages 517-528. Bhalotia, G., Nakhe, C., Hulgeri, A., Chakrabarti, S., and Sudarshan, S. (2002). Keyword Searching and Browsing in Databases using BANKS. In ICDE, pages 431-440. Bicer, V., Tran, T. (2011): Ranking Support for Keyword Search on Structured Data using Relevance Models. In CIKM. Bizer, G., Lehmann, J., Kobilarov, G., Auer, S., Becker, C., Cyganiak, R., Hellmann, S. (2009): DBpedia - A crystallization point for the Web of Data. J. Web Sem. (WS) 7(3):154-165 Dalvi, B. B., Kshirsagar, M., and Sudarshan, S. (2008). Keyword search on external memory data graphs. PVLDB, 1(1):1189-1204. Ding, B., Yu, J. X., Wang, S., Qin, L., Zhang, X., and Lin, X. (2007). Finding top-k min-cost connected trees in databases. In ICDE, pages 836-845. Golenberg, K., Kimelfeld, B., and Sagiv, Y. (2008). Keyword proximity search in complex data graphs. In SIGMOD, pages 927-940. Guo, L., Shao, F., Botev, C., and Shanmugasundaram, J. (2003). XRANK: Ranked keyword search over XML documents. In SIGMOD. He, H., Wang, H., Yang, J., and Yu, P. S. (2007). BLINKS: Ranked keyword searches on graphs. In SIGMOD, pages 305-316. Hristidis, V., Hwang, H., and Papakonstantinou, Y. (2008). Authority-based keyword search in databases. ACM Trans. Database Syst., 33(1):1-40 KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
  • 42.  Hristidis, V. and Papakonstantinou, Y. (2002). Discover: Keyword search in relational databases. In VLDB. Kacholia, V., Pandit, S., Chakrabarti, S., Sudarshan, S., Desai, R., and Karambelkar, H. (2005). Bidirectional expansion for keyword search on graph databases. In VLDB, pages 505-516. Kimelfeld, B. and Sagiv, Y. (2006). Finding and approximating top-k answers in keyword proximity search. In PODS, pages 173-182. Ladwig, G., Tran, T. (2011): Index Structures and Top-k Join Algorithms for Native Keyword Search Databases. In CIKM. Lavrenko, V. Croft, W.B. (2001): Relevance-Based Language Models. In SIGIR, pages 120-127. Li, G., Ooi, B. C., Feng, J., Wang, J., and Zhou, L. (2008). EASE: an effective 3-in-1 keyword search method for unstructured, semi-structured and structured data. In SIGMOD. Liu, F., Yu, C., Meng, W., and Chowdhury, A. (2006). Effective keyword search in relational databases. In SIGMOD, pages 563-574. Luo, Y., Lin, X., Wang, W., and Zhou, X. (2007). SPARK: Top-k keyword query in relational databases. In SIGMOD, pages 115-126. Qin, L., Yu J. X., Chang, L. (2009) Keyword search in databases: the power of RDBMS. In SIGMOD, pages 681-694. Sayyadian, M., LeKhac, H., Doan, A., and Gravano, L. (2007). Efficient keyword search across heterogeneous relational databases. In ICDE, pages 346-355. Tran, T., Herzig, D., Ladwig, G. (2011): SemSearchPro: Using Semantics throughout the Search Process. In Journal of Web Semantics, 2011. Tran, T., Wang, H., Rudolph, S., Cimiano, P. (2009): Top-k Exploration of Query Graph Candidates for Efficient Keyword Search on RDF. In ICDE. Vagelis Hristidis, L. G. and Papakonstantinou, Y. (2003). Efficient ir-style keyword search over relational databases. In VLDB. KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)