Semantic Search Tutorial     Introduction      Peter Mika|Yahoo! Research, Spain            pmika@yahoo-inc.com   Thanh Tr...
About the speakersPeter Mika  Senior Research Scientist  Head of Semantic Search group at Yahoo! Research in Barcelona  Se...
AgendaIntroduction (10 min)Semantic Web data (50 min)  The RDF data model  Publishing RDF  Crawling and indexing RDF dataQ...
Why Semantic Search? I.“We are at the beginning of search.“ (Marissa Mayer)  Solved large classes of queries, e.g. navigat...
Poorly solved information needsAmbiguous searches                                     Many of these queries would not  par...
Example: multiple interpretations
Why Semantic Search? II.The Semantic Web is now a reality  Large amounts of data published in RDF  Heterogeneous data of v...
Example: direct answers in search  Points of                   Faceted search                    Information  interest in ...
Novel search tasksAggregation of search results  e.g. price comparison across websitesAnalysis and prediction  e.g. world ...
Contextual (pervasive, ambient) search                                            Yahoo! Connected TV:                    ...
Interactive search and task completion
Document retrieval and data retrievalInformation Retrieval (IR) support the retrieval of documents(document retrieval)  Re...
Combination of document and data retrievalDocuments with metadata  Metadata may be embedded inside the document  I’m looki...
Semantic SearchTarget (combination of) document and data retrievalSemantic search is a retrieval paradigm that  Exploits t...
Semantic modelsSemantics is concerned with the meaning of the resources madeavailable for search  Various representations ...
Semantic Search – a process view                                                          Knowledge Representation        ...
Semantic Search systemsFor data / document retrieval, semantic search systems mightcombine a range of techniques, ranging ...
Example: Information WorkbenchAddressing the lifecycle ofinteracting with the Web of Data  Integration of data sources  Co...
Data Sources in the ApplicationEntire English WikipediaData from Linked Open Data  DBpedia  YAGO  …Data from Data.gov (US ...
Semantic SearchHybrid Search: Structured queries combined with keywordsacross structured and unstructured data sourcesQuer...
Search, Refinement and Navigation                                                     Keywords                            ...
Result Inspection, Analysis and Browsing
Semantic Web data
Data on the WebData on the Web is not directly accessible  Most web pages are generated from databases, but formatted for ...
Information extractionNatural Language Processing  Named entity recognition and disambiguation, sentiment analysis etc.Ext...
Semantic WebSharing data across the Web  Standard data model    RDF  A number of syntaxes (file formats)    RDF/XML, RDFa ...
Resource Description Framework (RDF)Each resource (thing, entity) is identified by a URI  Globally unique identifiers  URL...
Linking resourcesRoi’s homepage                                Friend-of-a-Friend ontology                                ...
OntologiesOntologies are the schemas for RDF graphs  Define the intended meaning of certain classes and relationships in a...
Example: schema.orgAgreement on a shared set of schemas for common types of webcontent  Bing, Google, and Yahoo! as initia...
Documentation and OWL ontology
Publishing RDFInterlinked RDF documents (Linked Data)  Each document describes a single resource with URIs pointing to  re...
Linked DataGrass-roots community effort to (re)publish open datasets in RDF  Centered around the Dbpedia project  Director...
Metadata in HTML anno 1995                      What does this term mean?<HTML><HEAD profile="http://dublincore.org/docume...
Microformats (anno 2003)  Mark-up for data in HTML pages     Reuse existing HTML elements (class, rel)  Microformats exist...
RDFa and RDFa Lite (2008-2012)W3C standards for embedding RDF data in HTML documents  A set of new HTML attributes to be u...
Example: Facebook’s Open Graph ProtocolThe ‘Like’ button provides publishers with a way to promote theircontent on Faceboo...
Example: Facebook’s Open Graph Protocol RDF vocabulary to be used in conjunction with RDFa    Simplify the work of develop...
Microdata HTML5    Currently under standardization at the W3C    Originally part of the HTML5 spec, but now a separate doc...
Current state of metadata on the Web31% of webpages, 5% of domains contain some metadata  Analysis of the Bing Crawl (US c...
Exponential growth in RDFa data                    Another five-fold increase between                    October 2010 and ...
Crawling and indexing
Crawling the Semantic WebLinked Data  Similar to HTML crawling, but the the crawler needs to parse  RDF/XML (and others) t...
Data fusionOntology matching  Widely studied in Semantic Web research, see e.g. list of publications at  ontologymatching....
Data quality assessment and curationHeterogeneity, quality of data is an even larger issue  Quality ranges from well-curat...
IndexingSearch requires matching and ranking  Matching selects a subset of the elements to be scoredThe goal of indexing i...
IR-style indexingIndex data as text  Create virtual documents from data  One virtual document per subgraph, resource or tr...
Horizontal index structureTwo fields (indices): one for terms, one for propertiesFor each term, store the property on the ...
Vertical index structureOne field (index) per propertyPositions are not required  But useful for phrase queriesQuery engin...
Distributed indexingMapReduce is ideal for building inverted indices  Map creates (term, {doc1}) pairs  Reduce collects al...
Query Processing / Matching
StructureTaxonomy of search approachesQuery processing / matching techniques for Semantic SearchTypes of semantic dataForm...
Taxonomy of search approaches (1)The search problem  A collection of resources, called data  Information needs expressed a...
Taxonomy of search approaches (2)                              Query engines (of databases)                               ...
Query processing for Semantic Search (1)Resources represented by semantic data ranging from  Structured data with well def...
Query processing for Semantic Search (2)                            NL                    Form- / facet-   Structured Quer...
Query processing for Semantic Search (3)                                   Textual Data                                   ...
Types of data models (1)Textual  Bag-of-words  Represent documents, text in structured data,…, real-world  objects (captur...
Types of data models (2)TextualStructured  Resource Description Framework (RDF)  Represent real-world objects, services, a...
Types of data models (3)TextualStructuredHybrid  RDF data embedded in text (RDFa)
Types of data models – RDFa (1)…<div about="/alice/posts/trouble_with_bob">   <h2 property="dc:title">The trouble with Bob...
Types of semantic data – RDFa (2)Bob is a good friend of mine. We went to        contentthe same university, and also shar...
Types of semantic data - conclusionSemantic data in general can be conceived as a graphwith text and structured data items...
Formalisms for querying semantic data (1)Example information need“Information about a friend of Alice, who shared an apart...
Formalisms for querying semantic data (2)Example information need“Information about a friend of Alice, who shared an apart...
Formalisms for querying semantic data (3)Example information need“Information about a friend of Alice, who shared an apart...
Formalisms for querying semantic data (4)Fully-structuredUnstructuredHybrid: content and structure constraints            ...
Formalisms for querying semantic data (5)Fully-structuredUnstructuredHybrid: content and structure constraints            ...
Formalisms for querying semantic data - conclusionSemantic search queries can be conceived as graph    patterns with nodes...
Processing hybrid graph patterns (1) Example information need “Information about a friend of Alice, who shared an apartmen...
Processing hybrid graph patterns (2)Matching hybrid graph patterns against data
Matching keyword query against text • Retrieve documents    • Inverted list (inverted index)            keyword       {<do...
Matching structured query against structured data    • Retrieve data for triple patterns       • Index on tables       • M...
Matching keyword query against structured data     • Retrieve keyword elements        • Using inverted index            ke...
Matching structured query against text• Based on offline IE (offline see Peter’s slides)• Based on online IE, i.e., “retri...
Matching structured query against text• Index   • Inverted index for document retrieval and pattern matching   • Join inde...
Query processing – main tasks                Retrieval                  Documents , data elements, triples, paths, Query  ...
Query processing – more tasks               More complex queries: disjunction,               aggregation, grouping, analyt...
Query processing on the Web - research challenges and opportunitiesLarge amount of semanticdata                           ...
Ranking
StructureProblem definitionTypes of ambiguitiesRanking paradigmsModel construction  Content-based  Structure-based
Ranking – problem definition  Query                      • Ambiguities arise when representation is                       ...
Content ambiguityapartment shared Berlin Alice                       ?x ns:knows ?z. ?z ns: works ?v. ?v ns:name “KIT”    ...
Structure ambiguityapartment shared Berlin Alice                       ?x ns:knows ?z. ?z ns: works ?v. ?v ns:name “KIT”  ...
AmbiguityRecall: query processing is matching at the level of syntax andsemanticsAmbiguities arise when data or query allo...
Features: What to use to deal with ambiguities?What is meant by “Berlin”? What is the connection between“Berlin” and “Alic...
Ranking paradigmsExplicit relevance model  Foundation: probability ranking principle  Ranking results by the posterior pro...
Ranking paradigms   No explicit notion of relevance: similarity between the query and   the document model     Vector spac...
Model constructionHow to obtain  Relevance models?  Weights for query / document terms?  Language models for document / qu...
Content-based model construction   Document statistics, e.g.            • An object is more likely about     Term frequenc...
Structure-based model construction Consider structure of objects during content-based modeling, i.e., to obtain structured...
Structure-based model construction   PageRank      Link analysis algorithm      Measuring relative importance of nodes    ...
Taxonomy of ranking approachesExplicitly vs. non-explicitly relevance-basedContent-based rankingStructure-based rankingCon...
Result Presentation
Search interfaceInput and output functionality  helping the user to formulate complex queries  presenting the results in a...
Query formulation“Snap-to-grid”: suggest the most likely interpretation of the query  Given the ontology or a summary of t...
Enhanced results/Rich SnippetsUse mark-up from the webpage to generate search snippets  Originally invented at Yahoo! (Sea...
Other result presentation tasksSelect the most relevant resources within an RDF document  Penin et al. Snippet Generation ...
Aggregated search: facets
Aggregated search: Sig.ma
Related entities                   Related actors and                   movies
Adaptive presentation:housing search
Evaluation
Semantic Search challenge (2010/2011)Two tasks  Entity Search    Queries where the user is looking for a single real world...
Evaluation form
Catching the bad guysPayment can be rejected for workers who try to game the system  An explanation is commonly expected, ...
Other evaluationsTREC Entity Track  Related Entity Finding    Entities related to a given entity through a particular rela...
Resources
ResourcesBooks  Ricardo Baeza-Yates and Berthier Ribeiro-Neto. Modern Information  Retrieval. ACM Press. 2011Survey papers...
Upcoming SlideShare
Loading in …5
×

Semantic Search Tutorial at SemTech 2012

1,082 views
1,036 views

Published on

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,082
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
50
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Semantic Search Tutorial at SemTech 2012

  1. 1. Semantic Search Tutorial Introduction Peter Mika|Yahoo! Research, Spain pmika@yahoo-inc.com Thanh Tran | Institute AIFB, KIT, Germany Tran@aifb.uni-karlsruhe.de
  2. 2. About the speakersPeter Mika Senior Research Scientist Head of Semantic Search group at Yahoo! Research in Barcelona Semantic Search, Web Object Retrieval, Natural Language ProcessingTran Duc Thanh Assistant Professor at AIFB, Karlsruhe Institute ofTechnology Head of Semantic Search group Semantic Search, Semantic / Linked Data Management
  3. 3. AgendaIntroduction (10 min)Semantic Web data (50 min) The RDF data model Publishing RDF Crawling and indexing RDF dataQuery processing (40 min)Ranking (30 min)Result presentation (15 min)Semantic Search evaluation (15 min)Questions (5 min)
  4. 4. Why Semantic Search? I.“We are at the beginning of search.“ (Marissa Mayer) Solved large classes of queries, e.g. navigational Heavy investment in computational power Remaining queries are hard, not solvable by brute force, and require a deep understanding of the world and human cognitionBackground knowledge and metadata can help to address poorlysolved queries
  5. 5. Poorly solved information needsAmbiguous searches Many of these queries would not paris hilton be asked by users, who learnedLong tail queries over time what search technology can and can not do. george bush (and I mean the beer brewer in Arizona)Multimedia search paris hilton sexyImprecise or overly precise searches jim hendler pictures of strong adventures peoplePrecise searches for descriptions countries in africa 32 year old computer scientist living in barcelona reliable digital camera under 300 dollars
  6. 6. Example: multiple interpretations
  7. 7. Why Semantic Search? II.The Semantic Web is now a reality Large amounts of data published in RDF Heterogeneous data of varying quality Users who are not skilled in writing complex queries (e.g. SPARQL) and may not be experts in the domainSearching data instead or in addition to searching documents Direct answers Novel search tasks
  8. 8. Example: direct answers in search Points of Faceted search Information interest in from the for Shopping Information box with Vienna, Austria results Knowledge content from and links Graph to Yahoo! Travel Since Aug, 2010, ‘regular’ search results are ‘Powered by Bing’
  9. 9. Novel search tasksAggregation of search results e.g. price comparison across websitesAnalysis and prediction e.g. world temperature by 2020Semantic profiling recommendations based on particular interestsSemantic log analysis understanding user behavior in terms of objectsSupport for complex tasks e.g. booking a vacation using a combination of services
  10. 10. Contextual (pervasive, ambient) search Yahoo! Connected TV: Widget engine embedded into the TV Yahoo! IntoNow: recognize audio and show related content
  11. 11. Interactive search and task completion
  12. 12. Document retrieval and data retrievalInformation Retrieval (IR) support the retrieval of documents(document retrieval) Representation based on lightweight syntax-centric models Work well for topical search Not so well for more complex information needs Web scaleDatabase (DB) and Knowledge-based Systems (KB) deliver more preciseanswers (data retrieval) More expressive models Allow for complex queries Retrieve concrete answers that precisely match queries Not just matching and filtering, but also joins Limitations in scalability
  13. 13. Combination of document and data retrievalDocuments with metadata Metadata may be embedded inside the document I’m looking for documents that mention countries in Africa.Data retrieval Structured data, but searchable text fields I’m looking for directors, who have directed movies where the synopsis mentions dinosaurs.
  14. 14. Semantic SearchTarget (combination of) document and data retrievalSemantic search is a retrieval paradigm that Exploits the structure/semantics of the data or explicit background knowledge to understand user intent and the meaning of content Incorporates the intent of the query and the meaning of content into the search process (semantic models)Wide range of semantic search systems Employ different semantic models, possibly at different steps of the search process and in order to support different tasks
  15. 15. Semantic modelsSemantics is concerned with the meaning of the resources madeavailable for search Various representations of meaningLinguistic models: models of relationships among words Taxonomies, thesauri, dictionaries of entity names Inference along linguistic relations, e.g. broader/narrower termsConceptual models: models of relationships among objects Ontologies capture entities in the world and their relationships Inference along domain-specific relationsWe will focus on conceptual models in this tutorial In particular, the RDF/OWL conceptual model for representing classes in a domain, and describing their instances
  16. 16. Semantic Search – a process view Knowledge Representation Semantic Models • Keywords Resources • Forms Query • NLConstruction • Formal language •IR-style matching & ranking •DB-style precise matching Query •KB-style matching & inferences Processing •Query visualization •Document and data presentation Documents Result •SummarizationPresentation •Implicit feedback •Explicit feedback Query •IncentivesRefinement Document Representation
  17. 17. Semantic Search systemsFor data / document retrieval, semantic search systems mightcombine a range of techniques, ranging from statistics-based IRmethods for ranking, database methods for efficientindexing and query processing, up to complex reasoningtechniques for making inferences!
  18. 18. Example: Information WorkbenchAddressing the lifecycle ofinteracting with the Web of Data Integration of data sources Content generation by the end user Search and Exploration Visualization Publishing User- generated WikipediaIntegrated management ofheterogeneous data sources DBpedia, Yago Structured and unstructured Earthquake (Data.gov) Published and user-generated Static and dynamic Open domain Structured Dynamic
  19. 19. Data Sources in the ApplicationEntire English WikipediaData from Linked Open Data DBpedia YAGO …Data from Data.gov (US Government) E.g. live data about earthquakesMany more
  20. 20. Semantic SearchHybrid Search: Structured queries combined with keywordsacross structured and unstructured data sourcesQuery interpretation: Translation of keywords into hybridqueriesKeyword search/query interpretation combined withfaceted search: iterative refinement process based on keywordsand operations on facets
  21. 21. Search, Refinement and Navigation Keywords Query Translations Term Completions Vorlesung Knowledge Discovery - Institut AIFB Facets21
  22. 22. Result Inspection, Analysis and Browsing
  23. 23. Semantic Web data
  24. 24. Data on the WebData on the Web is not directly accessible Most web pages are generated from databases, but formatted for human consumption APIs offer limited views over dataTwo solutions Extraction using Information Extraction (IE) techniques Out of scope for this tutorial Relying on publishers to expose structured data using standard Semantic Web formats
  25. 25. Information extractionNatural Language Processing Named entity recognition and disambiguation, sentiment analysis etc.Extraction of information about entities Suchanek et al. YAGO: A Core of Semantic Knowledge UnifyingWordNet and Wikipedia,WWW, 2007. Wu and Weld. Autonomously Semantifying Wikipedia, CIKM 2007.Extraction from HTML tables Cafarella et al. WebTables: Exploring the Power of Tables on theWeb.VLDB 2008Wrapper induction Kushmerick et al.Wrapper Induction for Information ExtractionText extraction. IJCAI 2007Filling web forms automatically (form-filling) Madhavan et al. Googles Deep-Web Crawl.VLDB 2008
  26. 26. Semantic WebSharing data across the Web Standard data model RDF A number of syntaxes (file formats) RDF/XML, RDFa Powerful, logic-based languages for schemas OWL, RIF Query languages and protocols HTTP, SPARQL
  27. 27. Resource Description Framework (RDF)Each resource (thing, entity) is identified by a URI Globally unique identifiers URLs a subset of URIs Often abbreviated using namespaces e.g. example:roi = http://example.org/roiRDF represents knowledge as a set of triples Each triple is a single fact about the entity (an attribute or a relationship)A set of triples forms an RDF graph RDF document type foaf:Person example:roi name “Roi Blanco”
  28. 28. Linking resourcesRoi’s homepage Friend-of-a-Friend ontology type example:roi foaf:Person name “Roi Blanco” knows sameAsYahoo!’s website type worksWith #roi2 #peter email “pmika@yahoo-inc.com”
  29. 29. OntologiesOntologies are the schemas for RDF graphs Define the intended meaning of certain classes and relationships in a domain e.g. the FOAF ontology defines the foaf:Person class and relationships such as foaf:knowsOntologies are published in standard languages such as OWL Language defines the constructs for modeling and their meaning e.g. subClassOf, sameAs Tools for editing ontologies such as Protégé, TopBraid ComposerReusing and extending well-known ontologies helps to interpretunknown data e.g. if X is subClassOf foaf:Person, and A is of type X, then we know that A is a person
  30. 30. Example: schema.orgAgreement on a shared set of schemas for common types of webcontent Bing, Google, and Yahoo! as initial supporters Similar in intent to sitemaps.org (2006) Use a single format to communicate the same information to all three search enginesSupport for microdataschema.org covers areas of interest to all search engines Business listings (local), creative works (video), recipes, reviews User defined extensionsEach search engine continues to develop its products
  31. 31. Documentation and OWL ontology
  32. 32. Publishing RDFInterlinked RDF documents (Linked Data) Each document describes a single resource with URIs pointing to related resources Common RDF file formats are RDF/XML and Turtle Mostly implemented as a wrapper around a database or Web serviceEmbedding RDF inside HTML RDFa, microdataSPARQL endpoints Triple stores are databases for managing RDF data SPARQL is a standard protocol and query language for accessing triple stores using HTTP
  33. 33. Linked DataGrass-roots community effort to (re)publish open datasets in RDF Centered around the Dbpedia project Directory of datasets at linkeddata.org, thedatahub.org
  34. 34. Metadata in HTML anno 1995 What does this term mean?<HTML><HEAD profile="http://dublincore.org/documents/dcq- html/"><META name="DC.author" content="Peter Mika"><LINK rel="DC.rights copyright" href="http://www.example.org/rights.html" /><LINK rel="meta" type="application/rdf+xml" title="FOAF" href= "http://www.cs.vu.nl/~pmika/foaf.rdf"></HEAD>…</HTML> How is this data related?
  35. 35. Microformats (anno 2003) Mark-up for data in HTML pages Reuse existing HTML elements (class, rel) Microformats exist for a limited set of objects Persons, organizations, reviews, events etc., see microformats.org No formal schemas Limited reuse, extensibility of schemas Unclear which combinations are allowed No identifiers for entities No interlinking between entities<div class="vcard"> <a class="email fn" href="mailto:jfriday@host.com">Joe Friday</a> <div class="tel">+1-919-555-7878</div> <div class="title">Area Administrator, Assistant</div></div>
  36. 36. RDFa and RDFa Lite (2008-2012)W3C standards for embedding RDF data in HTML documents A set of new HTML attributes to be used in head or body A specification of how to extract the data from these attributesRDFa is just a syntax, you have to choose a vocabulary separatelyRDFa family RDFa 1.0 Recommendation since October, 2008 RDFa 1.1 is a small update on RDFa to make it easier to use RDFa Primer (Working Draft, May 8, 2012) RDFa 1.1 Lite is a subset of RDFa 1.1 with the most commonly used features Proposed Recommendation (May 8, 2012)
  37. 37. Example: Facebook’s Open Graph ProtocolThe ‘Like’ button provides publishers with a way to promote theircontent on Facebook and build communities Shows up in profiles and news feed Site owners can later reach users who have liked an object Facebook Graph API allows 3rd party developers to access the dataOpen Graph Protocol is an RDFa-based format that allows todescribe the object that the user ‘Likes’
  38. 38. Example: Facebook’s Open Graph Protocol RDF vocabulary to be used in conjunction with RDFa Simplify the work of developers by restricting the freedom in RDFa Activities, Businesses, Groups, Organizations, People, Places, Products and Entertainment Only HTML <head> accepted<html xmlns:og="http://opengraphprotocol.org/schema/"><head> <title>The Rock (1996)</title> <meta property="og:title" content="The Rock" /> <meta property="og:type" content="movie" /> <meta property="og:url" content="http://www.imdb.com/title/tt0117500/" /> <meta property="og:image" content="http://ia.media- imdb.com/images/rock.jpg" /> …</head> ...
  39. 39. Microdata HTML5 Currently under standardization at the W3C Originally part of the HTML5 spec, but now a separate document Comparable to RDFa 1.1 Lite Key extensibility features (such as multiple types) missing HTML5 also has a number of “semantic” elements such as <time>, <video>, <article>…<div itemscope itemid=“http://www.yahoo.com/resource/person”><p>My name is <span itemprop="name">Neil</span>.</p><p>My band is called <span itemprop="band">Four Parts Water</span>. I was born on <time itemprop="birthday" datetime="2009-05-10"> May 10th 2009</time>. <img itemprop="image" src=”me.png" alt=”me”></p></div>
  40. 40. Current state of metadata on the Web31% of webpages, 5% of domains contain some metadata Analysis of the Bing Crawl (US crawl, January, 2012) RDFa is most common format By URL: 25% RDFa, 7% microdata, 9% microformat By eTLD (PLD): 4% RDFa, 0.3% microdata, 5.4% microformat Adoption is stronger among large publishers Especially for RDFa and microdataSee also P. Mika, T. Potter. Metadata Statistics for a Large Web Corpus, LDOW 2012 H.Mühleisen, C.Bizer.Web Data Commons - Extracting Structured Data from Two Large Web Corpora, LDOW 2012
  41. 41. Exponential growth in RDFa data Another five-fold increase between October 2010 and January, 2012 Five-fold increase between March, 2009 and October, 2010Percentage of URLs with embedded metadata in various formats
  42. 42. Crawling and indexing
  43. 43. Crawling the Semantic WebLinked Data Similar to HTML crawling, but the the crawler needs to parse RDF/XML (and others) to extract URIs to be crawled Semantic Sitemap/VOID descriptionsRDFa Same as HTML crawling, but data is extracted after crawling Mika et al. Investigating the Semantic Gap through Query Log Analysis, ISWC 2010.SPARQL endpoints Endpoints are not linked, need to be discovered by other means Semantic Sitemap/VOID descriptions
  44. 44. Data fusionOntology matching Widely studied in Semantic Web research, see e.g. list of publications at ontologymatching.org Unfortunately, not much of it is applicable in a Web context due to the quality of ontologiesEntity resolution Logic-based approaches in the Semantic Web Studied as record linkage in the database literature Machine learning based approaches, focusing on attributes Graph-based approaches, see e.g. the work of Lisa Getoor are applicable to RDF data Improvements over only attribute based matchingBlending Merging objects that represent the same real world entity and reconciling information from multiple sources
  45. 45. Data quality assessment and curationHeterogeneity, quality of data is an even larger issue Quality ranges from well-curated data sets (e.g. Freebase) to microformats In the worst of cases, the data becomes a graph of words Short amounts of text: prone to mistakes in data entry or extraction Example: mistake in a phone number or state codeQuality assessment and data curation Quality varies from data created by experts to user-generated content Automated data validation Against known-good data or using triangulation Validation against the ontology or using probabilistic models Data validation by trained professionals or crowdsourcing Sampling data for evaluation Curation based on user feedback
  46. 46. IndexingSearch requires matching and ranking Matching selects a subset of the elements to be scoredThe goal of indexing is to speed up matching Retrieval needs to be performed in milliseconds Without an index, retrieval would require streaming through the collectionThe type of index depends on the query model to support DB-style indexing IR-style indexing
  47. 47. IR-style indexingIndex data as text Create virtual documents from data One virtual document per subgraph, resource or triple typically: resourceKey differences to Text Retrieval RDF data is structured Minimally, queries on property values are required
  48. 48. Horizontal index structureTwo fields (indices): one for terms, one for propertiesFor each term, store the property on the same position in the propertyindex Positions are required even without phrase queriesQuery engine needs to support the alignment operator✓ Dictionary is number of unique terms + number of propertiesOccurrences is number of tokens * 2
  49. 49. Vertical index structureOne field (index) per propertyPositions are not required But useful for phrase queriesQuery engine needs to support fieldsDictionary is number of unique termsOccurrences is number of tokens✗ Number of fields is a problem for merging, query performance
  50. 50. Distributed indexingMapReduce is ideal for building inverted indices Map creates (term, {doc1}) pairs Reduce collects all docs for the same term: (term, {doc1, doc2…} Sub-indices are merged separately Term-partitioned indicesPeter Mika. Distributed Indexing for Semantic Search, SemSearch 2010.
  51. 51. Query Processing / Matching
  52. 52. StructureTaxonomy of search approachesQuery processing / matching techniques for Semantic SearchTypes of semantic dataFormalisms for querying semantic dataApproaches General task: hybrid graph pattern matching Matching keyword query against text Matching structured query against structured data Matching keyword query against structured data Matching structured query against text (a hybrid case)Main tasks, challenges and opportunities
  53. 53. Taxonomy of search approaches (1)The search problem A collection of resources, called data Information needs expressed as queries Search is the task of efficiently computing results from data that are relevant to queriesDocument data retrieval vs. structured data retrieval Differences in query and data representation and matching Efficiently retrieve structured data that exactly match formal information needs expressed as structured queries Effectively rank textual results that match ambiguous NL / keyword queries to a certain degree (notions of relevance)Semantic search: ranked retrieval of document and structureddata (given ambiguous queries / data)
  54. 54. Taxonomy of search approaches (2) Query engines (of databases) Exact Complete Query Sound • Approximate Matching • Not complete • Not sound • Ranked Data • Best effort • Top-k Search engines (stand-alone/database extensons)Query processing mainly focuses on efficiency of matching whereas rankingdeals with degree of matching (relevance)!
  55. 55. Query processing for Semantic Search (1)Resources represented by semantic data ranging from Structured data with well defined schemas Semi-structured data with incomplete or no schemas Data that largely comprise text Hybrid / embedded dataInformation needs of varying complexity, captured using differentformalisms and querying paradigms Natural language texts and keywords Form-based inputs Formal structured queries(Search is end-user oriented paradigm, requires “natural”, intuitive querying interfaces)Semantic search: efficiently computing results (query processing)from data that are relevant to queries (ranking)
  56. 56. Query processing for Semantic Search (2) NL Form- / facet- Structured Queries Keywords Questions based Inputs (SPARQL)Ambiquities Query Matching Data RDF data Semi-Structured Structured OWL ontologies with embedded in text RDF data RDF data rich, formal semantics (RDFa)Ambiquities: confidence degree, truth/trust value…
  57. 57. Query processing for Semantic Search (3) Textual Data Structured query Keyword query on on textual data , textual data, e.g. e.g. querying Web search extension for systems Search target different Semantic search systems? group of users, information needs,Unstructured and types of data. Query processing StructuredQuery Query for semantic searchStructured query is hybrid Keyword query on combination of techniques! on structured data structured data, e.g. standard e.g. search querying interface extensions for for databases / databases RDF stores Structured Data
  58. 58. Types of data models (1)Textual Bag-of-words Represent documents, text in structured data,…, real-world objects (captured as structured data) Lacks “structure” in text, e.g. linguistic structure, hyperlinks, (positional information) Structure in structured data representation term (statistics) combination Cloud In combination with Cloud Computing technologies, Technologies promising solutions for the management of `big data solutions management have emerged. Existing `big data industry solutions are able to support complex queries and industry solutions analytics tasks with terabytes of data. For example, using a support complex Greenplum. ……
  59. 59. Types of data models (2)TextualStructured Resource Description Framework (RDF) Represent real-world objects, services, applications, …. documents Resource attribute values and relationships between resources Schema creator Picture Person Bob
  60. 60. Types of data models (3)TextualStructuredHybrid RDF data embedded in text (RDFa)
  61. 61. Types of data models – RDFa (1)…<div about="/alice/posts/trouble_with_bob"> <h2 property="dc:title">The trouble with Bob</h2> <h3 property="dc:creator">Alice</h3> Bob is a good friend of mine. We went to the same university, and also shared an apartment in Berlin in 2008. The trouble with Bob is thathe takes much better photos than I do: <div about="http://example.com/bob/photos/sunset.jpg"> <img src="http://example.com/bob/photos/sunset.jpg" /> <span property="dc:title">Beautiful Sunset</span> by <span property="dc:creator">Bob</span>. </div></div>… adopted from : http://www.w3.org/TR/xhtml-rdfa-primer/
  62. 62. Types of semantic data – RDFa (2)Bob is a good friend of mine. We went to contentthe same university, and also shared anapartment in Berlin in 2008. The troublewith Bob is that he takes much betterphotos than I do: content adopted from : http://www.w3.org/TR/xhtml-rdfa-primer/
  63. 63. Types of semantic data - conclusionSemantic data in general can be conceived as a graphwith text and structured data items as nodes, and edges represent different types of relationships including explicit semantic relationships and vaguely specified ones such as hyperlinks!
  64. 64. Formalisms for querying semantic data (1)Example information need“Information about a friend of Alice, who shared an apartment withher in Berlin and knows someone working at KIT.” Unstructured queries Fully-structured queries Hybrid queries: unstructured + structured
  65. 65. Formalisms for querying semantic data (2)Example information need“Information about a friend of Alice, who shared an apartmentwith her in Berlin and knows someone working at KIT.” Unstructured NL Keywords shared apartment Berlin Alice
  66. 66. Formalisms for querying semantic data (3)Example information need“Information about a friend of Alice, who shared an apartmentwith her in Berlin and knows someone working at KIT.” Unstructured Fully-structured SPARQL: BGP, filter, optional, union, select, construct, ask, describe PREFIX ns: <http://example.org/ns#> SELECT ?x WHERE { ?x ns:knows ? y. ?y ns:name “Alice”. ?x ns:knows ?z. ?z ns: works ?v. ?v ns:name “KIT” }
  67. 67. Formalisms for querying semantic data (4)Fully-structuredUnstructuredHybrid: content and structure constraints “shared apartment Berlin Alice” ?x ns:knows ? y. ?y ns:name “Alice”. ?x ns:knows ?z. ?z ns: works ?v. ?v ns:name “KIT”
  68. 68. Formalisms for querying semantic data (5)Fully-structuredUnstructuredHybrid: content and structure constraints “shared apartment Berlin Alice” ?x ns:knows ? y. ?y ns:name “Alice”. ?x ns:knows ?z. ?z ns: works ?v. ?v ns:name “KIT”
  69. 69. Formalisms for querying semantic data - conclusionSemantic search queries can be conceived as graph patterns with nodes referring to text and structured data items, and edges referring to relationships between these items!
  70. 70. Processing hybrid graph patterns (1) Example information need “Information about a friend of Alice, who shared an apartment with her in Berlin and knows someone working at KIT.”apartment shared Berlin Alice ?x ns:knows ?z. ?z ns: works ?v. ?v ns:name “KIT” ?y ns:name “Alice”. ?x ns:knows ? y trouble with bob FluidOps 34 Peter sunset.jpg Bob is a good friend of Beautiful mine. We went to the Sunset same university, and Germany Semantic Alice Search also shared an apartment in Berlin in 2008. The trouble with Bob is that he takes much better Germany 2009 Bob photos than I do: Thanh KIT
  71. 71. Processing hybrid graph patterns (2)Matching hybrid graph patterns against data
  72. 72. Matching keyword query against text • Retrieve documents • Inverted list (inverted index) keyword {<doc1, pos, score, ...>, <doc2, pos, score, ...>, ...} • AND-semantics: top-k joinshared Berlin Alice shared Berlin Alice D1 D1 D1 shared = berlin = alice shared
  73. 73. Matching structured query against structured data • Retrieve data for triple patterns • Index on tables • Multiple “redundant” indexes to cover different access patterns • Join (conjunction of triples) • Blocking, e.g. linear merge join (required sorted input) • Non-blocking, e.g. symmetric hash-join • Materialized join indexes ?x ns:knows ?y. ?x ns:knows ?z.Per1 ns:works ?v ?v ns:name “KIT” SP-index PO-index ?z ns: works ?v. ?v ns:name “KIT” = = = Per1 ns:works Ins1 Ins1 ns:name KITPer1 ns:works Ins1 Ins1 ns:name KIT
  74. 74. Matching keyword query against structured data • Retrieve keyword elements • Using inverted index keyword {<el1, score, ...>, <el2, score, ...>,…} • Exploration / “Join” • Data indexes for triple lookup • Materialized index (paths up to graphs) • Top-k Steiner tree search, top-k subgraph exploration Alice Bob KIT Alice Bob KIT ↔ ↔ = =Alice ns:knows Bob Inst1 ns:name KIT Bob ns:works Inst1
  75. 75. Matching structured query against text• Based on offline IE (offline see Peter’s slides)• Based on online IE, i.e., “retrieve “ is as follows • Derive keywords to retrieve relevant documents • On-the-fly information extraction, i.e., phrase pattern matching “X name Y” • Retrieve extracted data for structured part • Retrieve documents for derived text patterns, e.g. sequence, windows, reg. exp. ?x ns:knows ?y. ?x ns:knows ?z. ?z ns: works ?v. ?v ns:name “KIT” knows name KIT
  76. 76. Matching structured query against text• Index • Inverted index for document retrieval and pattern matching • Join index inverted index for storing materialized joins between keywords • Neighborhood indexes for phrase patterns ?x ns:knows ?y. ?x ns:knows ?z. ?z ns: works ?v. ?v ns:name “KIT” KIT name knows name KIT
  77. 77. Query processing – main tasks Retrieval Documents , data elements, triples, paths, Query graphs Inverted index,…, but also other (B+ tree) Index documents, triples, materialized paths Matching Join Different join implementations, efficiency depends on availability of indexes Non-blocking join good for early result Data reporting and for “unpredictable” Linked Data / data streams scenario
  78. 78. Query processing – more tasks More complex queries: disjunction, aggregation, grouping, analytics… Query Join order optimization Approximate Approximate the search space Approximate the results (matching, join) Matching Parallelization Top-k Use only some entries in the input streams to produce k results Data Multiple sources Federation, routing On-the-fly mapping, similarity join Hybrid Join text and data
  79. 79. Query processing on the Web - research challenges and opportunitiesLarge amount of semanticdata • Optimization, parallelizationData inconsistent, • Approximationredundant, and low quality • Hybrid querying and dataLarge amount of data managementembedded in text • Federation, routingLarge amount of sources • Online schema mappingsLarge amount of links • Similarity joinbetween sources
  80. 80. Ranking
  81. 81. StructureProblem definitionTypes of ambiguitiesRanking paradigmsModel construction Content-based Structure-based
  82. 82. Ranking – problem definition Query • Ambiguities arise when representation is incomplete / imprecise • Ambiguities at the level of Matching • elements (content ambiguity) • structure between elements (structure ambiguity) DataDue to ambiguities in the representation of the information needs andthe underlying resources, the results cannot be guaranteed to exactlymatch the query. Ranking is the problem of determining the degreeof matching using some notions of relevance.
  83. 83. Content ambiguityapartment shared Berlin Alice ?x ns:knows ?z. ?z ns: works ?v. ?v ns:name “KIT” ?y ns:name “Alice”. ?x ns:knows ? y trouble with bob FluidOps 34 Peter sunset.jpg Bob is a good friend of Beautiful mine. We went to the Sunset same university, and Germany Semantic Alice Search also shared an apartment in Berlin in 2008. The trouble with Bob is that he takes much better Germany 2009 Bob photos than I do: Thanh KIT What is meant by “Berlin” in the query? What is meant by “KIT” in the query? What is meant by “Berlin” in the data? What is meant by “KIT” in the data? A city with the name Berlin? a person? A research group? a university? a location?
  84. 84. Structure ambiguityapartment shared Berlin Alice ?x ns:knows ?z. ?z ns: works ?v. ?v ns:name “KIT” ?y ns:name “Alice”. ?x ns:knows ? y trouble with bob FluidOps 34 Peter sunset.jpg Bob is a good friend of Beautiful mine. We went to the Sunset same university, and Germany Semantic Alice Search also shared an apartment in Berlin in 2008. The trouble with Bob is that he takes much better Germany 2009 Bob photos than I do: Thanh KIT What is the connection between “Berlin” and What is meant by “works”? “Alice”? Works at? employed? Friend? Co-worker?
  85. 85. AmbiguityRecall: query processing is matching at the level of syntax andsemanticsAmbiguities arise when data or query allow for multipleinterpretations, i.e. multiple matches Syntactic, e.g. works vs. works at Semantic, e.g. works vs. employ“Aboutness”, i.e., contain some elements which represent thecorrect interpretation Ambiguities arise when matching elements of different granularities Does i contains the interpretation for j, given some part(s) of i (syntactically/semantically) match j E.g. Berlin vs. “…we went to the same university, and also, we shared an apartment in Berlin in 2008…”Strictly speaking, ranking is performed after syntactic / semantic matching isdone!
  86. 86. Features: What to use to deal with ambiguities?What is meant by “Berlin”? What is the connection between“Berlin” and “Alice”? Content features Frequencies of terms: d more likely to be “about” a query term k when d more often, mentions k (probabilistic IR) Co-occurrences: terms K that often co-occur form a contextual interpretation, i.e., topics (cluster hypothesis) Structure features Consider relevance at level of fields Linked-based popularity
  87. 87. Ranking paradigmsExplicit relevance model Foundation: probability ranking principle Ranking results by the posterior probability (odds) of being observed in the relevant class: P(w|R) varies in different approaches, e.g., binary independence model, 2-poisson model, relevance model P( D | R ) = ∏ P ( w | R)∏ (1 − P( w | N )) P(D|R) P(D/N) w∈D w∉D k P( w | R) ≈ P ( w | q1 ,..., qk ) = ∑ P(m) P(w | m)∑ P(q k | m) m∈M i =1
  88. 88. Ranking paradigms No explicit notion of relevance: similarity between the query and the document model Vector space model (cosine similarity) Language models (KL divergence) Sim ( q , d ) = Cos ( ( w1, d ,..., w t , d ), ( w1, q ,..., w k , q )) P (t | θ q )Sim ( q , d ) = − KL (θ q ||θ d ) = − ∑ P (t | θ q ) log( t ∈V P (t | θ d )
  89. 89. Model constructionHow to obtain Relevance models? Weights for query / document terms? Language models for document / queries?
  90. 90. Content-based model construction Document statistics, e.g. • An object is more likely about Term frequency “Berlin”? Document length • When it contains a Collection statistics, e.g. relatively high number of mentions of the term Inverse document frequency “Berlin” Background language models • When the number of mentions of this term in the overall collection is relatively tf wt ,d = ∗ idf low |d | tfP (t | θ d ) = λ + (1 − λ ) P ( t | C ) |d |
  91. 91. Structure-based model construction Consider structure of objects during content-based modeling, i.e., to obtain structured content-based model Content-based model for structured objects, documents and for general tuples P (t | θ d ) = ∑α f P (t | θ f ) f ∈ Fd• An object is more likely about “Berlin”? • When one of its (important) fields contains a relatively high number of mentions of the term “Berlin”
  92. 92. Structure-based model construction PageRank Link analysis algorithm Measuring relative importance of nodes Link counts as a vote of support The PageRank of a node recursively depends on the number and PageRank of all nodes that link to it (incoming links) ObjectRank Types and semantics of links vary in structured data setting Authority transfer schema graph specifies connection strengths Recursively compute authority transfer data graph• An object about “Berlin” is more important than one another? •When a relatively large number of objects are linked to it
  93. 93. Taxonomy of ranking approachesExplicitly vs. non-explicitly relevance-basedContent-based rankingStructure-based rankingContent- and-structure-based ranking
  94. 94. Result Presentation
  95. 95. Search interfaceInput and output functionality helping the user to formulate complex queries presenting the results in an intelligent mannerSemantic Search brings improvements in Query formulation Snippet generation Suggesting related entities Adaptive and interactive presentation Presentation adapts to the kind of query and results presented Object results can be actionable, e.g. buy this product Aggregated search Grouping similar items, summarizing results in various ways Filtering (facets), possibly across different dimensions Task completion Help the user to fulfill the task by placing the query in a task context
  96. 96. Query formulation“Snap-to-grid”: suggest the most likely interpretation of the query Given the ontology or a summary of the data While the user is typing or after issuing the query Example: Freebase suggest, TrueKnowledge
  97. 97. Enhanced results/Rich SnippetsUse mark-up from the webpage to generate search snippets Originally invented at Yahoo! (SearchMonkey)Google,Yahoo!, Bing, Yandex now consume schema.org markup Validators available from Google and Bing
  98. 98. Other result presentation tasksSelect the most relevant resources within an RDF document Penin et al. Snippet Generation for Semantic Web Search Engines, ASWC 2010For each resource, rank the properties to be displayedNatural Language Generation (NLG) Verbalize, explain results
  99. 99. Aggregated search: facets
  100. 100. Aggregated search: Sig.ma
  101. 101. Related entities Related actors and movies
  102. 102. Adaptive presentation:housing search
  103. 103. Evaluation
  104. 104. Semantic Search challenge (2010/2011)Two tasks Entity Search Queries where the user is looking for a single real world object Pound et al. Ad-hoc Object Retrieval in the Web of Data, WWW 2010. List search (new in 2011) Queries where the user is looking for a class of objectsBillion Triples Challenge 2009 datasetEvaluated using Amazon’s Mechanical Turk Halpin et al. Evaluating Ad-Hoc Object Retrieval, IWEST 2010 Blanco et al. Repeatable and Reliable Search System Evaluation using Crowd-Sourcing, SIGIR2011
  105. 105. Evaluation form
  106. 106. Catching the bad guysPayment can be rejected for workers who try to game the system An explanation is commonly expected, though cheaters rarely complainWe opted to mix control questions into the real results Gold-win cases that are known to be perfect Gold-loose cases that are known to be badMetrics Avg. and std. dev on gold-win and gold-loose results Time to completeWorker Known Real Known Good Total Time to bad N complete N Mean N Mean N Mean (sec)badguy 20 2.556 200 2.738 20 2.684 240 29.6goodguy 13 1 130 2.038 13 3 156 95whoknows 1 1 21 1.571 2 3 24 83.5
  107. 107. Other evaluationsTREC Entity Track Related Entity Finding Entities related to a given entity through a particular relationship Retrieval over documents (ClueWeb 09 collection) Example: (Homepages of) airlines that fly Boeing 747 Entity List Completion Given some elements of a list of entities, complete the listQuestion Answering over Linked Data Retrieval over specific datasets (Dbpedia and MusicBrainz) Full natural language questions of different forms Correct results defined by an equivalent SPARQL query Example: Give me all actors starring in Batman Begins.
  108. 108. Resources
  109. 109. ResourcesBooks Ricardo Baeza-Yates and Berthier Ribeiro-Neto. Modern Information Retrieval. ACM Press. 2011Survey papers ThanhTran, Peter Mika. Survey of Semantic Search Approaches. Under submission, 2012.Conferences and workshops ISWC, ESWC, WWW, SIGIR, CIKM, SemTech Semantic Search workshop series Exploiting Semantic Annotations in Information Retrieval (ESAIR) Entity-oriented Search (EOS) workshop

×