Your SlideShare is downloading. ×
SemTech 2011 Semantic Search tutorial
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

SemTech 2011 Semantic Search tutorial

5,271
views

Published on

SemTech 2011 tutorial on Semantic Search by Peter Mika (Yahoo! Research) and Thanh Tran (AIFB Institute at KIT)

SemTech 2011 tutorial on Semantic Search by Peter Mika (Yahoo! Research) and Thanh Tran (AIFB Institute at KIT)

Published in: Education

0 Comments
9 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
5,271
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
212
Comments
0
Likes
9
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Search is a form of content aggregation
  • - In recent years we have witnessed tremendous interest and substantial economic exploitation of search technologies, both at web and enterprise scale. In this regard, technologies for Information Retrieval (IR) can be distinguished from solutions in the field of Database (DB) and Knowledge-based Expert Systems (KB). Whereas IR applications support the retrieval of documents (document retrieval), DB and KB systems deliver more precise answers (data retrieval). The technical differences between these two paradigms can be broken down into three main dimensions, i.e. the representation of the user need (query model), the underlying resources (data model) and the matching technique. The representation of user need and resource content in current IR systems is still almost exclusively achieved by the lightweight syntax-centric models such as the predominant keyword paradigm (i.e. keyword queries matched against bag-of-words document representation). While these systems have shown to work well for topical search, i.e. retrieve documents based on a topic, they usually fail to address more complex information needs. Using more expressive models for the representation of the user need, and leveraging the structure and semantics inherent in the data, DB and KB systems allow for complex queries, and for the retrieval of concrete answers that precisely match them.
  • Semantic search can be seen as a retrieval paradigm Centered on the use of semanticsIncorporates the semantics entailed by the query and (or) the resources into the matching process, it essentially performs semantic search.
  • Another trend resulting from this convergence of textual, structured and semantic data is the need for hybrid semantic search systems. Whilestandard IR focuses on the retrieval of documents, the DB and KB systems are built for data retrieval. As opposed to these types of systems, hybrid search systems manage the different types of data in a holistic way, and support the retrieval of answers that are integrated units of information, possibly assembled from different types of data. In such a system, there is not only a convergence at the
  • Close to the topic of keyword-search in databases, except knowledge-bases have a schema-oblivious designDifferent papers assume vastly different query needs even on the same type of data
  • >> >> Intro Session: motivation, overview etc.: 10 min >> >>1)Representation of the Search Space (M) 45>> >>2)Offline Preprocessing: Crawling and Indexing (P) 60>> >>3)Query Processing (T) 4)Matching (T) 90>> >>5)Ranking 60 >> >>6)Result Presentation (45)>> >>7)Evaluation (45)>> >>Demo Session (30)>> >> Wrap-up Session (10)
  • Approximate many results  need rankingRanking also needed in the case where qp is complete and sound, but queries and data representation so imprecise such that we have to deal with too many results
  • Miss structural information in textsHyperlinksLinguistic structurePositional information
  • - Real world objects
  • SELECT Returns all, or a subset of, the variables bound in a query pattern match. CONSTRUCT Returns an RDF graph constructed by substituting variables in a set of triple templates. ASK Returns a boolean indicating whether a query pattern matches or not. DESCRIBE Returns an RDF graph that describes the resources found. Graph patterns are defined recursively. A graph pattern may have zero or more optional graph patterns, and any part of a query pattern may have an optional part. In this example, there are two optional graph patterns.Section 6 introduces the ability to make portions of a query optional; Section 7 introduces the ability to express the disjunction of alternative graph patterns; and Section 8 introduces the ability to constrain portions of a query to particular source graphs. Section 8 also presents SPARQL's mechanism for defining the source graphs for a query.Basic graph patterns allow applications to make queries where the entire query pattern must match for there to be a solution. For every solution of a query containing only group graph patterns with at least one basic graph pattern, every variable is bound to an RDF Term in a solution. However, regular, complete structures cannot be assumed in all RDF graphs. It is useful to be able to have queries that allow information to be added to the solution where the information is available, but do not reject the solution because some part of the query pattern does not match. Optional matching provides this facility: if the optional part does not match, it creates no bindings but does not eliminate the solution.The UNION pattern combines graph patterns; each alternative possibility can contain more than one triple pattern:SPARQL provides a means of combining graph patterns so that one of several alternative graph patterns may match. If more than one of the alternatives matches, all the possible pattern solutions are found.
  • Web data: Text+ Linked Data+ Semi-structured RDF+ Hybrid datathat can be conceived as forming data graphsHear abour bob and alice all the time (in computer science literatures), want to find out more… build Semantic Web search engine. To address complex information needs by exploiting Web data:- Query as a set of constrains Match structured data Match text
  • - Less than 5 percent of IR papers deal with query processing and the aspect of efficiency
  • Partitioning has impact on performance!)Blocking: iterator-based approachesNon-blocking: good for streaming, good we cannot wait for some parts of the results to be completely worked-offLink data: cannot wait for sources, (some are slower then other) thus better to push data into query processing as the they come instead of pulling data and wait (busy waiting)Top-k:
  • -phrase patterns (e.g., “X is the capital of Y”) for large scale extraction. Such simple patterns, when coupled with the richness and redundancy of theWeb, can be very useful in scraping millions or even billions of facts from the Web.- patterns: Matched when keywords or data types Xi appear in sequence. Matched if all keywords/data types/patterns appear within an m-words window.For extraction: relation patternsFor text search: entity patterns -When not assuming that all relevant data can be extracted such matching against text still needed: Hybrid search
  • Given some materialized indexes  no joins at all Given sorted inputs  sorted merge join
  • Given some materialized indexes  no joins at all Given sorted inputs  sorted merge joinjoin
  • Every task is a challenge of itself, some more some less well elaboratedThere are separate challenges for every problems
  • >> >> Intro Session: motivation, overview etc.: 10 min >> >>1)Representation of the Search Space (M) 45>> >>2)Offline Preprocessing: Crawling and Indexing (P) 60>> >>3)Query Processing (T) 4)Matching (T) 90>> >>5)Ranking 60 >> >>6)Result Presentation (45)>> >>7)Evaluation (45)>> >>Demo Session (30)>> >> Wrap-up Session (10)
  • Approximate many results  need rankingRanking also needed in the case where qp is complete and sound, but queries and data representation so imprecise such that we have to deal with too many results
  • Web data: Text+ Linked Data+ Semi-structured RDF+ Hybrid datathat can be conceived as forming data graphsHear abour bob and alice all the time (in computer science literatures), want to find out more… build Semantic Web search engine. To address complex information needs by exploiting Web data:- Query as a set of constrains Match structured data Match text
  • Syntactic works vs.works at Semantic works vs. employ
  • text which contains a large mentions of “Berlin” is likely to be about “Berlin”i is more likely to be the correct interpretation of K when terms in K co-occur in a large number of context (bag of words) associated with i“Berlin and apartment” more often co-occur in the geographic location context/topic than in the context of people“Berlin and apartment” more often co-occur in the geographic location context/topic than in the context of people
  • Transcript

    • 1. Peter Mika| Yahoo Research, Spain
      pmika@yahoo-inc.com
      Thanh Tran | Institute AIFB, KIT, Germany
      Tran@aifb.uni-karlsruhe.de
      Semantic Search TutorialIntroduction
    • 2. About the speakers
      Peter Mika
      Semantic Search group at Yahoo! Barcelona
      Semantic Search, Web Object Retrieval, Natural Language Processing
      Tran Duc Thanh
      Semantic Search group at AIFB
      Semantic Search, Semantic Data Management, Linked Data Query Processing
    • 3. Agenda
      Introduction (5 min)
      Semantic Web data (50 min)
      The RDF data model
      Publishing RDF
      Crawling and indexing RDF data
      Query processing (35 min)
      Ranking (25 min)
      Result presentation (15 min)
      Semantic Search evaluation (15 min)
      Demos (15 min)
      Questions (5 min)
    • 4. Why Semantic Search? I.
      “We are at the beginning of search.“ (Marissa Mayer)
      Solved large classes of queries, e.g. navigational
      Heavy investment in computational power
      Remaining queries are hard, not solvable by brute force, and require a deep understanding of the world and human cognition
      Background knowledge and metadata can help to address poorly solved queries
    • 5. Poorly solved information needs
      Ambiguous searches
      parishilton
      Long tail queries
      george bush (and I mean the beer brewer in Arizona)
      Multimedia search
      parishilton sexy
      Imprecise or overly precise searches
      jimhendler
      pictures of strong adventures people
      Precise searches for descriptions
      countries in africa
      32 year old computer scientist living in barcelona
      reliable digital camera under 300 dollars
      Many of these queries would not be asked by users, who learned over time what search technology can and can not do.
    • 6. Example: multiple interpretations
    • 7. Why Semantic Search? II.
      The Semantic Web is now a reality
      Large amounts of data published in RDF
      Heterogeneous data of varying quality
      Users who are not skilled in writing complex queries (e.g. SPARQL) and may not be experts in the domain
      Searching data instead or in addition to searching documents
      Direct answers
      Novel search tasks
    • 8. Information box with content from and links to Yahoo! Travel
      Example: direct answers in search
      Points of interest in Vienna, Austria
      Shopping results from
      Yahoo! Shopping
      Since Aug, 2010, ‘regular’ search results are ‘Powered by Bing’
    • 9. Novel search tasks
      Aggregation of search results
      e.g. price comparison across websites
      Analysis and prediction
      e.g. world temperature by 2020
      Semantic profiling
      recommendations based on particular interests
      Semantic log analysis
      understanding user behavior in terms of objects
      Support for complex tasks
      e.g. booking a vacation using a combination of services
    • 10. Document retrieval and data retrieval
      Information Retrieval (IR) support the retrieval of documents (document retrieval)
      Representation based on lightweight syntax-centric models
      Work well for topical search
      Not so well for more complex information needs
      Web scale
      Database (DB) and Knowledge-based Systems (KB) deliver more precise answers (data retrieval)
      More expressive models
      Allow for complex queries
      Retrieve concrete answers that precisely match queries
      Not just matching and filtering, but also joins
      Limitations in scalability
    • 11. Combination of document and data retrieval
      Documents with metadata
      Metadata may be embeddedinside the document
      I’m looking for documents that mention countries in Africa.
      Data retrieval
      Structured data, but searchable text fields
      I’m looking for directors, who have directed movies where the synopsis mentions dinosaurs.
    • 12. Semantic Search
      Target (combination of) document and data retrieval
      Semantic search is a retrieval paradigm that
      Exploits the structure/semantics of the data or explicit background knowledge to understand user intent and the meaning of content
      Incorporates the intent of the query and the meaning of content into the search process (semantic models)
      Wide range of semantic search systems
      Employ different semantic models, possibly at different steps of the search process and in order to support different tasks
    • 13. Semantic models
      Semantics is concerned with the meaning of the resources made available for search
      Various representations of meaning
      Linguistic models: models of relationships among words
      Taxonomies, thesauri, dictionaries of entity names
      Inference along linguistic relations, e.g. broader/narrower terms
      Natural language search
      Conceptual models: models of relationships among objects
      Ontologies capture entities in the world and their relationships
      Inference along domain-specific relations
      Knowledge-based search
      We will focus on conceptual models in this tutorial
      In particular, the RDF/OWL conceptual model for representing classes in a domain, and describing their instances
    • 14. Semantic Search – a process view
      Knowledge Representation
      Semantic Models
      Resources
      Documents
      DocumentRepresentation
    • 15. Semantic Search systems
      For data / document retrieval, semantic search systems might combine a range of techniques, ranging from statistics-based IR methods for ranking, database methods for efficient indexing and query processing, up to complex reasoning techniques for making inferences!
    • 16. Semantic Web data
    • 17. Semantic Web
      Sharing data across the Web
      Standard data model
      RDF
      A number of syntaxes (file formats)
      RDF/XML, RDFa
      Powerful, logic-based languages for schemas
      OWL, RIF
      Query languages and protocols
      HTTP, SPARQL
    • 18. Resource Description Framework (RDF)
      Each resource (thing, entity) is identified by a URI
      Globally unique identifiers
      Locators of information
      Data is broken down into individual facts
      Triples of (subject, predicate, object)
      A set of triples (an RDF graph) is published together in an RDF document
      RDF document
      foaf:Person
      type
      example:roi
      name
      “Roi Blanco”
    • 19. Linking resources
      Friend-of-a-Friend ontology
      Roi’s homepage
      type
      example:roi
      foaf:Person
      name
      “Roi Blanco”
      sameAs
      Yahoo
      type
      worksWith
      example:roi2
      example:peter
      email
      “pmika@yahoo-inc.com”
    • 20. Publishing RDF
      Linked Data
      Data published as RDF documents linked to other RDF documents
      Community effort to re-publish large public datasets (e.g. Dbpedia, open government data)
      RDFa
      Data embedded inside HTML pages
      Recommended for site owners by Yahoo, Google, Facebook
      SPARQL endpoints
      Triple stores (RDF databases) that can be queried through the web
    • 21. Linked Data
      A web of RDF documents in parallel to the current Web
      often implemented as wrappers around databases or APIs
      The four rules of Linked Data:
      Use URIs to identify things.
      Use HTTPURIs so that these things can be referred to and looked up ("dereference") by people and user agents.
      Provide useful information about the thing when its URI is dereferenced, using standard formats such as RDF-XML.
      Include links to other, related URIs in the exposed data to improve discovery of other related information on the Web.
    • 22. Linked Data
      Advantages:
      No change to the publishing of the HTML documents
      Data can be published by third party (e.g. Dbpedia)
      Disadvantages:
      Web servers need to be configured to properly handle URIs that identify concepts instead of documents
      Not favored by search engines
      Lack of use cases
      Crawling needs to be changed
      Authority is difficult to determine
      Tools
      Triple stores (Virtuoso, Oracle etc.) and front-ends (Pubby)
      RDB-to-RDF mappers (e.g. D2RQ, Triplify)
      Validators (Vapour)
      Linked Data browsers (many)
    • 23. The state of Linked Data
      Rapidly growing community effort to (re)publish open datasets as Linked Data
      In particular, scientific and government datasets
      see linkeddata.org
      Less commercial interest, real usage
    • 24. Metadata in HTML
      1995: HTML meta tags
      1996: Simple HTML Ontology Extensions (SHOE)
      1998: RDF/XML
      RDF/XML in HTML
      RDF linked from HTML
      2003: Web 2.0
      Tagging
      Microformats
      Metadata in Wikipedia
      Machine tags in Flickr
      2005: eRDF
      2008: RDFa 1.0
      2011: RDFa 1.1
      2012: Microdata?
    • 25. HTML meta tags
      <HTML>
      <HEAD profile="http://dublincore.org/documents/dcq-html/">
      <META name="DC.author" content="Peter Mika">
      <LINK rel="DC.rights copyright" href="http://www.example.org/rights.html" />
      <LINK rel="meta" type="application/rdf+xml" title="FOAF"
      href= "http://www.cs.vu.nl/~pmika/foaf.rdf">
      </HEAD>

      </HTML>
    • 26. Microformats (μf)
      Agreements on the way to encode certain kinds metadata in HTML
      Reuse of semantic-bearing HTML elements
      Based on existing standards
      Minimality
      Microformats exist for a limited set of objects
      hCard (persons and organizations)
      hCalendar (events)
      hResume
      hProduct
      hRecipe
      Varying degrees of support and stability
      hCard and rel-tag are widely supported
      Community centered around microformats.org
      Specifications and discussions are hosted there
    • 27. Microformats: limitations
      No shared syntax
      Each microformat has a separate syntax tailored to the vocabulary
      No formal schemas
      Limited reuse, extensibility of schemas
      Unclear which combinations are allowed
      No datatypes
      No namespaces, unique identifiers (URIs)
      no interlinking
      mapping between instances is required
      Always appears in the HTML <body>
    • 28. Example: the hCard microformat
      <div class="vcard">
      <a class="email fn" href="mailto:jfriday@host.com">Joe Friday</a>
      <div class="tel">+1-919-555-7878</div>
      <div class="title">Area Administrator, Assistant</div>
      </div>
      <cite class="vcard">
      <a class="fn url" rel="friend colleague met” href="http://meyerweb.com/">
      Eric Meyer</a> </cite> wrote a post (<cite>
      <a href="http://meyerweb.com/eric/thoughts/2005/12/16/tax-relief/">
      Tax Relief</a></cite>) about an unintentionally humorous letter he received from
      the <span class="vcard”> <a class="fn org url" href="http://irs.gov/">
      Internal Revenue Service</a>
      </span>.
    • 29. RDFa
      W3C standard for embedding RDF data in HTML documents
      A set of new HTML attributes to be used in head or body
      A specification of how to extract the data from these attributes
      RDFa is just a syntax, you have to choose a vocabulary separately
      RDFa 1.0 is a W3C Recommendation since October, 2008
      RDFa Primer
      RDFa 1.1 is a small update on RDFa to make it easier to use
      Currently Working Draft (March 31, 2011)
      Updated version of the RDFa Primer (April 19, 2011)
      RDFa API for accessing RDFa data in a webpage in the browser from JavaScript
      Currently Working Draft (April 19, 2011)
    • 30. RDFa 1.1
      Changes
      New vocab attribute to define the default namespace for the document or subtree
      Profile documents to define multiple namespace prefixes
      The prefix attribute as a recommended replacement of xmlns
      You can use URIs even where only CURIEs where allowed before
      RDFa 1.1 is backward compatible with RDFa 1.0
      RDFa 1.1 is recommended if you want to use HTML5
    • 31. When to use RDFa
      Choose microformats when you find a microformat that fits your needs and supported by your consumers
      Microformats are first option because they are simple
      Yahoo supports all major microformats, see the documentation
      It’s a common misconception that RDFa requires XHTML or that it’s compatible with HTML5
      It’s compatible with HTML4, HTML5, XHTML
      If you find none that perfectly fits your needs then you need RDFa
      Microformats have a fixed schema: you can not add your own attributes
      Example: a social networking site with user profiles
      VCard is a good candidate, but for example it doesn’t have a way to express the user’s social connections
      You either live without this, or go with RDFa
    • 32. Example: Facebook’s Like and the Open Graph Protocol
      The ‘Like’ button provides publishers with a way to promote their content on Facebook and build communities
      Shows up in profiles and news feed
      Site owners can later reach users who have liked an object
      Facebook Graph API allows 3rd party developers to access the data
      Open Graph Protocol is an RDFa-based format that allows to describe the object that the user ‘Likes’
    • 33. Example: Facebook’s Open Graph Protocol
      RDF vocabulary to be used in conjunction with RDFa
      Simplify the work of developers by restricting the freedom in RDFa
      Activities, Businesses, Groups, Organizations, People, Places, Products and Entertainment
      Only HTML <head> accepted
      <html xmlns:og="http://opengraphprotocol.org/schema/">
      <head>
      <title>The Rock (1996)</title>
      <meta property="og:title" content="The Rock" />
      <meta property="og:type" content="movie" />
      <meta property="og:url" content="http://www.imdb.com/title/tt0117500/" />
      <meta property="og:image" content="http://ia.media-imdb.com/images/rock.jpg" /> …
      </head> ...
    • 34. Example: Yahoo! Enhanced Results (was: SearchMonkey)
      Guide for publishers to mark-up their pages for common types of objects
      Product, Local, News, Video, Events, Documents, Discussion, Games
      Using popular microformats and RDF vocabularies
      Copy-paste code
      Validator
      Yahoo as a consumer
      See later
    • 35. Example: Google’s Rich Snippets
      Google accepts popular microformats and its own RDFa vocabulary
      Similar approach to RDFa as Facebook
      Validator to check if the markup is correct
      Google displays enhanced results based on this metadata
      Rich Snippets
    • 36. RDFa on the rise
      510% increase between March, 2009 and October, 2010
      Percentage of URLs with embedded metadata in various formats
    • 37. Microdata
      Currently under standardization at the W3C
      Originally part of the HTML5 spec, but now a separate document
      Similar to microformats, but with the extensibility of RDFa
      Introduce new terms using reverse domain names or full URIs
      HTML5 also has a number of “semantic” elements such as <time>, <video>, <article>…
    • 38. Microdata example
      <div itemscope itemid=“http://www.yahoo.com/resource/person”>
      <p>My name is <span itemprop="name">Neil</span>.</p>
      <p>My band is called
      <span itemprop="band">Four Parts Water</span>.
      I was born on
      <time itemprop="birthday" datetime="2009-05-10">May 10th 2009</time>.
      <img itemprop="image" src=”me.png" alt=”me”>
      </p>
      </div
    • 39. The state of metadata in HTML
      5-10% of webpages contain some explicit metadata
      Depending on how you count…
      Too many competing approaches
      Too many formats: microformats vs RDFa vs Microdata
      When using RDFa, publishers may need to use multiple different vocabularies to satisfy everyone
    • 40. Crawling the Semantic Web
      Linked Data
      Similar to HTML crawling, but the the crawler needs to parse RDF/XML (and others) to extract URIs to be crawled
      Semantic Sitemap/VOID descriptions
      RDFa
      Same as HTML crawling, but data is extracted after crawling
      Mika et al. Investigating the Semantic Gap through Query Log Analysis, ISWC 2010.
      SPARQL endpoints
      Endpoints are not linked, need to be discovered by other means
      Semantic Sitemap/VOID descriptions
    • 41. Data fusion
      Ontology matching
      Widely studied in Semantic Web research, see e.g. list of publications at ontologymatching.org
      Unfortunately, not much of it is applicable in a Web context due to the quality of ontologies
      Entity resolution
      Logic-based approaches in the Semantic Web
      Studied as record linkage in the database literature
      Machine learning based approaches, focusing on attributes
      Graph-based approaches, see e.g. the work of Lisa Getoor are applicable to RDF data
      Improvements over only attribute based matching
      Blending
      Merging objects that represent the same real world entity and reconciling information from multiple sources
    • 42. Data quality assessment and curation
      Heterogeneity, quality of data is an even larger issue
      Quality ranges from well-curated data sets (e.g. Freebase) to microformats
      In the worst of cases, the data becomes a graph of words
      Short amounts of text: prone to mistakes in data entry or extraction
      Example: mistake in a phone number or state code
      Quality assessment and data curation
      Quality varies from data created by experts to user-generated content
      Automated data validation
      Against known-good data or using triangulation
      Validation against the ontology or using probabilistic models
      Data validation by trained professionals or crowdsourcing
      Sampling data for evaluation
      Curation based on user feedback
    • 43. Indexing
      Search requires matching and ranking
      Matching selects a subset of the elements to be scored
      The goal of indexing is to speed up matching
      Retrieval needs to be performed in milliseconds
      Without an index, retrieval would require streaming through the collection
      The type of index depends on the query model to support
      DB-style indexing
      IR-style indexing
    • 44. IR-style indexing
      Index data as text
      Create virtual documents from data
      One virtual document per subgraph, resource or triple
      typically: resource
      Key differences to Text Retrieval
      RDF data is structured
      Minimally, queries on property values are required
    • 45. Horizontal index structure
      Two fields (indices): one for terms, one for properties
      For each term, store the property on the same position in the property index
      Positions are required even without phrase queries
      Query engine needs to support the alignment operator
      ✓ Dictionary is number of unique terms + number of properties
      Occurrences is number of tokens * 2
    • 46. Vertical index structure
      One field (index) per property
      Positions are not required
      But useful for phrase queries
      Query engine needs to support fields
      Dictionary is number of unique terms
      Occurrences is number of tokens
      ✗ Number of fields is a problem for merging, query performance
    • 47. Indexing using MapReduce
      MapReduce is the perfect model for building inverted indices
      Map creates (term, {doc1}) pairs
      Reduce collects all docs for the same term: (term, {doc1, doc2…}
      Sub-indices are merged separately
      Term-partitioned indices
      Peter Mika. Distributed Indexing for Semantic Search, SemSearch 2010.
    • 48. Query Processing
    • 49. Structure
      Taxonomy of retrieval approaches
      Query processing for semantic search
      Types of semantic data
      Formalisms for querying semantic data
      Approaches
      General task: hybrid graph pattern matching
      Matching keyword query against text
      Matching structured query against structured data
      Matching keyword query against structured data
      Matching structured query against text (a hybrid case)
      Main tasks, challenges and opportunities
    • 50. Taxonomy of retrieval approaches (1)
      Data retrieval problem
      A collection of resources represented by the data model G
      Information needs expressed as queries in Q
      Retrieval is the task of efficiently computing results from G that are relevant to the queries in Q
      Document retrieval vs. data retrieval
      Differences in query and data representation and matching
      Efficiently retrieve structured data that exactly match formal information needs expressed as structured queries
      Effectively rank textual results that match ambiguous NL / keyword queries to a certain degree (notions of relevance)
    • 51. Taxonomy of retrieval approaches (2)
      Exact
      Complete
      Sound
      Query
      Matching
      Data
      Query processing mainly focuses on efficiency of matching whereas ranking deals with degree of matching (relevance)!
    • 57. Query processing for Semantic Search (1)
      The underlying collection of resources is represented by semantic data G ranging from
      Structured data with well defined schemas
      Semi-structured data with incomplete or no schemas
      Data that largely comprise text
      Hybrid / embedded data
      Targeted information needs Q are of varying complexity, captured using different formalisms and querying paradigms
      Natural language texts and keywords
      Form-based inputs
      Formal structured queries
      Semantic search, mainly, is the task of efficiently computing results(query processing) from G that are relevantto the queries in Q (ranking)
    • 58. Query processing for Semantic Search (2)
      Keywords
      NL Questions
      Form- / facet-based Inputs
      Structured Queries (SPARQL)
      Query
      Matching
      Data
      OWL ontologies with rich, formal semantics
      Structured RDF data
      Semi-Structured RDF data
      RDF data embedded in text (RDFa)
    • 59. Query processing for Semantic Search (3)
      Textual Data
      Keyword query on textual data
      (IR/document retrieval)
      Structured query on textual data
      Semantic Search target different group of users, information needs, and types of data. Query processing for semantic search is hybrid combination of techniques!
      Unstructured Query
      Structured
      Query
      Keyword query on structured data
      Structured query on structured data
      (DB/data retrieval)
      Structured Data
    • 60. Types of data models (1)
      Textual
      Bag-of-words
      Represent documents, text in structured data,…, real-world objects (captured as structured data)
      Lacks “structure”
      in text, e.g. linguistic structure, hyperlinks, (positional information)
      Structure in structured data representation
      term (statistics)
      In combination with Cloud Computing technologies, promising solutions for the management of `big data' have emerged. Existing industry solutions are able to support complex queries and analytics tasks with terabytes of data. For example, using a Greenplum.
      combination
      Cloud
      Computing
      Technologies
      solutions
      management
      `big data'
      industry
      solutions
      support
      complex
      ……
    • 61. Types of data models (2)
      Textual
      Structured
      Resource Description Framework (RDF)
      Represent real-world objects, services, applications, …. documents
      Resource attribute values and relationships between resources
      Schema
      Picture
      creator
      Person
      Bob
    • 62. Types of data models (3)
      Textual
      Structured
      Hybrid
      RDF data embedded in text (RDFa)
    • 63. Types of data models – RDFa (1)

      <div about="/alice/posts/trouble_with_bob">
      <h2 property="dc:title">The trouble with Bob</h2>
      <h3 property="dc:creator">Alice</h3>
      Bob is a good friend of mine. We went to the same university, and
      also shared an apartment in Berlin in 2008. The trouble with Bob is that he takes much better photos than I do:
      <div about="http://example.com/bob/photos/sunset.jpg">
      <imgsrc="http://example.com/bob/photos/sunset.jpg" />
      <span property="dc:title">Beautiful Sunset</span>
      by <span property="dc:creator">Bob</span>.
      </div>
      </div>

      adoptedfrom : http://www.w3.org/TR/xhtml-rdfa-primer/
    • 64. Types of semantic data – RDFa (2)
      Bob is a good friend of mine. We went to the same university, and also shared an apartment in Berlin in 2008. The trouble with Bob is that he takes much better photos than I do:
      content
      content
      adoptedfrom : http://www.w3.org/TR/xhtml-rdfa-primer/
    • 65. Types of semantic data - conclusion
      Semantic data in general can be conceived as a graph with text and structured data items as nodes, and edges represent different types of relationships including explicit semantic relationships and vaguely specified ones such as hyperlinks!
    • 66. Formalisms for querying semantic data (1)
      Example information need
      “Information about a friend of Alice, who shared an apartment with her in Berlin and knows someone working at KIT.”
      Unstructured queries
      Fully-structured queries
      Hybrid queries: unstructured + structured
    • 67. Formalisms for querying semantic data (2)
      Example information need
      “Information about a friend of Alice, who shared an apartment with her in Berlin and knows someone working at KIT.”
      Unstructured
      NL
      Keywords
      apartment
      Berlin
      Alice
      shared
    • 68. Formalisms for querying semantic data (3)
      Example information need
      “Information about a friend of Alice, who shared an apartment with her in Berlinand knows someone working at KIT.”
      Unstructured
      Fully-structured
      SPARQL: BGP, filter, optional, union, select, construct, ask, describe
      PREFIX ns: <http://example.org/ns#>
      SELECT ?x
      WHERE { ?x ns:knows ? y. ?y ns:name “Alice”.
      ?x ns:knows ?z. ?z ns: works ?v. ?v ns:name “KIT” }
    • 69. Formalisms for querying semantic data (4)
      Fully-structured
      Unstructured
      Hybrid: content and structure constraints
      “shared apartment Berlin Alice”
      ?x ns:knows ? y. ?y ns:name “Alice”.
      ?x ns:knows ?z. ?z ns: works ?v.
      ?v ns:name “KIT”
    • 70. Formalisms for querying semantic data (5)
      Fully-structured
      Unstructured
      Hybrid: content and structure constraints
      “shared apartment Berlin Alice”
      ?x ns:knows ? y. ?y ns:name “Alice”.
      ?x ns:knows ?z. ?z ns: works ?v.
      ?v ns:name “KIT”
    • 71. Formalisms for querying semantic data - conclusion
      Semantic search queries can be conceived as graph patterns with nodes referring to text and structured data items, and edges referring to relationships between these items!
    • 72. Processing hybrid graph patterns (1)
      Example information need
      “Information about a friend of Alice, who shared an apartment with her in Berlin and knows someone working at KIT.”
      ?x ns:knows ?z. ?z ns: works ?v. ?v ns:name “KIT”
      apartment shared Berlin Alice
      ?y ns:name “Alice”. ?x ns:knows ? y
      age
      works
      34
      trouble with bob
      FluidOps
      Peter
      sunset.jpg
      Bob is a good friend of mine. We went to the same university, and also shared an apartment in Berlin in 2008. The trouble with Bob is that he takes much better photos than I do:
      Beautiful
      Sunset
      author
      title
      Semantic Search
      Germany
      Alice
      creator
      author
      creator
      knows
      year
      Germany
      2009
      Bob
      Thanh
      knows
      located
      works
      KIT
    • 73. Processing hybrid graph patterns (2)
      Matching hybrid graph patterns against data
    • 74. Matching keyword query against text
      • Retrieve documents
      • 75. Inverted list (inverted index)
      keyword  {<doc1, pos, score, ...>,
      <doc2, pos, score, ...>, ...}
      • AND-semantics: top-k join
      shared
      Berlin
      Alice
      shared
      Berlin
      Alice
      D1
      D1
      D1
      shared
      berlin
      alice
      =
      =
      shared
    • 76. Matching structured query against structured data
      • Retrieve data for triple patterns
      • 77. Index on tables
      • 78. Multiple “redundant” indexes to cover different access patterns
      • 79. Join (conjunction of triples)
      • 80. Blocking, e.g. linear merge join (required sorted input)
      • 81. Non-blocking, e.g. symmetric hash-join
      • 82. Materialized join indexes
      ?x ns:knows ?y. ?x ns:knows ?z.
      ?z ns: works ?v. ?v ns:name “KIT”
      Per1 ns:works?v
      ?vns:name “KIT”
      SP-index
      PO-index
      =
      =
      =
      Per1 ns:worksIns1
      Ins1ns:name KIT
      Per1 ns:works Ins1
      Ins1 ns:name KIT
    • 83. Matching keyword query against structured data
      • Retrieve keyword elements
      • 84. Using inverted index
      keyword  {<el1, score, ...>, <el2, score, ...>,…}
      • Exploration / “Join”
      • 85. Data indexes for triple lookup
      • 86. Materialized index (paths up to graphs)
      • 87. Top-k Steiner tree search, top-k subgraph exploration
      Alice
      Bob
      Bob
      KIT
      Alice
      KIT


      =
      =
      Alice ns:knowsBob
      Inst1ns:name KIT
      Bobns:worksInst1
    • 88. Matching structured query against text
      • Based on offline IE (offline see Peter’s slides)
      • 89. Based on online IE, i.e., “retrieve “ is as follows
      • 90. Derive keywords to retrieve relevant documents
      • 91. On-the-fly information extraction, i.e., phrase pattern matching “X title Y”
      • 92. Retrieve extracted data for structured part
      • 93. Retrieve documents for derived text patterns, e.g. sequence, windows, reg. exp.
      • 94. Index
      • 95. Inverted index for document retrieval and pattern matching
      • 96. Join index  inverted index for storing materialized joins between keywords
      • 97. Neighborhood indexes for phrase patterns
      ?x ns:knows ?y. ?x ns:knows ?z.
      ?z ns: works ?v. ?v ns:name “KIT”
      KIT
      name
      knows
      name
      KIT
      Hybrid case
    • 98. Query processing – main tasks
      Retrieval
      Documents , data elements, triples, paths, graphs
      Inverted index,…, but also other indexes (B+ tree)
      Index documents, triples materialized join paths
      Join
      Different join implementations, efficiency depends on availability of indexes
      Non-blocking join good for early result reporting and for “unpredictable” linked data scenario
      Query
      Matching
      Data
    • 99. Query processing – more tasks
      Disjunction, aggregation, grouping
      Join order optimization
      Approximate
      Approximate the search space
      Approximate the results (matching, join)
      Parallelization
      Top-k
      Use only some entries in the input streams to produce k results
      Multiple sources
      Federation, routing
      On-the-fly mapping, similarity join
      Hybrid
      Join text and data
      Query
      Matching
      Data
    • 100. Query processing on the Web - research challenges and opportunities
      Large amount of semantic data
      Data inconsistent, redundant, and low quality
      Large amount of data embedded in text
      Large amount of sources
      Large amount of links between sources
      • Optimization parallelization,
      • 101. Approximation
      • 102. Hybrid querying and data management
      • 103. Federation, routing
      • 104. Online schema mappings
      • 105. Similarity join
    • Ranking
    • 106. Structure
      Problem definition
      Types of ambiguities
      Ranking paradigms
      Model construction
      Content-based
      Structure-based
    • 107. Ranking – problem definition
      Query
      • Ambiguities arise when representation is incomplete / imprecise
      • 108. Ambiguities at the level of
      • 109. elements (content ambiguity)
      • 110. structure between elements (structure ambiguity)
      Matching
      Data
      Due to ambiguities in the representation of the information needs and the underlying resources, the results cannot be guaranteed to exactly match the query. Ranking is the problem of determining the degree of matching using some notions of relevance.
    • 111. Content ambiguity
      ?x ns:knows ?z. ?z ns: works ?v. ?v ns:name “KIT”
      apartment shared Berlin Alice
      ?y ns:name “Alice”. ?x ns:knows ? y
      age
      works
      34
      trouble with bob
      FluidOps
      Peter
      sunset.jpg
      Bob is a good friend of mine. We went to the same university, and also shared an apartment in Berlin in 2008. The trouble with Bob is that he takes much better photos than I do:
      Beautiful
      Sunset
      author
      title
      Semantic Search
      Germany
      Alice
      creator
      author
      creator
      knows
      year
      Germany
      2009
      Bob
      Thanh
      knows
      located
      works
      KIT
      What is meant by “Berlin” in the query?
      What is meant by “Berlin” in the data?
      A city with the name Berlin? a person?
      What is meant by “KIT” in the query?
      What is meant by “KIT” in the data?
      A research group? a university? a location?
    • 112. Structure ambiguity
      ?x ns:knows ?z. ?z ns: works ?v. ?v ns:name “KIT”
      apartment shared Berlin Alice
      ?y ns:name “Alice”. ?x ns:knows ? y
      age
      works
      34
      trouble with bob
      FluidOps
      Peter
      sunset.jpg
      Bob is a good friend of mine. We went to the same university, and also shared an apartment in Berlin in 2008. The trouble with Bob is that he takes much better photos than I do:
      Beautiful
      Sunset
      author
      title
      Semantic Search
      Germany
      Alice
      creator
      author
      creator
      knows
      year
      Germany
      2009
      Bob
      Thanh
      knows
      located
      works
      KIT
      What is the connection between “Berlin” and “Alice”?
      Friend? Co-worker?
      What is meant by “works”?
      Works at? employed?
    • 113. Ambiguity
      Recall: query processing is matching at the level of syntax and semantics
      Ambiguities arise when data or query allow for multiple interpretations, i.e. multiple matches
      Syntactic, e.g. works vs. works at
      Semantic, e.g. works vs. employ
      “Aboutness”, i.e., contain some elements which represent the correct interpretation
      Ambiguities arise when matching elements of different granularities
      Does icontains the interpretation for j, given some part(s) of i (syntactically/semantically) match j
      E.g. Berlin vs. “…we went to the same university, and also, we shared an apartment in Berlin in 2008…”
      Strictly speaking, ranking is performed after syntactic / semantic matching is done!
    • 114. Features: What to use to deal with ambiguities?
      What is meant by “Berlin”? What is the connection between “Berlin” and “Alice”?
      Content features
      Frequencies of terms: d more likely to be “about” a query term k when d more often, mentions k (probabilistic IR)
      Co-occurrences: terms K that often co-occur form a contextual interpretation, i.e., topics (cluster hypothesis)
      Structure features
      Consider relevance at level of fields
      Linked-based popularity
    • 115. Ranking paradigms
      Explicit relevance model
      Foundation: probability ranking principle
      Ranking results by the posterior probability (odds) of being observed in the relevant class:
      P(w|R) varies in different approaches, e.g., binary independence model, 2-poisson model, relevancemodel
    • 116. Ranking paradigms
      No explicit notion of relevance: similarity between the query and the document model
      Vector space model (cosine similarity)
      Language models (KL divergence)
    • 117. Model construction
      How to obtain
      Relevance models?
      Weights for query / document terms?
      Language models for document / queries?
    • 118. Content-based model construction
      Document statistics, e.g.
      Term frequency
      Document length
      Collection statistics, e.g.
      Inverse document frequency
      Background language models
      • An object is more likely about “Berlin”?
      • 119. When it contains a relatively high number of mentions of the term “Berlin”
      • 120. When the number of mentions of this term in the overall collection is relatively low
    • Structure-based model construction
      Consider structure of objects during content-based modeling, i.e., to obtain structured content-based model
      Content-based model for structured objects, documents and for general tuples
      • An object is more likely about “Berlin”?
      • 121. When one of its (important) fields contains a relatively high number of mentions of the term “Berlin”
    • Structure-based model construction
      PageRank
      Link analysis algorithm
      Measuring relative importance of nodes
      Link counts as a vote of support
      The PageRank of a node recursively depends on the number and PageRank of all nodes that link to it (incoming links)
      ObjectRank
      Types and semantics of links vary in structured data setting
      Authority transfer schema graph specifies connection strengths
      Recursively compute authority transfer data graph
      • An object about “Berlin” is more important than one another?
      • 122. When a relatively large number of objects are linked to it
    • Taxonomy of ranking approaches
      Explicitly vs. non-explicitly relevance-based
      Content-based ranking
      Structure-based ranking
      Content- and-structure-based ranking
    • 123. Result Presentation
    • 124. Search interface
      Input and output functionality
      helping the user to formulate complex queries
      presenting the results in an intelligent manner
      Semantic Search brings improvements in
      Query formulation
      Snippet generation
      Adaptive and interactive presentation
      Presentation adapts to the kind of query and results presented
      Object results can be actionable, e.g. buy this product
      Aggregated search
      Grouping similar items, summarizing results in various ways
      Filtering (facets), possibly across different dimensions
      Task completion
      Help the user to fulfill the task by placing the query in a task context
    • 125. Query interpretation
      “Snap-to-grid”: find the most likely interpretation of the query given the ontology or a summary of the data
      See Query Processing
      Display the system’s interpretation of the user query
      Offer one or more interpretations, possibly while the user is typing
    • 126. Example: Freebase suggest
    • 127. Example: TrueKnowledge
      Q: “How many people live in Shanghai?”
      I: What is the population of Shanghai (Shanghainese: Zånhae), the metropolis in eastern China and a direct-controlled municipality of the People's Republic of China?
      A: The population of Shanghai on November 7th 2010 is approximately 19,300,389. (Extrapolated from a population of 18,884,600 in 2008 and a population of 19,210,000 on June 6th 2010.)
    • 128. Snippet generation using metadata
      Yahoo displays enriched search results for pages that contain microformat or RDFa markup using recognized ontologies
      Displaying data, images, video
      Example: GoodRelations for products
      Enhanced results also appear for sites from which we extract information ourselves
      Also used for generating facets that can be used to restrict search results by object type
      Example: “Shopping sites” facet for products
      Documentation and validator for developers
      http://developer.search.yahoo.com
      Formerly: SearchMonkey allowed developers to customize the result presentation and create new ones for any object type
    • 129. Example: Yahoo! Enhanced Results
      Enhanced result with deep links, rating, address.
    • 130. Automated snippet summarization
      Generate search result snippets given a query and a search result
      Penin et al. Snippet Generation for Semantic Web Search Engines, ASWC 2010
      Search results are ontologies
    • 131. Example: Facets in Yahoo! Search
      Click to restrict results to shopping sites
    • 132. Example: Yahoo! Vertical Intent Search
      Related actors and movies
    • 133. Adaptive presentation: semantic bookmarking
      Extract objects from pages tagged/bookmarked by a user
      Visualize the extracted objects
      Tabular display
      Sorting on attributes
      Map
      Tracking changes in data
      Alert me when the price drops below…
      Prototype: house search application
      Delicious profiles
      Extracting housing data from popular Spanish real-estate sites
    • 134. Adaptive presentation: semantic bookmarking
    • 135. Interactive presentation: Time Explorer
      Deliverable of the LivingKnowledge European Project
      Not a Yahoo product
      http://fbmya01.barcelonamedia.org:8080/future/
      Won the HCIR 2010 challenge
      Tool for understanding current news stories
      what are the events that led to a particular situation?
      what are the important entities for a given topic? (people,places,dates, etc.)
      what entities are important at a given time? How do their relationships change?
      what are the predictions made of a given topic?
    • 136. Interactive presentation: Time Explorer
      Technology
      Named Entity Recognition (persons, organizations)
      Temporal expression mining
      Inverted (sentence and document) index
      Forward index (archive) for retrieving relevant entities
      Ranking of both documents and relevant entities
      Display
      Two synchronized timelines showing relevant documents and the volume of documents
      Entity relationships
      Sentiments (future work)
    • 137. Example: Time Explorer
    • 138. Evaluation
    • 139. Semantic Search evaluation at SemSearch 2010/2011
      Started at SemSearch 2010
      Two tasks
      Entity Search
      Queries where the user is looking for a single real world object
      Pound et al. Ad-hoc Object Retrieval in the Web of Data, WWW 2010.
      List search (new in 2011)
      Queries where the user is looking for a class of objects
      Billion Triples Challenge 2009 dataset
      Evaluated using Amazon’s Mechanical Turk
      See Halpin et al. Evaluating Ad-Hoc Object Retrieval, IWEST 2010
      Prizes sponsored by Yahoo! Labs (2011)
    • 140. Entity Search Track
      Entity Search
      retrieval of data related to a single entity
      Queries
      Selected from the Search Query Tiny Sample v1.0 dataset, provided by the Yahoo! Webscope program
      Real web search queries sampled from the US query log of January, 2009
      Queries asked by at least three different users and with long number sequence removed (privacy reasons)
      50 selected queries that name an entity explicitly (but may also provide context)
      Last year: same type of queries, but a mix of Microsoft and Yahoo! Logs
    • 141. List query track
      List queries
      Queries that describe a set of entities
      The answer is a closed set
      Relatively small number of possible answers
      The answer is not likely to change
      Hand-picked but not hand-written
      Yahoo! Search logs
      Queries from the Tiny Sample v1.0 dataset
      Queries with clicks on Wikipedia
      TrueKnowledge
      Recent queries
    • 142. Data set
      Same as Billion Triples Challenge 2009 data set
      Blank nodes are encoded as URIs
      A data set combining crawls of multiple semantic search engines
      doesn’t necessarily match the current state of the Web
      doesn’t necessarily match the coverage of any particular search engine
      Final dataset
    • 143. Collecting the results
      Submissions via semsearch.yahoo.com
      NTNU, IIIT Hyderabad, DERI, U of Delaware, Daiict
      max. 3 submissions per team per track
      Pooling of results
      Top 20 results are evaluated
      Despite validation, still problems
      e.g. N-Triples encoded URIs, lowercased URIs
      Collecting triples for each result
      All triples where the URI is the subject
      Discarded URIs that didn’t appear as subject
      Rendering result display
      Values are clipped at 300 chars (last # or / for object-properties)
      RDF built-ins shown first
      Preference to English language values
    • 144. semsearch.yahoo.com
    • 145. Assessment with Amazon Mechanical Turk
      Evaluation using non-expert judges
      Paid $0.2 per 12 results
      Typically done in 1-2 minutes
      $6-$12 an hour
      Sponsored by the European SEALS project
      Each result is evaluated by 5 workers
      Workers are free to choose how many tasks they do
      Makes agreement difficult to compute
      Number of tasks completed per worker (2010)
    • 146. Evaluation form
    • 147. Evaluation form
    • 148. Catching the bad guys
      Payment can be rejected for workers who try to game the system
      An explanation is commonly expected, though cheaters rarely complain
      We opted to mix control questions into the real results
      Gold-win cases that are known to be perfect
      Gold-loose cases that are known to be bad
      Metrics
      Avg. and std. dev on gold-win and gold-loose results
      Time to complete
    • 149. Lessons learned
      Finding complex queries is not easy
      Query logs from web search engines contain simple queries
      Semantic Web search engines are not used by the general public
      Computing agreement is difficult
      Each judge evaluates a different number of items
      Result rendering is critically important
      The Semantic Web is not necessarily all DBpedia
      sub30-RES.3 40.6% DBpedia
      sub31-run3 93.8% Dbpedia
      Follow up experiments validated the Mechanical Turk approach
      Blanco et al. Repeatable and Reliable Search System Evaluation using Crowd-Sourcing, SIGIR2011
    • 150. Next steps
      Achieve repeatability
      Simplify our process and publish our tools
      Automate as much as possible… except the Turks ;)
      Web site for evaluation
      Continuous submission?
      Positioning compared to other evaluation campaigns
      TREC Entity Track
      Question Answering over Linked Data
      SEALS campaigns
      Join the discussion at semsearcheval@yahoogroups.com
    • 151. Demos