Peter Mika's Presentation at SSSW 2011
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Peter Mika's Presentation at SSSW 2011

on

  • 2,493 views

 

Statistics

Views

Total Views
2,493
Views on SlideShare
1,729
Embed Views
764

Actions

Likes
2
Downloads
41
Comments
0

5 Embeds 764

http://sssw.org 755
http://sssw11.org 6
http://207.46.192.232 1
http://www.sssw.org 1
http://kmi-web06.open.ac.uk 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • In fact, some of these searches are so hard that the users don’t even try them anymore
  • With ads, the situation is even worse due to the sparsity problem. Note how poor the ads are…
  • Search is a form of content aggregation
  • Semantic search can be seen as a retrieval paradigm Centered on the use of semantics Incorporates the semantics entailed by the query and (or) the resources into the matching process, it essentially performs semantic search.
  • Bar celona
  • Close to the topic of keyword-search in databases, except knowledge-bases have a schema-oblivious design Different papers assume vastly different query needs even on the same type of data

Peter Mika's Presentation at SSSW 2011 Presentation Transcript

  • 1. Semantic Search Peter Mika Yahoo! Research
  • 2. Yahoo! serves over 680 million users in 25 countries
  • 3. Yahoo! Research: visit us at research.yahoo.com
  • 4. Yahoo! Research Barcelona
    • Established January, 2006
    • Led by Ricardo Baeza-Yates
    • Research areas
      • Web Mining
        • content, structure, usage
      • Social Media
      • Distributed Web retrieval
      • Geo information retrieval
      • Semantic search
  • 5. Search is really fast, without necessarily being intelligent
  • 6. Why Semantic Search? Part I
    • Improvements in IR are harder and harder to come by
      • Machine learning using hundreds of features
        • Text-based features for matching
        • Graph-based features provide authority
      • Heavy investment in computational power, e.g. real-time indexing and instant search
    • Remaining challenges are not computational, but in modeling user cognition
      • Need a deeper understanding of the query, the content and/or the world at large
      • Could Watson explain why the answer is Toronto?
  • 7. Poorly solved information needs
    • Multiple interpretations
      • paris hilton
    • Long tail queries
      • george bush (and I mean the beer brewer in Arizona)
    • Multimedia search
      • paris hilton sexy
    • Imprecise or overly precise searches
      • jim hendler
      • pictures of strong adventures people
    • Searches for descriptions
      • countries in africa
      • 32 year old computer scientist living in barcelona
      • reliable digital camera under 300 dollars
    Many of these queries would not be asked by users, who learned over time what search technology can and can not do.
  • 8. Dealing with sparse collections Note: don’t solve the sparsity problem where it doesn’t exist
  • 9. Contextual Search: content-based recommendations Hovering over an underlined phrase triggers a search for related news items.
  • 10. Contextual Search: personalization Machine Learning based ‘search’ algorithm selects the main story and the three alternate stories based on the users demographics (age, gender etc.) and previous behavior. Display advertizing is a similar top-1 search problem on the collection of advertisements.
  • 11. Contextual Search: new devices Show related content Connect to friends watching the same
  • 12. Aggregation across different dimensions Hyperlocal: showing content from across Yahoo that is relevant to a particular neighbourhood.
  • 13. Why Semantic Search? Part II
    • The Semantic Web is now a reality
      • Billions of triples
      • Thousands of schemas
      • Varying quality
    • End users
      • Keyword queries, not SPARQL
    • Searching data instead or in addition to searching documents
      • Providing direct answers (possibly with explanations)
      • Support for novel search tasks
  • 14. Direct answers in search Information box with content from and links to Yahoo! Travel Points of interest in Vienna, Austria Since Aug, 2010, ‘regular’ search results are ‘Powered by Bing’ Products from Yahoo! Shopping
  • 15. Novel search tasks
    • Aggregation of search results
      • e.g. price comparison across websites
    • Analysis and prediction
      • e.g. world temperature by 2020
    • Semantic profiling
      • recommendations based on particular interests
    • Semantic log analysis
      • understanding user behavior in terms of objects
    • Support for complex tasks (search apps)
      • e.g. booking a vacation using a combination of services
  • 16. Why Semantic Search? Part III
    • There is a business model
      • Publishers are (increasingly) interested in making their content searchable, linkable and easier to aggregate and reuse
      • So that social media sites and search engines can expose their content to the right users, in a rich and attractive form
    • This is a about creating an ecosystem…
      • More advanced in some domains
      • In others, we still live in the tyranny of rate-limited APIs, licenses etc.
  • 17. This is not a business model http://en.wikipedia.org/wiki/Underpants_Gnomes
  • 18. Example: rNews
    • RDFa vocabulary for news articles
      • Easier to implement than NewsML
      • Easier to consume for news search and other readers, aggregators
    • Under development at the IPTC
      • March: v0.1 approved
      • Final version by Sept
  • 19. Example: Facebook’s Like and the Open Graph Protocol
    • The ‘Like’ button provides publishers with a way to promote their content on Facebook and build communities
      • Shows up in profiles and news feed
      • Site owners can later reach users who have liked an object
      • Facebook Graph API allows 3 rd party developers to access the data
    • Open Graph Protocol is an RDFa-based format that allows to describe the object that the user ‘Likes’
  • 20. Example: Facebook’s Open Graph Protocol
    • RDF vocabulary to be used in conjunction with RDFa
      • Simplify the work of developers by restricting the freedom in RDFa
    • Activities, Businesses, Groups, Organizations, People, Places, Products and Entertainment
    • Only HTML <head> accepted
    • http://opengraphprotocol.org/
    <html xmlns:og=&quot;http://opengraphprotocol.org/schema/&quot; > <head> <title>The Rock (1996)</title> <meta property=&quot;og:title&quot; content=&quot;The Rock&quot; /> <meta property=&quot;og:type&quot; content=&quot;movie&quot; /> <meta property=&quot;og:url&quot; content=&quot;http://www.imdb.com/title/tt0117500/&quot; /> <meta property=&quot;og:image&quot; content=&quot;http://ia.media-imdb.com/images/rock.jpg&quot; /> … </head> ...
  • 21. Semantic Search
  • 22. Semantic Search: a definition
    • Semantic search is a retrieval paradigm that
      • Makes use of the structure of the data or explicit schemas to understand user intent and the meaning of content
      • Exploits this understanding at some part of the search process
    • Combination of document and data retrieval
      • Documents with metadata
        • Metadata may be embedded inside the document
        • I’m looking for documents that mention countries in Africa.
      • Data retrieval
        • Structured data, but searchable text fields
        • I’m looking for directors , who have directed movies where the synopsis mentions dinosaurs.
    • Web search vs. vertical/enterprise/desktop search
    • Related fields:
      • XML retrieval
      • Keyword search in databases
      • Natural Language Retrieval
  • 23. Semantics at every step of the IR process bla bla bla? q=“bla” * 3 Document processing bla bla bla Indexing Ranking Query interpretation Result presentation The IR engine The Web bla bla bla bla bla bla “ bla” θ (q,d)
  • 24. Data on the Web
  • 25. Data on the Web
    • Most web pages on the Web are generated from structured data
      • Data is stored in relational databases (typically)
      • Queried through web forms
      • Presented as tables or simply as unstructured text
    • The structure and semantics (meaning) of the data is not directly accessible to search engines
    • Two solutions
      • Extraction using Information Extraction (IE) techniques (implicit metadata)
        • Supervised vs. unsupervised methods
      • Relying on publishers to expose structured data using standard Semantic Web formats (explicit metadata)
        • Particularly interesting for long tail content
  • 26. Information Extraction methods
    • Natural Language Processing
    • Extraction of triples
      • Suchanek et al. YAGO: A Core of Semantic Knowledge Unifying WordNet and Wikipedia, WWW, 2007.
      • Wu and Weld. Autonomously Semantifying Wikipedia, CIKM 2007.
    • Filling web forms automatically (form-filling)
      • Madhavan et al. Google's Deep-Web Crawl. VLDB 2008
    • Extraction from HTML tables
      • Cafarella et al. WebTables: Exploring the Power of Tables on the Web. VLDB 2008
    • Wrapper induction
      • Kushmerick et al. Wrapper Induction for Information ExtractionText extraction. IJCAI 2007
  • 27. Semantic Web
    • Sharing data across the Web
      • Publish information in standard formats (RDF, RDFa)
      • Share the meaning using powerful, logic-based languages (OWL, RIF)
      • Query using standard languages and protocols (HTTP, SPARQL)
    • Two main forms of publishing
      • Linked Data
        • Data published as RDF documents linked to other RDF documents and/or using SPARQL end-points
        • Community effort to re-publish large public datasets (e.g. Dbpedia, open government data)
      • RDFa
        • Data embedded inside HTML pages
        • Recommended for site owners by Yahoo, Google, Facebook
  • 28. Resource Description Framework (RDF)
    • Each resource (thing, entity) is identified by a URI
      • Globally unique identifiers
      • Locators of information
    • Data is broken down into individual facts
      • Triples of (subject, predicate, object)
    • A set of triples (an RDF graph) is published together in an RDF document
    example:roi “ Roi Blanco” name type foaf:Person RDF document
  • 29. Linked Data: interlinked RDF documents example:roi “ Roi Blanco” name foaf:Person sameAs example:roi2 worksWith example:peter “ pmika@yahoo-inc.com” email type type Roi’s homepage Yahoo Friend-of-a-Friend ontology
  • 30. RDFa: metadata embedded in HTML … <p typeof=”foaf:Person&quot; about=&quot;http://example.org/roi&quot;> <span property=”foaf:name” >Roi Blanco</span>. <a rel=”owl:sameAs&quot; href=&quot;http://research.yahoo.com/roi&quot;> Roi Blanco </a>. You can contact him at <a rel=”foaf:mbox&quot; href=&quot;mailto:roi@yahoo-inc.com&quot;> via email </ a>. </p> ... Roi’s homepage
  • 31. Crawling the Semantic Web
    • Linked Data
      • Similar to HTML crawling, but the the crawler needs to parse RDF/XML (and others) to extract URIs to be crawled
      • Semantic Sitemap / VOID descriptions
    • RDFa
      • Same as HTML crawling, but data is extracted after crawling
      • Mika et al. Investigating the Semantic Gap through Query Log Analysis, ISWC 2010.
    • SPARQL endpoints
      • Endpoints are not linked, need to be discovered by other means
      • Semantic Sitemap / VOID descriptions
  • 32. Data fusion
    • Ontology alignment/matching
      • Widely studied in Semantic Web research
    • Entity resolution
      • Record linkage in databases
        • Machine learning based approaches, focusing on attributes
      • Graph-based approaches
        • e.g. the work of Lisa Getoor are applicable to RDF data
        • Improvements over only attribute based matching
    • Blending
      • Merging objects that represent the same real world entity and reconciling information from multiple sources
  • 33. Data quality
    • Data quality is a serious issue
      • Quality ranges from professional data to user-generated content
      • Mistakes in extraction or markup
      • Misuse of the ontology language (e.g. owl:sameAs)
      • Trust issues
        • Can we trust Dbpedia.org to represent Wikipedia?
    • Quality assessment and data curation
      • Automated data validation
        • Against known-good data or using triangulation
        • Validation against the ontology or using probabilistic models
      • Manual validation by trained professionals or using crowd-sourcing
        • Sampling data for evaluation
      • Curation based on user feedback
  • 34. The role of ontologies in semantic search
    • Semantic Web data is extremely heterogeneous
      • Billion Triples Challenge dataset has 33,000 unique properties
    • Any search engine has a partial understanding of this data
      • A search engine may ‘support’ a handful of ontologies
      • The rest of the data is just a labeled graph
    • Reducing fragmentation in ontologies is critical for search
      • Decentralized ontology development doesn’t converge
      • schema.org is a step toward reducing the fragmentation
        • An agreement on the vocabularies supported by Bing/Google/Yahoo
    • Open research question whether reasoning is even useful to improve search
      • Mixed results in IR with query- and document expansion
  • 35. Query Interpretation
  • 36. Query Interpretation in Information Retrieval
    • Provide a higher level representation of queries in some conceptual space
      • Ideally, the same space in which documents are represented
      • Limited user involvement in traditional search engines
        • Examples: search assist, facets
    • Interpretation happens before the query is executed
      • Federation: determine where to send the query
      • Ranking feature
      • Deals with the sparseness of query streams
        • 88% of unique queries are singleton queries (Baeza et al)
        • Spell correction (“we have also included results for…”), stemming
  • 37. Query interpretation in Semantic Search
    • Web search users will not formulate structured queries
      • Unable to formulate queries in SPARQL or DL
      • No knowledge of the schema of the data
      • No single ontology
        • Billion Triples Challenge dataset contains 33,000 properties
    • “ Snap to grid”
      • Using the ontologies, a summary of the data or the whole data we find the most likely structured query matching the user input
        • Example: “starbucks nyc” -> company name:”Starbucks” in location:”New York City”
    • Larger user involvement
      • Guiding the user in constructing the query
        • Example: Freebase Suggest
      • Displaying back the interpretation of the query
        • Example: TrueKnowledge
  • 38. Indexing and Ranking
  • 39. Indexing
    • Search requires matching and ranking
      • Matching selects a subset of the elements to be scored
    • The goal of indexing is to speed up matching
      • Retrieval needs to be performed in milliseconds
      • Without an index, retrieval would require streaming through the collection
    • The type of index depends on the query language to support
      • SPARQL is a highly-expressive SQL-like query language for experts
        • DB-style indexing
      • End-users are accustomed to keyword queries with very limited structure (see Pound et al. WWW2010)
        • IR-style indexing
  • 40. DB-style indexing
    • Typical indexing in RDF triple stores
      • B+ trees to support the basic graph patterns
        • (s, p, o?), (?s, p, o), etc.
      • Query planning for efficient join processing
    • No ranking
    • Limited support to restrict results by keyword matching
      • SPARQL regular expression matching
        • { ?x dc:title ?title FILTER regex(?title, &quot;web&quot;, &quot;i&quot; ) }
        • Implemented as a filtering of results
      • Magic predicates
        • Non-standard SPARQL extensions for full-text search
        • SELECT ?label WHERE { <abstract:> fts:exactMatch ?label . }
        • Implemented using the full-text search of the underlying DBMS or using a side-index (e.g. Lucene)
  • 41. IR-style indexing
    • Index data as text
      • Create virtual documents from data
      • One virtual document per subgraph, resource or triple
        • typically: resource
    • Key differences to Text Retrieval
      • RDF data is structured
      • Minimally, queries on property values are required
    • No join processing or limited join processing
      • e.g. tree-shaped queries
  • 42. Horizontal index structure
    • Two fields (indices): one for terms, one for properties
    • For each term, store the property on the same position in the property index
      • Positions are required even without phrase queries
    • Query engine needs to support the alignment operator
    • ✓ Dictionary is number of unique terms + number of properties
    • Occurrences is number of tokens * 2
  • 43. Vertical index structure
    • One field (index) per property
    • Positions are not required
      • But useful for phrase queries
    • Query engine needs to support fields
    • Dictionary is number of unique terms
    • Occurrences is number of tokens
    • ✗ Number of fields is a problem for merging, query performance
  • 44. Ranking methods
    • The unit of retrieval is either an object or a sub-graph
      • The sub-graph induced by matching keywords in the query
    • Ranking methods from Information Retrieval
      • TF-IDF, BM25F, language models
      • Methods such as BM25F allow weighting by predicate
    • Machine learning is used to tune parameters and incorporate query-independent (static) features
      • Example: authority scores of datasets computed using PageRank (Harth et al. ISWC 2009)
    • Additional topics
      • Relevance feedback and personalization
      • Click-based ranking
      • De-duplication
      • Diversification
  • 45. Evaluation
      • Harry Halpin, Daniel Herzig, Peter Mika, Jeff Pound, Henry Thompson, Roi Blanco, Thanh Tran Duc
  • 46. Semantic Search evaluation
    • Started at SemSearch 2010
    • Two tasks
      • Entity Search
        • Queries where the user is looking for a single real world object
        • Pound et al. Ad-hoc Object Retrieval in the Web of Data, WWW 2010.
      • List search (new in 2011)
        • Queries where the user is looking for a class of objects
    • Billion Triples Challenge 2009 dataset
    • Evaluated using Amazon’s Mechanical Turk
      • See Halpin et al. Evaluating Ad-Hoc Object Retrieval, IWEST 2010
    • Prizes sponsored by Yahoo! Labs (2011)
  • 47. Hosted by Yahoo! Labs at semsearch.yahoo.com
  • 48. Assessment with Amazon Mechanical Turk
      • Evaluation using non-expert judges
        • Paid $0.2 per 12 results
          • Typically done in 1-2 minutes
          • $6-$12 an hour
          • Sponsored by the European SEALS project
        • Each result is evaluated by 5 workers
      • Workers are free to choose how many tasks they do
        • Makes agreement difficult to compute
    Number of tasks completed per worker (2010)
  • 49. Evaluation form
  • 50. Evaluation form
  • 51. Catching the bad guys
      • Payment can be rejected for workers who try to game the system
        • An explanation is commonly expected, though cheaters rarely complain
      • We opted to mix control questions into the real results
        • Gold-win cases that are known to be perfect
        • Gold-loose cases that are known to be bad
      • Metrics
        • Avg. and std. dev on gold-win and gold-loose results
        • Time to complete
    Worker Known bad Real Known Good Total N Time to complete (sec) N Mean N Mean N Mean badguy 20 2.556 200 2.738 20 2.684 240 29.6 goodguy 13 1 130 2.038 13 3 156 95 whoknows 1 1 21 1.571 2 3 24 83.5
  • 52. Lessons learned
    • Finding complex queries is not easy
      • Query logs from web search engines contain simple queries
      • Semantic Web search engines are not used by the general public
    • Computing agreement is difficult
      • Each judge evaluates a different number of items
    • Result rendering is critically important
    • The Semantic Web is not necessarily all DBpedia
      • sub30-RES.3 40.6% DBpedia
      • sub31-run3 93.8% Dbpedia
    • Follow up experiments validated the Mechanical Turk approach
      • Blanco et al. Repeatable and Reliable Search System Evaluation using Crowd-Sourcing, SIGIR2011
  • 53. Related work
    • TREC Entity Track
      • Document retrieval + extraction
    • Question Answering over Linked Data
      • Translation of natural language questions to SPARQL
    • SEALS campaigns
      • Interactive search
    • Join the discussion at semsearcheval@yahoogroups.com
  • 54. Search interface
  • 55. Search Interface
    • Semantic Search brings improvements in
      • Snippet generation
      • Navigation of the domain
      • Adaptive and interactive presentation
        • Presentation adapts to the kind of query and results presented
        • Object results can be actionable, e.g. buy this product
        • User can provide feedback on particular objects or data sources
      • Aggregated search
        • Grouping similar items, summarizing results in various ways
        • Filtering (facets), possibly across different dimensions
      • Query and task templates
        • Help the user to fulfill the task by placing the query in a task context
  • 56. Snippet generation using metadata
    • Yahoo displays enriched search results for pages that contain microformat or RDFa markup using recognized ontologies
      • Displaying data, images, video
      • Example: GoodRelations for products
      • Enhanced results also appear for sites from which we extract information ourselves
    • Also used for generating facets that can be used to restrict search results by object type
      • Example: “Shopping sites” facet for products
    • Documentation and validator for developers
      • http://developer.search.yahoo.com
    • Formerly: SearchMonkey allowed developers to customize the result presentation and create new ones for any object type
    • Haas et al. Enhanced Results in Web Search. SIGIR 2011
  • 57. Example: Yahoo! Enhanced Results Enhanced result with deep links, rating, address.
  • 58. Example: Yahoo! Vertical Intent Search Related actors and movies
  • 59. Example: Sig.ma Semantic Information Mashup (DERI)
  • 60. Future work in Semantic Web Search
    • Information Extraction for semantic search
    • Data quality assessment
    • Reasoning for search
    • Efficient processing of hybrid queries
    • Data display and visualization
    • Mobile search
    • Semantic advertizing
    • Search log and website traffic analysis
  • 61. The End
    • Many thanks for contributions by members of the SemSearch group at Yahoo! Research in Barcelona
    • Contact
      • [email_address]
      • Internships available for PhD students (deadline in January)