Semantic Search Summer School2009


Published on

Published in: Education, Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Designed for humans first and machines second, microformats are a set of simple, open data formats built upon existing and widely adopted standards. Instead of throwing away what works today, microformats intend to solve simpler problems first by adapting to current behaviors and usage patterns
  • In fact, some of these searches are so hard that the users don’t even try them anymore
  • Results are good, but consider the ads: First ad says: Virgins. Looking for virgins? Find exactly what you want today. Second ad: Virgins. …Find cheap tickets for Virgins. Third ad: Adspam… these people buy Yahoo! traffic and sell it to Google.
  • SW: Representing and reasoning with structured data on the Web Both a relational and graph view on information IR:: Aggregating information at a document-level based on ad-hoc information needs DB: Representing and querying information in a relational model NLP: from text to information One reference to Semantic Search
  • Entity-independent measures: M1: probability of fix given type M2: probability of fix given type, normalized by probability of fix (the more uncommon the fix, the better) M3: binary entropy function
  • [will be animated]
  • Semantic Search Summer School2009

    1. 1. Making the Web Searchable Peter Mika Researcher, Data Architect Yahoo! Research
    2. 2. Yahoo! Research (
    3. 3. Yahoo! Research Barcelona <ul><li>Established January, 2006 </li></ul><ul><li>Led by Ricardo Baeza-Yates </li></ul><ul><li>Research areas </li></ul><ul><ul><li>Web Mining </li></ul></ul><ul><ul><ul><li>content, structure, usage </li></ul></ul></ul><ul><ul><li>Distributed Web retrieval </li></ul></ul><ul><ul><li>Multimedia retrieval </li></ul></ul><ul><ul><li>NLP and Semantics </li></ul></ul>
    4. 4. Yahoo! by numbers (April, 2007) <ul><li>There are approximately 500 million users of Yahoo! branded services, meaning we reach 50 percent – or 1 out of every 2 users – online, the largest audience on the Internet (Yahoo! Internal Data). </li></ul><ul><li>Yahoo! is the most visited site online with nearly 4 billion visits and an average of 30 visits per user per month in the U.S. and leads all competitors in audience reach, frequency and engagement (comScore Media Metrix, US, Feb. 2007). </li></ul><ul><li>Yahoo! accounts for the largest share of time Americans spend on the Internet with 12 percent (comScore Media Metrix, US, Feb. 2007) and approximately 8 percent of the world’s online time (comScore WorldMetrix, Feb. 2007). </li></ul><ul><li>Yahoo! is the #1 home page with 85 million average daily visitors on Yahoo! homepages around the world, an increase of nearly 5 million visitors in a month (comScore WorldMetrix, Feb. 2007). </li></ul><ul><li>Yahoo!’s social media properties (Flickr, delicious, Answers, 360, Video, MyBlogLog, Jumpcut and Bix) have 115 million unique visitors worldwide (comScore WorldMetrix, Feb. 2007). </li></ul><ul><li>Yahoo! Answers is the largest collection of human knowledge on the Web with more than 90 million unique users and 250 million answers worldwide (Yahoo! Internal Data). </li></ul><ul><li>There are more than 450 million photos in Flickr in total and 1 million photos are uploaded daily. 80 percent of the photos are public (Yahoo! Internal Data). </li></ul><ul><li>Yahoo! Mail is the #1 Web mail provider in the world with 243 million users (comScore WorldMetrix, Feb. 2007) and nearly 80 million users in the U.S. (comScore Media Metrix, US, Feb. 2007) </li></ul><ul><li>Interoperability between Yahoo! Messenger and Windows Live Messenger has formed the largest IM community approaching 350 million user accounts (Yahoo! Internal Data). </li></ul><ul><li>Yahoo! Messenger is the most popular in time spent with an average of 50 minutes per user, per day (comScore WorldMetrix, Feb. 2007). </li></ul><ul><li>Nearly 1 in 10 Internet users is a member of a Yahoo! Groups (Yahoo! Internal Data). </li></ul><ul><li>Yahoo! is one of only 26 companies to be on both the Fortune 500 list and the Fortune’s “Best Place to Work” List (2006). </li></ul>
    5. 5. Agenda <ul><li>Publishing metadata on the Semantic Web </li></ul><ul><ul><li>A brief history of publishing metadata in HTML </li></ul></ul><ul><li>Semantic Search research and applications </li></ul>
    6. 6. The many faces of the Semantic Web <ul><li>Six ways of publishing RDF </li></ul><ul><ul><li>Linked RDF files (linked data) </li></ul></ul><ul><ul><li>Metadata inside webpages </li></ul></ul><ul><ul><li>SPARQL endpoints </li></ul></ul><ul><ul><li>Feeds </li></ul></ul><ul><ul><li>XSLT/GRDDL </li></ul></ul><ul><ul><li>Automated tools </li></ul></ul><ul><li>Non-exclusive but in practice </li></ul><ul><li>most publisher choose one </li></ul>
    7. 7. Option 1: Standalone RDF documents <ul><li>RDF documents linked to other RDF documents </li></ul><ul><ul><li>Use rdfs:seeAlso to point to a related document </li></ul></ul><ul><ul><ul><li>It says: Go and look at that document if you want to know more </li></ul></ul></ul><ul><li>Advantages: </li></ul><ul><ul><li>No change to the publishing of the HTML documents </li></ul></ul><ul><ul><li>Data can be published by third party </li></ul></ul><ul><li>Tools </li></ul><ul><ul><li>RDB-to-RDF mappers such as D2RQ or Triplify </li></ul></ul><ul><ul><li>Linked Data browsers </li></ul></ul><ul><li>Examples: Most datasets in the Linked Data cloud </li></ul>. . . #PeterM #Bud born “ Peter Mika” label “ Budapest” label #Hun capital-of “ 2,000,000” population #PeterM #Bud born “ Peter Mika” label “ Budapest” label #Hun capital-of “ 2,000,000” population #PeterM #Bud born “ Peter Mika” label “ Budapest” label #Hun capital-of “ 2,000,000” population
    8. 8. Option 1: cntd. <ul><li>For discovery, the metadata is often linked from HTML pages </li></ul><ul><ul><li>< link rel=&quot;meta&quot; type=&quot;application/rdf+xml&quot; title=&quot;FOAF&quot; href=&quot;; /> </li></ul></ul><ul><li>Additional advantages: </li></ul><ul><ul><li>Discovery from the webpage </li></ul></ul><ul><ul><li>It’s clear that the metadata is a machine representation of the human-targeted content of the page </li></ul></ul><ul><li>Examples: FOAF profiles, BestBuy </li></ul>. Peter Mika was born in Budapest. #PeterM #Bud born “ Peter Mika” label “ Budapest” label #Hun capital-of “ 2,000,000” population
    9. 9. Option 2: Metadata inside web pages <ul><li>Using microformats, RDFa, MicroData (more later) </li></ul><ul><li>Advantages: </li></ul><ul><ul><li>No content negotiation required </li></ul></ul><ul><ul><li>No separate database export required </li></ul></ul><ul><ul><li>Browser plug-in friendly </li></ul></ul><ul><ul><li>Search engine friendly </li></ul></ul><ul><ul><li>Copy-paste friendly </li></ul></ul><ul><li>Tools: </li></ul><ul><ul><li>XML editors (e.g. Oxygen) </li></ul></ul><ul><ul><li>Triplr </li></ul></ul><ul><ul><li>RDFa Distiller </li></ul></ul><ul><ul><li>RDFa bookmarklet </li></ul></ul><ul><ul><li>Ubiquity RDFa plugin </li></ul></ul><ul><ul><li>Optimus microformat parser </li></ul></ul><ul><li>Examples: many, including SlideShare, YouTube, LinkedIn, Digg, Myspace, Facebook… </li></ul>Peter Mika was born in Budapest. Peter Mika was born in Budapest. #PeterM #Bud born “ Peter Mika” label “ Budapest” label #Hun capital-of “ 2,000,000” population #PeterM #Bud born “ Peter Mika” label “ Budapest” label #Hun capital-of “ 2,000,000” population
    10. 10. Option 3: SPARQL endpoints <ul><li>Query access to your RDF database </li></ul><ul><ul><li>Similar to exposing your database on the Web and giving someone read-only SQL access </li></ul></ul><ul><li>Advantages: </li></ul><ul><ul><li>Most flexible and best performing access from a consumer perspective </li></ul></ul><ul><li>Tools: </li></ul><ul><ul><li>Triple stores (Oracle, Virtuoso, Sesame, Jena, OWLIM etc.) </li></ul></ul><ul><ul><li>RDB-to-RDF mappers such as D2RQ and Triplify </li></ul></ul>#PeterM #Bud born “ Peter Mika” label “ Budapest” label #Hun capital-of “ 2,000,000” population
    11. 11. Option 4: feeds <ul><li>The equivalent of a database dump </li></ul><ul><li>No standard feed format for RDF </li></ul><ul><li>Advantages </li></ul><ul><ul><li>Submit your data without making it public </li></ul></ul><ul><li>Yahoo! consumes: </li></ul><ul><ul><li>DataRSS </li></ul></ul><ul><ul><li>GoogleBase feeds </li></ul></ul><ul><ul><li>NewsML </li></ul></ul><ul><li>Submit your feed using SiteExplorer </li></ul>. #PeterM #Bud born “ Peter Mika” label “ Budapest” label #Hun capital-of “ 2,000,000” population #PeterM #Bud born “ Peter Mika” label “ Budapest” label #Hun capital-of “ 2,000,000” population #PeterM #Bud born “ Peter Mika” label “ Budapest” label #Hun capital-of “ 2,000,000” population
    12. 12. Option 5: XSLT <ul><li>Publish the transformation from HTML to structured data </li></ul><ul><ul><li>GRDDL is a standard for linking an HTML page to a transformation that produces RDF data </li></ul></ul><ul><li>Advantages </li></ul><ul><ul><li>No change to the page </li></ul></ul><ul><li>Disadvantages </li></ul><ul><ul><li>Transformation needs to be executed to get to the data </li></ul></ul><ul><li>Tools </li></ul><ul><ul><li>Intel MashMaker </li></ul></ul><ul><ul><li>Dapper </li></ul></ul><ul><ul><li>Glue API from AdaptiveBlue </li></ul></ul><XSLT> xx yy 1 2
    13. 13. Option 6: Automatic markup <ul><li>Restricted mostly to tagging entities with identifiers </li></ul><ul><li>Advantages </li></ul><ul><ul><li>Less manual effort </li></ul></ul><ul><li>Disadvantages </li></ul><ul><ul><li>Limited to finding relevant entities in text </li></ul></ul><ul><li>Tools </li></ul><ul><ul><li>OpenCalais </li></ul></ul><ul><ul><li>Zemanta API </li></ul></ul>Peter Mika was born in Budapest. <person>Peter Mika</person> was born in <location>Budapest</location>.
    14. 14. Example: Zemanta <ul><li>A personal writing assistant for bloggers </li></ul><ul><ul><li>Plugin for popular blogging platforms and web mail clients </li></ul></ul><ul><li>Analyzes text as you type and suggests hyperlinks, tags, categories, images and related articles </li></ul><ul><li>API available with the same functionality </li></ul>
    15. 15. Metadata in HTML
    16. 16. Brief history of the Annotated Web <ul><li>1995: HTML meta tags </li></ul><ul><li>1996: Simple HTML Ontology Extensions (SHOE) </li></ul><ul><li>1998: RDF/XML </li></ul><ul><ul><li>RDF/XML in HTML </li></ul></ul><ul><ul><li>RDF linked from HTML </li></ul></ul><ul><li>2003: Web 2.0 </li></ul><ul><ul><li>Tagging </li></ul></ul><ul><ul><li>Microformats </li></ul></ul><ul><ul><li>Metadata in Wikipedia </li></ul></ul><ul><ul><li>Machine tags in Flickr </li></ul></ul><ul><li>2005: eRDF </li></ul><ul><li>2008: RDFa </li></ul><ul><li>2009: Microdata (?) </li></ul>
    17. 17. HTML meta tags <ul><li><HTML> </li></ul><ul><li><HEAD profile=&quot;;> </li></ul><ul><li><META name=&quot; &quot; content=&quot; Peter Mika &quot;> </li></ul><ul><li><LINK rel=&quot;DC.rights copyright&quot; href=&quot; &quot; /> </li></ul><ul><li><LINK rel=&quot;meta&quot; type=&quot;application/rdf+xml&quot; title=&quot;FOAF&quot; </li></ul><ul><li> href= &quot; &quot;> </li></ul><ul><li></HEAD> </li></ul><ul><li>… </li></ul><ul><li></HTML> </li></ul>
    18. 18. SHOE example (Hefflin & Hendler, 1996) <ul><li><ONTOLOGY &quot;our-ontology&quot; VERSION=&quot;1.0&quot;> </li></ul><ul><li><ONTOLOGY-EXTENDS &quot;organization-ontology&quot; VERSION=&quot;2.1&quot; PREFIX=&quot;org&quot; URL=&quot;;> </li></ul><ul><li><ONTDEF CATEGORY=&quot;Person&quot; ISA=&quot;org.Thing&quot;> </li></ul><ul><li><ONTDEF RELATION=&quot;lastName&quot; ARGS=&quot;Person STRING&quot;> </li></ul><ul><li><ONTDEF RELATION=&quot;firstName&quot; ARGS=&quot;Person STRING&quot;> </li></ul><ul><li><ONTDEF RELATION=&quot;marriedTo&quot; ARGS=&quot;Person Person&quot;> </li></ul><ul><li><ONTDEF RELATION=&quot;employee&quot; ARGS=&quot;org.Organization Person&quot;> </li></ul><ul><li></ONTOLOGY > </li></ul><HEAD> <META HTTP-EQUIV=&quot;Instance-Key&quot; CONTENT=&quot;;> <USE-ONTOLOGY &quot;our-ontology&quot; VERSION=&quot;1.0&quot; PREFIX=&quot;our&quot; URL=&quot;;> </HEAD> <BODY> <CATEGORY &quot;our.Person&quot;> <RELATION &quot;our.marriedTo&quot; TO=&quot;;> <RELATION &quot;our.employee&quot; FROM=&quot;;> My name is <ATTRIBUTE &quot;our.firstName&quot;> George </ATTRIBUTE> <ATTRIBUTE &quot;our.lastName&quot;> Cook </ATTRIBUTE> and I live at...
    19. 19. SHOE system
    20. 20. SHOE Text-based query interface
    21. 21. SHOE Graphical Query Interface
    22. 22. Example: Creative Commons <ul><li>Embedding CC license in HTML (now deprecated): </li></ul><HTML> <HEAD>… </HEAD> <BODY> … <!–- <rdf:RDF xmlns=&quot;; xmlns:dc=&quot;; xmlns:rdf=&quot;;> <Work rdf:about=&quot;;> <dc:title>The Law of Averages</dc:title> <dc:description>...because eventually i&apos;ll be right...</dc:description> <license rdf:resource=&quot;; /> </Work> <License rdf:about=&quot;;> <requires rdf:resource=&quot;; /> <permits rdf:resource=&quot;; /> <permits rdf:resource=&quot;; /> <prohibits rdf:resource=&quot;; /> </License> </rdf:RDF> -->
    23. 23. Example: Creative Commons <ul><li>Current: rel attribute (HTML4) </li></ul>This work is licensed under a <a rel=&quot;license&quot; href=&quot;;>Creative Commons Attribution 3.0 United States License</a>. <ul><li>Use of the “rel” attribute for semantic annotation is the birth of the microformat… </li></ul>
    24. 24. Microformats (μf) <ul><li>Community centered around </li></ul><ul><ul><li>Specifications and discussions are hosted there </li></ul></ul><ul><li>Agreements on the way to encode certain kinds metadata in HTML </li></ul><ul><ul><li>Reuse of semantic-bearing HTML elements </li></ul></ul><ul><ul><li>Based on existing standards </li></ul></ul><ul><ul><li>Minimality </li></ul></ul><ul><li>Microformats exist for a limited set of objects </li></ul><ul><ul><li>hCard (persons and organizations) </li></ul></ul><ul><ul><li>hCalendar (events) </li></ul></ul><ul><ul><li>hResume </li></ul></ul><ul><ul><li>hProduct </li></ul></ul><ul><ul><li>hRecipe </li></ul></ul><ul><li>Varying degrees of support and stability </li></ul><ul><ul><li>hCard and rel-tag are widely supported </li></ul></ul>
    25. 25. Example: microformats <cite class=&quot; vcard &quot;> <a class=&quot; fn url &quot; rel=&quot;friend colleague met&quot; href=&quot;;> Eric Meyer </a> </cite> wrote a post ( <cite> <a href=&quot;;> Tax Relief </a></cite> ) about an unintentionally humorous letter he received from the <span class=&quot; vcard &quot;> <a class=&quot; fn org url &quot; href=&quot;;> Internal Revenue Service </a> </span>. <div class=&quot; vcard &quot;> <a class=&quot; email fn &quot; href=&quot;;> Joe Friday </a> <div class=&quot; tel &quot;> +1-919-555-7878 </div> <div class=&quot; title &quot;> Area Administrator, Assistant </div> </div>
    26. 26. Microformats: limitations <ul><li>No shared syntax </li></ul><ul><ul><li>Each microformat has a separate syntax tailored to the vocabulary </li></ul></ul><ul><li>No formal schemas </li></ul><ul><ul><li>Limited reuse, extensibility of schemas </li></ul></ul><ul><ul><li>Unclear which combinations are allowed </li></ul></ul><ul><li>No datatypes </li></ul><ul><li>No namespaces, unique identifiers (URIs) </li></ul><ul><ul><li>no interlinking </li></ul></ul><ul><ul><li>mapping between instances is required </li></ul></ul><ul><li>Relationship to page context is often unclear </li></ul>
    27. 27. RDFa <ul><li>RDF-in-attributes </li></ul><ul><ul><li>World Wide Web Consortium (W3C) recommendation for encoding RDF triples in HTML </li></ul></ul><ul><ul><li>Full RDF support </li></ul></ul><ul><ul><ul><li>Recommendation specifies the algorithm for parsing the triples out of HTML </li></ul></ul></ul><ul><ul><li>Requires XHTML in principle </li></ul></ul><ul><ul><ul><li>In practice, no one cares </li></ul></ul></ul>
    28. 28. RDFa in a slide <p xmlns:rdfs=&quot;; xmlns:foaf=&quot;” typeof=”foaf:Person &quot; about=&quot;; > <span property= &quot; rdfs:label foaf:name &quot;> Jo Smith </span>. <span property= &quot; foaf:title &quot;> Web hacker </span> at <a rel=”vcard:org&quot; property= &quot; foaf:name &quot; href=&quot;;> Acme Corp </a>. You can contact me <a rel= &quot; foaf:mbox&quot; href=&quot;;> via email </a>. </p> ... Assign the prefixes rdfs and foaf to the RDFS and FOAF namespaces (as in XML, RDF/XML etc.) Create a new resource of type foaf:Person Assign a value to a property Give it a URI Link to another resource and assign a name to it
    29. 29. Microformats vs. RDFa <ul><li>Choose microformats when you find a microformat that fits your needs (and supported by your favorite tool or search engine) </li></ul><ul><ul><li>Microformats are first option because they are simple </li></ul></ul><ul><ul><li>We support all major microformats, see the documentation </li></ul></ul><ul><ul><li>It’s a common misconception that RDFa requires XHTML: it doesn’t </li></ul></ul><ul><li>If you find none that perfectly fits your needs then you need RDFa </li></ul><ul><ul><li>Microformats have a fixed schema: you can not add your own attributes </li></ul></ul><ul><li>Example: a social networking site with user profiles </li></ul><ul><ul><li>VCard is a good candidate, but for example it doesn’t have a way to express the user’s social connections </li></ul></ul><ul><ul><li>You either live without this, or go with RDFa </li></ul></ul>
    30. 30. Keep an eye on HTML5 <ul><li>Currently under standardization at the W3C </li></ul><ul><ul><li>Last Call in Fall 2009 </li></ul></ul><ul><li>Introduces Microdata </li></ul><ul><ul><li>Similar to microformats </li></ul></ul><ul><ul><ul><li>Some predefined vocabularies with central registration </li></ul></ul></ul><ul><ul><li>Some of the flexibility of RDFa </li></ul></ul><ul><ul><li>Introduce new terms using reverse domain names or full URIs </li></ul></ul><ul><li>Semantic HTML elements such as <time>, <video>, <article>… </li></ul>
    31. 31. Microdata example <div item=“”> <p>My name is <span itemprop=&quot; name &quot;> Neil </span>.</p> <p>My band is called <span itemprop=&quot; band &quot;> Four Parts Water </span>. I was born on <time itemprop=&quot; birthday &quot; datetime=&quot; 2009-05-10 &quot;>May 10th 2009</time>. <img itemprop=&quot; image &quot; src=” me.png &quot; alt=”me”> </p> </div
    32. 32. The process of annotating with RDFa <ul><ul><li>Invest in familiarizing with the RDFa syntax by reading the RDFa Primer </li></ul></ul><ul><ul><ul><li>It is also highly recommended that you read the RDF Primer . RDF is the data model used by RDFa. </li></ul></ul></ul><ul><ul><li>Choose a vocabulary from the SearchMonkey documentation that fits your needs </li></ul></ul><ul><ul><ul><li>A vocabulary describes a set of types and attributes within a given domain </li></ul></ul></ul><ul><ul><ul><li>If you don’t fin d a good candidate , extend an existing one or create a new one </li></ul></ul></ul><ul><ul><li>Annotate your page. </li></ul></ul><ul><ul><ul><li>Before you start, you might want to validate your page for (X)HTML conformance using the W3C’s (X)HTML Validator to reduce the chance of errors. Choose Document Type XHTML + RDFa. </li></ul></ul></ul><ul><ul><ul><li>No specific tool support. If you have an HTML or XML editor that supports DTDs, you will have syntax checking and highlighting. </li></ul></ul></ul><ul><ul><ul><li>Use the RDFa Distiller to validate which data can be extracted from your page. </li></ul></ul></ul><ul><ul><ul><li>If you fancy, use the RDF Validator to graphically visualize the RDF graph that is outputted. </li></ul></ul></ul><ul><ul><li>Put the annotated page online. The data will extracted the next time your page is crawled </li></ul></ul><ul><ul><ul><li>No need to explicitly submit anything </li></ul></ul></ul><ul><ul><ul><li>No notification when your site is crawled </li></ul></ul></ul><ul><li>See for new tools and APIs </li></ul>
    33. 33. Choosing a vocabulary <ul><li>Look at SearchMonkey objects </li></ul><ul><ul><li>Video, Games, Presentations, Events, News, Businesses, Products, Discussion </li></ul></ul><ul><li>Search the Web or ask for advice on mailing lists </li></ul><ul><ul><li>[email_address] </li></ul></ul><ul><ul><li>[email_address] </li></ul></ul><ul><li>Wikis </li></ul><ul><ul><li> </li></ul></ul><ul><ul><li> </li></ul></ul><ul><li>Beware of people who claim to have the vocabulary of everything </li></ul><ul><ul><li>Preferably you want something small and targeted </li></ul></ul><ul><li>Never a 100% fit  you will need to introduce vocabulary terms (classes and properties) </li></ul><ul><ul><li>Do not introduce new classes/properties in existing namespaces </li></ul></ul><ul><ul><li>Example: the namespace is used by the FOAF project. Try not to introduce a new term without contacting the owner, i.e. the membership of the FOAF mailing list. </li></ul></ul>
    34. 34. Advanced topic: creating a vocabulary <ul><li>Get advice on methodology </li></ul><ul><ul><li> and </li></ul></ul><ul><li>Choose a namespace and a prefix </li></ul><ul><ul><li>Give sensible names, e.g. name it after your site, but don’t call it searchmonkey </li></ul></ul><ul><ul><li>Namespace ends either with a slash or a hash </li></ul></ul><ul><li>Create an RDF or OWL document describing your classes and properties </li></ul><ul><ul><li>Use an ontology editor such as Protégé 4.0 </li></ul></ul><ul><ul><li>Follow naming conventions </li></ul></ul><ul><li>Publish your vocabulary </li></ul><ul><ul><li>Make sure the URIs of your properties and classes are resolvable </li></ul></ul><ul><ul><ul><li>E.g. myvocab:digicam should resolve to a document containing the definition of myvocab:digicam </li></ul></ul></ul><ul><li>Convince others to adopt your vocabulary </li></ul><ul><ul><li>If you are in fishing, convince other fishing businesses </li></ul></ul>
    35. 35. Exercise ( <ul><li>Explore data on the Web </li></ul><ul><ul><li>Microformats </li></ul></ul><ul><ul><ul><li>Search for pages on Yahoo using </li></ul></ul></ul><ul><ul><ul><li>Try Operator Firefox Plug-in </li></ul></ul></ul><ul><ul><ul><li>Try Optimus </li></ul></ul></ul><ul><ul><li>RDFa </li></ul></ul><ul><ul><ul><li>Create yourself or search for pages on Yahoo using </li></ul></ul></ul><ul><ul><ul><li>Try RDFa bookmarklet to highlight RDFa </li></ul></ul></ul><ul><ul><ul><li>Try RDFa Distiller to extract RDF from HTML </li></ul></ul></ul><ul><ul><ul><li>Try RDF Validator to visualize your RDF data </li></ul></ul></ul><ul><li>Mark up your webpage using RDFa </li></ul><ul><ul><li>Use RDFa Distiller to test </li></ul></ul><ul><li>Try automated annotation using Zemanta or OpenCalais </li></ul>
    36. 36. Semantic Search
    37. 37. Why semantic search? (P. Raghavan) <ul><li>Old battles are won </li></ul><ul><ul><li>Main driver of user perception of search quality used to be: precision of navigational queries </li></ul></ul><ul><ul><li>The technical prowess was about crawling and spam </li></ul></ul><ul><ul><li>The hard core was indexing and retrieval </li></ul></ul><ul><li>Currently, the biggest bottlenecks in IR not computational, but in modeling user cognition </li></ul><ul><ul><li>If only we could find a computationally expensive way to solve the problem </li></ul></ul><ul><ul><ul><li>then we should be able to make it go faster </li></ul></ul></ul>
    38. 38. Searches that show a ‘semantic gap’ <ul><li>Ambiguous searches </li></ul><ul><ul><li>Paris Hilton </li></ul></ul><ul><li>Multimedia search </li></ul><ul><ul><li>Images of Paris Hilton </li></ul></ul><ul><li>Imprecise or overly precise searches </li></ul><ul><ul><li>Publications by Jim Hendler </li></ul></ul><ul><ul><li>Find images of strong and adventurous people (Lenat) </li></ul></ul><ul><li>Searches for descriptions </li></ul><ul><ul><li>Search for yourself without using your name </li></ul></ul><ul><ul><li>Product search (ads!) </li></ul></ul><ul><li>Searches that require aggregation </li></ul><ul><ul><li>Size of the Eiffer tower (Lenat) </li></ul></ul><ul><ul><li>Public opinion on Britney Spears </li></ul></ul><ul><ul><li>World temperature by 2020 </li></ul></ul>Queries that require a deeper understanding of the query, the content and/or the world at large
    39. 39. Not just search
    40. 40. Semantic Search <ul><li>Def. matching the user’s query with the Web’s content at a conceptual level, often with the help of world knowledge </li></ul><ul><ul><li>R. Guha, R. McCool: Semantic Search, WWW2003 </li></ul></ul><ul><li>Related disciplines </li></ul><ul><ul><li>Semantic Web, IR, Databases, NLP, IE </li></ul></ul><ul><li>As a field </li></ul><ul><ul><li>ISWC/ESWC/ASWC, WWW, SIGIR </li></ul></ul><ul><ul><li>Exploring Semantic Annotations in Information Retrieval (ECIR08, WSDM09) </li></ul></ul><ul><ul><li>Semantic Search Workshop (ESWC08, WWW09) </li></ul></ul><ul><ul><li>Future of Web Search: Semantic Search (FoWS09) </li></ul></ul>
    41. 41. Semantics at every step of the IR process bla bla bla? q=“bla” * 3 Document processing bla bla bla Ranking Query processing Search interface The IR engine The Web The Semantic Web bla bla bla bla bla bla “ bla” θ (q,d)
    42. 42. Document processing <ul><li>Goal: provide a higher level representation of text in some conceptual space </li></ul><ul><li>Diverse methods </li></ul><ul><ul><li>Document classification </li></ul></ul><ul><ul><li>Information Extraction </li></ul></ul><ul><ul><ul><li>Named-entity recognition, word-sense disambiguation, semantic role labeling, wrapper induction, form filling, etc. </li></ul></ul></ul><ul><li>Previously: open source toolkits such as GATE, OpenNLP </li></ul><ul><ul><li>Require expert user to operate and train </li></ul></ul><ul><li>NEW: Online tools </li></ul><ul><ul><li>For non-expert users </li></ul></ul><ul><ul><li>Embed metadata inside documents and/or link entities in documents to the cloud </li></ul></ul><ul><ul><li>Often using existing metadata as background knowledge </li></ul></ul>
    43. 43. Examples: online NLP <ul><li>OpenCalais (Thomson Reuters) </li></ul><ul><ul><li>Online interface and API for named-entity recognition, (partial) disambiguation and relationship extraction </li></ul></ul><ul><ul><li>HTML annotator (microformats), Firefox plug-in, WordPress plug-in, Yahoo Pipes service, etc. </li></ul></ul><ul><ul><li>Works best on the news domain </li></ul></ul><ul><li>Zemanta </li></ul><ul><ul><li>Zemanta is a ‘personal writing assistant’ that recognizes and disambiguates named entities and suggests related content </li></ul></ul><ul><ul><li>A Firefox-plugin that extends the functionality of popular blogging and online email platforms, o nline interface and API </li></ul></ul><ul><ul><li>Broad coverage but partial recognition </li></ul></ul>
    44. 44. Examples: information extraction from templated HTML <ul><li>Intel MashMaker </li></ul><ul><ul><li>Create mashups based on information extracted from the page you are browsing (Firefox plugin) </li></ul></ul><ul><ul><li>Wrapper induction trained using manual annotations </li></ul></ul><ul><ul><li>Export the XSLT to be used in Yahoo’s SearchMonkey </li></ul></ul><ul><li>Dapper </li></ul><ul><ul><li>Semantic Advertising company </li></ul></ul><ul><ul><li>Wrapper induction trained using manual annotations (online tool) </li></ul></ul><ul><ul><li>API to the dynamic extraction </li></ul></ul><ul><li>Glue </li></ul><ul><ul><li>Social interaction around objects on the Web (Firefox plugin) </li></ul></ul><ul><ul><li>API provides access to the information extracted from popular sites </li></ul></ul>
    45. 45. Query Interpretation <ul><li>Provide a higher level representation of queries in some conceptual space </li></ul><ul><ul><li>Ideally, the same space in which documents are represented </li></ul></ul><ul><li>Queries may be keywords, questions, semi-structured, structured etc. </li></ul><ul><li>Interpretation treated as a separate step from ranking </li></ul><ul><ul><li>Required for federation, i.e. determine where to send the query </li></ul></ul><ul><ul><li>Choosing between interpretations could be left to the user </li></ul></ul><ul><ul><li>Due to performance requirements </li></ul></ul><ul><ul><ul><li>You cannot execute the query to determine what it means and then query again </li></ul></ul></ul><ul><li>General world knowledge (e.g. DBpedia), domain ontologies or the schema of the actual data can be used </li></ul>
    46. 46. Example: Semantic Search Assist <ul><li>Observation: the same type of objects often have the same query context </li></ul><ul><ul><li>Users asking for the same aspect of the type </li></ul></ul><ul><li>Could we make query suggestions based on the type of the entity? </li></ul><ul><ul><li>Improvement for infrequent queries </li></ul></ul>apple ipod nano review sony plasma tv review jerry yang biography biography tim berners lee tim berners lee blog peter mika yahoo britney spears shaves her head
    47. 47. Models <ul><li>Desirable properties: </li></ul><ul><ul><li>P1: Fix is frequent within type </li></ul></ul><ul><ul><li>P2: Fix has frequencies well-distributed across entities </li></ul></ul><ul><ul><li>P3: Fix is infrequent outside of the type </li></ul></ul><ul><li>Models: </li></ul>apple ipod nano review entity fix type: product
    48. 48. Demo
    49. 49. Ranking <ul><li>Goal: match the query representation to content representation </li></ul><ul><ul><li>Again, possibly with the use of background knowledge </li></ul></ul><ul><li>Methods depend on </li></ul><ul><ul><li>the actual representation </li></ul></ul><ul><ul><ul><li>Bag of words, NL parse trees, SPARQL… </li></ul></ul></ul><ul><ul><li>the qualities of the queries and content </li></ul></ul><ul><ul><li>the use case </li></ul></ul><ul><ul><ul><li>e.g. time available for ranking may vary from milliseconds to days </li></ul></ul></ul>
    50. 50. Search interface <ul><li>Goal is to facilitate the interaction between the user and the system </li></ul><ul><ul><li>helping the user to formulate queries </li></ul></ul><ul><ul><li>present the results in an intelligent manner </li></ul></ul><ul><li>Semantic-based representations allow novel interfaces </li></ul><ul><ul><li>Adaptive presentation </li></ul></ul><ul><ul><ul><li>presentation adapts to the kind of query and results presented </li></ul></ul></ul><ul><ul><li>Aggregated search </li></ul></ul><ul><ul><ul><li>Grouping similar items, summarizing results in various ways </li></ul></ul></ul><ul><ul><ul><li>Possibilities for filtering, possibly across different dimensions </li></ul></ul></ul><ul><ul><li>Task completion </li></ul></ul><ul><ul><ul><li>Help the user to fulfill the task by placing the query in a task context </li></ul></ul></ul>
    51. 51. Putting it all together: Semantic Search Engines <ul><li>Natural Language search engines </li></ul><ul><ul><li>Hakia </li></ul></ul><ul><ul><li>Powerset (now built into Bing) </li></ul></ul><ul><ul><li>TrueKnowledge </li></ul></ul><ul><li>Structured data search engines </li></ul><ul><ul><li>Searching open web data </li></ul></ul><ul><ul><ul><li>Sindice </li></ul></ul></ul><ul><ul><ul><li>Sigma </li></ul></ul></ul><ul><ul><li>Searching closed world data </li></ul></ul><ul><ul><ul><li>Wolfram Alpha (closed structured data + computation) </li></ul></ul></ul><ul><li>Web search engines </li></ul><ul><ul><li>Yahoo’s SearchMonkey </li></ul></ul><ul><ul><ul><li>BOSS and YQL </li></ul></ul></ul><ul><ul><li>Google’s Rich Snippets </li></ul></ul>
    52. 52. <ul><li>An open platform for using structured data to build more useful and relevant search results </li></ul><ul><li>Creating an ecosystem of publishers, developers and end-users </li></ul><ul><ul><li>Helping publishers to implement semantic annotation and motivating them by allowing to customize their search results </li></ul></ul><ul><ul><li>Providing tools and APIs for developers to create compelling applications </li></ul></ul><ul><ul><li>Improving search result presentation for our end-users </li></ul></ul><ul><li>Semantic Web technology </li></ul><ul><ul><li>Support for a number of microformats, RDFa </li></ul></ul><ul><ul><li>RDF based data representation </li></ul></ul><ul><ul><li>Industry standard vocabularies (or use any of your own) </li></ul></ul>SearchMonkey
    53. 53. image deep links name/value pairs or abstract Enhanced Result
    54. 54. A difficult problem! <ul><li>What if one would try to do this automatically? </li></ul><ul><ul><li>Document summarization </li></ul></ul><ul><ul><li>Page structure detection </li></ul></ul><ul><ul><li>Information Extraction </li></ul></ul><ul><ul><li>Image recognition </li></ul></ul><ul><ul><li>Link classification and ranking </li></ul></ul><ul><li>But one of those cases where a little semantics can go a long way… </li></ul>
    55. 55. SearchMonkey’s database Index RDF/Microformat Markup site owners/publishers share structured data with Yahoo!. 1 consumers customize their search experience with Enhanced Results or Infobars 3 site owners & third-party developers build SearchMonkey apps. 2 DataRSS feed Web Services Page Extraction’s Web Pages
    56. 56. Example apps <ul><li>LinkedIn </li></ul><ul><ul><li>hCard plus feed data </li></ul></ul><ul><li>Creative Commons by Ben Adida </li></ul><ul><ul><li>CC in RDFa </li></ul></ul>
    57. 57. Example apps. II. <ul><li>Other me by Dan Brickley </li></ul><ul><ul><li>Google Social Graph API wrapped using a Web Service </li></ul></ul>
    58. 58. Google’s Rich Snippets <ul><li>Shares a subset of the features of SearchMonkey </li></ul><ul><ul><li>Encourages publishers to embed certain microformats and RDFa into webpages </li></ul></ul><ul><ul><ul><li>Currently reviews, people, products, business & organizations </li></ul></ul></ul><ul><ul><li>These are used to generate richer search results </li></ul></ul><ul><li>SearchMonkey is customizable </li></ul><ul><ul><li>Developers can develop applications themselves </li></ul></ul><ul><li>SearchMonkey is open </li></ul><ul><ul><li>Wide support for standard vocabularies </li></ul></ul><ul><ul><li>API access </li></ul></ul>
    59. 59. BOSS: Build your Own Search Service <ul><li>Ability to re-order results and blend-in addition content </li></ul><ul><li>No restrictions on presentation </li></ul><ul><li>No branding or attribution </li></ul><ul><li>Access to multiple verticals (web search, image, news) </li></ul><ul><li>40+ supported language and region pairs </li></ul><ul><li>Pricing (BOSS) </li></ul><ul><ul><li>Pay-by-usage </li></ul></ul><ul><ul><li>10,000 queries a day still free </li></ul></ul><ul><ul><li>Serve any ads you want </li></ul></ul><ul><li>For more info, </li></ul>
    60. 60. BOSS API to structured data <ul><li>Simple HTTP GET calls, no authentication </li></ul><ul><ul><li>You need an Application ID: register at </li></ul></ul><ul><li>{query}?appid={appid}&format=xml&view=searchmonkey_feed </li></ul><ul><li>Restrict your query using special words </li></ul><ul><ul><li>{format} </li></ul></ul><ul><ul><ul><li>{format} is one of hcard, hcalendar, tag, adr, hresume etc. </li></ul></ul></ul><ul><ul><li> </li></ul></ul>
    61. 61. Demo: resume search <ul><li>Search pages with resume data and given keywords </li></ul><ul><ul><li> {keyword} </li></ul></ul><ul><li>Parse the results as DataRSS (XML) </li></ul><ul><li>Extract information and display using YUI </li></ul>
    62. 62. Demo
    63. 63. Yahoo Query Language (YQL) <ul><li>Query web APIs as virtual tables </li></ul><ul><ul><li>Mash-up data by joining tables </li></ul></ul><ul><ul><li>Add an API by adding a table definition </li></ul></ul><ul><li>Example: select my friends and sort by nickname </li></ul>
    64. 64. PHP example : select the last 100 photos from Flickr with the word Austin <ul><li><?php </li></ul><ul><li>$url = &quot;;; </li></ul><ul><li>$q = &quot;select * from where text=’Austin'&quot;; </li></ul><ul><li>$fmt = &quot;xml&quot;; </li></ul><ul><li>$x = simplexml_load_file($url.urlencode($q).&quot;&format=$fmt&quot;); </li></ul><ul><li>foreach($x->attributes('') as $k=>$v) { </li></ul><ul><li>$$k=(string)$v; </li></ul><ul><li>} </li></ul><ul><li>echo <<<EOB </li></ul><ul><li>$count photos fetched from </li></ul><ul><li>{$x->diagnostics->url} in </li></ul><ul><li>{$x->diagnostics->url['execution-time']} seconds<br> </li></ul><ul><li>EOB; </li></ul><ul><li>$flickr = &quot;;; </li></ul><ul><li>foreach($x->results->photo as $p) { </li></ul><ul><li>echo &quot;<img src=&quot;$flickr{$p['server']}/{$p['id']}_{$p['secret']}_s.jpg&quot;/> &quot;; </li></ul><ul><li>} </li></ul><ul><li>?> </li></ul>
    65. 65. YQL example ( source )
    66. 66. That’s all there is to it! <ul><li><?php </li></ul><ul><li>$root = ''; </li></ul><ul><li>$city = 'Barcelona'; </li></ul><ul><li>$loc = 'Barcelona'; </li></ul><ul><li>$yql = 'select * from html where url = ''.$city.'' and xpath=&quot;//div[@id='bodyContent']/p&quot; limit 3'; </li></ul><ul><li>$url = $root . urlencode($yql) . '&format=xml'; </li></ul><ul><li>$info = getstuff($url); </li></ul><ul><li>$info = preg_replace(&quot;/.*<results>|</results>.*/&quot;,'',$info); </li></ul><ul><li>$info = preg_replace(&quot;/<?xml version=&quot;1.0&quot;&quot;. </li></ul><ul><li>&quot; encoding=&quot;UTF-8&quot;?>/&quot;,'',$info); </li></ul><ul><li>$info = preg_replace(&quot;//&quot;,'',$info); </li></ul><ul><li>$info = preg_replace(&quot;/&quot;/wiki/&quot;,'&quot;',$info); </li></ul><ul><li>$yql = 'select * from where woeid in '. </li></ul><ul><li>'(select woeid from geo.places where text=&quot;'.$loc.'&quot;)'. </li></ul><ul><li>' | unique(field=&quot;description&quot;)'; </li></ul><ul><li>$url = $root . urlencode($yql) . '&format=json'; </li></ul><ul><li>$events = getstuff($url); </li></ul><ul><li>$events = json_decode($events); </li></ul><ul><li>foreach($events->query->results->event as $e){ </li></ul><ul><li>$evHTML.='<li><h3><a href=&quot;'.$e->ticket_url.'&quot;>'.$e->name.'</a></h3><p>'. </li></ul><ul><li>substr($e->description,0,100).'&hellip;</p></li>'; </li></ul><ul><li>} </li></ul><ul><li>$yql = 'select * from where photo_id in '. </li></ul><ul><li>'(select id from where woe_id in '. </li></ul><ul><li>'(select woeid from geo.places where text=&quot;'.$loc.'&quot;)) limit 16'; </li></ul><ul><li>$url = $root . urlencode($yql) . '&format=json'; </li></ul><ul><li>$photos = getstuff($url); </li></ul><ul><li>$photos = json_decode($photos); </li></ul><ul><li>foreach($photos->query->results->photo as $s){ </li></ul><ul><li>$src = &quot;http://farm{$s->farm}{$s->server}/&quot;. </li></ul><ul><li>&quot;{$s->id}_{$s->secret}_s.jpg&quot;; </li></ul><ul><li>$phHTML.='<li><a href=&quot;'.$s->urls->url->content.'&quot;><img alt=&quot;'. </li></ul><ul><li>$s->title.'&quot; src=&quot;'.$src.'&quot;></a></li>'; </li></ul><ul><li>} </li></ul><ul><li>$yql='select description from rss where '. </li></ul><ul><li>' url=&quot;;'; </li></ul><ul><li>$url = $root . urlencode($yql) . '&format=json'; </li></ul><ul><li>$weather = getstuff($url); </li></ul><ul><li>$weather = json_decode($weather); </li></ul><ul><li>$weHTML = $weather->query->results->item->description; </li></ul><ul><li>function getstuff($url){ </li></ul><ul><li>$curl_handle = curl_init(); </li></ul><ul><li>curl_setopt($curl_handle, CURLOPT_URL, $url); </li></ul><ul><li>curl_setopt($curl_handle, CURLOPT_CONNECTTIMEOUT, 2); </li></ul><ul><li>curl_setopt($curl_handle, CURLOPT_RETURNTRANSFER, 1); </li></ul><ul><li>$buffer = curl_exec($curl_handle); </li></ul><ul><li>curl_close($curl_handle); </li></ul><ul><li>if (empty($buffer)){ </li></ul><ul><li>return 'Error retrieving data, please try later.'; </li></ul><ul><li>} else { </li></ul><ul><li>return $buffer; </li></ul><ul><li>} </li></ul><ul><li>}?> </li></ul>
    67. 67. Exercise <ul><li>Build a SearchMonkey application </li></ul><ul><li>Try a few YQL queries </li></ul><ul><li>Find data on the Web using search or by browsing Linked Data </li></ul><ul><li>Build your own search engine using BOSS </li></ul>
    68. 68. Challenges <ul><li>Future work in Semantic Web </li></ul><ul><ul><li>(Semi-)automated ways of metadata creation </li></ul></ul><ul><ul><ul><li>How do we go from 5% to 95%? </li></ul></ul></ul><ul><ul><li>Data quality </li></ul></ul><ul><ul><ul><li>Can we trust statements people make about each other’s data? </li></ul></ul></ul><ul><ul><li>Reasoning </li></ul></ul><ul><ul><ul><li>To what extent is reasoning useful? </li></ul></ul></ul><ul><ul><ul><li>For example, how much would entity resolution or taxonomic reasoning help? </li></ul></ul></ul><ul><ul><li>Scale </li></ul></ul><ul><ul><ul><li>How do we exploit cluster computing techniques? </li></ul></ul></ul><ul><ul><ul><li>What is between databases and IR engines? </li></ul></ul></ul><ul><ul><li>Fostering social agreements </li></ul></ul><ul><ul><ul><li>How do we get people to reuse vocabularies? </li></ul></ul></ul>
    69. 69. Challenges <ul><li>Future work in IR </li></ul><ul><ul><li>Query interpretation </li></ul></ul><ul><ul><li>Ranking with metadata </li></ul></ul><ul><ul><li>Evaluation of semantic search </li></ul></ul><ul><ul><li>Personalization </li></ul></ul><ul><ul><li>Semantic ads </li></ul></ul><ul><li>Constraints </li></ul><ul><ul><li>Users still want to see a document </li></ul></ul><ul><ul><ul><li>Keyword-based search cannot suffer </li></ul></ul></ul><ul><ul><ul><li>Whole page relevance, monetization can only increase </li></ul></ul></ul><ul><ul><li>Established expectations </li></ul></ul><ul><ul><ul><li>Query entry </li></ul></ul></ul><ul><ul><ul><li>Result presentation </li></ul></ul></ul>
    70. 70. Contact <ul><li>Peter Mika </li></ul><ul><ul><li>[email_address] </li></ul></ul><ul><ul><li>Come to Barcelona and stop by </li></ul></ul><ul><ul><li>Ask about our internship program </li></ul></ul><ul><li>SearchMonkey </li></ul><ul><ul><li> </li></ul></ul><ul><ul><li>mailing lists </li></ul></ul><ul><ul><ul><li>[email_address] </li></ul></ul></ul><ul><ul><ul><li>[email_address] </li></ul></ul></ul><ul><ul><li>forums </li></ul></ul><ul><ul><ul><li> </li></ul></ul></ul>
    71. 71. the monkey is out!