Yahoo Making The Web Searchable


Published on

Published in: Business, Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Designed for humans first and machines second, microformats are a set of simple, open data formats built upon existing and widely adopted standards. Instead of throwing away what works today, microformats intend to solve simpler problems first by adapting to current behaviors and usage patterns
  • As Vish discussed, SearchMonkey is all about building richer, more useful search results. Here’s a few examples Enhanced Results.
  • And it allows the user to add the movie directly to their online movie rental queue
  • [will be animated]
  • [will be animated]
  • [will be animated]
  • [will be animated]
  • [will be animated]
  • [will be animated]
  • [will be animated]
  • Results are good, but consider the ads: First ad says: Virgins. Looking for virgins? Find exactly what you want today. Second ad: Virgins. …Find cheap tickets for Virgins. Third ad: Adspam… these people buy Yahoo! traffic and sell it to Google.
  • Yahoo Making The Web Searchable

    1. 1. Making the Web Searchable Peter Mika Researcher, Data Architect Yahoo! Research
    2. 2. Yahoo! Research (
    3. 3. Yahoo! Research Barcelona <ul><li>Established January, 2006 </li></ul><ul><li>Led by Ricardo Baeza-Yates </li></ul><ul><li>Research areas </li></ul><ul><ul><li>Web Mining </li></ul></ul><ul><ul><ul><li>content, structure, usage </li></ul></ul></ul><ul><ul><li>Distributed Web retrieval </li></ul></ul><ul><ul><li>Multimedia retrieval </li></ul></ul><ul><ul><li>NLP and Semantics </li></ul></ul>
    4. 4. Yahoo! by numbers (April, 2007) <ul><li>There are approximately 500 million users of Yahoo! branded services, meaning we reach 50 percent – or 1 out of every 2 users – online, the largest audience on the Internet (Yahoo! Internal Data). </li></ul><ul><li>Yahoo! is the most visited site online with nearly 4 billion visits and an average of 30 visits per user per month in the U.S. and leads all competitors in audience reach, frequency and engagement (comScore Media Metrix, US, Feb. 2007). </li></ul><ul><li>Yahoo! accounts for the largest share of time Americans spend on the Internet with 12 percent (comScore Media Metrix, US, Feb. 2007) and approximately 8 percent of the world’s online time (comScore WorldMetrix, Feb. 2007). </li></ul><ul><li>Yahoo! is the #1 home page with 85 million average daily visitors on Yahoo! homepages around the world, an increase of nearly 5 million visitors in a month (comScore WorldMetrix, Feb. 2007). </li></ul><ul><li>Yahoo!’s social media properties (Flickr, delicious, Answers, 360, Video, MyBlogLog, Jumpcut and Bix) have 115 million unique visitors worldwide (comScore WorldMetrix, Feb. 2007). </li></ul><ul><li>Yahoo! Answers is the largest collection of human knowledge on the Web with more than 90 million unique users and 250 million answers worldwide (Yahoo! Internal Data). </li></ul><ul><li>There are more than 450 million photos in Flickr in total and 1 million photos are uploaded daily. 80 percent of the photos are public (Yahoo! Internal Data). </li></ul><ul><li>Yahoo! Mail is the #1 Web mail provider in the world with 243 million users (comScore WorldMetrix, Feb. 2007) and nearly 80 million users in the U.S. (comScore Media Metrix, US, Feb. 2007) </li></ul><ul><li>Interoperability between Yahoo! Messenger and Windows Live Messenger has formed the largest IM community approaching 350 million user accounts (Yahoo! Internal Data). </li></ul><ul><li>Yahoo! Messenger is the most popular in time spent with an average of 50 minutes per user, per day (comScore WorldMetrix, Feb. 2007). </li></ul><ul><li>Nearly 1 in 10 Internet users is a member of a Yahoo! Groups (Yahoo! Internal Data). </li></ul><ul><li>Yahoo! is one of only 26 companies to be on both the Fortune 500 list and the Fortune’s “Best Place to Work” List (2006). </li></ul>
    5. 5. Agenda <ul><li>The Annotated Web </li></ul><ul><li>SearchMonkey </li></ul><ul><ul><li>Demo </li></ul></ul><ul><ul><li>Technology </li></ul></ul><ul><ul><ul><li>DataRSS format </li></ul></ul></ul><ul><ul><ul><li>Query language </li></ul></ul></ul><ul><li>Lessons learned </li></ul><ul><li>Toward Semantic Search </li></ul><ul><li>BOSS </li></ul><ul><ul><li>Build your Own Search Service </li></ul></ul><ul><li>Y!OS 1.0 </li></ul><ul><ul><li>Yahoo! Open Strategy </li></ul></ul>
    6. 6. The Annotated Web
    7. 7. Previously in search <ul><li>Horizontal search </li></ul><ul><ul><li>Yahoo… </li></ul></ul><ul><ul><li>Keyword-based indexing </li></ul></ul><ul><ul><li>Minimal natural language processing </li></ul></ul><ul><ul><li>Limited experiments with ontologies (query expansion) </li></ul></ul><ul><li>Vertical search </li></ul><ul><ul><li>e.g., Kelkoo </li></ul></ul><ul><ul><li>Faceted search, browsing </li></ul></ul><ul><ul><li>Fixed ontology </li></ul></ul><ul><li>Combinations </li></ul><ul><ul><li>Google Base, Google Co-op </li></ul></ul><ul><ul><ul><li>Web-scale, but fixed ontologies </li></ul></ul></ul><ul><ul><ul><li>Proprietary technology </li></ul></ul></ul><ul><li>Can we do better with the Semantic Web? </li></ul><ul><ul><li>Address the long tail of queries (88% of queries) </li></ul></ul><ul><ul><li>Use standard technology </li></ul></ul><ul><li>Not a new question. But the answer may be new. </li></ul>
    8. 8. Which Semantic Web? <ul><li>Two visions </li></ul><ul><ul><li>Data Web </li></ul></ul><ul><ul><ul><li>Bringing the content of databases to the Web ( </li></ul></ul></ul><ul><ul><ul><li>Rich data, heavyweight semantics </li></ul></ul></ul><ul><ul><ul><li>Deep Web </li></ul></ul></ul><ul><ul><li>Annotated Web </li></ul></ul><ul><ul><ul><li>Annotating the content of Web resources (documents, mm) </li></ul></ul></ul><ul><ul><ul><li>Simple data, lightweight semantics </li></ul></ul></ul><ul><ul><ul><li>Shallow Web </li></ul></ul></ul><ul><li>This presentation is about the Annotated Web. </li></ul>
    9. 9. Brief history of the Annotated Web <ul><li>1995: HTML meta tags </li></ul><ul><li>1996: Simple HTML Ontology Extensions (SHOE) </li></ul><ul><li>1998: RDF/XML </li></ul><ul><ul><li>RDF/XML in HTML </li></ul></ul><ul><ul><li>RDF linked from HTML </li></ul></ul><ul><li>2003: Web 2.0 </li></ul><ul><ul><li>Tagging </li></ul></ul><ul><ul><li>Microformats </li></ul></ul><ul><ul><li>Metadata in Wikipedia </li></ul></ul><ul><ul><li>Machine tags in Flickr </li></ul></ul><ul><li>2005: eRDF </li></ul><ul><li>2008: RDFa </li></ul>
    10. 10. HTML meta tags <ul><li><HTML> </li></ul><ul><li><HEAD profile=&quot;;> </li></ul><ul><li><META name=&quot; &quot; content=&quot; Peter Mika &quot;> </li></ul><ul><li><LINK rel=&quot;DC.rights copyright&quot; href=&quot; &quot; /> </li></ul><ul><li><LINK rel=&quot;meta&quot; type=&quot;application/rdf+xml&quot; title=&quot;FOAF&quot; </li></ul><ul><li> href= &quot; &quot;> </li></ul><ul><li></HEAD> </li></ul><ul><li>… </li></ul><ul><li></HTML> </li></ul>
    11. 11. SHOE example (Hefflin & Hendler, 1996) <ul><li><ONTOLOGY &quot;our-ontology&quot; VERSION=&quot;1.0&quot;> </li></ul><ul><li><ONTOLOGY-EXTENDS &quot;organization-ontology&quot; VERSION=&quot;2.1&quot; PREFIX=&quot;org&quot; URL=&quot;;> </li></ul><ul><li><ONTDEF CATEGORY=&quot;Person&quot; ISA=&quot;org.Thing&quot;> </li></ul><ul><li><ONTDEF RELATION=&quot;lastName&quot; ARGS=&quot;Person STRING&quot;> </li></ul><ul><li><ONTDEF RELATION=&quot;firstName&quot; ARGS=&quot;Person STRING&quot;> </li></ul><ul><li><ONTDEF RELATION=&quot;marriedTo&quot; ARGS=&quot;Person Person&quot;> </li></ul><ul><li><ONTDEF RELATION=&quot;employee&quot; ARGS=&quot;org.Organization Person&quot;> </li></ul><ul><li></ONTOLOGY > </li></ul><HEAD> <META HTTP-EQUIV=&quot;Instance-Key&quot; CONTENT=&quot;;> <USE-ONTOLOGY &quot;our-ontology&quot; VERSION=&quot;1.0&quot; PREFIX=&quot;our&quot; URL=&quot;;> </HEAD> <BODY> <CATEGORY &quot;our.Person&quot;> <RELATION &quot;our.marriedTo&quot; TO=&quot;;> <RELATION &quot;our.employee&quot; FROM=&quot;;> My name is <ATTRIBUTE &quot;our.firstName&quot;> George </ATTRIBUTE> <ATTRIBUTE &quot;our.lastName&quot;> Cook </ATTRIBUTE> and I live at...
    12. 12. SHOE system
    13. 13. SHOE Text-based query interface
    14. 14. SHOE Graphical Query Interface
    15. 15. Example: Creative Commons <ul><li>Embedding CC license in HTML (now deprecated): </li></ul><HTML> <HEAD>… </HEAD> <BODY> … <!–- <rdf:RDF xmlns=&quot;; xmlns:dc=&quot;; xmlns:rdf=&quot;;> <Work rdf:about=&quot;;> <dc:title>The Law of Averages</dc:title> <dc:description>...because eventually i&apos;ll be right...</dc:description> <license rdf:resource=&quot;; /> </Work> <License rdf:about=&quot;;> <requires rdf:resource=&quot;; /> <permits rdf:resource=&quot;; /> <permits rdf:resource=&quot;; /> <prohibits rdf:resource=&quot;; /> </License> </rdf:RDF> -->
    16. 16. Example: Creative Commons <ul><li>Current: rel attribute (HTML4) </li></ul>This work is licensed under a <a rel=&quot;license&quot; href=&quot;;>Creative Commons Attribution 3.0 United States License</a>. <ul><li>Use of the “rel” attribute for semantic annotation is the birth of the microformat… </li></ul>
    17. 17. Example: microformats <cite class=&quot; vcard &quot;> <a class=&quot; fn url &quot; rel=&quot;friend colleague met&quot; href=&quot;;> Eric Meyer </a> </cite> wrote a post ( <cite> <a href=&quot;;> Tax Relief </a></cite> ) about an unintentionally humorous letter he received from the <span class=&quot; vcard &quot;> <a class=&quot; fn org url &quot; href=&quot;;> Internal Revenue Service </a> </span>. <div class=&quot; vcard &quot;> <a class=&quot; email fn &quot; href=&quot;;> Joe Friday </a> <div class=&quot; tel &quot;> +1-919-555-7878 </div> <div class=&quot; title &quot;> Area Administrator, Assistant </div> </div>
    18. 18. microformats <ul><li> </li></ul><ul><li>Originated by Tantek Celik and others </li></ul><ul><li>Agreements on the way to encode certain kinds metadata in HTML </li></ul><ul><ul><li>Reuse of semantic-bearing HTML elements </li></ul></ul><ul><ul><li>Based on existing standards </li></ul></ul><ul><ul><li>Community process </li></ul></ul><ul><ul><li>Persons, events , listings etc. but also syntactic metadata: licenses , tags </li></ul></ul><ul><li>Microformats have no shared syntax </li></ul><ul><ul><li>Each microformat has a separate syntax tailored to the vocabulary </li></ul></ul><ul><li>Microformats are not ontologies </li></ul><ul><ul><li>No formal descriptions of schema , only text </li></ul></ul><ul><ul><li>Limited reuse, extensibility of schemas </li></ul></ul><ul><ul><li>No datatypes </li></ul></ul><ul><li>No namespaces, unique identifiers (URIs) </li></ul><ul><ul><li>no interlinking </li></ul></ul><ul><ul><li>mapping between instances is required </li></ul></ul><ul><li>Relationship to page context is unclear </li></ul><ul><li>Widely used in millions of documents </li></ul><ul><ul><li>User-generated as well as automatically generated </li></ul></ul>
    19. 19. Example: tags and machine tags
    20. 20. Example: Tags and machine tags <ul><li>Tags </li></ul><ul><ul><li>User defined keywords </li></ul></ul><ul><ul><li>Minimal agreement </li></ul></ul><ul><ul><ul><li>Is ‘rock’ on Flickr same as ‘rock’ on myspace? </li></ul></ul></ul><ul><ul><ul><li>Is ‘rock’ by me on Flickr is the same as ‘rock’ by you on Flickr? </li></ul></ul></ul><ul><ul><ul><li>Is ‘rock’ by me on Flickr today the same as ‘rock’ by me on myspace tomorrow? </li></ul></ul></ul><ul><li>Machine tags </li></ul><ul><ul><li>User defined values for user defined properties </li></ul></ul><ul><ul><li>Possibility to define the namespace (but not enforced) </li></ul></ul><ul><ul><li>Limited use </li></ul></ul>
    21. 21. RDF-based annotation #1: eRDF <ul><li>eRDF </li></ul><ul><ul><li>Ian Davis (Talis) </li></ul></ul><ul><ul><li>Embedding RDF in HTML </li></ul></ul><ul><ul><ul><li>Straightforward mapping to RDF triples (XSLT available) </li></ul></ul></ul><ul><ul><ul><li>HTML4 compatible </li></ul></ul></ul><ul><ul><li>More complex than microformats </li></ul></ul><ul><ul><ul><li>Use any RDF/OWL vocabulary </li></ul></ul></ul><ul><ul><ul><li>Reuse of semantic-bearing HTML elements is limited </li></ul></ul></ul><ul><ul><li>More limited than RDF </li></ul></ul><ul><ul><ul><li>No blank nodes </li></ul></ul></ul><ul><ul><ul><li>No data types </li></ul></ul></ul><ul><ul><ul><li>No statements about subjects other than the current document </li></ul></ul></ul><ul><ul><li>Limited usage </li></ul></ul>
    22. 22. RDF-based annotation #2: RDFa <ul><li>RDFa </li></ul><ul><ul><li>World Wide Web Consortium (W3C) last call document </li></ul></ul><ul><ul><li>Similar intent as eRDF, but full RDF support </li></ul></ul><ul><ul><ul><li>Requires XHTML </li></ul></ul></ul><ul><ul><li>Big question: user complexity (  data quality) </li></ul></ul><p typeof=&quot;contact:Info&quot; about=&quot;;> <span property=&quot;contact:fn&quot;> Jo Smith </span>. <span property=&quot;contact:title&quot;> Web hacker </span> at <a rel=&quot;contact:org&quot; href=&quot;;> </a>. You can contact me <a rel=&quot;contact:email&quot; href=&quot;;> via email </a>. </p> ...
    23. 23. SearchMonkey
    24. 24. <ul><li>Creating an ecosystem of publishers, developers and end-users </li></ul><ul><ul><li>Motivating and helping publishers to implement semantic annotation </li></ul></ul><ul><ul><li>Providing tools for developers to create compelling applications </li></ul></ul><ul><ul><li>Focusing on end-user experience </li></ul></ul><ul><li>Rich abstracts as a first application </li></ul><ul><li>Addressing the long tail of query and content production </li></ul><ul><li>Standard Semantic Web technology </li></ul><ul><ul><li>dataRSS = Atom + RDFa </li></ul></ul><ul><ul><li>Industry standard vocabularies </li></ul></ul><ul><li> </li></ul>SearchMonkey
    25. 25. Before After an open platform for using structured data to build more useful and relevant search results What is SearchMonkey?
    26. 26. image deep links name/value pairs or abstract Enhanced Result
    27. 27. YAHOO! CONFIDENTIAL | Infobar
    28. 28. SearchMonkey’s database Index RDF/Microformat Markup site owners/publishers share structured data with Yahoo!. 1 consumers customize their search experience with Enhanced Results or Infobars 3 site owners & third-party developers build SearchMonkey apps. 2 DataRSS feed Web Services Page Extraction’s Web Pages
    29. 29. Developer tool
    30. 30. Developer tool
    31. 31. Developer tool
    32. 32. Developer tool
    33. 33. Developer tool
    34. 34. Gallery
    35. 35. Example apps <ul><li>LinkedIn </li></ul><ul><ul><li>hCard plus feed data </li></ul></ul><ul><li>Creative Commons by Ben Adida </li></ul><ul><ul><li>CC in RDFa </li></ul></ul>
    36. 36. Example apps. II. <ul><li>Other me by Dan Brickley </li></ul><ul><ul><li>Google Social Graph API wrapped using a Web Service </li></ul></ul>
    37. 37. DataRSS <ul><li><?profile ?> </li></ul><ul><li><feed xmlns:xsi=&quot;; </li></ul><ul><li>xsi:schemaLocation=&quot; ../latest/xsd/datarss.xsd“> </li></ul><ul><li><id></id> </li></ul><ul><li><author> </li></ul><ul><li><name>Peter Mika (</name> </li></ul><ul><li></author> </li></ul><ul><li><title>Example data feed for social</title> </li></ul><ul><li><updated>2007-11-14T04:05:06+07:00</updated> </li></ul><ul><li><entry> </li></ul><ul><li><!-- title field of entry is not used for anything --> </li></ul><ul><li><title>Peter Mika</title> </li></ul><ul><li><!--URL of the webpage extracted from --> </li></ul><ul><li><id></id> </li></ul><ul><li><updated>2007-11-14T04:05:06+07:00</updated> </li></ul><ul><li><content type=&quot;application/xml&quot;> </li></ul><ul><ul><ul><li><y:adjunct version=&quot;1.0&quot; name=&quot;social-simple&quot; xmlns:y=&quot;;> </li></ul></ul></ul><ul><ul><ul><li><y:item rel=&quot;dc:subject&quot;> </li></ul></ul></ul><ul><ul><ul><li><y:type typeof=&quot;foaf:Person&quot;> </li></ul></ul></ul><ul><ul><ul><li><y:meta property=&quot;foaf:name&quot;>John Doe</y:meta> </li></ul></ul></ul><ul><ul><ul><li><y:meta property=&quot;foaf:gender&quot;>male</y:meta> </li></ul></ul></ul><ul><ul><ul><li><y:item rel=&quot;foaf:homepage&quot; resource=&quot;;/> </li></ul></ul></ul><ul><ul><ul><li><y:item rel=&quot;foaf:mbox&quot; resource=&quot;;/> </li></ul></ul></ul><ul><ul><ul><li><y:item rel=&quot;foaf:weblog&quot; resource=&quot;;/> </li></ul></ul></ul><ul><ul><ul><li><y:item rel=&quot;foaf:knows&quot;> </li></ul></ul></ul><ul><ul><ul><li><y:type typeof=&quot;foaf:Person&quot;> </li></ul></ul></ul><ul><ul><ul><li><y:meta property=&quot;foaf:name&quot;>Jane Doe</y:meta> </li></ul></ul></ul><ul><ul><ul><li><y:meta property=&quot;foaf:gender&quot;>female</y:meta> </li></ul></ul></ul><ul><ul><ul><li><y:item rel=&quot;foaf:mbox&quot; resource=&quot;;/> </li></ul></ul></ul><ul><ul><ul><li></y:type> </li></ul></ul></ul><ul><ul><ul><li></y:item> </li></ul></ul></ul><ul><ul><ul><li></y:type> </li></ul></ul></ul><ul><ul><ul><li></y:item> </li></ul></ul></ul><ul><ul><ul><li></y:adjunct> </li></ul></ul></ul><ul><ul><li></entry> </li></ul></ul><ul><li></feed> </li></ul>Atom 1.0 XML + RDFa
    38. 38. The data part <ul><li><adjunct version=&quot;1.0&quot; id=“; xmlns=&quot;“ </li></ul><ul><li>updated=“2007-11-14T04:05:06+07:00”> </li></ul><ul><li><item rel=&quot;dc:subject&quot;> </li></ul><ul><li><type typeof =&quot;foaf:Person&quot;> </li></ul><ul><li><meta property =&quot;foaf:name&quot;>John Doe</meta> </li></ul><ul><li><meta property=&quot;foaf:gender&quot;>male</meta> </li></ul><ul><li><item rel =&quot;foaf:homepage&quot; resource =&quot;;/> </li></ul><ul><li><item rel=&quot;foaf:mbox&quot; resource=&quot;;/> </li></ul><ul><li><item rel=&quot;foaf:weblog&quot; resource=&quot;;/> </li></ul><ul><li><item rel=&quot;foaf:knows&quot;> </li></ul><ul><li><type typeof=&quot;foaf:Person&quot;> </li></ul><ul><li><meta property=&quot;foaf:name&quot;>Jane Doe</meta> </li></ul><ul><li><meta property=&quot;foaf:gender&quot;>female</meta> </li></ul><ul><li><item rel=&quot;foaf:mbox&quot; resource=&quot;;/> </li></ul><ul><li></type> </li></ul><ul><li></item> </li></ul><ul><li></type> </li></ul><ul><li></item> </li></ul><ul><li></adjunct> </li></ul>
    39. 39. DataRSS <ul><li>An Atom extension for structured data </li></ul><ul><li>Why a new format? </li></ul><ul><ul><li>A feed format is required by publishers </li></ul></ul><ul><ul><ul><li>Exclusive content (e.g. partnerships, paid inclusion) </li></ul></ul></ul><ul><ul><ul><li>No changes necessary to the web page </li></ul></ul></ul><ul><ul><ul><li>No standard named graph format for the Semantic Web </li></ul></ul></ul><ul><ul><ul><ul><li>Needed to capture meta-metadata such as source and timestamp of information </li></ul></ul></ul></ul><ul><ul><li>Not really a new format </li></ul></ul><ul><ul><ul><li>An Atom extension </li></ul></ul></ul><ul><ul><ul><li>Use any RDFa parser to get the triples out </li></ul></ul></ul><ul><ul><ul><li>cf. Google Base feeds </li></ul></ul></ul>
    40. 40. What happened since the launch? <ul><li>It’s starting to work! </li></ul><ul><ul><li>Click rates improve  Publishers are willing to invest  More structured data  More applications  More users  Click rates improve </li></ul></ul><ul><li>Increasing excitement all around </li></ul><ul><ul><li>Standardization of RDFa is bringing new energy </li></ul></ul><ul><ul><li>Good market for companies that help publishers to ‘semantify’ or support developers in extracting structured from web pages </li></ul></ul><ul><ul><ul><li>OpenCalais, Dapper, AdaptiveBlue, Intel MashMaker, Zemanta… </li></ul></ul></ul><ul><li>There have been some lessons learned… </li></ul>
    41. 41. Data quality <ul><li>Publishers/developers want the quick and dirty answer, not the long and clean one </li></ul><ul><li>Resource or literal? </li></ul><ul><ul><li><meta property=“vcard:url”></meta> </li></ul></ul><ul><ul><li><meta property=“vcard:tel”>0034691792522</meta> </li></ul></ul><ul><li>Webpage or resource? </li></ul><ul><ul><li>Should we allow a resource have the same URI as an existing webpage? </li></ul></ul><ul><ul><li>This is the default in eRDF/RDFa! </li></ul></ul><ul><ul><ul><li><div class=“foaf-name”>Peter Mika</div> </li></ul></ul></ul><ul><li>Types vs. datatypes </li></ul><ul><ul><li><meta property=“use:email”></meta> </li></ul></ul><ul><li>Extensibility </li></ul><ul><ul><li>rdfs:movies </li></ul></ul>Complexity of the formalism = Data quality down
    42. 42. Vocabularies <ul><li>Coverage is small </li></ul><ul><ul><li>Books, movies, stuff people care about… </li></ul></ul><ul><li>Competing proposals </li></ul><ul><ul><li>Versions floating around </li></ul></ul><ul><li>Not maintained </li></ul><ul><ul><li>I cannot maintain your vocabulary for you </li></ul></ul><ul><li>Vocabularies for microformats </li></ul><ul><ul><li>A must </li></ul></ul><ul><li>The role of the W3C </li></ul><ul><ul><li>Ontologies as member submission…. </li></ul></ul><ul><li>Vocabularies not designed for the annotated Web </li></ul>Distributed ontology development = Mess
    43. 43. eRDF <ul><li>Difficult for complex pages and dangerous in non-expert hands </li></ul><ul><ul><li>Serious limitations </li></ul></ul><ul><ul><ul><li>No datatypes </li></ul></ul></ul><ul><ul><ul><li>No subjects other than identifiers within the current page </li></ul></ul></ul><ul><ul><ul><li>Reuse of the id attribute </li></ul></ul></ul><ul><ul><ul><ul><li><div id=“blue_panel”> </li></ul></ul></ul></ul><ul><ul><ul><ul><ul><li><span class=“foaf-name”>Peter Mika</span> </li></ul></ul></ul></ul></ul><ul><ul><ul><ul><ul><li><div id=“white_clickable_thingy_on_the right”> </li></ul></ul></ul></ul></ul><ul><ul><ul><ul><ul><li><span class=“foaf-mbox”></span> </li></ul></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>… . </li></ul></ul></ul></ul></ul>
    44. 44. RDFa <ul><li>A huge improvement </li></ul><ul><ul><li>E.g. no repurposing of HTML attributes </li></ul></ul><ul><li>Still, not everything is intuitive to the uninitiated: </li></ul><ul><ul><li><div about=“#id”> </li></ul></ul><ul><ul><li><span property=“foaf:name“>Peter Mika</span> </li></ul></ul><ul><ul><li><span rel=“foaf:img“ typeof=“foaf:Image”> </li></ul></ul><ul><ul><li><span property=“dc:format”>jpg</span> </li></ul></ul><ul><ul><li>… </li></ul></ul><ul><ul><li></span </li></ul></ul><ul><ul><li></div> </li></ul></ul><ul><ul><li><div about=“#id”> </li></ul></ul><ul><ul><li><span property=“foaf:name“>Peter Mika</span> </li></ul></ul><ul><ul><li><span rel=“foaf:img“ resource=“”> </li></ul></ul><ul><ul><li><span typeof=“foaf:Image”> </li></ul></ul><ul><ul><li> <span property=“dc:format”>jpg</span> </li></ul></ul><ul><ul><li></span </li></ul></ul><ul><ul><li></span </li></ul></ul><ul><ul><li></div> </li></ul></ul>
    45. 45. Semantic Search
    46. 46. Semantics and IR <ul><li>Hard searches that cannot be solved with purely syntactical approach </li></ul><ul><ul><li>Ontologies in IR shown to work in limited domains </li></ul></ul><ul><ul><li>In Web IR most attempts (e.g. query expansion) have failed </li></ul></ul><ul><li>What is new? The scale and breadth. </li></ul><ul><ul><li>Growth in annotations, all domains (Web 2.0) </li></ul></ul><ul><ul><li>Data Web vs. Deep Web </li></ul></ul>
    47. 47. Hard searches <ul><li>Ambiguous searches </li></ul><ul><ul><li>Paris Hilton </li></ul></ul><ul><li>Multimedia search </li></ul><ul><ul><li>Images of Paris Hilton </li></ul></ul><ul><li>Imprecise or overly precise searches </li></ul><ul><ul><li>Publications by Jim Hendler </li></ul></ul><ul><ul><li>Find images of strong and adventurous people (Lenat) </li></ul></ul><ul><li>Searches for descriptions </li></ul><ul><ul><li>Search for yourself without using your name </li></ul></ul><ul><ul><li>Product search (ads!) </li></ul></ul><ul><li>Searches that require aggregation </li></ul><ul><ul><li>Size of the Eiffer tower (Lenat) </li></ul></ul><ul><ul><li>Public opinion on Britney Spears </li></ul></ul><ul><li>Queries that require a deeper understanding of the query, the content and/or the world at large </li></ul><ul><ul><li>Note: some of these are so hard that users don’t even try them any more </li></ul></ul>
    48. 48. Example…
    49. 49. Application: query intent Paris Hilton is a person!
    50. 50. Application: query intent #2 Hugo is a person!
    51. 51. Time to experiment! <ul><li>Future work in IR </li></ul><ul><ul><li>Ranking documents </li></ul></ul><ul><ul><li>Ranking resources/triples </li></ul></ul><ul><ul><ul><li>Ranking resources in the presence of documents </li></ul></ul></ul><ul><ul><li>New interfaces </li></ul></ul><ul><ul><ul><li>Query interface </li></ul></ul></ul><ul><ul><ul><li>Result presentation </li></ul></ul></ul><ul><li>Metrics </li></ul><ul><ul><li>Precision/recall </li></ul></ul><ul><ul><li>CTR </li></ul></ul><ul><ul><li>ROI </li></ul></ul>
    52. 52. Time to experiment! #2 <ul><li>Future work in Semantic Web </li></ul><ul><ul><li>(Semi-)automated ways of metadata creation (NLP!) </li></ul></ul><ul><ul><li>Entity resolution </li></ul></ul><ul><ul><li>Scale </li></ul></ul><ul><ul><li>Data quality </li></ul></ul><ul><ul><ul><li>We allow providing metadata for other people’s sites! </li></ul></ul></ul><ul><ul><li>Reasoning </li></ul></ul><ul><ul><ul><li>To the extent that it’s useful </li></ul></ul></ul><ul><li>Constraints </li></ul><ul><ul><li>Keyword-based search can not suffer </li></ul></ul><ul><ul><ul><li>Preference for very conservative solutions </li></ul></ul></ul><ul><ul><li>Metadata in the context of documents </li></ul></ul><ul><ul><ul><li>Assumption is still that the user will want to see/visit the source </li></ul></ul></ul><ul><ul><ul><li>(what to do with linked data?) </li></ul></ul></ul>
    53. 53. Going even further… <ul><li>NLP, Information Extraction </li></ul><ul><li>New types of application </li></ul><ul><ul><li>Aggregators </li></ul></ul><ul><ul><li>Stateful apps </li></ul></ul><ul><ul><li>Intent-driven apps </li></ul></ul><ul><ul><li>Mobile apps </li></ul></ul><ul><ul><li>… </li></ul></ul><ul><li>Beyond search </li></ul><ul><ul><li>Analysis, design, diagnosis etc. on top of aggregated data </li></ul></ul><ul><li>Personalization </li></ul><ul><ul><li>Building rich user profiles </li></ul></ul><ul><ul><li>Re-ranking results based on interests </li></ul></ul><ul><li>Monetization </li></ul><ul><ul><li>No more “buy virgins on eBay” </li></ul></ul>
    54. 54. BOSS: Build your Own Search Service <ul><li>Unlimited queries per day </li></ul><ul><li>Ability to re-order results and blend-in addition content </li></ul><ul><li>No restrictions on presentation </li></ul><ul><li>No branding or attribution </li></ul><ul><li>Access to multiple verticals (web search, image, news) </li></ul><ul><li>Ability to monetize </li></ul><ul><li>40+ supported language and region pairs </li></ul><ul><li> </li></ul>
    55. 55. Y!OS 1.0 <ul><li>Yahoo! Open Strategy </li></ul><ul><ul><li>Yahoo! Social Platform </li></ul></ul><ul><ul><ul><li>Profiles, Connections, Updates, Contacts and Status </li></ul></ul></ul><ul><ul><li>Yahoo! Query Language (YQL) </li></ul></ul><ul><ul><ul><li>Access (other) web services using a SQL-like language </li></ul></ul></ul><ul><ul><li>Yahoo! Application Platform (YAP) </li></ul></ul><ul><ul><ul><li>Developer hosted execution of applications with access to Yahoo's Social APIs and YQL; </li></ul></ul></ul><ul><ul><ul><li>Support for OpenSocial's JavaScript API ; and </li></ul></ul></ul><ul><ul><ul><li>Support for server-side YML tags. </li></ul></ul></ul><ul><ul><ul><li>Future: run applications on Y! sites </li></ul></ul></ul><ul><li>OpenID, OAuth </li></ul>
    56. 56. Contact <ul><li>Peter Mika </li></ul><ul><ul><li>[email_address] </li></ul></ul><ul><ul><li>Come to Barcelona and stop by </li></ul></ul><ul><li>SearchMonkey </li></ul><ul><ul><li> </li></ul></ul><ul><ul><li>mailing lists </li></ul></ul><ul><ul><ul><li>[email_address] </li></ul></ul></ul><ul><ul><ul><li>[email_address] </li></ul></ul></ul><ul><ul><li>forums </li></ul></ul><ul><ul><ul><li> </li></ul></ul></ul><ul><ul><ul><li> </li></ul></ul></ul><ul><ul><li>Semantic Web FAQ </li></ul></ul><ul><ul><ul><li> </li></ul></ul></ul>
    57. 57. the monkey is out!