Yahoo Making The Web SearchablePresentation Transcript
Making the Web Searchable Peter Mika Researcher, Data Architect Yahoo! Research
Yahoo! Research (research.yahoo.com)
Yahoo! Research Barcelona
Established January, 2006
Led by Ricardo Baeza-Yates
content, structure, usage
Distributed Web retrieval
NLP and Semantics
Yahoo! by numbers (April, 2007)
There are approximately 500 million users of Yahoo! branded services, meaning we reach 50 percent – or 1 out of every 2 users – online, the largest audience on the Internet (Yahoo! Internal Data).
Yahoo! is the most visited site online with nearly 4 billion visits and an average of 30 visits per user per month in the U.S. and leads all competitors in audience reach, frequency and engagement (comScore Media Metrix, US, Feb. 2007).
Yahoo! accounts for the largest share of time Americans spend on the Internet with 12 percent (comScore Media Metrix, US, Feb. 2007) and approximately 8 percent of the world’s online time (comScore WorldMetrix, Feb. 2007).
Yahoo! is the #1 home page with 85 million average daily visitors on Yahoo! homepages around the world, an increase of nearly 5 million visitors in a month (comScore WorldMetrix, Feb. 2007).
Yahoo!’s social media properties (Flickr, delicious, Answers, 360, Video, MyBlogLog, Jumpcut and Bix) have 115 million unique visitors worldwide (comScore WorldMetrix, Feb. 2007).
Yahoo! Answers is the largest collection of human knowledge on the Web with more than 90 million unique users and 250 million answers worldwide (Yahoo! Internal Data).
There are more than 450 million photos in Flickr in total and 1 million photos are uploaded daily. 80 percent of the photos are public (Yahoo! Internal Data).
Yahoo! Mail is the #1 Web mail provider in the world with 243 million users (comScore WorldMetrix, Feb. 2007) and nearly 80 million users in the U.S. (comScore Media Metrix, US, Feb. 2007)
Interoperability between Yahoo! Messenger and Windows Live Messenger has formed the largest IM community approaching 350 million user accounts (Yahoo! Internal Data).
Yahoo! Messenger is the most popular in time spent with an average of 50 minutes per user, per day (comScore WorldMetrix, Feb. 2007).
Nearly 1 in 10 Internet users is a member of a Yahoo! Groups (Yahoo! Internal Data).
Yahoo! is one of only 26 companies to be on both the Fortune 500 list and the Fortune’s “Best Place to Work” List (2006).
The Annotated Web
Toward Semantic Search
Build your Own Search Service
Yahoo! Open Strategy
The Annotated Web
Previously in search
Minimal natural language processing
Limited experiments with ontologies (query expansion)
e.g. shopping.com, Kelkoo
Faceted search, browsing
Google Base, Google Co-op
Web-scale, but fixed ontologies
Can we do better with the Semantic Web?
Address the long tail of queries (88% of queries)
Use standard technology
Not a new question. But the answer may be new.
Which Semantic Web?
Bringing the content of databases to the Web (linkeddata.org)
Rich data, heavyweight semantics
Annotating the content of Web resources (documents, mm)
<HEAD> <META HTTP-EQUIV="Instance-Key" CONTENT="http://www.cs.umd.edu/~george"> <USE-ONTOLOGY "our-ontology" VERSION="1.0" PREFIX="our" URL="http://ont.org/our-ont.html"> </HEAD> <BODY> <CATEGORY "our.Person"> <RELATION "our.marriedTo" TO="http://www.cs.umd.edu/~helena"> <RELATION "our.employee" FROM="http://www.cs.umd.edu"> My name is <ATTRIBUTE "our.firstName"> George </ATTRIBUTE> <ATTRIBUTE "our.lastName"> Cook </ATTRIBUTE> and I live at...
This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by/3.0/us/">Creative Commons Attribution 3.0 United States License</a>.
Use of the “rel” attribute for semantic annotation is the birth of the microformat…
Example: microformats <cite class=" vcard "> <a class=" fn url " rel="friend colleague met" href="http://meyerweb.com/"> Eric Meyer </a> </cite> wrote a post ( <cite> <a href="http://meyerweb.com/eric/thoughts/2005/12/16/tax-relief/"> Tax Relief </a></cite> ) about an unintentionally humorous letter he received from the <span class=" vcard "> <a class=" fn org url " href="http://irs.gov/"> Internal Revenue Service </a> </span>. <div class=" vcard "> <a class=" email fn " href="mailto:firstname.lastname@example.org"> Joe Friday </a> <div class=" tel "> +1-919-555-7878 </div> <div class=" title "> Area Administrator, Assistant </div> </div>
Originated by Tantek Celik and others
Agreements on the way to encode certain kinds metadata in HTML
Reuse of semantic-bearing HTML elements
Based on existing standards
Persons, events , listings etc. but also syntactic metadata: licenses , tags
Microformats have no shared syntax
Each microformat has a separate syntax tailored to the vocabulary
Microformats are not ontologies
No formal descriptions of schema , only text
Limited reuse, extensibility of schemas
No namespaces, unique identifiers (URIs)
mapping between instances is required
Relationship to page context is unclear
Widely used in millions of documents
User-generated as well as automatically generated
Example: tags and machine tags
Example: Tags and machine tags
User defined keywords
Is ‘rock’ on Flickr same as ‘rock’ on myspace?
Is ‘rock’ by me on Flickr is the same as ‘rock’ by you on Flickr?
Is ‘rock’ by me on Flickr today the same as ‘rock’ by me on myspace tomorrow?
User defined values for user defined properties
Possibility to define the namespace (but not enforced)
RDF-based annotation #1: eRDF
Ian Davis (Talis)
Embedding RDF in HTML
Straightforward mapping to RDF triples (XSLT available)
More complex than microformats
Use any RDF/OWL vocabulary
Reuse of semantic-bearing HTML elements is limited
More limited than RDF
No blank nodes
No data types
No statements about subjects other than the current document
RDF-based annotation #2: RDFa
World Wide Web Consortium (W3C) last call document
Similar intent as eRDF, but full RDF support
Big question: user complexity ( data quality)
<p typeof="contact:Info" about="http://example.org/staff/jo"> <span property="contact:fn"> Jo Smith </span>. <span property="contact:title"> Web hacker </span> at <a rel="contact:org" href="http://example.org"> Example.org </a>. You can contact me <a rel="contact:email" href="mailto:email@example.com"> via email </a>. </p> ...
Creating an ecosystem of publishers, developers and end-users
Motivating and helping publishers to implement semantic annotation
Providing tools for developers to create compelling applications
Focusing on end-user experience
Rich abstracts as a first application
Addressing the long tail of query and content production
Standard Semantic Web technology
dataRSS = Atom + RDFa
Industry standard vocabularies
Before After an open platform for using structured data to build more useful and relevant search results What is SearchMonkey?
image deep links name/value pairs or abstract Enhanced Result
YAHOO! CONFIDENTIAL | Infobar
SearchMonkey Acme.com’s database Index RDF/Microformat Markup site owners/publishers share structured data with Yahoo!. 1 consumers customize their search experience with Enhanced Results or Infobars 3 site owners & third-party developers build SearchMonkey apps. 2 DataRSS feed Web Services Page Extraction Acme.com’s Web Pages
hCard plus feed data
Creative Commons by Ben Adida
CC in RDFa
Example apps. II.
Other me by Dan Brickley
Google Social Graph API wrapped using a Web Service