Designed for humans first and machines second, microformats are a set of simple, open data formats built upon existing and widely adopted standards. Instead of throwing away what works today, microformats intend to solve simpler problems first by adapting to current behaviors and usage patterns
In fact, some of these searches are so hard that the users don’t even try them anymore
Results are good, but consider the ads: First ad says: Virgins. Looking for virgins? Find exactly what you want today. Ebay.com Second ad: Virgins. …Find cheap tickets for Virgins. Third ad: Adspam… these people buy Yahoo! traffic and sell it to Google.
SW: Representing and reasoning with structured data on the Web Both a relational and graph view on information IR:: Aggregating information at a document-level based on ad-hoc information needs DB: Representing and querying information in a relational model NLP: from text to information One reference to Semantic Search
Entity-independent measures: M1: probability of fix given type M2: probability of fix given type, normalized by probability of fix (the more uncommon the fix, the better) M3: binary entropy function
Making the Web Searchable Peter Mika Researcher, Data Architect Yahoo! Research
Yahoo! Research (research.yahoo.com)
Yahoo! Research Barcelona
Established January, 2006
Led by Ricardo Baeza-Yates
Research areas
Web Mining
content, structure, usage
Distributed Web retrieval
Multimedia retrieval
NLP and Semantics
Yahoo! by numbers (April, 2007)
There are approximately 500 million users of Yahoo! branded services, meaning we reach 50 percent – or 1 out of every 2 users – online, the largest audience on the Internet (Yahoo! Internal Data).
Yahoo! is the most visited site online with nearly 4 billion visits and an average of 30 visits per user per month in the U.S. and leads all competitors in audience reach, frequency and engagement (comScore Media Metrix, US, Feb. 2007).
Yahoo! accounts for the largest share of time Americans spend on the Internet with 12 percent (comScore Media Metrix, US, Feb. 2007) and approximately 8 percent of the world’s online time (comScore WorldMetrix, Feb. 2007).
Yahoo! is the #1 home page with 85 million average daily visitors on Yahoo! homepages around the world, an increase of nearly 5 million visitors in a month (comScore WorldMetrix, Feb. 2007).
Yahoo!’s social media properties (Flickr, delicious, Answers, 360, Video, MyBlogLog, Jumpcut and Bix) have 115 million unique visitors worldwide (comScore WorldMetrix, Feb. 2007).
Yahoo! Answers is the largest collection of human knowledge on the Web with more than 90 million unique users and 250 million answers worldwide (Yahoo! Internal Data).
There are more than 450 million photos in Flickr in total and 1 million photos are uploaded daily. 80 percent of the photos are public (Yahoo! Internal Data).
Yahoo! Mail is the #1 Web mail provider in the world with 243 million users (comScore WorldMetrix, Feb. 2007) and nearly 80 million users in the U.S. (comScore Media Metrix, US, Feb. 2007)
Interoperability between Yahoo! Messenger and Windows Live Messenger has formed the largest IM community approaching 350 million user accounts (Yahoo! Internal Data).
Yahoo! Messenger is the most popular in time spent with an average of 50 minutes per user, per day (comScore WorldMetrix, Feb. 2007).
Nearly 1 in 10 Internet users is a member of a Yahoo! Groups (Yahoo! Internal Data).
Yahoo! is one of only 26 companies to be on both the Fortune 500 list and the Fortune’s “Best Place to Work” List (2006).
Agenda
Publishing metadata on the Semantic Web
A brief history of publishing metadata in HTML
Semantic Search research and applications
The many faces of the Semantic Web
Six ways of publishing RDF
Linked RDF files (linked data)
Metadata inside webpages
SPARQL endpoints
Feeds
XSLT/GRDDL
Automated tools
Non-exclusive but in practice
most publisher choose one
Option 1: Standalone RDF documents
RDF documents linked to other RDF documents
Use rdfs:seeAlso to point to a related document
It says: Go and look at that document if you want to know more
Advantages:
No change to the publishing of the HTML documents
Data can be published by third party
Tools
RDB-to-RDF mappers such as D2RQ or Triplify
Linked Data browsers
Examples: Most datasets in the Linked Data cloud
. . . #PeterM #Bud born “ Peter Mika” label “ Budapest” label #Hun capital-of “ 2,000,000” population #PeterM #Bud born “ Peter Mika” label “ Budapest” label #Hun capital-of “ 2,000,000” population #PeterM #Bud born “ Peter Mika” label “ Budapest” label #Hun capital-of “ 2,000,000” population
Option 1: cntd.
For discovery, the metadata is often linked from HTML pages
< link rel="meta" type="application/rdf+xml" title="FOAF" href="http://www.cs.vu.nl/~pmika/foaf.rdf" />
Additional advantages:
Discovery from the webpage
It’s clear that the metadata is a machine representation of the human-targeted content of the page
Examples: FOAF profiles, BestBuy
. Peter Mika was born in Budapest. #PeterM #Bud born “ Peter Mika” label “ Budapest” label #Hun capital-of “ 2,000,000” population
Option 2: Metadata inside web pages
Using microformats, RDFa, MicroData (more later)
Advantages:
No content negotiation required
No separate database export required
Browser plug-in friendly
Search engine friendly
Copy-paste friendly
Tools:
XML editors (e.g. Oxygen)
Triplr
RDFa Distiller
RDFa bookmarklet
Ubiquity RDFa plugin
Optimus microformat parser
Examples: many, including SlideShare, YouTube, LinkedIn, Digg, Myspace, Facebook…
Peter Mika was born in Budapest. Peter Mika was born in Budapest. #PeterM #Bud born “ Peter Mika” label “ Budapest” label #Hun capital-of “ 2,000,000” population #PeterM #Bud born “ Peter Mika” label “ Budapest” label #Hun capital-of “ 2,000,000” population
Option 3: SPARQL endpoints
Query access to your RDF database
Similar to exposing your database on the Web and giving someone read-only SQL access
Advantages:
Most flexible and best performing access from a consumer perspective
<HEAD> <META HTTP-EQUIV="Instance-Key" CONTENT="http://www.cs.umd.edu/~george"> <USE-ONTOLOGY "our-ontology" VERSION="1.0" PREFIX="our" URL="http://ont.org/our-ont.html"> </HEAD> <BODY> <CATEGORY "our.Person"> <RELATION "our.marriedTo" TO="http://www.cs.umd.edu/~helena"> <RELATION "our.employee" FROM="http://www.cs.umd.edu"> My name is <ATTRIBUTE "our.firstName"> George </ATTRIBUTE> <ATTRIBUTE "our.lastName"> Cook </ATTRIBUTE> and I live at...
This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by/3.0/us/">Creative Commons Attribution 3.0 United States License</a>.
Use of the “rel” attribute for semantic annotation is the birth of the microformat…
Microformats (μf)
Community centered around microformats.org
Specifications and discussions are hosted there
Agreements on the way to encode certain kinds metadata in HTML
Reuse of semantic-bearing HTML elements
Based on existing standards
Minimality
Microformats exist for a limited set of objects
hCard (persons and organizations)
hCalendar (events)
hResume
hProduct
hRecipe
Varying degrees of support and stability
hCard and rel-tag are widely supported
Example: microformats <cite class=" vcard "> <a class=" fn url " rel="friend colleague met" href="http://meyerweb.com/"> Eric Meyer </a> </cite> wrote a post ( <cite> <a href="http://meyerweb.com/eric/thoughts/2005/12/16/tax-relief/"> Tax Relief </a></cite> ) about an unintentionally humorous letter he received from the <span class=" vcard "> <a class=" fn org url " href="http://irs.gov/"> Internal Revenue Service </a> </span>. <div class=" vcard "> <a class=" email fn " href="mailto:jfriday@host.com"> Joe Friday </a> <div class=" tel "> +1-919-555-7878 </div> <div class=" title "> Area Administrator, Assistant </div> </div>
Microformats: limitations
No shared syntax
Each microformat has a separate syntax tailored to the vocabulary
No formal schemas
Limited reuse, extensibility of schemas
Unclear which combinations are allowed
No datatypes
No namespaces, unique identifiers (URIs)
no interlinking
mapping between instances is required
Relationship to page context is often unclear
RDFa
RDF-in-attributes
World Wide Web Consortium (W3C) recommendation for encoding RDF triples in HTML
Full RDF support
Recommendation specifies the algorithm for parsing the triples out of HTML
Requires XHTML in principle
In practice, no one cares
RDFa in a slide <p xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xmlns:foaf="http://xmlns.com/foaf/0.1/” typeof=”foaf:Person " about="http://example.org/staff/jo" > <span property= " rdfs:label foaf:name "> Jo Smith </span>. <span property= " foaf:title "> Web hacker </span> at <a rel=”vcard:org" property= " foaf:name " href="http://example.org"> Acme Corp </a>. You can contact me <a rel= " foaf:mbox" href="mailto:jo@example.org"> via email </a>. </p> ... Assign the prefixes rdfs and foaf to the RDFS and FOAF namespaces (as in XML, RDF/XML etc.) Create a new resource of type foaf:Person Assign a value to a property Give it a URI Link to another resource and assign a name to it
Microformats vs. RDFa
Choose microformats when you find a microformat that fits your needs (and supported by your favorite tool or search engine)
Microformats are first option because they are simple
We support all major microformats, see the documentation
It’s a common misconception that RDFa requires XHTML: it doesn’t
If you find none that perfectly fits your needs then you need RDFa
Microformats have a fixed schema: you can not add your own attributes
Example: a social networking site with user profiles
VCard is a good candidate, but for example it doesn’t have a way to express the user’s social connections
You either live without this, or go with RDFa
Keep an eye on HTML5
Currently under standardization at the W3C
Last Call in Fall 2009
Introduces Microdata
Similar to microformats
Some predefined vocabularies with central registration
Some of the flexibility of RDFa
Introduce new terms using reverse domain names or full URIs
Semantic HTML elements such as <time>, <video>, <article>…
Microdata example <div item=“http://www.yahoo.com/resource/person”> <p>My name is <span itemprop=" name "> Neil </span>.</p> <p>My band is called <span itemprop=" band "> Four Parts Water </span>. I was born on <time itemprop=" birthday " datetime=" 2009-05-10 ">May 10th 2009</time>. <img itemprop=" image " src=” me.png " alt=”me”> </p> </div
The process of annotating with RDFa
Invest in familiarizing with the RDFa syntax by reading the RDFa Primer
It is also highly recommended that you read the RDF Primer . RDF is the data model used by RDFa.
Choose a vocabulary from the SearchMonkey documentation that fits your needs
A vocabulary describes a set of types and attributes within a given domain
If you don’t fin d a good candidate , extend an existing one or create a new one
Annotate your page.
Before you start, you might want to validate your page for (X)HTML conformance using the W3C’s (X)HTML Validator to reduce the chance of errors. Choose Document Type XHTML + RDFa.
No specific tool support. If you have an HTML or XML editor that supports DTDs, you will have syntax checking and highlighting.
Use the RDFa Distiller to validate which data can be extracted from your page.
If you fancy, use the RDF Validator to graphically visualize the RDF graph that is outputted.
Put the annotated page online. The data will extracted the next time your page is crawled
No need to explicitly submit anything
No notification when your site is crawled
See http://rdfa.info/rdfa-implementations for new tools and APIs
Beware of people who claim to have the vocabulary of everything
Preferably you want something small and targeted
Never a 100% fit you will need to introduce vocabulary terms (classes and properties)
Do not introduce new classes/properties in existing namespaces
Example: the namespace http://xmlns.com/foaf/0.1/ is used by the FOAF project. Try not to introduce a new term without contacting the owner, i.e. the membership of the FOAF mailing list.
Advanced topic: creating a vocabulary
Get advice on methodology
vocamp.org and semanticweb.org
Choose a namespace and a prefix
Give sensible names, e.g. name it after your site, but don’t call it searchmonkey
Namespace ends either with a slash or a hash
Create an RDF or OWL document describing your classes and properties
Use an ontology editor such as Protégé 4.0
Follow naming conventions
Publish your vocabulary
Make sure the URIs of your properties and classes are resolvable
E.g. myvocab:digicam should resolve to a document containing the definition of myvocab:digicam
Convince others to adopt your vocabulary
If you are in fishing, convince other fishing businesses
Exercise (slideshare.net/pmika)
Explore data on the Web
Microformats
Search for pages on Yahoo using searchmonkey:com.yahoo.page.uf.hcard
Try Operator Firefox Plug-in
Try Optimus
RDFa
Create yourself or search for pages on Yahoo using searchmonkey:com.yahoo.page.rdf.rdfa
Try RDFa bookmarklet to highlight RDFa
Try RDFa Distiller to extract RDF from HTML
Try RDF Validator to visualize your RDF data
Mark up your webpage using RDFa
Use RDFa Distiller to test
Try automated annotation using Zemanta or OpenCalais
Semantic Search
Why semantic search? (P. Raghavan)
Old battles are won
Main driver of user perception of search quality used to be: precision of navigational queries
The technical prowess was about crawling and spam
The hard core was indexing and retrieval
Currently, the biggest bottlenecks in IR not computational, but in modeling user cognition
If only we could find a computationally expensive way to solve the problem
then we should be able to make it go faster
Searches that show a ‘semantic gap’
Ambiguous searches
Paris Hilton
Multimedia search
Images of Paris Hilton
Imprecise or overly precise searches
Publications by Jim Hendler
Find images of strong and adventurous people (Lenat)
Searches for descriptions
Search for yourself without using your name
Product search (ads!)
Searches that require aggregation
Size of the Eiffer tower (Lenat)
Public opinion on Britney Spears
World temperature by 2020
Queries that require a deeper understanding of the query, the content and/or the world at large
Not just search
Semantic Search
Def. matching the user’s query with the Web’s content at a conceptual level, often with the help of world knowledge
R. Guha, R. McCool: Semantic Search, WWW2003
Related disciplines
Semantic Web, IR, Databases, NLP, IE
As a field
ISWC/ESWC/ASWC, WWW, SIGIR
Exploring Semantic Annotations in Information Retrieval (ECIR08, WSDM09)
Semantic Search Workshop (ESWC08, WWW09)
Future of Web Search: Semantic Search (FoWS09)
Semantics at every step of the IR process bla bla bla? q=“bla” * 3 Document processing bla bla bla Ranking Query processing Search interface The IR engine The Web The Semantic Web bla bla bla bla bla bla “ bla” θ (q,d)
Document processing
Goal: provide a higher level representation of text in some conceptual space
Diverse methods
Document classification
Information Extraction
Named-entity recognition, word-sense disambiguation, semantic role labeling, wrapper induction, form filling, etc.
Previously: open source toolkits such as GATE, OpenNLP
Require expert user to operate and train
NEW: Online tools
For non-expert users
Embed metadata inside documents and/or link entities in documents to the cloud
Often using existing metadata as background knowledge
Examples: online NLP
OpenCalais (Thomson Reuters)
Online interface and API for named-entity recognition, (partial) disambiguation and relationship extraction
HTML annotator (microformats), Firefox plug-in, WordPress plug-in, Yahoo Pipes service, etc.
Works best on the news domain
Zemanta
Zemanta is a ‘personal writing assistant’ that recognizes and disambiguates named entities and suggests related content
A Firefox-plugin that extends the functionality of popular blogging and online email platforms, o nline interface and API
Broad coverage but partial recognition
Examples: information extraction from templated HTML
Intel MashMaker
Create mashups based on information extracted from the page you are browsing (Firefox plugin)
Wrapper induction trained using manual annotations
Export the XSLT to be used in Yahoo’s SearchMonkey
Dapper
Semantic Advertising company
Wrapper induction trained using manual annotations (online tool)
API to the dynamic extraction
Glue
Social interaction around objects on the Web (Firefox plugin)
API provides access to the information extracted from popular sites
Query Interpretation
Provide a higher level representation of queries in some conceptual space
Ideally, the same space in which documents are represented
Queries may be keywords, questions, semi-structured, structured etc.
Interpretation treated as a separate step from ranking
Required for federation, i.e. determine where to send the query
Choosing between interpretations could be left to the user
Due to performance requirements
You cannot execute the query to determine what it means and then query again
General world knowledge (e.g. DBpedia), domain ontologies or the schema of the actual data can be used
Example: Semantic Search Assist
Observation: the same type of objects often have the same query context
Users asking for the same aspect of the type
Could we make query suggestions based on the type of the entity?
Improvement for infrequent queries
apple ipod nano review sony plasma tv review jerry yang biography biography tim berners lee tim berners lee blog peter mika yahoo britney spears shaves her head
Models
Desirable properties:
P1: Fix is frequent within type
P2: Fix has frequencies well-distributed across entities
P3: Fix is infrequent outside of the type
Models:
apple ipod nano review entity fix type: product
Demo
Ranking
Goal: match the query representation to content representation
Again, possibly with the use of background knowledge
Methods depend on
the actual representation
Bag of words, NL parse trees, SPARQL…
the qualities of the queries and content
the use case
e.g. time available for ranking may vary from milliseconds to days
Search interface
Goal is to facilitate the interaction between the user and the system
presentation adapts to the kind of query and results presented
Aggregated search
Grouping similar items, summarizing results in various ways
Possibilities for filtering, possibly across different dimensions
Task completion
Help the user to fulfill the task by placing the query in a task context
Putting it all together: Semantic Search Engines
Natural Language search engines
Hakia
Powerset (now built into Bing)
TrueKnowledge
Structured data search engines
Searching open web data
Sindice
Sigma
Searching closed world data
Wolfram Alpha (closed structured data + computation)
Web search engines
Yahoo’s SearchMonkey
BOSS and YQL
Google’s Rich Snippets
An open platform for using structured data to build more useful and relevant search results
Creating an ecosystem of publishers, developers and end-users
Helping publishers to implement semantic annotation and motivating them by allowing to customize their search results
Providing tools and APIs for developers to create compelling applications
Improving search result presentation for our end-users
Semantic Web technology
Support for a number of microformats, RDFa
RDF based data representation
Industry standard vocabularies (or use any of your own)
SearchMonkey
image deep links name/value pairs or abstract Enhanced Result
A difficult problem!
What if one would try to do this automatically?
Document summarization
Page structure detection
Information Extraction
Image recognition
Link classification and ranking
But one of those cases where a little semantics can go a long way…
SearchMonkey Acme.com’s database Index RDF/Microformat Markup site owners/publishers share structured data with Yahoo!. 1 consumers customize their search experience with Enhanced Results or Infobars 3 site owners & third-party developers build SearchMonkey apps. 2 DataRSS feed Web Services Page Extraction Acme.com’s Web Pages
Example apps
LinkedIn
hCard plus feed data
Creative Commons by Ben Adida
CC in RDFa
Example apps. II.
Other me by Dan Brickley
Google Social Graph API wrapped using a Web Service
Google’s Rich Snippets
Shares a subset of the features of SearchMonkey
Encourages publishers to embed certain microformats and RDFa into webpages
Currently reviews, people, products, business & organizations
These are used to generate richer search results
SearchMonkey is customizable
Developers can develop applications themselves
SearchMonkey is open
Wide support for standard vocabularies
API access
BOSS: Build your Own Search Service
Ability to re-order results and blend-in addition content
No restrictions on presentation
No branding or attribution
Access to multiple verticals (web search, image, news)
40+ supported language and region pairs
Pricing (BOSS)
Pay-by-usage
10,000 queries a day still free
Serve any ads you want
For more info, http://developer.yahoo.com/search/boss/
BOSS API to structured data
Simple HTTP GET calls, no authentication
You need an Application ID: register at developer.yahoo.com/search/boss/
0 comments
Post a comment