Year of the Monkey: Lessons from the first year of SearchMonkey - Presentation Transcript
Year of the Monkey Peter Mika Researcher and Data Architect Yahoo!
SearchMonkey Acme.com’s database Index RDF/Microformat Markup site owners/publishers share structured data with Yahoo!. 1 consumers customize their search experience with Enhanced Results or Infobars 3 site owners & third-party developers build SearchMonkey apps. 2 DataRSS feed Web Services Page Extraction Acme.com’s Web Pages
March, 2008: The news is out!
“ Yahoo’s support for semantic web standards like RDF and microformats is exactly the incentive websites need to adopt them. Instead of semantic silos scattered across the Web (think Twine), Yahoo will be pulling all the semantic information together when available, as a search engine should. Until now, there were few applications that demanded properly structured data from third parties. That changes today.” – TechCrunch
May, 2008: Time to party!
What happened since the launch?
It’s working in and out
We are contributing significantly to the growth of the Semantic Web
Users are delighted Publishers are willing to invest More structured data More applications More users are delighted
Increasing excitement all around
A year later, even Google gets excited
This presentation is about the things we learned along the way
How far are we?
Lessons about technology
Lessons about our communities
Moving ahead
How far are we?
The Web is getting more structured
But just how far are we: lots of data or very little? Percentage of URLs with embedded metadata in various formats Sep, 2008 Mar, 2009 >400% increase in RDFa data
The Semantic Gap
The real question is whether this data serves a purpose
Our purpose: fulfilling the information needs of our users
It’s not about the size! Consider Wikipedia.
Demand= needs Supply = information
Analysis through query logs
Research questions:
How much of this data would ever be encountered by a user through search?
What categories of queries can be answered?
What’s the role of large sites?
Method
Imitating the average search behavior of users through web search query log analysis
Reproducible experiments (given query log data)
BOSS web search API
Returns metadata for search result URLs in RDF/XML or DataRSS
Data
Microformats, eRDF, RDFa data
Query log data
US query log
Random sample of 7k queries
Recent query log covering over a month period
Query classification data
US query log
1000 queries classified into various categories
Caveats
For us, and for the time being, search = document search
For this experiment, we assume current bag-of-words document retrieval is a reasonable approximation of semantic search
For us, search = web search
We are dealing with the average search user
There are many queries the users have learned not to ask
Volume is a rough approximation of value
There are rare information needs with high pay-offs, e.g. patent search, financial data, biomedical data…
Number of queries with a given number of results with particular formats (N=7081) Impressions Average impressions per query
Notes:
Queries with 0 results with metadata not shown
You cannot add numberss in columns: a query may return documents with different formats
Assume queries return more than 10 results
On average, a query has at least one result with metadata. Are tags as useful as hCard? That’s only 1 in every 16 queries. 1 2 3 4 5 6 7 8 9 10 ANY 2127 1164 492 244 85 24 10 5 3 1 7623 1.08 hcard 1457 370 93 11 3 0 0 0 0 0 2535 0.36 rel-tag 1317 350 95 44 14 8 6 3 1 1 2681 0.38 adr 456 77 21 6 1 0 0 0 0 0 702 0.10 hatom 450 52 8 1 0 0 0 0 0 0 582 0.08 license 359 21 1 1 0 0 0 0 0 0 408 0.06 xfn 339 26 1 1 0 0 0 1 0 0 406 0.06
An ontology is a shared , formal representation of a domain
How do we build communities? www.vocamp.org
Learning about our communities
Publishers, developers, users and marketers
Most applications are developed by the site owner
Exposing only the data that is required for the application
How to encourage other types of applications and more data?
Typical developer is a front-end engineer
Mostly new to semantic technologies, but motivated
However, ramp-up is steep. Learning RDF plus SearchMonkey.
How to simplify development?
Users have little interest in customizing their search experience
They attach less value to customization than we thought
How to give our users more value without the need for customization?
Helping our publishers
What if we could remove the need for programming?
SearchMonkey objects
Generate enhanced results based on markup in common formats
Copy-paste code
Validator
LATE BREAKING NEWS Five new objects: Product, Local, News, Event, Discussion
Opening up new ways of accessing structured data
BOSS (Build your Own Search Service) API
Full service web search API
Access metadata with search results
view=searchmonkey_rdf&format=xml
Use magic words to restrict search to results with certain kinds of metadata
e.g. searchmonkey:com.yahoo.page.uf.hcard
YQL
Query web services as virtual relational tables
Create mashups by joining tables
The microformats ‘table’ allows similar access as with BOSS
New default on applications for our users
Moving ahead
Summary CELEBRATING 1 YEAR ANNIVERSARY OF SEARCHMONKEY In 23 markets around the world 70 million enhanced results viewed daily (US) >15% increase in click-through rates 200 people enter dev tool to start creating an app a day >15,000 developers registered to build apps >400 applications in gallery Amount of RDFa structured data increased by 413%
What we’ve done
Opened up SearchMonkey via BOSS and YQL
Significantly simplified the work of publishers and developers
Improved user experience through a number of new applications
Working with the community to establish standards and best practices
0 comments
Post a comment