• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Semantic Searchmonkey
 

Semantic Searchmonkey

on

  • 8,629 views

Semantic Search + SeachMonkey talk given at Yahoo! Hacku event.

Semantic Search + SeachMonkey talk given at Yahoo! Hacku event.

http://developer.yahoo.com/hacku
http://developer.yahoo.com/searchmonkey

Statistics

Views

Total Views
8,629
Views on SlideShare
8,525
Embed Views
104

Actions

Likes
17
Downloads
111
Comments
1

6 Embeds 104

http://developer.yahoo.com 48
http://www.paulscott.za.net 18
http://developer.yahoo.net 15
http://www.slideshare.net 13
http://www.linkedin.com 8
http://paulscott.za.net 2

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

11 of 1 previous next

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • A SearchMonkey Enhanced result contains a great deal of structured data. It could have a picture, key/value pairs, deep links…This kind of information goes far beyond what normal search results give you – a title and an autoextracted summary. Where does this information come from?
  • Likewise, an Infobar has a summary (what the user sees before the pane is expanded) and a “blob”, an area of free-form HTML.
  • XSLT custom data services are excellent when there is no good structured data available, either because you don’t own the site in question, or because you just want to get a prototype out quickly without having to to change your site’s template markup. You can use these data services to mock up what is possible with SearchMonkey.As with the PHP, the XSLT is fairly simple. The “hard” part of writing the stylesheet is really just finding the right xpath expression for extracting the information you want. The other thing you need to do is pick a good vocabulary for describing the extracted data. For example, a description is a dc:description (Dublin Core description) and so on.If the page is not well-formed XHTML, have no fear, we tidy up the page ahead of time and run the XSLT on that. The tidying can fail, but only if the markup is really pathologically bad.As we mentioned before, XSLT custom data services are good for mocking up Enhanced Results, but they’re too slow in practice. For a production-quality app, you’ll need to use them in infobars.[Show demo]

Semantic Searchmonkey Semantic Searchmonkey Presentation Transcript

  • Monkey with the Semantic Web
  • SearchMonkey Presentation by: Paul Tarjan, Chief Technical Monkey (ptarjan@yahoo-inc.com) Online at: http://www.slideshare.net/ptarjan/semantic-searchmonkey
  • The web was / is fragmented Funny pictures Super secret military site Friend’s website University Cool event page bookmarks
  • So we added search to find stuff Google Yahoo Super Funny secret pictures military site Friend’s University Cool website event page bookmarks
  • But there are many similar sites Facebook Events Evite Events Upcoming Events Youtube Metacafe Vimeo Digg Reddit Technorati Let’s treat these as “views” onto “objects”
  • Wouldn’t it be cool if you could do: •  object:video creator:”Paul Tarjan” length<=60s
  • Wouldn’t it be cool if you could do: •  object:video creator:http://paulisageek.com/ length<=60s
  • Wouldn’t it be cool if you could do: •  object:game name:”Desktop Tower Defense” version:1.5 publishdate:”May 2 2005”
  • Wouldn’t it be cool if you could do: •  object:video author:”The Escapist” game:”Left 4 Dead”
  • It gets even cooler
  • Aggregation: •  object:review type:camera make:canon model:D40
  • Aggregation: •  object:event date:”May 16, 2008” type:party price<$5
  • Aggregation: •  object:photo person:“Paul Tarjan”
  • Aggregation: •  object:photo person:http://paulisageek.com
  • The Semantic What? •  Web pages are views of data for people to read •  Search Engines are a hack •  They treat pages as a bucket of words •  Lets turn the web into a database •  APIs are good, but there is no “web” of APIs •  If you figure out a good way of doing that, let me know 
  • Ok, I want to do it. Now what?
  • Recommendation: µF •  If there is a microformat for your data, use it –  hcard –  hreview –  hresume –  hcalendar –  rel-tag –  rel-licence –  xfn –  hatom –  geo
  • µF in a nutshell •  Change your @class to something that is known •  <div> –  <span class=“name”>Paul Tarjan</span> –  <span class=‘email’>spam@paulisageek.com</span> •  </div> •  BECOMES •  <div class=“vcard”> –  <span class=“fn”>Paul Tarjan</span> –  <span class=“email”>spam@paulisageek.com</span> •  </div>
  • Recommendation: RDFa •  If you have data that doesn’t really fit in a µF •  Examples: –  Markup APIs (YUI, javadoc, etc) –  Media (Audios, Videos, Games, Presentations) –  Job Postings
  • RDFa in a nutshell •  Make a namespace •  Use @property, @rel and @resource •  For DATA: @property makes the node contents into the value •  For URLs: @rel makes the @resource into the value
  • Normal HTML •  <html> … <div class=quot;private”> private static String <strong>_createCookieHash </strong> (hash) …
  • RDFa: example •  <html xmlns:yui=quot;http://yuilibrary.com/rdf/ 1.0/yui.rdf#quot;> … <div class=quot;private” rel=quot;yui:methodquot; resource=quot;#method__createCookieHashquot;> private static String <strong property=quot;yui:namequot;> _createCookieHash </strong> (hash) …
  • That’s it! •  Automatically picked up by semantic parsers / crawlers •  Can build a SearchMonkey app on it •  Can make a mashup way easier than screen scraping •  Can get the data from Yahoo! BOSS
  • What is SearchMonkey? an open platform for using structured data to build more useful and relevant search results Before After
  • Enhanced Result: Zagat Image Links Key/Value Pairs or Abstract
  • Infobar: Wikipedia Preview Summary Blob
  • Part of the puzzle Semantic vocabularies Semantic markup on web pages SearchMonkey
  • Vocabularies •  Need to speak the same language •  I like to see girls of that... caliber. •  English, French, Spanish, Esparanto? •  URLs to the rescue –  Dublin Core (http://purl.org/dc/elements/1.1/) –  Friend of a Friend (http://xmlns.com/foaf/0.1/) –  X-Friend Network (http://gmpg.org/xfn/11/) –  … (many more)
  • Syntax •  Nouns, Verbs, and Adjectives, oh my! •  All phrases become lots of triples •  (Subject, Verb / Adj. / Prep. / etc, Object) •  Key / Value pairs ++ –  Everything is a URL or String –  Subject doesn’t have to be the document
  • Syntax 2 •  Key / Value pair –  Title = Awesome SearchMonkey Presentation –  Homepage = http://search.yahoo.com/searchmonkey •  Triples –  (self, http://purl.org/dc#title, “Awesome SearchMonkey Presentation”) –  (self, http://vcard#url, http://search.yahoo.com/searchmonkey)
  • Decompose to triples •  My friend “Bob” is an idiot. –  (self, http://xmlns.com/foaf/0.1/knows, genid:Ui__152310312_366) –  (genid:Ui__152310312_366, http:// www.w3.org/2001/vcard-rdf/3.0#fn, “Bob”) –  (genid:Ui__152310312_366, http:// example.org/ptarjan/isInstanceOf, http:// example.org/ptarjan/idiot) •  Unnamed nodes are O.K.
  • Writing URLs takes a lot of work! •  xmlns:foaf=http://xmlns.com/foaf/0.1/ •  xmlns:vcard=http://www.w3.org/2001/vcard-rdf/ 3.0# •  xmlns:junk=http://example.org/ptarjan/ •  My friend “Bob” is an idiot. –  (self, foaf:knows, genid:Ui__152310312_366) –  (genid:Ui__152310312_366, vcard:fn, “Bob”) –  (genid:Ui__152310312_366, junk:isInstanceOf, junk:idiot) •  Unnamed nodes are O.K.
  • RDFa •  <html xmlns:foaf=“http://xmlns.com/foaf/0.1” xmlns:vcard=http://www.w3.org/2001/vcard-rdf/ 3.0# xmlns:junk=http://example.org/ptarjan/> <div rel=“foaf:knows”> <span property=“vcard:fn”>Bob</span> <span rel=“junk:isInstanceOf” resource=“junk:idiot” /> </div> </html>
  • •  </SemanticWeb> •  Questions?
  • Innards of SearchMonkey •  You build a web-service inside our framework •  When a search page renders –  We check which SM apps are enabled –  We call them • 50ms for in-page • Long time for AJAX –  They return data in our template –  We render them (and cache)
  • Prototyping with XSLT •  What if I don’t have structured data? –  I don’t own the site –  I do own the site, but I want to prototype first •  Build an XSLT custom data service first –  Write some XSLT to extract the data and transform it into DataRSS –  Mostly about finding the right XPath (use Firebug or XPather ) –  Quick to implement, but brittle –  Can’t do a good Enhanced Result
  • Do it for real •  Demo
  • Examples •  Rubic’s cube •  VTA Bus •  API Monkey •  BugMeNot •  RetailMeNot •  Amazon
  • questions?