Applications of Open Search Tools:  WWW2010 Tutorial Rosie Jones and Ted Drake Yahoo!  Inc April 26 th , 2010 [email_addre...
Introductions
Schedule 2:00  – 2:15 Introductions and Overview Rosie & Ted 2:15 – 2:30 Motivation – state of the industry Ted Drake 2:30...
Caveat <ul><li>There is a lot of open search software out there! </li></ul><ul><li>This tutorial is breadth-oriented, and ...
Schedule 2:00  – 2:15 Introductions and Overview Rosie & Ted 2:15 – 2:30 Motivation – state of the industry Ted Drake 2:30...
Motivation
State of the Industry - Mashups <ul><li>Programmable Web: Resource for API and Mashup development </li></ul><ul><li>10 new...
State of the Industry - Healthy Market <ul><li>1,500 search related companies on TechCrunch </li></ul>
Open Source Technology Reduces Barriers <ul><li>Yahoo! Query Language  </li></ul><ul><ul><li>Select * from (insert your de...
Motivation II: Tools for Academic Papers
In Academia: Paper in WWW 2010 <ul><li>Highlighting Disputed Claims on the Web Rob Ennals, Beth Trushkowsky, John Mark Ago...
In Academia: Papers from SIGIR 2008 <ul><li>Towards breaking the quality curse: a web-querying approach to web people sear...
More Publications using Open Source Search Engines <ul><li>Affective feedback: an investigation into the role of emotions ...
Schedule 2:00  – 2:15 Introductions and Overview Rosie & Ted 2:15 – 2:30 Motivation – state of the industry Ted Drake 2:30...
Web Search Architecture Find documents Follow links Fetch freshest content Build graph of hyperlinks Process text and meta...
What is Open Search
Open Source Search and Open Search Open source code  lets you  build your own search engine Open search lets you leverage ...
Why Open Search? <ul><li>#!/usr/local/bin/perl –w </li></ul><ul><ul><li>$searchResultPage = GET($url); </li></ul></ul><ul>...
Scraping Modules http://search.cpan.org/~jfriedl/Yahoo-Search-1.10.13/lib/Yahoo/Search.pm
Do I Look Like A Piece of Bad Software?
Information Superhighway for Known Robots Search engine may stop accepting requests from your IP, or just slow down service
Scrape with Search Engine’s Blessing <ul><li>http://code.google.com/apis/ajaxsearch/ </li></ul><ul><li>http://msdn.microso...
Other Parts to the Search Process <ul><li>Indexing </li></ul><ul><ul><li>Indexing algorithms </li></ul></ul><ul><ul><li>Ac...
Indexing Your Own Content
Task of Indexing <ul><li>Store document contents in format that allows quick lookup </li></ul><ul><li>Invest time offline ...
Brute Force Document Scoring <ul><li>Check every document in collection to see if it contains any query terms </li></ul><u...
open drake search ted D1 D67 D3 D92 … query= open search ted drake D8 D9 D15 D32 D1 D9 D46 mit D3 D8 D9 D15 D32 D1 D6 D9 D...
High Level Comparison Platform License Lang. Docs Ranking Users Parallel Scale Lucene Apache Java Many Flexible Amazon Yes...
Previous Benchmarks  (Middleton+Baeza-Yates 07)
Open Search Benchmarking <ul><li>http://zooie.wordpress.com/2009/07/06/a-comparison-of-open-source-search-engines-and-inde...
Benchmarks <ul><li>Not enough comparative benchmarks out there </li></ul><ul><li>Hard to do;  we really need standards </l...
In action <ul><li>Lucene </li></ul><ul><li>Sphinx </li></ul><ul><li>Indri </li></ul><ul><li>All the code examples are here...
Lucene <ul><li>Lot of industrial support w/ proven scalability </li></ul><ul><ul><li>Amazon, Netflix, Wikipedia </li></ul>...
Lucene Indexing
Lucene Search javac -cp /lucenedir/lucene-2.4.1/lucene-core-2.4.1.jar:. Index.java java –Xmv512m –cp /lucenedir/lucene-cor...
Sphinx <ul><li>Runs Craigslist Search </li></ul><ul><li>MySQL integration focus </li></ul><ul><ul><li>But also supports a ...
Sphinx Indexing <ul><li>SQL text columns or XML input </li></ul><ul><li>sphinx.conf </li></ul><ul><li>indexer --quiet --co...
Sphinx Search Socket connection to searchd Sphinx service
Indri <ul><li>Lemur Project </li></ul><ul><ul><li>http://www.lemurproject.org/ </li></ul></ul><ul><li>Powerful Structured ...
Indri: Hello World <ul><li>Index & Search directory of txt files </li></ul><ul><li>IndriBuildIndex -index=/Users/viksi/sig...
Indexed Info in Search API
Index - Structured Meta Data <ul><li>SearchMonkey: </li></ul><ul><li>Yahoo! SearchMonkey captures the structured data from...
Index - Social <ul><li>Social:  </li></ul><ul><li>Delicious saves/tags </li></ul><ul><li>FOAF (Friend of a Friend), XFN </...
Index – Machine Tags <ul><li>Keyterms </li></ul><ul><li>Mis-Spelling </li></ul><ul><li>Content Enrichment </li></ul><ul><l...
Schedule 2:00  – 2:15 Introductions and Overview Rosie & Ted 2:15 – 2:30 Motivation – state of the industry Ted Drake 2:30...
Hello, World! Open Search Service APIs Photo by  Oskay
Roadmap of APIs <ul><li>Google </li></ul><ul><li>Bing </li></ul><ul><li>BOSS </li></ul><ul><li>Twitter </li></ul><ul><li>Y...
Google AJAX Search <ul><li>Javascript Widget or API </li></ul><ul><li>REST API: </li></ul><ul><li>http://ajax.googleapis.c...
Google Custom Search <ul><li>Turn-key product </li></ul><ul><li>Bulk load 1000s site restricts; On-demand 24 hour Web Inde...
Bing 2.0 API <ul><li>Multiple Sources; Batch support </li></ul><ul><ul><li>Web, Images, InstantAnswer, Phonebook, RelatedS...
Yahoo! BOSS <ul><li>BOSS =  B uild your  O wn  S earch  S ervice </li></ul><ul><li>Open Yahoo’s core search features via w...
Unrestricted? <ul><li>Unlimited queries </li></ul><ul><li>Blend, re-order, discard </li></ul><ul><li>Full Presentation con...
BOSS API <ul><li>Usage </li></ul><ul><ul><li>http://boss.yahooapis.com/ysearch/{vert}/v1/{q}?appid={appid}&start=0&count=1...
Web = Cross Platform <ul><li>Google AJAX, Bing, BOSS </li></ul><ul><li>HTTP GET ,  URI => XML, JSON </li></ul><ul><li>Any ...
Platforms
Yahoo! YQL <ul><li>select * from internet API  (e.g. flickr, ebay, amazon) </li></ul><ul><ul><li>http://developer.yahoo.co...
Amazon Web Services (AWS) <ul><li>Amazon Cloud Support </li></ul><ul><li>Amazon SimpleDB, Relational Database Services </l...
Google App Engine <ul><li>Free application hosting (up to 5 million pv/month) </li></ul><ul><li>Java, Ruby, or Python </li...
Examples
BOSS Out in the Open <ul><li> http://www.xurch.com </li></ul><ul><li> http://search.techcrunch.com </li></ul><ul><li> http...
Google Custom Search Examples <ul><li>CopyScape –  Looks for sites copying your  text </li></ul><ul><li>Topicalizer –  Ext...
Bing Examples <ul><li>Site Search Engine by a Microsoft engineer http://nathanbuggia.com/blog/post/Custom-Site-Search-Engi...
Coolest Features Across the Board <ul><li>BOSS  se_link (graphs), delicious (bookmarks), keyterms (extracted entities), se...
Schedule 2:00  – 2:15 Introductions and Overview Rosie & Ted 2:15 – 2:30 Motivation – state of the industry Ted Drake 2:30...
Mashups
Let’s Build Something <ul><li>TweetNews </li></ul><ul><ul><li>http://tweetnews.appspot.com/search?q=twitter </li></ul></ul...
Digression: TF-IDF for Ranking <ul><li>TF = Term Frequency </li></ul><ul><ul><li>Documents containing the query terms ofte...
TweetNews Model <ul><li>Goal:  Inject relevance in latest news search results </li></ul><ul><li>Approach: </li></ul><ul><u...
TweetNews Main Source
Non-Search: delicious Classifier <ul><li>Usage </li></ul><ul><ul><li>&view=delicious_toptags </li></ul></ul><ul><ul><li>&v...
Mashup: Related terms <ul><li>Delicious users can tag web sites they bookmark. </li></ul><ul><li>Get a ranked list of tags...
Mashup – Social Impact <ul><li>What  are your friends buzzing, digging, tagging… </li></ul><ul><li>YQL:  select * from soc...
Mashup – The Fire Hose
Mashup – Government Data <ul><li>Guardian’s World Government Data Collection  http://www.guardian.co.uk/world-government-d...
Coming Soon: Twitter Annotations <ul><li>Metadata for tweets </li></ul><ul><li>Step 1. create link for users to tweet your...
Mashup – Open Tables on YQL <ul><ul><li>Define new API definitions </li></ul></ul><ul><ul><li>Open Source in GitHub </li><...
Mashup – Open Tables on YQL <ul><li><?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?>  <table xmlns=&quot;http://...
Mashup – Using an Open Table
Blending Vertical + Service <ul><li>Comprehensiveness! </li></ul><ul><ul><li>Every Search Engine should be a One-Stop Shop...
Delicious Blending Idea <ul><li>Goal:   Blend  delicious + web results </li></ul><ul><li>Approach: </li></ul><ul><ul><li>1...
From Web
Hack Ideas <ul><li>Discovery  (BOSS Search App Store) </li></ul><ul><ul><li>Designing a fairer marketplace for app distrib...
Schedule 2:00  – 2:15 Introductions and Overview Rosie & Ted 2:15 – 2:30 Motivation – state of the industry Ted Drake 2:30...
Ranking
Retrieval and Ranking <ul><li>RETRIEVE  the documents matching simple conditions </li></ul><ul><ul><li>Boolean AND on quer...
Ranking with Open Source Tools <ul><li>Indri/Lemur </li></ul><ul><ul><li>Language modeling </li></ul></ul><ul><ul><li>BM25...
Evaluation with Click Logs
Evaluating with Clicks People click on the good results, right?
Not All Results Are Equally Likely to be Looked At (Source:  iprospect.com  WhitePaper_2006_SearchEngineUserBehavior.pdf) ‏
Clicks and Views Depend on Rank [Joachims et al, 2005]
Evaluation from Click Logs <ul><li>Show a screenshot and me doing a “skip first” </li></ul>Read From Top to Bottom Skip Sk...
Mining Clicks for Ranking <ul><li>Clicks can be used to predict </li></ul><ul><ul><li>Pairwise preference </li></ul></ul><...
Interleaving for Learning from Clicks – Pairwise Judgments <ul><li>[Joachims, KDD 2002] </li></ul><ul><li>[Radlinski and J...
Evaluation using Discounted Cumulative Gain <ul><li>Discounted Cumulative Gain (DCG) </li></ul><ul><li>[J ä rvelin and Kek...
Directly Modeling Relevance From Clicks Which ranking of web pages is better for the query “NIPS 2007”? [Carterette and Jo...
Ingredients for Learning from Clicks <ul><li>Sufficient users  </li></ul><ul><li>Ability to record results shown </li></ul...
 
How to Get Search Engine Results to Modify? <ul><li>Radlinski and Joachims </li></ul><ul><li>citeseer/arXiv.org results an...
Query Logs <ul><li>Might be in /etc/httpd/logs/access_log* check httpd.conf </li></ul><ul><li>[IP] - - [Time] “[Method] [U...
Other Wishlist Items <ul><li>A good baseline </li></ul><ul><ul><li>Motivate your users to use your engine </li></ul></ul><...
Reasons to Build a Demo “ Eat Your Own Dogfood” algorithm design and testing - allows you to improve without labeled data ...
More About Logs and Evaluation in Other Tutorials <ul><li>Web Search Engine Metrics (Direct Metrics to Measure User Satisf...
What Doesn’t Exist? <ul><li>Query log mining tools </li></ul><ul><ul><li>An opportunity for you! </li></ul></ul><ul><li>… ...
Other Open Source Tools
Lemur Query Log Toolbar <ul><li>Research community project for collecting query logs </li></ul><ul><ul><li>Sign up at http...
Book on Hadoop Scale Processing Coming Out <ul><li>Ivory: A Hadoop toolkit for Web-scale information retrieval   </li></ul...
Take Home Messages <ul><li>You can evaluate with clicks </li></ul><ul><li>You can collect clicks by building a useful / fu...
Pointers - Tools <ul><li>[1] Indri Homepage. http://www.lemurproject.org/indri/.. </li></ul><ul><li>[2] Lemur Toolkit Home...
Mashup Resources <ul><li>Yahoo Developer Network: developer.yahoo.com </li></ul><ul><li>Y.Q.L. : developer.yahoo.com/yql <...
Acknowledgements <ul><li>Vik Singh co-wrote earlier version of this tutorial </li></ul><ul><li>A few slides from Ricardo B...
<ul><li>QA </li></ul>
Upcoming SlideShare
Loading in...5
×

Open Source Search Tools for www2010 conferencesourcesearchtoolswww20100426dApplications of Open Search Tools: WWW2010 Tutorial

134,332

Published on

Presentation by Ted DRAKE and Rosie JONES for the www2010 conference in North Carolina. This discusses the open source search software, APIs and trends.

Published in: Technology
1 Comment
5 Likes
Statistics
Notes
No Downloads
Views
Total Views
134,332
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
131
Comments
1
Likes
5
Embeds 0
No embeds

No notes for slide
  • http://developer.yahoo.com/everything.html - for logos
  • ROSIE – SHOW PSEUDOCODE FOR SIMPLIFIED VERSION – THEN CONVERT TO YQL(TED) OR PERL (ROSIE)?
    The user uses a search interface to rapidly gather many snippets that contain similar phrases, and then selects those that they would like to mark (Figure 6). The server uses Yahoo BOSS2 to search the web for snippets that resemble a paraphrase entered by the user.
    SIGIR 2008 proceedings
    http://portal.acm.org/toc.cfm?id=1390334&amp;idx=SERIES278&amp;type=proceeding&amp;coll=ACM&amp;dl=ACM&amp;part=series&amp;WantType=Proceedings&amp;title=SIGIR&amp;CFID=43145604&amp;CFTOKEN=93348762
    Jung et al IP&amp;M
    http://search.yahoo.com/search?p=Click+data+as+implicit+relevance+feedback+in+web+search&amp;ei=UTF-8&amp;fr=moz2
    Affective feedback
    http://eprints.gla.ac.uk/4825/1/4825.pdf
    http://portal.acm.org/citation.cfm?id=1390566&amp;dl=GUIDE&amp;coll=GUIDE&amp;CFID=43143609&amp;CFTOKEN=22951859
  • ROSIE – SHOW PSEUDOCODE FOR SIMPLIFIED VERSION – THEN CONVERT TO YQL(TED) OR PERL (ROSIE)?
    SIGIR 2008 proceedings
    http://portal.acm.org/toc.cfm?id=1390334&amp;idx=SERIES278&amp;type=proceeding&amp;coll=ACM&amp;dl=ACM&amp;part=series&amp;WantType=Proceedings&amp;title=SIGIR&amp;CFID=43145604&amp;CFTOKEN=93348762
    Jung et al IP&amp;M
    http://search.yahoo.com/search?p=Click+data+as+implicit+relevance+feedback+in+web+search&amp;ei=UTF-8&amp;fr=moz2
    Affective feedback
    http://eprints.gla.ac.uk/4825/1/4825.pdf
    http://portal.acm.org/citation.cfm?id=1390566&amp;dl=GUIDE&amp;coll=GUIDE&amp;CFID=43143609&amp;CFTOKEN=22951859
  • ROSIE WORK ON THIS TONIGHT
  • Eran / Ashim ; okay to inlcude BOSS HERE?
  • WHAT DO WE SHOW FOR PRESENTATION?? – the SIGIR 2008 papers? Query-biased summaries
  • Mapping from words to documents containing them
  • SIGIR 2008 proceedings
    http://portal.acm.org/toc.cfm?id=1390334&amp;idx=SERIES278&amp;type=proceeding&amp;coll=ACM&amp;dl=ACM&amp;part=series&amp;WantType=Proceedings&amp;title=SIGIR&amp;CFID=43145604&amp;CFTOKEN=93348762
    Jung et al IP&amp;M
    http://search.yahoo.com/search?p=Click+data+as+implicit+relevance+feedback+in+web+search&amp;ei=UTF-8&amp;fr=moz2
    Affective feedback
    http://eprints.gla.ac.uk/4825/1/4825.pdf
    http://portal.acm.org/citation.cfm?id=1390566&amp;dl=GUIDE&amp;coll=GUIDE&amp;CFID=43143609&amp;CFTOKEN=22951859
  • TED check correctness
  • TED check correctness
  • LOGO NEEDED!
  • TED UPDATE
  • TED UPDATE?
  • SIGIR 2008 proceedings
    http://portal.acm.org/toc.cfm?id=1390334&amp;idx=SERIES278&amp;type=proceeding&amp;coll=ACM&amp;dl=ACM&amp;part=series&amp;WantType=Proceedings&amp;title=SIGIR&amp;CFID=43145604&amp;CFTOKEN=93348762
    Jung et al IP&amp;M
    http://search.yahoo.com/search?p=Click+data+as+implicit+relevance+feedback+in+web+search&amp;ei=UTF-8&amp;fr=moz2
    Affective feedback
    http://eprints.gla.ac.uk/4825/1/4825.pdf
    http://portal.acm.org/citation.cfm?id=1390566&amp;dl=GUIDE&amp;coll=GUIDE&amp;CFID=43143609&amp;CFTOKEN=22951859
  • SIGIR 2008 proceedings
    http://portal.acm.org/toc.cfm?id=1390334&amp;idx=SERIES278&amp;type=proceeding&amp;coll=ACM&amp;dl=ACM&amp;part=series&amp;WantType=Proceedings&amp;title=SIGIR&amp;CFID=43145604&amp;CFTOKEN=93348762
    Jung et al IP&amp;M
    http://search.yahoo.com/search?p=Click+data+as+implicit+relevance+feedback+in+web+search&amp;ei=UTF-8&amp;fr=moz2
    Affective feedback
    http://eprints.gla.ac.uk/4825/1/4825.pdf
    http://portal.acm.org/citation.cfm?id=1390566&amp;dl=GUIDE&amp;coll=GUIDE&amp;CFID=43143609&amp;CFTOKEN=22951859
  • SIGIR 2008 proceedings
    http://portal.acm.org/toc.cfm?id=1390334&amp;idx=SERIES278&amp;type=proceeding&amp;coll=ACM&amp;dl=ACM&amp;part=series&amp;WantType=Proceedings&amp;title=SIGIR&amp;CFID=43145604&amp;CFTOKEN=93348762
    Jung et al IP&amp;M
    http://search.yahoo.com/search?p=Click+data+as+implicit+relevance+feedback+in+web+search&amp;ei=UTF-8&amp;fr=moz2
    Affective feedback
    http://eprints.gla.ac.uk/4825/1/4825.pdf
    http://portal.acm.org/citation.cfm?id=1390566&amp;dl=GUIDE&amp;coll=GUIDE&amp;CFID=43143609&amp;CFTOKEN=22951859
  • SIGIR 2008 proceedings
    http://portal.acm.org/toc.cfm?id=1390334&amp;idx=SERIES278&amp;type=proceeding&amp;coll=ACM&amp;dl=ACM&amp;part=series&amp;WantType=Proceedings&amp;title=SIGIR&amp;CFID=43145604&amp;CFTOKEN=93348762
    Jung et al IP&amp;M
    http://search.yahoo.com/search?p=Click+data+as+implicit+relevance+feedback+in+web+search&amp;ei=UTF-8&amp;fr=moz2
    Affective feedback
    http://eprints.gla.ac.uk/4825/1/4825.pdf
    http://portal.acm.org/citation.cfm?id=1390566&amp;dl=GUIDE&amp;coll=GUIDE&amp;CFID=43143609&amp;CFTOKEN=22951859
  • SIGIR 2008 proceedings
    http://portal.acm.org/toc.cfm?id=1390334&amp;idx=SERIES278&amp;type=proceeding&amp;coll=ACM&amp;dl=ACM&amp;part=series&amp;WantType=Proceedings&amp;title=SIGIR&amp;CFID=43145604&amp;CFTOKEN=93348762
    Jung et al IP&amp;M
    http://search.yahoo.com/search?p=Click+data+as+implicit+relevance+feedback+in+web+search&amp;ei=UTF-8&amp;fr=moz2
    Affective feedback
    http://eprints.gla.ac.uk/4825/1/4825.pdf
    http://portal.acm.org/citation.cfm?id=1390566&amp;dl=GUIDE&amp;coll=GUIDE&amp;CFID=43143609&amp;CFTOKEN=22951859
  • SIGIR 2008 proceedings
    http://portal.acm.org/toc.cfm?id=1390334&amp;idx=SERIES278&amp;type=proceeding&amp;coll=ACM&amp;dl=ACM&amp;part=series&amp;WantType=Proceedings&amp;title=SIGIR&amp;CFID=43145604&amp;CFTOKEN=93348762
    Jung et al IP&amp;M
    http://search.yahoo.com/search?p=Click+data+as+implicit+relevance+feedback+in+web+search&amp;ei=UTF-8&amp;fr=moz2
    Affective feedback
    http://eprints.gla.ac.uk/4825/1/4825.pdf
    http://portal.acm.org/citation.cfm?id=1390566&amp;dl=GUIDE&amp;coll=GUIDE&amp;CFID=43143609&amp;CFTOKEN=22951859
  • SIGIR 2008 proceedings
    http://portal.acm.org/toc.cfm?id=1390334&amp;idx=SERIES278&amp;type=proceeding&amp;coll=ACM&amp;dl=ACM&amp;part=series&amp;WantType=Proceedings&amp;title=SIGIR&amp;CFID=43145604&amp;CFTOKEN=93348762
    Jung et al IP&amp;M
    http://search.yahoo.com/search?p=Click+data+as+implicit+relevance+feedback+in+web+search&amp;ei=UTF-8&amp;fr=moz2
    Affective feedback
    http://eprints.gla.ac.uk/4825/1/4825.pdf
    http://portal.acm.org/citation.cfm?id=1390566&amp;dl=GUIDE&amp;coll=GUIDE&amp;CFID=43143609&amp;CFTOKEN=22951859
  • ROSIE ADD A PICTURE
  • TALK MORE ABOUT THIS EXAMPLE
  • MAKE A NEW SCREENSHOT WITHOUT CLIPPES TEXT
    DESCRIBE arXiv.org more fully – who uses it what it does etc.
    Radlinksi et a – implemented arxiv search on top of lucene http://search.arxiv.org/
    One could use eg. Yahoo result ordering as one baseline:
    BOSS with restriction to arxiv.org
    What would this pseudocode look like?
  • TODO: Examples from SIGIR 2008 papers for each of those
  • RJ show unigram/ngram examples
    - add refs for Observer User Behavior
  • Open Source Search Tools for www2010 conferencesourcesearchtoolswww20100426dApplications of Open Search Tools: WWW2010 Tutorial

    1. 1. Applications of Open Search Tools: WWW2010 Tutorial Rosie Jones and Ted Drake Yahoo! Inc April 26 th , 2010 [email_address] , [email_address]
    2. 2. Introductions
    3. 3. Schedule 2:00 – 2:15 Introductions and Overview Rosie & Ted 2:15 – 2:30 Motivation – state of the industry Ted Drake 2:30 – 3:00 Indexing and Search Rosie & Ted 3:00 – 3:30 Hello World! Using Search Service APIs & Examples Ted Drake 3:30 – 4:00 Coffee Break 4:00 – 4:30 Mashup Patterns Ted Drake 4:30 – 5:00 Ranking and Evaluation Rosie Jones 5:00 – 5:30 Discussion, Questions Ted & Rosie
    4. 4. Caveat <ul><li>There is a lot of open search software out there! </li></ul><ul><li>This tutorial is breadth-oriented, and example driven </li></ul><ul><ul><li>And therefore necessarily kind of shallow </li></ul></ul>For the slides: [email_address] [email_address] http://www.slideshare.net/7mary4
    5. 5. Schedule 2:00 – 2:15 Introductions and Overview Rosie & Ted 2:15 – 2:30 Motivation – state of the industry Ted Drake 2:30 – 3:00 Search and Indexing Rosie & Ted 3:00 – 3:30 Hello World! Using Search Service APIs & Examples Ted Drake 3:30 – 4:00 Coffee Break 4:00 – 4:30 Mashup Patterns Ted & Rosie 4:30 – 5:00 Ranking and Evaluation Rosie Jones 5:00 – 5:30 Discussion, Questions Ted & Rosie
    6. 6. Motivation
    7. 7. State of the Industry - Mashups <ul><li>Programmable Web: Resource for API and Mashup development </li></ul><ul><li>10 new search mashups every month (average) </li></ul><ul><li>62 search APIs (as of April 25,2010) </li></ul>
    8. 8. State of the Industry - Healthy Market <ul><li>1,500 search related companies on TechCrunch </li></ul>
    9. 9. Open Source Technology Reduces Barriers <ul><li>Yahoo! Query Language </li></ul><ul><ul><li>Select * from (insert your desire) </li></ul></ul><ul><ul><li>Built in cache, threading, authentication </li></ul></ul><ul><ul><li>Easily extended with Open Tables </li></ul></ul><ul><li>Hadoop </li></ul><ul><ul><li>Yahoo Distribution of Hadoop includes patches and updates </li></ul></ul><ul><ul><li>Your Hadoop installation can perform at your current scale </li></ul></ul><ul><ul><ul><li>All the way up to Yahoo scale </li></ul></ul></ul><ul><li>Open Source Search Engines </li></ul><ul><ul><li>Lemur </li></ul></ul><ul><ul><li>Lucene </li></ul></ul>
    10. 10. Motivation II: Tools for Academic Papers
    11. 11. In Academia: Paper in WWW 2010 <ul><li>Highlighting Disputed Claims on the Web Rob Ennals, Beth Trushkowsky, John Mark Agosta, Tye Rattenbury, Tad Hirsch </li></ul>The server uses Yahoo BOSS2 to search the web for snippets that resemble a paraphrase entered by the user.
    12. 12. In Academia: Papers from SIGIR 2008 <ul><li>Towards breaking the quality curse: a web-querying approach to web people search   </li></ul><ul><li>[ Kalashnikov et al SIGIR 2008] </li></ul><ul><ul><li>Web as external corpus </li></ul></ul><ul><ul><li>Use Yahoo! API to retrieve </li></ul></ul><ul><li>Emulating query-biased summaries using document titles [Joho et al SIGIR 2008] </li></ul><ul><ul><li>Yahoo!, Google, Terrier (TREC) </li></ul></ul>
    13. 13. More Publications using Open Source Search Engines <ul><li>Affective feedback: an investigation into the role of emotions in the information seeking process [ Arapakis et al SIGIR 2008] </li></ul><ul><ul><li>Use Indri to parse and retrieve TREC newswire and web collections </li></ul></ul><ul><li>[Jung et al IP&M 2007] </li></ul><ul><ul><li>Last clicked document is predictor of relevance (used Nutch search engine on university website) </li></ul></ul><ul><li>Minimal test collections for retrieval evaluation [Carterette et al SIGIR 2006] </li></ul><ul><ul><li>Indri, Lemur, Lucene, Mg, SMART, Zettair </li></ul></ul>
    14. 14. Schedule 2:00 – 2:15 Introductions and Overview Rosie & Ted 2:15 – 2:30 Motivation – state of the industry Ted Drake 2:30 – 3:00 Search and Indexing Rosie & Ted 3:00 – 3:30 Hello World! Using Search Service APIs & Examples Ted Drake 3:30 – 4:00 Coffee Break 4:00 – 4:30 Mashup Patterns Ted & Rosie 4:30 – 5:00 Automatic Evaluation Rosie Jones 5:00 – 5:30 Discussion, Questions Ted & Rosie
    15. 15. Web Search Architecture Find documents Follow links Fetch freshest content Build graph of hyperlinks Process text and meta-data - compressed - for quick lookup Index Text and meta-data - compressed - for quick lookup Offline Find documents containing query words Runtime Crawlers Indexers Retrieval Ranking Interface
    16. 16. What is Open Search
    17. 17. Open Source Search and Open Search Open source code lets you build your own search engine Open search lets you leverage existing commercial search engines
    18. 18. Why Open Search? <ul><li>#!/usr/local/bin/perl –w </li></ul><ul><ul><li>$searchResultPage = GET($url); </li></ul></ul><ul><ul><li>process($searchResultPage) </li></ul></ul><ul><ul><li>… </li></ul></ul><ul><li>Curl (php) </li></ul><ul><li>Javascript… </li></ul>
    19. 19. Scraping Modules http://search.cpan.org/~jfriedl/Yahoo-Search-1.10.13/lib/Yahoo/Search.pm
    20. 20. Do I Look Like A Piece of Bad Software?
    21. 21. Information Superhighway for Known Robots Search engine may stop accepting requests from your IP, or just slow down service
    22. 22. Scrape with Search Engine’s Blessing <ul><li>http://code.google.com/apis/ajaxsearch/ </li></ul><ul><li>http://msdn.microsoft.com/en-us/library/dd251056.aspx </li></ul><ul><li>http://developer.yahoo.com/search/boss/ </li></ul>MUCH MORE DETAIL IN THE NEXT SECTION!
    23. 23. Other Parts to the Search Process <ul><li>Indexing </li></ul><ul><ul><li>Indexing algorithms </li></ul></ul><ul><ul><li>Access to the index – what is overall document frequency? What if I rank differently using the index? </li></ul></ul><ul><li>Presentation </li></ul><ul><ul><li>User interface effects </li></ul></ul><ul><li>Existing Open Search Platforms Can Get You Started </li></ul>
    24. 24. Indexing Your Own Content
    25. 25. Task of Indexing <ul><li>Store document contents in format that allows quick lookup </li></ul><ul><li>Invest time offline </li></ul><ul><ul><li>For fast runtime access </li></ul></ul><ul><li>Runtime task </li></ul><ul><ul><li>Given the current query </li></ul></ul><ul><ul><li>Which subset of documents should we spend time ranking </li></ul></ul>
    26. 26. Brute Force Document Scoring <ul><li>Check every document in collection to see if it contains any query terms </li></ul><ul><ul><li>Most documents don’t contain any of the query terms </li></ul></ul><ul><ul><li>Look at query terms to see which documents to consider </li></ul></ul>
    27. 27. open drake search ted D1 D67 D3 D92 … query= open search ted drake D8 D9 D15 D32 D1 D9 D46 mit D3 D8 D9 D15 D32 D1 D6 D9 D15 D32 D3 D8 D9 D15 D32 Posting Posting list D1 D3 D8 D9 D15 D32 D6 D46 Inverted Index
    28. 28. High Level Comparison Platform License Lang. Docs Ranking Users Parallel Scale Lucene Apache Java Many Flexible Amazon Yes TB zettair BSD like C HTML, TREC, TXT Flexible Research No TB Indri BSD like C++ Many Very Flexible Research Yes TB Sphinx GPL C++ Many Flexible craigslist Yes TB RDBMS BSD, GPL C SQL Text Limited - Maybe GB Xapian GPL C++ Many Flexible gmane Yes TB
    29. 29. Previous Benchmarks (Middleton+Baeza-Yates 07)
    30. 30. Open Search Benchmarking <ul><li>http://zooie.wordpress.com/2009/07/06/a-comparison-of-open-source-search-engines-and-indexing-twitter/ </li></ul><ul><ul><li>An over the weekend experiment to make code examples </li></ul></ul>
    31. 31. Benchmarks <ul><li>Not enough comparative benchmarks out there </li></ul><ul><li>Hard to do; we really need standards </li></ul><ul><ul><li>Optimize each platform, per hardware and data set </li></ul></ul><ul><ul><li>Lot of platforms, with different APIs, options and numerical settings </li></ul></ul><ul><li>Need good diverse data sets, small & large </li></ul><ul><li>Hard to please </li></ul><ul><ul><li>Winners & losers in benchmarks; lot of biases </li></ul></ul><ul><ul><li>Always room for improvement </li></ul></ul><ul><li>Really evolutionary to nail benchmarks </li></ul><ul><ul><li>It’s an Open Source project </li></ul></ul><ul><ul><ul><li>http://github.com/zooie/opensearch/tree/master </li></ul></ul></ul>
    32. 32. In action <ul><li>Lucene </li></ul><ul><li>Sphinx </li></ul><ul><li>Indri </li></ul><ul><li>All the code examples are here: </li></ul><ul><li>http://github.com/zooie/opensearch/tree/master </li></ul>
    33. 33. Lucene <ul><li>Lot of industrial support w/ proven scalability </li></ul><ul><ul><li>Amazon, Netflix, Wikipedia </li></ul></ul><ul><li>An IR Library in Java </li></ul><ul><ul><li>There’s also pyLucene & CLucene </li></ul></ul><ul><li>Use Nutch, Solr or Hounder for the rest </li></ul><ul><ul><li>Crawlers, result abstracts… </li></ul></ul>
    34. 34. Lucene Indexing
    35. 35. Lucene Search javac -cp /lucenedir/lucene-2.4.1/lucene-core-2.4.1.jar:. Index.java java –Xmv512m –cp /lucenedir/lucene-core-2.4.1.jar:. Index
    36. 36. Sphinx <ul><li>Runs Craigslist Search </li></ul><ul><li>MySQL integration focus </li></ul><ul><ul><li>But also supports a XML input pipe </li></ul></ul><ul><li>Pretty fast indexer </li></ul><ul><li>searchd, indexer commands </li></ul><ul><li>Mostly declarative option setting (sphinx.conf) </li></ul><ul><li>Client API (python, Java, ruby, php) sockets to searchd </li></ul>
    37. 37. Sphinx Indexing <ul><li>SQL text columns or XML input </li></ul><ul><li>sphinx.conf </li></ul><ul><li>indexer --quiet --config sphinx.conf medindex </li></ul>
    38. 38. Sphinx Search Socket connection to searchd Sphinx service
    39. 39. Indri <ul><li>Lemur Project </li></ul><ul><ul><li>http://www.lemurproject.org/ </li></ul></ul><ul><li>Powerful Structured Query Language </li></ul><ul><li>Advanced Language Models </li></ul><ul><li>Native C++; swigged Java, php </li></ul><ul><li>Command line binaries </li></ul><ul><li>Developer resources </li></ul><ul><ul><li>http://lemur.wiki.sourceforge.net/ </li></ul></ul>
    40. 40. Indri: Hello World <ul><li>Index & Search directory of txt files </li></ul><ul><li>IndriBuildIndex -index=/Users/viksi/sigir/med_data/indri_index -corpus.path=/Users/viksi/sigir/med_data/indri_data -corpus.class=txt -memory=300m </li></ul><ul><ul><li>http://www.lemurproject.org/lemur/indexing.php#IndriBuildIndex </li></ul></ul><ul><li>IndriRunQuery -index=/Users/viksi/sigir/med_data/indri_index -count=100 -rule=&quot;method:dirichlet,mu:2500&quot; -query=&quot;#weight(1.0 #uw2(chest pain) 2.0 #1(heart attack))” </li></ul><ul><ul><li>http://www.lemurproject.org/lemur/IndriQueryLanguage.php </li></ul></ul>
    41. 41. Indexed Info in Search API
    42. 42. Index - Structured Meta Data <ul><li>SearchMonkey: </li></ul><ul><li>Yahoo! SearchMonkey captures the structured data from web sites for the index. </li></ul><ul><li>RDF </li></ul><ul><li>Microformats </li></ul>
    43. 43. Index - Social <ul><li>Social: </li></ul><ul><li>Delicious saves/tags </li></ul><ul><li>FOAF (Friend of a Friend), XFN </li></ul><ul><li>Recent social activity: Twitter, Facebook, Buzz, Blogs… </li></ul>
    44. 44. Index – Machine Tags <ul><li>Keyterms </li></ul><ul><li>Mis-Spelling </li></ul><ul><li>Content Enrichment </li></ul><ul><li>Inbound Links </li></ul>
    45. 45. Schedule 2:00 – 2:15 Introductions and Overview Rosie & Ted 2:15 – 2:30 Motivation – state of the industry Ted Drake 2:30 – 3:00 Search and Indexing Rosie & Ted 3:00 – 3:30 Hello World! Using Search Service APIs & Examples Ted Drake 3:30 – 4:00 Coffee Break 4:00 – 4:30 Mashup Patterns Ted & Rosie 4:30 – 5:00 Automatic Evaluation Rosie Jones 5:00 – 5:30 Discussion, Questions Ted & Rosie
    46. 46. Hello, World! Open Search Service APIs Photo by Oskay
    47. 47. Roadmap of APIs <ul><li>Google </li></ul><ul><li>Bing </li></ul><ul><li>BOSS </li></ul><ul><li>Twitter </li></ul><ul><li>YQL </li></ul><ul><li>Live examples </li></ul>Photo by Scorpions and Centaurs
    48. 48. Google AJAX Search <ul><li>Javascript Widget or API </li></ul><ul><li>REST API: </li></ul><ul><li>http://ajax.googleapis.com/ajax/services/search/{vertical}?v=1.0&q={query } </li></ul><ul><li>Web, Local, Video, Blogs, News, Books, Images, Patents </li></ul><ul><li>Can’t modify results though </li></ul>
    49. 49. Google Custom Search <ul><li>Turn-key product </li></ul><ul><li>Bulk load 1000s site restricts; On-demand 24 hour Web Indexing </li></ul><ul><li>Iframe or Custom Search Element results for developers; XML for enterprise </li></ul>
    50. 50. Bing 2.0 API <ul><li>Multiple Sources; Batch support </li></ul><ul><ul><li>Web, Images, InstantAnswer, Phonebook, RelatedSearch, Spell </li></ul></ul><ul><li>Usage: http://api.search.live.net/json.aspx?AppId={appid}&Market=en-US&Query={query}&Sources=web+spell&Web.Count=1 </li></ul><ul><li>Can modify (w/ some restrictions, i.e. re-ranking, blending with non-Bing sources) </li></ul>
    51. 51. Yahoo! BOSS <ul><li>BOSS = B uild your O wn S earch S ervice </li></ul><ul><li>Open Yahoo’s core search features via web services to let 3rd parties revolutionize Search </li></ul><ul><li>Unrestricted </li></ul>
    52. 52. Unrestricted? <ul><li>Unlimited queries </li></ul><ul><li>Blend, re-order, discard </li></ul><ul><li>Full Presentation control </li></ul><ul><li>Limited only by your imagination </li></ul>
    53. 53. BOSS API <ul><li>Usage </li></ul><ul><ul><li>http://boss.yahooapis.com/ysearch/{vert}/v1/{q}?appid={appid}&start=0&count=10&lang=en&format=xml&view=keyterms </li></ul></ul><ul><li>Verticals </li></ul><ul><ul><li>Web, News, Images, Spelling </li></ul></ul><ul><li>In query syntax </li></ul><ul><ul><li>inurl, url, intitle, site, AND/OR, “-”, “+” </li></ul></ul><ul><li>Notable web view fields </li></ul><ul><ul><li>Delicious bookmarks </li></ul></ul><ul><ul><li>SearchMonkey (microformats) </li></ul></ul><ul><ul><li>Larger abstracts </li></ul></ul><ul><ul><li>Extracted Entities (keyterms) </li></ul></ul><ul><li>Can modify </li></ul>SearchMonkey keyterms Bookmarks
    54. 54. Web = Cross Platform <ul><li>Google AJAX, Bing, BOSS </li></ul><ul><li>HTTP GET , URI => XML, JSON </li></ul><ul><li>Any programming lang. that supports HTTP </li></ul><ul><li>Many language specific libraries available </li></ul><ul><ul><li>Web Search “[platform] [language]” </li></ul></ul><ul><ul><ul><li>“ yahoo boss python” </li></ul></ul></ul><ul><li>Mobile: HTML web apps work on all smart phones </li></ul>
    55. 55. Platforms
    56. 56. Yahoo! YQL <ul><li>select * from internet API (e.g. flickr, ebay, amazon) </li></ul><ul><ul><li>http://developer.yahoo.com/yql/ </li></ul></ul>many standard & “open tables” services »
    57. 57. Amazon Web Services (AWS) <ul><li>Amazon Cloud Support </li></ul><ul><li>Amazon SimpleDB, Relational Database Services </li></ul><ul><li>E-Commerce Fulfillment Services </li></ul><ul><li>Messaging </li></ul><ul><li>Monitoring </li></ul><ul><li>Networking </li></ul><ul><li>Payments & Billing </li></ul><ul><li>Storage </li></ul><ul><li>Workforce: Amazon Mechanical Turk </li></ul><ul><li>Large scale functionality at startup prices </li></ul>
    58. 58. Google App Engine <ul><li>Free application hosting (up to 5 million pv/month) </li></ul><ul><li>Java, Ruby, or Python </li></ul><ul><li>Extensive SDK support </li></ul><ul><li>Distributed Data Storage (up to 500 mb for free) </li></ul>
    59. 59. Examples
    60. 60. BOSS Out in the Open <ul><li> http://www.xurch.com </li></ul><ul><li> http://search.techcrunch.com </li></ul><ul><li> http://www.spysee.jp </li></ul><ul><li> http://www.123people.com </li></ul><ul><li> http://www.pipl.com </li></ul><ul><li> http://tweetnews.appspot.com </li></ul><ul><li> http://bossy.appspot.com </li></ul><ul><li> http://www.hakia.com </li></ul><ul><li> http://oneriot.com </li></ul><ul><li> http://www.daylife.com </li></ul><ul><li> http://www.inquisitorx.com/ </li></ul><ul><li> http://insiderfood.com/ </li></ul><ul><li> http://ask-boss.appspot.com/ </li></ul><ul><li> http://www.4hoursearch.com </li></ul><ul><li> http://www.devunity.com (Techcrunch 50) </li></ul><ul><li> http://copyrightspot.com/ (Mashable) </li></ul><ul><li> http://imusicmash.com (Mashable) </li></ul><ul><li> http://truevert.com (Mashable) </li></ul><ul><li> http://professeurs.esiea.fr/wassner/?2008/10/20/171-semantic-calculator </li></ul><ul><li> http://www.ysearchblog.com/archives/000613.html </li></ul><ul><li> http://www.ysearchblog.com/archives/000621.html </li></ul><ul><ul><li>DNS Mashup </li></ul></ul><ul><ul><li>BuildASearch </li></ul></ul><ul><ul><li>PlayerSearch </li></ul></ul><ul><ul><li>V3GGIE </li></ul></ul><ul><ul><li>Dipidity Newsline </li></ul></ul><ul><ul><li>Tianamo </li></ul></ul>
    61. 61. Google Custom Search Examples <ul><li>CopyScape – Looks for sites copying your text </li></ul><ul><li>Topicalizer – Extracts topics, finds related information from text </li></ul>
    62. 62. Bing Examples <ul><li>Site Search Engine by a Microsoft engineer http://nathanbuggia.com/blog/post/Custom-Site-Search-Engine-Using-the-Live-Search-API.aspx </li></ul>
    63. 63. Coolest Features Across the Board <ul><li>BOSS se_link (graphs), delicious (bookmarks), keyterms (extracted entities), searchmonkey (rdfa, microformats, structured abstracts) </li></ul><ul><li>Yahoo! YQL </li></ul><ul><li>Bing Video, Translation, Instant Answer, Batch </li></ul><ul><li>Google CSE large site restricts, refinements </li></ul><ul><li>Google AJAX Transliteration, Blogs, Books </li></ul>
    64. 64. Schedule 2:00 – 2:15 Introductions and Overview Rosie & Ted 2:15 – 2:30 Motivation – state of the industry Ted Drake 2:30 – 3:00 Search and Indexing Rosie & Ted 3:00 – 3:30 Hello World! Using Search Service APIs & Examples Ted Drake 3:30 – 4:00 Coffee Break 4:00 – 4:30 Mashups Ted Drake 4:30 – 5:00 Automatic Evaluation Rosie Jones 5:00 – 5:30 Discussion, Questions Ted & Rosie
    65. 65. Mashups
    66. 66. Let’s Build Something <ul><li>TweetNews </li></ul><ul><ul><li>http://tweetnews.appspot.com/search?q=twitter </li></ul></ul><ul><ul><li>“ the best mashup we’ve ever seen” (Wired) </li></ul></ul><ul><li>Tools </li></ul><ul><ul><li>BOSS, BOSS Mashup Framework, Google App Engine, Python 2.5 </li></ul></ul><ul><li>Source </li></ul><ul><ul><li>http://vik.singh.googlepages.com/fresh.zip </li></ul></ul>
    67. 67. Digression: TF-IDF for Ranking <ul><li>TF = Term Frequency </li></ul><ul><ul><li>Documents containing the query terms often tend to be relevant </li></ul></ul><ul><li>IDF – Inverse Document Frequency </li></ul><ul><ul><li>Words that are in every document aren’t as important </li></ul></ul><ul><ul><ul><li>The, of, “click here”, “home page” </li></ul></ul></ul><ul><ul><li>Document frequency: number of documents containing this term </li></ul></ul><ul><ul><li>Divide by Document frequency: Inverse Document Frequency </li></ul></ul><ul><li>Sort by TF * IDF to get a ranking over documents </li></ul>
    68. 68. TweetNews Model <ul><li>Goal: Inject relevance in latest news search results </li></ul><ul><li>Approach: </li></ul><ul><ul><li>Fetch latest (order by date) news results for query </li></ul></ul><ul><ul><li>Also fetch latest tweets for query (search.twitter.com) </li></ul></ul><ul><ul><li>Vectorize each Twitter and News search result </li></ul></ul><ul><ul><li>Euclidean Normalized TFIDF document vector of term:freq pairs </li></ul></ul><ul><ul><li>Compute cosine sim between each twitter & news result vector </li></ul></ul><ul><ul><li>Assign tweet to news result if sim >= threshold </li></ul></ul><ul><ul><li>Sort news results by # of related tweets </li></ul></ul><ul><li>WWW2010 similar paper </li></ul><ul><ul><li>Time is of the Essence: Improving Recency Ranking Using Twitter Data Anlei Dong, Ruiqiang Zhang, Pranam Kolari, Bai Jing, Yi Chang, Fernando Diaz, Zhaohui Zheng, Hongyuan Zha </li></ul></ul>
    69. 69. TweetNews Main Source
    70. 70. Non-Search: delicious Classifier <ul><li>Usage </li></ul><ul><ul><li>&view=delicious_toptags </li></ul></ul><ul><ul><li>&view=delicious_saves </li></ul></ul><ul><li>Idea: Liberal v. Conservative Classifier </li></ul><ul><li>Generate politics queries list </li></ul><ul><ul><li>Mine Reuters or editors </li></ul></ul><ul><li>BOSS search each; take top 1k results </li></ul><ul><li>Filter on tag ‘liberal’ or ‘conservative’; assemble binary training set </li></ul><ul><li>Features </li></ul><ul><ul><li>“ &abstract=long ”, “ &view=keyterms, delicious_saves, searchmonkey_rss ”, title, url, date, se_link # inbound links </li></ul></ul>
    71. 71. Mashup: Related terms <ul><li>Delicious users can tag web sites they bookmark. </li></ul><ul><li>Get a ranked list of tags for a general topic </li></ul><ul><li>select delicious_toptags,title from search.web where query=&quot;hadoop&quot; and view=&quot;delicious_toptags“ </li></ul>
    72. 72. Mashup – Social Impact <ul><li>What are your friends buzzing, digging, tagging… </li></ul><ul><li>YQL: select * from social.connections.updates where guid=me </li></ul><ul><li>Use data to find more recent and relevant information </li></ul><ul><li>Lijit creates a vertical search engine based on a user’s delicious, facebook, and other saved bookmarks </li></ul><ul><li>WWW2010 Related Paper : Liquid Query: Multi-domain Exploratory Search on the Web Marco Brambilla, Alessandro Bozzon, Stefano Ceri, Piero Fraternali </li></ul><ul><li>Now it’s time to turn on the FIRE HOSE </li></ul>
    73. 73. Mashup – The Fire Hose
    74. 74. Mashup – Government Data <ul><li>Guardian’s World Government Data Collection http://www.guardian.co.uk/world-government-data </li></ul><ul><ul><li>U.S. Unemployment Statistics </li></ul></ul><ul><ul><li>U.S. Aviation Accidents </li></ul></ul><ul><ul><li>Raw Data for U. S. Department of Energy (DOE) Categorical Exclusion(CX) Determinations Under the National Environmental Policy Act (NEPA) </li></ul></ul><ul><ul><li>Treasury Recovery Act Data </li></ul></ul><ul><ul><li>Migratory Bird Flyways - Continental United States </li></ul></ul>
    75. 75. Coming Soon: Twitter Annotations <ul><li>Metadata for tweets </li></ul><ul><li>Step 1. create link for users to tweet your page. </li></ul><ul><li>Step 2. Insert metadata into each tweet </li></ul><ul><li>Step 3. Pull that information back and mash with other data. </li></ul><ul><li>Example </li></ul><ul><li>Yahoo! Finance has a tweet this stock link. </li></ul><ul><li>Insert information (ticker:yhoo) into the tweet’s metadata. </li></ul><ul><li>Follow the distribution of this metadata and look for correlations in stock price activity. Perhaps a new line on Finance Charts. </li></ul>
    76. 76. Mashup – Open Tables on YQL <ul><ul><li>Define new API definitions </li></ul></ul><ul><ul><li>Open Source in GitHub </li></ul></ul><ul><ul><li>Server-side JavaScript allows Insert and more </li></ul></ul><ul><ul><li>Allows for private keys </li></ul></ul>
    77. 77. Mashup – Open Tables on YQL <ul><li><?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?> <table xmlns=&quot;http://query.yahooapis.com/v1/schema/table.xsd&quot;> <meta> <author>Nagesh Susarla</author> <documentationURL>See search.web and search.images for more details</documentationURL> </meta> <bindings> <select itemPath=&quot;results.result&quot; produces=&quot;XML&quot;> <inputs> <key id=&quot;query&quot; type=&quot;xs:string&quot; paramType=&quot;query&quot; required=&quot;true&quot;/> </inputs> <execute><![CDATA[ var qs = query; var search = y.query('select * from search.web(50) where query=@query', {query: qs}).results; var images = []; default xml namespace='http://www.inktomi.com/'; for each (var result in search.result) { images.push(y.query('select * from search.images(1) where query=@query and url=@url', {url:result.url, query:qs})); } var i = 0; for each (var result in search.result) { var image = images[i].results.result; if (image) { result.image = <image>{image}</image>; } i++; } response.object = search; ]]> </execute> </select> </bindings> </table> </li></ul>
    78. 78. Mashup – Using an Open Table
    79. 79. Blending Vertical + Service <ul><li>Comprehensiveness! </li></ul><ul><ul><li>Every Search Engine should be a One-Stop Shop </li></ul></ul>
    80. 80. Delicious Blending Idea <ul><li>Goal: Blend delicious + web results </li></ul><ul><li>Approach: </li></ul><ul><ul><li>1000s BOSS Web Queries, Filter w/ delicious_saves </li></ul></ul><ul><ul><li>Training set: x: search features | y: delicious count </li></ul></ul><ul><ul><li>Machine learn the transfer function </li></ul></ul><ul><ul><ul><li>Infer the delicious count for any web result </li></ul></ul></ul><ul><ul><ul><li>Can now normalize the two search result sets </li></ul></ul></ul>
    81. 81. From Web
    82. 82. Hack Ideas <ul><li>Discovery (BOSS Search App Store) </li></ul><ul><ul><li>Designing a fairer marketplace for app distribution </li></ul></ul><ul><ul><li>Emerging problem for Facebook, iPhone App Store </li></ul></ul><ul><li>Desktop, Data Visualization (Cooliris, Inquisitor) </li></ul><ul><li>Mobile (iPhone, Android, BlackBerry) </li></ul><ul><li>Passive Location/Contextual Based Search </li></ul><ul><li>Social (Facebook, Twitter, OpenSocial, Friend Connect, OneConnect) </li></ul><ul><li>Semantic </li></ul><ul><li>BOSS keyterms, SearchMonkey </li></ul><ul><li>Bing Instant Answers </li></ul><ul><li>Google CSE Refiners </li></ul>
    83. 83. Schedule 2:00 – 2:15 Introductions and Overview Rosie & Ted 2:15 – 2:30 Motivation – state of the industry Ted Drake 2:30 – 3:00 Search and Indexing Rosie & Ted 3:00 – 3:30 Hello World! Using Search Service APIs & Examples Ted Drake 3:30 – 4:00 Coffee Break 4:00 – 4:30 Mashups Ted Drake 4:30 – 5:00 Ranking and Evaluation Rosie Jones 5:00 – 5:30 Discussion, Questions Ted & Rosie
    84. 84. Ranking
    85. 85. Retrieval and Ranking <ul><li>RETRIEVE the documents matching simple conditions </li></ul><ul><ul><li>Boolean AND on query terms </li></ul></ul><ul><ul><li>TF-IDF </li></ul></ul><ul><ul><li>… </li></ul></ul><ul><li>RANK using more sophisticated function </li></ul><ul><ul><li>Term proximity </li></ul></ul><ul><ul><li>Page authority </li></ul></ul><ul><ul><li>Author identity </li></ul></ul><ul><ul><li>… </li></ul></ul>
    86. 86. Ranking with Open Source Tools <ul><li>Indri/Lemur </li></ul><ul><ul><li>Language modeling </li></ul></ul><ul><ul><li>BM25, Okapi, Cosine similarity, inQuery </li></ul></ul><ul><li>Lucene </li></ul><ul><ul><li>TF-IDF, weighted by term occurrences </li></ul></ul><ul><ul><li>Fielded search </li></ul></ul><ul><li>Terrier </li></ul><ul><ul><li>Okapi BM25, language modeling and TF-IDF </li></ul></ul><ul><ul><li>Divergence from Randomness </li></ul></ul><ul><li>Your own re-ranking code using open search </li></ul>
    87. 87. Evaluation with Click Logs
    88. 88. Evaluating with Clicks People click on the good results, right?
    89. 89. Not All Results Are Equally Likely to be Looked At (Source: iprospect.com WhitePaper_2006_SearchEngineUserBehavior.pdf) ‏
    90. 90. Clicks and Views Depend on Rank [Joachims et al, 2005]
    91. 91. Evaluation from Click Logs <ul><li>Show a screenshot and me doing a “skip first” </li></ul>Read From Top to Bottom Skip Skip Skip Click! [Joachims et al SIGIR 2005]
    92. 92. Mining Clicks for Ranking <ul><li>Clicks can be used to predict </li></ul><ul><ul><li>Pairwise preference </li></ul></ul><ul><ul><ul><li>Query: Doc1, Doc2 [ Joachims 2002] </li></ul></ul></ul><ul><ul><li>Absolute relevance </li></ul></ul><ul><ul><ul><li>Taking clicks on other documents into account </li></ul></ul></ul><ul><ul><ul><li>[Carterette and Jones, NIPS 2007] </li></ul></ul></ul><ul><ul><ul><li>[Chapelle and Zhang, WWW 2009] </li></ul></ul></ul>
    93. 93. Interleaving for Learning from Clicks – Pairwise Judgments <ul><li>[Joachims, KDD 2002] </li></ul><ul><li>[Radlinski and Joachims, KDD 2007] </li></ul><ul><li>[Radlinksi et al, CIKM 2008] </li></ul>Results from Method 1 Results from Method 2
    94. 94. Evaluation using Discounted Cumulative Gain <ul><li>Discounted Cumulative Gain (DCG) </li></ul><ul><li>[J ä rvelin and Kek älä inen 2000] </li></ul>Highly relevant Value = 3 Somewhat relevant Value = 2 Tangentially relevant Value = 1 Irrelevant Value = 0 Most important Value = 1 Less important Value = 1/log(i) ‏
    95. 95. Directly Modeling Relevance From Clicks Which ranking of web pages is better for the query “NIPS 2007”? [Carterette and Jones, NIPS 2007] Rank 1 Rank 2 Rank 3 Rank 4 Rank 5 Rank 1 Rank 2 Rank 3 Rank 4 Rank 5 Click count 1 Is DCG 1 > DCG 2 ? P(DCG 1 > DCG 2 )
    96. 96. Ingredients for Learning from Clicks <ul><li>Sufficient users </li></ul><ul><li>Ability to record results shown </li></ul><ul><li>Ability to vary presentation order </li></ul><ul><li>Ability to vary results shown </li></ul><ul><li>Ability to log clicks </li></ul><ul><li>Ability to run experiments </li></ul><ul><li>varying your secret sauce </li></ul>
    97. 98. How to Get Search Engine Results to Modify? <ul><li>Radlinski and Joachims </li></ul><ul><li>citeseer/arXiv.org results and permuted rankings, recorded clicks, skip above, skip next </li></ul><ul><li>See also their open source engine Osmot </li></ul><ul><ul><li>http://radlinski.org/osmot/ </li></ul></ul>Search Engine Services Allow You to Do This Kind of Thing Yourself
    98. 99. Query Logs <ul><li>Might be in /etc/httpd/logs/access_log* check httpd.conf </li></ul><ul><li>[IP] - - [Time] “[Method] [URI] [Version] [Code] “[Referrer]” “[User-Agent]” </li></ul><ul><ul><li>10.66.91.231 - - [08/Jun/2009:21:24:44 -0700] &quot;GET / search?q=awesome+presentation HTTP/1.1&quot; 200 2940 &quot;http://i_was_referred_from_here.com&quot; &quot;Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.0.10) Gecko/2009042315 Firefox/3.0.10 Ubiquity/0.1.4” </li></ul></ul><ul><li>Tip: Instrument as much as possible in GET URI via CGI parameters </li></ul><ul><ul><li>search?q=yahoo&region=us&tab=local&device=mobile&advanced=1 </li></ul></ul><ul><ul><li>One log, avoid joins; URI must < 2k bytes </li></ul></ul><ul><li>grep, cut, uniq, wc, sort, cat are your friends </li></ul><ul><ul><li>Ex. Count user query sessions (session key = IP+hour) </li></ul></ul><ul><ul><li>sudo grep ’/search?q=' /etc/httpd/logs/access_log.1 | cut -d' ' -f1,4 | cut -d':' -f1,2 | uniq | wc –l </li></ul></ul><ul><li>For advanced SQL processing on single machine: sqlite3 import script </li></ul><ul><ul><li>http://selinap.com/2008/04/python-parse-apache-log-to-sqlite-database/ </li></ul></ul><ul><li>Distributed: Hadoop & Pig </li></ul><ul><ul><li>http://www.cloudera.com/blog/2009/06/17/analyzing-apache-logs-with-pig/ </li></ul></ul>
    99. 100. Other Wishlist Items <ul><li>A good baseline </li></ul><ul><ul><li>Motivate your users to use your engine </li></ul></ul><ul><ul><li>More fun than reading newspaper stories from 1997 </li></ul></ul><ul><li>Evaluate something that is different from ranking </li></ul><ul><ul><li>Summarization </li></ul></ul><ul><ul><li>Information extraction </li></ul></ul><ul><li>Or improve on existing ranking </li></ul><ul><li>NLP tasks “take top results and do X…” </li></ul><ul><ul><li>Data mining </li></ul></ul><ul><li>Pseudo-relevance feedback </li></ul>
    100. 101. Reasons to Build a Demo “ Eat Your Own Dogfood” algorithm design and testing - allows you to improve without labeled data - look closely at the results - convince your advisor/funders it works! Observe user behavior Cheap flight to boston Cheap flights to boston Cheap flights Travelocity Expedia American arlines.com American airlines.com Americanairlines.com Puppy Cute puppy More cute puppy picutres
    101. 102. More About Logs and Evaluation in Other Tutorials <ul><li>Web Search Engine Metrics (Direct Metrics to Measure User Satisfaction) – Tuesday, 2:00 PM–5:30 PM </li></ul><ul><li>Ali Dasdan , Yahoo! (USA) Kostas Tsioutsiouliklis , Yahoo! (USA) Emre Velipasaoglu , Yahoo! (USA) </li></ul><ul><li>Web Search/Browse Log Mining: Challenges, Methods, and Applications – Today, 9:00 AM–5:30 PM </li></ul><ul><li>Daxin Jiang , Microsoft (China), Jian Pei , Simon Fraser University (Canada) Hang Li , Microsoft ( </li></ul>
    102. 103. What Doesn’t Exist? <ul><li>Query log mining tools </li></ul><ul><ul><li>An opportunity for you! </li></ul></ul><ul><li>… </li></ul>
    103. 104. Other Open Source Tools
    104. 105. Lemur Query Log Toolbar <ul><li>Research community project for collecting query logs </li></ul><ul><ul><li>Sign up at http://lemurstudy.cs.umass.edu/ </li></ul></ul><ul><li>Built and maintained by LTI CMU and CIIR UMass Amherst </li></ul><ul><li>http://www.lemurproject.org/ </li></ul>
    105. 106. Book on Hadoop Scale Processing Coming Out <ul><li>Ivory: A Hadoop toolkit for Web-scale information retrieval </li></ul><ul><li>http://www.umiacs.umd.edu/~jimmylin/ivory/docs/index.html </li></ul><ul><li>Jimmy Lin </li></ul>
    106. 107. Take Home Messages <ul><li>You can evaluate with clicks </li></ul><ul><li>You can collect clicks by building a useful / fun search service </li></ul><ul><li>You can create a useful/fun search service using open search APIs </li></ul><ul><li>You obtain implementations of standard retrieval algorithms with open source search engines </li></ul><ul><li>Modify that code with your new techniques </li></ul>
    107. 108. Pointers - Tools <ul><li>[1] Indri Homepage. http://www.lemurproject.org/indri/.. </li></ul><ul><li>[2] Lemur Toolkit Homepage. http://www.lemurproject.org/. </li></ul><ul><li>[3] Lucene Homepage. http://jakarta.apache.org/lucene/ . </li></ul><ul><li>[4] Xapian Code Library Homepage. http://www.xapian.org/. </li></ul><ul><li>[5] Zettair Homepage. http://www.seg.rmit.edu.au/zettair/ . </li></ul><ul><li>[6] Terrier Homepage. http://ir.dcs.gla.ac.uk/terrier/ . </li></ul><ul><li>[7] Nutch Homepage. http://lucene.apache..org/nutch/ . </li></ul><ul><li>[8] Sphinx search http://sphinxsearch.com/ </li></ul>
    108. 109. Mashup Resources <ul><li>Yahoo Developer Network: developer.yahoo.com </li></ul><ul><li>Y.Q.L. : developer.yahoo.com/yql </li></ul><ul><li>BOSS : developer.yahoo.com/boss </li></ul><ul><li>Bing : bing.com/developers </li></ul><ul><li>Google Search: code.google.com/apis/ajaxsearch/ </li></ul><ul><li>App Engine : code.google.com/appengine/ </li></ul><ul><li>A.W.S. : aws.amazon.com </li></ul><ul><li>Programmable Web : programmableWeb.com </li></ul><ul><li>Mashable : mashable.com </li></ul><ul><li>Tech Crunch : TechCrunch.com </li></ul>
    109. 110. Acknowledgements <ul><li>Vik Singh co-wrote earlier version of this tutorial </li></ul><ul><li>A few slides from Ricardo Baeza-Yates and Ben Carterette </li></ul><ul><li>Andrew Tomkins, Wei Vivian Zhang, Ahmed Hassan, Eran Palmon for helpful feedback </li></ul>
    110. 111. <ul><li>QA </li></ul>
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×