Advertisement
Advertisement

More Related Content

Advertisement

Similar to Open Source Search Tools for www2010 conferencesourcesearchtoolswww20100426dApplications of Open Search Tools: WWW2010 Tutorial(20)

More from Ted Drake(20)

Advertisement

Recently uploaded(20)

Open Source Search Tools for www2010 conferencesourcesearchtoolswww20100426dApplications of Open Search Tools: WWW2010 Tutorial

  1. Applications of Open Search Tools: WWW2010 Tutorial Rosie Jones and Ted Drake Yahoo! Inc April 26 th , 2010 [email_address] , [email_address]
  2. Introductions
  3. Schedule 2:00 – 2:15 Introductions and Overview Rosie & Ted 2:15 – 2:30 Motivation – state of the industry Ted Drake 2:30 – 3:00 Indexing and Search Rosie & Ted 3:00 – 3:30 Hello World! Using Search Service APIs & Examples Ted Drake 3:30 – 4:00 Coffee Break 4:00 – 4:30 Mashup Patterns Ted Drake 4:30 – 5:00 Ranking and Evaluation Rosie Jones 5:00 – 5:30 Discussion, Questions Ted & Rosie
  4. Schedule 2:00 – 2:15 Introductions and Overview Rosie & Ted 2:15 – 2:30 Motivation – state of the industry Ted Drake 2:30 – 3:00 Search and Indexing Rosie & Ted 3:00 – 3:30 Hello World! Using Search Service APIs & Examples Ted Drake 3:30 – 4:00 Coffee Break 4:00 – 4:30 Mashup Patterns Ted & Rosie 4:30 – 5:00 Ranking and Evaluation Rosie Jones 5:00 – 5:30 Discussion, Questions Ted & Rosie
  5. Motivation
  6. Motivation II: Tools for Academic Papers
  7. Schedule 2:00 – 2:15 Introductions and Overview Rosie & Ted 2:15 – 2:30 Motivation – state of the industry Ted Drake 2:30 – 3:00 Search and Indexing Rosie & Ted 3:00 – 3:30 Hello World! Using Search Service APIs & Examples Ted Drake 3:30 – 4:00 Coffee Break 4:00 – 4:30 Mashup Patterns Ted & Rosie 4:30 – 5:00 Automatic Evaluation Rosie Jones 5:00 – 5:30 Discussion, Questions Ted & Rosie
  8. Web Search Architecture Find documents Follow links Fetch freshest content Build graph of hyperlinks Process text and meta-data - compressed - for quick lookup Index Text and meta-data - compressed - for quick lookup Offline Find documents containing query words Runtime Crawlers Indexers Retrieval Ranking Interface
  9. What is Open Search
  10. Open Source Search and Open Search Open source code lets you build your own search engine Open search lets you leverage existing commercial search engines
  11. Scraping Modules http://search.cpan.org/~jfriedl/Yahoo-Search-1.10.13/lib/Yahoo/Search.pm
  12. Do I Look Like A Piece of Bad Software?
  13. Information Superhighway for Known Robots Search engine may stop accepting requests from your IP, or just slow down service
  14. Indexing Your Own Content
  15. open drake search ted D1 D67 D3 D92 … query= open search ted drake D8 D9 D15 D32 D1 D9 D46 mit D3 D8 D9 D15 D32 D1 D6 D9 D15 D32 D3 D8 D9 D15 D32 Posting Posting list D1 D3 D8 D9 D15 D32 D6 D46 Inverted Index
  16. High Level Comparison Platform License Lang. Docs Ranking Users Parallel Scale Lucene Apache Java Many Flexible Amazon Yes TB zettair BSD like C HTML, TREC, TXT Flexible Research No TB Indri BSD like C++ Many Very Flexible Research Yes TB Sphinx GPL C++ Many Flexible craigslist Yes TB RDBMS BSD, GPL C SQL Text Limited - Maybe GB Xapian GPL C++ Many Flexible gmane Yes TB
  17. Previous Benchmarks (Middleton+Baeza-Yates 07)
  18. Lucene Indexing
  19. Lucene Search javac -cp /lucenedir/lucene-2.4.1/lucene-core-2.4.1.jar:. Index.java java –Xmv512m –cp /lucenedir/lucene-core-2.4.1.jar:. Index
  20. Sphinx Search Socket connection to searchd Sphinx service
  21. Indexed Info in Search API
  22. Schedule 2:00 – 2:15 Introductions and Overview Rosie & Ted 2:15 – 2:30 Motivation – state of the industry Ted Drake 2:30 – 3:00 Search and Indexing Rosie & Ted 3:00 – 3:30 Hello World! Using Search Service APIs & Examples Ted Drake 3:30 – 4:00 Coffee Break 4:00 – 4:30 Mashup Patterns Ted & Rosie 4:30 – 5:00 Automatic Evaluation Rosie Jones 5:00 – 5:30 Discussion, Questions Ted & Rosie
  23. Hello, World! Open Search Service APIs Photo by Oskay
  24. Platforms
  25. Examples
  26. Schedule 2:00 – 2:15 Introductions and Overview Rosie & Ted 2:15 – 2:30 Motivation – state of the industry Ted Drake 2:30 – 3:00 Search and Indexing Rosie & Ted 3:00 – 3:30 Hello World! Using Search Service APIs & Examples Ted Drake 3:30 – 4:00 Coffee Break 4:00 – 4:30 Mashups Ted Drake 4:30 – 5:00 Automatic Evaluation Rosie Jones 5:00 – 5:30 Discussion, Questions Ted & Rosie
  27. Mashups
  28. TweetNews Main Source
  29. Mashup – The Fire Hose
  30. Mashup – Using an Open Table
  31. From Web
  32. Schedule 2:00 – 2:15 Introductions and Overview Rosie & Ted 2:15 – 2:30 Motivation – state of the industry Ted Drake 2:30 – 3:00 Search and Indexing Rosie & Ted 3:00 – 3:30 Hello World! Using Search Service APIs & Examples Ted Drake 3:30 – 4:00 Coffee Break 4:00 – 4:30 Mashups Ted Drake 4:30 – 5:00 Ranking and Evaluation Rosie Jones 5:00 – 5:30 Discussion, Questions Ted & Rosie
  33. Ranking
  34. Evaluation with Click Logs
  35. Evaluating with Clicks People click on the good results, right?
  36. Not All Results Are Equally Likely to be Looked At (Source: iprospect.com WhitePaper_2006_SearchEngineUserBehavior.pdf) ‏
  37. Clicks and Views Depend on Rank [Joachims et al, 2005]
  38. Directly Modeling Relevance From Clicks Which ranking of web pages is better for the query “NIPS 2007”? [Carterette and Jones, NIPS 2007] Rank 1 Rank 2 Rank 3 Rank 4 Rank 5 Rank 1 Rank 2 Rank 3 Rank 4 Rank 5 Click count 1 Is DCG 1 > DCG 2 ? P(DCG 1 > DCG 2 )
  39.  
  40. Reasons to Build a Demo “ Eat Your Own Dogfood” algorithm design and testing - allows you to improve without labeled data - look closely at the results - convince your advisor/funders it works! Observe user behavior Cheap flight to boston Cheap flights to boston Cheap flights Travelocity Expedia American arlines.com American airlines.com Americanairlines.com Puppy Cute puppy More cute puppy picutres
  41. Other Open Source Tools

Editor's Notes

  1. http://developer.yahoo.com/everything.html - for logos
  2. ROSIE – SHOW PSEUDOCODE FOR SIMPLIFIED VERSION – THEN CONVERT TO YQL(TED) OR PERL (ROSIE)? The user uses a search interface to rapidly gather many snippets that contain similar phrases, and then selects those that they would like to mark (Figure 6). The server uses Yahoo BOSS2 to search the web for snippets that resemble a paraphrase entered by the user. SIGIR 2008 proceedings http://portal.acm.org/toc.cfm?id=1390334&idx=SERIES278&type=proceeding&coll=ACM&dl=ACM&part=series&WantType=Proceedings&title=SIGIR&CFID=43145604&CFTOKEN=93348762 Jung et al IP&M http://search.yahoo.com/search?p=Click+data+as+implicit+relevance+feedback+in+web+search&ei=UTF-8&fr=moz2 Affective feedback http://eprints.gla.ac.uk/4825/1/4825.pdf http://portal.acm.org/citation.cfm?id=1390566&dl=GUIDE&coll=GUIDE&CFID=43143609&CFTOKEN=22951859
  3. ROSIE – SHOW PSEUDOCODE FOR SIMPLIFIED VERSION – THEN CONVERT TO YQL(TED) OR PERL (ROSIE)? SIGIR 2008 proceedings http://portal.acm.org/toc.cfm?id=1390334&idx=SERIES278&type=proceeding&coll=ACM&dl=ACM&part=series&WantType=Proceedings&title=SIGIR&CFID=43145604&CFTOKEN=93348762 Jung et al IP&M http://search.yahoo.com/search?p=Click+data+as+implicit+relevance+feedback+in+web+search&ei=UTF-8&fr=moz2 Affective feedback http://eprints.gla.ac.uk/4825/1/4825.pdf http://portal.acm.org/citation.cfm?id=1390566&dl=GUIDE&coll=GUIDE&CFID=43143609&CFTOKEN=22951859
  4. ROSIE WORK ON THIS TONIGHT
  5. Eran / Ashim ; okay to inlcude BOSS HERE?
  6. WHAT DO WE SHOW FOR PRESENTATION?? – the SIGIR 2008 papers? Query-biased summaries
  7. Mapping from words to documents containing them
  8. SIGIR 2008 proceedings http://portal.acm.org/toc.cfm?id=1390334&idx=SERIES278&type=proceeding&coll=ACM&dl=ACM&part=series&WantType=Proceedings&title=SIGIR&CFID=43145604&CFTOKEN=93348762 Jung et al IP&M http://search.yahoo.com/search?p=Click+data+as+implicit+relevance+feedback+in+web+search&ei=UTF-8&fr=moz2 Affective feedback http://eprints.gla.ac.uk/4825/1/4825.pdf http://portal.acm.org/citation.cfm?id=1390566&dl=GUIDE&coll=GUIDE&CFID=43143609&CFTOKEN=22951859
  9. TED check correctness
  10. TED check correctness
  11. LOGO NEEDED!
  12. TED UPDATE
  13. TED UPDATE?
  14. SIGIR 2008 proceedings http://portal.acm.org/toc.cfm?id=1390334&idx=SERIES278&type=proceeding&coll=ACM&dl=ACM&part=series&WantType=Proceedings&title=SIGIR&CFID=43145604&CFTOKEN=93348762 Jung et al IP&M http://search.yahoo.com/search?p=Click+data+as+implicit+relevance+feedback+in+web+search&ei=UTF-8&fr=moz2 Affective feedback http://eprints.gla.ac.uk/4825/1/4825.pdf http://portal.acm.org/citation.cfm?id=1390566&dl=GUIDE&coll=GUIDE&CFID=43143609&CFTOKEN=22951859
  15. SIGIR 2008 proceedings http://portal.acm.org/toc.cfm?id=1390334&idx=SERIES278&type=proceeding&coll=ACM&dl=ACM&part=series&WantType=Proceedings&title=SIGIR&CFID=43145604&CFTOKEN=93348762 Jung et al IP&M http://search.yahoo.com/search?p=Click+data+as+implicit+relevance+feedback+in+web+search&ei=UTF-8&fr=moz2 Affective feedback http://eprints.gla.ac.uk/4825/1/4825.pdf http://portal.acm.org/citation.cfm?id=1390566&dl=GUIDE&coll=GUIDE&CFID=43143609&CFTOKEN=22951859
  16. SIGIR 2008 proceedings http://portal.acm.org/toc.cfm?id=1390334&idx=SERIES278&type=proceeding&coll=ACM&dl=ACM&part=series&WantType=Proceedings&title=SIGIR&CFID=43145604&CFTOKEN=93348762 Jung et al IP&M http://search.yahoo.com/search?p=Click+data+as+implicit+relevance+feedback+in+web+search&ei=UTF-8&fr=moz2 Affective feedback http://eprints.gla.ac.uk/4825/1/4825.pdf http://portal.acm.org/citation.cfm?id=1390566&dl=GUIDE&coll=GUIDE&CFID=43143609&CFTOKEN=22951859
  17. SIGIR 2008 proceedings http://portal.acm.org/toc.cfm?id=1390334&idx=SERIES278&type=proceeding&coll=ACM&dl=ACM&part=series&WantType=Proceedings&title=SIGIR&CFID=43145604&CFTOKEN=93348762 Jung et al IP&M http://search.yahoo.com/search?p=Click+data+as+implicit+relevance+feedback+in+web+search&ei=UTF-8&fr=moz2 Affective feedback http://eprints.gla.ac.uk/4825/1/4825.pdf http://portal.acm.org/citation.cfm?id=1390566&dl=GUIDE&coll=GUIDE&CFID=43143609&CFTOKEN=22951859
  18. SIGIR 2008 proceedings http://portal.acm.org/toc.cfm?id=1390334&idx=SERIES278&type=proceeding&coll=ACM&dl=ACM&part=series&WantType=Proceedings&title=SIGIR&CFID=43145604&CFTOKEN=93348762 Jung et al IP&M http://search.yahoo.com/search?p=Click+data+as+implicit+relevance+feedback+in+web+search&ei=UTF-8&fr=moz2 Affective feedback http://eprints.gla.ac.uk/4825/1/4825.pdf http://portal.acm.org/citation.cfm?id=1390566&dl=GUIDE&coll=GUIDE&CFID=43143609&CFTOKEN=22951859
  19. SIGIR 2008 proceedings http://portal.acm.org/toc.cfm?id=1390334&idx=SERIES278&type=proceeding&coll=ACM&dl=ACM&part=series&WantType=Proceedings&title=SIGIR&CFID=43145604&CFTOKEN=93348762 Jung et al IP&M http://search.yahoo.com/search?p=Click+data+as+implicit+relevance+feedback+in+web+search&ei=UTF-8&fr=moz2 Affective feedback http://eprints.gla.ac.uk/4825/1/4825.pdf http://portal.acm.org/citation.cfm?id=1390566&dl=GUIDE&coll=GUIDE&CFID=43143609&CFTOKEN=22951859
  20. SIGIR 2008 proceedings http://portal.acm.org/toc.cfm?id=1390334&idx=SERIES278&type=proceeding&coll=ACM&dl=ACM&part=series&WantType=Proceedings&title=SIGIR&CFID=43145604&CFTOKEN=93348762 Jung et al IP&M http://search.yahoo.com/search?p=Click+data+as+implicit+relevance+feedback+in+web+search&ei=UTF-8&fr=moz2 Affective feedback http://eprints.gla.ac.uk/4825/1/4825.pdf http://portal.acm.org/citation.cfm?id=1390566&dl=GUIDE&coll=GUIDE&CFID=43143609&CFTOKEN=22951859
  21. ROSIE ADD A PICTURE
  22. TALK MORE ABOUT THIS EXAMPLE
  23. MAKE A NEW SCREENSHOT WITHOUT CLIPPES TEXT DESCRIBE arXiv.org more fully – who uses it what it does etc. Radlinksi et a – implemented arxiv search on top of lucene http://search.arxiv.org/ One could use eg. Yahoo result ordering as one baseline: BOSS with restriction to arxiv.org What would this pseudocode look like?
  24. TODO: Examples from SIGIR 2008 papers for each of those
  25. RJ show unigram/ngram examples - add refs for Observer User Behavior
Advertisement