Open Source Search Tools for www2010 conferencesourcesearchtoolswww20100426dApplications of Open Search Tools: WWW2010 Tutorial
Apr. 26, 2010•0 likes
6 likes
Be the first to like this
Show More
•132,599 views
views
Total views
0
On Slideshare
0
From embeds
0
Number of embeds
0
Download to read offline
Report
Technology
Presentation by Ted DRAKE and Rosie JONES for the www2010 conference in North Carolina. This discusses the open source search software, APIs and trends.
Schedule 2:00 – 2:15 Introductions and Overview Rosie & Ted 2:15 – 2:30 Motivation – state of the industry Ted Drake 2:30 – 3:00 Search and Indexing Rosie & Ted 3:00 – 3:30 Hello World! Using Search Service APIs & Examples Ted Drake 3:30 – 4:00 Coffee Break 4:00 – 4:30 Mashup Patterns Ted & Rosie 4:30 – 5:00 Automatic Evaluation Rosie Jones 5:00 – 5:30 Discussion, Questions Ted & Rosie
Web Search Architecture Find documents Follow links Fetch freshest content Build graph of hyperlinks Process text and meta-data - compressed - for quick lookup Index Text and meta-data - compressed - for quick lookup Offline Find documents containing query words Runtime Crawlers Indexers Retrieval Ranking Interface
Open Source Search and Open Search Open source code lets you build your own search engine Open search lets you leverage existing commercial search engines
open drake search ted D1 D67 D3 D92 … query= open search ted drake D8 D9 D15 D32 D1 D9 D46 mit D3 D8 D9 D15 D32 D1 D6 D9 D15 D32 D3 D8 D9 D15 D32 Posting Posting list D1 D3 D8 D9 D15 D32 D6 D46 Inverted Index
High Level Comparison Platform License Lang. Docs Ranking Users Parallel Scale Lucene Apache Java Many Flexible Amazon Yes TB zettair BSD like C HTML, TREC, TXT Flexible Research No TB Indri BSD like C++ Many Very Flexible Research Yes TB Sphinx GPL C++ Many Flexible craigslist Yes TB RDBMS BSD, GPL C SQL Text Limited - Maybe GB Xapian GPL C++ Many Flexible gmane Yes TB
Directly Modeling Relevance From Clicks Which ranking of web pages is better for the query “NIPS 2007”? [Carterette and Jones, NIPS 2007] Rank 1 Rank 2 Rank 3 Rank 4 Rank 5 Rank 1 Rank 2 Rank 3 Rank 4 Rank 5 Click count 1 Is DCG 1 > DCG 2 ? P(DCG 1 > DCG 2 )
Reasons to Build a Demo “ Eat Your Own Dogfood” algorithm design and testing - allows you to improve without labeled data - look closely at the results - convince your advisor/funders it works! Observe user behavior Cheap flight to boston Cheap flights to boston Cheap flights Travelocity Expedia American arlines.com American airlines.com Americanairlines.com Puppy Cute puppy More cute puppy picutres
http://developer.yahoo.com/everything.html - for logos
ROSIE – SHOW PSEUDOCODE FOR SIMPLIFIED VERSION – THEN CONVERT TO YQL(TED) OR PERL (ROSIE)?
The user uses a search interface to rapidly gather many snippets that contain similar phrases, and then selects those that they would like to mark (Figure 6). The server uses Yahoo BOSS2 to search the web for snippets that resemble a paraphrase entered by the user.
SIGIR 2008 proceedings
http://portal.acm.org/toc.cfm?id=1390334&idx=SERIES278&type=proceeding&coll=ACM&dl=ACM&part=series&WantType=Proceedings&title=SIGIR&CFID=43145604&CFTOKEN=93348762
Jung et al IP&M
http://search.yahoo.com/search?p=Click+data+as+implicit+relevance+feedback+in+web+search&ei=UTF-8&fr=moz2
Affective feedback
http://eprints.gla.ac.uk/4825/1/4825.pdf
http://portal.acm.org/citation.cfm?id=1390566&dl=GUIDE&coll=GUIDE&CFID=43143609&CFTOKEN=22951859
ROSIE – SHOW PSEUDOCODE FOR SIMPLIFIED VERSION – THEN CONVERT TO YQL(TED) OR PERL (ROSIE)?
SIGIR 2008 proceedings
http://portal.acm.org/toc.cfm?id=1390334&idx=SERIES278&type=proceeding&coll=ACM&dl=ACM&part=series&WantType=Proceedings&title=SIGIR&CFID=43145604&CFTOKEN=93348762
Jung et al IP&M
http://search.yahoo.com/search?p=Click+data+as+implicit+relevance+feedback+in+web+search&ei=UTF-8&fr=moz2
Affective feedback
http://eprints.gla.ac.uk/4825/1/4825.pdf
http://portal.acm.org/citation.cfm?id=1390566&dl=GUIDE&coll=GUIDE&CFID=43143609&CFTOKEN=22951859
ROSIE WORK ON THIS TONIGHT
Eran / Ashim ; okay to inlcude BOSS HERE?
WHAT DO WE SHOW FOR PRESENTATION?? – the SIGIR 2008 papers? Query-biased summaries
Mapping from words to documents containing them
SIGIR 2008 proceedings
http://portal.acm.org/toc.cfm?id=1390334&idx=SERIES278&type=proceeding&coll=ACM&dl=ACM&part=series&WantType=Proceedings&title=SIGIR&CFID=43145604&CFTOKEN=93348762
Jung et al IP&M
http://search.yahoo.com/search?p=Click+data+as+implicit+relevance+feedback+in+web+search&ei=UTF-8&fr=moz2
Affective feedback
http://eprints.gla.ac.uk/4825/1/4825.pdf
http://portal.acm.org/citation.cfm?id=1390566&dl=GUIDE&coll=GUIDE&CFID=43143609&CFTOKEN=22951859
TED check correctness
TED check correctness
LOGO NEEDED!
TED UPDATE
TED UPDATE?
SIGIR 2008 proceedings
http://portal.acm.org/toc.cfm?id=1390334&idx=SERIES278&type=proceeding&coll=ACM&dl=ACM&part=series&WantType=Proceedings&title=SIGIR&CFID=43145604&CFTOKEN=93348762
Jung et al IP&M
http://search.yahoo.com/search?p=Click+data+as+implicit+relevance+feedback+in+web+search&ei=UTF-8&fr=moz2
Affective feedback
http://eprints.gla.ac.uk/4825/1/4825.pdf
http://portal.acm.org/citation.cfm?id=1390566&dl=GUIDE&coll=GUIDE&CFID=43143609&CFTOKEN=22951859
SIGIR 2008 proceedings
http://portal.acm.org/toc.cfm?id=1390334&idx=SERIES278&type=proceeding&coll=ACM&dl=ACM&part=series&WantType=Proceedings&title=SIGIR&CFID=43145604&CFTOKEN=93348762
Jung et al IP&M
http://search.yahoo.com/search?p=Click+data+as+implicit+relevance+feedback+in+web+search&ei=UTF-8&fr=moz2
Affective feedback
http://eprints.gla.ac.uk/4825/1/4825.pdf
http://portal.acm.org/citation.cfm?id=1390566&dl=GUIDE&coll=GUIDE&CFID=43143609&CFTOKEN=22951859
SIGIR 2008 proceedings
http://portal.acm.org/toc.cfm?id=1390334&idx=SERIES278&type=proceeding&coll=ACM&dl=ACM&part=series&WantType=Proceedings&title=SIGIR&CFID=43145604&CFTOKEN=93348762
Jung et al IP&M
http://search.yahoo.com/search?p=Click+data+as+implicit+relevance+feedback+in+web+search&ei=UTF-8&fr=moz2
Affective feedback
http://eprints.gla.ac.uk/4825/1/4825.pdf
http://portal.acm.org/citation.cfm?id=1390566&dl=GUIDE&coll=GUIDE&CFID=43143609&CFTOKEN=22951859
SIGIR 2008 proceedings
http://portal.acm.org/toc.cfm?id=1390334&idx=SERIES278&type=proceeding&coll=ACM&dl=ACM&part=series&WantType=Proceedings&title=SIGIR&CFID=43145604&CFTOKEN=93348762
Jung et al IP&M
http://search.yahoo.com/search?p=Click+data+as+implicit+relevance+feedback+in+web+search&ei=UTF-8&fr=moz2
Affective feedback
http://eprints.gla.ac.uk/4825/1/4825.pdf
http://portal.acm.org/citation.cfm?id=1390566&dl=GUIDE&coll=GUIDE&CFID=43143609&CFTOKEN=22951859
SIGIR 2008 proceedings
http://portal.acm.org/toc.cfm?id=1390334&idx=SERIES278&type=proceeding&coll=ACM&dl=ACM&part=series&WantType=Proceedings&title=SIGIR&CFID=43145604&CFTOKEN=93348762
Jung et al IP&M
http://search.yahoo.com/search?p=Click+data+as+implicit+relevance+feedback+in+web+search&ei=UTF-8&fr=moz2
Affective feedback
http://eprints.gla.ac.uk/4825/1/4825.pdf
http://portal.acm.org/citation.cfm?id=1390566&dl=GUIDE&coll=GUIDE&CFID=43143609&CFTOKEN=22951859
SIGIR 2008 proceedings
http://portal.acm.org/toc.cfm?id=1390334&idx=SERIES278&type=proceeding&coll=ACM&dl=ACM&part=series&WantType=Proceedings&title=SIGIR&CFID=43145604&CFTOKEN=93348762
Jung et al IP&M
http://search.yahoo.com/search?p=Click+data+as+implicit+relevance+feedback+in+web+search&ei=UTF-8&fr=moz2
Affective feedback
http://eprints.gla.ac.uk/4825/1/4825.pdf
http://portal.acm.org/citation.cfm?id=1390566&dl=GUIDE&coll=GUIDE&CFID=43143609&CFTOKEN=22951859
SIGIR 2008 proceedings
http://portal.acm.org/toc.cfm?id=1390334&idx=SERIES278&type=proceeding&coll=ACM&dl=ACM&part=series&WantType=Proceedings&title=SIGIR&CFID=43145604&CFTOKEN=93348762
Jung et al IP&M
http://search.yahoo.com/search?p=Click+data+as+implicit+relevance+feedback+in+web+search&ei=UTF-8&fr=moz2
Affective feedback
http://eprints.gla.ac.uk/4825/1/4825.pdf
http://portal.acm.org/citation.cfm?id=1390566&dl=GUIDE&coll=GUIDE&CFID=43143609&CFTOKEN=22951859
ROSIE ADD A PICTURE
TALK MORE ABOUT THIS EXAMPLE
MAKE A NEW SCREENSHOT WITHOUT CLIPPES TEXT
DESCRIBE arXiv.org more fully – who uses it what it does etc.
Radlinksi et a – implemented arxiv search on top of lucene http://search.arxiv.org/
One could use eg. Yahoo result ordering as one baseline:
BOSS with restriction to arxiv.org
What would this pseudocode look like?
TODO: Examples from SIGIR 2008 papers for each of those
RJ show unigram/ngram examples
- add refs for Observer User Behavior