Loading…

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

Like this presentation? Why not share!

Open Source Search Tools for www2010 conferencesourcesearchtoolswww20100426dApplications of Open Search Tools: WWW2010 Tutorial

on

  • 92,635 views

Presentation by Ted DRAKE and Rosie JONES for the www2010 conference in North Carolina. This discusses the open source search software, APIs and trends.

Presentation by Ted DRAKE and Rosie JONES for the www2010 conference in North Carolina. This discusses the open source search software, APIs and trends.

Statistics

Views

Total Views
92,635
Views on SlideShare
92,581
Embed Views
54

Actions

Likes
5
Downloads
129
Comments
1

3 Embeds 54

http://www.slideshare.net 48
http://developer.yahoo.net 5
http://www.linkedin.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

11 of 1

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • http://developer.yahoo.com/everything.html - for logos
  • ROSIE – SHOW PSEUDOCODE FOR SIMPLIFIED VERSION – THEN CONVERT TO YQL(TED) OR PERL (ROSIE)? The user uses a search interface to rapidly gather many snippets that contain similar phrases, and then selects those that they would like to mark (Figure 6). The server uses Yahoo BOSS2 to search the web for snippets that resemble a paraphrase entered by the user. SIGIR 2008 proceedings http://portal.acm.org/toc.cfm?id=1390334&idx=SERIES278&type=proceeding&coll=ACM&dl=ACM&part=series&WantType=Proceedings&title=SIGIR&CFID=43145604&CFTOKEN=93348762 Jung et al IP&M http://search.yahoo.com/search?p=Click+data+as+implicit+relevance+feedback+in+web+search&ei=UTF-8&fr=moz2 Affective feedback http://eprints.gla.ac.uk/4825/1/4825.pdf http://portal.acm.org/citation.cfm?id=1390566&dl=GUIDE&coll=GUIDE&CFID=43143609&CFTOKEN=22951859
  • ROSIE – SHOW PSEUDOCODE FOR SIMPLIFIED VERSION – THEN CONVERT TO YQL(TED) OR PERL (ROSIE)? SIGIR 2008 proceedings http://portal.acm.org/toc.cfm?id=1390334&idx=SERIES278&type=proceeding&coll=ACM&dl=ACM&part=series&WantType=Proceedings&title=SIGIR&CFID=43145604&CFTOKEN=93348762 Jung et al IP&M http://search.yahoo.com/search?p=Click+data+as+implicit+relevance+feedback+in+web+search&ei=UTF-8&fr=moz2 Affective feedback http://eprints.gla.ac.uk/4825/1/4825.pdf http://portal.acm.org/citation.cfm?id=1390566&dl=GUIDE&coll=GUIDE&CFID=43143609&CFTOKEN=22951859
  • ROSIE WORK ON THIS TONIGHT
  • Eran / Ashim ; okay to inlcude BOSS HERE?
  • WHAT DO WE SHOW FOR PRESENTATION?? – the SIGIR 2008 papers? Query-biased summaries
  • Mapping from words to documents containing them
  • SIGIR 2008 proceedings http://portal.acm.org/toc.cfm?id=1390334&idx=SERIES278&type=proceeding&coll=ACM&dl=ACM&part=series&WantType=Proceedings&title=SIGIR&CFID=43145604&CFTOKEN=93348762 Jung et al IP&M http://search.yahoo.com/search?p=Click+data+as+implicit+relevance+feedback+in+web+search&ei=UTF-8&fr=moz2 Affective feedback http://eprints.gla.ac.uk/4825/1/4825.pdf http://portal.acm.org/citation.cfm?id=1390566&dl=GUIDE&coll=GUIDE&CFID=43143609&CFTOKEN=22951859
  • TED check correctness
  • TED check correctness
  • LOGO NEEDED!
  • TED UPDATE
  • TED UPDATE?
  • SIGIR 2008 proceedings http://portal.acm.org/toc.cfm?id=1390334&idx=SERIES278&type=proceeding&coll=ACM&dl=ACM&part=series&WantType=Proceedings&title=SIGIR&CFID=43145604&CFTOKEN=93348762 Jung et al IP&M http://search.yahoo.com/search?p=Click+data+as+implicit+relevance+feedback+in+web+search&ei=UTF-8&fr=moz2 Affective feedback http://eprints.gla.ac.uk/4825/1/4825.pdf http://portal.acm.org/citation.cfm?id=1390566&dl=GUIDE&coll=GUIDE&CFID=43143609&CFTOKEN=22951859
  • SIGIR 2008 proceedings http://portal.acm.org/toc.cfm?id=1390334&idx=SERIES278&type=proceeding&coll=ACM&dl=ACM&part=series&WantType=Proceedings&title=SIGIR&CFID=43145604&CFTOKEN=93348762 Jung et al IP&M http://search.yahoo.com/search?p=Click+data+as+implicit+relevance+feedback+in+web+search&ei=UTF-8&fr=moz2 Affective feedback http://eprints.gla.ac.uk/4825/1/4825.pdf http://portal.acm.org/citation.cfm?id=1390566&dl=GUIDE&coll=GUIDE&CFID=43143609&CFTOKEN=22951859
  • SIGIR 2008 proceedings http://portal.acm.org/toc.cfm?id=1390334&idx=SERIES278&type=proceeding&coll=ACM&dl=ACM&part=series&WantType=Proceedings&title=SIGIR&CFID=43145604&CFTOKEN=93348762 Jung et al IP&M http://search.yahoo.com/search?p=Click+data+as+implicit+relevance+feedback+in+web+search&ei=UTF-8&fr=moz2 Affective feedback http://eprints.gla.ac.uk/4825/1/4825.pdf http://portal.acm.org/citation.cfm?id=1390566&dl=GUIDE&coll=GUIDE&CFID=43143609&CFTOKEN=22951859
  • SIGIR 2008 proceedings http://portal.acm.org/toc.cfm?id=1390334&idx=SERIES278&type=proceeding&coll=ACM&dl=ACM&part=series&WantType=Proceedings&title=SIGIR&CFID=43145604&CFTOKEN=93348762 Jung et al IP&M http://search.yahoo.com/search?p=Click+data+as+implicit+relevance+feedback+in+web+search&ei=UTF-8&fr=moz2 Affective feedback http://eprints.gla.ac.uk/4825/1/4825.pdf http://portal.acm.org/citation.cfm?id=1390566&dl=GUIDE&coll=GUIDE&CFID=43143609&CFTOKEN=22951859
  • SIGIR 2008 proceedings http://portal.acm.org/toc.cfm?id=1390334&idx=SERIES278&type=proceeding&coll=ACM&dl=ACM&part=series&WantType=Proceedings&title=SIGIR&CFID=43145604&CFTOKEN=93348762 Jung et al IP&M http://search.yahoo.com/search?p=Click+data+as+implicit+relevance+feedback+in+web+search&ei=UTF-8&fr=moz2 Affective feedback http://eprints.gla.ac.uk/4825/1/4825.pdf http://portal.acm.org/citation.cfm?id=1390566&dl=GUIDE&coll=GUIDE&CFID=43143609&CFTOKEN=22951859
  • SIGIR 2008 proceedings http://portal.acm.org/toc.cfm?id=1390334&idx=SERIES278&type=proceeding&coll=ACM&dl=ACM&part=series&WantType=Proceedings&title=SIGIR&CFID=43145604&CFTOKEN=93348762 Jung et al IP&M http://search.yahoo.com/search?p=Click+data+as+implicit+relevance+feedback+in+web+search&ei=UTF-8&fr=moz2 Affective feedback http://eprints.gla.ac.uk/4825/1/4825.pdf http://portal.acm.org/citation.cfm?id=1390566&dl=GUIDE&coll=GUIDE&CFID=43143609&CFTOKEN=22951859
  • SIGIR 2008 proceedings http://portal.acm.org/toc.cfm?id=1390334&idx=SERIES278&type=proceeding&coll=ACM&dl=ACM&part=series&WantType=Proceedings&title=SIGIR&CFID=43145604&CFTOKEN=93348762 Jung et al IP&M http://search.yahoo.com/search?p=Click+data+as+implicit+relevance+feedback+in+web+search&ei=UTF-8&fr=moz2 Affective feedback http://eprints.gla.ac.uk/4825/1/4825.pdf http://portal.acm.org/citation.cfm?id=1390566&dl=GUIDE&coll=GUIDE&CFID=43143609&CFTOKEN=22951859
  • ROSIE ADD A PICTURE
  • TALK MORE ABOUT THIS EXAMPLE
  • MAKE A NEW SCREENSHOT WITHOUT CLIPPES TEXT DESCRIBE arXiv.org more fully – who uses it what it does etc. Radlinksi et a – implemented arxiv search on top of lucene http://search.arxiv.org/ One could use eg. Yahoo result ordering as one baseline: BOSS with restriction to arxiv.org What would this pseudocode look like?
  • TODO: Examples from SIGIR 2008 papers for each of those
  • RJ show unigram/ngram examples - add refs for Observer User Behavior

Open Source Search Tools for www2010 conferencesourcesearchtoolswww20100426dApplications of Open Search Tools: WWW2010 Tutorial Open Source Search Tools for www2010 conferencesourcesearchtoolswww20100426dApplications of Open Search Tools: WWW2010 Tutorial Presentation Transcript

  • Applications of Open Search Tools: WWW2010 Tutorial Rosie Jones and Ted Drake Yahoo! Inc April 26 th , 2010 [email_address] , [email_address]
  • Introductions
  • Schedule 2:00 – 2:15 Introductions and Overview Rosie & Ted 2:15 – 2:30 Motivation – state of the industry Ted Drake 2:30 – 3:00 Indexing and Search Rosie & Ted 3:00 – 3:30 Hello World! Using Search Service APIs & Examples Ted Drake 3:30 – 4:00 Coffee Break 4:00 – 4:30 Mashup Patterns Ted Drake 4:30 – 5:00 Ranking and Evaluation Rosie Jones 5:00 – 5:30 Discussion, Questions Ted & Rosie
  • Caveat
    • There is a lot of open search software out there!
    • This tutorial is breadth-oriented, and example driven
      • And therefore necessarily kind of shallow
    For the slides: [email_address] [email_address] http://www.slideshare.net/7mary4
  • Schedule 2:00 – 2:15 Introductions and Overview Rosie & Ted 2:15 – 2:30 Motivation – state of the industry Ted Drake 2:30 – 3:00 Search and Indexing Rosie & Ted 3:00 – 3:30 Hello World! Using Search Service APIs & Examples Ted Drake 3:30 – 4:00 Coffee Break 4:00 – 4:30 Mashup Patterns Ted & Rosie 4:30 – 5:00 Ranking and Evaluation Rosie Jones 5:00 – 5:30 Discussion, Questions Ted & Rosie
  • Motivation
  • State of the Industry - Mashups
    • Programmable Web: Resource for API and Mashup development
    • 10 new search mashups every month (average)
    • 62 search APIs (as of April 25,2010)
  • State of the Industry - Healthy Market
    • 1,500 search related companies on TechCrunch
  • Open Source Technology Reduces Barriers
    • Yahoo! Query Language
      • Select * from (insert your desire)
      • Built in cache, threading, authentication
      • Easily extended with Open Tables
    • Hadoop
      • Yahoo Distribution of Hadoop includes patches and updates
      • Your Hadoop installation can perform at your current scale
        • All the way up to Yahoo scale
    • Open Source Search Engines
      • Lemur
      • Lucene
  • Motivation II: Tools for Academic Papers
  • In Academia: Paper in WWW 2010
    • Highlighting Disputed Claims on the Web Rob Ennals, Beth Trushkowsky, John Mark Agosta, Tye Rattenbury, Tad Hirsch
    The server uses Yahoo BOSS2 to search the web for snippets that resemble a paraphrase entered by the user.
  • In Academia: Papers from SIGIR 2008
    • Towards breaking the quality curse: a web-querying approach to web people search  
    • [ Kalashnikov et al SIGIR 2008]
      • Web as external corpus
      • Use Yahoo! API to retrieve
    • Emulating query-biased summaries using document titles [Joho et al SIGIR 2008]
      • Yahoo!, Google, Terrier (TREC)
  • More Publications using Open Source Search Engines
    • Affective feedback: an investigation into the role of emotions in the information seeking process [ Arapakis et al SIGIR 2008]
      • Use Indri to parse and retrieve TREC newswire and web collections
    • [Jung et al IP&M 2007]
      • Last clicked document is predictor of relevance (used Nutch search engine on university website)
    • Minimal test collections for retrieval evaluation [Carterette et al SIGIR 2006]
      • Indri, Lemur, Lucene, Mg, SMART, Zettair
  • Schedule 2:00 – 2:15 Introductions and Overview Rosie & Ted 2:15 – 2:30 Motivation – state of the industry Ted Drake 2:30 – 3:00 Search and Indexing Rosie & Ted 3:00 – 3:30 Hello World! Using Search Service APIs & Examples Ted Drake 3:30 – 4:00 Coffee Break 4:00 – 4:30 Mashup Patterns Ted & Rosie 4:30 – 5:00 Automatic Evaluation Rosie Jones 5:00 – 5:30 Discussion, Questions Ted & Rosie
  • Web Search Architecture Find documents Follow links Fetch freshest content Build graph of hyperlinks Process text and meta-data - compressed - for quick lookup Index Text and meta-data - compressed - for quick lookup Offline Find documents containing query words Runtime Crawlers Indexers Retrieval Ranking Interface
  • What is Open Search
  • Open Source Search and Open Search Open source code lets you build your own search engine Open search lets you leverage existing commercial search engines
  • Why Open Search?
    • #!/usr/local/bin/perl –w
      • $searchResultPage = GET($url);
      • process($searchResultPage)
    • Curl (php)
    • Javascript…
  • Scraping Modules http://search.cpan.org/~jfriedl/Yahoo-Search-1.10.13/lib/Yahoo/Search.pm
  • Do I Look Like A Piece of Bad Software?
  • Information Superhighway for Known Robots Search engine may stop accepting requests from your IP, or just slow down service
  • Scrape with Search Engine’s Blessing
    • http://code.google.com/apis/ajaxsearch/
    • http://msdn.microsoft.com/en-us/library/dd251056.aspx
    • http://developer.yahoo.com/search/boss/
    MUCH MORE DETAIL IN THE NEXT SECTION!
  • Other Parts to the Search Process
    • Indexing
      • Indexing algorithms
      • Access to the index – what is overall document frequency? What if I rank differently using the index?
    • Presentation
      • User interface effects
    • Existing Open Search Platforms Can Get You Started
  • Indexing Your Own Content
  • Task of Indexing
    • Store document contents in format that allows quick lookup
    • Invest time offline
      • For fast runtime access
    • Runtime task
      • Given the current query
      • Which subset of documents should we spend time ranking
  • Brute Force Document Scoring
    • Check every document in collection to see if it contains any query terms
      • Most documents don’t contain any of the query terms
      • Look at query terms to see which documents to consider
  • open drake search ted D1 D67 D3 D92 … query= open search ted drake D8 D9 D15 D32 D1 D9 D46 mit D3 D8 D9 D15 D32 D1 D6 D9 D15 D32 D3 D8 D9 D15 D32 Posting Posting list D1 D3 D8 D9 D15 D32 D6 D46 Inverted Index
  • High Level Comparison Platform License Lang. Docs Ranking Users Parallel Scale Lucene Apache Java Many Flexible Amazon Yes TB zettair BSD like C HTML, TREC, TXT Flexible Research No TB Indri BSD like C++ Many Very Flexible Research Yes TB Sphinx GPL C++ Many Flexible craigslist Yes TB RDBMS BSD, GPL C SQL Text Limited - Maybe GB Xapian GPL C++ Many Flexible gmane Yes TB
  • Previous Benchmarks (Middleton+Baeza-Yates 07)
  • Open Search Benchmarking
    • http://zooie.wordpress.com/2009/07/06/a-comparison-of-open-source-search-engines-and-indexing-twitter/
      • An over the weekend experiment to make code examples
  • Benchmarks
    • Not enough comparative benchmarks out there
    • Hard to do; we really need standards
      • Optimize each platform, per hardware and data set
      • Lot of platforms, with different APIs, options and numerical settings
    • Need good diverse data sets, small & large
    • Hard to please
      • Winners & losers in benchmarks; lot of biases
      • Always room for improvement
    • Really evolutionary to nail benchmarks
      • It’s an Open Source project
        • http://github.com/zooie/opensearch/tree/master
  • In action
    • Lucene
    • Sphinx
    • Indri
    • All the code examples are here:
    • http://github.com/zooie/opensearch/tree/master
  • Lucene
    • Lot of industrial support w/ proven scalability
      • Amazon, Netflix, Wikipedia
    • An IR Library in Java
      • There’s also pyLucene & CLucene
    • Use Nutch, Solr or Hounder for the rest
      • Crawlers, result abstracts…
  • Lucene Indexing
  • Lucene Search javac -cp /lucenedir/lucene-2.4.1/lucene-core-2.4.1.jar:. Index.java java –Xmv512m –cp /lucenedir/lucene-core-2.4.1.jar:. Index
  • Sphinx
    • Runs Craigslist Search
    • MySQL integration focus
      • But also supports a XML input pipe
    • Pretty fast indexer
    • searchd, indexer commands
    • Mostly declarative option setting (sphinx.conf)
    • Client API (python, Java, ruby, php) sockets to searchd
  • Sphinx Indexing
    • SQL text columns or XML input
    • sphinx.conf
    • indexer --quiet --config sphinx.conf medindex
  • Sphinx Search Socket connection to searchd Sphinx service
  • Indri
    • Lemur Project
      • http://www.lemurproject.org/
    • Powerful Structured Query Language
    • Advanced Language Models
    • Native C++; swigged Java, php
    • Command line binaries
    • Developer resources
      • http://lemur.wiki.sourceforge.net/
  • Indri: Hello World
    • Index & Search directory of txt files
    • IndriBuildIndex -index=/Users/viksi/sigir/med_data/indri_index -corpus.path=/Users/viksi/sigir/med_data/indri_data -corpus.class=txt -memory=300m
      • http://www.lemurproject.org/lemur/indexing.php#IndriBuildIndex
    • IndriRunQuery -index=/Users/viksi/sigir/med_data/indri_index -count=100 -rule="method:dirichlet,mu:2500" -query="#weight(1.0 #uw2(chest pain) 2.0 #1(heart attack))”
      • http://www.lemurproject.org/lemur/IndriQueryLanguage.php
  • Indexed Info in Search API
  • Index - Structured Meta Data
    • SearchMonkey:
    • Yahoo! SearchMonkey captures the structured data from web sites for the index.
    • RDF
    • Microformats
  • Index - Social
    • Social:
    • Delicious saves/tags
    • FOAF (Friend of a Friend), XFN
    • Recent social activity: Twitter, Facebook, Buzz, Blogs…
  • Index – Machine Tags
    • Keyterms
    • Mis-Spelling
    • Content Enrichment
    • Inbound Links
  • Schedule 2:00 – 2:15 Introductions and Overview Rosie & Ted 2:15 – 2:30 Motivation – state of the industry Ted Drake 2:30 – 3:00 Search and Indexing Rosie & Ted 3:00 – 3:30 Hello World! Using Search Service APIs & Examples Ted Drake 3:30 – 4:00 Coffee Break 4:00 – 4:30 Mashup Patterns Ted & Rosie 4:30 – 5:00 Automatic Evaluation Rosie Jones 5:00 – 5:30 Discussion, Questions Ted & Rosie
  • Hello, World! Open Search Service APIs Photo by Oskay
  • Roadmap of APIs
    • Google
    • Bing
    • BOSS
    • Twitter
    • YQL
    • Live examples
    Photo by Scorpions and Centaurs
  • Google AJAX Search
    • Javascript Widget or API
    • REST API:
    • http://ajax.googleapis.com/ajax/services/search/{vertical}?v=1.0&q={query }
    • Web, Local, Video, Blogs, News, Books, Images, Patents
    • Can’t modify results though
  • Google Custom Search
    • Turn-key product
    • Bulk load 1000s site restricts; On-demand 24 hour Web Indexing
    • Iframe or Custom Search Element results for developers; XML for enterprise
  • Bing 2.0 API
    • Multiple Sources; Batch support
      • Web, Images, InstantAnswer, Phonebook, RelatedSearch, Spell
    • Usage: http://api.search.live.net/json.aspx?AppId={appid}&Market=en-US&Query={query}&Sources=web+spell&Web.Count=1
    • Can modify (w/ some restrictions, i.e. re-ranking, blending with non-Bing sources)
  • Yahoo! BOSS
    • BOSS = B uild your O wn S earch S ervice
    • Open Yahoo’s core search features via web services to let 3rd parties revolutionize Search
    • Unrestricted
  • Unrestricted?
    • Unlimited queries
    • Blend, re-order, discard
    • Full Presentation control
    • Limited only by your imagination
  • BOSS API
    • Usage
      • http://boss.yahooapis.com/ysearch/{vert}/v1/{q}?appid={appid}&start=0&count=10&lang=en&format=xml&view=keyterms
    • Verticals
      • Web, News, Images, Spelling
    • In query syntax
      • inurl, url, intitle, site, AND/OR, “-”, “+”
    • Notable web view fields
      • Delicious bookmarks
      • SearchMonkey (microformats)
      • Larger abstracts
      • Extracted Entities (keyterms)
    • Can modify
    SearchMonkey keyterms Bookmarks
  • Web = Cross Platform
    • Google AJAX, Bing, BOSS
    • HTTP GET , URI => XML, JSON
    • Any programming lang. that supports HTTP
    • Many language specific libraries available
      • Web Search “[platform] [language]”
        • “ yahoo boss python”
    • Mobile: HTML web apps work on all smart phones
  • Platforms
  • Yahoo! YQL
    • select * from internet API (e.g. flickr, ebay, amazon)
      • http://developer.yahoo.com/yql/
    many standard & “open tables” services »
  • Amazon Web Services (AWS)
    • Amazon Cloud Support
    • Amazon SimpleDB, Relational Database Services
    • E-Commerce Fulfillment Services
    • Messaging
    • Monitoring
    • Networking
    • Payments & Billing
    • Storage
    • Workforce: Amazon Mechanical Turk
    • Large scale functionality at startup prices
  • Google App Engine
    • Free application hosting (up to 5 million pv/month)
    • Java, Ruby, or Python
    • Extensive SDK support
    • Distributed Data Storage (up to 500 mb for free)
  • Examples
  • BOSS Out in the Open
    • http://www.xurch.com
    • http://search.techcrunch.com
    • http://www.spysee.jp
    • http://www.123people.com
    • http://www.pipl.com
    • http://tweetnews.appspot.com
    • http://bossy.appspot.com
    • http://www.hakia.com
    • http://oneriot.com
    • http://www.daylife.com
    • http://www.inquisitorx.com/
    • http://insiderfood.com/
    • http://ask-boss.appspot.com/
    • http://www.4hoursearch.com
    • http://www.devunity.com (Techcrunch 50)
    • http://copyrightspot.com/ (Mashable)
    • http://imusicmash.com (Mashable)
    • http://truevert.com (Mashable)
    • http://professeurs.esiea.fr/wassner/?2008/10/20/171-semantic-calculator
    • http://www.ysearchblog.com/archives/000613.html
    • http://www.ysearchblog.com/archives/000621.html
      • DNS Mashup
      • BuildASearch
      • PlayerSearch
      • V3GGIE
      • Dipidity Newsline
      • Tianamo
  • Google Custom Search Examples
    • CopyScape – Looks for sites copying your text
    • Topicalizer – Extracts topics, finds related information from text
  • Bing Examples
    • Site Search Engine by a Microsoft engineer http://nathanbuggia.com/blog/post/Custom-Site-Search-Engine-Using-the-Live-Search-API.aspx
  • Coolest Features Across the Board
    • BOSS se_link (graphs), delicious (bookmarks), keyterms (extracted entities), searchmonkey (rdfa, microformats, structured abstracts)
    • Yahoo! YQL
    • Bing Video, Translation, Instant Answer, Batch
    • Google CSE large site restricts, refinements
    • Google AJAX Transliteration, Blogs, Books
  • Schedule 2:00 – 2:15 Introductions and Overview Rosie & Ted 2:15 – 2:30 Motivation – state of the industry Ted Drake 2:30 – 3:00 Search and Indexing Rosie & Ted 3:00 – 3:30 Hello World! Using Search Service APIs & Examples Ted Drake 3:30 – 4:00 Coffee Break 4:00 – 4:30 Mashups Ted Drake 4:30 – 5:00 Automatic Evaluation Rosie Jones 5:00 – 5:30 Discussion, Questions Ted & Rosie
  • Mashups
  • Let’s Build Something
    • TweetNews
      • http://tweetnews.appspot.com/search?q=twitter
      • “ the best mashup we’ve ever seen” (Wired)
    • Tools
      • BOSS, BOSS Mashup Framework, Google App Engine, Python 2.5
    • Source
      • http://vik.singh.googlepages.com/fresh.zip
  • Digression: TF-IDF for Ranking
    • TF = Term Frequency
      • Documents containing the query terms often tend to be relevant
    • IDF – Inverse Document Frequency
      • Words that are in every document aren’t as important
        • The, of, “click here”, “home page”
      • Document frequency: number of documents containing this term
      • Divide by Document frequency: Inverse Document Frequency
    • Sort by TF * IDF to get a ranking over documents
  • TweetNews Model
    • Goal: Inject relevance in latest news search results
    • Approach:
      • Fetch latest (order by date) news results for query
      • Also fetch latest tweets for query (search.twitter.com)
      • Vectorize each Twitter and News search result
      • Euclidean Normalized TFIDF document vector of term:freq pairs
      • Compute cosine sim between each twitter & news result vector
      • Assign tweet to news result if sim >= threshold
      • Sort news results by # of related tweets
    • WWW2010 similar paper
      • Time is of the Essence: Improving Recency Ranking Using Twitter Data Anlei Dong, Ruiqiang Zhang, Pranam Kolari, Bai Jing, Yi Chang, Fernando Diaz, Zhaohui Zheng, Hongyuan Zha
  • TweetNews Main Source
  • Non-Search: delicious Classifier
    • Usage
      • &view=delicious_toptags
      • &view=delicious_saves
    • Idea: Liberal v. Conservative Classifier
    • Generate politics queries list
      • Mine Reuters or editors
    • BOSS search each; take top 1k results
    • Filter on tag ‘liberal’ or ‘conservative’; assemble binary training set
    • Features
      • “ &abstract=long ”, “ &view=keyterms, delicious_saves, searchmonkey_rss ”, title, url, date, se_link # inbound links
  • Mashup: Related terms
    • Delicious users can tag web sites they bookmark.
    • Get a ranked list of tags for a general topic
    • select delicious_toptags,title from search.web where query="hadoop" and view="delicious_toptags“
  • Mashup – Social Impact
    • What are your friends buzzing, digging, tagging…
    • YQL: select * from social.connections.updates where guid=me
    • Use data to find more recent and relevant information
    • Lijit creates a vertical search engine based on a user’s delicious, facebook, and other saved bookmarks
    • WWW2010 Related Paper : Liquid Query: Multi-domain Exploratory Search on the Web Marco Brambilla, Alessandro Bozzon, Stefano Ceri, Piero Fraternali
    • Now it’s time to turn on the FIRE HOSE
  • Mashup – The Fire Hose
  • Mashup – Government Data
    • Guardian’s World Government Data Collection http://www.guardian.co.uk/world-government-data
      • U.S. Unemployment Statistics
      • U.S. Aviation Accidents
      • Raw Data for U. S. Department of Energy (DOE) Categorical Exclusion(CX) Determinations Under the National Environmental Policy Act (NEPA)
      • Treasury Recovery Act Data
      • Migratory Bird Flyways - Continental United States
  • Coming Soon: Twitter Annotations
    • Metadata for tweets
    • Step 1. create link for users to tweet your page.
    • Step 2. Insert metadata into each tweet
    • Step 3. Pull that information back and mash with other data.
    • Example
    • Yahoo! Finance has a tweet this stock link.
    • Insert information (ticker:yhoo) into the tweet’s metadata.
    • Follow the distribution of this metadata and look for correlations in stock price activity. Perhaps a new line on Finance Charts.
  • Mashup – Open Tables on YQL
      • Define new API definitions
      • Open Source in GitHub
      • Server-side JavaScript allows Insert and more
      • Allows for private keys
  • Mashup – Open Tables on YQL
    • <?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?> <table xmlns=&quot;http://query.yahooapis.com/v1/schema/table.xsd&quot;> <meta> <author>Nagesh Susarla</author> <documentationURL>See search.web and search.images for more details</documentationURL> </meta> <bindings> <select itemPath=&quot;results.result&quot; produces=&quot;XML&quot;> <inputs> <key id=&quot;query&quot; type=&quot;xs:string&quot; paramType=&quot;query&quot; required=&quot;true&quot;/> </inputs> <execute><![CDATA[ var qs = query; var search = y.query('select * from search.web(50) where query=@query', {query: qs}).results; var images = []; default xml namespace='http://www.inktomi.com/'; for each (var result in search.result) { images.push(y.query('select * from search.images(1) where query=@query and url=@url', {url:result.url, query:qs})); } var i = 0; for each (var result in search.result) { var image = images[i].results.result; if (image) { result.image = <image>{image}</image>; } i++; } response.object = search; ]]> </execute> </select> </bindings> </table>
  • Mashup – Using an Open Table
  • Blending Vertical + Service
    • Comprehensiveness!
      • Every Search Engine should be a One-Stop Shop
  • Delicious Blending Idea
    • Goal: Blend delicious + web results
    • Approach:
      • 1000s BOSS Web Queries, Filter w/ delicious_saves
      • Training set: x: search features | y: delicious count
      • Machine learn the transfer function
        • Infer the delicious count for any web result
        • Can now normalize the two search result sets
  • From Web
  • Hack Ideas
    • Discovery (BOSS Search App Store)
      • Designing a fairer marketplace for app distribution
      • Emerging problem for Facebook, iPhone App Store
    • Desktop, Data Visualization (Cooliris, Inquisitor)
    • Mobile (iPhone, Android, BlackBerry)
    • Passive Location/Contextual Based Search
    • Social (Facebook, Twitter, OpenSocial, Friend Connect, OneConnect)
    • Semantic
    • BOSS keyterms, SearchMonkey
    • Bing Instant Answers
    • Google CSE Refiners
  • Schedule 2:00 – 2:15 Introductions and Overview Rosie & Ted 2:15 – 2:30 Motivation – state of the industry Ted Drake 2:30 – 3:00 Search and Indexing Rosie & Ted 3:00 – 3:30 Hello World! Using Search Service APIs & Examples Ted Drake 3:30 – 4:00 Coffee Break 4:00 – 4:30 Mashups Ted Drake 4:30 – 5:00 Ranking and Evaluation Rosie Jones 5:00 – 5:30 Discussion, Questions Ted & Rosie
  • Ranking
  • Retrieval and Ranking
    • RETRIEVE the documents matching simple conditions
      • Boolean AND on query terms
      • TF-IDF
    • RANK using more sophisticated function
      • Term proximity
      • Page authority
      • Author identity
  • Ranking with Open Source Tools
    • Indri/Lemur
      • Language modeling
      • BM25, Okapi, Cosine similarity, inQuery
    • Lucene
      • TF-IDF, weighted by term occurrences
      • Fielded search
    • Terrier
      • Okapi BM25, language modeling and TF-IDF
      • Divergence from Randomness
    • Your own re-ranking code using open search
  • Evaluation with Click Logs
  • Evaluating with Clicks People click on the good results, right?
  • Not All Results Are Equally Likely to be Looked At (Source: iprospect.com WhitePaper_2006_SearchEngineUserBehavior.pdf) ‏
  • Clicks and Views Depend on Rank [Joachims et al, 2005]
  • Evaluation from Click Logs
    • Show a screenshot and me doing a “skip first”
    Read From Top to Bottom Skip Skip Skip Click! [Joachims et al SIGIR 2005]
  • Mining Clicks for Ranking
    • Clicks can be used to predict
      • Pairwise preference
        • Query: Doc1, Doc2 [ Joachims 2002]
      • Absolute relevance
        • Taking clicks on other documents into account
        • [Carterette and Jones, NIPS 2007]
        • [Chapelle and Zhang, WWW 2009]
  • Interleaving for Learning from Clicks – Pairwise Judgments
    • [Joachims, KDD 2002]
    • [Radlinski and Joachims, KDD 2007]
    • [Radlinksi et al, CIKM 2008]
    Results from Method 1 Results from Method 2
  • Evaluation using Discounted Cumulative Gain
    • Discounted Cumulative Gain (DCG)
    • [J ä rvelin and Kek älä inen 2000]
    Highly relevant Value = 3 Somewhat relevant Value = 2 Tangentially relevant Value = 1 Irrelevant Value = 0 Most important Value = 1 Less important Value = 1/log(i) ‏
  • Directly Modeling Relevance From Clicks Which ranking of web pages is better for the query “NIPS 2007”? [Carterette and Jones, NIPS 2007] Rank 1 Rank 2 Rank 3 Rank 4 Rank 5 Rank 1 Rank 2 Rank 3 Rank 4 Rank 5 Click count 1 Is DCG 1 > DCG 2 ? P(DCG 1 > DCG 2 )
  • Ingredients for Learning from Clicks
    • Sufficient users
    • Ability to record results shown
    • Ability to vary presentation order
    • Ability to vary results shown
    • Ability to log clicks
    • Ability to run experiments
    • varying your secret sauce
  •  
  • How to Get Search Engine Results to Modify?
    • Radlinski and Joachims
    • citeseer/arXiv.org results and permuted rankings, recorded clicks, skip above, skip next
    • See also their open source engine Osmot
      • http://radlinski.org/osmot/
    Search Engine Services Allow You to Do This Kind of Thing Yourself
  • Query Logs
    • Might be in /etc/httpd/logs/access_log* check httpd.conf
    • [IP] - - [Time] “[Method] [URI] [Version] [Code] “[Referrer]” “[User-Agent]”
      • 10.66.91.231 - - [08/Jun/2009:21:24:44 -0700] &quot;GET / search?q=awesome+presentation HTTP/1.1&quot; 200 2940 &quot;http://i_was_referred_from_here.com&quot; &quot;Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.0.10) Gecko/2009042315 Firefox/3.0.10 Ubiquity/0.1.4”
    • Tip: Instrument as much as possible in GET URI via CGI parameters
      • search?q=yahoo&region=us&tab=local&device=mobile&advanced=1
      • One log, avoid joins; URI must < 2k bytes
    • grep, cut, uniq, wc, sort, cat are your friends
      • Ex. Count user query sessions (session key = IP+hour)
      • sudo grep ’/search?q=' /etc/httpd/logs/access_log.1 | cut -d' ' -f1,4 | cut -d':' -f1,2 | uniq | wc –l
    • For advanced SQL processing on single machine: sqlite3 import script
      • http://selinap.com/2008/04/python-parse-apache-log-to-sqlite-database/
    • Distributed: Hadoop & Pig
      • http://www.cloudera.com/blog/2009/06/17/analyzing-apache-logs-with-pig/
  • Other Wishlist Items
    • A good baseline
      • Motivate your users to use your engine
      • More fun than reading newspaper stories from 1997
    • Evaluate something that is different from ranking
      • Summarization
      • Information extraction
    • Or improve on existing ranking
    • NLP tasks “take top results and do X…”
      • Data mining
    • Pseudo-relevance feedback
  • Reasons to Build a Demo “ Eat Your Own Dogfood” algorithm design and testing - allows you to improve without labeled data - look closely at the results - convince your advisor/funders it works! Observe user behavior Cheap flight to boston Cheap flights to boston Cheap flights Travelocity Expedia American arlines.com American airlines.com Americanairlines.com Puppy Cute puppy More cute puppy picutres
  • More About Logs and Evaluation in Other Tutorials
    • Web Search Engine Metrics (Direct Metrics to Measure User Satisfaction) – Tuesday, 2:00 PM–5:30 PM
    • Ali Dasdan , Yahoo! (USA) Kostas Tsioutsiouliklis , Yahoo! (USA) Emre Velipasaoglu , Yahoo! (USA)
    • Web Search/Browse Log Mining: Challenges, Methods, and Applications – Today, 9:00 AM–5:30 PM
    • Daxin Jiang , Microsoft (China), Jian Pei , Simon Fraser University (Canada) Hang Li , Microsoft (
  • What Doesn’t Exist?
    • Query log mining tools
      • An opportunity for you!
  • Other Open Source Tools
  • Lemur Query Log Toolbar
    • Research community project for collecting query logs
      • Sign up at http://lemurstudy.cs.umass.edu/
    • Built and maintained by LTI CMU and CIIR UMass Amherst
    • http://www.lemurproject.org/
  • Book on Hadoop Scale Processing Coming Out
    • Ivory: A Hadoop toolkit for Web-scale information retrieval
    • http://www.umiacs.umd.edu/~jimmylin/ivory/docs/index.html
    • Jimmy Lin
  • Take Home Messages
    • You can evaluate with clicks
    • You can collect clicks by building a useful / fun search service
    • You can create a useful/fun search service using open search APIs
    • You obtain implementations of standard retrieval algorithms with open source search engines
    • Modify that code with your new techniques
  • Pointers - Tools
    • [1] Indri Homepage. http://www.lemurproject.org/indri/..
    • [2] Lemur Toolkit Homepage. http://www.lemurproject.org/.
    • [3] Lucene Homepage. http://jakarta.apache.org/lucene/ .
    • [4] Xapian Code Library Homepage. http://www.xapian.org/.
    • [5] Zettair Homepage. http://www.seg.rmit.edu.au/zettair/ .
    • [6] Terrier Homepage. http://ir.dcs.gla.ac.uk/terrier/ .
    • [7] Nutch Homepage. http://lucene.apache..org/nutch/ .
    • [8] Sphinx search http://sphinxsearch.com/
  • Mashup Resources
    • Yahoo Developer Network: developer.yahoo.com
    • Y.Q.L. : developer.yahoo.com/yql
    • BOSS : developer.yahoo.com/boss
    • Bing : bing.com/developers
    • Google Search: code.google.com/apis/ajaxsearch/
    • App Engine : code.google.com/appengine/
    • A.W.S. : aws.amazon.com
    • Programmable Web : programmableWeb.com
    • Mashable : mashable.com
    • Tech Crunch : TechCrunch.com
  • Acknowledgements
    • Vik Singh co-wrote earlier version of this tutorial
    • A few slides from Ricardo Baeza-Yates and Ben Carterette
    • Andrew Tomkins, Wei Vivian Zhang, Ahmed Hassan, Eran Palmon for helpful feedback
    • QA