Faceted Search and Solr


Published on

An overview of Faceted Search by Daniel Tunkelang and an overview of Faceted Search and Solr by Otis Gospodnetić.

Published in: Technology

Faceted Search and Solr

  1. 1. “Regular” Search Faceted Search Interface: ! User expresses information need as short query. Search engine returns ranked, pageable result set. New York CTO Club ! December 9, 2009 User happy when... ! Top-ranked result satisfies information need. ! At least some result on first page is relevant. Daniel Tunkelang, Google User unhappy when... Otis Gospodneti!, Sematext ! No result on first page satisfies information need. ! Results misleadingly appear relevant (bait and switch). 1 3 Agenda Relevance Is Subjective Daniel: ! What is faceted search? Relevance is defined as a measure of ! Why use faceted search? information conveyed by a document relative to ! Thoughts about design and user experience. a query. It is shown that the relationship between the Otis: ! What are Lucene and Solr? document and the query, though necessary, is ! Why use an open-source search library? not sufficient to determine relevance. ! Thoughts about implementation. William Goffman, On relevance as a measure, 1964. 2 4
  2. 2. Regular Search Experience What is Faceted Search? ! Best understood through examples. " See the following slides. " Or shop on almost any ecommerce site. ! Facets = multiple ways to organize information. " Often based on available structured information. " But not always, e.g., facets obtained via text mining. ! Typical interaction: " User starts with a full-text search. " Facets guide query refinement process. 5 7 Assumptions Are Dangerous Faceted Search for News ! self-awareness tf-idf PageRank ! self-expression ! model knows best ! answer is a document ! one-shot query 6 8
  3. 3. Faceted Search for People 9 Faceted Search for Breakfast But Facets are Not a Silver Bullet... ! Screen real estate is finite. " Choose facets wisely. " Choose facet values wisely for monster facets. ! Multiple selection within a facet is powerful, but... " Has to be intuitive, especially AND vs. OR. " Even trickier for hierarchical facets. ! Search relevance still matters! " Most faceted search applications rank results. " Irrelevant results " irrelevant facet refinements. 10 12
  4. 4. Exploring Information Science Be Careful with Faceted Search! Cameras have artists?! 13 15 Deliver Precision and Recall Clarify, Then Refine Easier said than done! Ranking of facet values is an open research topic. 14 16
  5. 5. Take-Aways What is / isn't Lucene ! Faceted search addresses the subjectivity of ! Free, ASL, Java IR library, Jar relevance and information overload. ! Doug Cutting, ASF, 2001 ! But deploying faceted search effectively ! Application agnostic: Indexing & Searching requires that you think about user experience. ! High performance, scalable ! No dependencies ! Recommended reading: ! Heavily ported " My thin book entitled Faceted Search " Marti Hearst's book on Search User Interfaces ! No: crawler, rich doc parser, turn-key solution " Peter Morville's upcoming book on Search Patterns ! No: out of the box faceted search-capability... but... 17 19 Faceted Search with Lucene & Solr Otis Gospodneti!, Sematext 18
  6. 6. What is/isn't Solr Facet Field Requirements ! Indexing/Search server with HTTP API built on ! Must be indexed top of Lucene ! Often not tokenized ! Fast & scalable (distributed search, index ! Often not altered (lowercase, punctuation)# replication)# ! Storing not required ! XML, JSON, Ruby, Perl, PHP, javabin ! Multivalued fields OK ! No: crawler (but Nutch ==> Solr works)# ! Yes: rich text parser ! Yes: Faceted Search out of the box! 21 23 Solr and Faceted Search Turn It On ! 3 Types of facets: Field Values (text), Dates, ! 0 facets: Queries. ! http://host:80/solr/select?q=foo ! “Text”: return counts for all/top terms in a field ! 1 facet: for a result set - e.g. categories a la Amazon ! http://host:80/solr/select?q=foo&facet=true&facet.field=category ! Dates: return counts for docs in specified date ! N facets: ranges ! http://host:80/solr/select? q=foo&facet=true&facet.field=category&facet.field=inStock ! Queries: return counts for docs that also match ! facet=true or facet.on a given query - handy for number ranges (think prices!)# 22 24
  7. 7. Text Facet Response Date Facet Response <result numFound="4" start="0"/> <result name="response" numFound="42" start="0"/> ! facet.mincount=1 to <lst name="facet_counts"> <lst name="facet_counts"> <lst name="facet_fields"> avoid 0-count facet <lst name="facet_dates"> <lst name="category"> values <lst name="timestamp"> <int name="electronics">3</int> ! facet.limit=N to limit to <int name="2007-08-11T00:00:00.000Z">1</int> <int name="copier">0</int> <int name="2007-08-12T00:00:00.000Z">5</int> top N facet values </lst> <int name="2007-08-13T00:00:00.000Z">3</int> <lst name="inStock"> ! facet.missing=true to <int name="2007-08-14T00:00:00.000Z">7</int> <int name="false">3</int> catch uncategorized <int name="2007-08-15T00:00:00.000Z">2</int> <int name="true">1</int> <int name="2007-08-16T00:00:00.000Z">16</int> </lst> ! lots of other options! <str name="gap">+1DAY</str> </lst> <date name="end">2007-08-17T00:00:00Z</date> </lst> 25 </lst> 27 Date Facets Query Facets ! http://.../solr/select/? ! http://.../solr/select? q=*:*&rows=0&facet=true&facet.date=timesta q=shoes&rows=0&facet=true&facet.field=inStoc mp&facet.date.start=NOW/DAY- k&facet.query=price: 5DAYS&facet.date.end=NOW/DAY [*+TO+500]&facet.query=price:[500+TO+*] %2B1DAY&facet.date.gap=%2B1DAY ! Avoids the bucket-at-index-time work-around ! (%2B1 ==> +1)# ! Keep queries disjoint ! Solr Date Math Parser syntax: /HOUR, +2YEARS, -1DAY, /DAY+6MONTHS+3DAYS, +6MONTHS+3DAYS/DAY 26 28
  8. 8. Query Facet Response State of Lucene & Solr <result numFound="3" start="0"/> ! Super healthy community, exploding <lst name="facet_counts"> <lst name="facet_queries"> development <int name="price:[* TO 500]">3</int> ! Lucene 3.0 – 2009-11-25: <int name="price:[500 TO *]">1</int> ! Performance, faster range queries, clean API, better </lst> Unicode support, more non-English support <lst name="facet_fields"> <lst name="inStock"> ! Solr 1.4 – 2009-11-10: <int name="false">3</int> ! Performance, new replication, Db indexing, rich-doc <int name="true">1</int> indexing, results clustering, faster response protocol, </lst> deduplication... </lst> </lst> 29 31 UI Integration Lucene, Solr, Enterprise ! Use Filter Queries via fq ! Free: Community ! http://.../solr/select? ! Lucene ~ 600 emails/month (dev: 2000/month)# q=shoes&facet=true&facet.field=category& ! Solr ~1300 emails/month (dev: 800/month)# fq=price:[0 TO 300] ! http://.../solr/select? ! Commercial: Support Subscriptions q=shoes&facet=true&facet.field=category& ! Sematext fq=price:[0 TO 300]&fq=inStock:true ! Lucid Imagination ! Important: single request does it all 30 32
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.