Faceted Search Nycto Talk


Published on

These slides were used for a presentation by Daniel Tunkelang (Google) and Otis Gospondetic (Sematext) at the New York CTO Club on December 9th, 2009.

Faceted Search

People come to your site to get the information they need, by exploring, discovering, and making comparisons. You want them to successfully sift through all of your content, quickly and effectively. The traditional approach of providing a search box and a ranked list of results can frustrate users, who need more guidance in order to find what they are looking for--or even know if the information is available.

Enter faceted search. Faceted search enables users to navigate a multi-dimensional information space by combining text search with a progressive narrowing of choices in each dimension. This technique has become ubiquitous in online retail, and is increasingly popular in other domains, both on the public internet and on intranets.

This talk will review the basic concepts of faceted search, and then dive into some of the subtler concerns. Specifically, we will elaborate on both the design and implementation concerns that determine whether a faceted search deployment will be successful.

Our own Daniel Tunkelang co-founded Endeca, a pioneer in faceted search, and worked there for 10 years before recently moving to Google. In addition to building the world's leading commercial technology for faceted search, he has played an active role in engaging the broader community of researchers and practitioners to advance understanding of this field. These efforts include organizing an annual workshop on human-computer information retrieval and publishing a textbook on faceted search.

Otis Gospodnetic is the co-founder of Sematext, a Lucene expert, co-author of Lucene in Action and upcoming Solr in Action, and a long-time Lucene and Solr developer with over 10 years of experience in search and related technologies. Sematext implements open-source search, linguistic, and text analytics technology in the enterprise. They focus on the development of scalable and high-performance search solutions.

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Faceted Search Nycto Talk

  1. Faceted Search New York CTO Club December 9, 2009 Daniel Tunkelang, Google Otis Gospodneti!, Sematext
  2. Agenda Daniel: ! What is faceted search? ! Why use faceted search? ! Thoughts about design and user experience. Otis: ! What are Lucene and Solr? ! Why use an open-source search library? ! Thoughts about implementation.
  3. “Regular” Search Interface: ! User expresses information need as short query. ! Search engine returns ranked, pageable result set. User happy when... ! Top-ranked result satisfies information need. ! At least some result on first page is relevant. User unhappy when... ! No result on first page satisfies information need. ! Results misleadingly appear relevant (bait and switch).
  4. Relevance Is Subjective Relevance is defined as a measure of information conveyed by a document relative to a query. It is shown that the relationship between the document and the query, though necessary, is not sufficient to determine relevance. William Goffman, On relevance as a measure, 1964.
  5. Regular Search Experience
  6. Assumptions Are Dangerous ! self-awareness tf-idf PageRank ! self-expression ! model knows best ! answer is a document ! one-shot query
  7. What is Faceted Search? ! Best understood through examples. " See the following slides. " Or shop on almost any ecommerce site. ! Facets = multiple ways to organize information. " Often based on available structured information. " But not always, e.g., facets obtained via text mining. ! Typical interaction: " User starts with a full-text search. " Facets guide query refinement process.
  8. Faceted Search for News
  9. Faceted Search for People
  10. Faceted Search for Breakfast
  11. But Facets are Not a Silver Bullet... ! Screen real estate is finite. " Choose facets wisely. " Choose facet values wisely for monster facets. ! Multiple selection within a facet is powerful, but... " Has to be intuitive, especially AND vs. OR. " Even trickier for hierarchical facets. ! Search relevance still matters! " Most faceted search applications rank results. " Irrelevant results " irrelevant facet refinements.
  12. Exploring Information Science
  13. Deliver Precision and Recall Easier said than done! Ranking of facet values is an open research topic.
  14. Be Careful with Faceted Search! Cameras have artists?!
  15. Clarify, Then Refine
  16. Take-Aways ! Faceted search addresses the subjectivity of relevance and information overload. ! But deploying faceted search effectively requires that you think about user experience. ! Recommended reading: " My thin book entitled Faceted Search " Marti Hearst's book on Search User Interfaces " Peter Morville's upcoming book on Search Patterns
  17. Faceted Search with Lucene & Solr Otis Gospodneti!, Sematext
  18. What is / isn't Lucene ! Free, ASL, Java IR library, Jar ! Doug Cutting, ASF, 2001 ! Application agnostic: Indexing & Searching ! High performance, scalable ! No dependencies ! Heavily ported ! No: crawler, rich doc parser, turn-key solution ! No: out of the box faceted search-capability... but...
  19. What is/isn't Solr ! Indexing/Search server with HTTP API built on top of Lucene ! Fast & scalable (distributed search, index replication)# ! XML, JSON, Ruby, Perl, PHP, javabin ! No: crawler (but Nutch ==> Solr works)# ! Yes: rich text parser ! Yes: Faceted Search out of the box!
  20. Solr and Faceted Search ! 3 Types of facets: Field Values (text), Dates, Queries. ! “Text”: return counts for all/top terms in a field for a result set - e.g. categories a la Amazon ! Dates: return counts for docs in specified date ranges ! Queries: return counts for docs that also match a given query - handy for number ranges (think prices!)#
  21. Facet Field Requirements ! Must be indexed ! Often not tokenized ! Often not altered (lowercase, punctuation)# ! Storing not required ! Multivalued fields OK
  22. Turn It On ! 0 facets: ! http://host:80/solr/select?q=foo ! 1 facet: ! http://host:80/solr/select?q=foo&facet=true&facet.field=category ! N facets: ! http://host:80/solr/select? q=foo&facet=true&facet.field=category&facet.field=inStock ! facet=true or facet.on
  23. Text Facet Response <result numFound="4" start="0"/> ! facet.mincount=1 to <lst name="facet_counts"> <lst name="facet_fields"> avoid 0-count facet <lst name="category"> values <int name="electronics">3</int> ! facet.limit=N to limit to <int name="copier">0</int> top N facet values </lst> <lst name="inStock"> ! facet.missing=true to <int name="false">3</int> catch uncategorized <int name="true">1</int> </lst> ! lots of other options! </lst> </lst>
  24. Date Facets ! http://.../solr/select/? q=*:*&rows=0&facet=true&facet.date=timesta mp&facet.date.start=NOW/DAY- 5DAYS&facet.date.end=NOW/DAY %2B1DAY&facet.date.gap=%2B1DAY ! (%2B1 ==> +1)# ! Solr Date Math Parser syntax: /HOUR, +2YEARS, -1DAY, /DAY+6MONTHS+3DAYS, +6MONTHS+3DAYS/DAY
  25. Date Facet Response <result name="response" numFound="42" start="0"/> <lst name="facet_counts"> <lst name="facet_dates"> <lst name="timestamp"> <int name="2007-08-11T00:00:00.000Z">1</int> <int name="2007-08-12T00:00:00.000Z">5</int> <int name="2007-08-13T00:00:00.000Z">3</int> <int name="2007-08-14T00:00:00.000Z">7</int> <int name="2007-08-15T00:00:00.000Z">2</int> <int name="2007-08-16T00:00:00.000Z">16</int> <str name="gap">+1DAY</str> <date name="end">2007-08-17T00:00:00Z</date> </lst>
  26. Query Facets ! http://.../solr/select? q=shoes&rows=0&facet=true&facet.field=inStoc k&facet.query=price: [*+TO+500]&facet.query=price:[500+TO+*] ! Avoids the bucket-at-index-time work-around ! Keep queries disjoint
  27. Query Facet Response <result numFound="3" start="0"/> <lst name="facet_counts"> <lst name="facet_queries"> <int name="price:[* TO 500]">3</int> <int name="price:[500 TO *]">1</int> </lst> <lst name="facet_fields"> <lst name="inStock"> <int name="false">3</int> <int name="true">1</int> </lst> </lst> </lst>
  28. UI Integration ! Use Filter Queries via fq ! http://.../solr/select? q=shoes&facet=true&facet.field=category& fq=price:[0 TO 300] ! http://.../solr/select? q=shoes&facet=true&facet.field=category& fq=price:[0 TO 300]&fq=inStock:true ! Important: single request does it all
  29. State of Lucene & Solr ! Super healthy community, exploding development ! Lucene 3.0 – 2009-11-25: ! Performance, faster range queries, clean API, better Unicode support, more non-English support ! Solr 1.4 – 2009-11-10: ! Performance, new replication, Db indexing, rich-doc indexing, results clustering, faster response protocol, deduplication...
  30. Lucene, Solr, Enterprise ! Free: Community ! Lucene ~ 600 emails/month (dev: 2000/month)# ! Solr ~1300 emails/month (dev: 800/month)# ! Commercial: Support Subscriptions ! Sematext ! Lucid Imagination