Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Faceted Search with Lucene


Published on

Faceted search is a powerful technique to let users easily navigate the search results. It can also be used to develop rich user interfaces, which give an analyst quick insights about the documents space. In this session I will introduce the Facets module, how to use it, under-the-hood details as well as optimizations and best practices. I will also describe advanced faceted search capabilities with Lucene Facets.

Published in: Technology, Education
  • Dating direct: ❤❤❤ ❤❤❤
    Are you sure you want to  Yes  No
    Your message goes here
  • Dating for everyone is here: ❶❶❶ ❶❶❶
    Are you sure you want to  Yes  No
    Your message goes here

Faceted Search with Lucene

  1. 1. Faceted Search with Lucene Shai Erera Researcher, IBM
  2. 2. Who Am I • • • • Working at IBM – Information Retrieval Research Lucene/Solr committer and PMC member
  3. 3. Lucene Facets 101
  4. 4. Faceted Search • Technique for accessing documents that were classified into a taxonomy of categories – • Flat: Author/John Doe, Tags/Lucene, Popularity/High – Hierarchical: Computers/Software/Information Retrieval/Fulltext/Apache Lucene (ODP) Quick overview of the break down of the search results – • How many documents are in category Committed Paths/lucene/core vs. Committed Paths/lucene/facet Simplifies interaction with the search application – Drilldown to issues that were updated in Past 2 days by clicking a link – No knowledge required about search syntax and index schema
  5. 5. Lucene Facets • • Contributed by IBM in 2011, released in 3.4.0 Major changes since 4.1.0+ – – – – • Two main indexing-time modes – – • Taxonomy-based: hierarchical facets, managed by a sidecar index, low NRT reopen cost SortedSetDocValues: flat facets only, no sidecar index, higher NRT reopen cost Runtime modes – • NRT support Nearly 400% search speedups Complete API revamp New features (SortedSet, range faceting, drill-sideways) Range facets (on NumericDocValues fields) Other implementations: Solr, ElasticSearch, Bobo Browse
  6. 6. Lucene Facet Components • TaxonomyWriter/Reader – • FacetFields – • Defines which facets to aggregate and the FacetsAggregator (aggregation function) FacetsCollector – • Add facets information to documents (DocValues fields, drilldown terms) FacetRequest – • Manage the taxonomy information Collects matching documents and computes the top-K categories for each facet request (invokes FacetsAccumulator) DrillDownQuery / DrillSideways – Execute drilldown and drill-sideways requests
  7. 7. Sample Code – Indexing // Builds the taxonomy as documents are indexed, multi-threaded, single instance TaxonomyWriter taxoWriter = new DirectoryTaxonomyWriter(taxoDir); // Adds facets information to a document, can be initialized once per thread FacetFields facetFields = new FacetFields(taxoWriter); // List of categories to add to the document List<CategoryPath> cats = new ArrayList<CategoryPath>(); cats.add(new CategoryPath("Author", "Erik Hatcher")); cats.add(new CategoryPath("Author/Otis Gospodnetić“, ‘/’)); cats.add(new CategoryPath("Pub Date", "2004", "December", "1")); Document bookDoc = new Document(); bookDoc.add(new TextField(“title”, “lucene in action”, Store.YES); // add categories fields (DocValues, Postings) facetFields.addFields(bookDoc, cats); // index the document indexWriter.addDocument(bookDoc);
  8. 8. Sample Code – Search // Open an NRT TaxonomyReader TaxonomyReader taxoReader = new DirectoryTaxonomyReader(taxoWriter); // Define the facets to FacetSearchParams fsp = fsp.addFacetRequest(new fsp.addFacetRequest(new aggregate (top-10 categories for each) new FacetSearchParams(); CountFacetRequest(new CategoryPath("Author"), 10)); CountFacetRequest(new CategoryPath("Pub Date"), 10)); // Collect both top-K facets and top-N matching documents TopDocsCollector tdc = TopScoredDocCollector.create(10, true); FacetsCollector fc = FacetsCollector.create(fsp, indexr, taxor); Query q = new TermQuery(new Term(“title”, “lucene”));, MultiCollector.wrap(tdc, fc)); // Traverse the top facets for (FacetResult fres : facetsCollector.getFacetResults()) { FacetResultNode root = fres.getFacetResultNode(); System.out.println(String.format("%s (%d)", root.label, root.value)); for (FacetResultNode cat : root.getSubResults()) { System.out.println(“ “ + cat.label.components[0] + “ (“ + cat.value + “)”); } }
  9. 9. Drilldown and Drill-Sideways • Drilldown adds a filter to the search – Multiple categories can be OR’d // Drilldown – filter results to “Component/core/index”; // All other “Component/*” and “Component/core/*” get count 0 Query base = new MatchAllDocsQuery(); DrillDownQuery ddq = new DrillDownQuery(facetIndexingParams, base); ddq.add(new CategoryPath(“Component/core/index”, ‘/’)); • Drill sideways allows drilldown, yet still aggregate “sideways” categories // Drill-Sideways – drilldown on “Component/core/index”; // Other “Component/*” and “Component/core/*” are counted too DrillSideways ds = new DrillSideways(searcher, taxoReader); DrillSidewaysResult sidewaysRes =, ddq, 10, fsp);
  10. 10. Dynamic Facets • Range facets on NumericDocValues fields – – Define interested buckets during query Supports any arbitrary ValueSource (Lucene 4.6.0) // Aggregate matching documents into buckets RangeAccumulator a = new RangeAccumulator(new RangeFacetRequest<LongRange>("field", new LongRange(“1-5", 1L, true, 5L, true), new LongRange(“6-20", 6L, true, 20L, true), new LongRange(“21-100", 21L, false, 100L, false), new LongRange(“over 100", 100L, false, Long.MAX_VALUE, true)));
  11. 11. Facet Associations • Not all facets created equal – – – • Categories can have values associated with them per document – – • Categories added by an automatic categorization system, e.g. Category/Apache Lucene (0.74) (confidence level is 0.74) Important metadata about the facet, e.g. Contracts/US ($5M) (total $$$ generated from contracts) Complex structures, e.g. Users/Shai Erera (lastAccess=YYYY/MM/DD, numUpdates=8…) They are later aggregated by these values NOTE: ≠ NumericDocValuesFields! Facet associations are completely customizable – encoded as a byte[] per document
  12. 12. More Features • Complements – – – • Sampling – – • Holds the count of each category in-memory, per IndexReader When number of search results is >50% of the index, count the “complement set” Useful for “overview” queries, e.g. MatchAllDocsQuery Aggregate a sampled set of the search results Optionally re-count top-K facets for accurate values Partitions – – Partition the taxonomy space to control memory usage during faceted search Useful for very big taxonomies (10s of millions of categories)
  13. 13. Lucene Facets Under the Hood
  14. 14. The Taxonomy Index • The taxonomy maps categories to integer codes (referred to as ordinals) – – – • Kind of like a Map<CategoryPath,Integer>, with hierarchy support Provides taxonomy browsing services DirectoryTaxonomyWriter is managed as a sidecar Lucene index Categories are broken down to their path components, e.g. Date/2012/March/20 becomes: – – – – Date, with ordinal=1 Date/2012, with ordinal=2 Date/2012/March, with ordinal=3 Date/2012/March/20, with ordinal=4
  15. 15. The Search Index • Categories are added as drilldown terms, e.g. for Date/2012/March/20: – – – • $facets:Date $facets:Date/2012 … All category ordinals associated with the document are added as a BinaryDocValuesField – – All path components ordinals’ are added, not just the leafs’ Encoded as VInt + gap for efficient compression and speed • – Other compression methods attempted, but were slower to decode (LUCENE-4609) Used during faceted search to read all the associated ordinals and aggregate accordingly (e.g. count)
  16. 16. SortedSet Facets • • • • SortedSetFacetFields add SortedSetDocValuesFields and drilldown terms to documents Local-segment SortedSet ordinals are mapped to global ones through SortedSetDocValuesReaderState Use SortedSetDocValuesAccumulator to accumulate SortedSet facets Advantages: – – – • Taxonomy representation requires less RAM (flat taxonomy) No sidecar index Tie-breaks by label-sort order Disadvantages: – – – – Not full taxonomy Overall uses more RAM (local-to-global ordinal mapping) Adds NRT reopen cost Slower than taxonomy-based facets
  17. 17. Global Ordinals • Per-segment integer codes (as used by the SortedSet approach) are less efficient – – – • Global ordinals allow efficient per-segment faceting and aggregation – – • Different ordinals for same categories across segments Hold in-memory codes map (e.g. local-to-global) – more RAM and less scalable Resolve top-K on the String representation of categories – more CPU No translation maps required (no extra RAM, highly scalable) Aggregation, top-K computation done on integer codes But, do not play well with IndexWriter.addIndexes(Directory…) – Must use IndexWriter.addIndexes(IndexReader…), so that the ordinals in the input search are mapped to the destination’s
  18. 18. Two-Phase Aggregation • FacetsCollector works in two steps: – – • Performance tests show that this improves faceted search (LUCENE-4600) – • Collects matching documents (and optionally their scores) Invokes FacetsAccumulator to accumulate the top-K facets Locality of reference? Useful for Sampling and Complements – Hard to do otherwise
  19. 19. FacetIndexingParams • Determine how facets are encoded – – – • CategoryListParams holds parameters for a category list – – • Partition size Facet delimiter character (for drilldown terms, default u001F) CategoryListParams Encoder/Decoder (default DGapVInt) OrdinalPolicy (how path components are encoded): ALL_PARENTS, NO_PARENTS and ALL_BUT_DIMENSION (default) CategoryListParams can be used to group facets together – – Default: all facets are put in the same “category list” (i.e. one BinaryDocValues field) Expert: separate categories by dimension into different category lists • • Useful when sets of categories are always aggregated together, but not with other categories FacetIndexingParams are currently not recorded per-segment and therefore you should be careful if you suddenly change them!
  20. 20. Questions?