Faceted Search with Lucene
Shai Erera
Researcher, IBM
Who Am I
•
•
•
•

Working at IBM – Information Retrieval Research
Lucene/Solr committer and PMC member
http://shaierera.blogspot.com
shaie@apache.org
Lucene Facets 101
Faceted Search
•

Technique for accessing documents that were classified into a taxonomy of categories
–

•

Flat: Author/John Doe, Tags/Lucene, Popularity/High

–

Hierarchical: Computers/Software/Information Retrieval/Fulltext/Apache Lucene (ODP)

Quick overview of the break down of the search results
–

•

How many documents are in category Committed Paths/lucene/core vs. Committed Paths/lucene/facet

Simplifies interaction with the search application
–

Drilldown to issues that were updated in Past 2 days by clicking a link

–

No knowledge required about search syntax and index schema

http://jirasearch.mikemccandless.com
Lucene Facets
•
•

Contributed by IBM in 2011, released in 3.4.0
Major changes since 4.1.0+
–
–
–
–

•

Two main indexing-time modes
–
–

•

Taxonomy-based: hierarchical facets, managed by a
sidecar index, low NRT reopen cost
SortedSetDocValues: flat facets only, no sidecar index,
higher NRT reopen cost

Runtime modes
–

•

NRT support
Nearly 400% search speedups
Complete API revamp
New features (SortedSet, range faceting, drill-sideways)

Range facets (on NumericDocValues fields)

Other implementations: Solr, ElasticSearch, Bobo
Browse
Lucene Facet Components
•

TaxonomyWriter/Reader
–

•

FacetFields
–

•

Defines which facets to aggregate and the FacetsAggregator (aggregation function)

FacetsCollector
–

•

Add facets information to documents (DocValues fields, drilldown terms)

FacetRequest
–

•

Manage the taxonomy information

Collects matching documents and computes the top-K categories for each facet request
(invokes FacetsAccumulator)

DrillDownQuery / DrillSideways
–

Execute drilldown and drill-sideways requests
Sample Code – Indexing
// Builds the taxonomy as documents are indexed, multi-threaded, single instance
TaxonomyWriter taxoWriter = new DirectoryTaxonomyWriter(taxoDir);
// Adds facets information to a document, can be initialized once per thread
FacetFields facetFields = new FacetFields(taxoWriter);
// List of categories to add to the document
List<CategoryPath> cats = new ArrayList<CategoryPath>();
cats.add(new CategoryPath("Author", "Erik Hatcher"));
cats.add(new CategoryPath("Author/Otis Gospodnetić“, ‘/’));
cats.add(new CategoryPath("Pub Date", "2004", "December", "1"));
Document bookDoc = new Document();
bookDoc.add(new TextField(“title”, “lucene in action”, Store.YES);
// add categories fields (DocValues, Postings)
facetFields.addFields(bookDoc, cats);
// index the document
indexWriter.addDocument(bookDoc);
Sample Code – Search
// Open an NRT TaxonomyReader
TaxonomyReader taxoReader = new DirectoryTaxonomyReader(taxoWriter);
// Define the facets to
FacetSearchParams fsp =
fsp.addFacetRequest(new
fsp.addFacetRequest(new

aggregate (top-10 categories for each)
new FacetSearchParams();
CountFacetRequest(new CategoryPath("Author"), 10));
CountFacetRequest(new CategoryPath("Pub Date"), 10));

// Collect both top-K facets and top-N matching documents
TopDocsCollector tdc = TopScoredDocCollector.create(10, true);
FacetsCollector fc = FacetsCollector.create(fsp, indexr, taxor);
Query q = new TermQuery(new Term(“title”, “lucene”));
searcher.search(q, MultiCollector.wrap(tdc, fc));
// Traverse the top facets
for (FacetResult fres : facetsCollector.getFacetResults()) {
FacetResultNode root = fres.getFacetResultNode();
System.out.println(String.format("%s (%d)", root.label, root.value));
for (FacetResultNode cat : root.getSubResults()) {
System.out.println(“ “ + cat.label.components[0] + “ (“ + cat.value + “)”);
}
}
Drilldown and Drill-Sideways
•

Drilldown adds a filter to the search
–

Multiple categories can be OR’d

// Drilldown – filter results to “Component/core/index”;
// All other “Component/*” and “Component/core/*” get count 0
Query base = new MatchAllDocsQuery();
DrillDownQuery ddq = new DrillDownQuery(facetIndexingParams, base);
ddq.add(new CategoryPath(“Component/core/index”, ‘/’));

•

Drill sideways allows drilldown, yet still aggregate “sideways”
categories

// Drill-Sideways – drilldown on “Component/core/index”;
// Other “Component/*” and “Component/core/*” are counted too
DrillSideways ds = new DrillSideways(searcher, taxoReader);
DrillSidewaysResult sidewaysRes = ds.search(null, ddq, 10, fsp);
http://blog.mikemccandless.com/2013/02/drill-sideways-faceting-with-lucene.html
Dynamic Facets
•

Range facets on NumericDocValues fields
–
–

Define interested buckets during query
Supports any arbitrary ValueSource (Lucene 4.6.0)

// Aggregate matching documents into buckets
RangeAccumulator a = new RangeAccumulator(new
RangeFacetRequest<LongRange>("field",
new LongRange(“1-5", 1L, true, 5L, true),
new LongRange(“6-20", 6L, true, 20L, true),
new LongRange(“21-100", 21L, false, 100L, false),
new LongRange(“over 100", 100L, false, Long.MAX_VALUE, true)));
Facet Associations
•

Not all facets created equal
–
–
–

•

Categories can have values associated with them per document
–
–

•

Categories added by an automatic categorization system, e.g. Category/Apache
Lucene (0.74) (confidence level is 0.74)
Important metadata about the facet, e.g. Contracts/US ($5M) (total $$$ generated
from contracts)
Complex structures, e.g. Users/Shai Erera (lastAccess=YYYY/MM/DD,
numUpdates=8…)
They are later aggregated by these values
NOTE: ≠ NumericDocValuesFields!

Facet associations are completely customizable – encoded as a byte[] per
document

http://shaierera.blogspot.com/2013/01/facet-associations.html
More Features
•

Complements
–
–
–

•

Sampling
–
–

•

Holds the count of each category in-memory, per IndexReader
When number of search results is >50% of the index, count the “complement set”
Useful for “overview” queries, e.g. MatchAllDocsQuery
Aggregate a sampled set of the search results
Optionally re-count top-K facets for accurate values

Partitions
–
–

Partition the taxonomy space to control memory usage during faceted search
Useful for very big taxonomies (10s of millions of categories)
Lucene Facets Under the Hood
The Taxonomy Index
•

The taxonomy maps categories to integer codes (referred to as ordinals)
–
–
–

•

Kind of like a Map<CategoryPath,Integer>, with hierarchy support
Provides taxonomy browsing services
DirectoryTaxonomyWriter is managed as a sidecar Lucene index

Categories are broken down to their path components, e.g.
Date/2012/March/20 becomes:
–
–
–
–

Date, with ordinal=1
Date/2012, with ordinal=2
Date/2012/March, with ordinal=3
Date/2012/March/20, with ordinal=4
The Search Index
•

Categories are added as drilldown terms, e.g. for Date/2012/March/20:
–
–
–

•

$facets:Date
$facets:Date/2012
…

All category ordinals associated with the document are added as a
BinaryDocValuesField
–
–

All path components ordinals’ are added, not just the leafs’
Encoded as VInt + gap for efficient compression and speed
•

–

Other compression methods attempted, but were slower to decode (LUCENE-4609)

Used during faceted search to read all the associated ordinals and aggregate accordingly
(e.g. count)
SortedSet Facets
•
•
•
•

SortedSetFacetFields add SortedSetDocValuesFields and drilldown
terms to documents
Local-segment SortedSet ordinals are mapped to global ones through
SortedSetDocValuesReaderState
Use SortedSetDocValuesAccumulator to accumulate SortedSet facets
Advantages:
–
–
–

•

Taxonomy representation requires less RAM (flat taxonomy)
No sidecar index
Tie-breaks by label-sort order

Disadvantages:
–
–
–
–

Not full taxonomy
Overall uses more RAM (local-to-global ordinal mapping)
Adds NRT reopen cost
Slower than taxonomy-based facets
Global Ordinals
•

Per-segment integer codes (as used by the SortedSet approach) are less efficient
–
–
–

•

Global ordinals allow efficient per-segment faceting and aggregation
–
–

•

Different ordinals for same categories across segments
Hold in-memory codes map (e.g. local-to-global) – more RAM and less scalable
Resolve top-K on the String representation of categories – more CPU
No translation maps required (no extra RAM, highly scalable)
Aggregation, top-K computation done on integer codes

But, do not play well with IndexWriter.addIndexes(Directory…)
–

Must use IndexWriter.addIndexes(IndexReader…), so that the ordinals in the
input search are mapped to the destination’s
Two-Phase Aggregation
•

FacetsCollector works in two steps:
–
–

•

Performance tests show that this improves faceted search (LUCENE-4600)
–

•

Collects matching documents (and optionally their scores)
Invokes FacetsAccumulator to accumulate the top-K facets
Locality of reference?

Useful for Sampling and Complements
–

Hard to do otherwise
FacetIndexingParams
•

Determine how facets are encoded
–
–
–

•

CategoryListParams holds parameters for a category list
–
–

•

Partition size
Facet delimiter character (for drilldown terms, default u001F)
CategoryListParams
Encoder/Decoder (default DGapVInt)
OrdinalPolicy (how path components are encoded): ALL_PARENTS, NO_PARENTS and
ALL_BUT_DIMENSION (default)

CategoryListParams can be used to group facets together
–
–

Default: all facets are put in the same “category list” (i.e. one BinaryDocValues field)
Expert: separate categories by dimension into different category lists
•

•

Useful when sets of categories are always aggregated together, but not with other categories

FacetIndexingParams are currently not recorded per-segment and therefore you
should be careful if you suddenly change them!
Questions?

Faceted Search with Lucene

  • 2.
    Faceted Search withLucene Shai Erera Researcher, IBM
  • 3.
    Who Am I • • • • Workingat IBM – Information Retrieval Research Lucene/Solr committer and PMC member http://shaierera.blogspot.com shaie@apache.org
  • 4.
  • 5.
    Faceted Search • Technique foraccessing documents that were classified into a taxonomy of categories – • Flat: Author/John Doe, Tags/Lucene, Popularity/High – Hierarchical: Computers/Software/Information Retrieval/Fulltext/Apache Lucene (ODP) Quick overview of the break down of the search results – • How many documents are in category Committed Paths/lucene/core vs. Committed Paths/lucene/facet Simplifies interaction with the search application – Drilldown to issues that were updated in Past 2 days by clicking a link – No knowledge required about search syntax and index schema http://jirasearch.mikemccandless.com
  • 6.
    Lucene Facets • • Contributed byIBM in 2011, released in 3.4.0 Major changes since 4.1.0+ – – – – • Two main indexing-time modes – – • Taxonomy-based: hierarchical facets, managed by a sidecar index, low NRT reopen cost SortedSetDocValues: flat facets only, no sidecar index, higher NRT reopen cost Runtime modes – • NRT support Nearly 400% search speedups Complete API revamp New features (SortedSet, range faceting, drill-sideways) Range facets (on NumericDocValues fields) Other implementations: Solr, ElasticSearch, Bobo Browse
  • 7.
    Lucene Facet Components • TaxonomyWriter/Reader – • FacetFields – • Defineswhich facets to aggregate and the FacetsAggregator (aggregation function) FacetsCollector – • Add facets information to documents (DocValues fields, drilldown terms) FacetRequest – • Manage the taxonomy information Collects matching documents and computes the top-K categories for each facet request (invokes FacetsAccumulator) DrillDownQuery / DrillSideways – Execute drilldown and drill-sideways requests
  • 8.
    Sample Code –Indexing // Builds the taxonomy as documents are indexed, multi-threaded, single instance TaxonomyWriter taxoWriter = new DirectoryTaxonomyWriter(taxoDir); // Adds facets information to a document, can be initialized once per thread FacetFields facetFields = new FacetFields(taxoWriter); // List of categories to add to the document List<CategoryPath> cats = new ArrayList<CategoryPath>(); cats.add(new CategoryPath("Author", "Erik Hatcher")); cats.add(new CategoryPath("Author/Otis Gospodnetić“, ‘/’)); cats.add(new CategoryPath("Pub Date", "2004", "December", "1")); Document bookDoc = new Document(); bookDoc.add(new TextField(“title”, “lucene in action”, Store.YES); // add categories fields (DocValues, Postings) facetFields.addFields(bookDoc, cats); // index the document indexWriter.addDocument(bookDoc);
  • 9.
    Sample Code –Search // Open an NRT TaxonomyReader TaxonomyReader taxoReader = new DirectoryTaxonomyReader(taxoWriter); // Define the facets to FacetSearchParams fsp = fsp.addFacetRequest(new fsp.addFacetRequest(new aggregate (top-10 categories for each) new FacetSearchParams(); CountFacetRequest(new CategoryPath("Author"), 10)); CountFacetRequest(new CategoryPath("Pub Date"), 10)); // Collect both top-K facets and top-N matching documents TopDocsCollector tdc = TopScoredDocCollector.create(10, true); FacetsCollector fc = FacetsCollector.create(fsp, indexr, taxor); Query q = new TermQuery(new Term(“title”, “lucene”)); searcher.search(q, MultiCollector.wrap(tdc, fc)); // Traverse the top facets for (FacetResult fres : facetsCollector.getFacetResults()) { FacetResultNode root = fres.getFacetResultNode(); System.out.println(String.format("%s (%d)", root.label, root.value)); for (FacetResultNode cat : root.getSubResults()) { System.out.println(“ “ + cat.label.components[0] + “ (“ + cat.value + “)”); } }
  • 10.
    Drilldown and Drill-Sideways • Drilldownadds a filter to the search – Multiple categories can be OR’d // Drilldown – filter results to “Component/core/index”; // All other “Component/*” and “Component/core/*” get count 0 Query base = new MatchAllDocsQuery(); DrillDownQuery ddq = new DrillDownQuery(facetIndexingParams, base); ddq.add(new CategoryPath(“Component/core/index”, ‘/’)); • Drill sideways allows drilldown, yet still aggregate “sideways” categories // Drill-Sideways – drilldown on “Component/core/index”; // Other “Component/*” and “Component/core/*” are counted too DrillSideways ds = new DrillSideways(searcher, taxoReader); DrillSidewaysResult sidewaysRes = ds.search(null, ddq, 10, fsp); http://blog.mikemccandless.com/2013/02/drill-sideways-faceting-with-lucene.html
  • 11.
    Dynamic Facets • Range facetson NumericDocValues fields – – Define interested buckets during query Supports any arbitrary ValueSource (Lucene 4.6.0) // Aggregate matching documents into buckets RangeAccumulator a = new RangeAccumulator(new RangeFacetRequest<LongRange>("field", new LongRange(“1-5", 1L, true, 5L, true), new LongRange(“6-20", 6L, true, 20L, true), new LongRange(“21-100", 21L, false, 100L, false), new LongRange(“over 100", 100L, false, Long.MAX_VALUE, true)));
  • 12.
    Facet Associations • Not allfacets created equal – – – • Categories can have values associated with them per document – – • Categories added by an automatic categorization system, e.g. Category/Apache Lucene (0.74) (confidence level is 0.74) Important metadata about the facet, e.g. Contracts/US ($5M) (total $$$ generated from contracts) Complex structures, e.g. Users/Shai Erera (lastAccess=YYYY/MM/DD, numUpdates=8…) They are later aggregated by these values NOTE: ≠ NumericDocValuesFields! Facet associations are completely customizable – encoded as a byte[] per document http://shaierera.blogspot.com/2013/01/facet-associations.html
  • 13.
    More Features • Complements – – – • Sampling – – • Holds thecount of each category in-memory, per IndexReader When number of search results is >50% of the index, count the “complement set” Useful for “overview” queries, e.g. MatchAllDocsQuery Aggregate a sampled set of the search results Optionally re-count top-K facets for accurate values Partitions – – Partition the taxonomy space to control memory usage during faceted search Useful for very big taxonomies (10s of millions of categories)
  • 14.
  • 15.
    The Taxonomy Index • Thetaxonomy maps categories to integer codes (referred to as ordinals) – – – • Kind of like a Map<CategoryPath,Integer>, with hierarchy support Provides taxonomy browsing services DirectoryTaxonomyWriter is managed as a sidecar Lucene index Categories are broken down to their path components, e.g. Date/2012/March/20 becomes: – – – – Date, with ordinal=1 Date/2012, with ordinal=2 Date/2012/March, with ordinal=3 Date/2012/March/20, with ordinal=4
  • 16.
    The Search Index • Categoriesare added as drilldown terms, e.g. for Date/2012/March/20: – – – • $facets:Date $facets:Date/2012 … All category ordinals associated with the document are added as a BinaryDocValuesField – – All path components ordinals’ are added, not just the leafs’ Encoded as VInt + gap for efficient compression and speed • – Other compression methods attempted, but were slower to decode (LUCENE-4609) Used during faceted search to read all the associated ordinals and aggregate accordingly (e.g. count)
  • 17.
    SortedSet Facets • • • • SortedSetFacetFields addSortedSetDocValuesFields and drilldown terms to documents Local-segment SortedSet ordinals are mapped to global ones through SortedSetDocValuesReaderState Use SortedSetDocValuesAccumulator to accumulate SortedSet facets Advantages: – – – • Taxonomy representation requires less RAM (flat taxonomy) No sidecar index Tie-breaks by label-sort order Disadvantages: – – – – Not full taxonomy Overall uses more RAM (local-to-global ordinal mapping) Adds NRT reopen cost Slower than taxonomy-based facets
  • 18.
    Global Ordinals • Per-segment integercodes (as used by the SortedSet approach) are less efficient – – – • Global ordinals allow efficient per-segment faceting and aggregation – – • Different ordinals for same categories across segments Hold in-memory codes map (e.g. local-to-global) – more RAM and less scalable Resolve top-K on the String representation of categories – more CPU No translation maps required (no extra RAM, highly scalable) Aggregation, top-K computation done on integer codes But, do not play well with IndexWriter.addIndexes(Directory…) – Must use IndexWriter.addIndexes(IndexReader…), so that the ordinals in the input search are mapped to the destination’s
  • 19.
    Two-Phase Aggregation • FacetsCollector worksin two steps: – – • Performance tests show that this improves faceted search (LUCENE-4600) – • Collects matching documents (and optionally their scores) Invokes FacetsAccumulator to accumulate the top-K facets Locality of reference? Useful for Sampling and Complements – Hard to do otherwise
  • 20.
    FacetIndexingParams • Determine how facetsare encoded – – – • CategoryListParams holds parameters for a category list – – • Partition size Facet delimiter character (for drilldown terms, default u001F) CategoryListParams Encoder/Decoder (default DGapVInt) OrdinalPolicy (how path components are encoded): ALL_PARENTS, NO_PARENTS and ALL_BUT_DIMENSION (default) CategoryListParams can be used to group facets together – – Default: all facets are put in the same “category list” (i.e. one BinaryDocValues field) Expert: separate categories by dimension into different category lists • • Useful when sets of categories are always aggregated together, but not with other categories FacetIndexingParams are currently not recorded per-segment and therefore you should be careful if you suddenly change them!
  • 21.