his talk will feature some of my recent research into the alternative uses for Solr facets and facet metadata. I will develop the idea that facets can be used to discover similarities between items and attributes in a search index, and show some interesting applications of this idea. A common takeaway is that using facets and facet metadata in non-conventional ways enables the semantic context of a query to be automatically tuned. This has important implications for user-centric and semantically focused relevance.
2. Information Spaces
Use Cases: Search and Discovery
Knowledge Spaces
Asking the right question (knowing what questions to ask)
Navigation and Visualization
Alexa: Who’s on first?
What’s the first baseman’s name?
Relevance - Similarity - Precision - Classification
Vectors and Vector Spaces
Are Information spaces like Euclidian or Cartesian spaces?
Knowledge Bases
Lamp Table
Side Table
Table Lamp
3. Facet Synonyms - Spatial Metaphors
Parameters
Dimensions
Navigators
Refiners
Supports the notion of some kind of n-dimensional information
“space”
I call it a meta-informational space
Traditional Uses - Navigation and Visualization
Verity K2
Endeca
Fast ESP
MS Fast
4. Contexts are Viewpoints / Perspectives in Information Space
“The circumstances that form the setting for an event, statement, or idea, and in
terms of which it can be fully understood and assessed.”
Personal Contexts
Who is searching?
What are their roles / interests?
What have the searched for in the past (including just now)?
What are they allowed to search for?
Semantic Contexts
Homonym / Polysemy Problem
“apple”
Tech company, Horticulture, Food, Music, New York City
What is the subject area or domain?
Contexts
5. Facet - Similarity Theorem
Lemma 1: Similar things tend to occur in similar contexts
Lemma 2: Facets are a tool for exploring meta-informational contexts
Theorem: Facets can be used to find similar things.
6. Facets
“A particular aspect or feature of something.”
Facets are Metadata
”Data about data" - attributes, aspects, descriptors, features, properties,
traits
Metadata Semantics: what, where, when, why
name, size, shape, color, material, texture
manufacturer, number of outlets, voltage, is pre-assembled …
address, phone number, birth date, user rating
Metadata Contexts: Some metadata fields depend on “what" the “thing” is,
e.g. People have different attributes than Toaster Ovens
Metadata provides Semantic Mappings
Consist of field name = field value pairings
Map Terms to Concepts
The term ‘red’ is known to be a ‘color’ because it is a
value in the ‘color’ field
7. Metadata as a Knowledge Base
Faceted Navigation - Top-down or drill-in
Search - More direct or bottom-up
Query Autofiltering:
Uses facet metadata in search collection to determine semantic
meaning of search terms.
Semantic Knowledge is Power
Can use this built-in knowledge to “short-circuit” the “search
then drill in” paradigm
Metadata cardinality and “Boolean in the Vernacular”
Semantic Pattern Rules
$DRUG treats $SYMPTOM vs. $DRUG causes $SYMPTOM
$Accessory for $Product (e.g. “case for iPhone”)
Enables precise bottom-up search
Dependent on metadata quality and completeness
Improving metadata can improve search too
8. Categorical vs. Numerical
Some metadata is non-numerical - i.e. categorical
Similarity in numerical hyper-spaces is modeled as Euclidian space
Numerical Similarity
Search Relevance / Clustering / Learning Algorithms
Use Term Probability Vectors (tf/idf) in unstructured text
Everything must be a number - categorical data is indexed (arbitrary)
Similarity is based on linear or angular closeness of vectors
Detects patterns - which may not be intuitive => black-box models
Facet-based Similarity
Similarity based on shared categorical and numerical ranges
Numerical data are ranged or “binned” to be compatible with category
9. Navigating Categorical Spaces
Pivot Facets:
Paths or trajectories through categorical spaces
Multi-Dimensional Query Suggester
Pivot Patterns - Semantically “sensible” permutations of metadata fields
$First_Name $Last_Name $Occupation $City $State
Bob Jones Accountant Cincinnati OH
Multi-dimensional queries and precision
Users expect greater precision In results (i.e. fewer) when they add refining information
to the query
Traditional “bag-of-words” search algorithms often fail to deliver on this expectation
Typeahead solutions should show queries that will “work” - rapid visualization of
available content
Query Autofiltering solution is tailored to this since it can navigate the same
categorical space that the pivot patterns generate
Gotcha: Both solutions depend on accurate and complete metadata at sufficient level
of granularity.
10. Adding Context to Suggester via Facets
Suggestions are validated against content collection
Facets are used to acquire contextual metadata from a content collection while
building a typeahead collection
Use Cases
Security Trimming of Suggestions
If a query only hits on secured documents, do not want to show that query to
users that cannot see any of the documents
Solution: Use facets to get the list of ACLs that are associated with a query
Dynamic boosting of suggestions based on previous searches
Contextual metadata added to typeahead collection boosts similar
suggestions
Solution: In typeahead application retain context metadata for selected queries
and re-send it as boost queries in subsequent typeahead requests
Facet-Similarity Theorem at work
11. Building a Suggester with Dynamic Context
Uses Facet Queries against a
Content Collection to create
additional metadata for the
Suggester or Typeahead
Collection.
This contextual metadata can then
be used for:
• Security Trimming of
Typeahead suggestions
• Dynamic boosting of similar
suggestions within a user
session
12. Building a Suggester with Dynamic Context
Bring back other fields in addition to displayed suggestion text
(i.e., the ones that were calculated using faceting)
If a query is used to search, temporarily store its associated metadata in a
circular cache on the browser.
When submitting the next typeahead query, add the cached information from
the queue as boost queries.
Type ‘j’ - get back
Jai Johnny Johanson Bands
Jai Johnny Johanson Groups
J.J. Johnson
Jai Johnny Johanson
Juke Joint Jezebel
Juke Joint Jimmy
Just searched for ‘Paul McCartney’ then type ‘j’
John Lennon
John Lennon Songs
John Lennon Songs Covered
James P Johnson Songs (?)
John Lennon Originals
Hey Jude
13. Structured vs. Unstructured Data
Faceted navigation requires structured data
Search is designed to handle unstructured data
Query Autofiltering enables precise search of structured data without complex Query
Language - Builds structured query from the inherent semantics of a “free text” query
“Who’s In The Who”
Structured Data = has metadata
Real-World - Data is imperfect / incomplete
Generally speaking there is not enough structure
eCommerce - tends to focus on top-down due to ubiquity of faceted navigation
e.g. “semi-structured”
Enterprise Search - document rich
Available metadata is poor in describing “Aboutness”
14. Analyzing Text to Extract Metadata
Search Engine
Analyze text to create “inverted index” and to parse the query
Text Mining
Analyze text to extract entities, concepts, categories
Goal - Improved Metadata Through Text Mining
Case 1:
Extracting product type and product attributes metadata from short product descriptions in
eCommerce data - dealing with precision and recall
Use “directed” NLP techniques to extract precise metadata.
“Coffee Pods for Keurig Coffee Makers”
Case 2:
Large text documents. Want to extract keywords and assign categories to documents.
Add metadata concerning “aboutness”
15. Auto phrasing
Auto Phrasing
- Multi-term phrases that refer to a single entity.
- Uses knowledge from a curated phrase list to determine what is an auto phrase
- Works on tokenized text fields (implemented as a Lucene TokenFilter)
Importance of Noun Phrases
Want “things” to be treated as such
Pre-emptive solution for ambiguities and miss or cross matches down the road
Examples
“data scientist” - not “data” but a person
“garbage collection” - a JVM process - has nothing to do with a search “collection”
“query pipeline” vs “span query” - LW Fusion thing vs Lucene thing
“query” is a noise word in LW blogs corpus, “span query” and “phrase query” are
keywords
16. Keyword Clustering using Facets
Information Theory
Keywords have high “Entropy” - meaning that their distribution is not uniform within a
collection of documents, but tends to be localized to documents about a related topic.
Keywords and Topics
Keywords are rare within a document corpus but common within a subset of
documents on the same subject area or topic
Keywords used in the same subject domain will be clustered or co-located
Application of Facet-Similarity Theorem
Use facets on unstructured text to find terms that are co-located by computing
simple facet ratios for positive and negative queries
Keyword clusters can then be used for topic mapping
17. Data Mining with Facets
Method:
• Tokenize text with auto phrasing, stop words and synonyms
- store tokens in a multi-valued field with DocValues
- (yes you can facet on a text field but it tends to hit a wall - 2M word limit
on facet values)
• Using the /terms handler, get each term in the text field.
• For each term, submit two queries
- one with text_field:[term] (positive Q)
- one with -text_field:[term] (negative Q)
• For each facet value (other terms) calculate the following ratio:
• Take the X log(X) of this ratio (for better discrimination)
- for each term, take the best related terms above some threshold
Facet Counts (Positive Q)
Total Counts (Positive Q)
Facet Counts(Negative Q)
Total Counts (Negative Q)
23. Topic Mapping Approach
Keyword coverage is more important than density
Many-to-Many mapping
Documents may cover more than one topic
A given keyword may occur in more than one topic area
“Democratic” process
Keywords are “evidence” for a Topic - “aboutness” is cumulative
Enables documents to be mapped to multiple topics - which gives information on topic
relatedness
Simple threshold to determine topic membership