Haystacks slides

Facets and Similarity
Exploring the Meta-Informational Hyperspace
Ted Sullivan
Lucidworks, Inc.

Information Spaces
Use Cases: Search and Discovery
Knowledge Spaces
Asking the right question (knowing what questions to ask)
Navigation and Visualization
Alexa: Who’s on first?
What’s the first baseman’s name?
Relevance - Similarity - Precision - Classification
Vectors and Vector Spaces
Are Information spaces like Euclidian or Cartesian spaces?
Knowledge Bases
Lamp Table
Side Table
Table Lamp

Facet Synonyms - Spatial Metaphors
Parameters
Dimensions
Navigators
Reﬁners
Supports the notion of some kind of n-dimensional information
“space”
I call it a meta-informational space
Traditional Uses - Navigation and Visualization
Verity K2
Endeca
Fast ESP
MS Fast

Contexts are Viewpoints / Perspectives in Information Space
“The circumstances that form the setting for an event, statement, or idea, and in
terms of which it can be fully understood and assessed.”
Personal Contexts
Who is searching?

What are their roles / interests?

What have the searched for in the past (including just now)?

What are they allowed to search for?
Semantic Contexts
Homonym / Polysemy Problem

“apple”

Tech company, Horticulture, Food, Music, New York City

What is the subject area or domain?
Contexts

Facet - Similarity Theorem
Lemma 1: Similar things tend to occur in similar contexts
Lemma 2: Facets are a tool for exploring meta-informational contexts
Theorem: Facets can be used to ﬁnd similar things.

Facets
“A particular aspect or feature of something.”
Facets are Metadata
”Data about data" - attributes, aspects, descriptors, features, properties,
traits
Metadata Semantics: what, where, when, why
name, size, shape, color, material, texture
manufacturer, number of outlets, voltage, is pre-assembled …
address, phone number, birth date, user rating
Metadata Contexts: Some metadata fields depend on “what" the “thing” is,
e.g. People have different attributes than Toaster Ovens
Metadata provides Semantic Mappings
Consist of field name = field value pairings
Map Terms to Concepts
The term ‘red’ is known to be a ‘color’ because it is a
value in the ‘color’ field

Metadata as a Knowledge Base
Faceted Navigation - Top-down or drill-in
Search - More direct or bottom-up
Query Autoﬁltering:
Uses facet metadata in search collection to determine semantic
meaning of search terms.

Semantic Knowledge is Power
Can use this built-in knowledge to “short-circuit” the “search
then drill in” paradigm

Metadata cardinality and “Boolean in the Vernacular”

Semantic Pattern Rules

$DRUG treats $SYMPTOM vs. $DRUG causes $SYMPTOM

$Accessory for $Product (e.g. “case for iPhone”)

Enables precise bottom-up search
Dependent on metadata quality and completeness
Improving metadata can improve search too

Categorical vs. Numerical
Some metadata is non-numerical - i.e. categorical
Similarity in numerical hyper-spaces is modeled as Euclidian space
Numerical Similarity
Search Relevance / Clustering / Learning Algorithms
Use Term Probability Vectors (tf/idf) in unstructured text
Everything must be a number - categorical data is indexed (arbitrary)
Similarity is based on linear or angular closeness of vectors
Detects patterns - which may not be intuitive => black-box models
Facet-based Similarity
Similarity based on shared categorical and numerical ranges
Numerical data are ranged or “binned” to be compatible with category

Navigating Categorical Spaces
Pivot Facets:
Paths or trajectories through categorical spaces
Multi-Dimensional Query Suggester
Pivot Patterns - Semantically “sensible” permutations of metadata fields

$First_Name $Last_Name $Occupation $City $State
Bob Jones Accountant Cincinnati OH
Multi-dimensional queries and precision
Users expect greater precision In results (i.e. fewer) when they add refining information
to the query

Traditional “bag-of-words” search algorithms often fail to deliver on this expectation

Typeahead solutions should show queries that will “work” - rapid visualization of
available content

Query Autofiltering solution is tailored to this since it can navigate the same
categorical space that the pivot patterns generate

Gotcha: Both solutions depend on accurate and complete metadata at sufficient level
of granularity.

Adding Context to Suggester via Facets
Suggestions are validated against content collection
Facets are used to acquire contextual metadata from a content collection while
building a typeahead collection
Use Cases
Security Trimming of Suggestions
If a query only hits on secured documents, do not want to show that query to
users that cannot see any of the documents

Solution: Use facets to get the list of ACLs that are associated with a query

Dynamic boosting of suggestions based on previous searches
Contextual metadata added to typeahead collection boosts similar
suggestions

Solution: In typeahead application retain context metadata for selected queries
and re-send it as boost queries in subsequent typeahead requests

Facet-Similarity Theorem at work

Building a Suggester with Dynamic Context
Uses Facet Queries against a
Content Collection to create
additional metadata for the
Suggester or Typeahead
Collection.
This contextual metadata can then
be used for:
• Security Trimming of
Typeahead suggestions
• Dynamic boosting of similar
suggestions within a user
session

Building a Suggester with Dynamic Context
Bring back other fields in addition to displayed suggestion text
(i.e., the ones that were calculated using faceting)
If a query is used to search, temporarily store its associated metadata in a
circular cache on the browser.
When submitting the next typeahead query, add the cached information from
the queue as boost queries.
Type ‘j’ - get back
Jai Johnny Johanson Bands
Jai Johnny Johanson Groups
J.J. Johnson
Jai Johnny Johanson
Juke Joint Jezebel
Juke Joint Jimmy
Just searched for ‘Paul McCartney’ then type ‘j’
John Lennon
John Lennon Songs
John Lennon Songs Covered
James P Johnson Songs (?)
John Lennon Originals
Hey Jude

Structured vs. Unstructured Data
Faceted navigation requires structured data

Search is designed to handle unstructured data

Query Autoﬁltering enables precise search of structured data without complex Query
Language - Builds structured query from the inherent semantics of a “free text” query

“Who’s In The Who”
Structured Data = has metadata
Real-World - Data is imperfect / incomplete
Generally speaking there is not enough structure

eCommerce - tends to focus on top-down due to ubiquity of faceted navigation

e.g. “semi-structured”

Enterprise Search - document rich
Available metadata is poor in describing “Aboutness”

Analyzing Text to Extract Metadata
Search Engine
Analyze text to create “inverted index” and to parse the query

Text Mining
Analyze text to extract entities, concepts, categories
Goal - Improved Metadata Through Text Mining
Case 1:
Extracting product type and product attributes metadata from short product descriptions in
eCommerce data - dealing with precision and recall
Use “directed” NLP techniques to extract precise metadata.
“Coffee Pods for Keurig Coffee Makers”
Case 2:
Large text documents. Want to extract keywords and assign categories to documents.
Add metadata concerning “aboutness”

Auto phrasing
Auto Phrasing
- Multi-term phrases that refer to a single entity.
- Uses knowledge from a curated phrase list to determine what is an auto phrase
- Works on tokenized text fields (implemented as a Lucene TokenFilter)
Importance of Noun Phrases
Want “things” to be treated as such

Pre-emptive solution for ambiguities and miss or cross matches down the road

Examples

“data scientist” - not “data” but a person

“garbage collection” - a JVM process - has nothing to do with a search “collection”

“query pipeline” vs “span query” - LW Fusion thing vs Lucene thing

“query” is a noise word in LW blogs corpus, “span query” and “phrase query” are
keywords

Keyword Clustering using Facets
Information Theory
Keywords have high “Entropy” - meaning that their distribution is not uniform within a
collection of documents, but tends to be localized to documents about a related topic.
Keywords and Topics
Keywords are rare within a document corpus but common within a subset of
documents on the same subject area or topic

Keywords used in the same subject domain will be clustered or co-located
Application of Facet-Similarity Theorem
Use facets on unstructured text to ﬁnd terms that are co-located by computing
simple facet ratios for positive and negative queries

Keyword clusters can then be used for topic mapping

Data Mining with Facets
Method:
• Tokenize text with auto phrasing, stop words and synonyms
- store tokens in a multi-valued field with DocValues
- (yes you can facet on a text field but it tends to hit a wall - 2M word limit
on facet values)
• Using the /terms handler, get each term in the text field.
• For each term, submit two queries
- one with text_field:[term] (positive Q)
- one with -text_field:[term] (negative Q)
• For each facet value (other terms) calculate the following ratio:
• Take the X log(X) of this ratio (for better discrimination)
- for each term, take the best related terms above some threshold
Facet Counts (Positive Q)
Total Counts (Positive Q)
Facet Counts(Negative Q)
Total Counts (Negative Q)

Facet Ratios => Keyword Clusters
Security
ldap 727.7777777777777
permission 540.6349206349206
authentication 499.04761904761904
secure 320.22222222222223
password 231.70068027210885
identity 207.93650793650795
user name 182.984126984127
ssl 152.48677248677248
login 124.76190476190476
port 93.57142857142857
protocol 90.1058201058201
remote 77.97619047619048
connector 74.9288451012589
installation 70.88744588744588
mechanism 69.31216931216932
jetty 57.76014109347443
native 57.582417582417584
directory 43.477633477633475
sharepoint 38.98809523809524
restrict 34.65608465608466
plugin 30.114942528735632
dashboard 28.791208791208792
communicate 23.764172335600907

Facet Ratios => Keyword Clusters
garbage 1048.4615384615383
pause 813.4615384615383
heap 581.0439560439561
xx 397.6923076923077
bottleneck 325.38461538461536
collector 278.9010989010989
jvm 253.07692307692307
collect 195.23076923076923
crash 144.6153846153846
thread 135.5769230769231
scheme 116.20879120879121
concern 100.11834319526628
low 97.9652605459057
memory 91.77514792899409
slower 90.38461538461539
timestamp 75.08875739644971
log ﬁle 72.3076923076923
disk 52.36074270557029
general 46.01398601398601
generation 44.88063660477454
delete 41.26254180602007
size 38.85189437428244
eﬃcient 38.280542986425345
specify 31.990060501296455
Garbage Collection

Keyword Vector Document Clustering
Use the Keyword Vectors to compute distances between documents rather than raw TF/IDF
=> Higher Signal To Noise
Tokenizer Compute Keyword Vector K-Means Clustering
Cluster: 98
stump_the_chump: 15159.8533727
stump: 12931.0599949
prize: 12378.4630507
sight: 2943.0123456
tough: 2872.89050924
question: 2827.60450268
judge: 2353.93441007
submit: 2250.3503055
session: 2147.89226715
panel: 1888.9584879
hostetter: 1722.90005854
grant: 1600.7415686
chump: 1558.95135161
lucene_revolution:1353.7746721
spot: 1211.58699335
award: 1048.0824900
mock: 1005.09316809
conference: 903.00251411
muir: 878.76730374
seat: 870.91541559
hot: 799.50707482

Topic Mapping
Semantic or Subject / Categorical Spaces
security
performance
garbage collection
authorization
saml kerberos
usernamelogin
permission
acl
qps
bottleneck
latency
generation
jvm
pause
stop the world
heap
optimization
concurrent
xx
speed
password

Topic Mapping
<doc>
<field name="label_s">Solr/Lucene Tech</field>
<field name="term_ss">solr</field>
<field name="term_ss">lucene</field>
<field name="term_ss">search handler</field>
<field name="term_ss">request handler</field>
<field name="term_ss">solrj</field>
<field name="term_ss">term query</field>
<field name="term_ss">boolean query</field>
<field name="term_ss">span query</field>
<field name=“term_ss”>phrase query</field>
<field name="term_ss">queryparser=>query parser</field>
<field name="term_ss">fq=>filter query</field>
<field name="term_ss">function query</field>
<field name="term_ss">bq=>boost query</field>
<field name="term_ss">solrconfig xml=>solrconfig.xml</field>
<field name="term_ss">edismax</field>
<field name="term_ss">dismax</field>
<field name="term_ss">analysis=>analyzer</field>
<field name="term_ss">positionincrementgap</field>
<field name="term_ss">highlighter</field>
<field name="term_ss">similarity</field>
<field name="term_ss">search index,lucene index=>inverted index</field>
<field name="term_ss">token</field>
<field name="term_ss">token filter,tokenfilter=>tokenizer</field>
<field name="term_ss">field type=>fieldtype</field>
<field name="term_ss">schema.xml,schema xml=>schema</field>
<field name="term_ss">facet</field>
<field name="term_ss">frange=>range</field>
<field name="term_ss">trie</field>
<field name="term_ss">pivot faceting,facet.pivot,facet pivot=>pivot facet</field>
<field name="term_ss">reference guide</field>
</doc>

Topic Mapping Approach
Keyword coverage is more important than density
Many-to-Many mapping
Documents may cover more than one topic

A given keyword may occur in more than one topic area

“Democratic” process
Keywords are “evidence” for a Topic - “aboutness” is cumulative

Enables documents to be mapped to multiple topics - which gives information on topic
relatedness

Simple threshold to determine topic membership

Haystacks slides

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Haystacks slides

Similar to Haystacks slides (20)

Recently uploaded

Recently uploaded (20)

Haystacks slides