SlideShare a Scribd company logo
1 of 24
Download to read offline
Facets and Similarity
Exploring the Meta-Informational Hyperspace
Ted Sullivan
Lucidworks, Inc.
Information Spaces
Use Cases: Search and Discovery
Knowledge Spaces
Asking the right question (knowing what questions to ask)
Navigation and Visualization
Alexa: Who’s on first?
What’s the first baseman’s name?
Relevance - Similarity - Precision - Classification
Vectors and Vector Spaces
Are Information spaces like Euclidian or Cartesian spaces?
Knowledge Bases
Lamp Table
Side Table
Table Lamp
Facet Synonyms - Spatial Metaphors
Parameters
Dimensions
Navigators
Refiners
Supports the notion of some kind of n-dimensional information
“space”
I call it a meta-informational space
Traditional Uses - Navigation and Visualization
Verity K2
Endeca
Fast ESP
MS Fast
Contexts are Viewpoints / Perspectives in Information Space
“The circumstances that form the setting for an event, statement, or idea, and in
terms of which it can be fully understood and assessed.”
Personal Contexts
Who is searching?

What are their roles / interests?

What have the searched for in the past (including just now)?

What are they allowed to search for?
Semantic Contexts
Homonym / Polysemy Problem 



“apple”



Tech company, Horticulture, Food, Music, New York City

What is the subject area or domain?
Contexts
Facet - Similarity Theorem
Lemma 1: Similar things tend to occur in similar contexts
Lemma 2: Facets are a tool for exploring meta-informational contexts
Theorem: Facets can be used to find similar things.
Facets
“A particular aspect or feature of something.”
Facets are Metadata
”Data about data" - attributes, aspects, descriptors, features, properties,
traits
Metadata Semantics: what, where, when, why
name, size, shape, color, material, texture
manufacturer, number of outlets, voltage, is pre-assembled …
address, phone number, birth date, user rating
Metadata Contexts: Some metadata fields depend on “what" the “thing” is,
e.g. People have different attributes than Toaster Ovens
Metadata provides Semantic Mappings
Consist of field name = field value pairings
Map Terms to Concepts
The term ‘red’ is known to be a ‘color’ because it is a
value in the ‘color’ field
Metadata as a Knowledge Base
Faceted Navigation - Top-down or drill-in
Search - More direct or bottom-up
Query Autofiltering:
Uses facet metadata in search collection to determine semantic
meaning of search terms.

Semantic Knowledge is Power
Can use this built-in knowledge to “short-circuit” the “search
then drill in” paradigm



Metadata cardinality and “Boolean in the Vernacular”



Semantic Pattern Rules

$DRUG treats $SYMPTOM vs. $DRUG causes $SYMPTOM

$Accessory for $Product (e.g. “case for iPhone”)



Enables precise bottom-up search
Dependent on metadata quality and completeness
Improving metadata can improve search too
Categorical vs. Numerical
Some metadata is non-numerical - i.e. categorical
Similarity in numerical hyper-spaces is modeled as Euclidian space
Numerical Similarity
Search Relevance / Clustering / Learning Algorithms
Use Term Probability Vectors (tf/idf) in unstructured text
Everything must be a number - categorical data is indexed (arbitrary)
Similarity is based on linear or angular closeness of vectors
Detects patterns - which may not be intuitive => black-box models
Facet-based Similarity
Similarity based on shared categorical and numerical ranges
Numerical data are ranged or “binned” to be compatible with category
Navigating Categorical Spaces
Pivot Facets:
Paths or trajectories through categorical spaces
Multi-Dimensional Query Suggester
Pivot Patterns - Semantically “sensible” permutations of metadata fields

$First_Name $Last_Name $Occupation $City $State
Bob Jones Accountant Cincinnati OH
Multi-dimensional queries and precision
Users expect greater precision In results (i.e. fewer) when they add refining information
to the query





Traditional “bag-of-words” search algorithms often fail to deliver on this expectation



Typeahead solutions should show queries that will “work” - rapid visualization of
available content



Query Autofiltering solution is tailored to this since it can navigate the same
categorical space that the pivot patterns generate



Gotcha: Both solutions depend on accurate and complete metadata at sufficient level
of granularity.
Adding Context to Suggester via Facets
Suggestions are validated against content collection
Facets are used to acquire contextual metadata from a content collection while
building a typeahead collection
Use Cases
Security Trimming of Suggestions
If a query only hits on secured documents, do not want to show that query to
users that cannot see any of the documents

Solution: Use facets to get the list of ACLs that are associated with a query

Dynamic boosting of suggestions based on previous searches
Contextual metadata added to typeahead collection boosts similar
suggestions

Solution: In typeahead application retain context metadata for selected queries
and re-send it as boost queries in subsequent typeahead requests

Facet-Similarity Theorem at work
Building a Suggester with Dynamic Context
Uses Facet Queries against a
Content Collection to create
additional metadata for the
Suggester or Typeahead
Collection.
This contextual metadata can then
be used for:
• Security Trimming of
Typeahead suggestions
• Dynamic boosting of similar
suggestions within a user
session
Building a Suggester with Dynamic Context
Bring back other fields in addition to displayed suggestion text
(i.e., the ones that were calculated using faceting)
If a query is used to search, temporarily store its associated metadata in a
circular cache on the browser.
When submitting the next typeahead query, add the cached information from
the queue as boost queries.
Type ‘j’ - get back
Jai Johnny Johanson Bands
Jai Johnny Johanson Groups
J.J. Johnson
Jai Johnny Johanson
Juke Joint Jezebel
Juke Joint Jimmy
Just searched for ‘Paul McCartney’ then type ‘j’
John Lennon
John Lennon Songs
John Lennon Songs Covered
James P Johnson Songs (?)
John Lennon Originals
Hey Jude
Structured vs. Unstructured Data
Faceted navigation requires structured data

Search is designed to handle unstructured data

Query Autofiltering enables precise search of structured data without complex Query
Language - Builds structured query from the inherent semantics of a “free text” query

“Who’s In The Who”
Structured Data = has metadata
Real-World - Data is imperfect / incomplete
Generally speaking there is not enough structure

eCommerce - tends to focus on top-down due to ubiquity of faceted navigation

e.g. “semi-structured”

Enterprise Search - document rich
Available metadata is poor in describing “Aboutness”
Analyzing Text to Extract Metadata
Search Engine
Analyze text to create “inverted index” and to parse the query

Text Mining
Analyze text to extract entities, concepts, categories
Goal - Improved Metadata Through Text Mining
Case 1:
Extracting product type and product attributes metadata from short product descriptions in
eCommerce data - dealing with precision and recall
Use “directed” NLP techniques to extract precise metadata.
“Coffee Pods for Keurig Coffee Makers”
Case 2:
Large text documents. Want to extract keywords and assign categories to documents.
Add metadata concerning “aboutness”
Auto phrasing
Auto Phrasing
- Multi-term phrases that refer to a single entity.
- Uses knowledge from a curated phrase list to determine what is an auto phrase
- Works on tokenized text fields (implemented as a Lucene TokenFilter)
Importance of Noun Phrases
Want “things” to be treated as such



Pre-emptive solution for ambiguities and miss or cross matches down the road

Examples

“data scientist” - not “data” but a person

“garbage collection” - a JVM process - has nothing to do with a search “collection”

“query pipeline” vs “span query” - LW Fusion thing vs Lucene thing

“query” is a noise word in LW blogs corpus, “span query” and “phrase query” are
keywords
Keyword Clustering using Facets
Information Theory
Keywords have high “Entropy” - meaning that their distribution is not uniform within a
collection of documents, but tends to be localized to documents about a related topic.
Keywords and Topics
Keywords are rare within a document corpus but common within a subset of
documents on the same subject area or topic

Keywords used in the same subject domain will be clustered or co-located
Application of Facet-Similarity Theorem
Use facets on unstructured text to find terms that are co-located by computing
simple facet ratios for positive and negative queries

Keyword clusters can then be used for topic mapping
Data Mining with Facets
Method:
• Tokenize text with auto phrasing, stop words and synonyms
- store tokens in a multi-valued field with DocValues
- (yes you can facet on a text field but it tends to hit a wall - 2M word limit
on facet values)
• Using the /terms handler, get each term in the text field.
• For each term, submit two queries
- one with text_field:[term] (positive Q)
- one with -text_field:[term] (negative Q)
• For each facet value (other terms) calculate the following ratio:
• Take the X log(X) of this ratio (for better discrimination)
- for each term, take the best related terms above some threshold
Facet Counts (Positive Q)
Total Counts (Positive Q)
Facet Counts(Negative Q)
Total Counts (Negative Q)
Facet Ratios => Keyword Clusters
Security
ldap 727.7777777777777
permission 540.6349206349206
authentication 499.04761904761904
secure 320.22222222222223
password 231.70068027210885
identity 207.93650793650795
user name 182.984126984127
ssl 152.48677248677248
login 124.76190476190476
port 93.57142857142857
protocol 90.1058201058201
remote 77.97619047619048
connector 74.9288451012589
installation 70.88744588744588
mechanism 69.31216931216932
jetty 57.76014109347443
native 57.582417582417584
directory 43.477633477633475
sharepoint 38.98809523809524
restrict 34.65608465608466
plugin 30.114942528735632
dashboard 28.791208791208792
communicate 23.764172335600907
Facet Ratios => Keyword Clusters
garbage 1048.4615384615383
pause 813.4615384615383
heap 581.0439560439561
xx 397.6923076923077
bottleneck 325.38461538461536
collector 278.9010989010989
jvm 253.07692307692307
collect 195.23076923076923
crash 144.6153846153846
thread 135.5769230769231
scheme 116.20879120879121
concern 100.11834319526628
low 97.9652605459057
memory 91.77514792899409
slower 90.38461538461539
timestamp 75.08875739644971
log file 72.3076923076923
disk 52.36074270557029
general 46.01398601398601
generation 44.88063660477454
delete 41.26254180602007
size 38.85189437428244
efficient 38.280542986425345
specify 31.990060501296455
Garbage Collection
Keyword Vector Document Clustering
Use the Keyword Vectors to compute distances between documents rather than raw TF/IDF
=> Higher Signal To Noise
Tokenizer Compute Keyword Vector K-Means Clustering
Cluster: 98
stump_the_chump: 15159.8533727
stump: 12931.0599949
prize: 12378.4630507
sight: 2943.0123456
tough: 2872.89050924
question: 2827.60450268
judge: 2353.93441007
submit: 2250.3503055
session: 2147.89226715
panel: 1888.9584879
hostetter: 1722.90005854
grant: 1600.7415686
chump: 1558.95135161
lucene_revolution:1353.7746721
spot: 1211.58699335
award: 1048.0824900
mock: 1005.09316809
conference: 903.00251411
muir: 878.76730374
seat: 870.91541559
hot: 799.50707482
Topic Mapping
Semantic or Subject / Categorical Spaces
security
performance
garbage collection
authorization
saml kerberos
usernamelogin
permission
acl
qps
bottleneck
latency
generation
jvm
pause
stop the world
heap
optimization
concurrent
xx
speed
password
Topic Mapping
<doc>
<field name="label_s">Solr/Lucene Tech</field>
<field name="term_ss">solr</field>
<field name="term_ss">lucene</field>
<field name="term_ss">search handler</field>
<field name="term_ss">request handler</field>
<field name="term_ss">solrj</field>
<field name="term_ss">term query</field>
<field name="term_ss">boolean query</field>
<field name="term_ss">span query</field>
<field name=“term_ss”>phrase query</field>
<field name="term_ss">queryparser=>query parser</field>
<field name="term_ss">fq=>filter query</field>
<field name="term_ss">function query</field>
<field name="term_ss">bq=>boost query</field>
<field name="term_ss">solrconfig xml=>solrconfig.xml</field>
<field name="term_ss">edismax</field>
<field name="term_ss">dismax</field>
<field name="term_ss">analysis=>analyzer</field>
<field name="term_ss">positionincrementgap</field>
<field name="term_ss">highlighter</field>
<field name="term_ss">similarity</field>
<field name="term_ss">search index,lucene index=>inverted index</field>
<field name="term_ss">token</field>
<field name="term_ss">token filter,tokenfilter=>tokenizer</field>
<field name="term_ss">field type=>fieldtype</field>
<field name="term_ss">schema.xml,schema xml=>schema</field>
<field name="term_ss">facet</field>
<field name="term_ss">frange=>range</field>
<field name="term_ss">trie</field>
<field name="term_ss">pivot faceting,facet.pivot,facet pivot=>pivot facet</field>
<field name="term_ss">reference guide</field>
</doc>
Topic Mapping Approach
Keyword coverage is more important than density
Many-to-Many mapping
Documents may cover more than one topic

A given keyword may occur in more than one topic area

“Democratic” process
Keywords are “evidence” for a Topic - “aboutness” is cumulative

Enables documents to be mapped to multiple topics - which gives information on topic
relatedness

Simple threshold to determine topic membership
Topic Mapping Results

More Related Content

What's hot

Using a keyword extraction pipeline to understand concepts in future work sec...
Using a keyword extraction pipeline to understand concepts in future work sec...Using a keyword extraction pipeline to understand concepts in future work sec...
Using a keyword extraction pipeline to understand concepts in future work sec...Kai Li
 
Automatically Build Solr Synonyms List using Machine Learning - Chao Han, Luc...
Automatically Build Solr Synonyms List using Machine Learning - Chao Han, Luc...Automatically Build Solr Synonyms List using Machine Learning - Chao Han, Luc...
Automatically Build Solr Synonyms List using Machine Learning - Chao Han, Luc...Lucidworks
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchTrey Grainger
 
Reflected intelligence evolving self-learning data systems
Reflected intelligence  evolving self-learning data systemsReflected intelligence  evolving self-learning data systems
Reflected intelligence evolving self-learning data systemsTrey Grainger
 
Haystack- Learning to rank in an hourly job market
Haystack- Learning to rank in an hourly job market Haystack- Learning to rank in an hourly job market
Haystack- Learning to rank in an hourly job market Xun Wang
 
Evolving the Optimal Relevancy Ranking Model at Dice.com
Evolving the Optimal Relevancy Ranking Model at Dice.comEvolving the Optimal Relevancy Ranking Model at Dice.com
Evolving the Optimal Relevancy Ranking Model at Dice.comSimon Hughes
 
Dice.com Bay Area Search - Beyond Learning to Rank Talk
Dice.com Bay Area Search - Beyond Learning to Rank TalkDice.com Bay Area Search - Beyond Learning to Rank Talk
Dice.com Bay Area Search - Beyond Learning to Rank TalkSimon Hughes
 
The Intent Algorithms of Search & Recommendation Engines
The Intent Algorithms of Search & Recommendation EnginesThe Intent Algorithms of Search & Recommendation Engines
The Intent Algorithms of Search & Recommendation EnginesTrey Grainger
 
Self-learned Relevancy with Apache Solr
Self-learned Relevancy with Apache SolrSelf-learned Relevancy with Apache Solr
Self-learned Relevancy with Apache SolrTrey Grainger
 
Searching and Querying Knowledge Graphs with Solr/SIREn - A Reference Archite...
Searching and Querying Knowledge Graphs with Solr/SIREn - A Reference Archite...Searching and Querying Knowledge Graphs with Solr/SIREn - A Reference Archite...
Searching and Querying Knowledge Graphs with Solr/SIREn - A Reference Archite...Lucidworks
 
Lucene/Solr Revolution 2015: Where Search Meets Machine Learning
Lucene/Solr Revolution 2015: Where Search Meets Machine LearningLucene/Solr Revolution 2015: Where Search Meets Machine Learning
Lucene/Solr Revolution 2015: Where Search Meets Machine LearningJoaquin Delgado PhD.
 
Searching with vectors
Searching with vectorsSearching with vectors
Searching with vectorsSimon Hughes
 
The Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data EcosystemThe Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data EcosystemTrey Grainger
 
Intent Algorithms: The Data Science of Smart Information Retrieval Systems
Intent Algorithms: The Data Science of Smart Information Retrieval SystemsIntent Algorithms: The Data Science of Smart Information Retrieval Systems
Intent Algorithms: The Data Science of Smart Information Retrieval SystemsTrey Grainger
 
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...Lucidworks
 
Developing A Big Data Search Engine - Where we have gone. Where we are going:...
Developing A Big Data Search Engine - Where we have gone. Where we are going:...Developing A Big Data Search Engine - Where we have gone. Where we are going:...
Developing A Big Data Search Engine - Where we have gone. Where we are going:...Lucidworks
 
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...Trey Grainger
 
Reflected Intelligence: Lucene/Solr as a self-learning data system
Reflected Intelligence: Lucene/Solr as a self-learning data systemReflected Intelligence: Lucene/Solr as a self-learning data system
Reflected Intelligence: Lucene/Solr as a self-learning data systemTrey Grainger
 
Building Search & Recommendation Engines
Building Search & Recommendation EnginesBuilding Search & Recommendation Engines
Building Search & Recommendation EnginesTrey Grainger
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchlucenerevolution
 

What's hot (20)

Using a keyword extraction pipeline to understand concepts in future work sec...
Using a keyword extraction pipeline to understand concepts in future work sec...Using a keyword extraction pipeline to understand concepts in future work sec...
Using a keyword extraction pipeline to understand concepts in future work sec...
 
Automatically Build Solr Synonyms List using Machine Learning - Chao Han, Luc...
Automatically Build Solr Synonyms List using Machine Learning - Chao Han, Luc...Automatically Build Solr Synonyms List using Machine Learning - Chao Han, Luc...
Automatically Build Solr Synonyms List using Machine Learning - Chao Han, Luc...
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
 
Reflected intelligence evolving self-learning data systems
Reflected intelligence  evolving self-learning data systemsReflected intelligence  evolving self-learning data systems
Reflected intelligence evolving self-learning data systems
 
Haystack- Learning to rank in an hourly job market
Haystack- Learning to rank in an hourly job market Haystack- Learning to rank in an hourly job market
Haystack- Learning to rank in an hourly job market
 
Evolving the Optimal Relevancy Ranking Model at Dice.com
Evolving the Optimal Relevancy Ranking Model at Dice.comEvolving the Optimal Relevancy Ranking Model at Dice.com
Evolving the Optimal Relevancy Ranking Model at Dice.com
 
Dice.com Bay Area Search - Beyond Learning to Rank Talk
Dice.com Bay Area Search - Beyond Learning to Rank TalkDice.com Bay Area Search - Beyond Learning to Rank Talk
Dice.com Bay Area Search - Beyond Learning to Rank Talk
 
The Intent Algorithms of Search & Recommendation Engines
The Intent Algorithms of Search & Recommendation EnginesThe Intent Algorithms of Search & Recommendation Engines
The Intent Algorithms of Search & Recommendation Engines
 
Self-learned Relevancy with Apache Solr
Self-learned Relevancy with Apache SolrSelf-learned Relevancy with Apache Solr
Self-learned Relevancy with Apache Solr
 
Searching and Querying Knowledge Graphs with Solr/SIREn - A Reference Archite...
Searching and Querying Knowledge Graphs with Solr/SIREn - A Reference Archite...Searching and Querying Knowledge Graphs with Solr/SIREn - A Reference Archite...
Searching and Querying Knowledge Graphs with Solr/SIREn - A Reference Archite...
 
Lucene/Solr Revolution 2015: Where Search Meets Machine Learning
Lucene/Solr Revolution 2015: Where Search Meets Machine LearningLucene/Solr Revolution 2015: Where Search Meets Machine Learning
Lucene/Solr Revolution 2015: Where Search Meets Machine Learning
 
Searching with vectors
Searching with vectorsSearching with vectors
Searching with vectors
 
The Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data EcosystemThe Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data Ecosystem
 
Intent Algorithms: The Data Science of Smart Information Retrieval Systems
Intent Algorithms: The Data Science of Smart Information Retrieval SystemsIntent Algorithms: The Data Science of Smart Information Retrieval Systems
Intent Algorithms: The Data Science of Smart Information Retrieval Systems
 
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...
 
Developing A Big Data Search Engine - Where we have gone. Where we are going:...
Developing A Big Data Search Engine - Where we have gone. Where we are going:...Developing A Big Data Search Engine - Where we have gone. Where we are going:...
Developing A Big Data Search Engine - Where we have gone. Where we are going:...
 
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...
 
Reflected Intelligence: Lucene/Solr as a self-learning data system
Reflected Intelligence: Lucene/Solr as a self-learning data systemReflected Intelligence: Lucene/Solr as a self-learning data system
Reflected Intelligence: Lucene/Solr as a self-learning data system
 
Building Search & Recommendation Engines
Building Search & Recommendation EnginesBuilding Search & Recommendation Engines
Building Search & Recommendation Engines
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
 

Similar to Haystacks slides

A Multifaceted Look At Faceting - Ted Sullivan, Lucidworks
A Multifaceted Look At Faceting - Ted Sullivan, LucidworksA Multifaceted Look At Faceting - Ted Sullivan, Lucidworks
A Multifaceted Look At Faceting - Ted Sullivan, LucidworksLucidworks
 
The need for sophistication in modern search engine implementations
The need for sophistication in modern search engine implementationsThe need for sophistication in modern search engine implementations
The need for sophistication in modern search engine implementationsBen DeMott
 
Classification, Tagging & Search
Classification, Tagging & SearchClassification, Tagging & Search
Classification, Tagging & SearchJames Melzer
 
NetIKX Semantic Search Presentation
NetIKX Semantic Search PresentationNetIKX Semantic Search Presentation
NetIKX Semantic Search Presentationurvics
 
Tovek Presentation by Livio Costantini
Tovek Presentation by Livio CostantiniTovek Presentation by Livio Costantini
Tovek Presentation by Livio Costantinimaxfalc
 
Taxonomies And Search Aiim Mn
Taxonomies And Search Aiim MnTaxonomies And Search Aiim Mn
Taxonomies And Search Aiim MnAIIM Minnesota
 
Using metadata repositories with search
Using metadata repositories with searchUsing metadata repositories with search
Using metadata repositories with searchJean Graef
 
Semantic Search Tutorial at SemTech 2012
Semantic Search Tutorial at SemTech 2012 Semantic Search Tutorial at SemTech 2012
Semantic Search Tutorial at SemTech 2012 Thanh Tran
 
Philly PHP: April '17 Elastic Search Introduction by Aditya Bhamidpati
Philly PHP: April '17 Elastic Search Introduction by Aditya BhamidpatiPhilly PHP: April '17 Elastic Search Introduction by Aditya Bhamidpati
Philly PHP: April '17 Elastic Search Introduction by Aditya BhamidpatiRobert Calcavecchia
 
Applications of Semantic Technology in the Real World Today
Applications of Semantic Technology in the Real World TodayApplications of Semantic Technology in the Real World Today
Applications of Semantic Technology in the Real World TodayAmit Sheth
 
Sweeny ux-seo om-cap 2014_v3
Sweeny ux-seo om-cap 2014_v3Sweeny ux-seo om-cap 2014_v3
Sweeny ux-seo om-cap 2014_v3Marianne Sweeny
 
Automatic Metadata Generation Charles Duncan
Automatic Metadata Generation Charles DuncanAutomatic Metadata Generation Charles Duncan
Automatic Metadata Generation Charles DuncanJISC CETIS
 
Semantic Search using RDF Metadata (SemTech 2005)
Semantic Search using RDF Metadata (SemTech 2005)Semantic Search using RDF Metadata (SemTech 2005)
Semantic Search using RDF Metadata (SemTech 2005)Bradley Allen
 
Webinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with SolrWebinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with SolrLucidworks
 
How search engines work Anand Saini
How search engines work Anand SainiHow search engines work Anand Saini
How search engines work Anand SainiDr,Saini Anand
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information RetrievalBhaskar Mitra
 
Semantic Search tutorial at SemTech 2012
Semantic Search tutorial at SemTech 2012Semantic Search tutorial at SemTech 2012
Semantic Search tutorial at SemTech 2012Peter Mika
 
ACIS 2015 Bibliographical-based Facets for Expertise Search
ACIS 2015 Bibliographical-based Facets for Expertise SearchACIS 2015 Bibliographical-based Facets for Expertise Search
ACIS 2015 Bibliographical-based Facets for Expertise SearchGan Keng Hoon
 
Extracting and Reducing the Semantic Information Content of Web Documents to ...
Extracting and Reducing the Semantic Information Content of Web Documents to ...Extracting and Reducing the Semantic Information Content of Web Documents to ...
Extracting and Reducing the Semantic Information Content of Web Documents to ...ijsrd.com
 

Similar to Haystacks slides (20)

A Multifaceted Look At Faceting - Ted Sullivan, Lucidworks
A Multifaceted Look At Faceting - Ted Sullivan, LucidworksA Multifaceted Look At Faceting - Ted Sullivan, Lucidworks
A Multifaceted Look At Faceting - Ted Sullivan, Lucidworks
 
The need for sophistication in modern search engine implementations
The need for sophistication in modern search engine implementationsThe need for sophistication in modern search engine implementations
The need for sophistication in modern search engine implementations
 
Classification, Tagging & Search
Classification, Tagging & SearchClassification, Tagging & Search
Classification, Tagging & Search
 
NetIKX Semantic Search Presentation
NetIKX Semantic Search PresentationNetIKX Semantic Search Presentation
NetIKX Semantic Search Presentation
 
Tovek Presentation by Livio Costantini
Tovek Presentation by Livio CostantiniTovek Presentation by Livio Costantini
Tovek Presentation by Livio Costantini
 
Taxonomies And Search Aiim Mn
Taxonomies And Search Aiim MnTaxonomies And Search Aiim Mn
Taxonomies And Search Aiim Mn
 
Using metadata repositories with search
Using metadata repositories with searchUsing metadata repositories with search
Using metadata repositories with search
 
Semantic Search Tutorial at SemTech 2012
Semantic Search Tutorial at SemTech 2012 Semantic Search Tutorial at SemTech 2012
Semantic Search Tutorial at SemTech 2012
 
Philly PHP: April '17 Elastic Search Introduction by Aditya Bhamidpati
Philly PHP: April '17 Elastic Search Introduction by Aditya BhamidpatiPhilly PHP: April '17 Elastic Search Introduction by Aditya Bhamidpati
Philly PHP: April '17 Elastic Search Introduction by Aditya Bhamidpati
 
Applications of Semantic Technology in the Real World Today
Applications of Semantic Technology in the Real World TodayApplications of Semantic Technology in the Real World Today
Applications of Semantic Technology in the Real World Today
 
Sweeny ux-seo om-cap 2014_v3
Sweeny ux-seo om-cap 2014_v3Sweeny ux-seo om-cap 2014_v3
Sweeny ux-seo om-cap 2014_v3
 
Automatic Metadata Generation Charles Duncan
Automatic Metadata Generation Charles DuncanAutomatic Metadata Generation Charles Duncan
Automatic Metadata Generation Charles Duncan
 
Semantic Search using RDF Metadata (SemTech 2005)
Semantic Search using RDF Metadata (SemTech 2005)Semantic Search using RDF Metadata (SemTech 2005)
Semantic Search using RDF Metadata (SemTech 2005)
 
Webinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with SolrWebinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with Solr
 
How search engines work Anand Saini
How search engines work Anand SainiHow search engines work Anand Saini
How search engines work Anand Saini
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information Retrieval
 
Sub1522
Sub1522Sub1522
Sub1522
 
Semantic Search tutorial at SemTech 2012
Semantic Search tutorial at SemTech 2012Semantic Search tutorial at SemTech 2012
Semantic Search tutorial at SemTech 2012
 
ACIS 2015 Bibliographical-based Facets for Expertise Search
ACIS 2015 Bibliographical-based Facets for Expertise SearchACIS 2015 Bibliographical-based Facets for Expertise Search
ACIS 2015 Bibliographical-based Facets for Expertise Search
 
Extracting and Reducing the Semantic Information Content of Web Documents to ...
Extracting and Reducing the Semantic Information Content of Web Documents to ...Extracting and Reducing the Semantic Information Content of Web Documents to ...
Extracting and Reducing the Semantic Information Content of Web Documents to ...
 

Recently uploaded

办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样umasea
 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfPower Karaoke
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel
 
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....kzayra69
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityNeo4j
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptkotipi9215
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 

Recently uploaded (20)

办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdf
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New Features
 
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered Sustainability
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.ppt
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 

Haystacks slides

  • 1. Facets and Similarity Exploring the Meta-Informational Hyperspace Ted Sullivan Lucidworks, Inc.
  • 2. Information Spaces Use Cases: Search and Discovery Knowledge Spaces Asking the right question (knowing what questions to ask) Navigation and Visualization Alexa: Who’s on first? What’s the first baseman’s name? Relevance - Similarity - Precision - Classification Vectors and Vector Spaces Are Information spaces like Euclidian or Cartesian spaces? Knowledge Bases Lamp Table Side Table Table Lamp
  • 3. Facet Synonyms - Spatial Metaphors Parameters Dimensions Navigators Refiners Supports the notion of some kind of n-dimensional information “space” I call it a meta-informational space Traditional Uses - Navigation and Visualization Verity K2 Endeca Fast ESP MS Fast
  • 4. Contexts are Viewpoints / Perspectives in Information Space “The circumstances that form the setting for an event, statement, or idea, and in terms of which it can be fully understood and assessed.” Personal Contexts Who is searching? What are their roles / interests? What have the searched for in the past (including just now)? What are they allowed to search for? Semantic Contexts Homonym / Polysemy Problem “apple” Tech company, Horticulture, Food, Music, New York City What is the subject area or domain? Contexts
  • 5. Facet - Similarity Theorem Lemma 1: Similar things tend to occur in similar contexts Lemma 2: Facets are a tool for exploring meta-informational contexts Theorem: Facets can be used to find similar things.
  • 6. Facets “A particular aspect or feature of something.” Facets are Metadata ”Data about data" - attributes, aspects, descriptors, features, properties, traits Metadata Semantics: what, where, when, why name, size, shape, color, material, texture manufacturer, number of outlets, voltage, is pre-assembled … address, phone number, birth date, user rating Metadata Contexts: Some metadata fields depend on “what" the “thing” is, e.g. People have different attributes than Toaster Ovens Metadata provides Semantic Mappings Consist of field name = field value pairings Map Terms to Concepts The term ‘red’ is known to be a ‘color’ because it is a value in the ‘color’ field
  • 7. Metadata as a Knowledge Base Faceted Navigation - Top-down or drill-in Search - More direct or bottom-up Query Autofiltering: Uses facet metadata in search collection to determine semantic meaning of search terms. Semantic Knowledge is Power Can use this built-in knowledge to “short-circuit” the “search then drill in” paradigm Metadata cardinality and “Boolean in the Vernacular” Semantic Pattern Rules $DRUG treats $SYMPTOM vs. $DRUG causes $SYMPTOM $Accessory for $Product (e.g. “case for iPhone”) Enables precise bottom-up search Dependent on metadata quality and completeness Improving metadata can improve search too
  • 8. Categorical vs. Numerical Some metadata is non-numerical - i.e. categorical Similarity in numerical hyper-spaces is modeled as Euclidian space Numerical Similarity Search Relevance / Clustering / Learning Algorithms Use Term Probability Vectors (tf/idf) in unstructured text Everything must be a number - categorical data is indexed (arbitrary) Similarity is based on linear or angular closeness of vectors Detects patterns - which may not be intuitive => black-box models Facet-based Similarity Similarity based on shared categorical and numerical ranges Numerical data are ranged or “binned” to be compatible with category
  • 9. Navigating Categorical Spaces Pivot Facets: Paths or trajectories through categorical spaces Multi-Dimensional Query Suggester Pivot Patterns - Semantically “sensible” permutations of metadata fields $First_Name $Last_Name $Occupation $City $State Bob Jones Accountant Cincinnati OH Multi-dimensional queries and precision Users expect greater precision In results (i.e. fewer) when they add refining information to the query Traditional “bag-of-words” search algorithms often fail to deliver on this expectation Typeahead solutions should show queries that will “work” - rapid visualization of available content Query Autofiltering solution is tailored to this since it can navigate the same categorical space that the pivot patterns generate Gotcha: Both solutions depend on accurate and complete metadata at sufficient level of granularity.
  • 10. Adding Context to Suggester via Facets Suggestions are validated against content collection Facets are used to acquire contextual metadata from a content collection while building a typeahead collection Use Cases Security Trimming of Suggestions If a query only hits on secured documents, do not want to show that query to users that cannot see any of the documents Solution: Use facets to get the list of ACLs that are associated with a query Dynamic boosting of suggestions based on previous searches Contextual metadata added to typeahead collection boosts similar suggestions Solution: In typeahead application retain context metadata for selected queries and re-send it as boost queries in subsequent typeahead requests Facet-Similarity Theorem at work
  • 11. Building a Suggester with Dynamic Context Uses Facet Queries against a Content Collection to create additional metadata for the Suggester or Typeahead Collection. This contextual metadata can then be used for: • Security Trimming of Typeahead suggestions • Dynamic boosting of similar suggestions within a user session
  • 12. Building a Suggester with Dynamic Context Bring back other fields in addition to displayed suggestion text (i.e., the ones that were calculated using faceting) If a query is used to search, temporarily store its associated metadata in a circular cache on the browser. When submitting the next typeahead query, add the cached information from the queue as boost queries. Type ‘j’ - get back Jai Johnny Johanson Bands Jai Johnny Johanson Groups J.J. Johnson Jai Johnny Johanson Juke Joint Jezebel Juke Joint Jimmy Just searched for ‘Paul McCartney’ then type ‘j’ John Lennon John Lennon Songs John Lennon Songs Covered James P Johnson Songs (?) John Lennon Originals Hey Jude
  • 13. Structured vs. Unstructured Data Faceted navigation requires structured data Search is designed to handle unstructured data Query Autofiltering enables precise search of structured data without complex Query Language - Builds structured query from the inherent semantics of a “free text” query “Who’s In The Who” Structured Data = has metadata Real-World - Data is imperfect / incomplete Generally speaking there is not enough structure eCommerce - tends to focus on top-down due to ubiquity of faceted navigation e.g. “semi-structured” Enterprise Search - document rich Available metadata is poor in describing “Aboutness”
  • 14. Analyzing Text to Extract Metadata Search Engine Analyze text to create “inverted index” and to parse the query Text Mining Analyze text to extract entities, concepts, categories Goal - Improved Metadata Through Text Mining Case 1: Extracting product type and product attributes metadata from short product descriptions in eCommerce data - dealing with precision and recall Use “directed” NLP techniques to extract precise metadata. “Coffee Pods for Keurig Coffee Makers” Case 2: Large text documents. Want to extract keywords and assign categories to documents. Add metadata concerning “aboutness”
  • 15. Auto phrasing Auto Phrasing - Multi-term phrases that refer to a single entity. - Uses knowledge from a curated phrase list to determine what is an auto phrase - Works on tokenized text fields (implemented as a Lucene TokenFilter) Importance of Noun Phrases Want “things” to be treated as such Pre-emptive solution for ambiguities and miss or cross matches down the road Examples “data scientist” - not “data” but a person “garbage collection” - a JVM process - has nothing to do with a search “collection” “query pipeline” vs “span query” - LW Fusion thing vs Lucene thing “query” is a noise word in LW blogs corpus, “span query” and “phrase query” are keywords
  • 16. Keyword Clustering using Facets Information Theory Keywords have high “Entropy” - meaning that their distribution is not uniform within a collection of documents, but tends to be localized to documents about a related topic. Keywords and Topics Keywords are rare within a document corpus but common within a subset of documents on the same subject area or topic Keywords used in the same subject domain will be clustered or co-located Application of Facet-Similarity Theorem Use facets on unstructured text to find terms that are co-located by computing simple facet ratios for positive and negative queries Keyword clusters can then be used for topic mapping
  • 17. Data Mining with Facets Method: • Tokenize text with auto phrasing, stop words and synonyms - store tokens in a multi-valued field with DocValues - (yes you can facet on a text field but it tends to hit a wall - 2M word limit on facet values) • Using the /terms handler, get each term in the text field. • For each term, submit two queries - one with text_field:[term] (positive Q) - one with -text_field:[term] (negative Q) • For each facet value (other terms) calculate the following ratio: • Take the X log(X) of this ratio (for better discrimination) - for each term, take the best related terms above some threshold Facet Counts (Positive Q) Total Counts (Positive Q) Facet Counts(Negative Q) Total Counts (Negative Q)
  • 18. Facet Ratios => Keyword Clusters Security ldap 727.7777777777777 permission 540.6349206349206 authentication 499.04761904761904 secure 320.22222222222223 password 231.70068027210885 identity 207.93650793650795 user name 182.984126984127 ssl 152.48677248677248 login 124.76190476190476 port 93.57142857142857 protocol 90.1058201058201 remote 77.97619047619048 connector 74.9288451012589 installation 70.88744588744588 mechanism 69.31216931216932 jetty 57.76014109347443 native 57.582417582417584 directory 43.477633477633475 sharepoint 38.98809523809524 restrict 34.65608465608466 plugin 30.114942528735632 dashboard 28.791208791208792 communicate 23.764172335600907
  • 19. Facet Ratios => Keyword Clusters garbage 1048.4615384615383 pause 813.4615384615383 heap 581.0439560439561 xx 397.6923076923077 bottleneck 325.38461538461536 collector 278.9010989010989 jvm 253.07692307692307 collect 195.23076923076923 crash 144.6153846153846 thread 135.5769230769231 scheme 116.20879120879121 concern 100.11834319526628 low 97.9652605459057 memory 91.77514792899409 slower 90.38461538461539 timestamp 75.08875739644971 log file 72.3076923076923 disk 52.36074270557029 general 46.01398601398601 generation 44.88063660477454 delete 41.26254180602007 size 38.85189437428244 efficient 38.280542986425345 specify 31.990060501296455 Garbage Collection
  • 20. Keyword Vector Document Clustering Use the Keyword Vectors to compute distances between documents rather than raw TF/IDF => Higher Signal To Noise Tokenizer Compute Keyword Vector K-Means Clustering Cluster: 98 stump_the_chump: 15159.8533727 stump: 12931.0599949 prize: 12378.4630507 sight: 2943.0123456 tough: 2872.89050924 question: 2827.60450268 judge: 2353.93441007 submit: 2250.3503055 session: 2147.89226715 panel: 1888.9584879 hostetter: 1722.90005854 grant: 1600.7415686 chump: 1558.95135161 lucene_revolution:1353.7746721 spot: 1211.58699335 award: 1048.0824900 mock: 1005.09316809 conference: 903.00251411 muir: 878.76730374 seat: 870.91541559 hot: 799.50707482
  • 21. Topic Mapping Semantic or Subject / Categorical Spaces security performance garbage collection authorization saml kerberos usernamelogin permission acl qps bottleneck latency generation jvm pause stop the world heap optimization concurrent xx speed password
  • 22. Topic Mapping <doc> <field name="label_s">Solr/Lucene Tech</field> <field name="term_ss">solr</field> <field name="term_ss">lucene</field> <field name="term_ss">search handler</field> <field name="term_ss">request handler</field> <field name="term_ss">solrj</field> <field name="term_ss">term query</field> <field name="term_ss">boolean query</field> <field name="term_ss">span query</field> <field name=“term_ss”>phrase query</field> <field name="term_ss">queryparser=>query parser</field> <field name="term_ss">fq=>filter query</field> <field name="term_ss">function query</field> <field name="term_ss">bq=>boost query</field> <field name="term_ss">solrconfig xml=>solrconfig.xml</field> <field name="term_ss">edismax</field> <field name="term_ss">dismax</field> <field name="term_ss">analysis=>analyzer</field> <field name="term_ss">positionincrementgap</field> <field name="term_ss">highlighter</field> <field name="term_ss">similarity</field> <field name="term_ss">search index,lucene index=>inverted index</field> <field name="term_ss">token</field> <field name="term_ss">token filter,tokenfilter=>tokenizer</field> <field name="term_ss">field type=>fieldtype</field> <field name="term_ss">schema.xml,schema xml=>schema</field> <field name="term_ss">facet</field> <field name="term_ss">frange=>range</field> <field name="term_ss">trie</field> <field name="term_ss">pivot faceting,facet.pivot,facet pivot=>pivot facet</field> <field name="term_ss">reference guide</field> </doc>
  • 23. Topic Mapping Approach Keyword coverage is more important than density Many-to-Many mapping Documents may cover more than one topic A given keyword may occur in more than one topic area “Democratic” process Keywords are “evidence” for a Topic - “aboutness” is cumulative Enables documents to be mapped to multiple topics - which gives information on topic relatedness Simple threshold to determine topic membership