SlideShare a Scribd company logo
The Well-Tempered Search
Application
Variations on a Theme:
Why does my search app suck, and
what can I do about it?
Ted Sullivan – (old Phuddy Duddy)
Senior (very much so I’m afraid) Solutions (I hope) Architect (and sometime plumber)
Lucidworks Technical Services
Our Basic Premises (Premisi?)
• Lemma 1: Search Applications use algorithms that make
finding chunks of text within large datasets possible in
HTT (human-tolerable time).
• Lemma 2: These algorithms work by breaking text into
primitive components and building up a search
“experience” from that.
• Lemma 3: Lemma 2 is not sufficient to achieve Lemma 1.
The Basic Disconnect
• Text can be analyzed at the level of tokens
(syntax) and at the level of meaning
(semantics).
• We think one way (semantics), search engines
think another (syntax – i.e. token order).
• How do we bridge the gap? … More clever
algorithms!
Art and Science
• We need to be intelligent curators of these
algorithms. Craftsmen (craftswomen?) that
think of these as tools with a specific purpose.
• Like any good craftsperson – we need a wide
array of tools to get the job done (well
almost).
When is my search app done?
• Quick answer: NEVER (ain’t consultin’ great?)
• Long answer – As long as it is continues to
improve, like fine wine or bourbon, you are on
the path to enlightenment.
• How do you get there grasshopper? Add
semantic intelligence to the engine!
Search cannot be shrink-wrapped!!
What have we got for Donny behind Curtain #1 Jay?
Well Monty - Heeeeeeeeeeeerrrrrrrreeeeesssss the
Google … SEARCH Appliance!!!!
Search cannot be shrink-wrapped!!
What have we got for Donny behind Curtain #1 Jay?
Well Monty - Heeeeeeeeeeeerrrrrrrreeeeesssss the
Google … SEARCH Appliance!!!!*
Sorry Donny – It’s a ZONK!
* but Google Web Search has some Serious Mojo!
Prelude part 1– The basic problem
The inverted index and “bag-of-words” search:
The red fox jumped over the fence.
Time flies like an arrow. Fruit flies like a banana.
the 1,6
red 2
fox 3
jumped 4
over 5
fence 7
flies 2,7
like 3,8
Prelude part B – The Tried and True
• Phrase and Proximity boosting and “Slop”
• Synonyms and stop words
• Stemming or Lemmatization
• Autocomplete
• Best Bets / Landing Pages – the
sledgehammer
• Spell check – spell suggest – aka the warm
fuzzies.
Fugue - Subject or Exposition
Search engines need more ‘semantic
awareness’ or at least the illusion of this.
There is a heavy duty solution called Artificial
Intelligence – which except in the fertile
imagination of Hollywood screenwriters, is not
there yet. So we need to fake it just a bit.
Theme and Variations I
autophrasing and the red sofa
Theme: When multiple words mean just one thing.
Fuzzy way: Boosting phrases (proximity and phrase slop)
- pushes false positives down – i.e. out of the limelight
- i.e. - shoves ‘em under the rug
This encounters a problem with faceted search
Like the eye of Sauron in LOTR or Santa Claus, the faceting engine SEES
ALL (sins)!
Brake Pads example: hit on things that have ‘brake’ (like children’s stroller
brakes) and ‘pads’ – like mattress pads.
Variation I: Autophrasing
AutophrasingTokenFilter tells Lucene not to
tokenize when a noun phrase represents a single
thing - by providing a flat list of phrases.
Creates one-to-one token mapping that Lucene
prefers because it avoids the “sausagization”
problem.
https://github.com/LucidWorks/auto-phrase-tokenfilter
income tax refund
income tax
tax refund
“income tax” is not income.
A “tax refund” is not a tax.
Solution: Autophrasing + synonym mapping
income tax => tax
tax refund => refund
Autophrasing Example
Autophrasing Setup
autophrases.txt:
income tax
tax refund
tax rebate
sales tax
property tax
synonyms.txt
income_tax,property_tax,sales_tax,tax
tax_refund,refund,rebate,tax_rebate
<fieldType name="text_autophrase" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="com.lucidworks.analysis.AutoPhrasingTokenFilterFactory"
phrases="autophrases.txt"
includeTokens="true” replaceWhitespaceWith="_" />
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"
enablePositionIncrements="true" />
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
</fieldType>
Multi-term synonym problem
• New York, New York – it’s a HELLOVA town!
Subject was inspired by an old JIRA ticket: Lucene-1622
“if multi-word synonyms are indexed together with the
original token stream (at overlapping positions), then a query
for a partial synonym sequence (e.g., “big” in the synonym
“big apple” for “new york city”) causes the document to
match”
(or “apple” which will hit on my blog post if you crawl
lucidworks.com !)
This means certain phrase queries should match but don't (e.g.: "hotspot
is down"), and other phrase queries shouldn't match but do (e.g.: "fast
hotspot fi").
Other cases do work correctly (e.g.: "fast hotspot"). We refer to this
"lossy serialization" as sausagization, because the incoming graph is
unexpectedly turned from a correct word lattice into an incorrect
sausage.
This limitation is challenging to fix: it requires changing the index format
(and Codec APIs) to store an additional int position length per position,
and then fixing positional queries to respect this value.
Sausagization: from Mike McCandless blog Changing Bits:
Lucene's TokenStreams are actually graphs!
http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html
Multi-term synonym demo
new york
new york state
empire state
new york city
new york new york
big apple
ny ny
city of new york
state of new york
ny state
autophrases.txt
new_york => new_york_state,new_york_city,big_apple,
new_york_new_york,ny_ny,nyc,empire_state,ny_state,
state_of_new_york
new_york_state,empire_state,ny_state,
state_of_new_york
new_york_city,big_apple,new_york_new_york,ny_ny,nyc,
city_of_new_york
synonyms.txt
This document is about new york state.
This document is about new york city.
There is a lot going on in NYC.
I heart the big apple.
The empire state is a great state.
New York, New York is a hellova town.
I am a native of the great state of New York.
new york new york city new york state
Multi-term synonym demo
/select /autophrase
This document is about new york state.
This document is about new york city.
There is a lot going on in NYC.
I heart the big apple.
The empire state is a great state.
New York, New York is a hellova town.
I am a native of the great state of New York.
empire state
Multi-term synonym demo
/select /autophrase
(Even a blind squirrel finds a nut once in a while)
Variation II: The Red Sofa Problem
{
"response":{"numFound":3,"start":0,"docs":[
{
"color":"red",
”text":"This is the red sofa example. Please find with 'red sofa' query.",
{
"color":"red",
”text":"This is a red beach ball. It is red in color but is not something that you
should not sit on because you would tend to roll off.",
{
"color":"blue",
”text":"This is a blue sofa, it should only hit on sofas that are blue in color."]
}}
OOTB – q=red sofa is interpreted as text:red text:sofa (default OR)
http://localhost:8983/solr/collection1/select?q=red+sofa&wt=json
Closing the Loop:
Content Tagging and Intelligent Query Filtering
Using the search index itself as the knowledge source:
Solution for the Red Sofa problem
Query Autofiltering: Search Index driven query introspection /
query rewriting:
Lucene FieldCache Magic
Lucene FieldCache (to be renamed UninvertedIndex in Lucene 5.0)
Inverted Index:
Show all documents that have this term value in this field.
Uninverted or Forward Index:
Show all term values that have been indexed in this field.
SolrIndexSearcher searcher = rb.req.getSearcher();
SortedDocValues fieldValues = FieldCache.DEFAULT.getTermsIndex(
searcher.getAtomicReader( ), categoryField );
…
StringTokenizer strtok = new StringTokenizer ( query, " .,:;"'" );
while (strtok.hasMoreTokens( ) ) {
String tok = strtok.nextToken( ).toLowerCase( );
BytesRef key = new BytesRef( tok.getBytes() );
if (fieldValues.lookupTerm( key ) >= 0) {
Query Autofiltering
{
"response":{"numFound":1,"start":0,"docs":[
{
"id":"1",
"color":"red",
"description":"This is the red sofa example. Please find with 'red sofa' query."]
}
http://localhost:8983/solr/collection1/infer?q=red+sofa&wt=json
Now search for “red sofa” only returns ….. red sofas!
But – is this too “brute force”? The takeaway is that using
the search index AS a knowledge store can be very powerful!
Architecture: its all about Plumbing
• Pipelines for every occasion.
Indexing Pipelines – good ‘ole ETL
- Content enrichment, tagging
- Metadata cleanup
Query Pipelines
– identification, query preprocessing
- introspection
One is the “hand” the other, the “glove”
Index Pipelines
Lots of choices here:
• Internal to Solr – DIH, UpdateRequestProcessor
Pros and cons
• External – Morphlines, Open Pipeline, Flume,
Spark, Hadoop, Custom SolrJ
• Lucidworks Fusion
Entity and Fact Extraction
Entities:
Things, Locations, Dates, People, Organizations, Concepts
Entity Relationships
Company was acquired by Company
Drug cures Disease
Person likes Pizza
Annotation Pipelines (UIMA, Lucidworks Fusion):
Entity Extraction followed by Fact Extraction
Pattern method:
$Drug is used to treat $Condition
Parts of Speech (POS) analysis
Subject Predicate Object
Theme and Variations II
The Classification Wars
• Machine Learning or Taxonomy – is it a Floor
Wax or a Dessert Topping?
Answer: It’s a floor wax AND a dessert topping! Its delicious and just look at that shine!
Machine Learning
Use mathematical vector-crunching algorithms like Latent
Dirichlet Allocation (LDA), Bayesian Inference, Maximum
Entropy, log likelihood, Support Vector Machines (SVM) etc., to
find patterns and to associate those patterns with concepts.
Can be supervised (i.e. given a training set) or unsupervised (the
algorithm just finds clusters). Supervised learning are called
semi-automatic classifiers.
Check out Taming Text by Ingersoll, Morton and
Farris (Manning)
Machine Learning In
LucidworksFusion
Training Data
NLP Trainer
Stage
NLP Model
Test Data
NLP Classifier
Stage
Classified
Documents
Taxonomy or Ontology
“Knowledge graphs” that relate things and concepts to each other
either hierarchically or associatively.
Pros:
Works without large amounts of content to analyze
Encapsulates the knowledge of human subject matter experts
Cons:
Often not well designed for search (mixes semantic relationship
types / organizational logic)
Requires curation by subject matter experts whose time is costly
Taxonomies Designed for Search
Category nodes and Evidence nodes
Category Node:
A ‘parent’ node
Can have child nodes that are:
Sub Categories
Evidence Nodes
Evidence Node:
Tends to be a leaf node (no children)
Contains keyterms (synonyms)
May contain “rules” e.g. (if contains term a and term b but not term c)
Evidence Nodes can have more than one category node parent
Hits on Evidence Nodes add to the cumulative score of a Category Node.
Scores can be diluted as the accumulate up the hierarchy – so that the
nearest category gets the strongest ‘vote’.
US Corporations
Foreign Corporations
British
Chinese
French
German
Japanese
Russian
etc.
Fortune 100 Companies
Energy
Financial Services
Investment Banks
Commercial Banks
Health Care
Health Insurance
HMO
Medical Devices
Pharmaceuticals
Hospitality
Manufacturing
Aircraft
Automobiles
Electrical Equipment
Ford, GM, Chrysler
Fortune 100 Companies
Energy
Financial Services
Investment Banks
Commercial Banks
Health Care
Health Insurance
HMO
Medical Devices
Pharmaceuticals
Hospitality
Manufacturing
Aircraft
Automobiles
Electrical Equipment
US Corporations
Foreign Corporations
British
Chinese
French
German
Japanese
Russian
etc.
Ford, GM, Chrysler,Toyota,BMW
Fortune 100 Companies
Energy
Financial Services
Investment Banks
Commercial Banks
Health Care
Health Insurance
HMO
Medical Devices
Pharmaceuticals
Hospitality
Manufacturing
Aircraft
Automobiles
Electrical Equipment
US Corporations
Foreign Corporations
British
Chinese
French
German
Japanese
Russian
etc.
Fortune 100 Companies
Energy
Financial Services
Investment Banks
Commercial Banks
Health Care
Health Insurance
HMO
Medical Devices
Pharmaceuticals
Hospitality
Manufacturing
Aircraft
Automobiles
Electrical Equipment
Ford, GM, Chrysler,Toyota,BMW
GE, Boeing
US Corporations
Foreign Corporations
British
Chinese
French
German
Japanese
Russian
etc.
Fortune 100 Companies
Energy
Financial Services
Investment Banks
Commercial Banks
Health Care
Health Insurance
HMO
Medical Devices
Pharmaceuticals
Hospitality
Manufacturing
Aircraft
Automobiles
Electrical Equipment
Ford, GM, Chrysler,Toyota,BMW
GE, Boeing
Bank of America, Hyatt
US Corporations
Foreign Corporations
British
Chinese
French
German
Japanese
Russian
etc.
Query Pipelines
The ‘Wh’ Words: Who, What, When, Where
Who are they (authentication)?
What can they see (security - authorization)?
When can they see it (entitlement)?
What are they interested in (personalization / recommendation)?
Where are they now (location)?
Query Pipelines
Inferential Search
Query introspection -> Query modification.
Query Autofiltering
Are you feeling lucky today?
Topic boosting / spotlighting
Use ML to detect the topic, then boost and/or spotlight
results tagged this way.
Use a specialized collection to store ‘facet knowledge’
The Art of the Fugue:
Inferential Search
• Infer what the user is looking for and give them
that
• Clever software infers meaning aka query
“intent”
• When we do this right, it appears to be magic!
Machine Learning Drives
Query Introspection
Training Data
NLP Trainer
Stage
NLP Model
Test Data
NLP Classifier
Stage
Classified
Documents
Machine Learning models can drive
Query Introspection
NL Query
NLP Query
Stage
NLP Model
Tagged
Query
Landing
Page
Boost
Documents
Da Capo al Coda
• Killer search apps are crafted from fine
ingredients and like fine whiskey will get
better with age - if you are paying attention to
‘what’ your users are looking for.
• Putting the pieces together requires an
understanding of ‘what’ things, independent
of what words they use to describe it.
Thanks for your attention!
Ted Sullivan
Lucidworks, Technical Services
ted.sullivan@lucidworks.com
Skype: ted.sullivan5
LinkedIn
Metuchen, New Jersey
(You gotta problem with that?)

More Related Content

Similar to The well tempered search application

Webinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with SolrWebinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with Solr
Lucidworks
 
Data Day Seattle, From NLP to AI
Data Day Seattle, From NLP to AIData Day Seattle, From NLP to AI
Data Day Seattle, From NLP to AI
Jonathan Mugan
 
From Natural Language Processing to Artificial Intelligence
From Natural Language Processing to Artificial IntelligenceFrom Natural Language Processing to Artificial Intelligence
From Natural Language Processing to Artificial Intelligence
Jonathan Mugan
 
Conf orm - explain
Conf orm - explainConf orm - explain
Conf orm - explain
Louise Grandjonc
 
Fluid, Fluent APIs
Fluid, Fluent APIsFluid, Fluent APIs
Fluid, Fluent APIs
Erik Rose
 
Write better python code with these 10 tricks | by yong cui, ph.d. | aug, 202...
Write better python code with these 10 tricks | by yong cui, ph.d. | aug, 202...Write better python code with these 10 tricks | by yong cui, ph.d. | aug, 202...
Write better python code with these 10 tricks | by yong cui, ph.d. | aug, 202...
amit kuraria
 
Ruby meetup evolution of bof search
Ruby meetup   evolution of bof search Ruby meetup   evolution of bof search
Ruby meetup evolution of bof search
Miha Mencin
 
Rocky Nevin's presentation at eComm 2008
Rocky Nevin's presentation at eComm 2008Rocky Nevin's presentation at eComm 2008
Rocky Nevin's presentation at eComm 2008
eComm2008
 
Introduction to Algorithms
Introduction to AlgorithmsIntroduction to Algorithms
Introduction to Algorithms
Mohamed Essam
 
Extreme Swift
Extreme SwiftExtreme Swift
Extreme Swift
Movel
 
Drupal 8: A story of growing up and getting off the island
Drupal 8: A story of growing up and getting off the islandDrupal 8: A story of growing up and getting off the island
Drupal 8: A story of growing up and getting off the island
Angela Byron
 
Algorithms
Algorithms Algorithms
Algorithms
Mohamed Essam
 
Tokens, Complex Systems, and Nature
Tokens, Complex Systems, and NatureTokens, Complex Systems, and Nature
Tokens, Complex Systems, and Nature
Trent McConaghy
 
All of javascript
All of javascriptAll of javascript
All of javascript
Togakangaroo
 
Apex for humans
Apex for humansApex for humans
Apex for humans
Kevin Poorman
 
All of Javascript
All of JavascriptAll of Javascript
All of Javascript
Togakangaroo
 
Uses & Abuses of Mocks & Stubs
Uses & Abuses of Mocks & StubsUses & Abuses of Mocks & Stubs
Uses & Abuses of Mocks & Stubs
PatchSpace Ltd
 
It Probably Works - QCon 2015
It Probably Works - QCon 2015It Probably Works - QCon 2015
It Probably Works - QCon 2015
Fastly
 
Dynamic Python
Dynamic PythonDynamic Python
Dynamic Python
Chui-Wen Chiu
 
Playfulness at Work
Playfulness at WorkPlayfulness at Work
Playfulness at Work
Erin Dees
 

Similar to The well tempered search application (20)

Webinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with SolrWebinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with Solr
 
Data Day Seattle, From NLP to AI
Data Day Seattle, From NLP to AIData Day Seattle, From NLP to AI
Data Day Seattle, From NLP to AI
 
From Natural Language Processing to Artificial Intelligence
From Natural Language Processing to Artificial IntelligenceFrom Natural Language Processing to Artificial Intelligence
From Natural Language Processing to Artificial Intelligence
 
Conf orm - explain
Conf orm - explainConf orm - explain
Conf orm - explain
 
Fluid, Fluent APIs
Fluid, Fluent APIsFluid, Fluent APIs
Fluid, Fluent APIs
 
Write better python code with these 10 tricks | by yong cui, ph.d. | aug, 202...
Write better python code with these 10 tricks | by yong cui, ph.d. | aug, 202...Write better python code with these 10 tricks | by yong cui, ph.d. | aug, 202...
Write better python code with these 10 tricks | by yong cui, ph.d. | aug, 202...
 
Ruby meetup evolution of bof search
Ruby meetup   evolution of bof search Ruby meetup   evolution of bof search
Ruby meetup evolution of bof search
 
Rocky Nevin's presentation at eComm 2008
Rocky Nevin's presentation at eComm 2008Rocky Nevin's presentation at eComm 2008
Rocky Nevin's presentation at eComm 2008
 
Introduction to Algorithms
Introduction to AlgorithmsIntroduction to Algorithms
Introduction to Algorithms
 
Extreme Swift
Extreme SwiftExtreme Swift
Extreme Swift
 
Drupal 8: A story of growing up and getting off the island
Drupal 8: A story of growing up and getting off the islandDrupal 8: A story of growing up and getting off the island
Drupal 8: A story of growing up and getting off the island
 
Algorithms
Algorithms Algorithms
Algorithms
 
Tokens, Complex Systems, and Nature
Tokens, Complex Systems, and NatureTokens, Complex Systems, and Nature
Tokens, Complex Systems, and Nature
 
All of javascript
All of javascriptAll of javascript
All of javascript
 
Apex for humans
Apex for humansApex for humans
Apex for humans
 
All of Javascript
All of JavascriptAll of Javascript
All of Javascript
 
Uses & Abuses of Mocks & Stubs
Uses & Abuses of Mocks & StubsUses & Abuses of Mocks & Stubs
Uses & Abuses of Mocks & Stubs
 
It Probably Works - QCon 2015
It Probably Works - QCon 2015It Probably Works - QCon 2015
It Probably Works - QCon 2015
 
Dynamic Python
Dynamic PythonDynamic Python
Dynamic Python
 
Playfulness at Work
Playfulness at WorkPlayfulness at Work
Playfulness at Work
 

Recently uploaded

Using Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query PerformanceUsing Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query Performance
Grant Fritchey
 
The Key to Digital Success_ A Comprehensive Guide to Continuous Testing Integ...
The Key to Digital Success_ A Comprehensive Guide to Continuous Testing Integ...The Key to Digital Success_ A Comprehensive Guide to Continuous Testing Integ...
The Key to Digital Success_ A Comprehensive Guide to Continuous Testing Integ...
kalichargn70th171
 
ppt on the brain chip neuralink.pptx
ppt  on   the brain  chip neuralink.pptxppt  on   the brain  chip neuralink.pptx
ppt on the brain chip neuralink.pptx
Reetu63
 
DECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSIS
DECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSISDECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSIS
DECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSIS
Tier1 app
 
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CDKuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
rodomar2
 
14 th Edition of International conference on computer vision
14 th Edition of International conference on computer vision14 th Edition of International conference on computer vision
14 th Edition of International conference on computer vision
ShulagnaSarkar2
 
UI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
UI5con 2024 - Boost Your Development Experience with UI5 Tooling ExtensionsUI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
UI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
Peter Muessig
 
GreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-JurisicGreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-Jurisic
Green Software Development
 
The Rising Future of CPaaS in the Middle East 2024
The Rising Future of CPaaS in the Middle East 2024The Rising Future of CPaaS in the Middle East 2024
The Rising Future of CPaaS in the Middle East 2024
Yara Milbes
 
Oracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptxOracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptx
Remote DBA Services
 
Preparing Non - Technical Founders for Engaging a Tech Agency
Preparing Non - Technical Founders for Engaging  a  Tech AgencyPreparing Non - Technical Founders for Engaging  a  Tech Agency
Preparing Non - Technical Founders for Engaging a Tech Agency
ISH Technologies
 
How Can Hiring A Mobile App Development Company Help Your Business Grow?
How Can Hiring A Mobile App Development Company Help Your Business Grow?How Can Hiring A Mobile App Development Company Help Your Business Grow?
How Can Hiring A Mobile App Development Company Help Your Business Grow?
ToXSL Technologies
 
ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, FactsALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
Green Software Development
 
Quarter 3 SLRP grade 9.. gshajsbhhaheabh
Quarter 3 SLRP grade 9.. gshajsbhhaheabhQuarter 3 SLRP grade 9.. gshajsbhhaheabh
Quarter 3 SLRP grade 9.. gshajsbhhaheabh
aisafed42
 
Enhanced Screen Flows UI/UX using SLDS with Tom Kitt
Enhanced Screen Flows UI/UX using SLDS with Tom KittEnhanced Screen Flows UI/UX using SLDS with Tom Kitt
Enhanced Screen Flows UI/UX using SLDS with Tom Kitt
Peter Caitens
 
What’s New in Odoo 17 – A Complete Roadmap
What’s New in Odoo 17 – A Complete RoadmapWhat’s New in Odoo 17 – A Complete Roadmap
What’s New in Odoo 17 – A Complete Roadmap
Envertis Software Solutions
 
A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...
A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...
A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...
kalichargn70th171
 
Top Benefits of Using Salesforce Healthcare CRM for Patient Management.pdf
Top Benefits of Using Salesforce Healthcare CRM for Patient Management.pdfTop Benefits of Using Salesforce Healthcare CRM for Patient Management.pdf
Top Benefits of Using Salesforce Healthcare CRM for Patient Management.pdf
VALiNTRY360
 
Webinar On-Demand: Using Flutter for Embedded
Webinar On-Demand: Using Flutter for EmbeddedWebinar On-Demand: Using Flutter for Embedded
Webinar On-Demand: Using Flutter for Embedded
ICS
 
Project Management: The Role of Project Dashboards.pdf
Project Management: The Role of Project Dashboards.pdfProject Management: The Role of Project Dashboards.pdf
Project Management: The Role of Project Dashboards.pdf
Karya Keeper
 

Recently uploaded (20)

Using Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query PerformanceUsing Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query Performance
 
The Key to Digital Success_ A Comprehensive Guide to Continuous Testing Integ...
The Key to Digital Success_ A Comprehensive Guide to Continuous Testing Integ...The Key to Digital Success_ A Comprehensive Guide to Continuous Testing Integ...
The Key to Digital Success_ A Comprehensive Guide to Continuous Testing Integ...
 
ppt on the brain chip neuralink.pptx
ppt  on   the brain  chip neuralink.pptxppt  on   the brain  chip neuralink.pptx
ppt on the brain chip neuralink.pptx
 
DECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSIS
DECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSISDECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSIS
DECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSIS
 
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CDKuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
 
14 th Edition of International conference on computer vision
14 th Edition of International conference on computer vision14 th Edition of International conference on computer vision
14 th Edition of International conference on computer vision
 
UI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
UI5con 2024 - Boost Your Development Experience with UI5 Tooling ExtensionsUI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
UI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
 
GreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-JurisicGreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-Jurisic
 
The Rising Future of CPaaS in the Middle East 2024
The Rising Future of CPaaS in the Middle East 2024The Rising Future of CPaaS in the Middle East 2024
The Rising Future of CPaaS in the Middle East 2024
 
Oracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptxOracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptx
 
Preparing Non - Technical Founders for Engaging a Tech Agency
Preparing Non - Technical Founders for Engaging  a  Tech AgencyPreparing Non - Technical Founders for Engaging  a  Tech Agency
Preparing Non - Technical Founders for Engaging a Tech Agency
 
How Can Hiring A Mobile App Development Company Help Your Business Grow?
How Can Hiring A Mobile App Development Company Help Your Business Grow?How Can Hiring A Mobile App Development Company Help Your Business Grow?
How Can Hiring A Mobile App Development Company Help Your Business Grow?
 
ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, FactsALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
 
Quarter 3 SLRP grade 9.. gshajsbhhaheabh
Quarter 3 SLRP grade 9.. gshajsbhhaheabhQuarter 3 SLRP grade 9.. gshajsbhhaheabh
Quarter 3 SLRP grade 9.. gshajsbhhaheabh
 
Enhanced Screen Flows UI/UX using SLDS with Tom Kitt
Enhanced Screen Flows UI/UX using SLDS with Tom KittEnhanced Screen Flows UI/UX using SLDS with Tom Kitt
Enhanced Screen Flows UI/UX using SLDS with Tom Kitt
 
What’s New in Odoo 17 – A Complete Roadmap
What’s New in Odoo 17 – A Complete RoadmapWhat’s New in Odoo 17 – A Complete Roadmap
What’s New in Odoo 17 – A Complete Roadmap
 
A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...
A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...
A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...
 
Top Benefits of Using Salesforce Healthcare CRM for Patient Management.pdf
Top Benefits of Using Salesforce Healthcare CRM for Patient Management.pdfTop Benefits of Using Salesforce Healthcare CRM for Patient Management.pdf
Top Benefits of Using Salesforce Healthcare CRM for Patient Management.pdf
 
Webinar On-Demand: Using Flutter for Embedded
Webinar On-Demand: Using Flutter for EmbeddedWebinar On-Demand: Using Flutter for Embedded
Webinar On-Demand: Using Flutter for Embedded
 
Project Management: The Role of Project Dashboards.pdf
Project Management: The Role of Project Dashboards.pdfProject Management: The Role of Project Dashboards.pdf
Project Management: The Role of Project Dashboards.pdf
 

The well tempered search application

  • 1. The Well-Tempered Search Application Variations on a Theme: Why does my search app suck, and what can I do about it? Ted Sullivan – (old Phuddy Duddy) Senior (very much so I’m afraid) Solutions (I hope) Architect (and sometime plumber) Lucidworks Technical Services
  • 2. Our Basic Premises (Premisi?) • Lemma 1: Search Applications use algorithms that make finding chunks of text within large datasets possible in HTT (human-tolerable time). • Lemma 2: These algorithms work by breaking text into primitive components and building up a search “experience” from that. • Lemma 3: Lemma 2 is not sufficient to achieve Lemma 1.
  • 3. The Basic Disconnect • Text can be analyzed at the level of tokens (syntax) and at the level of meaning (semantics). • We think one way (semantics), search engines think another (syntax – i.e. token order). • How do we bridge the gap? … More clever algorithms!
  • 4. Art and Science • We need to be intelligent curators of these algorithms. Craftsmen (craftswomen?) that think of these as tools with a specific purpose. • Like any good craftsperson – we need a wide array of tools to get the job done (well almost).
  • 5. When is my search app done? • Quick answer: NEVER (ain’t consultin’ great?) • Long answer – As long as it is continues to improve, like fine wine or bourbon, you are on the path to enlightenment. • How do you get there grasshopper? Add semantic intelligence to the engine!
  • 6. Search cannot be shrink-wrapped!! What have we got for Donny behind Curtain #1 Jay? Well Monty - Heeeeeeeeeeeerrrrrrrreeeeesssss the Google … SEARCH Appliance!!!!
  • 7. Search cannot be shrink-wrapped!! What have we got for Donny behind Curtain #1 Jay? Well Monty - Heeeeeeeeeeeerrrrrrrreeeeesssss the Google … SEARCH Appliance!!!!* Sorry Donny – It’s a ZONK! * but Google Web Search has some Serious Mojo!
  • 8. Prelude part 1– The basic problem The inverted index and “bag-of-words” search: The red fox jumped over the fence. Time flies like an arrow. Fruit flies like a banana. the 1,6 red 2 fox 3 jumped 4 over 5 fence 7 flies 2,7 like 3,8
  • 9. Prelude part B – The Tried and True • Phrase and Proximity boosting and “Slop” • Synonyms and stop words • Stemming or Lemmatization • Autocomplete • Best Bets / Landing Pages – the sledgehammer • Spell check – spell suggest – aka the warm fuzzies.
  • 10. Fugue - Subject or Exposition Search engines need more ‘semantic awareness’ or at least the illusion of this. There is a heavy duty solution called Artificial Intelligence – which except in the fertile imagination of Hollywood screenwriters, is not there yet. So we need to fake it just a bit.
  • 11. Theme and Variations I autophrasing and the red sofa Theme: When multiple words mean just one thing. Fuzzy way: Boosting phrases (proximity and phrase slop) - pushes false positives down – i.e. out of the limelight - i.e. - shoves ‘em under the rug This encounters a problem with faceted search Like the eye of Sauron in LOTR or Santa Claus, the faceting engine SEES ALL (sins)! Brake Pads example: hit on things that have ‘brake’ (like children’s stroller brakes) and ‘pads’ – like mattress pads.
  • 12. Variation I: Autophrasing AutophrasingTokenFilter tells Lucene not to tokenize when a noun phrase represents a single thing - by providing a flat list of phrases. Creates one-to-one token mapping that Lucene prefers because it avoids the “sausagization” problem. https://github.com/LucidWorks/auto-phrase-tokenfilter
  • 13. income tax refund income tax tax refund “income tax” is not income. A “tax refund” is not a tax. Solution: Autophrasing + synonym mapping income tax => tax tax refund => refund Autophrasing Example
  • 14. Autophrasing Setup autophrases.txt: income tax tax refund tax rebate sales tax property tax synonyms.txt income_tax,property_tax,sales_tax,tax tax_refund,refund,rebate,tax_rebate <fieldType name="text_autophrase" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="com.lucidworks.analysis.AutoPhrasingTokenFilterFactory" phrases="autophrases.txt" includeTokens="true” replaceWhitespaceWith="_" /> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" /> <filter class="solr.PorterStemFilterFactory"/> </analyzer> </fieldType>
  • 15. Multi-term synonym problem • New York, New York – it’s a HELLOVA town! Subject was inspired by an old JIRA ticket: Lucene-1622 “if multi-word synonyms are indexed together with the original token stream (at overlapping positions), then a query for a partial synonym sequence (e.g., “big” in the synonym “big apple” for “new york city”) causes the document to match” (or “apple” which will hit on my blog post if you crawl lucidworks.com !)
  • 16. This means certain phrase queries should match but don't (e.g.: "hotspot is down"), and other phrase queries shouldn't match but do (e.g.: "fast hotspot fi"). Other cases do work correctly (e.g.: "fast hotspot"). We refer to this "lossy serialization" as sausagization, because the incoming graph is unexpectedly turned from a correct word lattice into an incorrect sausage. This limitation is challenging to fix: it requires changing the index format (and Codec APIs) to store an additional int position length per position, and then fixing positional queries to respect this value. Sausagization: from Mike McCandless blog Changing Bits: Lucene's TokenStreams are actually graphs! http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html
  • 17. Multi-term synonym demo new york new york state empire state new york city new york new york big apple ny ny city of new york state of new york ny state autophrases.txt new_york => new_york_state,new_york_city,big_apple, new_york_new_york,ny_ny,nyc,empire_state,ny_state, state_of_new_york new_york_state,empire_state,ny_state, state_of_new_york new_york_city,big_apple,new_york_new_york,ny_ny,nyc, city_of_new_york synonyms.txt
  • 18. This document is about new york state. This document is about new york city. There is a lot going on in NYC. I heart the big apple. The empire state is a great state. New York, New York is a hellova town. I am a native of the great state of New York. new york new york city new york state Multi-term synonym demo /select /autophrase
  • 19. This document is about new york state. This document is about new york city. There is a lot going on in NYC. I heart the big apple. The empire state is a great state. New York, New York is a hellova town. I am a native of the great state of New York. empire state Multi-term synonym demo /select /autophrase (Even a blind squirrel finds a nut once in a while)
  • 20. Variation II: The Red Sofa Problem { "response":{"numFound":3,"start":0,"docs":[ { "color":"red", ”text":"This is the red sofa example. Please find with 'red sofa' query.", { "color":"red", ”text":"This is a red beach ball. It is red in color but is not something that you should not sit on because you would tend to roll off.", { "color":"blue", ”text":"This is a blue sofa, it should only hit on sofas that are blue in color."] }} OOTB – q=red sofa is interpreted as text:red text:sofa (default OR) http://localhost:8983/solr/collection1/select?q=red+sofa&wt=json
  • 21. Closing the Loop: Content Tagging and Intelligent Query Filtering Using the search index itself as the knowledge source:
  • 22. Solution for the Red Sofa problem Query Autofiltering: Search Index driven query introspection / query rewriting:
  • 23. Lucene FieldCache Magic Lucene FieldCache (to be renamed UninvertedIndex in Lucene 5.0) Inverted Index: Show all documents that have this term value in this field. Uninverted or Forward Index: Show all term values that have been indexed in this field. SolrIndexSearcher searcher = rb.req.getSearcher(); SortedDocValues fieldValues = FieldCache.DEFAULT.getTermsIndex( searcher.getAtomicReader( ), categoryField ); … StringTokenizer strtok = new StringTokenizer ( query, " .,:;"'" ); while (strtok.hasMoreTokens( ) ) { String tok = strtok.nextToken( ).toLowerCase( ); BytesRef key = new BytesRef( tok.getBytes() ); if (fieldValues.lookupTerm( key ) >= 0) {
  • 24. Query Autofiltering { "response":{"numFound":1,"start":0,"docs":[ { "id":"1", "color":"red", "description":"This is the red sofa example. Please find with 'red sofa' query."] } http://localhost:8983/solr/collection1/infer?q=red+sofa&wt=json Now search for “red sofa” only returns ….. red sofas! But – is this too “brute force”? The takeaway is that using the search index AS a knowledge store can be very powerful!
  • 25. Architecture: its all about Plumbing • Pipelines for every occasion. Indexing Pipelines – good ‘ole ETL - Content enrichment, tagging - Metadata cleanup Query Pipelines – identification, query preprocessing - introspection One is the “hand” the other, the “glove”
  • 26. Index Pipelines Lots of choices here: • Internal to Solr – DIH, UpdateRequestProcessor Pros and cons • External – Morphlines, Open Pipeline, Flume, Spark, Hadoop, Custom SolrJ • Lucidworks Fusion
  • 27. Entity and Fact Extraction Entities: Things, Locations, Dates, People, Organizations, Concepts Entity Relationships Company was acquired by Company Drug cures Disease Person likes Pizza Annotation Pipelines (UIMA, Lucidworks Fusion): Entity Extraction followed by Fact Extraction Pattern method: $Drug is used to treat $Condition Parts of Speech (POS) analysis Subject Predicate Object
  • 28. Theme and Variations II The Classification Wars • Machine Learning or Taxonomy – is it a Floor Wax or a Dessert Topping? Answer: It’s a floor wax AND a dessert topping! Its delicious and just look at that shine!
  • 29. Machine Learning Use mathematical vector-crunching algorithms like Latent Dirichlet Allocation (LDA), Bayesian Inference, Maximum Entropy, log likelihood, Support Vector Machines (SVM) etc., to find patterns and to associate those patterns with concepts. Can be supervised (i.e. given a training set) or unsupervised (the algorithm just finds clusters). Supervised learning are called semi-automatic classifiers. Check out Taming Text by Ingersoll, Morton and Farris (Manning)
  • 30. Machine Learning In LucidworksFusion Training Data NLP Trainer Stage NLP Model Test Data NLP Classifier Stage Classified Documents
  • 31. Taxonomy or Ontology “Knowledge graphs” that relate things and concepts to each other either hierarchically or associatively. Pros: Works without large amounts of content to analyze Encapsulates the knowledge of human subject matter experts Cons: Often not well designed for search (mixes semantic relationship types / organizational logic) Requires curation by subject matter experts whose time is costly
  • 32. Taxonomies Designed for Search Category nodes and Evidence nodes Category Node: A ‘parent’ node Can have child nodes that are: Sub Categories Evidence Nodes Evidence Node: Tends to be a leaf node (no children) Contains keyterms (synonyms) May contain “rules” e.g. (if contains term a and term b but not term c) Evidence Nodes can have more than one category node parent Hits on Evidence Nodes add to the cumulative score of a Category Node. Scores can be diluted as the accumulate up the hierarchy – so that the nearest category gets the strongest ‘vote’.
  • 33. US Corporations Foreign Corporations British Chinese French German Japanese Russian etc. Fortune 100 Companies Energy Financial Services Investment Banks Commercial Banks Health Care Health Insurance HMO Medical Devices Pharmaceuticals Hospitality Manufacturing Aircraft Automobiles Electrical Equipment
  • 34. Ford, GM, Chrysler Fortune 100 Companies Energy Financial Services Investment Banks Commercial Banks Health Care Health Insurance HMO Medical Devices Pharmaceuticals Hospitality Manufacturing Aircraft Automobiles Electrical Equipment US Corporations Foreign Corporations British Chinese French German Japanese Russian etc.
  • 35. Ford, GM, Chrysler,Toyota,BMW Fortune 100 Companies Energy Financial Services Investment Banks Commercial Banks Health Care Health Insurance HMO Medical Devices Pharmaceuticals Hospitality Manufacturing Aircraft Automobiles Electrical Equipment US Corporations Foreign Corporations British Chinese French German Japanese Russian etc.
  • 36. Fortune 100 Companies Energy Financial Services Investment Banks Commercial Banks Health Care Health Insurance HMO Medical Devices Pharmaceuticals Hospitality Manufacturing Aircraft Automobiles Electrical Equipment Ford, GM, Chrysler,Toyota,BMW GE, Boeing US Corporations Foreign Corporations British Chinese French German Japanese Russian etc.
  • 37. Fortune 100 Companies Energy Financial Services Investment Banks Commercial Banks Health Care Health Insurance HMO Medical Devices Pharmaceuticals Hospitality Manufacturing Aircraft Automobiles Electrical Equipment Ford, GM, Chrysler,Toyota,BMW GE, Boeing Bank of America, Hyatt US Corporations Foreign Corporations British Chinese French German Japanese Russian etc.
  • 38. Query Pipelines The ‘Wh’ Words: Who, What, When, Where Who are they (authentication)? What can they see (security - authorization)? When can they see it (entitlement)? What are they interested in (personalization / recommendation)? Where are they now (location)?
  • 39. Query Pipelines Inferential Search Query introspection -> Query modification. Query Autofiltering Are you feeling lucky today? Topic boosting / spotlighting Use ML to detect the topic, then boost and/or spotlight results tagged this way. Use a specialized collection to store ‘facet knowledge’
  • 40. The Art of the Fugue: Inferential Search • Infer what the user is looking for and give them that • Clever software infers meaning aka query “intent” • When we do this right, it appears to be magic!
  • 41. Machine Learning Drives Query Introspection Training Data NLP Trainer Stage NLP Model Test Data NLP Classifier Stage Classified Documents
  • 42. Machine Learning models can drive Query Introspection NL Query NLP Query Stage NLP Model Tagged Query Landing Page Boost Documents
  • 43. Da Capo al Coda • Killer search apps are crafted from fine ingredients and like fine whiskey will get better with age - if you are paying attention to ‘what’ your users are looking for. • Putting the pieces together requires an understanding of ‘what’ things, independent of what words they use to describe it.
  • 44. Thanks for your attention! Ted Sullivan Lucidworks, Technical Services ted.sullivan@lucidworks.com Skype: ted.sullivan5 LinkedIn Metuchen, New Jersey (You gotta problem with that?)

Editor's Notes

  1. ( * but the Google web app has some serious Mojo!)
  2. ( * but the Google web app has some serious Mojo!)
  3. Suga “Plague” story here – on Spell Check
  4. Sales tax is a type of tax, not a type of sale.
  5. Cue – Utube Video here
  6. Who are they? – Security – what can they see? Personalization – what do they like?