SlideShare a Scribd company logo
1 of 9
Download to read offline
Paul Houle - What ails Enterprise Search?
http://blog.databaseanimals.com/is-enterprise-search-boring[7/9/2014 3:33:18 PM]
What ails Enterprise Search?
Youcan'timprovewhatyoucan'tmeasure.

Paul Houle
– Creatorofdatabaseanimalsandbayesianbrains
July 03, 2014
I this article, asking "What is your assessment of today's enterprise
search industry?" I thought I'd chip in.
What's done right
Today's Enterprise Search products have effective answers for
content ingestion and and query performance.
Any product that is successful at all has an answer for content
Paul Houle - What ails Enterprise Search?
http://blog.databaseanimals.com/is-enterprise-search-boring[7/9/2014 3:33:18 PM]
ingestion. It's a complex problem because you need to interact with
many kinds of system, but it's a solved problem: a vendor who
hasn't solved this problem would not be successful at all.
Query throughput is easy to handle with horizontal replication.
After that, there's a concern about latency, but the best answer to
that is have the search engine "do more with less", optimizing
algorithms and data structures. Developers oriented towards
performance work can be found in the video game industry and
other pockets of the software industry -- so long as you make it a
priority, it's tractable in terms of business and technology
Lucene 4
Paul Houle - What ails Enterprise Search?
http://blog.databaseanimals.com/is-enterprise-search-boring[7/9/2014 3:33:18 PM]
Eddie Clio
Enterprise search products are often built around Lucene. Lucene 3
had a lot of good traits, but also fundamental flaws.
Strings in the Java language, on which Lucene 4 is based, are
encoded in a fixed-length representation. ASCII characters, used
heavily in most market areas, get doubled in size. When you're
looking at gigabytes of documents, this is a big deal. The Fedora
Linux distribution rejected Lucene for a desktop search tool ten
years ago because of this overhead.
Lucene 4 represents text as UTF-8, speeds up general operations by
at least a factor of two, and speeds up many specific operations by
hundreds of times. The design has improved dramatically, making
it much easier to engineer substantial changes to the scoring
algorithms.
Many organizations have a code base in Lucene 3, but from my
viewpoint, it's malpractice to do maintenance work on a Lucene 3
system, because in the long term, it can't compete with a Lucene 4
system.
Paul Houle - What ails Enterprise Search?
http://blog.databaseanimals.com/is-enterprise-search-boring[7/9/2014 3:33:18 PM]
The science of relevance
There's a quote that circulates in the business literature, which goes
something like "You can't improve what you can't measure". It's
been misatttributed to Edward Demings and others, but I like the
way it is used in J.F. Lawton's 1997 book The Selling Bible -- he
talks to successful salespeople and finds that they know what
percentage of customers they can sell, then talks to the "losers in
the lounge" and draws a blank when he asks that question.
The best case study I can think for relevance work is IBM Watson.
When some IBMers got the idea to compete at Jeopardy, they built
a demo system based on an existing search engine and got this
result
Paul Houle - What ails Enterprise Search?
http://blog.databaseanimals.com/is-enterprise-search-boring[7/9/2014 3:33:18 PM]
The dark line is the performance of the demo, and the cluster of
dots higher up is the performance of winning Jeopardy players.
Most of the players are in grey, but the dark ones to the right are
from Keith Jennings, the record holder that Watson needed to beat.
The chart is intimidating: if you were up against this and chose to
give up, I wouldn't blame you.
After some years of work, IBM systematically improved the
performance of Watson until it hit the target
Now, the strategy and the software framework behind Watson had
this capacity, but it couldn't have gotten close to the goal without a
Paul Houle - What ails Enterprise Search?
http://blog.databaseanimals.com/is-enterprise-search-boring[7/9/2014 3:33:18 PM]
systematic program of evaluation.
Evaluation has many virtues, the most fundamental of which is
comparing two versions and deciding which is better. You and I
can think of many things which seem like they'd improve the
relevance of a search engine, but if you try them, you might find
things stay the same or get worse.
Industry and academic researchers participate in the yearly TREC,
which is organized around a group of Kaggle-like competitions
where participants try to get the best results

with a specific set of documents and queries.
It's an expensive process for a few reasons. First, you need to have
hundreds of queries, annotating thousands of possible search
results as valid or not. You'll need to load a substantial set of
documents (gigabytes if not terabytes) and then run all of the
queries. You might want to try this hundreds of times trying out
different combinations of parameters, not to mention to fix the
Paul Houle - What ails Enterprise Search?
http://blog.databaseanimals.com/is-enterprise-search-boring[7/9/2014 3:33:18 PM]
bugs that will certainly turn up. If your culture doesn't put devops
first, you'll spend a huge amount of human time running those
tests.
At least if you use the artifacts that TREC creates, you get a
tolerable set of judgements. You'll certainly get better results if
you optimize for your own documents, but then you've got to
create your own judgements.
Escaping irrelevance
OccupyReno MediaCommittee
If you talk to Enterprise Search vendors you'll find that some of
them participate in TREC or some use it internally. You'll find the
overwhelming majority do not.
Paul Houle - What ails Enterprise Search?
http://blog.databaseanimals.com/is-enterprise-search-boring[7/9/2014 3:33:18 PM]
What they tell me, and I believe it, is that customers don't see
enough value in relevant search results to pay for evaluation work.
If it's good enough to make the sale, it's good enough. One
objection to the mainstream TREC work is that TREC rewards the
quality of the 500th search result, something that doesn't matter in
some fields, like web search, where users only look at the first 10
result.
Although it's always been easy to tweak Lucene to prioritize certain
fields and do other ad-hoc tricks which ought to improve
relevance, it's been unusual to see Lucene-based competitiors in
TREC because: (i) the Lucene 3 scoring engine is nowhere near
competitive on TREC, and (ii) changing the scoring engine to
something better was maddeningly difficult and often resulted in
terrible performance loss.
Chris Carillo
The good news is that Lucene 4 now has pluggable Similarity
engines. In particular, it contains implementations of the modern
Language Modelling approach
Paul Houle - What ails Enterprise Search?
http://blog.databaseanimals.com/is-enterprise-search-boring[7/9/2014 3:33:18 PM]
http://lucene.apache.org/core/4_9_0/core/org/apache/lucene/search/similaritie
which is a dramatic improvement over the old tf*idf scoring in
itself, as well as being a rational foundation to build even better
systems.
So far as is publicly known, the LM similarity is little used because
getting good results on it depends on choosing a "smoothing"
function which addresses the poor sample size we get when we're
looking at rare words. Lucene 4 currently implements two
smoothing algorithms out of several that are in the literature. The
successful use of LM in Lucene is a matter of trying out algorithms
and their parameters to get the best result, a task that,
unfortunately, nobody is doing openly.
Paul Houle
Creator of database animals and bayesian brains
 
 

Read Next: The Supermen

© 2014 Paul Houle

More Related Content

Similar to Paul houle what ails enterprise search

Bearish SEO: Defining the User Experience for Google’s Panda Search Landscape
Bearish SEO: Defining the User Experience for Google’s Panda Search LandscapeBearish SEO: Defining the User Experience for Google’s Panda Search Landscape
Bearish SEO: Defining the User Experience for Google’s Panda Search LandscapeMarianne Sweeny
 
Reducing Time Spent On Requirements
Reducing Time Spent On RequirementsReducing Time Spent On Requirements
Reducing Time Spent On RequirementsByron Workman
 
Building multi billion ( dollars, users, documents ) search engines on open ...
Building multi billion ( dollars, users, documents ) search engines  on open ...Building multi billion ( dollars, users, documents ) search engines  on open ...
Building multi billion ( dollars, users, documents ) search engines on open ...Andrei Lopatenko
 
[DSC Europe 22] Avoid mistakes building AI products - Karol Przystalski
[DSC Europe 22] Avoid mistakes building AI products - Karol Przystalski[DSC Europe 22] Avoid mistakes building AI products - Karol Przystalski
[DSC Europe 22] Avoid mistakes building AI products - Karol PrzystalskiDataScienceConferenc1
 
Internet of Things Brings On Development Demands That DevOps Manages, Say Exp...
Internet of Things Brings On Development Demands That DevOps Manages, Say Exp...Internet of Things Brings On Development Demands That DevOps Manages, Say Exp...
Internet of Things Brings On Development Demands That DevOps Manages, Say Exp...Dana Gardner
 
Start Up Austin 2017: Manual vs Automation - When to Start Automating your Pr...
Start Up Austin 2017: Manual vs Automation - When to Start Automating your Pr...Start Up Austin 2017: Manual vs Automation - When to Start Automating your Pr...
Start Up Austin 2017: Manual vs Automation - When to Start Automating your Pr...Amazon Web Services
 
"Open" includes users - Leverage their input
"Open" includes users - Leverage their input"Open" includes users - Leverage their input
"Open" includes users - Leverage their inputRandy Earl
 
Top Three Data Modeling Tools Usability Comparsion
Top Three Data Modeling Tools Usability ComparsionTop Three Data Modeling Tools Usability Comparsion
Top Three Data Modeling Tools Usability ComparsionErin
 
Top Three Data Modeling Tools Usability Comparsion
Top Three Data Modeling Tools Usability ComparsionTop Three Data Modeling Tools Usability Comparsion
Top Three Data Modeling Tools Usability ComparsionErin
 
Andrew NG machine learning
Andrew NG machine learningAndrew NG machine learning
Andrew NG machine learningShareDocView.com
 
[@IndeedEng] Large scale interactive analytics with Imhotep
[@IndeedEng] Large scale interactive analytics with Imhotep[@IndeedEng] Large scale interactive analytics with Imhotep
[@IndeedEng] Large scale interactive analytics with Imhotepindeedeng
 
XYZ Fast Prototyping MGMT 3405 1 Definition – Fa.docx
XYZ Fast Prototyping MGMT 3405  1  Definition – Fa.docxXYZ Fast Prototyping MGMT 3405  1  Definition – Fa.docx
XYZ Fast Prototyping MGMT 3405 1 Definition – Fa.docxjeffevans62972
 
JSF (ADF) Case Studies Paper
JSF (ADF) Case Studies PaperJSF (ADF) Case Studies Paper
JSF (ADF) Case Studies PaperMichael Fons
 
Reactive Microservice Architecture with Groovy and Grails
Reactive Microservice Architecture with Groovy and GrailsReactive Microservice Architecture with Groovy and Grails
Reactive Microservice Architecture with Groovy and GrailsSteve Pember
 
Metric Abuse: Frequently Misused Metrics in Oracle
Metric Abuse: Frequently Misused Metrics in OracleMetric Abuse: Frequently Misused Metrics in Oracle
Metric Abuse: Frequently Misused Metrics in OracleSteve Karam
 
What can DesignOps do for you? by Carol Smith at TLMUX in Montreal
What can DesignOps do for you? by Carol Smith at TLMUX in MontrealWhat can DesignOps do for you? by Carol Smith at TLMUX in Montreal
What can DesignOps do for you? by Carol Smith at TLMUX in MontrealCarol Smith
 
2014-10 DevOps NFi - Why it's a good idea to deploy 10 times per day v1.0
2014-10 DevOps NFi - Why it's a good idea to deploy 10 times per day v1.02014-10 DevOps NFi - Why it's a good idea to deploy 10 times per day v1.0
2014-10 DevOps NFi - Why it's a good idea to deploy 10 times per day v1.0Joakim Lindbom
 
Citrix Labs Rapid Prototyping Workshop
Citrix Labs Rapid Prototyping WorkshopCitrix Labs Rapid Prototyping Workshop
Citrix Labs Rapid Prototyping WorkshopReuven Cohen
 

Similar to Paul houle what ails enterprise search (20)

Bearish SEO: Defining the User Experience for Google’s Panda Search Landscape
Bearish SEO: Defining the User Experience for Google’s Panda Search LandscapeBearish SEO: Defining the User Experience for Google’s Panda Search Landscape
Bearish SEO: Defining the User Experience for Google’s Panda Search Landscape
 
Reducing Time Spent On Requirements
Reducing Time Spent On RequirementsReducing Time Spent On Requirements
Reducing Time Spent On Requirements
 
Building multi billion ( dollars, users, documents ) search engines on open ...
Building multi billion ( dollars, users, documents ) search engines  on open ...Building multi billion ( dollars, users, documents ) search engines  on open ...
Building multi billion ( dollars, users, documents ) search engines on open ...
 
[DSC Europe 22] Avoid mistakes building AI products - Karol Przystalski
[DSC Europe 22] Avoid mistakes building AI products - Karol Przystalski[DSC Europe 22] Avoid mistakes building AI products - Karol Przystalski
[DSC Europe 22] Avoid mistakes building AI products - Karol Przystalski
 
Internet of Things Brings On Development Demands That DevOps Manages, Say Exp...
Internet of Things Brings On Development Demands That DevOps Manages, Say Exp...Internet of Things Brings On Development Demands That DevOps Manages, Say Exp...
Internet of Things Brings On Development Demands That DevOps Manages, Say Exp...
 
Final Project
Final ProjectFinal Project
Final Project
 
Search V Next Final
Search V Next FinalSearch V Next Final
Search V Next Final
 
Start Up Austin 2017: Manual vs Automation - When to Start Automating your Pr...
Start Up Austin 2017: Manual vs Automation - When to Start Automating your Pr...Start Up Austin 2017: Manual vs Automation - When to Start Automating your Pr...
Start Up Austin 2017: Manual vs Automation - When to Start Automating your Pr...
 
"Open" includes users - Leverage their input
"Open" includes users - Leverage their input"Open" includes users - Leverage their input
"Open" includes users - Leverage their input
 
Top Three Data Modeling Tools Usability Comparsion
Top Three Data Modeling Tools Usability ComparsionTop Three Data Modeling Tools Usability Comparsion
Top Three Data Modeling Tools Usability Comparsion
 
Top Three Data Modeling Tools Usability Comparsion
Top Three Data Modeling Tools Usability ComparsionTop Three Data Modeling Tools Usability Comparsion
Top Three Data Modeling Tools Usability Comparsion
 
Andrew NG machine learning
Andrew NG machine learningAndrew NG machine learning
Andrew NG machine learning
 
[@IndeedEng] Large scale interactive analytics with Imhotep
[@IndeedEng] Large scale interactive analytics with Imhotep[@IndeedEng] Large scale interactive analytics with Imhotep
[@IndeedEng] Large scale interactive analytics with Imhotep
 
XYZ Fast Prototyping MGMT 3405 1 Definition – Fa.docx
XYZ Fast Prototyping MGMT 3405  1  Definition – Fa.docxXYZ Fast Prototyping MGMT 3405  1  Definition – Fa.docx
XYZ Fast Prototyping MGMT 3405 1 Definition – Fa.docx
 
JSF (ADF) Case Studies Paper
JSF (ADF) Case Studies PaperJSF (ADF) Case Studies Paper
JSF (ADF) Case Studies Paper
 
Reactive Microservice Architecture with Groovy and Grails
Reactive Microservice Architecture with Groovy and GrailsReactive Microservice Architecture with Groovy and Grails
Reactive Microservice Architecture with Groovy and Grails
 
Metric Abuse: Frequently Misused Metrics in Oracle
Metric Abuse: Frequently Misused Metrics in OracleMetric Abuse: Frequently Misused Metrics in Oracle
Metric Abuse: Frequently Misused Metrics in Oracle
 
What can DesignOps do for you? by Carol Smith at TLMUX in Montreal
What can DesignOps do for you? by Carol Smith at TLMUX in MontrealWhat can DesignOps do for you? by Carol Smith at TLMUX in Montreal
What can DesignOps do for you? by Carol Smith at TLMUX in Montreal
 
2014-10 DevOps NFi - Why it's a good idea to deploy 10 times per day v1.0
2014-10 DevOps NFi - Why it's a good idea to deploy 10 times per day v1.02014-10 DevOps NFi - Why it's a good idea to deploy 10 times per day v1.0
2014-10 DevOps NFi - Why it's a good idea to deploy 10 times per day v1.0
 
Citrix Labs Rapid Prototyping Workshop
Citrix Labs Rapid Prototyping WorkshopCitrix Labs Rapid Prototyping Workshop
Citrix Labs Rapid Prototyping Workshop
 

More from Paul Houle

Chatbots in 2017 -- Ithaca Talk Dec 6
Chatbots in 2017 -- Ithaca Talk Dec 6Chatbots in 2017 -- Ithaca Talk Dec 6
Chatbots in 2017 -- Ithaca Talk Dec 6Paul Houle
 
Estimating the Software Product Value during the Development Process
Estimating the Software Product Value during the Development ProcessEstimating the Software Product Value during the Development Process
Estimating the Software Product Value during the Development ProcessPaul Houle
 
Universal Standards for LEI and other Corporate Reference Data: Enabling risk...
Universal Standards for LEI and other Corporate Reference Data: Enabling risk...Universal Standards for LEI and other Corporate Reference Data: Enabling risk...
Universal Standards for LEI and other Corporate Reference Data: Enabling risk...Paul Houle
 
Fixing a leaky bucket; Observations on the Global LEI System
Fixing a leaky bucket; Observations on the Global LEI SystemFixing a leaky bucket; Observations on the Global LEI System
Fixing a leaky bucket; Observations on the Global LEI SystemPaul Houle
 
Cisco Fog Strategy For Big and Smart Data
Cisco Fog Strategy For Big and Smart DataCisco Fog Strategy For Big and Smart Data
Cisco Fog Strategy For Big and Smart DataPaul Houle
 
Making the semantic web work
Making the semantic web workMaking the semantic web work
Making the semantic web workPaul Houle
 
Ontology2 platform
Ontology2 platformOntology2 platform
Ontology2 platformPaul Houle
 
Ontology2 Platform Evolution
Ontology2 Platform EvolutionOntology2 Platform Evolution
Ontology2 Platform EvolutionPaul Houle
 
Subjective Importance Smackdown
Subjective Importance SmackdownSubjective Importance Smackdown
Subjective Importance SmackdownPaul Houle
 
Dropping unique constraints in sql server
Dropping unique constraints in sql serverDropping unique constraints in sql server
Dropping unique constraints in sql serverPaul Houle
 
Prefix casting versus as-casting in c#
Prefix casting versus as-casting in c#Prefix casting versus as-casting in c#
Prefix casting versus as-casting in c#Paul Houle
 
Paul houle resume
Paul houle resumePaul houle resume
Paul houle resumePaul Houle
 
Keeping track of state in asynchronous callbacks
Keeping track of state in asynchronous callbacksKeeping track of state in asynchronous callbacks
Keeping track of state in asynchronous callbacksPaul Houle
 
Embrace dynamic PHP
Embrace dynamic PHPEmbrace dynamic PHP
Embrace dynamic PHPPaul Houle
 
Once asynchronous, always asynchronous
Once asynchronous, always asynchronousOnce asynchronous, always asynchronous
Once asynchronous, always asynchronousPaul Houle
 
What do you do when you’ve caught an exception?
What do you do when you’ve caught an exception?What do you do when you’ve caught an exception?
What do you do when you’ve caught an exception?Paul Houle
 
Extension methods, nulls, namespaces and precedence in c#
Extension methods, nulls, namespaces and precedence in c#Extension methods, nulls, namespaces and precedence in c#
Extension methods, nulls, namespaces and precedence in c#Paul Houle
 
Pro align snap 2
Pro align snap 2Pro align snap 2
Pro align snap 2Paul Houle
 
Proalign Snapshot 1
Proalign Snapshot 1Proalign Snapshot 1
Proalign Snapshot 1Paul Houle
 
Text wise technology textwise company, llc
Text wise technology   textwise company, llcText wise technology   textwise company, llc
Text wise technology textwise company, llcPaul Houle
 

More from Paul Houle (20)

Chatbots in 2017 -- Ithaca Talk Dec 6
Chatbots in 2017 -- Ithaca Talk Dec 6Chatbots in 2017 -- Ithaca Talk Dec 6
Chatbots in 2017 -- Ithaca Talk Dec 6
 
Estimating the Software Product Value during the Development Process
Estimating the Software Product Value during the Development ProcessEstimating the Software Product Value during the Development Process
Estimating the Software Product Value during the Development Process
 
Universal Standards for LEI and other Corporate Reference Data: Enabling risk...
Universal Standards for LEI and other Corporate Reference Data: Enabling risk...Universal Standards for LEI and other Corporate Reference Data: Enabling risk...
Universal Standards for LEI and other Corporate Reference Data: Enabling risk...
 
Fixing a leaky bucket; Observations on the Global LEI System
Fixing a leaky bucket; Observations on the Global LEI SystemFixing a leaky bucket; Observations on the Global LEI System
Fixing a leaky bucket; Observations on the Global LEI System
 
Cisco Fog Strategy For Big and Smart Data
Cisco Fog Strategy For Big and Smart DataCisco Fog Strategy For Big and Smart Data
Cisco Fog Strategy For Big and Smart Data
 
Making the semantic web work
Making the semantic web workMaking the semantic web work
Making the semantic web work
 
Ontology2 platform
Ontology2 platformOntology2 platform
Ontology2 platform
 
Ontology2 Platform Evolution
Ontology2 Platform EvolutionOntology2 Platform Evolution
Ontology2 Platform Evolution
 
Subjective Importance Smackdown
Subjective Importance SmackdownSubjective Importance Smackdown
Subjective Importance Smackdown
 
Dropping unique constraints in sql server
Dropping unique constraints in sql serverDropping unique constraints in sql server
Dropping unique constraints in sql server
 
Prefix casting versus as-casting in c#
Prefix casting versus as-casting in c#Prefix casting versus as-casting in c#
Prefix casting versus as-casting in c#
 
Paul houle resume
Paul houle resumePaul houle resume
Paul houle resume
 
Keeping track of state in asynchronous callbacks
Keeping track of state in asynchronous callbacksKeeping track of state in asynchronous callbacks
Keeping track of state in asynchronous callbacks
 
Embrace dynamic PHP
Embrace dynamic PHPEmbrace dynamic PHP
Embrace dynamic PHP
 
Once asynchronous, always asynchronous
Once asynchronous, always asynchronousOnce asynchronous, always asynchronous
Once asynchronous, always asynchronous
 
What do you do when you’ve caught an exception?
What do you do when you’ve caught an exception?What do you do when you’ve caught an exception?
What do you do when you’ve caught an exception?
 
Extension methods, nulls, namespaces and precedence in c#
Extension methods, nulls, namespaces and precedence in c#Extension methods, nulls, namespaces and precedence in c#
Extension methods, nulls, namespaces and precedence in c#
 
Pro align snap 2
Pro align snap 2Pro align snap 2
Pro align snap 2
 
Proalign Snapshot 1
Proalign Snapshot 1Proalign Snapshot 1
Proalign Snapshot 1
 
Text wise technology textwise company, llc
Text wise technology   textwise company, llcText wise technology   textwise company, llc
Text wise technology textwise company, llc
 

Recently uploaded

Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 

Recently uploaded (20)

Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 

Paul houle what ails enterprise search

  • 1. Paul Houle - What ails Enterprise Search? http://blog.databaseanimals.com/is-enterprise-search-boring[7/9/2014 3:33:18 PM] What ails Enterprise Search? Youcan'timprovewhatyoucan'tmeasure. Paul Houle – Creatorofdatabaseanimalsandbayesianbrains July 03, 2014 I this article, asking "What is your assessment of today's enterprise search industry?" I thought I'd chip in. What's done right Today's Enterprise Search products have effective answers for content ingestion and and query performance. Any product that is successful at all has an answer for content
  • 2. Paul Houle - What ails Enterprise Search? http://blog.databaseanimals.com/is-enterprise-search-boring[7/9/2014 3:33:18 PM] ingestion. It's a complex problem because you need to interact with many kinds of system, but it's a solved problem: a vendor who hasn't solved this problem would not be successful at all. Query throughput is easy to handle with horizontal replication. After that, there's a concern about latency, but the best answer to that is have the search engine "do more with less", optimizing algorithms and data structures. Developers oriented towards performance work can be found in the video game industry and other pockets of the software industry -- so long as you make it a priority, it's tractable in terms of business and technology Lucene 4
  • 3. Paul Houle - What ails Enterprise Search? http://blog.databaseanimals.com/is-enterprise-search-boring[7/9/2014 3:33:18 PM] Eddie Clio Enterprise search products are often built around Lucene. Lucene 3 had a lot of good traits, but also fundamental flaws. Strings in the Java language, on which Lucene 4 is based, are encoded in a fixed-length representation. ASCII characters, used heavily in most market areas, get doubled in size. When you're looking at gigabytes of documents, this is a big deal. The Fedora Linux distribution rejected Lucene for a desktop search tool ten years ago because of this overhead. Lucene 4 represents text as UTF-8, speeds up general operations by at least a factor of two, and speeds up many specific operations by hundreds of times. The design has improved dramatically, making it much easier to engineer substantial changes to the scoring algorithms. Many organizations have a code base in Lucene 3, but from my viewpoint, it's malpractice to do maintenance work on a Lucene 3 system, because in the long term, it can't compete with a Lucene 4 system.
  • 4. Paul Houle - What ails Enterprise Search? http://blog.databaseanimals.com/is-enterprise-search-boring[7/9/2014 3:33:18 PM] The science of relevance There's a quote that circulates in the business literature, which goes something like "You can't improve what you can't measure". It's been misatttributed to Edward Demings and others, but I like the way it is used in J.F. Lawton's 1997 book The Selling Bible -- he talks to successful salespeople and finds that they know what percentage of customers they can sell, then talks to the "losers in the lounge" and draws a blank when he asks that question. The best case study I can think for relevance work is IBM Watson. When some IBMers got the idea to compete at Jeopardy, they built a demo system based on an existing search engine and got this result
  • 5. Paul Houle - What ails Enterprise Search? http://blog.databaseanimals.com/is-enterprise-search-boring[7/9/2014 3:33:18 PM] The dark line is the performance of the demo, and the cluster of dots higher up is the performance of winning Jeopardy players. Most of the players are in grey, but the dark ones to the right are from Keith Jennings, the record holder that Watson needed to beat. The chart is intimidating: if you were up against this and chose to give up, I wouldn't blame you. After some years of work, IBM systematically improved the performance of Watson until it hit the target Now, the strategy and the software framework behind Watson had this capacity, but it couldn't have gotten close to the goal without a
  • 6. Paul Houle - What ails Enterprise Search? http://blog.databaseanimals.com/is-enterprise-search-boring[7/9/2014 3:33:18 PM] systematic program of evaluation. Evaluation has many virtues, the most fundamental of which is comparing two versions and deciding which is better. You and I can think of many things which seem like they'd improve the relevance of a search engine, but if you try them, you might find things stay the same or get worse. Industry and academic researchers participate in the yearly TREC, which is organized around a group of Kaggle-like competitions where participants try to get the best results with a specific set of documents and queries. It's an expensive process for a few reasons. First, you need to have hundreds of queries, annotating thousands of possible search results as valid or not. You'll need to load a substantial set of documents (gigabytes if not terabytes) and then run all of the queries. You might want to try this hundreds of times trying out different combinations of parameters, not to mention to fix the
  • 7. Paul Houle - What ails Enterprise Search? http://blog.databaseanimals.com/is-enterprise-search-boring[7/9/2014 3:33:18 PM] bugs that will certainly turn up. If your culture doesn't put devops first, you'll spend a huge amount of human time running those tests. At least if you use the artifacts that TREC creates, you get a tolerable set of judgements. You'll certainly get better results if you optimize for your own documents, but then you've got to create your own judgements. Escaping irrelevance OccupyReno MediaCommittee If you talk to Enterprise Search vendors you'll find that some of them participate in TREC or some use it internally. You'll find the overwhelming majority do not.
  • 8. Paul Houle - What ails Enterprise Search? http://blog.databaseanimals.com/is-enterprise-search-boring[7/9/2014 3:33:18 PM] What they tell me, and I believe it, is that customers don't see enough value in relevant search results to pay for evaluation work. If it's good enough to make the sale, it's good enough. One objection to the mainstream TREC work is that TREC rewards the quality of the 500th search result, something that doesn't matter in some fields, like web search, where users only look at the first 10 result. Although it's always been easy to tweak Lucene to prioritize certain fields and do other ad-hoc tricks which ought to improve relevance, it's been unusual to see Lucene-based competitiors in TREC because: (i) the Lucene 3 scoring engine is nowhere near competitive on TREC, and (ii) changing the scoring engine to something better was maddeningly difficult and often resulted in terrible performance loss. Chris Carillo The good news is that Lucene 4 now has pluggable Similarity engines. In particular, it contains implementations of the modern Language Modelling approach
  • 9. Paul Houle - What ails Enterprise Search? http://blog.databaseanimals.com/is-enterprise-search-boring[7/9/2014 3:33:18 PM] http://lucene.apache.org/core/4_9_0/core/org/apache/lucene/search/similaritie which is a dramatic improvement over the old tf*idf scoring in itself, as well as being a rational foundation to build even better systems. So far as is publicly known, the LM similarity is little used because getting good results on it depends on choosing a "smoothing" function which addresses the poor sample size we get when we're looking at rare words. Lucene 4 currently implements two smoothing algorithms out of several that are in the literature. The successful use of LM in Lucene is a matter of trying out algorithms and their parameters to get the best result, a task that, unfortunately, nobody is doing openly. Paul Houle Creator of database animals and bayesian brains    Read Next: The Supermen © 2014 Paul Houle