Today's Enterprise Search products have effective answers for content ingestion and and query performance.
Any product that is successful at all has an answer for content ingestion. It's a complex problem because you need to interact with many kinds of system, but it's a solved problem: a vendor who hasn't solved this problem would not be successful at all.
Scanning the Internet for External Cloud Exposures via SSL Certs
Paul houle what ails enterprise search
1. Paul Houle - What ails Enterprise Search?
http://blog.databaseanimals.com/is-enterprise-search-boring[7/9/2014 3:33:18 PM]
What ails Enterprise Search?
Youcan'timprovewhatyoucan'tmeasure.
Paul Houle
– Creatorofdatabaseanimalsandbayesianbrains
July 03, 2014
I this article, asking "What is your assessment of today's enterprise
search industry?" I thought I'd chip in.
What's done right
Today's Enterprise Search products have effective answers for
content ingestion and and query performance.
Any product that is successful at all has an answer for content
2. Paul Houle - What ails Enterprise Search?
http://blog.databaseanimals.com/is-enterprise-search-boring[7/9/2014 3:33:18 PM]
ingestion. It's a complex problem because you need to interact with
many kinds of system, but it's a solved problem: a vendor who
hasn't solved this problem would not be successful at all.
Query throughput is easy to handle with horizontal replication.
After that, there's a concern about latency, but the best answer to
that is have the search engine "do more with less", optimizing
algorithms and data structures. Developers oriented towards
performance work can be found in the video game industry and
other pockets of the software industry -- so long as you make it a
priority, it's tractable in terms of business and technology
Lucene 4
3. Paul Houle - What ails Enterprise Search?
http://blog.databaseanimals.com/is-enterprise-search-boring[7/9/2014 3:33:18 PM]
Eddie Clio
Enterprise search products are often built around Lucene. Lucene 3
had a lot of good traits, but also fundamental flaws.
Strings in the Java language, on which Lucene 4 is based, are
encoded in a fixed-length representation. ASCII characters, used
heavily in most market areas, get doubled in size. When you're
looking at gigabytes of documents, this is a big deal. The Fedora
Linux distribution rejected Lucene for a desktop search tool ten
years ago because of this overhead.
Lucene 4 represents text as UTF-8, speeds up general operations by
at least a factor of two, and speeds up many specific operations by
hundreds of times. The design has improved dramatically, making
it much easier to engineer substantial changes to the scoring
algorithms.
Many organizations have a code base in Lucene 3, but from my
viewpoint, it's malpractice to do maintenance work on a Lucene 3
system, because in the long term, it can't compete with a Lucene 4
system.
4. Paul Houle - What ails Enterprise Search?
http://blog.databaseanimals.com/is-enterprise-search-boring[7/9/2014 3:33:18 PM]
The science of relevance
There's a quote that circulates in the business literature, which goes
something like "You can't improve what you can't measure". It's
been misatttributed to Edward Demings and others, but I like the
way it is used in J.F. Lawton's 1997 book The Selling Bible -- he
talks to successful salespeople and finds that they know what
percentage of customers they can sell, then talks to the "losers in
the lounge" and draws a blank when he asks that question.
The best case study I can think for relevance work is IBM Watson.
When some IBMers got the idea to compete at Jeopardy, they built
a demo system based on an existing search engine and got this
result
5. Paul Houle - What ails Enterprise Search?
http://blog.databaseanimals.com/is-enterprise-search-boring[7/9/2014 3:33:18 PM]
The dark line is the performance of the demo, and the cluster of
dots higher up is the performance of winning Jeopardy players.
Most of the players are in grey, but the dark ones to the right are
from Keith Jennings, the record holder that Watson needed to beat.
The chart is intimidating: if you were up against this and chose to
give up, I wouldn't blame you.
After some years of work, IBM systematically improved the
performance of Watson until it hit the target
Now, the strategy and the software framework behind Watson had
this capacity, but it couldn't have gotten close to the goal without a
6. Paul Houle - What ails Enterprise Search?
http://blog.databaseanimals.com/is-enterprise-search-boring[7/9/2014 3:33:18 PM]
systematic program of evaluation.
Evaluation has many virtues, the most fundamental of which is
comparing two versions and deciding which is better. You and I
can think of many things which seem like they'd improve the
relevance of a search engine, but if you try them, you might find
things stay the same or get worse.
Industry and academic researchers participate in the yearly TREC,
which is organized around a group of Kaggle-like competitions
where participants try to get the best results
with a specific set of documents and queries.
It's an expensive process for a few reasons. First, you need to have
hundreds of queries, annotating thousands of possible search
results as valid or not. You'll need to load a substantial set of
documents (gigabytes if not terabytes) and then run all of the
queries. You might want to try this hundreds of times trying out
different combinations of parameters, not to mention to fix the
7. Paul Houle - What ails Enterprise Search?
http://blog.databaseanimals.com/is-enterprise-search-boring[7/9/2014 3:33:18 PM]
bugs that will certainly turn up. If your culture doesn't put devops
first, you'll spend a huge amount of human time running those
tests.
At least if you use the artifacts that TREC creates, you get a
tolerable set of judgements. You'll certainly get better results if
you optimize for your own documents, but then you've got to
create your own judgements.
Escaping irrelevance
OccupyReno MediaCommittee
If you talk to Enterprise Search vendors you'll find that some of
them participate in TREC or some use it internally. You'll find the
overwhelming majority do not.
8. Paul Houle - What ails Enterprise Search?
http://blog.databaseanimals.com/is-enterprise-search-boring[7/9/2014 3:33:18 PM]
What they tell me, and I believe it, is that customers don't see
enough value in relevant search results to pay for evaluation work.
If it's good enough to make the sale, it's good enough. One
objection to the mainstream TREC work is that TREC rewards the
quality of the 500th search result, something that doesn't matter in
some fields, like web search, where users only look at the first 10
result.
Although it's always been easy to tweak Lucene to prioritize certain
fields and do other ad-hoc tricks which ought to improve
relevance, it's been unusual to see Lucene-based competitiors in
TREC because: (i) the Lucene 3 scoring engine is nowhere near
competitive on TREC, and (ii) changing the scoring engine to
something better was maddeningly difficult and often resulted in
terrible performance loss.
Chris Carillo
The good news is that Lucene 4 now has pluggable Similarity
engines. In particular, it contains implementations of the modern
Language Modelling approach