Paul houle what ails enterprise search

Paul Houle - What ails Enterprise Search?
http://blog.databaseanimals.com/is-enterprise-search-boring[7/9/2014 3:33:18 PM]
What ails Enterprise Search?
Youcan'timprovewhatyoucan'tmeasure.

Paul Houle
– Creatorofdatabaseanimalsandbayesianbrains
July 03, 2014
I this article, asking "What is your assessment of today's enterprise
search industry?" I thought I'd chip in.
What's done right
Today's Enterprise Search products have effective answers for
content ingestion and and query performance.
Any product that is successful at all has an answer for content

ingestion. It's a complex problem because you need to interact with
many kinds of system, but it's a solved problem: a vendor who
hasn't solved this problem would not be successful at all.
Query throughput is easy to handle with horizontal replication.
After that, there's a concern about latency, but the best answer to
that is have the search engine "do more with less", optimizing
algorithms and data structures. Developers oriented towards
performance work can be found in the video game industry and
other pockets of the software industry -- so long as you make it a
priority, it's tractable in terms of business and technology
Lucene 4

Eddie Clio
Enterprise search products are often built around Lucene. Lucene 3
had a lot of good traits, but also fundamental flaws.
Strings in the Java language, on which Lucene 4 is based, are
encoded in a fixed-length representation. ASCII characters, used
heavily in most market areas, get doubled in size. When you're
looking at gigabytes of documents, this is a big deal. The Fedora
Linux distribution rejected Lucene for a desktop search tool ten
years ago because of this overhead.
Lucene 4 represents text as UTF-8, speeds up general operations by
at least a factor of two, and speeds up many specific operations by
hundreds of times. The design has improved dramatically, making
it much easier to engineer substantial changes to the scoring
algorithms.
Many organizations have a code base in Lucene 3, but from my
viewpoint, it's malpractice to do maintenance work on a Lucene 3
system, because in the long term, it can't compete with a Lucene 4
system.

The science of relevance
There's a quote that circulates in the business literature, which goes
something like "You can't improve what you can't measure". It's
been misatttributed to Edward Demings and others, but I like the
way it is used in J.F. Lawton's 1997 book The Selling Bible -- he
talks to successful salespeople and finds that they know what
percentage of customers they can sell, then talks to the "losers in
the lounge" and draws a blank when he asks that question.
The best case study I can think for relevance work is IBM Watson.
When some IBMers got the idea to compete at Jeopardy, they built
a demo system based on an existing search engine and got this
result

The dark line is the performance of the demo, and the cluster of
dots higher up is the performance of winning Jeopardy players.
Most of the players are in grey, but the dark ones to the right are
from Keith Jennings, the record holder that Watson needed to beat.
The chart is intimidating: if you were up against this and chose to
give up, I wouldn't blame you.
After some years of work, IBM systematically improved the
performance of Watson until it hit the target
Now, the strategy and the software framework behind Watson had
this capacity, but it couldn't have gotten close to the goal without a

systematic program of evaluation.
Evaluation has many virtues, the most fundamental of which is
comparing two versions and deciding which is better. You and I
can think of many things which seem like they'd improve the
relevance of a search engine, but if you try them, you might find
things stay the same or get worse.
Industry and academic researchers participate in the yearly TREC,
which is organized around a group of Kaggle-like competitions
where participants try to get the best results

with a specific set of documents and queries.
It's an expensive process for a few reasons. First, you need to have
hundreds of queries, annotating thousands of possible search
results as valid or not. You'll need to load a substantial set of
documents (gigabytes if not terabytes) and then run all of the
queries. You might want to try this hundreds of times trying out
different combinations of parameters, not to mention to fix the

bugs that will certainly turn up. If your culture doesn't put devops
first, you'll spend a huge amount of human time running those
tests.
At least if you use the artifacts that TREC creates, you get a
tolerable set of judgements. You'll certainly get better results if
you optimize for your own documents, but then you've got to
create your own judgements.
Escaping irrelevance
OccupyReno MediaCommittee
If you talk to Enterprise Search vendors you'll find that some of
them participate in TREC or some use it internally. You'll find the
overwhelming majority do not.

What they tell me, and I believe it, is that customers don't see
enough value in relevant search results to pay for evaluation work.
If it's good enough to make the sale, it's good enough. One
objection to the mainstream TREC work is that TREC rewards the
quality of the 500th search result, something that doesn't matter in
some fields, like web search, where users only look at the first 10
result.
Although it's always been easy to tweak Lucene to prioritize certain
fields and do other ad-hoc tricks which ought to improve
relevance, it's been unusual to see Lucene-based competitiors in
TREC because: (i) the Lucene 3 scoring engine is nowhere near
competitive on TREC, and (ii) changing the scoring engine to
something better was maddeningly difficult and often resulted in
terrible performance loss.
Chris Carillo
The good news is that Lucene 4 now has pluggable Similarity
engines. In particular, it contains implementations of the modern
Language Modelling approach

http://lucene.apache.org/core/4_9_0/core/org/apache/lucene/search/similaritie
which is a dramatic improvement over the old tf*idf scoring in
itself, as well as being a rational foundation to build even better
systems.
So far as is publicly known, the LM similarity is little used because
getting good results on it depends on choosing a "smoothing"
function which addresses the poor sample size we get when we're
looking at rare words. Lucene 4 currently implements two
smoothing algorithms out of several that are in the literature. The
successful use of LM in Lucene is a matter of trying out algorithms
and their parameters to get the best result, a task that,
unfortunately, nobody is doing openly.
Paul Houle
Creator of database animals and bayesian brains



Read Next: The Supermen

© 2014 Paul Houle

Paul houle what ails enterprise search

Recommended

Recommended

More Related Content

Similar to Paul houle what ails enterprise search

Similar to Paul houle what ails enterprise search (20)

More from Paul Houle

More from Paul Houle (20)

Recently uploaded

Recently uploaded (20)

Paul houle what ails enterprise search