Many people think that since enterprise search has been around for years, all its issues should have been solved and its complexity dramatically reduced. In other words, it ought to be a commodity by now. Although search technologies continue to advance, the origin of the complexity is the nature of language itself – full of ambiguity, contradiction, multiple meanings, and many contexts. The digital search for what you want to know involves a balance between precision (the measure of the usefulness or a result) and recall (the measure of the completeness of the result). Increasing precision without
sacrificing recall is a complicated balance to strike, and the commoditization of this balance is one of the myths not currently in line with reality.
2. 2
Myths in Enterprise Search
Many people think that since enterprise search has been around for years, all its issues should have
been solved and its complexity dramatically reduced. In other words, it ought to be a commodity by
now. Although search technologies continue to advance, the origin of the complexity is the nature of
language itself – full of ambiguity, contradiction, multiple meanings, and many contexts. The digital
search for what you want to know involves a balance between precision (the measure of the usefulness
or a result) and recall (the measure of the completeness of the result). Increasing precision without
sacrificing recall is a complicated balance to strike, and the commoditization of this balance is one of the
myths not currently in line with reality.
COMMON SEARCH MYTHS
In this document we uncover and correct some of the other myths, assumptions, and misconceptions
about enterprise search that prevail in the market today.
Myth
Web search and Enterprise search are more or less the same.
Reality
This is the assumption behind the statement, “Why can’t I get a Google interface for my company?” The
quick answer is, “because you don’t actually want it.” The more appropriate question is, “Why can I find
what I’m looking for on Google more easily than I can on my own corporate website?” That is indeed a
problem, and it happens, ironically, when people try to solve it from a Web search perspective.
To Google’s credit, they have been successful in forcing the enterprise search market to realize that ease
of use for the end user (and ease of use in general) should be given the serious attention it deserves.
Agreed. But the demands of the enterprise user are different and the complexities of source content are
greater than a public search of the Web.
Users expect a better answer to their question because they are more intimate with their content. They
are not looking for the most popular answer – the general logic behind ranking for Web search – they
want the right answer. Consequently, the ranking model for the enterprise is much more complex than
for the Web. It must consider several parameters in its balance of demand between precision and recall
(term frequency, source freshness, spatial proximity, authority, etc.), a balance that changes from
application to application.
Now add to this greater complexity. The reasons for searching are much more varied. Data types are
much more varied. Data freshness (how soon does a new document appear in my search?) is more
important. Security is an issue. A typical enterprise supports more content types, formats, and security
layers than the entire Web.
3. 3
Finally, search in the enterprise is often in context of a specific solution, for example uncovering legal
risk, assessing product campaigns, or buying goods online. While it is popular to start a request for
information with a search query, unlike the Web, there is an expectation that further investigation will
involve refinement techniques that involve navigating through supplemental information such as facets
and concepts. This more exploratory model visualized through navigators, tag clouds, heat maps, and
the like, is commonplace in enterprise search but not on the Web.
Myth
The Web has more content than any enterprise, so Web search companies are the real experts on
search.
Reality
As of summer 2008, Google’s index was just under 21 billion web pages and growing. Yahoo’s was
actually higher at around 55 billion1
. If you take an average page size of 200 bytes2
, then Google’s index
was then about 4.2TB and Yahoo’s about 11TB. While this is large, and perhaps larger than many
enterprise search implementations, there are many, many enterprise search implementations in the
tens and hundreds of terabytes, and a few now in the petabyte range. The assumption might be that the
corpus is larger because the average document size is also. True for some implementations, but others
contain just email.
In any case, when it comes to scale, it is intuitive who has the most incentive for efficiency. The Web
search companies purchase and host their own hardware. Enterprise search vendors must convince
their customers to buy the hardware themselves3
.
Myth
My relevancy model is better than your relevancy model.
Reality
Search technology, in some form or another, has been around since Lexis-Nexis first commercialized it in
the 1970s. In the 1990s, Google became the catalyst for carving out the web search business as a
separate entity, introducing ranking algorithms based on website popularity. It also legitimized the
importance of search as a strategic asset in the enterprise.
Until then there was little distinction between Web or enterprise search. All the models were more or
less the same, based on matching words typed into the search box against words in documents residing
in the index. But now that there was real money to be had, enterprise search vendors turned up the
volume on competitive differentiation by touting the superiority of their relevancy models.
In hindsight, this was a surprising tactic because the math behind relevancy is quite complex and hardly
the stuff of debate for the search customer: keyword search, conceptual search, semantic search, scope
search, TF/IDF, Bayesian probability, Boolean filtering, query expansion, NLP, text mining, and so on.
One vendor proclaims that enterprise search is not rocket science (or brain surgery – it’s
4. 4
interchangeable); enterprise search is harder. Another vendor simply tells its customers it’s too
complex, so “leave it to the experts who know how to do this.”
Yet, today, as in Time Before Google, the majority of customers are still not satisfied with the results
they get from their search engines. In their minds the quality has not really changed. You still hear, “Why
can’t I find information in my company as easily as I can find it on Google?”
The suggestion that one model is better than another is not so much wrong, but a moot point. Every
vendor’s model is good for some content, just not for all content. What makes a good enterprise search
vendor is their ability to adapt to the context and character of the content and application requirements
by deploying an optimal combination of all these different approaches.
Finally, perhaps we’re all arguing about the wrong thing. Relevancy is important, but what of the user’s
experience? Is the search technology touching the complete information landscape or just part of it?
How is navigation and exploration accomplished? Does the technology act on the results, i.e., connect in
to business operations and trigger action? There is more to information access than search, and there is
more to search than relevancy.
Myth
Manual facet management is easy and quick and offers a good user interface.
Reality
Facets, or dimensions, of results of a search can be used to help navigate to related information. The
conventional approach to facet management requires defining the facets before indexing as part of the
search platform’s configuration. Some vendors provide a well-designed user interface to make the
process as easy as possible. A typical example might be the organization of facets on an electronic retail
outlet’s ecommerce website. For productType = 'computer', I can declare my facets in this
order: price, make, CPUs, memory, storage, slots, monitor. For laptops, I would add a piece of logic that
says if portable='yes' then display the weight facet.
The manual approach is not a problem if your objects have a fairly uniform structure (e.g. books). You
can have millions of them, but the key is they are all described the same way. But imagine a national
retail outlet whose product catalog contains several hundred thousand different products. And further,
imagine the catalog changes fairly constantly. The manual process is now a half-year project and the
change management a major ongoing commitment.
A system that recommends the facets to you automatically and on the fly for each query would remove
all this work. The logic behind the ranking is not unlike the ranking for search results, involving a number
of calculations to arrive at a composite score (e.g. sparse matrix analysis, clustering, facet distribution,
etc.). The algorithms should be smart enough to avoid situations where no facets appear because none
are relevant enough to display (e.g. sparse-matrix analysis alone). This can happen with content that has
minimal facet intersection.
5. 5
Myth
A simple database query is all a search engine needs to extract content from a database. Only the
relational database can truly support ad hoc structured querying
Reality
We challenge this assumption by comparing both relational and search engine technologies, with the
goal of proposing a hybrid solution that reflects the advantages of both. Let’s take a look at the
relational model first.
The relational model, you may recall, was originally designed for managing the transactional integrity of
inputting information and for its efficient retrieval through predefined, repetitive reporting. It works
because the database schema is designed specifically for the structure of the data and the shape of the
reports.
But the market began demanding a more ad hoc approach to querying their data for what-if analysis and
general exploration. In this situation, the query is no longer repetitive or known in advance, and
therefore cannot be planned for in the database engine.
The ad hoc query does not sit well with the basic relational model. Any relationship created a priori (all
relationships in a database schema) will bias for some queries and against others. Since you do not know
the query being asked, you do not know which side it will fall on. It is quite possible to create a “killer
query” that brings the database engine to a screeching halt.
Attempts to solve this problem have resulted in a continuous evolution of the relational model, twisting
it in various ways to provide better performance and greater flexibility (more ad hoc). Technologies
include data marts and star schemas, software and hardware data warehouses (e.g., Teradata, Netezza),
cubes, and vertical indexing technologies (e.g., Vertica, Sybase IQ).
The underlying problem is still there, however. All these technologies still view the problem from a
traditional table-column-relationship point of view, and this is inherently limiting. It does not mean we
abandon SQL or the need for the relational model to manage transactional data entry and fixed
reporting, but it does suggest we should rethink how the basic engine works for the optimization of
rapid, high volume, ad hoc information retrieval.
What might this new engine look like? Search indexes provide an interesting approach. They are
certainly designed for this type of problem. Google, for example, responds to millions of queries a day,
searching through billions of documents, each query taking less than a few seconds to respond. No
database technology comes close to this type of performance.
But then Google does not have to deal with cardinal relationships. It does not have to support the SQL
JOIN statement. The JOIN statement is the cornerstone of both reporting and ad hoc querying. For
example, we may have a hundred invoices for a customer. In a relational database, that amounts to a
101 tables: one for the customer and a hundred for the invoices. The customer data is stored once but
6. 6
referenced a hundred times. If we want to return all the invoices for a particular customer, or all the
customers with invoices greater than a certain amount, the JOIN statement is used to exploit the
relationship between the customer and invoice tables.
The approach conventional search vendors use to extract content from a relational database is to
execute a SQL query against the database, returning a result set of uniform shape that is then indexed. If
different data or a different result set is requested, a new query is defined, the search index is
reconfigured, and the index is re-indexed. It works this way because search technologies simply do not
understand the relational concept.
There are many problems with this model. First, the data is “flattened”, meaning all cardinality is
removed by repeating content in each result set row. In our example, a flattened result set would
include the customer’s properties in every invoice. An updated customer record would require an
update to each one of its invoices in the search index.
Second, there are no real ad hoc capabilities here. You must know beforehand how your users will
explore the database content because you have to predetermine the shape of the results. But often you
don’t know what your next question will be until you see the answer to the first.
Finally, it is now impossible to JOIN content as a database engine does. A JOIN is not like a search; it is a
true Cartesian of results between two sets of content that share a common property value.
This does not need to be so. The rapid, high volume, pure ad hoc querying capability of the search index
is still valid, but the architecture needs to be enhanced to retain the integrity of the cardinal relationship
from the database source. If the search engine was augmented to ingest each table’s rows individually
for all the tables in the database, then it would be possible (with some clever work on the vendor’s part)
to support a JOIN statement executed on the fly at query time.
By the way, because the index contains both structured and unstructured content, the JOIN could be
between a table, email, and a set of documents. Further, since this is a search environment, “fuzzy
JOINs” are possible that capitalize on standard capabilities such as spell correction and synonym
expansion.
NEW DEVELOPMENTS IN SEARCH: INFORMATION ACCESS
While development in enterprise search and web search continue, a new category called unified
information architecture (UIA) is beginning to gain market traction. Unified information Architecture
extends enterprise search capabilities across all types of documents, data, and media. This expanded
scope replaces legacy enterprise search, offering all its functionality and combining simple access to
data and media. The advantages include being able to assemble all relevant information with one query;
connecting content and related data; and searching data with a simple search query instead of a
structured query language and formal reports. UIA can co-exist with search or replace it outright. For
more information about UIA, please visit www.attivo.com.