Beyond the Enterprise Data Warehouse           The emerging Enterprise Search solutions           Findwise Göteborg       ...
AbstractThis document elaborates the role of the enterprise search technology as an intelligent retrieval platform forstru...
Table of ContentsEnterprise search as enterprise data warehouse .............................................................
IntroductionIt has long been held as convention that intelligent exploration of corporate operational systems requires aco...
Enterprise search as enterprise data warehouseWhen corporate users want to know something, they are looking inward, using ...
Second, Text analytics is a set of techniques that can discover and extract unstructured text and transform itinto structu...
Enterprise search vs. EDW solutionThere are several reasons for choosing enterprise search, some of which are discussed in...
Google?” is a common cry. It is not surprising this is the case, since what better an audience to test theacceptability of...
Search engines, especially those with a history of servicing the Web, are designed for the widest audience,consumers who k...
Figure 3 – Enterprise search: one common platform and shared dataConclusionEnterprise search technology may not have been ...
ReferencesArdentia ”Zen and the art of enterprise search”, by Richard Lewis,FAST Search & Transfer “Search in a structured...
Upcoming SlideShare
Loading in …5
×

Enterprise Search White Paper: Beyond the Enterprise Data Warehouse - The Emerging Enterprise Search Solution

1,422 views

Published on

This white paper elaborates the role of the enterprise search technology as an intelligent retrieval platform for structured data, a role traditionally held by the Relational Database Management Systems (RDBMS). Furthermore it investigates the great possibility by enterprise search solutions to derive insights and patterns by also analyzing the unstructured data, which is not possible to do with traditional data warehouse systems based on RDBMS.

Published in: Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,422
On SlideShare
0
From Embeds
0
Number of Embeds
10
Actions
Shares
0
Downloads
15
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Enterprise Search White Paper: Beyond the Enterprise Data Warehouse - The Emerging Enterprise Search Solution

  1. 1. Beyond the Enterprise Data Warehouse The emerging Enterprise Search solutions Findwise Göteborg 2009-02-25 Helge Legernes helge.legernes@findwise.sewww.findwise .se | info@findwise.se | +46 31 288 400 | Drottningg. 5 | 411 14 Göteborg
  2. 2. AbstractThis document elaborates the role of the enterprise search technology as an intelligent retrieval platform forstructured data, a role traditionally held by the Relational Database Management Systems (RDBMS).Furthermore it investigates the great possibility by enterprise search solutions to derive insights and patternsby also analyzing the unstructured data, which is not possible to do with traditional data warehouse systemsbased on RDBMS.© 2009 Findwise AB 1
  3. 3. Table of ContentsEnterprise search as enterprise data warehouse ..............................................................4 An example ..................................................................................................................................................................5Enterprise search vs. EDW solution .................................................................................6 Scalability and distributed architecture ....................................................................................................................6 True Ad-hoc Querying and Relevancy .......................................................................................................................6 User interaction ..........................................................................................................................................................6 More content – better analysis..................................................................................................................................7 Real time content – not batch oriented content ......................................................................................................7 Low implementation and maintenance costs with great flexibility ........................................................................7 One common index used by all users in the enterprise ...........................................................................................8Conclusion ....................................................................................................................9References .................................................................................................................. 10© 2009 Findwise AB 2
  4. 4. IntroductionIt has long been held as convention that intelligent exploration of corporate operational systems requires acommitment to Business Intelligence and Data Warehousing technologies. Recent demands, however, havecompelled organizations to look for ways to enhance their technical portfolio with capabilities similar to thosefound at public Web search portals or with their current Content Management Systems. Generally, the demandarises from extreme requirements in performance and scalability, a desire to incorporate unstructured data, oradopting more effective analytics via concept and context based textual mining.Every once in a while an industry develops a peculiar ability to sustain two seemingly identical technologies infundamentally incompatible ways. Think of the media industry with the computer and the television. Bothinvolve a source of electronic information, a vehicle for transmission, and a device for visual display. Yet howlong did it take before we could share the same monitor, or record a show on the same machine as the familyphotos?Now consider OLAP (Online Analytical Processing) and Search. Both include a central data store; mechanismsfor extracting data from multiple sources; logic for intelligent, ad-hoc retrieval; and visualization paradigms toorganize the results. They even have parallel jargon: knowledge and data management, text and data mining,search and query, document and row, and so on. And they are entirely incompatible with each other – or were.We are all quite familiar with the RDBMS as the de facto standard, but the extraction and analysis process, thetraditional prevue of the OLAP environment – ETL (Extract Transform and Load), data warehouse, data mart,and business intelligence – is beginning to strain the relational model. Why is it so hard to support naturallanguage queries? How is meaning extracted from textual content? Why can’t the system be more real-time?How can it provide better analytics? Why is the whole process so expensive to manage?The RDBMS vendors are just now releasing new versions of their database engines embedded with thebeginnings of search technology. Perhaps, they “peered over the wall” at the search vendors to see howthey’ve been solving these problems. If they did, they would have been surprised to see the search vendorsstaring right back at them, for they have been mining both structured and unstructured content for some timenow, and doing so in ways quite different than the RDBMS.Conventional approaches to information discovery and access on structured data are running out of steam.Traditional proprietary applications are often too complex for people to use without expertise and specifictraining.Furthermore software licenses are too expensive for wide-scale deployment to all the users who may needaccess.© 2009 Findwise AB 3
  5. 5. Enterprise search as enterprise data warehouseWhen corporate users want to know something, they are looking inward, using search techniques to seek outclues that are scattered across the entire corporate IT resource—from the data submerged in silos includingERP, ordering and financial systems, intranets, e-mail and Web servers, and users’ own workstations. Thisinvolves going through multiple sources and linking together the pieces of information to give a starting pointfor decisions. The past three to four years we have seen a real focus on enabling corporate users to do thismore efficiently, because without effective companywide search techniques, users end up wasting time as theyhunt across multiple systems for a piece of information. As a consequence the companies suffer as well.Over time, the Enterprise Search Engine System willcertainly become the most shared system within the Unstructured informationenterprise. A big difference until now has been that the Data held in Word and PDF documents onEDW systems have mostly dealt with structured data, while servers and users’ desktops that are notthe ESE systems have dealt with both structured and indexed, tagged or archived for easy location.unstructured data. So, how can business users utilize the European and U.S. analysts agree that about 80unstructured and structured information that resides both percent of information within businesses isin their companies as well as out on the Internet to make unstructured.better tactical and strategic decisions?This collection of information might include business intelligence reports as well as the data underlying thesereports, blogs, news feeds, or other information that resides in the company or on the Internet. Figure 1 – Typical enterprise search architectureWe need to set some definitions: First, Enterprise search refers to technology that indexes and retrievesinformation in structured and unstructured sources in both internal data sources and web sites for a companysuser base. Typically, this search is still goal-oriented—the users know what they are looking for. Enterprises usesearch for a variety of purposes including simple site search as well as finding basic answers to questions suchas: “What is my authorization limit?” or, “How do I order a new computer?” As well as more advanced analysisthat leverages text analytics.© 2009 Findwise AB 4
  6. 6. Second, Text analytics is a set of techniques that can discover and extract unstructured text and transform itinto structured information that can then be leveraged in various ways. Various extraction techniques existincluding entity extraction (e.g. people, place, financial amount), concept extraction (e.g. unhappy customer) orfact extraction (e.g. an event specification), to name a few. For example, concepts may be created usingvarious linguistic- and statistics-based processes. Different vendors make use of different techniques and havecreated patents for these techniques.Lastly, Convergence is bringing these two technologies together and then searching using extraction indexescreated by the text analytics. In other words, search results are filtered by concepts, entities, facts, or otherextraction types and utilized to answer questions.An exampleConsider the following prime example of what might be possible utilizing this approach. A manufacturingcompany is interested in gathering intelligence on why it is losing market share. In the past, it would havelooked at numerous reports including: sales reports, warranty reports, customer satisfaction surveys andanalyst reports. It would have also had business analysts searching the Internet to find any relevantinformation. Leveraging this new approach, the organization might create concepts that it uses to searchthrough all of its internal documents and business intelligence reports as well as any external news feeds andarticles. These concepts might include “unhappy customer”, and/or utilize various entities like “competitorsmaking financial transactions.”The concept itself may come from a taxonomy or ontology that has already been created by the organization ora third party. Or, it may have been created by a subject matter expert working with this information via a GUIthat is part of a software package. This extracted information is sent to a search engine interface. The resultsmight appear as they would with any search engine. The results might also be piped to a business intelligenceproduct to produce plots of percent unhappy, happy, and neutral (no comment) customers. These table plotsmight have been derived from the text of customer care centres or customer surveys and would have drilldown capabilities to enable line of business users to explore why these customers are unhappy and do furtheranalysis. Levering this kind of approach, companies can then derive insights that were not possible before.© 2009 Findwise AB 5
  7. 7. Enterprise search vs. EDW solutionThere are several reasons for choosing enterprise search, some of which are discussed in the followingsections.Scalability and distributed architectureTraditional enterprise architectures gain performance primarily through increased use of CPU and memory.This is scalability through more grid iron and its costs increase exponentially with demand. RDBMS technologiesadopted this model from the beginning, which is understandable since it was the prevailing solution when theindustry was new and growing. They have recently begun a move towards more distributed models, but thejourney must be difficult. To be truly multi-dimensionally distributed means a basic rewrite of your core.New enterprise architectures are distributed from the ground up, balancing demand for CPU, memory, and diskto reflect the realities of commodity hardware, which is any collection of off-the-shelf inexpensive computers,typically “Lintel/Wintel”, networked together in a grid. Since performance is gained through simply addinganother computer to the grid, costs are linear in growth. This is scalability through grid computing, and ITorganizations are beginning to adopt this model for all their enterprise requirements.True Ad-hoc Querying and RelevancyThe relational model is exactly that – a model. The fact that there is a schema at all means the RDBMS worksbetter for some queries over others. The issue is not so much who has the fastest query but who has theslowest. For example, the difference between a 20 and 50 millisecond delay is not discernable to the user, buta 30 second response time is. Variance is often far more important than average query time.The problem lies with the fact that a schema is counter to the notion of a purely ad-hoc query environment.The industry devised new denormalized designs, such as star and snowflake schemas, and even introducedmulti-dimensional database engines to alleviate the problem as much as possible, but while these technologieswere good for queries generally focused on sales, financial, and production analyses, they were no good forreal-time processing, heavy analytics, historical comparisons, complex updates (e.g. data corrections),textually-centric content, or environments requiring a large number of dimensions. A truly unbiased model isone that has no schema at all (i.e. hyper-denormalized to one table). This is the index of an enterprise searchengine.Another problem is the language itself. SQL is designed to query systems where the results support a binaryinclusion model. In other words, data is either part of the answer or not. The order of the data itself in theresult set has no relevant meaning other than a preference for sorting and summation. This model leaves outthe whole world of relative inclusion, where data has a graduated score of relevance in the result list.The omission is understandable. Business intelligence focuses almost exclusively on financial and othernumerical data, really to support monetary trends to measure an organization’s health or assess risk. However,a lot of information is left on the table when one elects to ignore the textual content or relegates it to simpleidentifiers for qualifying these analysesUser interactionThe popularity of the Web search vendors (Google, Yahoo!, and Microsoft) has had an interesting effect on theexpectations of users in the enterprise market. The simplicity of the search bar, the power of navigators, andthe convenience of simple ranked results are now basic requirements. “Why can’t our stuff be as simple as© 2009 Findwise AB 6
  8. 8. Google?” is a common cry. It is not surprising this is the case, since what better an audience to test theacceptability of a user interface than the world at large?The net effect is a shift in user interaction design (UI is more about “interaction” these days than “interface”) asit relates to information retrieval, exploration, and analysis. The new model recognizes the need for anintegrated approach to querying and exploring content. The interaction is a combination of techniques that allresolve to the same action: asking for information and getting results. How intelligently the system allows youto ask your question measures the quality of the interaction.The basic model is this: providing both a search box and several smart navigation approaches offers the bestinterface for query and exploration, and an intelligent balance of the two is the most efficient interactionmodel for intelligent extraction of information. A good second generation enterprise search engine supports allof these capabilities out of the box. The RDBMS vendors are just beginning to incorporate some of them intheir latest versions.More content – better analysisAs mentioned earlier one often claim that 80 percent of an organization’s content resides in unstructuredrepositories (email, documents, etc.), while only 20 percent resides in databases. The content managementvendors use this statistic often. While true, it may also be the case that the 20 percent in the database has 80percent of the importance. The argument is largely academic, because the real answer is to make the entire100 percent available.You can’t find something that isn’t there, so why not make sure it is in the corpus of searchable content? This issimple insurance against the “not knowing what you don’t know” problem that inevitably inflicts the worst painon an organization. Furthermore, the integration of structured and unstructured content allows for newinvestigative approaches. Imagine monitoring the efficacy of an advertising campaign by tracking product salesand market intelligence in a coordinated fashionReal time content – not batch oriented contentWe often think of content as primarily historical, a snapshot of activity that occurs over a period of time (whichmay be as recent as the last 24 hours). This is such a standard assumption the batch orientation of the datawarehousing market requires that it be true. But it is not always true. What if I’m monitoring stock transactionsand I wish to get answers before the trader executes the manager’s requests? What if my job is to look forpatterns on the newswire? The data in these scenarios is streamed in real time, not digested in batches, and itshould be supported as such. Note the importance of latency here. Streamed content should be searchablewith minimal delay, often less than a second.Second generation enterprise search engines support alerts and sub-second latency. In an OLAP environmentthis would be considerably more difficult.Low implementation and maintenance costs with great flexibilityIn a typical OLAP environment you will find one or more data warehouses, several operational data storesand/or data marts, assorted ETL technologies, and a myriad of business intelligence repositories that definereports, queries and other assorted meta-data. Creating and maintaining these components requires a host ofengineers, DBAs, and knowledge workers (who we pretend do not need to understand relational concepts butreally do if they build reports or queries, even graphically). Now add to this environment meta-datamanagement, keyword and acronym dictionaries, multiple languages, thesauri, etc. and you can see how themanagement costs can proliferate.© 2009 Findwise AB 7
  9. 9. Search engines, especially those with a history of servicing the Web, are designed for the widest audience,consumers who know absolutely nothing about technology. The knowledge worker does not exist in theconsumer market, so the ad hoc report at its most simplest is merely a search bar that returns ranked resultsand sometimes dynamically generated navigators to assist in exploration. The management process, as youmight imagine, is also much easier, if for no other reason that there is one single component to install,configure, and manage rather than several. Please note that a DBA is no longer required. Figure 2 – Search vs. EDWThe “single component” comment requires some explanation. Recall that data marts and operational datastores were created to overcome the limitations of the relational model for handling ad-hoc querying –avoiding the “killer query” – and because the RDBMS was never able to efficiently support high volume dataupdates and query traffic at the same time. The search engine has no such limitations.One common index used by all users in the enterpriseAs mentioned earlier an enterprise search solution could serve as the basis for one “single component” wherethe index can serve many different applications and different users and their needs.Many of the search solutions have a range of powerful presentation-layer functions. These enable overlaying ofresults of multiple related searches—allowing users to explore commonalities between groups or searchresults, building an in-depth picture of results from seemingly unconnected data.The enterprise search solutions can also feed different BI tools, but of course as time evolves the searchvendors will enhance their products with integrated BI tools.© 2009 Findwise AB 8
  10. 10. Figure 3 – Enterprise search: one common platform and shared dataConclusionEnterprise search technology may not have been around as long as it’s RDMS counterpart, but it has beenmuch more aggressive in its intelligent retrieval of information from both structured and unstructured content.Whatever the reasons, for organizations to compete effectively, report accurately, defend aggressively, or mostany other activity, the key to their success is their ability to get the right information to the right people at theright time (and at the right cost); from any source, internal or external to the organization. The enterprisesearch solution should be considered to be the best candidate that can effectively deliver.© 2009 Findwise AB 9
  11. 11. ReferencesArdentia ”Zen and the art of enterprise search”, by Richard Lewis,FAST Search & Transfer “Search in a structured data environment”, by Davor Peter Sutija,Hurwitz & Associates “Enterprise search evolves”, by Fern Halper.© 2009 Findwise AB 10

×