Designing an Effective Enterprise Search Solution


Published on

For enterprise search requirements, we assess the Google Search Appliance along with SharePoint, social search, and AutoSuggest, in contrast to Apache Lucene and Solr.

Published in: Business
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Designing an Effective Enterprise Search Solution

  1. 1. • Cognizant 20-20 InsightsDesigning an Effective EnterpriseSearch Solution Executive Summary Defining the Enterprise Search Engine There are many diverse requirements for search There are quite a few proprietary and open capabilities that emerge within an enterprise. This source enterprise search tools available in the white paper addresses the top five most desired market. The Google Search Appliance is chosen enterprise search requirements. The solution here for its ease of use and its ability to handle discussed herein is based on various implemen- most of the aforementioned requirements. (Refer tation experiences we have gained over years of to the Appendix for the architecture diagram that using multiple search tools. The top five search provides an alternate approach using Apache Solr requirements include: 3.1 and Nutch 1.3.) • Diverse Content: Ability to crawl, index and The Google Search Appliance provides quite a search diverse content repository. few traditionally requested enterprise search features out-of-the-box (OOTB). Even though >> The Web, Microsoft SQL database and the appliance fits the hardware plug-and-play SharePoint content management systems. model, it provides a flexible framework for inte- • Secured Search: Ability to crawl secured grating with external systems such as content content and make it accessible to only management systems, document management authorized people and/or groups. systems, security systems, federated search and >> Single sign-on, forms-based authentication. both Google and non-Google services on the cloud. The following lists the features required to • User Interface: Ability to provide various user implement the top five requirements. For the sake interface (UI) components to serve end users of gauging the complexity of the implementation, with precise results. our list differentiates custom components from >> Guided navigation, related search terms, re- Google out-of-the-box (GOOTB) capabilities. A lated articles and best bets. sample approach is also provided for developing the required custom components to architect an >> AutoSuggest with terms combined from real-time search and custom (user configu- effective enterprise search solution using the rable) terms in data stores. Google Appliance. • Desktop Search: Ability to integrate with • Google Web crawler for crawling and indexing content stored in the desktop. Web content (GOOTB). • Social Search: Ability to find other people, • Google DB connector for crawling and indexing ratings and expertise within the organization. Microsoft SQL database (GOOTB). cognizant 20-20 insights | february 2012
  2. 2. • Google SharePoint connector for crawling and appliance (GSA) is at the center of this proposed indexing SharePoint content (GOOTB). search architecture. GSA, an enterprise search appliance, is a packaged hardware search1. Google forms authentication for index time engine. The plug-and-play model provides authorization and serve time authentication many necessary enterprise search features (GOOTB). out of the box. The tool also provides a flexible2. Google front-end configuration for: (connector) framework for which developers/ >> Faceted search, aka guided navigation (lim- implementers can integrate the appliance with ited OOTB). other content sources. >> Related search terms (GOOTB). Just like, the Google search appliance is extremely adept in crawling, >> Related articles (GOOTB). indexing and serving Web pages. Additionally, >> Best bets (GOOTB). GSA provides connectors to index and search any relational databases, content management >> AutoSuggest (GOOTB and custom applica- tion). systems (i.e., EMC Documentum, Microsoft SharePoint, Open Text Livelink, IBM FileNet)3. Google desktop search component integration and local/network file systems or file shares. (external Google component). Connectors to other non-supported CMS4. Google results integration with internal rating (content management system) tools can either system. be developed from scratch using Google’s connector framework or can be purchased as >> Integration with Google people search a product from Google’s partner Websites. (GOOTB). >> Integration with expertise system (custom). • Google Web Crawler: Intranet Web can be indexed using Google’s Web crawler. GSA >> Integration with custom rating system allows implementers to configure the “start (custom). URL,” “follow URL pattern” and “do not crawlComponent Description pattern” details, with additional options thatThe following components are critical to Google- allow us to force a recrawl of specific patternspowered enterprise search architecture (see as and when required. The crawl status aloneFigure 1 for a schematic view). can be tracked using the “crawler diagnos- tics” which details if a URL was successfully• Google Search Appliance: Google searchEnterprise Search Component Diagram (using GSA) People Search LDAP Server Web Crawling Google Desktop Tool Client Desktop Database Connector Search Configuration TermsFederator SharePoint Connector <<XML Feeds>> Forms Authentication ExpertsConnect Best Bets InfoValuator Related Terms Microsoft Intranet Microsoft SharePoint Web SQL <<Stores>> CMS G S Content Repositories Google OOTB Configurations Content Repositories Custom Applications Google ComponentFigure 1 cognizant 20-20 insights 2
  3. 3. indexed or not. While crawling, if the Googlebot »» Delta query in Solr can be achieved using encounters an error or if the pages are not the regular “deltaquery” with the TIME- crawler friendly, appropriate error messages STAMP column. Additionally, XML import are displayed in the diagnostics that aid in (using /update handler) can be used to troubleshooting. import new or modified documents in the form of an XML file. >> Disadvantage: As efficient and good as it sounds, one disadvantage of Web crawler is »» Solr provides an effective way to remove Google’s inability to reveal the exact page content from the index, either via the ad- that is currently being processed. min console or via XML import (/update with delete option). >> Alternative: The OS console monitor and/ or tracking log files are some ways that could help track URL crawl status. At any • SharePoint Connector: With its introduc- tion of SP2010, Microsoft provides an easy point of time, a developer should be able Web-based medium for uploading, maintaining to view the current URL being crawled and and sharing documents within the enterprise. issues faced (if any) with security. Almost The portal-collaboration-CMS-DMS (document all tools provide this feature – such as Solr, management system) apparatus provides tight FAST, Endeca and Autonomy. integration with Integrated Windows Authen-• Database Connector: DB crawling is again tication (IWA) for securing sites and/or pages made trivial in Google by its ability to provide and/or documents. the implementer just the bare minimal details to Google provides connectors to very few CMS fill in before triggering the database indexing. systems out of the box. But one such system The key configurations needed are username, is Microsoft’s SharePoint — which makes password, JDBC URL, database name, results sense since SharePoint is gaining momentum URL hostname, JDBC driver class name, SQL within the enterprise due to its ease of use query and primary key field. Stylesheets can be and cost benefits. For seamless integra- used to allow users to preview a document in a tion with SharePoint, GSA uses the content specific format in the search results page. feed which handles authorization along with >> Disadvantage: A few key disadvantages the connector configuration. For a smooth faced during the implementation are: handshake between the Google connector and SharePoint sites, an additional Google »» Google’s inability to allow end implement- services component must be installed on the ers to schedule DB crawl. SharePoint server. The connector supports »» Google’s way of removing content from late security binding with bulk authentication index is quite primitive and time-consum- at query time. ing. The only way is to include the URL Google Services (GS) is a component that must in the do not crawl (DNC) list and wait be installed at the SharePoint server end to for the appliance to realize it. It is even ensure site/sub-site indexing and bulk authori- more complicated for connector/XML-fed zation at index and query time. For more infor- content. mation, the step-by-step approach is easily »» Poor diagnostics for connector/XML-fed available on the Google open source site. content. >> Disadvantage: Even if Google is executing >> Alternative: Compared to GSA, we found a bulk late binding, performance issues at Apache Solr is a better option for indexing query time are inevitable when the docu- the database via data import handler. ment volume is high. »» 2000 documents in Solr would take about >> Alternative: One alternate is to consider 5-10 seconds. the site/page/document level security as an »» The Solr console provides a very good additional metadata, develop an application overview of documents crawled and sta- that would post-filter the results based on tus of crawl. Log files can be configured to end-user security attributes. This is again a capture the crawler status as well. primitive method and has its own disadvan- tages in terms of query time latency. cognizant 20-20 insights 3
  4. 4. »» There are quite a few consumer off the >> Alternative: There are tools that support shelf (COTS) products like Vivisimo, Micro- an early binding security model that allows soft FS4SP (Fast Search for SharePoint), the search engine to cache the user se- FAST ESP — now fast search for Internet curity groups along with the content. The search (FSIS) — that are designed to bet- disadvantage of early binding is that real- ter handle SharePoint SP2010 for a higher time security changes are not immediately cost. We have found FAST ESP to be an ef- reflected in the system. fective tool that handles diverse CMS in- cluding SharePoint and offers early bind- >> Note: One disadvantage with Apache Solr is that it does not handle secured content. ing security model for better performance. The only way to serve secured content is to Now, the tightly integrated SP2010 search store the security tags/groups as one of the — FS4SP — seems to be preferred by many metadata and implement a field (or meta- enterprise intranets. SP2010 has its own data) constrained search. cost benefits if the enterprise’s primary technology stack includes Windows and • AutoSuggest: GSA provides an open source other Microsoft product suites. component called “search-as-you-type” which allows end implementers to fetch real-time• Forms Authentication: Google provides a results from the appliance (see Figure 2). In simple way to define forms authentication at order to integrate the results from Google with both index time and query time. Forms authen- that of custom (suggestive) terms, developers tication at index time allows the GoogleBot need to build a special component. to register the security cookie based on the TermFederator is a custom component that domain of the secured website that requires fetches user configurable (suggestive) terms indexing. This cookie is carried with the stored in the database and/or any other CMS. GoogleBot, allowing the bot to crawl and index (Note: TermFederator is only a consumer of secured content. the terms stored in the database and any gov- At query time, Google uses the query time ernance for these terms is outside the scope configuration to make an HEAD request of this architecture.) Adequate caching can that would allow the logged-in user (within be used at this component-end to improve a specific domain) to view only the content retrieval performance of the database/CMS that he is authorized to view. This late binding query. The TermFederator is configured in GSA security model has its disadvantages in terms as a “Onebox” module. At the time of end- of query time latency and performance issues user query, real-time results from Google and at the hosting server end when too many results from the TermFederator are merged HEAD requests are made to the web server. into a single unified suggestion terms list. The Performance degradation is inevitable with key advantage is that GSA handles on-the- higher QPS and/or higher results count. fly federation between real-time results and custom terms with minimal overhead at query time.How AutoSuggest Works <<HTTP>> <<HTTP>> TermsFederator Cache <<JSON>> <<OBXML>> DatabaseFigure 2 cognizant 20-20 insights 4
  5. 5. »» Disadvantage: Onebox modules are de- • Related Results: Collective results from signed to respond within one second. This different and/or the same domain that refer to could result in no results from TermFeder- the same content is again OOTB with Google. ator if there is any delay at the database. • External Data Federation: OneBox module in GSA allows us to federate search results from »» Alternative: “TermComponent” in Apache external data source/store. GSA internally Solr is an effective autosuggest tool. handles uncluttered merging of Terms stored in any local text file can be the Onebox results with organic Among social made available to Solr at startup. A sepa- results in sequence. (Note: search, expertise rate component designed to merge alpha- no relevancy or algorithm is betically (or sequentially) the top N terms applied for merging.) This is one search with expert from Solr and from the local text file would powerful feature that Google rating are common address this requirement as well. provides in comparison to other requirements.User Interface: tools on the market. Enterprises are keen• Best Bets — aka Keymatches, aka AdWords. • Assessing Social Search: Social to find out those search — rather, personalized• Related search terms same as synonyms. social search — has emerged who specialize or• Query expansion, same as dictionary-based as a key requirement within have expertise in search, allows users to do bidirectional search today’s enterprises. Among based on terms maintained as a part of this social search, expertise search specific fields. configuration. with expert rating are common• Faceted search, aka Guided Navigation: GSA requirements. Enterprises are keen to find does not support faceted search. But this out those who specialize or have expertise feature can be achieved via metadata con- in specific fields. This information is used to strained search at query time, similar to how assist internal research, consulting, resource it is implemented in Solr. The structure of allocation (across departments) and/or simple content within the appliance is flat and there is networking. Papers published by experts no hierarchy and/or taxonomy maintained. are most sought after within an enterprise. Research papers rated by even a contact or >> Disadvantage: Facet count in GSA is not an expert are valued high as compared to available OOTB. documents that are not rated at all. >> Alternative: As a key requirement in most People search is an OOTB GSA feature. People ES implementations, faceted search is cur- search in GSA becomes easy with provision rently GSA’s primary drawback. But again, for integration with lightweight directory if we think of alternates, here are a few access protocol (LDAP) servers. But for social options: search, we would need to develop additional »» Continue using GSA: We could develop an components that would allow us to tag people application that can return count based as experts and allow end users to rate content. on certain fields (metadata) that are avail- The data that is captured and stored via any able. If not carefully planned, this could be of these custom components are imported as another query time overhead. Addition- external metadata into the GSA appliance. GSA ally, GSA partners commercially sell com- links any external metadata imported to the ponents supporting “parametric search” content already available in the index based that can be evaluated for our requirement. on the content universal resource identifier »» Considering other COTS/Open Source: (URI). We can thus accomplish linking between (Oracle) Endeca and (HP) Autonomy people information that’s already indexed with maintain content hierarchy for guided that of expert and rating meta information. navigation. Indexing hierarchical content ExpertsConnect is a custom component that is unsupported with future Microsoft keeps track of people and contacts within GSA. FS4SP. But faceted search will continue These contacts provide a link to “prospec- to be supported. Faceted search is one tive” experts. This information is fed to GSA as of Apache Solr’s strongest features and external metadata and hence is linked to the is implemented within many e-commerce people data that are indexed from a directory Websites. residing on an LDAP server. cognizant 20-20 insights 5
  6. 6. The Anatomy of Social Search PeopleSearch LDAP server <<HTTP>> <<HTTP>> ExpertsConnect <<JSON>> <<OBXML>> <<HTTP>> <<Stores>> InfoValuator Database <<OBXML>> <<Stores>>Figure 3 At query time, any people information sources. Any search made from the Google associated with organic results are tagged. toolbar for desktops returned results from Additional details, pertaining to designation, both the desktop and from the appliance. Note: department, etc., are fetched as a separate The Google desktop search tool maintained an Onebox result based on the information that index within the user’s own file system. is indexed from LDAP. These details are linked with the contact details that are externally >> Alternates include: fed and this augmented set of “prospective” »» COTS vendor like Autonomy and Micro- experts is displayed as a single Onebox result soft FAST ESP provide options to imple- along with the organic search engine results ment specific independent components page (SERP). for desktop search. But these components are not cost-effective for a simple desktop The organic results, when integrated with a search. rating system would allow people to rate and/ or tag content. This can be achieved using the »» Apache Lucene and/or Solr is a good “InfoValuator” component. option for desktop search. The tool which is easy to install and configure comes with• InfoValuator: InfoValuator component cap- a default Jetty server that can be used to tures end-user rating and saves a combina- index local file systems. Again, Lucene/ tion of user identity, content URI and value Solr are limited to searching files within rating in the backend data store. On a sched- a desktop and cannot handle indexing uled basis, this data is fed into the appliance e-mail servers out of the box. as external metadata and is associated with the content and/or people information. The Conclusions rating is fetched along with Google’s organic There is no one search engine that fulfills all search results. The UI can be designed to allow enterprise search requirements. HP Autonomy (any logged-in) user to rate the content on the claims this lofty perch but it comes with a huge fly, which the InfoValuator component would cost overhead, with the base cost crossing half a capture and store. million dollars. Open source search engines are• Accessing Desktop Search: Google compo- widely used by many but large enterprises still nent for desktop search was decommissioned fear open source products for their lack of profes- as of September 2011. But it was one solid sional product support. Google Search Appliance platform-independent tool that allowed imple- has been the preferred tool for many medium to menters to index and search desktop content. large enterprises for its ease of use, ready-to- The tool came with a provision to configure an go model and all-inclusive support package that internal Google appliance as one of the search comes with the purchase of the appliance. But cognizant 20-20 insights 6
  7. 7. Architecture Alternate Using Apache Solr 3.1 & Nutch 1.3 Search Frontend Application Application Server Desktop Files WebAPP InfoValuator People Content Builder Frontend Renderer Terms Federator Plug-in Security Post filter Experts Connect Web/SharePoint Database Search Solr Search People Search XML Content Feeder Database LDAP Application Server The Web WebAPP Solr HTTP Servlet Nutch regex-urlfilter.txt DisMax Handler Search (Custom) (Custom) local-fs /Update nutch-site.xml More like this Spell People Search Handler Social Request Handler handler SharePoint Solr Core XML Schema Solr Config Synonyms Caching Admin analytics Nutch Schema Spell Data import Dictionary Query parser Apache Lucene Indexer & Search Engine Solr Components Content Repositories Custom Applications Nutch ComponentsFigure 4again, even Google is not the right fit for many cost of ownership) and ROI (return on investment)requirements that we have seen so far. Custom ahead of time would help enterprises strike asearch application development is inevitable and right balance between choosing the right searchif well planned, we can basically use any tool in tool and investing time and money on extendingthe market to implement enterprise search as a the tool for required customizations.full-fledged application. Identifying the TCO (totalReferencesGoogle Search Appliance Document ReferenceGoogle Search Appliance SharePoint ImplementationApache Solr 3.1About the AuthorAruna Vaidyanathan is an Enterprise Search Architect within Cognizant’s Portal Content Collaboration(PCC) Practice. With 10 years of IT experience, Aruna has spent seven-plus years integrating enterprisesearch projects within various industries such as manufacturing and logistics, consumer goods and lifesciences. As a part of an 800-member practice with over 80 enterprise search professionals, Aruna’sprimary role is to evaluate top enterprise search products in the market and provide domain-specificconsultation and search product implementation recommendations to Cognizant clients. Aruna holds amaster’s degree in computer applications and can be reached at cognizant 20-20 insights 7
  8. 8. About the PracticeCognizant’s Portal, Content and Collaboration (PCC) practice is an 800-member focus group that con-solidates expertise and service delivery offerings in the PCC space. Enterprise Search is a primarypractice with proven capabilities to provide end-to-end search solutions for Cognizant customers inthe PCC domain. With over 80 professionals ranging from technical specialists to senior architects,the practice focuses on enterprise search consulting, systems integration, solution migration and post-implementation support. The practice holds partnership agreements with various leading enterprisesearch product vendors and can be reached at CognizantCognizant (NASDAQ: CTSH) is a leading provider of information technology, consulting, and business process out-sourcing services, dedicated to helping the world’s leading companies build stronger businesses. Headquartered inTeaneck, New Jersey (U.S.), Cognizant combines a passion for client satisfaction, technology innovation, deep industryand business process expertise, and a global, collaborative workforce that embodies the future of work. With over 50delivery centers worldwide and approximately 137,700 employees as of December 31, 2011, Cognizant is a member ofthe NASDAQ-100, the S&P 500, the Forbes Global 2000, and the Fortune 500 and is ranked among the top performingand fastest growing companies in the world. Visit us online at or follow us on Twitter: Cognizant. World Headquarters European Headquarters India Operations Headquarters 500 Frank W. Burr Blvd. 1 Kingdom Street #5/535, Old Mahabalipuram Road Teaneck, NJ 07666 USA Paddington Central Okkiyam Pettai, Thoraipakkam Phone: +1 201 801 0233 London W2 6BD Chennai, 600 096 India Fax: +1 201 801 0243 Phone: +44 (0) 20 7297 7600 Phone: +91 (0) 44 4209 6000 Toll Free: +1 888 937 3277 Fax: +44 (0) 20 7121 0102 Fax: +91 (0) 44 4209 6060 Email: Email: Email:©­­ Copyright 2012, Cognizant. All rights reserved. No part of this document may be reproduced, stored in a retrieval system, transmitted in any form or by anymeans, electronic, mechanical, photocopying, recording, or otherwise, without the express written permission from Cognizant. The information contained herein issubject to change without notice. All other trademarks mentioned herein are the property of their respective owners.