Search Readiness
Checklist:
Moving to
Solr/Lucene Open
Source Search
A Lucid Imagination White Paper
Abstract
Search was once considered a black-box application that ingested content and delivered results to users
opaquely....
Contents
Introduction .......................................................................................................
Introduction
Whether you are undertaking a new search application or have a working search application running on
a platfo...
In understanding the motivation behind your search application, consider how best to align three factors:
I.     Why Do Yo...
Free software, such as Lucene and Solr open source search, does not mean search is free of effort. If
1. What business obj...
Within the realm of search behaviors, special attention needs to be paid to the control of search
4. How much control do y...
II.    What Are the Key Technical
       Characteristics of Your Search Application?
Given the flexibility and broad appli...
Much as documents and data can live in different repositories, they come packaged in different
1. In hat formats are the d...
3. Howmuch new content do you presently add per unit time?

   The quality of your search results can be affected by the i...
5. Does your content require faceting or a taxonomy in order to support productive navigation

   Faceted search provides ...
III . What Is the Technology Environment
      in Which You Are Building Your Search Application?
Driven by the opportunit...
For most intents and purposes, open-source software has “crossed the chasm” into mainstream
2. Is your development team sk...
IT organizations are able to achieve significant setup/deployment economies by standardizing
4. On what operating system p...
6. Are application development practices in your organization structured to address time to

   Successful application dev...
IV. How can you ensure fit between Solr/Lucene
    and your ongoing business needs?
The best test of technology in the ent...
2. How does your organization find and incorporate changes to code or source code

   Open source code is the raw material...
5. How Will You Ensure a Consistent, Authoritative Base of

   Critical mass of expertise in development is directly corre...
Summary of Questions

      1.   What business objectives are (or should be) achieved with your search application?
I.    ...
About Lucid Imagination
Lucid Imagination can help you use Solr/Lucene to get the most from your search applications. Luci...
Appendix: Solr/Lucene Features and Benefits
Lucene and Solr are complementary technologies that offer very similar underly...
made coding for relevancy ranking and sophisticated search capabilities like faceted search help
    users in sorting, org...
Upcoming SlideShare
Loading in...5
×

Moving to Solr/Lucene Open Source Search

4,342

Published on

Search was once considered a black-box application that ingested content and delivered results to users opaquely. However, driven by the opportunities and demands of the growing universe of content and by the versatility of Solr/Lucene open source search technology, search applications are evolving from a standalone facility to an enabling framework.http://www.lucidimagination.com/developer/whitepapers/search-readiness-checklist

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
4,342
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
1
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Transcript of "Moving to Solr/Lucene Open Source Search"

  1. 1. Search Readiness Checklist: Moving to Solr/Lucene Open Source Search A Lucid Imagination White Paper
  2. 2. Abstract Search was once considered a black-box application that ingested content and delivered results to users opaquely. However, driven by the opportunities and demands of the growing universe of content and by the versatility of Solr/Lucene open source search technology, search applications are evolving from a standalone facility to an enabling framework. Good search is hard. While the basics of search technology can be deceptively simple, the art and science of applying that technology to relevant business and content processing problems is daunting. By its very nature, search can span an almost infinite variety of content, formats, subject matter, relevancy criteria, and more. This Open Source Search Readiness Checklist is organized into four broad categories: Why do you need a search application? What are the key technical characteristics of your search application?  What is your search application’s technology environment?  How can you ensure the best fit between Solr/Lucene and your ongoing business needs?  Each category details key issues to consider in moving to open source search. Whether you are  undertaking a new search application or have a working search application running on a platform you are considering leaving behind, this checklist provides a working foundation to help you make the transition smoothly. Working with Lucid Imagination, the commercial company for Solr/Lucene open source search technology, offers you packaged solutions that simplify and streamline search application development; lower the cost of growth through flexible, adaptable architecture; and deliver reliable backing of unmatched expertise in enterprise search and open source. Lucene/Solr Open Source Search Readiness Checklist A Lucid Imagination White Paper • September 2010 Page i
  3. 3. Contents Introduction ........................................................................................................................................................................................... 1 I. Why Do You Need a Search Application?........................................................................................................................... 2 II. What Are the Key Technical Characteristics of Your Search Application? .......................................................... 5 III . What Is the Technology Environment in Which You Are Building Your Search Application? ...................... 9 IV. How can you ensure fit between Solr/Lucene and your ongoing business needs? ........................................ 13 Summary of Questions...................................................................................................................................................................... 16 About Lucid Imagination ................................................................................................................................................................. 17 Recommended Reading ................................................................................................................................................................... 17 Appendix: Solr/Lucene Features and Benefits ........................................................................................................................ 18 Lucene/Solr Open Source Search Readiness Checklist A Lucid Imagination White Paper • September 2010 Page ii
  4. 4. Introduction Whether you are undertaking a new search application or have a working search application running on a platform you are considering leaving behind, there are a lot of questions you’ll need to answer to be prepared for the effort. Good search is hard. While the basics of search technology can be deceptively simple, the art and science of applying that technology to relevant business and content processing problems are daunting. By its very nature, search can span an almost infinite variety of content, formats, subject matter, relevancy criteria, and more. Add in the fact that there are almost as many ways to judge relevant results as there are individual end users, and you can see the challenge. This Open Source Readiness Checklist is organized into four broad categories, each with a discussion of the issues and opportunities you’ll need to consider as you prepare for your search application. Where applicable, we’ll provide additional references for further study or research. Why do you need a search application? What are the key technical characteristics of your search application?  What is your search application’s technology environment?  How can you ensure the best fit between Solr/Lucene and your ongoing business needs?  This guide is not intended to replace a design strategy, architectural rigor, or a formal requirements  document. By considering answers for the issues it sets forth, we believe you’ll be better prepared for getting your Solr/Lucene application up and running. If you are replacing a legacy commercial platform, you may wonder: Can Solr/Lucene be a complete search platform if you can’t just “drop it in” and replace what you now have, function-for-function, feature for feature? Consider first that, owing to the great variation of search problems, search technology providers have historically taken different approaches to developing their own toolkit: An effort to imitate one with the other will not cut it. We believe you will be best served by a fresh look at the problem search was meant to solve, unburdened by the details of prior implementations. More importantly, the flexibility and adaptive nature of Solr/Lucene open source will both enable immediate transition and lay the foundation for evolving your application to meet emerging needs. The key measure of readiness for the transition is a solid grip on the value of the effort. Lucid Imagination’s customers report that Solr/Lucene technology delivers tremendous benefits in flexibility, result quality, performance—and most importantly, an ability to control their business and technology destiny with search. Those same customers use Lucid Imagination’s services and solutions to lock in those gains, and cement the competitive advantage achieved with Solr/Lucene. We believe an understanding of these advantages will lead you to apply Solr/Lucene most effectively, and identify where it is that Lucid Imagination can help you design, develop, and deploy your search application with confidence. Lucene/Solr Open Source Search Readiness Checklist A Lucid Imagination White Paper • September 2010 Page 1
  5. 5. In understanding the motivation behind your search application, consider how best to align three factors: I. Why Do You Need a Search Application? your users, your data, and your business objectives. When you build a search application, you face end users with expectations driven by their experience with the large consumer search engines on the public Internet, such as Google, Bing, and Yahoo. Certainly, the billions of dollars spent on billions of end users searching trillions of documents have delivered broad-ranging innovations. It’s a fundamentally different proposition to build your own search application. Internet searches may produce millions of results in milliseconds, but they rely on measures like website popularity or on URLs and domain names—not generally applicable to purpose-built applications for businesses. Relying on generalized relevancy for a global population of all Internet users, the big Internet search engines are not tied to your business rules, business process logic, or the opportunity cost of improved precision for your specific set of data or your search users—and their business interests are not yours. Retrieval of unstructured, heterogeneous documents and data is where Lucene/Solr search technology excels. Much of that data has been stored in a relational database, which offer robust storage and stability, RECOMMENDED READING: but its query and retrieval model is ill-suited to the more varied, dynamic modern data landscape.  Starting a Search Application Solr/Lucene search technology offers extraordinarily Marc Krellenstein, CTO and broad applicability, flexibility, scalability, and adaptability. Open source Founder, Lucid Imagination provenance contributes directly to those benefits in many ways. It  The Case for Lucene/Solr: provides a broad community of professional developers, testing and Real World Open Source perfecting the technology against tremendous variation in use cases, as Search Applications well as changes and improvements that are strictly peer-reviewed, A Lucid Imagination White creating a broad foundation of innovation and rigorous peer review. Paper Not to mention faceting, geo-search, numeric range queries, speed and scalability into the billions of documents, near-real-time indexing, and many more innovations that have broken barriers to building effective search applications. Another great capability inherent in the Solr/Lucene platform is anticipating the future needs of the broad range of users. With adaptive and editorial boosting relevancy techniques, query corrections and suggestions, recommended results, and faceted search, search applications built with Solr/Lucene help your business control the quality of experience between your users and your data—and fit that experience to your business objectives. Lucene/Solr Open Source Search Readiness Checklist A Lucid Imagination White Paper • September 2010 Page 2
  6. 6. Free software, such as Lucene and Solr open source search, does not mean search is free of effort. If 1. What business objectives are (or should be) achieved with your search application? your search project is successful, consider how you will prove it: Which of these would you be able to point to? (a) Save money? How much or how much more? (b) Save time? How much or how much more? (c) Increase revenue? How much or how much more? (d) Increase end user satisfaction? Which ones? (e) Create advantage over competitors? (f) Decrease risk? How much or how much more? (g) More than one of the above? Most organizations have a system for finding information, often a legacy commercial search system. 2. What objectives are (or are not) being met with your current search implementation? Why is it unsatisfactory? If you were to replace or improve it, which of the results in the previous question would it affect? By how much? Which of the following properties of your search application (one or more) would have the most 3. Which improvements in search behavior contribute to improved business results? impact on the business results you are looking for? (a) Speed with which new content is available. (b) Likelihood the user’s chosen result is in the top n results returned. (c) Completeness of the full set of results the system delivers. (d) Speed with which queries deliver result sets. (e) Flexibility with which the system handles different types of queries. (f) Ability of the system to never deliver “zero” results. (g) Ranking of particular results for particular queries. (h) Reduced effort required for users to find previously unknown content. (i) Likelihood the user will return to use the search system again and again. Lucene/Solr Open Source Search Readiness Checklist A Lucid Imagination White Paper • September 2010 Page 3
  7. 7. Within the realm of search behaviors, special attention needs to be paid to the control of search 4. How much control do you need over the results that end users see? results. Often, the application of algorithms, business rules, and access rights tie directly to the economic benefits of search. Solr/Lucene offer great depth in this dimension. The previous question asked about general changes in search behavior; here, consider specifically how important direct control of results is to the success of the application. (a) Do you need to adjust the likelihood that particular results or documents appear at a certain time, or in relationship to other results? (b) Are there certain documents or data that should be delivered to certain users, but proscribed from others? (c) Are there algorithms that you need the system to account for programmatically, in automated fashion during the course of search, such as performing probability calculations? (d) How important is it that you understand why the search returned a particular set of results, and be able to adjust the search behavior as a result? The behavior of your search application will be judged by its end users; how much do you know 5. How much do your end users know about the content they are searching for? about those users and the queries they are likely to submit? Consider the following contrasts. Are your end users likely to: (a) Express their queries in terms or phrases that will narrow in on results quickly, or submit broad, general words that retrieve broad results? (b) Spell the terms they are searching for correctly? (c) Search for known results in an unknown location (e.g., “Find the e-mail I sent to Carol on Tuesday, August 10” )? Or undertake a search without knowing which content they might find? (d) Browse through interim sets of results in order to narrow or refine their search queries? (e) Specify quantitative parameters, such as distances, prices, locations, or dates, as part of their search? (f) Use logic-oriented language (e.g., Boolean queries or wildcard characters) or natural language? Lucene/Solr Open Source Search Readiness Checklist A Lucid Imagination White Paper • September 2010 Page 4
  8. 8. II. What Are the Key Technical Characteristics of Your Search Application? Given the flexibility and broad applicability of Solr/Lucene open source search technology, there is a rich set of design decisions to be made in setting up the application to meet your business objectives within the scope of your technology. In this section, we’ll explore some of the key inputs you’ll need to consider before you begin the exercise of architecture and design of your search application. In most, if not all, of the permutations of search needs implied by the questions below, the flexibility of Solr/Lucene search can address your needs. It’s important to note that these questions are not intended to replace a formal design process or substitute for rigorous architectural assessment of how you can use Solr/Lucene to build a successful search application. Rather, it will help establish your intent with respect to key functional and system behaviors. More than in the previous sections, you may find that the answers to the scoping questions below change over time. As you familiarize RECOMMENDED READING: yourself with the capabilities and possibilities available with the Solr/Lucene search platform, you may well want to refine or revise  Faceted Search with Solr your understanding of what constitutes desired behavior. Yonik Seeley, creator of Apache Solr and co-founder Often, organizations build a working prototype of their search of Lucid Imagination application in order to validate the assumptions, as well as the design  Optimizing Findability in and implementation of the system intended to put those assumptions Lucene and Solr into action. While there are many nuances to formal development Grant Ingersoll, Chair, methodologies that exploit this discover-by-doing effect, they share a Apache Lucene PMC and co- founder of Lucid common pattern of implementation, iteration, learning, improvement, Imagination and change. It is strongly recommended that you consider at least two sets of answers to the questions below; first for a prototype implementation, and perhaps one or more revisions of that implementation going forward, once you accumulate experience and discover the full range of possibilities. Lucene/Solr Open Source Search Readiness Checklist A Lucid Imagination White Paper • September 2010 Page 5
  9. 9. Much as documents and data can live in different repositories, they come packaged in different 1. In hat formats are the documents and data you will search? formats, based on where they originated and who created them. A good understanding of these formats enables successful content processing for search. Different format types require different levels of interpretation and composition to separate out searchable text content and metadata (information about the document or its content), which can inform a search, from visual presentation details such as colors, fonts, or software-specific content. For each of the formats, there are further considerations of version; to cite just one example, the formatting and file structure of Microsoft Word 97 *.doc documents differs from the Office 2007 *.docx version. Solr/Lucene can leverage a range of tools—built-in as well as extensions, including both open source and commercial source. Which of the following document format types will you be indexing and searching? (a) XML documents (b) Database records (c) HTML documents (d) Microsoft office documents: *.doc or *.docx for Word; *.ppt or *.pptx for Powerpoint; *.xls or *.xlsx for Excel (e) PDF documents (f) CSV (comma separated values) or TSVs (tab separated values) (g) Open Office documents (h) Engineering drawings from CAD/CAM/CAE systems (i) Others Configuring your search system requires an understanding of your document sizes, as performance 2. Document collection composition: how big are documents? and throughput depend heavily on accounting for the size of documents to be indexed. What percentage or fraction of your documents are: (a) Under 1 KB (f) 5 MB to 10 MB (b) 1 KB to 100 KB (g) 10 MB to 50 MB (c) 100 KB to 500 KB (h) 50 MB to 100 MB (d) 500 KB to 1 MB (i) 100 KB to 250 MB (e) 1 MB to 5 MB (j) 250 MB and up Lucene/Solr Open Source Search Readiness Checklist A Lucid Imagination White Paper • September 2010 Page 6
  10. 10. 3. Howmuch new content do you presently add per unit time? The quality of your search results can be affected by the interval between when a document is How many documents are updated per unit time? complete or ready, and when it appears in the index for searching. (a) Millions of very small documents—in the form of tweets, comments, messages, log files, etc.— appear continuously as users or systems create these content snippets. (b) Existing documents are revised, either by users, or by machines—in the latter case, examples such as reports and data output indexed by your search application. (c) New documents are available less frequently, perhaps even on a regular schedule, which in turn drives user expectations of when they can be searched. (d) Changes to content come in particular windows, busier at some times than others. Consider the question of change to your collection in two ways: First, at what interval does the amount of content in your collection change? Second, what fraction of the total documents are you adding to the overall collection within each interval? (a) From minute to minute (e) Daily (b) About to four times per hour (f) Weekly (c) No more than two per hour (g) Monthly (d) No more once every 4 hours Consider the population of users who drive your search application. How many are they, and what 4. What is the rate of queries you expect from your user population? number of queries might be submitted? Consider especially that queries in the search application do not always map one-to-one with a single string entered by a user in a search box. Use these questions to characterize how many queries your search application will need to handle per unit time, typically in queries per second. (a) How often do they need access to the application? (b) Will they submit queries one at a time on an occasional or ad-hoc basis, or will they rely on the search application for continuous constant use? (c) Do they have the expertise necessary to narrow quickly on search results, or will they require continuous iteration, using one set of results to inform a series of subsequent queries? (d) Will they have the expertise to write queries that conform precisely to the search application’s expectation, or will you rely on the search application to analyze and decompose their terms and phrases to ensure efficient execution and relevant results? Lucene/Solr Open Source Search Readiness Checklist A Lucid Imagination White Paper • September 2010 Page 7
  11. 11. 5. Does your content require faceting or a taxonomy in order to support productive navigation Faceted search provides an effective way to allow users to refine search results, continually drilling and discovery by end-users? down until the desired items are found. For example, on an e-commerce site, Solr/Lucene can present a list of different brands of a flat-screen television, or let the user navigate into results. Facets can span virtually any list of attributes, from sets of terms within a field to dates to numeric ranges and the like. In addition to document-driven faceting, some search applications add an external taxonomy platform to derive metadata—i.e., to extract what documents are about and append fields that support guided navigation through results. (a) Do documents contain data or metadata that allow users to narrow results? (b) Are there consistent rules of document analysis you can create and apply to derive attributes from documents? (c) If documents lack native metadata, can you use a third party taxonomy platform to identify attributes for faceted navigation? 6. Which advanced search features do you expect to use in order to improve how users can Solr/Lucene offers a broad set of powerful query and search tools that can help users quickly choose submit queries and choose? from available options, either before or after they submit a query. Which of the following features can help improve the speed and efficacy of the experience for your end users? (a) Autosuggest/as-you-type: The search application prompts the user with possible alternate queries implied by a partial or complete search term, as they type in the search box. (b) Spellchecking: The search application can interpret search terms that are not necessarily spelled correctly, either prompting the user with correctly spelled alternatives, and/or automatically retrieving results that match terms that most closely resemble the misspelled word in the query. (c) Did you mean: Similar to spell checking, the search application can offer alternate matches to terms that resemble the user’s query, even when those terms were not typed in explicitly. (d) More like this: The search application allows the user to drill down into a particular element of one result set to find additional results that resemble it. (e) Hit highlighting: The search application can mark or emphasize specific terms from the query in snippets of the document result, showing the user which terms match the query. Lucene/Solr Open Source Search Readiness Checklist A Lucid Imagination White Paper • September 2010 Page 8
  12. 12. III . What Is the Technology Environment in Which You Are Building Your Search Application? Driven by the opportunities and demands of the growing universe of available content and by the versatility of Solr/Lucene open source search, search is evolving from a standalone facility to an enabling framework. Search was once considered a black-box application that ingested content and delivered results to users opaquely. No more. Today, RECOMMENDED READING: developers are turning to Solr/Lucene to extend the data access and management power of their applications into the realm of unstructured  Full Text Search text—documents, articles, product descriptions, case studies, informal Engine vs. RDBMS Marc Krellenstein, CTO and notes, websites, forums, wikis, inventory data, patient records, e-mail co-founder, Lucid messages, resumes, patents, legal decisions, tweets, log files, traditional Imagination relational data stores, and nontraditional data infrastructure: The  Scaling Lucene and Solr examples are endless. Effective retrieval of timely, actionable content in Mark Miller, Lucid the face of such diversity means treating search as an application Imagination; Apache development platform or an enabling framework, not an end-unto-itself Lucene and Solr Committer application. Like application development effort, the exercise of creating search applications and enabling existing applications with search must be driven by business considerations. With an understanding of your business needs in hand from the previous section, we now turn to the constraints and capabilities of the technology context in which the search application is to be developed and deployed, and exploring key attributes of your technology environment tied to search application development. Solr and Lucene search applications are typically developed as web applications. High-level search 1. What Programming Skills Do Your Developers Bring to Your Search Application? functions that can be accessed programmatically include queries, indexing commands, relevance algorithms, performance, and the like, generally presented by Solr as services and configuration options. Solr offers a particularly broad base of client libraries, which means it can be accessed through a large variety of programming languages. In which of the following languages/environments supported by Solr is your application development team skilled and experienced? (a) JSON (f) Python (b) Java (g) .Net (c) Ruby (h) C# (d) PHP (i) Perl (e) Ajax (j) JavaScript Lucene/Solr Open Source Search Readiness Checklist A Lucid Imagination White Paper • September 2010 Page 9
  13. 13. For most intents and purposes, open-source software has “crossed the chasm” into mainstream 2. Is your development team skilled and experienced in working with Open Source? usage, with a broad range of government, nonprofit, and corporate sectors running well-established portions of their IT infrastructure on the LAMP stack—Linux, Apache, MySQL, and PHP/Perl/Python. A recent survey of 300 large corporations by the global consultancy firm Accenture shows the majority of respondents committing strategic technology initiatives to open source. To gauge the depth of open source utilization, which of the following major open source projects are broadly utilized in your organization? (a) Linux for server operating systems (b) MySQL or Postgres for RDBMS (c) Eclipse for integrated software development (d) PHP for web application integration (e) Apache for http services (f) Tomcat for web application containers (g) JBOSS for application business logic Most individuals are acquainted with searching for content stored either in the context of their own 3. How and where are the data and documents stored, independent of format? personal computer environments, such as a file system, in e-mail, or in one of the popular, advertising-driven consumer-facing commercial Internet search service. In the context of enterprise or commercial search, the diversity of data storage methods spans a much broader range of technologies, not necessarily tied to formats for individual file objects. Which of the following data repositories will your search application access? (a) Traditional directory-oriented file servers, fileshares, and filesystems (b) Web servers (c) Relational databases, including Oracle, MySQL, SQL Server, Informix, Postgres, DB2 (d) Nonrelational (AKA NoSQL) data stores, such as Hadoop, Cassandra, Memcached (e) Proprietary collaboration stores e.g., Lotus Notes, Sharepoint (f) Open Source content management systems, e.g., Drupal, Joomla, Alfresco. (g) Proprietary Enterprise content management systems, e.g., Documentum, Vignette, OpenText (h) XML-oriented data stores, such as Mark Logic Lucene/Solr Open Source Search Readiness Checklist A Lucid Imagination White Paper • September 2010 Page 10
  14. 14. IT organizations are able to achieve significant setup/deployment economies by standardizing 4. On what operating system platform(s) or environments will your search application run? hardware and software practices at the platform level, along with operating practices. Because Solr runs in a Java servlet container, with indexes portable across platforms, it can operate in any of the mix of mainstream operating system environments, virtualized environments and cloud platforms available in today’s marketplace. (a) Linux (b) Solaris (c) Windows/NT Server/.Net framework (d) Mac OS (e) Amazon EC2 (including the above OS environments) (f) VMWare (including the above OS environments) Solr and Lucene are complementary technologies that offer very similar underlying capabilities. Solr 5. Should you use Lucene or Solr? is the Lucene search server; Lucene is the set of Java libraries that run inside the Solr search server, also available independent of the server implementation. As the Lucene search server, Solr presents a web service layer built atop Lucene using the Lucene search library and extending it to provide application users with a ready-to-use search platform. Solr offers search speed, relevancy ranking, complete query capabilities, portability, scalability, low overhead indexes, and rapid incremental indexing, from its Lucene core. Its server encapsulation of Lucene adds operational and administrative capabilities like web services, faceting, configurable schema, caching, replication, and administrative tools for configuration, data loading, statistics, logging, cache management, and more. Lucene gives Solr its search power. In all but a small number of exceptions, organizations building search applications should start with Solr rather than a direct implementation of the Lucene libraries. Applications that do otherwise often began their efforts prior to the availability of Solr. Solr provides the starting point for most developers who are building a Lucene-based search application. Organizations who build with Solr find themselves better able to adapt their application to changing data structures, query needs, user behaviors, and infrastructure configuration. These benefits accrue in lower “costs of ownership,” improved flexibility, and a broader available pool of search application developers in the marketplace. Lucene/Solr Open Source Search Readiness Checklist A Lucid Imagination White Paper • September 2010 Page 11
  15. 15. 6. Are application development practices in your organization structured to address time to Successful application development depends on the professional practice of software development. market constraints or technical complexity? While there are many theories, approaches, and development models, there are a key set of development disciplines practiced by successful application development organizations. Does your application development team understand the tools and methodologies methods and mechanisms involved in the following software development competencies? (a) Requirements analysis (b) Iterative design (c) Documentation (d) Test planning (e) Change control (f) Architectural description (g) Formal design (h) Fuild and release engineering 7. What service level availability does your search application need to deliver to end users? What Solr’s ability to run on a distributed infrastructure provides robust application availability and is the cost or impact of outages or service unavailability? performance at scale, allowing you to expand to meet growth in both your document collection and your user workload. As with all infrastructure, it is important to understand in advance what impact a service outage would have on your end users, in order to ensure that the system is as strong as its weakest link, so that you can make appropriate choices about networking, servers, storage, and operating procedures. What is the longest interval during which your end users can be productive without access to your search application? And how often can they tolerate such unscheduled outages? (a) 1 minute (a) Once per year Duration Frequency (b) 30 minutes (b) Once per month (c) 1 hour (c) Once per week (d) 4 hours (d) Once per day (e) 1 day (e) Once an hour or more (f) Longer than 1 day Lucene/Solr Open Source Search Readiness Checklist A Lucid Imagination White Paper • September 2010 Page 12
  16. 16. IV. How can you ensure fit between Solr/Lucene and your ongoing business needs? The best test of technology in the enterprise is in its ability to deliver on business needs consistently. It must strike the optimal balance between features/functions and the continuous achievement of competitive advantage for the business paying for it. Search is the same, only more so: It must constantly do a better and better job of delivering results that derive competitive advantage from matching end users to valuable information in timely fashion. Open source can be a two-edged sword: Unmatched in its innovation, the timing of its innovation (as is often the case with innovation in any domain) is not always predictable. While the marketplace challenges a company faces are constant and dynamic, its technology infrastructure demands a strong degree of stability and predictability. The design, building, and maintenance of applications must handle change without adding instability to the problems they aim to solve. At Lucid Imagination, we specialize in capturing the best that open source Solr/Lucene search offers, delivering it into business-critical application development efforts in a way that improves stability; providing predictability without sacrificing the power, scalability, or flexibility of open source. With time- driven support, deep expertise, and broad solution platform of stable value-added software, we transform open source search into a stable foundation that lets you accelerate with confidence. In this section, we’ll present considerations for you in taking advantage of the power of open source in the context of the enterprise. Unlike previous sections that were shaped by various options, these questions are designed to help you consider risks and dynamics of your development effort and its ability to bridge the gap: between the open source innovation you need to compete and the enterprise foundation you rely on to effectively reap the benefits of that innovation. If there is one element all search applications share, it is their diversity: Each set of content, queries, 1. What is your “bench-depth” in designing and deploying search applications? and end user requirements is unique. One of the great strengths of open source search is as a robust, general purpose platform capturing inputs from a broad variety of search use cases. Even when you have top talent, your search application may be limited by their experience; others inside or outside the public open source discussion archives might have experience that could benefit their efforts. For example, the foundations of ambition for your search application are built-in early: Your development team must make critical architecture and design decisions, with significant downstream impact throughout subsequent releases of your application to customers. Breadth of experience will make a critical difference in whether those assumptions will lend themselves to necessary future changes, or introduce unnecessary constraints that hobble your application when the time comes to seize new opportunities. Lucene/Solr Open Source Search Readiness Checklist A Lucid Imagination White Paper • September 2010 Page 13
  17. 17. 2. How does your organization find and incorporate changes to code or source code Open source code is the raw material of your application development effort. The less it costs to fixes for your applications? ensure inbound quality and stability, the more you reduce risks to the application you are building. Open source software does not stand still. Even between major releases, the team of committers and programmers developing fixes and improvements is constantly adding new ideas and features to their project. Some of these changes are available as patches, others are built into trunk and available through nightly builds, and they may or may not meet your acceptance criteria. Solr/Lucene is no different: Driven by a consensus-leveraged meritocracy, they produce changes that may or may not be compatible with your implementation. Identifying which of those to incorporate into development and assessing their impact on other elements of the system is a critical success factor—and may or may not be obvious at the point in time they become available. In building prototypes, you may or may not be able to wait for the community of experts to work on 3. What is the cost-benefit tradeoff of timely fixes and availability of expertise? your need or provide advice; once you reach a production, business-critical scenario, you’ll need things done on your timetable, not theirs. Or, you may not wish your particular effort to have any public exposure at all—in which case you’ll want a communications channel that meets the needs of your business in your marketplace. Many problems can be solved given enough time and effort. If your design and deployment efforts conform to a schedule where speed has value, consider the relative cost-benefits of internal trial-and- error vs. predictably delivered expertise available on demand. 4. Does the cost-benefit tradeoff of fix timeliness change once your application moves into a Once an application’s user base extends beyond the developers who built it, its owners must be ready production environment? to deliver consistent, predictable availability, performance, and scalability. Meeting the service needs of end users cannot always be done in real time by the person who wrote the software; developers move on to other projects or leave the company. The heterogeneity of your content collection, particularly as it changes and grows, can introduce new, unanticipated sources of anomalies in its performance. Similarly, it is difficult to anticipate the full range of user queries and demands on the system, which often leads to the application's inability to meet new, previously unaccounted-for requirements. Ensuring timeliness of fixes to accommodate these organic changes may well be beyond the reach of your development team or your IT organization. Last, and not least, ensuring that the release process itself can meet its intended thresholds of performance, throughput, and other systemic qualities can benefit from lessons learned by experts experienced across a diverse range of deployments. Lucene/Solr Open Source Search Readiness Checklist A Lucid Imagination White Paper • September 2010 Page 14
  18. 18. 5. How Will You Ensure a Consistent, Authoritative Base of Critical mass of expertise in development is directly correlated with the overall effectiveness and Knowledge and Skills for Your Development Team to Work From? velocity of your development efforts. The Solr/Lucene open source community provides developers with a rich, diverse base of resources to use in bootstrapping their skills, including mailing list forums, examples, peer-to-peer resources, and much more. The enterprise developer can swim far and wide in the sea of information, learning by wandering among other implementations and other discussions. At the same time, organizations driven by a development and business timetable need a more structured, organized, and directed approach to building a solid, consistent foundation based on authoritative sources. Working from a pedagogically oriented set of materials, developers can not only acquire a clearer sense of what the technology is and does, but also how best to apply search engine technologies to business requirements. Best practices distilled from years of experience of a broad base of experts can give your team a quicker start, reduce the setup and execution time, and improve how effectively they contend with problems as and when they emerge. Lucene/Solr Open Source Search Readiness Checklist A Lucid Imagination White Paper • September 2010 Page 15
  19. 19. Summary of Questions 1. What business objectives are (or should be) achieved with your search application? I. Why do you need a search application? 2. What objectives are (or are not) being met with your current search implementation? 3. Which improvements in search behavior contribute to improved business results? 4. How much control do you need over the results that end users see? 5. How much do your end users know about the content they are searching for? 1. In what formats are the documents and data you will search? II. What are the key technical characteristics of your search application? 2. Document composition: how big are documents? 3. How much new content do you presently add per unit time? How many documents are updated per unit time? 4. What is the rate of queries you expect from your user population? 5. Does your content require faceting or a taxonomy in order to support productive navigation and discovery by end-users? 6. Which advanced search features do you expect to use in order to improve how users can submit queries and choose? 1. What programming skills do your developers bring to your search application? III . What is the technology environment in which you are building your search application? 2. Is your development team skilled and experienced in working with Open Source? 3. How and where are the data and documents stored, independent of format? 4. On what operating system platform(s) or operating environments will your search application run? 5. Should you use Lucene or Solr? 6. Are application development practices in your organization structured to address time-to-market constraints or technical complexity? 7. What service level availability does your search application need to deliver to end users? What is the cost or impact of outages or service unavailability? 1. What is your “bench-depth” in designing and deploying search applications? IV. How can you ensure continuous fit between Solr/Lucene and your business needs? 2. How does your organization find and incorporate changes to code or source code fixes for your applications? 3. What is the cost-benefit tradeoff of timely fixes and availability of expertise? 4. Does the cost-benefit tradeoff of fix timeliness change once your application moves into a production environment? 5. How will you ensure a consistent, authoritative base of knowledge and skills for your development team to work from? Lucene/Solr Open Source Search Readiness Checklist A Lucid Imagination White Paper • September 2010 Page 16
  20. 20. About Lucid Imagination Lucid Imagination can help you use Solr/Lucene to get the most from your search applications. Lucid Imagination has the world-class expertise, resources, support, and services needed to cost-effectively architect, implement, and optimize Solr/Lucene-based solutions. We provide commercial-grade support, training, and consulting and offer certified, tested versions of Lucene and Solr. Lucid Imagination’s goal is to serve as a central resource for the entire Lucene community and marketplace, to make enterprise search application developers more productive. We also provide access to Solr/Lucene experts, well- organized information, and documentation. We’ve helped hundreds of companies get the most out of their search infrastructure. Customers include AT&T, Buy.com, Cisco, Ford, Macy’s, Sears, Shopzilla, The Motley Fool, Verizon, Edmunds.com, GSI Commerce, Zappos (Amazon), and many other household names. Lucid Imagination is a privately held venture-funded company. The investors include Granite Ventures, Walden International, In-Q-Tel, and Shasta Ventures. To learn more please visit http://www.lucidimagination.com or http://www.lucidimagination.com/solutions. For more information on what Lucid Imagination can do to help your employees, customers, and partners get the most out of your e-commerce efforts contact sales@lucidimagination.com or please call +1.650.353.4057. Recommended Reading  Starting a Search Application by Marc Krellenstein http://www.lucidimagination.com/developers/whitepapers/starting-search-application  The Case for Lucene/Solr Real World Open Source Search Applications http:/www.lucidimagination.com/solutions/whitepapers/Managers-Guide-to-Real-World-Open- Faceted Search with Solr by Yonik Seeley http://www.lucidimagination.com/Community/Hear- Source-Search-Applications  Optimizing Findability in Lucene and Solr by Grant Ingersoll from-the-Experts/Articles/Faceted-Search-Solr  http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Optimizing- Full Text Search Engine vs. RDBMS by Marc Krellenstein Findability-Lucene-and-Solr  http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Faceted-Search- Scaling Lucene and Solr by Mark Miller http://www.lucidimagination.com/Community/Hear- Solr  from-the-Experts/Articles/Scaling-Lucene-and-Solr Lucene/Solr Open Source Search Readiness Checklist A Lucid Imagination White Paper • September 2010 Page 17
  21. 21. Appendix: Solr/Lucene Features and Benefits Lucene and Solr are complementary technologies that offer very similar underlying capabilities. In choosing a search solution that is best suited for your requirements, key factors to consider are application scope, development environment, and software development preferences. Lucene is a Java technology-based search library that offers speed, relevancy ranking, complete query capabilities, portability, scalability, and low overhead indexes and rapid incremental indexing. Solr is the Lucene search server. It presents a web service layer built atop Lucene using the Lucene search library and extending it to provide application users with a ready-to-use search platform. Solr brings with it operational and administrative capabilities like web services, faceting, configurable schema, caching, replication, and administrative tools for configuration, data loading, statistics, logging, cache management, and more. Lucene presents a collection of directly callable Java libraries and requires coding and solid information retrieval experience. Solr extends the capabilities of Lucene to provide an enterprise-ready search platform, eliminating the need for extensive programming. Solr provides the starting point for most developers who are building a Lucene-based search application. It comes ready to run in a servlet container such as Tomcat or Jetty, making it ready to scale in a production Java environment. With convenient ReST-like/web-service interfaces callable over HTTP, and transparent XML-based configuration files, Solr can greatly accelerate application development and maintenance. In fact, Lucene programmers have often reported that they find Solr contains “the same features I was going to build myself as a framework for Lucene, but already very well implemented.” Using Solr, enterprises can customize the search application according to their requirements, without involving the cost and risk of writing the code from the scratch. Lucene provides greater control of your source code and works best in development environments where resources need to be controlled exclusively by Java API calls. It works best when constructing and embedding a state-of-the-art search engine, allowing programmers to assemble and compile inside a native Java application. While working with Lucene, programmers can directly control the large set of sophisticated features with low-level access, data, or state manipulation. Enterprises that do not require strict control of low-level Java libraries generally prefer Solr, as it provides ease of use and scalable search power out of the box. As functional siblings, Lucene and Solr have become popular alternatives for search applications; the two differ mainly in the style of application development used. Key benefits of search with Solr/Lucene include: Search Quality: Speed, Relevance, and Precision Solr/Lucene provides near-real-time search and strong relevance ranking to deliver contextually relevant and accurate results very quickly. Tailor-  Lucene/Solr Open Source Search Readiness Checklist A Lucid Imagination White Paper • September 2010 Page 18
  22. 22. made coding for relevancy ranking and sophisticated search capabilities like faceted search help users in sorting, organizing, classifying, and structuring retrieved information to ensure that search delivers desired results. Search with Solr/Lucene also provides proximity operators, wildcards, fielded searching, term/field/document weights, find-similar functions, spell checking, multilingual search, and much more. Lower Cost and Greater Flexibility, Plug and Play Architecture Solr/Lucene reduces recurring and nonrecurring costs, lowering your TCO. As open source software, it does not require purchase of  a license and is freely available for use. The open source code can be used as is, modified, customized, and updated as appropriate to your needs. Solr is easily embedded in your enterprise’s existing infrastructure, reducing costs of installation, configuration, and management. Open Source Platform for Portability and Easy Deployment Because Solr/Lucene is an open- source software solution, it is based on open standards and community-driven development  processes. It is highly portable and can run on any platform that supports Java. For instance, you can build an index on Linux and copy it to a Microsoft Windows machine and search there. This unsurpassed portability enables you to keep your search application and your company’s evolving infrastructure in tandem. Lucene, in turn, has been implemented in other environments, including C#, C, Python, and PHP. At deployment time, Solr offers very flexible options; it can be easily deployed on a single server as well as on distributed, multiserver systems. Largest Installed Base of Applications, Increasing Customer Base Solr/Lucene is the most widely used open source search system and is installed in around 4,000 organizations worldwide. Publicly  visible search sites that use Solr/Lucene include CNET, LinkedIn, Monster, Digg, Zappos, MySpace, Netflix, and Wikipedia. Solr/Lucene is also in use at Apple, HP, IBM, Iron Mountain, and Los Alamos National Laboratories. Large Developer Base and Adaptability As community developed software, Solr/Lucene provides transparent development and easy access to updates and releases. Developers can work with open  source code and customize the software according to business-specific needs and objectives. Its open source paradigm lets Solr/Lucene provide developers with the freedom and flexibility to evolve the software with changing requirements, liberating them from the constraints of commercial vendors. Lucid Imagination provides the expertise, resources, and services needed to help enterprises deploy  Commercial-Grade Support for Mission Critical Search Applications from Lucid Imagination and develop Lucene-based search solutions efficiently and cost-effectively. Lucid helps enterprises achieve optimal search performance and accuracy with its broad range of expertise, which includes indexing and metadata management, content analysis, business rule application, and natural language processing. Lucid Imagination also offers certified distributions of Lucene and Solr, commercial-grade SLA-based support, training, high-level consulting, and value-added software extensions to enable customers to create powerful and successful search applications. Lucene/Solr Open Source Search Readiness Checklist A Lucid Imagination White Paper • September 2010 Page 19

×