• Save
Whitepaper-  Real World Search
Upcoming SlideShare
Loading in...5
×
 

Whitepaper- Real World Search

on

  • 2,096 views

 

Statistics

Views

Total Views
2,096
Views on SlideShare
2,096
Embed Views
0

Actions

Likes
2
Downloads
0
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Whitepaper-  Real World Search Whitepaper- Real World Search Document Transcript

  • The Case for Lucene/Solr: A Manager’s Guide to Real World Open Source pplications Search Applications By Lucid Imagination
  • Abstract In today’s information-driven environment, search is a critical solution to problems when it slashes the time and effort separating end users from the data they value. Search spans the range of business models and use cases—from driving direct customer sales, to analytics and business intelligence, employee productivity, and reduced administrative overhead. Making the best use of search requires two perspectives: both a look at the business requirements for a search application and a view to new business opportunities created by using search to leverage the organization’s content resources. Thousands of organizations across different sectors and business models have harnessed Apache Lucene/Solr to search their rapidly growing and diversifying content resources. Underlying this broad adoption is the extraordinary power, scalability, and versatility of open source search technologies. This paper provides an overview of both the requirements and the opportunities for search applications. It then explores how real world organizations are successfully using Lucene/Solr search applications to meet those opportunities, presenting how the technology is used for specific business models and use cases across industries. In addition, it offers a baseline for setting search requirements that managers and architects can use to adopt Lucene/Solr, and adapt this open source search technology to the unique needs of their business. © 2010, Lucid Imagination The Case for Lucene/Solr: Real World Search Applications A Lucid Imagination White Paper • January 2010 Page ii
  • Table of Contents Introduction ............................................................................................................................................................ 1 Understanding Search Opportunities and Requirements ..................................................................... 2 What Data and Documents Are You Searching? ............................................................................... 2 Who Needs the Results and Why? ......................................................................................................... 3 Where Is Search Integrated with IT Infrastructure? ...................................................................... 4 How Is the Search Interface Presented to the User? ...................................................................... 5 The Real World: Applications and Case Studies........................................................................................ 7 Yellow Pages, Local Search, and Searching Classifieds....................................................................... 8 Media................................................................................................................................................................... 10 E-commerce ..................................................................................................................................................... 12 Job and Career Sites ...................................................................................................................................... 14 Libraries, Archives, and Museums (LAMs) Search ............................................................................ 16 Social Media Search ....................................................................................................................................... 18 Enterprise (Intranet) Search ..................................................................................................................... 21 Business Use Case Matrix................................................................................................................................ 23 Appendix: Lucene/Solr Features and Benefits ....................................................................................... 24 The Case for Lucene/Solr: Real World Search Applications A Lucid Imagination White Paper • January 2010 Page iii
  • Introduction As fast as companies, communities, and consumers produce data—about each other, products, opinions, research, and everything else imaginable—they need faster, more versatile search capabilities to find the information they need to create opportunities for competitive advantage. In today’s information-driven environment, search addresses the critical problems created by the explosive growth of content by slashing the time and effort users expend in finding data they value. Search spans the range of business models and use cases: from driving direct customer sales, to analytics and business intelligence, employee productivity, and reduced administrative overhead. Apache Lucene/Solr1 open source search technology has been implemented across the broadest range of applications and business models—and likely in ways that can fit the needs of your organization. In successful operation today at thousands of enterprises, Lucene/Solr technology scales from tens of thousands to hundreds and billions of documents; searches data that is structured, unstructured, and in combination; data inside and outside the firewall; and ranges in use from a simple website search box through sophisticated faceted navigation. It addresses equally diverse business processes and mission critical applications. Across the spectrum, Lucene/Solr helps users find, make sense of, and act upon information quickly and efficiently. In this white paper, we’ll review real-world case studies for Lucene/Solr functionality across business sectors to demonstrate its versatility and varied applicability. The diversity of examples provides strong evidence of Lucene/Solr’s flexibility and power as a search technology. The examples also attest to the innovation and transparency inherent to the open source development model. Our focus is on familiarizing the audience of business managers and application owners with existing Lucene/Solr applications; the substantial technical advantages to developers are covered elsewhere. We’ll first survey the key requirements and business use cases of search and then look at where they are built into search applications. Our objective is to provide business managers and application owners with a broad perspective on how Lucene/Solr search technology is used to build solutions to compelling business problems. In the Appendix, we provide an overview of Lucene/Solr’s key features and benefits, with a basic outline of the capabilities offered to meet the broadest range of business needs. 1 Lucene and Solr are complementary technologies that offer very similar underlying capabilities; Solr is the Lucene Search Server. Since Lucene serves as the core of Solr’s search capabilities, this paper refers to the two as Lucene/Solr. For more information, see the Appendix. The Case for Lucene/Solr: Real World Search Applications A Lucid Imagination White Paper • January 2010 Page 1
  • Understanding Search Opportunities and Requirements Search technology has come a long way from its roots in matching keywords with appearance in documents and obtaining undifferentiated results. Search today empowers users by delivering actionable information quickly and efficiently, across multiple, diverse sources of data. The business use cases range from executing mission critical commercial transactions (e.g., e-commerce sites) to unlocking employee and end-user productivity in the search for a single relevant document (e.g., enterprise search). Given the breadth of capability of the problem domain, it’s useful to look at search and ask two fundamental questions: “How it can it solve my business problems?” and “What new business opportunities can search solve for?” In considering how search technology solves business problems, it is useful to start with an elucidation of the requirements you’ll need to consider for your search application. At the same time, be sure to look more broadly at the capabilities that Lucene/Solr offers, as it can help open up new frontiers for incorporating search and leveraging more value from data repositories. Starting with some basic questions—what, who, how, and where—you can clarify the high-level business requirements specific to your business needs, which in turn allow you to make the best decisions for your search application. The process of looking at the fundamentals also raises new questions about how and where the search technology offered by Lucene and Solr can create new business opportunities. Let’s look at four fundamental questions you should address in understanding search opportunities and requirements: • What data and documents are you searching? • Who needs the results and why? • Where is search integrated with IT Infrastructure? • How is the search interface presented to the user? What Data and Documents Are You Searching? Business today is driven more than ever by the end-users’ creation and consumption of real-time information. A key differentiating capability of search technology is ingesting a broad range of content types and processing large collections of diverse data in real time in order to deliver actionable information. Two aspects to consider: The Case for Lucene/Solr: Real World Search Applications A Lucid Imagination White Paper • January 2010 Page 2
  • • Types of Content Content comes in multiple formats: HTML pages, XML files, PDFs, images, PowerPoint presentations, Excel spreadsheets, Word documents, log files, multimedia content, and more. Content resides in various repositories, including databases, file servers, content management systems, archiving systems, collaboration applications, and employee desktops and laptops. Search technology must be able to locate, organize, and aggregate data whatever its form or location. • Frequency of Updating Content Organizations update content at varying intervals, driven by differing business processes and models—social media or news applications have real-time content need, whereas an e- commerce application might re-index in response to new inventory on a batch basis and a research institution might add to its collection less often still. Search applications need to be adaptable to the differences in content change frequency. Who Needs the Results and Why? Business search puts a high priority on end user experience and results in which the searched content is tuned to the unique needs of each user. Because, after all, the human dimension—the usefulness of results and the efficacy of interaction—is the acid test of a search application. Internet search applications like Google, Yahoo, and Bing are now common and mature. They have raised user expectations about key qualities of the search experience...but they solve a very different problem. While Internet searches can produce millions of results in milliseconds, they rely on measures like website popularity or URLs and domain names—not relevant and not generally applicable to purpose-built applications for businesses. What’s more, they rely on generalizing relevancy for a global population of all Internet users, without being tied to business rules, or business process logic, or the opportunity cost of improved precision for a specific set of data or search users. Business search applications cannot rely on such brute force coarse approaches to tune their results. They need far more control and precision. They have to be able to deliver highly useful results while matching, if not exceeding, the levels of user experience that people have come to expect by virtue of their daily interactions with commercial search engines. Key points of consideration from a business perspective are: The Case for Lucene/Solr: Real World Search Applications A Lucid Imagination White Paper • January 2010 Page 3
  • • Relevance Relevance is entirely a factor of the goals of the search application’s users. The application must have the mechanisms to recognize the subjective needs of users and tune results accordingly. It must also provide easier ways to narrow search criteria without requiring users to come up with perfect query terms. Flexibility for drilling deeper will make results richer and valuable. Mechanisms to apply filters, proximity values, and sorting parameters to narrow search scope can also lead to a richer set of more useful results, with less time and effort. • Cost of Relevance As business goals are driven by revenue opportunities and cost savings, it is critical to tie relevance to the economics of the business. For example, a public-facing retail site should focus on matching merchandise to search, site stickiness, and customer loyalty. It requires search technology that streamlines and simplifies the shopping experience with relevant results directly contributing to sales revenue. For knowledge workers, internal search applications should help make employees more productive by reducing the amount of time and effort to find documents they need to do their jobs. Multiple studies show that information workers can spend 20–30% of their time searching for information. • Precision Ranking Result accuracy, sorted by attributes like relevance, date, field, or any document property feature, makes the search process better. End users generally abandon a search before tackling the fine points of Boolean logic or scrolling for a result buried too far down. • Query Response Speed Today, 5–7 seconds is the typical threshold for end-user patience. Too much wait time for search results frustrates users, and causes them to abandon pages. Fast, relevant results cannot be limited by search technology hamstrung by data influx or query overload. Query response time should also work hand-in-hand with the refinement of multiple search attributes, so that increasingly complex queries do not extract a performance penalty. Where Is Search Integrated with IT Infrastructure? Useful, valuable search technology rarely exists in isolation. Searched data is transformed into actionable information when it is integrated with the organization’s information infrastructure: business process to business intelligence to content management systems. A robust search technology must be customizable to integrate with the existing systems seamlessly. The Case for Lucene/Solr: Real World Search Applications A Lucid Imagination White Paper • January 2010 Page 4
  • • Application Integration A key requirement for a search application is its extensibility for integration with existing infrastructure and applications like content management systems, databases, and the full range of business processes and applications. It should have interfaces that support ingestion of data as well as delivery of results in readily consumable formats—because in many cases, results are consumed by other applications, not a human. • Scalability We can assume that data will change and grow. So scalability is a key factor for search application. Applications should grow to address future needs without penalties for the breadth of data or for the count of documents indexed. The search application should be able to grow with the requirements of the organization, without needing additional large investments in hardware to match the pace of growth. Proprietary search vendors often charge for search by the number of documents indexed. In a world where constantly expanding content growth is the norm, such costs can be a real and substantial drag on the cost of ownership for search applications, many times resulting in negative return. • Security Every organization has its own security requirements and access controls. Search technologies need to comply with the security policies of the enterprise, controlling results that have restricted access. The search technology should also be able to make use of document-level security from other sources. How Is the Search Interface Presented to the User? The user interface is where search delivers on findability and presents actionable results. The search application is only as good as the convenience of submitting queries, reviewing and refining results, and finding information. Key aspects to consider: • Navigation Users benefit from guidance that makes their queries more productive. Techniques such as faceted search with result clustering, advance hinting (“did you mean”), “more like this,” and drop down menus for setting search scope help users achieve desired results faster, making a search application both user- and information-friendly. It is also important to allow users to draw associative connections between results—using the technology to uncover relationships and discover more about what they were seeking than they knew at the outset. The Case for Lucene/Solr: Real World Search Applications A Lucid Imagination White Paper • January 2010 Page 5
  • The NetFlix search application is powered by Solr; it adds the fuzzy dimension to search, with auto-completion of movie names, correction of misspelled names of actors, and suggests titles closest to the query. As a result, 85% of users have found the movie they were looking for ranked at the #1 spot in the results. • Discovery Search application functionality should extend beyond the generic presentation of a result list of documents that contain a keyword. Highlighting keywords in searched results, expanding searches with synonyms and spell checking, and offering users ways to learn a bit more about documents in the results without having to load the document are great ways to significantly improve usability. • Intuitive Intelligence Search applications must go beyond keyword search to help users retrieve accurate information even when they are not sure of the best keywords. Additionally, they should reduce misinterpretations where homonyms, spelling errors, and ambiguous keywords are involved (e.g., is “apple” a fruit or a computer company?). The Case for Lucene/Solr: Real World Search Applications A Lucid Imagination White Paper • January 2010 Page 6
  • The Real World: Applications and Case Studies With an understanding of the fundamentals of search business applications in hand, it is helpful to gain additional context on business usage through a survey of organizations that have successfully used Lucene/Solr for powerful search applications. All of these cases were built on the capability of Lucene/Solr to provide innovative, high- performance, cross-platform, feature-rich search technology suitable for nearly every application. By powering diverse search applications for thousands of organizations such as AT&T, Zappos, McClatchy, Smithsonian, MTV Networks, LinkedIn, MySpace, Comcast, Monster, Netflix, and many more, Lucene/Solr has provided mission critical capability that turns search into a robust competitive advantage. For these organizations, Lucene/Solr solutions regularly index and search hundreds of millions of documents with subsecond response time, unencumbered by costly licensing or vendor lock-in. Together they represent a compelling argument for the broad applicability of Lucene/Solr across the full range of business opportunities and search needs. Business use case studies we’ll review include: • Yellow Pages, Local Search, and Searching Classifieds • Media • E-commerce • Job and Career Sites • Libraries, Archives, and Museums (LAMs) Search • Social Media Search • Enterprise (Intranet) Search The Case for Lucene/Solr: Real World Search Applications A Lucid Imagination White Paper • January 2010 Page 7
  • Yellow Pages, Local Search, and Searching Requirements Classifieds In the business of online local search, geographic-based (location) • Intelligent results going relevance generates competitive advantage. Online directories beyond keyword search need to provide a rich, interactive search experience to users to • Deeper, faceted increase site views and stickiness, which in turn translates into navigation increased advertising revenue. Simplified location-based search, • Seamless integration intuitive faceted query response, and data mashups are a few with latest Web 2.0 features that define search functionality for an online directory. tools Lucene/Solr solutions offer accurate search results, factoring in • Lower IT-related costs location, users’ reviews, and ratings, alongside paid advertising. By • Geocentric user taking advantage of Solr’s open source model—with search experience algorithms that are completely transparent—companies can invest • Search numeric values in configuring their search solutions to match their business logic, rather than trying to infer or pay for exposure proprietary back- Solr Solution end logic. • Customizable Search Index which can be Internet Yellow pages and local tuned transparently to online search is forecast to account for key findability drivers grow to $27.8 billion in 2011. • Drop down filters for The Kelsey Report1 narrowing or widening the scope of search Success Stories • Seamless integration • YP.com, a division of AT&T Interactive with existing technologies • Zvents.com, local event search service • Yelp.com, the community local search site • Native numeric encoding and search capabilities • Reduced server footprint for lower TCO than most commercial vendors 1The Kelsey Group’s Global Print Yellow Pages, Internet Yellow Pages and Local Search Five Year Outlook The Case for Lucene/Solr: Real World Search Applications A Lucid Imagination White Paper • January 2010 Page 8
  • Case Study 1 yp.com by AT&T Interactive AT&T Interactive is an online and mobile search and advertising company. Their leading-edge portal, yp.com—an online business listing and advertising site—was originally implemented with a commercial proprietary search application. It faced issues of scalability, vendor lock-in, and performance. With help from Lucid Imagination, AT&T successfully migrated to a Solr-based search solution that leveraged the flexibility of open source without compromising features and functionality. And they did so with a much smaller budget. Business Needs • Addressing the need to factor in location to support geographic search, and include relevant comments • Striking a balance between organic search and advertised content • Indexing highly unstructured content such as user comments • Increasing relevancy of results and boosting paid search results for preferential placement of advertisers • Linguistic support to enable search experience, such as spellchecking, synonyms, find-similar, etc. • Integrating with latest Web 2.0 tools • Reducing server footprint The Solr Solution • Context-specific relevancy, geographic proximity, ad placement, and user comments • Faceting, drop down filters to narrow/widen the scope of search • Functional support for creating new features • Spell-correction, and location-optimized search results to show users businesses nearest to them first • Seamless integration with many Web 2.0 tools to create innovative features and mashups • Lowers TCO by reducing the number of search servers from 120 to two dozen servers The Case for Lucene/Solr: Real World Search Applications A Lucid Imagination White Paper • January 2010 Page 9
  • Media Brand reinforcement, premium content, and easy accessibility are the main business motivators for online media and Requirements publishing companies. Relevant information improves time on • Real-time indexing of the site and encourages users to explore related content, petabytes of structured boosting subscription rates and site views. These translate into a and unstructured data virtuous cycle of additional revenue generation. • Deeper search capability Given that content is the business, the need for a robust search • Improved query application ties directly to competitive advantage. response time Lucene/Solr provides a customized, function rich solution for the • Reduced infrastructure and customization costs media and publishing industry. It addresses dynamic challenges of content diversity, content freshness, and content acquisition , Solr Solution and gives companies a platform on which to build a world-class innovative search experience to differentiate themselves in a • Reverse indexing highly competitive marketplace. • Intelligent, faceted search to enable contextual and linguistic relevance “Solr has done wonders for us. • Easy configuration for It is easy to understand and parsing structured and unstructured data deploy, and has reduced our • Easy and seamless costs drastically.” installation for lower Doug Steigerwald, TCO • Customization with open McClatchy Interactive source code Success Stories • McClatchy Newspapers • Netflix • Comcast Interactive • MTV Networks, a division of Viacom • The Motley Fool, fool.com • Fanfeedr.com, personalized sports aggregator The Case for Lucene/Solr: Real World Search Applications A Lucid Imagination White Paper • January 2010 Page 10
  • Case Study 2 McClatchy—Leading Newspaper Publisher The third largest newspaper publisher in the United States, McClatchy Company owns 30 daily newspapers in 29 markets across the country. To win online, McClatchy knew it had to have a robust search solution, to empower the McClatchy audience with the information they wanted and secure loyalty from readers and sponsorships from advertisers. Working with Lucid Imagination, McClatchy migrated from proprietary search software to open source and chose Solr for its high performance, comprehensive capabilities, and superior value Requirements • Proliferating content and data sources (text, videos, audios, images), with real-time streaming • Empowering end users with ease of use • Supporting peak traffic and popular search spikes with consistent performance • Providing scalability for a database growing by orders of magnitude annually • Providing flexibility to support customization • Controlling IT costs while exceeding performance benchmarks of competition The Lucene/Solr Solution • Deeper content by indexing both structured and unstructured data in real time, effortlessly • Indexes millions of documents, with search results delivered in milliseconds • User-friendly navigation with drop down filters, faceted navigation, linguistic corrections, etc. • Excellent performance, even in peak hours, by load-balancing search requests across servers • Scalability without impact on performance • High degree of customization, since it’s open source • Integration with existing IT infrastructure and eliminates associated license fees to cut costs • 8-fold reduction in server footprint The Case for Lucene/Solr: Real World Search Applications A Lucid Imagination White Paper • January 2010 Page 11
  • E-commerce E-commerce businesses must provide a compelling shopping experience Requirements in order to maintain brand equity and thrive in a very highly competitive • Multidimensional, market landscape. By reducing the time and effort required to navigate dynamic search available merchandise and find what they want, superior search • Faster results contributes directly to a satisfying buying experience for customers. • Real-time indexing Search then translates directly into higher revenues and customer of products loyalty. Instant results, intuitively organized, advanced faceting for easy browsing, synchronizing results with images, and integration with user • Faceting and browsing ratings are among the must have features of an e-commerce search application. capabilities • Seamless Lucene/Solr gives companies the ability to build their sites around the integration with concept of “searchendizing”—putting the desired merchandise at the top existing IT of the results list—which can make the difference between sales made infrastructure and sales lost. Faceting, database integration, real-time indexing, and query monitoring all enable users to find products they want, driving conversion rates and enabling a winning online experience. 2 Solr Solution • Faceted search for Online retail sales in the deeper drill down and browsing B2C market are expected • Intuitive search Success Stories to reach $340 billion by capabilities for 201321 cross-channel • Buy.com shopping • Sears.com experience • Macys.com Forrester Research • System • Zappos.com administration tools • Advanceautoparts.com for data loading, • Dollardays.com index replication, monitoring, logging, and cache management • Query monitoring for better highlighting of 2“Consumers will spend more than $340 billion online by 2013, says Forrester,” popular products Internet Retailer, 27 November 2009, http://www.internetretailer.com/dailyNews.asp?id=32630. The Case for Lucene/Solr: Real World Search Applications A Lucid Imagination White Paper • January 2010 Page 12
  • Case Study 3 Zappos Zappos is the premier destination for online shoe shopping. At Zappos, the mission is excellent online customer service—customers should be able to browse shoe styles, sizes, shapes, and colors more easily than any other shoe store, on or offline. To achieve this, Zappos wanted a robust, flexible, multifunctional search solution/application. After evaluating many commercial search technologies, Zappos zeroed in on Solr, working with Lucid Imagination to ensure continued, successful deployment. Requirements • Simplified, attractive user experience that makes it easy to find and buy • Relevant results, fast • Navigation across attributes, such as size, color, and style for broader and deeper results • Indexing products as they were entered in the catalogs • Cross-functional navigation to give customers a realistic shopping experience • Intuitive intelligence to provide alternate suggestions • Analytical capabilities to drive business strategy • Facilitating control on results • Integration with existing IT infrastructure The Solr Solution • Search results in subseconds, across categories • Faceting, for easy browsing and discovery and a compelling user experience • Real-time indexing of products • Synchronization of visuals, specs, filters, and promotions to make shopping experience true to life • Information on user activity to help build strategy on product promotions • Controls to rank popular or high-stock products in results where users are more likely to buy them • Facilitates integration with heterogeneous open source environment The Case for Lucene/Solr: Real World Search Applications A Lucid Imagination White Paper • January 2010 Page 13
  • Job and Career Sites Job portals are countercyclical to the economy. When the economy Requirements flourishes, posted jobs grow in number; when it sags, candidates flock in to post their resumes. Success for an online job portal is tied to the • Linguistic efficiency of its search capability—matching résumés to job listings and intelligence for vice versa—so both employers and prospective employees can zero in on more relevant just the right opportunity. results • Control search For example, an employer may want to navigate through filters to results to maintain narrow the scope of a candidate search, such as education, previous privacy employer, salary history, skillsets, etc.; a job seeker may want to expose these attributes, but keep a current employer’s name confidential. A job- • Deeper search seeker may want to apply to jobs within a particular geographic area. capability • Numeric search Lucene/Solr not only provides such flexibility but also addresses other • Faster query complexities of this industry by enabling linguistic intelligence (such as response identical acronyms that correspond to different entities; variations in • Reduced spelling, imperfectly constructed search queries); indexing unstructured infrastructure and data (résumés); and managing ever-growing data. customization costs Solr Solution “I think the breakthrough was • Intelligent, faceted when we tried it, and we search to enable realized, wow, this thing could contextual and linguistic relevance really scale.” • Easy configuration for parsing Peter Keegan, Monster.com structured and Success Stories unstructured data • Monster • Easy and seamless installation for • The Big Jobs lower TCO • eBharatJobs • Careerjet • Business process integration and Customization with open source code The Case for Lucene/Solr: Real World Search Applications A Lucid Imagination White Paper • January 2010 Page 14
  • Case Study 4 Monster.com Monster is the largest job search engine in the world, with over a million jobs posted at any one time. By 2008 it had 150 million résumés in its database, serving over 63 million job seekers per month, now running on average 300 to 400 queries per second with an average response time of 40 milliseconds. To provide the highest level of service and support to their customers—both employers and job seekers—Monster has an unmatched marketplace for employment opportunities, with Lucene-based search at the heart of its business model. The Requirements • Managing high volumes of data, continually increasing by double digit percentages annually • Maintaining constant inventory updates and providing faster results • Removing technological barriers that limit the scope of information • Enabling end users to refine search and drill deeper without any performance impact • Providing security controls to ensure end user privacy • Facilitating scalability and flexibility in tandem with company’s vision and growth plans The Lucene Solution • High volumes of data by clustering data to reduce the index size • Real-time indexing for fresher, faster query results • Intuitive search to enable in-depth cross-functional job and résumé browsing • Faceted search and ‘single click’ filters for search refinement • Security controls to manage user information • Unlimited scalability and customization leveraging open source licensing The Case for Lucene/Solr: Real World Search Applications A Lucid Imagination White Paper • January 2010 Page 15
  • Libraries, Archives, and Museums (LAMs) Search The core asset of educational and research institutions is knowledge Requirements archived and accumulated over decades. In the world of academic search, the diversity of information for any query—text, illustration, audio/video • Management of media, or data in any other format—makes unstructured formats a key multiple formats of aspect of the searchable archive. data and documents • Customization and Lucene/Solr gives academic and research institutions the power to turn scalability information into knowledge by going beyond keyword-driven search to • Linguistic support in expose a rich variety of results and exploration. Based on the open source queries model, it not only integrates with the existing IT infrastructure but also • Faster results leverages the existing classification hierarchies to give structure to terabytes of information spread across disparate collections, significantly reducing overhead and enabling flexible and scalable deployment. Solr Solution • Optimized index “With Solr, you can do so many things infrastructure limits size without without writing a lick of code. I hadn't compromising speed realized how easy it is to extend our or flexibility custom request handler, response • Easy customization writer, and update handler. Just move it for implementing taxonomy rules all to Solr and let it do the heavy • Faceted search to lifting.” narrow results to a specific source across Sjored Siebinga, Europeana diverse sets of data Success Stories • Instant results • Seamless integration • Smithsonian Institute with IT • Europeana, the European Union online cultural archive infrastructure for • The US Library of Congress and World Digital Library lower TCO • Stanford University Library • University of Michigan Graduate Library The Case for Lucene/Solr: Real World Search Applications A Lucid Imagination White Paper • January 2010 Page 16
  • Case Study 5 Smithsonian The Smithsonian Institution is the flagship museum collection of the United States, supporting a research institute that provides “one-stop” searching for 2 million records, including nearly a quarter of a million media files (images, media files, online journals, and other resources) distributed across dozens of archives, databases, museums, and libraries. To make this treasure of information easily accessible to people, the Smithsonian needed an efficient search solution that could overcome the following challenges: The Challenges • Managing a complicated taxonomy that could no longer accommodate a growing data index • Indexing disparate types of content, including documents, videos, and images • Making information available from a large database • Providing access controls to restrict information • Integrating with existing legacy tools Smithsonian chose Lucene/Solr, and worked with Lucid Imagination to create an optimized, well-designed solution. The Solr Solution • Efficient index strategy to manage a mix of structured and unstructured data • Holistic search, by optimizing configuration to reduce the number of servers and better handling query requests • Filtering information through faceted search • Access controls to restrict information based on membership profiles • Integration with the existing IT infrastructure • Provides guidance and assistance on setting replicated search environment The Case for Lucene/Solr: Real World Search Applications A Lucid Imagination White Paper • January 2010 Page 17
  • Social Media Search Requirements Search solutions must support differentiated business models matching Web 2.0 innovations, including user-generated content • Deliver search results and mashups, without compromising scalability—a challenge, given the virtually limitless content on the Internet. Success and as soon as content is differentiation is measured by how well the site provides relevant available results to grow its user base and keeps them engaged. • Deeper drill down Increasingly, the technological factors driving Web 2.0 application capabilities paradigms are finding their way into the enterprise, unlocking • Intuitive interface collaboration and productivity in new ways that challenge conventional organizational bounds—and that rely in equal measure on search to create the connections between employees Lucene/Solr Solution to enable discovery, cross-pollination, and more efficient collective effort. • Near-instant results with segmentable Lucene/Solr not only provides fast results but also facilitates flexible, intuitive navigation to help end users connect with others. indexing It boosts the reach and performance of search, while cutting • Intuitive search implementation costs and lowering barriers to innovation. • Data-driven spellchecking based on user search Success Stories “With Solr, we really treat it histories • Digg as kind of a platform where Linguistic support • Myspace we can build other kind of • LinkedIn through ‘Did you • Reddit things on top of it… We have mean" functionality • Technorati a very valuable set of data, Highlighting keywords • Scout Labs and we really want to • Deeper drill down • Xmarks.com explore new ways of with faceting building new features from • Real-time content that data set.” updating —Sammy Yu, Digg.com The Case for Lucene/Solr: Real World Search Applications A Lucid Imagination White Paper • January 2010 Page 18
  • Case Study 6 Digg.com Digg displays the wisdom of the crowds. By leveraging the mass collaboration of readers distributed across the Internet—everything on Digg is submitted by the public community for the public community—it builds on the easy community findability of information valued by the marketplace of readers and consumers. Digg realized early on that to succeed in the business of information, they need to make information available to needed their audience as effortlessly as possible. They saw the following challenges as roadblocks for implementing a base search application: Requirements • Managing unstructured data (13 million documents and growing) in real time • Providing results faster • Facilitating smart navigation to provide information in digestible portions • Recognizing and eliminating duplicate content • Providing semantic and linguistic smart application • Facilitating scalability while retaining costs Digg selected Solr for its unmatched flexibility and functionality. The Solr Solution • Highly customizable and flexible • Results in subseconds, with simple-to-use pull downs to refine results seconds, simple • Fuzzy duplicate detection (by coding) uzzy • Unlimited scalability and seamless integration with the heterogeneous environment nlimited The Case for Lucene/Solr: Real World Search Applications eal A Lucid Imagination White Paper • January 2010 Page 19
  • Case Study 7 LinkedIn Connecting 50 million registered users from 200 countries across 170 industries and matching them to the right professional contacts is what LinkedIn is all about. LinkedIn’s business is premised on ’s intelligent search application that could overcome the following: The Challenges • Managing an ever-growing database, with one new member joining and creating a profile every growing data second • Indexing unstructured data in real time • Giving instant query responses, even in peak traffic hours • Providing intuitive navigation and intelligent linguistic support • Integrating with other Web 2.0 tools to build user profiles that integrate data from multiple sources They chose Lucene to implement the search function at the core of their business model. model The Lucene Solution • Used index segmentation for faster results and to limit index base • Provided faceted search and intelligence support features like changing the view of search results and auto-complet completion of contacts • Calculated relative relevance, ranking results on the fly based on relationship between the user’s profile and the other profiles being searched • Integrated with the latest web tools for example, incorporating videos in search results tools; • Provided "scale as you grow” facility through the flexibility of the open source model scale grow The Case for Lucene/Solr: Real World Search Applications eal A Lucid Imagination White Paper • January 2010 Page 20
  • Enterprise (Intranet) Search Enterprises today have a global footprint, which leads to the creation of Requirements multiple content types and the use of disparate applications and content management systems across business centers. The result is often silos of • Single interface to unmanaged data spread across the intranet of an enterprise—a situation access enterprise where information is omnipresent but cannot be used. data To achieve a competitive advantage, enable intelligent decisionmaking, • Faster results eliminate duplication of work, and lower the cost of ownership, • Control over search enterprises need a search application that gives structure to results unstructured data; provides a single gateway to search across multiple • Ready integration enterprise repositories, with speed, flexibility, and intuitive intelligence. with existing content Lucene/Solr is a solid match for enterprise search. As a customizable and management multifunctional search application, Lucene/Solr provides robust search software features at minimal cost. The open source development model behind Lucene/Solr integrates seamlessly with legacy tools, and brings down Solr Solution the total cost of ownership significantly. Given the sensitive nature of enterprise content, Lucene/Solr facilitates • Single gateway for all types of data document-level, role-based security. And with the transparent search algorithms and configurability for relevancy, Lucene/Solr enables • Dynamic boosting intranet search with the precise control enterprise content owners of content require, ensuring that results consistently deliver the right documents to • Transparent search the right people. algorithms and relevancy tuning • Customization and “The search and discovery easy integration software market grew 19 with open source percent in 2008 to $2.1 billion” code Sue Feldman, IDC The Case for Lucene/Solr: Real World Search Applications A Lucid Imagination White Paper • January 2010 Page 21
  • Case Study 8 Food and Drug Administration The Food and Drug Administration (FDA) is a U.S. government agency responsible for regulating and supervising the safety of foods medications, veterinary products, tobacco, and cosmetics. The FDA has a large repository of information that dates back multiple decades, and exists in formats ranging from early optical character recognition to recent electronic formats. To mine this knowledge base, the FDA is developing a semantic mining framework using open source tools such as Apache Lucene and Solr. Requirements • Integrating petabytes of data highly distributed across the intranet of an enterprise • Managing multiple indices for documents stored in distributed repositories • Managing and maintaining archival data and evolving vocabularies • Indexing unstructured data in real time • Recognizing and eliminating duplicate content • Handling concurrent queries and delivering fast and relevant results • Restricting search results according to agency access control policies • Integrating with existing infrastructure without additional overhead The Lucene Solution • A single gateway to search across multiple enterprise repositories • Duplicate detection • Fast and relevant results with content analysis and query interpretation algorithms • Filters results based on access controls and security policies of an enterprise • Facilitates integration with existing enterprise infrastructure to reduce TCO The Case for Lucene/Solr: Real World Search Applications A Lucid Imagination White Paper • January 2010 Page 22
  • Business Use Case Matrix To simplify mapping your search needs to existing search applications in the real world, the matrix below compares business use cases against key search requirements. While not an exhaustive list, the matrix highlights the different business use cases across sectors and business models, reflecting the adaptability of Lucene/Solr across the various domains of search applications and use cases. Users Content Content Update Frequency Access Verticals Customer Control Internal Original Aggregated High Medium Low Facing Enterprise (Intranet) √ √ √ √ Schools/ √ √ √ √ √ √ Universities Education Libraries √ √ √ √ √ Job Portals √ √ √ √ Social Networks √ √ √ √ √ News √ √ √ √ Media Media √ √ √ √ E-Commerce Sites √ √ √ √ √ √ Financial Services √ √ √ √ √ Yellow Pages √ √ √ Horizontal Portals √ √ √ √ The Case for Lucene/Solr: Real World Search Applications A Lucid Imagination White Paper • January 2010 Page 23
  • Appendix: Lucene/Solr Features and Benefits Lucene and Solr are complementary technologies that offer very similar underlying capabilities. In choosing a search solution that is best suited for your requirements, key factors to consider are application scope, development environment, and software development preferences. Lucene is a Java technology-based search library that offers speed, relevancy ranking, complete query capabilities, portability, scalability, and low overhead indexes and rapid incremental indexing. Solr is the Lucene Search Server. It presents a web service layer built atop Lucene using the Lucene search library and extending it to provide application users with a ready-to-use search platform. Solr brings with it operational and administrative capabilities like web services, faceting, configurable schema, caching, replication, and administrative tools for configuration, data loading, statistics, logging, cache management, and more. Lucene presents a collection of directly callable Java libraries and requires coding and solid information retrieval experience. Solr extends the capabilities of Lucene to provide an enterprise- ready search platform, eliminating the need for extensive programming. Solr provides the starting point for most developers who are building a Lucene-based search application. It comes ready to run in a servlet container such as Tomcat or Jetty, making it ready to scale in a production Java environment. With convenient ReST-like/web-service interfaces callable over HTTP, and transparent XML-based configuration files, Solr can greatly accelerate application development and maintenance. In fact, Lucene programmers have often reported that they find Solr contains “the same features I was going to build myself as a framework for Lucene, but already very well implemented.” Using Solr, enterprises can customize the search application according to their requirements, without involving the cost and risk of writing the code from the scratch. Lucene provides greater control of your source code and works best in development environments where resources need to be controlled exclusively by Java API calls. It works best when constructing and embedding a state-of-the-art search engine, allowing programmers to assemble and compile inside a native Java application. While working with Lucene, programmers can directly control the large set of sophisticated features with low-level access, data, or state manipulation. Enterprises that do not require strict control of low-level Java libraries generally prefer Solr, as it provides ease of use and scalable search power out of the box. The Case for Lucene/Solr: Real World Search Applications A Lucid Imagination White Paper • January 2010 Page 24
  • As functional siblings, Lucene and Solr have become popular alternatives for search applications; the two differ mainly in the style of application development used. Key benefits of search with Lucene/Solr include: • Search Quality: Speed, Relevance, and Precision Lucene/Solr provides near-real-time search and strong relevance ranking to deliver contextually relevant and accurate results very quickly. Tailor-made coding for relevancy ranking and sophisticated search capabilities like faceted search help users in sorting, organizing, classifying, and structuring retrieved information to ensure that search delivers desired results. Search with Lucene/Solr also provides proximity operators, wildcards, fielded searching, term/field/document weights, find-similar functions, spell checking, multilingual search, and much more. • Lower Cost and Greater Flexibility, Plug and Play Architecture Lucene/Solr reduces recurring and nonrecurring costs, lowering your TCO. As open source software, it does not require purchase of a license and is freely available for use. The open source code can be used as is, modified, customized, and updated as appropriate to your needs. Solr is easily embedded in your enterprise’s existing infrastructure, reducing costs of installation, configuration, and management. • Open Source Platform for Portability and Easy Deployment Because Lucene/Solr is an open-source software solution, it is based on open standards and community-driven development processes. It is highly portable and can run on any platform that supports Java. For instance, you can build an index on Linux and copy it to a Microsoft Windows machine and search there. This unsurpassed portability enables you to keep your search application and your company’s evolving infrastructure in tandem. Lucene, in turn, has been implemented in other environments, including C#, C, Python, and PHP. At deployment time, Solr offers very flexible options; it can be easily deployed on a single server as well as on distributed, multiserver systems. • Largest Installed Base of Applications, Increasing Customer Base Lucene/Solr is the most widely used open source search system and is installed in around 4,000 organizations worldwide. Publicly visible search sites that use Lucene/Solr include CNET, LinkedIn, Monster, Digg, Zappos, MySpace, Netflix, and Wikipedia. Lucene/Solr is also in use at Apple, HP, IBM, Iron Mountain, and Los Alamos National Laboratories. The Case for Lucene/Solr: Real World Search Applications A Lucid Imagination White Paper • January 2010 Page 25
  • • Large Developer Base and Adaptability As community developed software, Lucene/Solr provides transparent development and easy access to updates and releases. Developers can work with open source code and customize the software according to business-specific needs and objectives. Its open source paradigm lets Lucene/Solr provide developers with the freedom and flexibility to evolve the software with changing requirements, liberating them from the constraints of commercial vendors. • Commercial-Grade Support for Mission Critical Search Applications from Lucid Imagination Lucid Imagination provides the expertise, resources, and services that are needed to help enterprises deploy and develop Lucene-based search solutions efficiently and cost-effectively. Lucid helps enterprises achieve optimal search performance and accuracy with its broad range of expertise, which includes indexing and metadata management, content analysis, business rule application, and natural language processing. Lucid Imagination also offers certified distributions of Lucene and Solr, commercial-grade SLA-based support, training, high-level consulting and value-added software extensions to enable customers to create powerful and successful search applications. The Case for Lucene/Solr: Real World Search Applications A Lucid Imagination White Paper • January 2010 Page 26