• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
What’s new in apache solr 1.4
 

What’s new in apache solr 1.4

on

  • 5,382 views

Sol 1.4 is better than ever! Read this white paper and learn about these new features, including: ...

Sol 1.4 is better than ever! Read this white paper and learn about these new features, including:

* enhanced data import capabilities
* rich document handling
* speedier numeric range queries
* duplicate detection
* java-based replication and deployment
* smarter handling of index changes
* faster faceting
* streamlined caching

Statistics

Views

Total Views
5,382
Views on SlideShare
5,381
Embed Views
1

Actions

Likes
4
Downloads
21
Comments
0

1 Embed 1

http://www.slideshare.net 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    What’s new in apache solr 1.4 What’s new in apache solr 1.4 Document Transcript

    • Open Source Search Search: What’s New in Apache Solr 1.4 A Lucid Imagination Technical White Paper
    • © 2009 by Lucid Imagination, Inc. under the terms of Creative Commons license, as detailed at http://www.lucidimagination.com/Copyrights-and-Disclaimers/. Version 1.02, published 26 October 2009. Solr, Lucene, Apachecon and their logos are trademarks of the Apache Software Foundation. What’s New in Solr 1.4 A Lucid Imagination Technical White Paper • October 2009 Page ii
    • Abstract Apache Solr is the definitive application development implementation for Lucene, and it is the leading open source search platform. Solr 1.3 set a high bar for functionality, extensibility, and performance. As time marches on, Solr committers and contributors have been hard at work engineering to make a good thing even better. This white paper describes the new features and improvements in the latest version, Apache Solr 1.4. In the simplest terms, Solr is now faster and better than before. Central components of Solr have been improved to cut the time needed for processing queries and indexing documents. The goal: to provide a powerful, versatile search application server with ever better scalability, performance and relevancy. New features include streamlined caching, smarter handling of index changes, faster faceting, enhanced data import capabilities, speedier numeric range queries, duplicate detection and more. What’s New in Solr 1.4 A Lucid Imagination Technical White Paper • October 2009 Page iii
    • Table of Contents Introduction ............................................................................................................................................................ 1 Performance Improvements............................................................................................................................. 2 Streamlined Caching ........................................................................................................................................ 2 Scalable Concurrent File Access .................................................................................................................. 2 Smarter Handling of Index Changes .......................................................................................................... 3 Faster Faceting ................................................................................................................................................... 4 Streaming Updates for SolrJ .......................................................................................................................... 4 What Else Is New for Solr 1.4 Performance ............................................................................................ 5 Feature Improvements ....................................................................................................................................... 5 Solr Becomes an Omnivore ........................................................................................................................... 5 DataImportHandler Enhancements ........................................................................................................... 6 Smoother Replication ...................................................................................................................................... 7 More Choices for Logging .............................................................................................................................. 8 Multiselect Faceting ......................................................................................................................................... 9 Speedier Range Queries .................................................................................................................................. 9 Duplicate Detection ....................................................................................................................................... 10 New Request Handler Components ........................................................................................................ 11 What Else Is New with Solr 1.4 Features .............................................................................................. 11 Get Started & Resources .................................................................................................................................. 12 Next Steps ............................................................................................................................................................. 12 APPENDIX: Choosing Lucene or Solr .......................................................................................................... 13 What’s New in Solr 1.4 A Lucid Imagination Technical White Paper • October 2009 Page iv
    • Introduction Apache Solr is the definitive application development implementation for Apache Lucene, and it is the leading open source search platform. If you imagine Lucene as a high- performance race car engine, then Solr is all the things that make that engine usable, such as a chassis, gas pedal, steering wheel, seat, and much more. Solr makes it easy to develop sophisticated, fast search applications with advanced features such as faceting. Solr builds on another open source search technology, Lucene, which provides indexing and search technology, as well as spellchecking, hit highlighting, and advanced processing capabilities. Both Solr and Lucene are developed at the Apache Software Foundation. Lucene currently ranks among the top 15 open source projects and is one of the top 5 Apache projects, with installations at over 4,000 companies. Lucene and Solr downloads have grown nearly tenfold over the past three years; Solr is the fastest-growing Lucene subproject. Lucene and Solr offer an attractive alternative to proprietary licensed search and discovery software vendors.1. Solr 1.3 set a high bar for functionality, extensibility, and performance. As time marches on, Solr engineers have been hard at work making a good thing even better. This white paper describes the new features and improvements in the latest version, Solr 1.4. In the simplest terms, Solr is now faster and better than before. Central components of Solr have been improved to cut the time needed for processing queries and indexing documents. Many new features 1 See the Appendix for a discussion of when to choose Lucene or Solr. What’s New in Solr 1.4 A Lucid Imagination Technical White Paper • October 2009 Page 1
    • have been added, all with the goal of providing users with the information they want as fast as possible. Performance Improvements Solr 1.4 increases Solr’s speed with numerous improvements in key areas. Some of these enhancements are high-performance replacements for standard off-the-shelf Java platform components. Much as a car hobbyist replaces stock parts of an engine, the architects and programmers working on Solr have replaced crucial components to make Solr 1.4 run faster than ever for many common operations. Streamlined Caching Solr caches data from its index as an optimization, because reading from memory is always faster than reading from the file system. Over the duration of a single faceting request, the cache might be accessed hundreds or even thousands of times. Previously, the cache implementation was a synchronized LinkedHashMap from the Java platform API. Solr 1.4 uses a new class, ConcurrentLRUCache, which is specifically designed to minimize the overhead of synchronization. Anecdotal evidence suggests that this implementation can double query throughput in some circumstances. Scalable Concurrent File Access In the past, Solr used the Java platform’s RandomAccessFile to read data from index files. Reading a portion of a file involves calling seek() to find the right part of the file, and read() to actually retrieve the data. Multithreaded access to the same file has meant that the seek() and read() pairs must be synchronized. If the data to be read isn’t already in the operating system cache, things get worse: the synchronization causes all other reading threads to wait while the data is retrieved from disk. What’s New in Solr 1.4 A Lucid Imagination Technical White Paper • October 2009 Page 2
    • The Java Nonblocking Input/Output (NIO) API offers a much better solution. NIO’s FileChannel includes a read() method that, in essence, performs a seek() and a read() in a single operation. public int read(ByteBuffer dst, long position) Solr 1.4 uses this NIO method (via Lucene’s NIOFSDirectory) to read index files.2 Smarter Handling of Index Changes Solr generally keeps a big pile of documents in an existing index. New documents are periodically added, but usually the number of new documents is small compared with the size of the index. Solr (via Lucene) stores the index as a collection of segments; as new documents are added, most of the segments will remain unchanged. Solr 1.4 is very much aware that, for the most part, index segments don’t change. Consequently, Solr is much smarter about reusing unchanged segments, which results in less memory churn, less disk access, and better performance. reopen() Index New index Index segments on disk 2 On Windows, the older RandomAccessFile implementation is used because of a bug in the Windows NIO implementation. What’s New in Solr 1.4 A Lucid Imagination Technical White Paper • October 2009 Page 3
    • One example is reloading an index. Previously, the entire index was loaded again, which is expensive in time and resources. Now, Solr 1.4 is smart enough to reuse index segments that haven’t changed, resulting in a much more efficient reload of a modified index. This means that adding new documents to an index and making them available comes at a lower resource cost. The figure above illustrates the mechanism. Many other optimizations have been made with respect to index segments. The field cache, for example, is now split so there is one field cache per segment. Again, this results in much more efficient processing of index updates, because the field caches for every unchanged segment do not need to be touched. Faster Faceting One of Solr’s killer features is faceting, the ability to quickly narrow and drill down into search results by categories. Solr uses UnInvertedField to keep mapping between documents and field values so it can provide faceting information in response to queries. For multivalued fields, Solr 1.4 includes a new implementation of UnInvertedField that can be 50 times faster and 5 times smaller than its predecessor. Single value fields still use either the enum or fieldcache method. Streaming Updates for SolrJ SolrJ is the API that Java client applications use to work with Solr. The Solr 1.4 version of SolrJ includes an optimized implementation, StreamingUpdateSolrServer, which is useful for indexing many documents at a time. In one simple test, the number of documents indexed per second jumped from 231 to 25,000 using the new implementation. What’s New in Solr 1.4 A Lucid Imagination Technical White Paper • October 2009 Page 4
    • For bulk updates, consider switching to the new implementation. In one simple test, the number of documents indexed per second jumped from 231 to 25,000 when using the new implementation. What Else Is New for Solr 1.4 Performance In addition to these important performance enhancements in Solr 1.4, there are several more, including: Binary format for updates, much more compact than XML, now available for SolrJ. OmitTermFreqAndPositions can be applied to a field so that Solr does not compute the number of terms and list of positions for that field, which saves time and space for nontext fields. Queries that don’t sort by score can eliminate scoring, which speeds up queries. Filters now apply before the main query, which makes queries 300% faster in some cases. New filter implementation for small results sets, so it runs smaller and faster. Feature Improvements Aside from performance improvements, Solr 1.4 sports a variety of great new features. As an open source project, Solr 1.4 is largely created by the people who use it, so the new features are the ones that the community cares about most passionately. Solr Becomes an Omnivore Solr can’t give you good results unless you give it good data. Normally you feed Solr XML documents corresponding to the structure of your schema. This works fine, and if all your data consists of XML documents, they can be fed directly to Solr or easily transformed to the correct input. Of course, reality is always messy. Chances are that many documents you want to include in your Solr index are in other file formats, like PDF or Microsoft Word. Fortunately, Solr 1.4 knows how to deal with the mess. What’s New in Solr 1.4 A Lucid Imagination Technical White Paper • October 2009 Page 5
    • Solr 1.4 can now ingest these other types of documents using a feature called Solr Cell.3 Solr Cell uses another open source project, Tika, to read documents in a variety of formats and convert them to an XHTML stream. Solr parses the stream to produce a document, which is then indexed. Here are a few of the formats that Tika understands: • PDF • OpenDocument (OpenOffice formats) • Microsoft OLE 2 Compound Document (Word, PowerPoint, Excel, Visio, etc.) • HTML • RTF • gzip • ZIP • Java Archive (JAR) files DataImportHandler Enhancements DataImportHandler knows how to index data pulled from relational databases or XML files. The details of what is indexed and how it happens are configured in solrconfig.xml. Solr 1.4 contains some extremely useful upgrades to DataImportHandler. The first is the ability to push data into DataImportHandler. In Solr 1.3, DataImportHandler was pull-only. This meant that the only possibly way to push data to Solr was to use the update XML or CSV format, which meant you couldn’t take advantage of any of DataImportHandler’s capabilities. In the Solr 1.4 world, a new component called ContentStreamDataSource allows you to use DataImportHandler’s features for indexing content. Another powerful enhancement in Solr 1.4 is the ability to listen for import events. All you need to do is provide an implementation of the EventListener interface and let Solr 3 The name is based on the acronym Content Extraction Library (CEL). This feature is also known by its more technical name ExtractingRequestHandler. What’s New in Solr 1.4 A Lucid Imagination Technical White Paper • October 2009 Page 6
    • know about it in solrconfig.xml. When importing begins and ends, your listener will be notified. Solr 1.4 also brings the ability to control error handling in DataImportHandler. For each entity, you can control what happens when an error occurs via solrconfig.xml. The choices for error handling are as follows: abort : The import is stopped and all changes are rolled back. skip : The current document is skipped. continue : Import continues as if the error did not occur. DataImportHandler contains many more enhancements and optimizations in Solr 1.4, including new data sources, new entity processors, and new transformers. Smoother Replication Replication is a fancy name for making a copy of a Solr index, which at its heart is just a matter of copying files. Making copies of an index is useful for two reasons. The first is simply to create a backup. The second reason is to place the same index on multiple Solr servers, which is necessary if you want to distribute incoming requests to improve performance. Prior to Solr 1.4, replication was implemented with shell scripts, and consequently would only work effectively on platforms with a shell, like Linux; it relied on the Unix rsync file utility and it relied on the OS providing hard links, which could require cumbersome scripting, excluding tiered deployments on Windows platforms. In Solr 1.4, replication has been abstracted and implemented entirely at the Java platform layer, which means it will work (and work the same) wherever the Java platform runs. This is great news for anyone using Solr because it means that backups can be performed in the same way on a Solr instance, regardless of hardware or operating system, and it means that configuring replication across multiple Solr instances is similarly uniform. Replication does not require a backup and the index is copied from one live index to another. Replication and backups are configured in solrconfig.xml. Add a couple lines if you just want to make a backup—you can choose to backup upon Solr startup or after every commit or optimize. In addition, you can use an http command to request a backup at any time. What’s New in Solr 1.4 A Lucid Imagination Technical White Paper • October 2009 Page 7
    • If you need to replicate an index across multiple servers, the configuration is pretty simple. Set it up on the master server’s solrconfig.xml like this: <requestHandler name="/replication" class="solr.ReplicationHandler"> <lst name="master"> <str name="replicateAfter">commit</str> <str name="confFiles">schema.xml,stopwords.txt</str> </lst> </requestHandler> You can choose to replicate on startup, after commits, or after optimization. The confFiles element specifies configuration files you want to replicate to slaves. Once the server configuration is done, point the slaves at the master, something like this: <requestHandler name="/replication" class="solr.ReplicationHandler"> <lst name="slave"> <str name="masterUrl"> http://masterhostname:8983/solr/replication </str> <str name="pollInterval">00:00:60</str> </lst> </requestHandler> The slaves periodically query the master to see if the index has changed. If so, they pull down the changes and apply them. That’s all! More Choices for Logging Logging is a crucial capability in a server application. Administrators examine logs to monitor Solr instances and figure out how to make them run optimally. Up until now, Solr used the logging facility included with the Java Development Kit (JDK). Solr 1.4 uses a more flexible logging framework, SLF4J. SLF4J can bind to several logging implementations, including log4j, Jakarta Commons Logging (JCL), and JDK logging. This binding can be changed at runtime simply by switching JAR files around. What’s New in Solr 1.4 A Lucid Imagination Technical White Paper • October 2009 Page 8
    • This is the best possible kind of upgrade. The default configuration, binding SLF4J to JDK logging, provides the same functionality as previous releases of Solr. However, you now have the option of easily plugging in log4j or JCL if you prefer. Multiselect Faceting Faceting is the ability to group search results by certain fields. Solr 1.4 adds support for multiselect faceting, which is the ability to narrow search results by multiple facets. Solr’s support is generic and includes the ability to tag filters and to exclude filters by tag when faceting. A sample query string might look like this: q=index replication&facet=true &fq={!tag=proj}project:(lucene OR solr) &facet.field={!ex=proj}project &facet.field={!ex=src}source To see this in action, check out the search facility that Lucid Imagination provides to search technical knowledge resources on Solr along with Lucene and all its subprojects: http://search.lucidimagination.com/. Speedier Range Queries Solr can process queries that include numeric ranges, which means it can answer questions like “Which hats are between size 56 and 64?” and “Which swimming pools are less than 10 meters long?” In Solr 1.4, standard range queries now use a prefix tree or trie. Numbers are placed into the tree based on their digits, which makes range queries faster than comparing each complete number. Thus, for example, 175 is indexed as hundreds:1 tens:17 ones:175. The results have been observed at up to 40 times faster than standard range queries To take advantage of fast range queries, use the TrieField type in your schema. The implementation takes care of the details, and you will notice that range queries are significantly faster. The illustration below shows an Example of a Prefix Tree, where the leaves of the tree hold the actual term values and all the descendants of a node have a common prefix associated with the node. Bold circles mark all relevant nodes to retrieve a range from 215 to 977. What’s New in Solr 1.4 A Lucid Imagination Technical White Paper • October 2009 Page 9
    • Let’s look at another example, this time in the schema. The type attribute in the schema’s field type declaration tells Solr which numeric type you will represent with TrieField. Here are a few declarations that show how to use TrieField for various numeric types: <fieldType name="tint" class="solr.TrieField" type="integer" omitNorms="true" positionIncrementGap="0" indexed="true" stored="false" /> <fieldType name="tlong" class="solr.TrieField" type="long" omitNorms="true" positionIncrementGap="0" indexed="true" stored="false" /> <fieldType name="tdouble" class="solr.TrieField" type="double" omitNorms="true" positionIncrementGap="0" indexed="true" stored="false" /> Duplicate Detection With large sets of documents to be indexed, it is important to detect documents that are identical or nearly identical so that the document only gets added to the index once. Solr 1.4 offers this capability, named document duplicate detection or deduplication. The more technical name is SignatureUpdateProcessor. SignatureUpdateProcessor creates a message digest or hash value from some or all of the fields of a document. The hash value acts like a fingerprint for the document and can be quickly compared to the hash values for other documents. What’s New in Solr 1.4 A Lucid Imagination Technical White Paper • October 2009 Page 10
    • Several hashing algorithms are available: MD5Signature and Lookup3Signature are both useful for exact matching, while TextProfileSignature (from the Apache Nutch project) is a fuzzy hashing implementation to detect documents that are nearly equivalent. New Request Handler Components New request handler components are now available in Solr 1.4: ClusteringComponent uses Carrot2 to dynamically cluster the top N search results, something like dynamically discovered facets. TermsComponent returns indexed terms and document frequency in a field, useful for auto-suggest, etc. TermVectorComponent returns term information per document (term frequency, positions). StatsComponent computes statistics on numeric fields: min, max, sum, sumOfSquares, count, missing, mean, stddev. What Else Is New with Solr 1.4 Features Solr 1.4 has many other new features. A few of them are listed here: • Ranges over arbitrary functions: {!frange l=1 u=2}sqrt(sum(a,b)) • Nested queries, for function queries too • solrjs: JavaScript client library • commitWithin: doc must be committed within x milliseconds • Binary field type • Merge one index into another • SolrJ client for load balancing and failover • Field globbing for some params: hl.fl=*_text • Doublemetaphone, Arabic stemmer, etc. • VelocityResponseWriter: template responses using Velocity What’s New in Solr 1.4 A Lucid Imagination Technical White Paper • October 2009 Page 11
    • Get Started & Resources http://www.lucidimagination.com/blog/2009/02/05/looking-forward-to-new-features- in-solr-14/ http://wiki.apache.org/solr/SolrReplication http://wiki.apache.org/solr/ExtractingRequestHandler http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Content- Extraction-Tika http://www.lucidimagination.com/blog/tag/range-queries/ http://www.slf4j.org/manual.html http://wiki.apache.org/solr/Deduplication http://shalinsays.blogspot.com/2009/09/whats-new-in-dataimporthandler-in-solr.html Next Steps For more information on how Lucid Imagination can help your employees, customers, and partners find the information they need more quickly, effectively, and at lower cost, please visit http://www.lucidimagination.com/ to access blog posts, articles, and reviews of dozens of successful implementations. Certified Distributions from Lucid Imagination are complete, supported bundles of software which include additional bug fixes, performance enhancements, along with our free 30-day Get Started program. Coupled with one of our support subscriptions, a Certified Distribution can provide a complete environment to develop, deploy, and maintain commercial-grade search applications. Certified Distributions are available at www.lucidimagination.com/Downloads. Please e-mail specific questions to: Support and Service: support@lucidimagination.com Sales and Commercial: sales@lucidimagination.com Consulting: consulting@lucidimagination.com Or call: 1.650.353.4057 What’s New in Solr 1.4 A Lucid Imagination Technical White Paper • October 2009 Page 12
    • APPENDIX: Choosing Lucene or Solr The great improvements in the capabilities of Lucene and Solr open source search technology have created rapidly growing interest in using them as alternatives to other search applications. As is often the case with open-source technology, online community documentation provides rich details on features and variations, but does little to provide explicit direction on which technologies would be the best choice. So when is Lucene preferable to Solr and vice versa? There is in fact no single answer, as Lucene and Solr bring very similar underlying technology to bear on somewhat distinct problems. Solr is versatile and powerful, a full- featured, production-ready search application server requiring little formal software programming. Lucene presents a collection of directly callable Java libraries, with fine- grained control of machine functions and independence from higher-level protocols. In choosing which might be best for your search solution, the key questions to consider are application scope, deployment environment, and software development preferences. If you are new to developing search applications, you should start with Solr. Solr provides scalable search power out of the box, whereas Lucene requires solid information retrieval experience and some meaningful heavy lifting in Java to take advantage of its capabilities. In many instances, Solr doesn’t even require any real programming. Solr is essentially the “serverization” of Lucene, and many of its abstract functions are highly similar, if not just the same. If you are building an app for the enterprise sector, for instance, you will find Solr an almost 100% match to your business requirements: it comes ready to run in a servlet container such as Tomcat or Jetty, and ready to scale in a production Java environment. Its RESTful interfaces and XML-based configuration files can greatly accelerate application development and maintenance. In fact, Lucene programmers have often reported that they find Solr to contain “the same features I was going to build myself as a framework for Lucene, but already very-well implemented.” Once you start with Solr, and you find yourself using a lot of the features Solr provides out of the box, you will likely be better off using Solr’s well-organized extension mechanisms instead of starting from scratch using Apache Lucene. If, on the other hand, you don’t want to make any calls via HTTP, and want to have all of your resources controlled exclusively by Java API calls that you write, Lucene may be a better choice. Lucene works best when constructing and embedding a state-of-the-art search engine, allowing programmers to assemble and compile inside a native Java What’s New in Solr 1.4 A Lucid Imagination Technical White Paper • October 2009 Page 13
    • application. Some programmers set aside the convenience of Solr in order to more directly control the large set of sophisticated features with low-level access, data, or state manipulation, and choose Lucene instead, for example with byte-level manipulation of segments or intervention in data I/O. Investment at the lower level enables development of extremely sophisticated, cutting edge text search and retrieval capabilities. As for features, the latest version of Solr generally encapsulates the latest version of Lucene. As the two are in many ways functional siblings, spending time on gaining a solid understanding how Lucene works internally can help you understand Apache Solr and its extension of Lucene's workings. No matter which you choose, the power of open source search is yours to harness. More information on both Lucene and Solr can be found at http://www.lucidimagination.com. What’s New in Solr 1.4 A Lucid Imagination Technical White Paper • October 2009 Page 14