SlideShare a Scribd company logo
1 of 36
Programmer’s
Programmer Guide to
Open Source Search
            Search:
What’s New
in Apache Lucene 3.0
A Lucid Imagination
Technical White Paper
© 2010 by Lucid Imagination, Inc. under the terms of Creative Commons license, as detailed at
http://www.lucidimagination.com/Copyrights-and-Disclaimers/. Version 1.02, published 6 June 2010. Solr,
Lucene, Apachecon and their logos are trademarks of the Apache Software Foundation.




Programmer’s Guide: What’s New in Lucene 2.9 / 3.0
A Lucid Imagination Technical White Paper • June 2010                                             Page i
Abstract
Apache Lucene is a high-performance, cross-platform, full-featured Information Retrieval
library in open source, suitable for nearly every application that requires full-text search
features.

Since its introduction nearly 10 years ago, Apache Lucene has become a competitive player
for developing extensible, high-performance full-text search solutions. The experience
accumulated over time by the community of Lucene committers and contributors and the
innovations they have engineered have delivered significant ongoing advances in Lucene’s
capabilities.

This white paper describes the new features and improvements in the latest versions,
Apache Lucene 2.9 and 3.0. It is intended mainly for programmers familiar with the broad
base of Lucene’s capabilities, though those new to Lucene should also find it a useful
exploration of the newest features. Key topics such as how to upgrade from 2.9 to 3.0, as
well as considerations for migrating from Lucene to Solr, are also addressed.

In the simplest terms, Lucene is now faster and more flexible than before. Historic weak
points have been improved to open the way for innovative new features like near-real-time
search, flexible indexing, and high-performance numerical range queries. Many new
features have been added, new APIs introduced, and critical bugs have been fixed—all with
the same goal: improving Lucene’s state-of-the-art search capabilities.




Programmer’s Guide: What’s New in Lucene 2.9 / 3.0
A Lucid Imagination Technical White Paper • June 2010                                  Page ii
Table of Contents
Introduction ............................................................................................................................................................ 1
Core Features and Improvements .................................................................................................................. 4
   Numeric Capabilities and Numeric Range Queries .............................................................................. 4
   New TokenStream API .................................................................................................................................... 8
   Per-Segment Search ...................................................................................................................................... 12
   Near Realtime Search (NRS) ...................................................................................................................... 14
   MultiTermQuery-Related Improvements ............................................................................................. 15
   Payloads ............................................................................................................................................................. 16
Additions to Lucene Contrib .......................................................................................................................... 18
   New Contrib Analyzers ................................................................................................................................ 18
   Lucene Spatial (formerly known as LocalLucene) ............................................................................ 18
   Lucene Remote and Java RMI .................................................................................................................... 20
   New Flexible QueryParser .......................................................................................................................... 20
   Minor Changes and Improvements in Lucene 2.9 ............................................................................. 21
   Changes and Improvements in Lucene 3.0 .......................................................................................... 23
   Lucene Version by Version Compatibility since 2.9 ......................................................................... 24
Strategies for Upgrading to Lucene 2.9 / 3.0........................................................................................... 25
   Upgrade to 2.9—Recommended Actions .............................................................................................. 26
   Upgrade to 2.9—Optional Actions ........................................................................................................... 26
Migrating from Lucene to Solr? .................................................................................................................... 27
References ............................................................................................................................................................ 29
Next Steps ............................................................................................................................................................. 30
APPENDIX: Choosing Lucene or Solr .......................................................................................................... 31




Programmer’s Guide: What’s New in Lucene 2.9 / 3.0
A Lucid Imagination Technical White Paper • June 2010                                                                                                         Page iii
Introduction
Apache Lucene is a high-performance, cross-platform, full-featured Information Retrieval
library, in open source, suitable for nearly every application that requires full-text search
features. Lucene currently ranks among the top 15 open source projects and is one of the
top 5 Apache projects, with installations at over 4,000 companies. Downloads of Lucene,
and its server implementation Solr, have grown nearly tenfold over the past three years;
Solr is the fastest-growing Lucene subproject. Lucene and Solr offer an attractive
alternative to proprietary licensed search and discovery software vendors.1 With the
release of versions 2.9 and 3.0 (September and November 2009), the Apache Lucene
community delivered the latest upgrades of Lucene.

This white paper aims to address key issues for you if you have an Apache Lucene-based
application, and need to upgrade existing code to work well with these latest versions, so
that you may take advantage of the various improvements and prepare for future releases
and application maintainability. If you do not have a Lucene application, the paper should
also give you a good overview of the innovations in this release.

Unlike the previous 2.4.1 release (March 2009), Lucene 2.9 and 3.0 go well beyond just a
bug-fix release. They introduce multiple performance improvements, new features, better
runtime behavior, API changes, and bug-fixes at a variety of levels. Importantly, 2.9
deprecates a number of legacy interfaces, and 3.0 is in the main a reimplemented version of
2.9, but without those deprecated interfaces.
The 2.9 release improves Lucene in several key aspects, which make it an even more
compelling alternative to other solutions. Most notably:
           Improvements for Near-Realtime Search capabilities make documents searchable
           almost instantaneously.
           A new, straightforward API for handling Numeric Ranges both simplifies
           development and virtually wipes out performance overhead.
           Analysis API has been replaced for more streamlined, flexible text handling.



1   See the Appendix for a discussion of when to choose Lucene or Solr.

Programmer’s Guide: What’s New in Lucene 2.9 / 3.0
A Lucid Imagination Technical White Paper • June 2010                                     Page 1
And, behind the scenes, the groundwork has been laid for yet more indexing flexibility in
future releases.
Lucene Contrib also adds new utility packages, introduced with this release:
       An extremely flexible query parser framework opens new possibilities for
       programmers to more easily create their own query parsing syntax.
       Local-Lucene and its geo-search capabilities, now donated to Apache, provide this
       near-mandatory functionality for state-of-the-art search.
       Various contributions have markedly improved support for languages like Arabic,
       Persian, and Chinese.
Version 3.0 is again a cleanup release and considered feature equivalent to its predecessor.
3.0 is the first Apache Lucene release requiring Java 5 at runtime, enabling Lucene to make
use of new language features such as Generics Enumerations, Variable Arguments, along
with Java 5’s concurrent utilities.


                                      2.9 release improves Lucene in several
                                      key aspects and 2.9 deprecates a number
                                      of legacy interfaces. 3.0 is in the main a
                                      reimplemented version of 2.9, but without
                                      those deprecated interfaces.

While the majority of programmers are already running on either version 1.5 or 1.6
platforms (1.6 is the recommended JVM), Java 1.4 reached its end of service life in October
2008. With the new major Lucene 3.0 release, all legacy issues marked as deprecated have
now been removed, enforcing their replacement.
Some important notes on compatibility: because previous minor releases also contained
performance improvements and bug fixes, programmers have been accustomed to
upgrading to a new Lucene version just by replacing the JAR file in their classpath. And, in
those past cases, Lucene-based apps could be upgraded flawlessly without recompiling the
software components accessing or extending Apache Lucene. However, this may not be so
with Lucene 2.9/3.0.

Programmer’s Guide: What’s New in Lucene 2.9 / 3.0
A Lucid Imagination Technical White Paper • June 2010                                 Page 2
Lucene 2.9 introduces several back-compatibility-breaking changes that may well require
                              back                   breaking
changes in your code that uses the library. A drop-in library replacement is not guaranteed
to be successful; at a minimum, it is not likely to be flawless. As a result, we recommend
that if you are upgrading from a previous Lucene release, you should at least recompile any
software components directly accessing or extending the library. In the latter case,
  ftware
recompilation alone will most likely not be sufficient. More details on these dependencies
are discussed in the “Upgrading Lucene” section of the paper.
We’ve also noted any significant compatibility issues, labeling them
                      ignificant
with this flag:
This document is not intended to be a comprehensive overview of all functions of Lucene
2.9/3.0, but rather of new key features and capabilities. Always check the Lucid
Imagination Certified distribution (www.lucidimagination.com/downloads and the official
                                    www.lucidimagination.com/downloads)
Lucene Website (lucene.apache.org for the most up-to-date release information.
                  lucene.apache.org)                      date




Programmer’s Guide: What’s New in Lucene 2.9 / 3.0
A Lucid Imagination Technical White Paper • June 2010                                Page 3
Core Features and Improvements

Numeric Capabilities and Numeric Range Queries
One of Apache Lucene's basic properties is its representation of internal searchable values
(terms) as UTF-8 encoded characters. Every value passed to Lucene must be converted into
a string in order to be searchable. At the same time, Lucene is frequently applied to search
numeric values and ranges, such as prices, dates, or other numeric field attributes.
Historically, searching over numeric ranges has been a weak point of the library. However,
the 2.9 release comes with a tremendous improvement for searching numeric values,
especially for range queries.
Prior to Lucene 2.9, numeric values were encoded with leading zeros, essentially as a full-
precision value. Values stored with full precision ended up creating many unique terms in
the index. Thus, if you needed to retrieve all documents in a certain range (e.g., from $1.50
to $1500.00), Lucene had to iterate through a lot of terms whenever many documents with
unique values were indexed. Consequently, execution of queries with large ranges and lots
of unique terms could be extremely slow as a result of this overhead.
Many workaround techniques have evolved over the years to improve the performance of
ranges, such as encoding date ranges in multiple fields with separate fields for year, month,
and day. But at the end of the day, every programmer had to roll his or her own way of
searching ranges efficiently.
In Lucene 2.9, NumericUtils and its relatives (NumericRangeQuery /
NumericRangeFilter) introduce native numeric encoding and search capabilities.
Numeric Java primitives (long, int, float, and double) are transformed into prefix-
encoded representations with increasing precision. Internally each prefix precision is
generated by stripping off the least significant bits indicated by the precisionStep. Each
value is subsequently converted to a sequence of 7-bit ASCII characters (due to the UTF-8
term encoding in the index, 8 or more bits would split into two or more bytes) resulting in
a predictable number of prefix-terms that can be calculated ahead of time. The figure below
illustrates such a Prefix Tree.




Programmer’s Guide: What’s New in Lucene 2.9 / 3.0
A Lucid Imagination Technical White Paper • June 2010                                  Page 4
Example of a Prefix Tree, where the leaves of the tree hold the actual term values and all the descendants of a
node have a common prefix associated with the node. Bold circles mark all relevant nodes to retrieve a range
from 215 to 977.



The generated terms are indexed just like any other string values passed to Lucene. Under
the hood, Lucene associates distinct terms with all documents containing the term, so that
all documents containing a numeric value with the same prefix are “grouped” together,
meaning the number of terms that need to be searched is reduced tremendously. This
stands in contrast to the relatively less efficient encoding scheme in previous releases,
where each unique numeric value was indexed as a distinct term based on the number of
terms in the index.




Programmer’s Guide: What’s New in Lucene 2.9 / 3.0
A Lucid Imagination Technical White Paper • June 2010                                                      Page 5
Directory directory = new RAMDirectory();
    Analyzer analyzer = new WhitespaceAnalyzer();
    IndexWriter writer = new IndexWriter(directory, analyzer,
    IndexWriter.MaxFieldLength.UNLIMITED);
    for (int i = 0; i < 20000; i++) {
        Document doc = new Document();
        doc.add(new Field("id", String.valueOf(i), Field.Store.YES,
        Field.Index.NOT_ANALYZED_NO_NORMS));
        String num = Integer.toString(i);
        String paddedValue = "00000".substring(0, 5 - num.length()) +
        num;
        doc.add(new Field("oldNumeric", paddedValue, Field.Store.YES,
        Field.Index.NOT_ANALYZED_NO_NORMS));
        writer.addDocument(doc);
    }
    writer.close();
Indexing a zero-padded numeric value for use with an ordinary RangeQuery.



You can also use the native encoding of numeric values beyond range searches. Numeric
fields can be loaded in the internal FieldCache, where they are used for sorting. Zero-
padding of numeric primitives (see code example above) is no longer needed as the trie-
encoding guarantees the correct ordering without requiring execution overhead or extra
coding.
The code listing below instead uses the new NumericField to index a numeric Java
primitive using 4-bit precision. Like the straightforward NumericField, querying
numeric ranges also provides a type-safe API. NumericRangeQuery instances are
created using one of the provided static constructors for the corresponding Java primitive.




Programmer’s Guide: What’s New in Lucene 2.9 / 3.0
A Lucid Imagination Technical White Paper • June 2010                                 Page 6
Directory directory = new RAMDirectory();
   Analyzer analyzer = new WhitespaceAnalyzer();
   IndexWriter writer = new IndexWriter(directory, analyzer,
   IndexWriter.MaxFieldLength.UNLIMITED);
   for (int i = 0; i < 20000; i++) {
       Document doc = new Document();
       doc.add(new Field("id", String.valueOf(i), Field.Store.YES,
       Field.Index.NOT_ANALYZED_NO_NORMS));
       doc.add(new NumericField("newNumeric", 4,
       Field.Store.YES, true).setIntValue(i));
       writer.addDocument(doc);
   }
   writer.close();
Indexing numeric values with the new NumericField type

The example below shows a numeric range query using an int primitive with the same
precision used in the indexing example. If different precision values are used at index or
search time, numeric queries can yield unexpected behavior.




Programmer’s Guide: What’s New in Lucene 2.9 / 3.0
A Lucid Imagination Technical White Paper • June 2010                                  Page 7
IndexSearcher searcher = new IndexSearcher(directory, true);
   Query query = NumericRangeQuery.newIntRange("newNumeric", 4, 10,
   10000, true, false);
   TopDocs docs = searcher.search(query, null, 10);
   assertNotNull("Docs is null", docs);
   assertEquals(9990, docs.totalHits);
   for (int i = 0; i < docs.scoreDocs.length; i++) {
       ScoreDocs d= docs.scoreDocs[i];
       assertTrue(sd.doc >= 10 && sd.doc < 10000);
   }
Searching numeric values with the new NumericRangeQuery



Improvements resulting from new Lucene numeric capabilities are equally significant in
versatility and performance. Now, Lucene can cover almost every use-case related to
numeric values. Moreover, range searches or sorting on float or double values up to fast
date searches (dates converted to time stamps) will execute in less than 100 milliseconds
in most cases. By comparison, the old approach using padded full-precision values could
take up to 30 seconds or more depending on the underlying index.


New TokenStream API
Almost every programmer who has extended Lucene has worked with its analysis function.
Text analysis is common to almost every use-case, and is among the best known Lucene
APIs.
Since its early days, Lucene has used a “Decorator Pattern” to provide a pluggable and
flexible analysis API, allowing a combination of existing and customized analysis
implementations. The central analysis class TokenStream enumerates a sequence of
tokens from either a document's fields or from a query. Commonly, multiple
TokenStream instances are chained, each applying a separate analysis step to text terms
represented by a Token class that encodes all relevant information about a term.
Prior to Lucene 2.9, TokenStream operated exclusively on Token instances transporting
term information through the analysis chain. With this release, the token-based API has
been marked as deprecated. It is completely replaced by an attribute-based API.

Programmer’s Guide: What’s New in Lucene 2.9 / 3.0
A Lucid Imagination Technical White Paper • June 2010                                Page 8
Here’s how it has changed. Rather than receiving a Token instance from one of the two
TokenStream.next() methods, the new API follows a stateful approach instead. To
advance in the stream, consumers call TokenStream.incrementToken(), which
returns a Boolean result indicating if the end of the stream has been reached. Information
gathered during the analysis process is encoded in attributes accessible via the new
TokenStream base class AttributeSource. In contrast to the older Token class, the
Attribute-based approach separates specific term characteristics from others not
necessarily related. Each TokenStream adds the attributes it is specifically targeting at
construction time (see code listing below) and keeps a reference to it throughout its
lifetime. This provides type-safe access to all attributes relevant for a particular
TokenStream instance.


   protected CharReplacementTokenStream(TokenStream input) {
       super(input);
       termAtt = (TermAttribute) addAttribute(TermAttribute.class);
   }
Adding a TermAttribute at construction time

Inside TokenStream.incrementToken(), a token stream only operates on attributes
that have been declared in the constructor. For instance, if you have Lucene replacing a
character like a German umlaut in a term, only the TermAttribute (declared at
construction time in the code listing above) is used. (Other attributes like
PositionIncrementAttribute or PayloadAttribute are ignored by this
TokenStream as they might not be needed in this particular use-case.)




Programmer’s Guide: What’s New in Lucene 2.9 / 3.0
A Lucid Imagination Technical White Paper • June 2010                                Page 9
public boolean incrementToken() throws IOException {
    if (input.incrementToken()) {
        final char[] termBuffer = termAtt.termBuffer();
        final int termLength = termAtt.termLength();
        if (replaceChar(termBuffer, termLength)) {
                 termAtt.setTermBuffer(output, 0, outputPos);
    }
        return true;
    }
    return false;
    }
Replacing characters using the new attribute-based API.

What the above example does not demonstrate is the full power of the new token API.
There, we replaced one or more characters in the token and discarded the original one. Yet,
in many use-cases, the original token should be preserved in addition to the modified one.
Using the old API required a fair bit of work and logic to handle such a common use-case.
In contrast, the new attribute-based approach allows capture and restoration of the state of
attributes, which makes such use-cases almost trivial. The example below shows a version
of the previous example improved for Lucene 2.9/3.0, in which the original term attribute
is restored once the stream is advanced.




Programmer’s Guide: What’s New in Lucene 2.9 / 3.0
A Lucid Imagination Technical White Paper • June 2010                               Page 10
public boolean incrementToken() throws IOException {
    if (state != null) {
        restoreState(state);
        state = null;
        return true;
    }
    if (input.incrementToken()) {
        final char[] termBuffer = termAtt.termBuffer();
        final int termLength = termAtt.termLength();
        if (replaceChar(termBuffer, termLength)) {
                 state = captureState();
                 termAtt.setTermBuffer(output, 0, outputPos);
        }
        return true;
    }
    return false;
    }
Replacing characters and additionally emitting the original term text using the new attribute-based API (position
increments are omitted).

The separation of attributes makes it possible to add arbitrary properties to the analysis
chain without using a customized Token class. Attributes are then made type-safely
accessible by all subsequent TokenStream instances, and can eventually be used by the
consumer. This way, you get a generic way to add various kind of custom information, such
as part-of-speech tags, payloads, or average document length to the token stream.
Unfortunately, Lucene 2.9 and 3.0 don't yet provide functionality to persist a custom
Attribute implementation to the underlying index. This improvement, part of what is often
referred to as "flexible indexing," is under active development and is proposed for one of
the upcoming Lucene releases.
Beyond the generalizability of this API, one of its most significant improvements is its
effective reuse of Attribute instances across multiple iterations of analysis. Attribute
implementations are created during TokenStream instantiation and are reused each time
the stream advances to a successive increment. Even if a stream is used for another

Programmer’s Guide: What’s New in Lucene 2.9 / 3.0
A Lucid Imagination Technical White Paper • June 2010                                                   Page 11
analysis, the same Attribute instances may be used, provided the stream is reusable.
This greatly reduces the rate of object creation, streamlining execution and minimizing any
required garbage collection.
                           n.
While Lucene 2.9 provides full back
                               back-compatibility for old-style
TokenStream implementations, it is strongly recommended to
update any existing custom TokenStream implementations to
exclusively use incrementToken instead of one of the overhead-
heavy next() methods. Lucene 3.0 removed this compatibility layer
and enforces the new attribute based API.
If you are trying to update your custom TokenStream or one of its subclass
(TokenFilter and Tokenizer) implementations, it is recommended that you use the
abstract BaseTokenStreamTestCase class, which provides various utility functions for
testing against the new and old API. The test case is freely available for download in the
source distribution of Apache Lucene 2.9
                            he        2.9/3.0.


Per-Segment Search
Since the early days of Apache Lucene, documents have been stored at the lowest level in a
segment—a small but entirely independent index. On the highest abstraction level, Lucene
           a
combines segments into one large index and executes searches across all visible segments.
As more and more documents are added to an index, Lucene buffers your documents in
RAM and flushes them to disk periodically. Depending on a variety of factors, Lucene either
incrementally adds documents to an existing segment, or creates entirely new segments. To
reduce the negative impact of an increasing number of segments on search performance,
                                                                           performance
Lucene tries to combine/merge multiple segments into larger ones. For optimal search
performance, Lucene can optimize an index that essentially merges all existing segments
into a single segment.
Prior to Lucene 2.9, search logic resided at the highest abstraction level, accessing a single
IndexReader no matter how many segments the index was composed of. Similar the Similarly
FieldCache was associated with the top   top-level IndexReader, and then had to be
invalidated each time an index was reopened. With Lucene 2.9 the search logic and the
                                                              2.9,
FieldCache have moved to a per-segment level. While this has introduced a little more
                                  per                        his
internal complexity, the benefit of the tradeoff is a new per-segment index behavior that
                                                              segment
yields a rich variety of performance improvements for unoptimized indexes.


Programmer’s Guide: What’s New in Lucene 2.9 / 3.0
A Lucid Imagination Technical White Paper • June 2010                                   Page 12
In most applications, existing segments rarely change internally, and this property had not
been effectively utilized in previous versions of Lucene. IndexReader.reopen(), first
added in Lucene 2.4, now has the ability to add new or changed segments to an already
existing top-level IndexReader instead of reloading all existing segments. The
FieldCache also takes advantage of rarely changing segments. Cache instances of
unchanged or updated segments can remain in memory or need only be rebuilt instead of
invalidating the FieldCache entirely. Depending on the number of changed index
segments, this can heavily reduce I/O as well as garbage collection costs, compared to
reopening the entire index.
Previous versions of Lucene also suffered from long warming time for sorting and function
queries. Those use-cases have been improved as the warm-up of reopened searchers is
now much faster.
It's worth mentioning that Per-Segment Search doesn't yield improvements in all situations.
If an IndexReader is opened on an optimized index, all pre-existing segments are merged
into a single one, which then loads in its entirety. In other situations, perhaps more
common, where some changes have been committed to the index and a new
IndexReader instance is obtained by calling IndexReader.reopen() on a previously
opened reader, the new per-segment capabilities can dramatically speed up reopening. But
in this case, opening a new IndexReader using one of the overloaded static
IndexReader.open() methods will create an entirely new reader instance and
therefore can't take advantage of any per-segment capabilities.




Programmer’s Guide: What’s New in Lucene 2.9 / 3.0
A Lucid Imagination Technical White Paper • June 2010                               Page 13
IndexReader reader = indexWriter.getReader();
    …
    IndexReader newReader = indexWriter.reopen();
    if (reader !=         newReader) {
        reader.close();
        reader = newReader;
    }


Obtaining and reopening a Near-Realtime Reader from an IndexWriter instance
                               Realtime

The majority of Lucene users won’t touch the changes related to Per-
Segment Search during their day-to-day business unless there are
    ment
working on low-level code implementing Filters or Custom –
                 level
Collector classes. Both classes directly expose the per – segment
model like Collector#setNextReader(), which is called once
            Collector#setNextReader()
for each segment during search. The Filter API instead doesn’t
immediately yield its relation to per
                                  per-segment search and has caused
lots of confusion in the past.
Filter#getDocIdSet(IndexReader) and its deprecated
relative Filter#bits(IndexReader) are also called once per
segment instead of once per index. The document IDs set by the
Filter must be relative to the current segment rathe than absolute.
                                                 rather


Near Realtime Search (NRS)
More and more, Lucene programmers are pursuing real-time or near-real real-time
requirements with their search applications. Previous Lucene versions did a decent job
with the incremental changes characteristic of this scenario, capturing those changes and
making them available for searching. Lucene 2.9 adds significant new capabilities for
addressing the requirements of high-change document environments.
                               high change
First of all, the IndexWriter in general responsible for modifying the underlying index
                  IndexWriter—in
and flushing documents to disk
                            disk—now offers a way to obtain an IndexReader instance
directly from the writer. The newly obtained reader then not only reflects the documents


Programmer’s Guide: What’s New in Lucene 2.9 / 3.0
A Lucid Imagination Technical White Paper • June 2010                               Page 14
already flushed to disk, but also makes all uncommitted documents still residing in
memory almost instantly searchable.
The reader instance returned by IndexWriter.getReader() supports reopening the
reader as long as the writer releasing the reader has not been committed. Once it is
committed, reopening the reader will result in an AlreadyClosedExecption
                                                   AlreadyClosedExecption.
It is important to understand why this feature is referred to as “near real-time” rather than
                                                                         real
“real-time.” When IndexWriter.getReader() is called for the very first time, Lucene
needs to consume a reasonable amount of additional resources (    (i.e., RAM, CPU
                                                                              CPU-cycles, and
file descriptors) to make uncommitted documents searchable. Due to this additional work,
                                                    searchable.
uncommitted documents will not always be available instantaneously. Nonetheless, in most
cases, the performance gained with this feature will be better than just reopening the
index, or the traditional simpler approach of opening a brand new reader instance.
To keep the latency as low as possible, the IndexWriter offers an optional “prewarmup”
functionality, by which newly merged segments can be prepared for real-time search. If you
                                                                  real
are new to this feature, you should be a
                                       aware that the pre-warmup API is still marked
                                                          warmup
experimental and might change in future releases.


MultiTermQuery-Related Improvements
                elated
In Lucene 2.4, many standard queries, such as FuzzyQuery,
WildcardQuery, and PrefixQuery were refactored and
subclassed under MultiTermQuery Lucene 2.9 adds some
                   MultiTermQuery.
improvements under the hood, resulting in much better performance
for those queries.2
In Lucene 2.9/3.0, multiterm queries now use a constant score internally, based on the
                  ,
assumption that most programmers don't care about the interim score of the queries
resulting from the term expansion that takes place during query rewriting.




2   This could be a back-compatibility issue if one of those classes has been subclassed.
                         compatibility

Programmer’s Guide: What’s New in Lucene 2.9 / 3.0
A Lucid Imagination Technical White Paper • June 2010                                   Page 15
Although constant-scoring is now the default behavior, the older scoring mode is still
available for multiterm queries in 2.9/3.0. Beyond that, you can choose one of the following
scoring modes:
        Filtered constant score: rewrites the multiterm query into a
        ConstantScoreQuery in combination with a filter to match all relevant
        documents.
        BooleanQuery constant score: rewrites the multiterm query into a
        ConstantScoreQuery based on a BooleanQuery by translating each term into
        an optional Boolean clause. This mode still has a limitation of maxClauseCount and
        might raise an exception if the query has too many Boolean clauses.
        Conventional scoring (not recommended): rewrites the multiterm query into an
        ordinary BooleanQuery.
        Automatic constant score (default): tries to choose the best constant score mode
        (Filter or BooleanQuery) based on term and document counts from the query.
        If the number of terms and documents is small enough, BooleanQuery is chosen,
        otherwise the query rewrites to a filter-backed ConstantScoreQuery.
You can change the scoring mode by passing an implementation of RewriteMethod to
MultiTermQuery.setRewriteMethod() as shown in the code example below.
        PrefixQuery prefixQuery = new PrefixQuery(new Term("aField",
        "luc"));
        prefixQuery.setRewriteMethod(
        MultiTermQuery.CONSTANT_SCORE_FILTER_REWRITE);
Explicitly setting a filtered constant-score RewriteMethod on a PrefixQuery



Payloads
The Payloads feature, though originally added in a previous version of Lucene, remains
pretty new to most programmers. A payload is essentially a byte array that is associated
with a particular term in the index. Payloads can be associated with a single term during
text analysis and subsequently committed directly to the index. On the search side, these
byte arrays are accessible to influence the scoring for a particular term, or even to filter
entire documents.



Programmer’s Guide: What’s New in Lucene 2.9 / 3.0
A Lucid Imagination Technical White Paper • June 2010                                  Page 16
For instance, if your Lucene application is analyzing the phrase “Gangs of New York,”
payloads can encode information about the terms “New” and “York” together, so that they
are treated as a paired term for the name of a city, or can specify that “Gangs” is a noun
rather than a verb. Prior to 2.9, payloads were exposed via a query called
BoostingTermQuery, which has now been renamed to PayloadTermQuery. By using
this query type, you can query Lucene to find all occurrences where “New” is a part of a city
name like “New York” or “New Orleans”.
In comparison with previous versions, Lucene 2.9/3.0 also provides more control and
flexibility for payload scoring. You can pass a custom PayloadFunction to the
constructor of a payload-aware query. Each payload is fed back to the custom function,
which calculates the score based on the cumulative outcomes of payload occurrences.
This improvement becomes even more useful when payloads are used in combination with
span queries. Spans represent a range of term positions in a document, whereas in turn,
payloads can help scoring based on the distance between terms. For instance, using a
PayloadNearQuery , documents can be scored differently if terms are in the same
sentence or paragraph if that information is encoded in the payload.
At a higher abstraction level, another payload aware TokenFilter has been added.
DelimitedPayloadTokenFilter splits tokens separated by a predefined character
delimiter, where the first part of the token is the token itself and the second part after the
delimiter represents the payload. For example, it can parse an e-mail address, for example
carol.smith@apache.org, by making “carol.smith” the token, and creating a payload to
represent the domain name, “apache.org”. A customizable payload-encoder takes care of
encoding the values while everything else magically happens inside the filter. Besides being
a convenient way to add payloads to existing search functionality, this class also serves as a
working example of how to use payloads during the analysis process.3




3
    See www.lucidimagination.com/blog/2009/08/05/getting-started-with-payloads for more information.

Programmer’s Guide: What’s New in Lucene 2.9 / 3.0
A Lucid Imagination Technical White Paper • June 2010                                                  Page 17
Additions to Lucene Contrib
So far, we’ve reviewed key new features and improvements introduced in the Apache
Lucene core API. This section outlines the major additions and improvements to Lucene
Contrib packages. Contrib packages are parts of Lucene that do not necessarily belong to
the API core but are often helpful in building Lucene applications.


New Contrib Analyzers
The Analysis package in Lucene Contrib has always been a valuable source for almost every
Lucene programmer. The latest release brings several noteworthy improvements
especially in terms of language support.
       Better support for Chinese: Chinese, like many Asian languages, does not use white
       spaces to delimit one word from another, nor is punctuation used at all. Smart-CN
       provides an analyzer with improved tokenization and capabilities in splitting
       individual characters. While Smart-CN is part of the analyzers contrib module, it is
       distributed in its own JAR file because of the large (6MB) file resources it depends
       on.
       “Light10”-based Arabic analysis: a new Analyzer based on a high-performance
       stemming algorithm (Light10) applying lightweight prefix and suffix removal to
       Arabic text.
       Persian analyzer: applying character normalization and Persian stopword removal
       to Persian-only or mixed language text.
       Reverse String filter, as in leading wildcards: to support a search feature like leading
       wildcards efficiently, one of the common tricks/approaches is to index terms in
       reverse order. A leading wildcard effectively becomes a trailing wildcard if searched
       against a field with reversed tokens.


Lucene Spatial (formerly known as LocalLucene)
Geospatial search has become a very common use-case, especially with the advent of
mobile devices. Almost every new mobile platform supports a “nearby” search feature. End
users seeking data on something near their current location (restaurants, movie theatres,


Programmer’s Guide: What’s New in Lucene 2.9 / 3.0
A Lucid Imagination Technical White Paper • June 2010                                  Page 18
plumbers, etc.) expect both that results are limited to within a certain range, and that
results can be ranked by distance from the end user’s location.
In early 2009, an open source project formerly known as LocalLucene was donated to
Apache Lucene and integrated as a contrib package. Lucene Spatial extends Lucene
capabilities with support for geographical and location-based search.
While Lucene Spatial doesn't have any distance scoring capabilities, it can effectively help
to filter and sort based on geographical information like longitude and latitude values.
Filtering is an especially common use-case, when combined with a full-text query. In
searching for “French restaurant” within 5 miles from a specific location, the filter restricts
down the search space to documents with location fields within 5 miles; the rest of the
search operation is implemented in core Lucene.
Lucene Spatial has a couple of different ways to encode geographic information:
           GeoHash: a hierarchical spatial data structure that subdivides space into buckets
           in a grid shape. GeoHash takes the even bits from the longitude value while the
           odd bits are taken from the latitude value. The result is an arbitrary precision
           base 32-encoded string that offers the property of gradually removing
           characters from the end of the string to reduce the size and precision of the code.
           Nearby places are likely to have similar prefixes due to this property.
           Cartesian Tiers: projects the world on a flat surface. Overlays to this projection
           are created as grids (Cartesian Tiers), with each tier having an increasing
           number (always by a power of two) of grid boxes on it dividing up the
           projection. Location data can be placed within one of the grid boxes with
           different precision depending on the number of grid boxes on the tier.
Both of the above allow efficient storage of geo-information in a Lucene index. In contrast
to plain latitude and longitude values indexed in separate fields, GeoHash and Cartesian
Tiers encode in a single field.
Note that despite its previous releases under a different name (LocalLucene), the Lucene
Spatial API still isn't considered stable and might change in future releases.




Programmer’s Guide: What’s New in Lucene 2.9 / 3.0
A Lucid Imagination Technical White Paper • June 2010                                   Page 19
Lucene Remote and Java RMI
The historic dependency on Java RMI from the Lucene core has now been removed: Lucene
Remote is now partitioned into an optional contrib package. While the package itself
doesn't add any functionality to Lucene it introduces a critical back-compatibility issue
likely to be relevant for many programmers. In prior versions, the core-interface
Searchable extended java.rmi.Remote to enable searches on remote indexes. If you
had taken advantage of this convenience, you will now have to add the new Lucene-remote
JAR file to the classpath and change their code to use the new remote base interface
RMIRemoteSearchable as shown below.


   final RMIRemoteSearchable remoteObject = ...;
   final String remoteObjectName = ...;
   Naming.rebind (remoteObjectName, remoteObject);
   Searchable searchable = (Searchable)Naming.lookup(remoteObjectName);
Using RemoteSearchable with Lucene 2.9



New Flexible QueryParser
Lucene’s built-in query parser has been a burden on developers trying to extend the default
query syntax. While changing certain parts of it, such as query instantiation, could be
readily achieved by subclassing the parser, changing the actual syntax required deep
knowledge of the JavaCC parser-generator.
The new contrib package QueryParser provides a complete query parser framework,
which is fully compliant with the core parser but enables flexible customization by using a
modular architecture.
The basic idea of the new query parser is to separate the syntax from the semantics of a
query, internally represented as a tree. Ultimately the parser splits up in three stages:
   1. Parsing stage: transforms the query text (syntax) into a QueryNode tree. This stage is
      exploited by a single interface (SyntaxParser) mandatory for custom
      implementation of this stage.



Programmer’s Guide: What’s New in Lucene 2.9 / 3.0
A Lucid Imagination Technical White Paper • June 2010                                Page 20
2. Query-Node processing stage: once the QueryNode tree is created, a chain of
      processors start working on the tree. While walking down the tree, a processor can
      apply query optimizations, child reordering, or term tokenization even before the
      query is actually executed.
   3. Building stage: the final stage builds the actual Lucene Query object by mapping
      QueryNode types to associated builders. Each builder subsequently applies the
      actual conversion into a Lucene query.
The snippet below, taken from the new standard QueryParser implementation, shows how
the stages are exposed at the API's top level.


   QueryNode queryTree = this.syntaxParser.parse(query, getField());
   queryTree = this.processorPipeline.process(queryTree);
   return (Query) this.builder.build(queryTree);


To provide a smooth transition from the existing core parser to the new API, this contrib
package also contains an implementation fully compliant with the standard query syntax.
This not only helps the switch to the new query parser but it also serves as an example of
how to use and extend the API. That said, the standard implementation is based on the new
query parser API and therefore it can't simply replace a core parser as is. If you have been
replacing Lucene's current query parser, you can use QueryParserWrapper instead,
which preserves the old query parser interface but calls the new parser framework. One
final caveat: the QueryParserWrapper is marked as deprecated, as the new query parser
will be moved to the core in the upcoming release and eventually replace the old API.


Minor Changes and Improvements in Lucene 2.9
Beside the improvements and entirely new features, Lucene 2.9 contains several minor
improvements worth mentioning. The following points are a partial outline of minor
changes.




Programmer’s Guide: What’s New in Lucene 2.9 / 3.0
A Lucid Imagination Technical White Paper • June 2010                               Page 21
Term vector-based highlighter: a new term highlighter implementation based on term
             based highlighter:
vectors (essentially a view of terms, offsets and positions in a documents field). It supports
      s                               offsets,
features like n-Gram fields and phrase-unit highlighting with slops and yields good
                                         unit
performance on large documents. The downside is that it requires a lot more disk spaces
due to stored term vectors.
           Collector replaces HitCollector: the low-level
           HitCollector was deprecated and replaced with a new
           Collector class. Collector offers a more efficient API
           to collect hits across sequential IndexReader instances.
           The most significant improvement here is that score
                         nificant
           calculation is now decoupled from collecting hits or
           skipped entirely if not needed
                                    needed—a nice new efficiency.
           Improved String “interning”: Lucene 2.9 internally uses a custom String intern
                            “interning”:
           cache instead of Java's default String.intern(). The lockless
                                                               .
           implementation yields minor internal performance improvements.
           New n-gram distance a new n-gram-based distance measure was added to the
                         istance:           based
           contrib spellcheck package.
                              package
           Weight is now an abstract class: the Weight interface was
           refactored to an abstract class including minor method
           signature changes.
           ExtendedFieldCache marked deprecated: All methods and parsers from the
           interface ExtendedFieldCache have been moved into FieldCache
                                                                  FieldCache.
           ExtendedFieldCache is now deprecated and contains only a few
           declarations for binary backwards compatibility.
           MergePolicy interface changed: MergePolicy now
           requires an IndexWriter instance to be passed upon
           instantiation. As a result, IndexWriter was removed as
           a method argument from all MergePolicy methods.
For a complete list of improvements, bug-fixes, compatibility, and runtime behavior
                       improvem                                            behavi
changes you should consult the CHANGES.txt file included in the Lucene distribution
(lucene.apache.org/java/2_9_0/changes/Changes.html ).
 lucene.apache.org/java/2_9_0/changes/Changes.html



Programmer’s Guide: What’s New in Lucene 2.9 / 3.0
A Lucid Imagination Technical White Paper • June 2010                                  Page 22
Changes and Improvements in Lucene 3.0
Lucene 3.0 provides a clean transition to a new major version of the library. Since no new
features have been introduce this section will give a short overview of important changes
                   introduced,
regarding backwards compatibility, API Changes and removed features.
                                         Changes,
           Removed Compressed Fields: Compressed fields already
           deprecated in Lucene 2.9 have been removed without a
           direct replacement. While Lucene 3.0 is still able to read
           indexes with compressed fields, index merges or
           optimizations will decompress and store such fields
           transparently. Given this behavior indexes built with
                            ven      behavior,
           compressed fields might suddenly become larger during a
           segment merge or optimization.
           Removed deprecated Classes and Methods: Deprecated
           methods and classes have been removed in 3.0. A full list
           can be found at lucene.apache.org/java
           /3_0_0/changes/Changes.html#3.0.0.api_changes.
           /3_0_0/changes/Changes.html#3.0.0.api_changes
           Generics & Java 5 Features: Lucene 3.0 became the first release requiring Java 5
           as an underlying execution environment. In addition to various replacements of
           classes with their improved equivalents like StringBuffer and
                              impr
           StringBuilder many new language features were introduced. Public APIs
           StringBuilder,
           now make heavy use of Generic Types and Variable Arguments. Underneath the
           hood, this move flattened the way to introduce improvements using Java 5
           Concurrent Utilities.
           Scorer Deprecations: 3.0 refactors several methods on
                  Deprecation
           Lucene’s lowest level. Scorer and its abstract super-class
           DocIdSetIteration have incompatible API changes
           while equivalents for old APIs are provided. Custom
           Query, Scorer or DocIdSetIterator implemen-
           tations must be ported to the new API in order to be
           compatible with Lucene 3.0.




Programmer’s Guide: What’s New in Lucene 2.9 / 3.0
A Lucid Imagination Technical White Paper • June 2010                                Page 23
Made core TokenStreams final: To enforce Lucene’s
           Decorator-based analysis model several core
                      based
           TokenStream implementations have been declared final
           without any replacement and can therefore not be
           subclassed anymore. Users subclassing streams like
           KeywordTokenizer or StandardTokenizer are
           required to rebuild the functionality.
To gain a comprehensive understanding of what has changed in 3.0 in contrast to Lucene
            mprehensive under
2.9, programmers should consult the CHANGES.txt file and the corresponding issues on the
Lucene issue tracker.


Lucene Version by Version compatibility since 2.9
                y
In Lucene 2.9, a Version constant was first introduced to help in preserving a Version by
Version backwards compatibility. The initial purpose of Version was to enable the
Lucene contributors to eventually fix long time known bugs and limitation in Lucene
                                      long-time
without breaking their own backwards compatibility policy.
Lucene’s StandardAnalyzer was the first class making use of
Version to change its runtime behavior based on the given version
                                 beha
number. Its constructor requires an instance of Version that changed
its internal behavior accordingly:
           As of 2.4, Tokens incorrectly identified as acronyms were
           corrected
           As of 2.9, StopFilter preserves position increment
You might ask why this old, and in these cases incorrect behavior is preserved at all, and
                                                  incorrect,                          all
why it is the user’s responsibility to decide which is correct. Yet the answer isn’t as obvious
                   s
as expected. Since Lucene preserves backwards compatibility for indices created with
                                                                     indices
previous versions of the library, it also has to preserve compatibility with how those
                          library
indices have been build and how they are queried.
Changes like runtime behavior of Analyzers, TokenFilters, and Tokenizers can
easily break backwards compatibility, trigger unexpected behavior, and can cause bad user
                                                         behavior
experience if queries return different documents than before.


Programmer’s Guide: What’s New in Lucene 2.9 / 3.0
A Lucid Imagination Technical White Paper • June 2010                                  Page 24
The Version constant has been introduced to make the upgrade process easier for users
who cannot afford to rebuild their index or need to support “old” indices in production
environments. In such cases, it is recommended that you pass the Version constant you are
upgrading from to all constructors expecting a Version constant. Once the latest behavior is
desired, the version you are upgraded to should be used in favor of
Version#LUCENE_CURRENT, which has been deprecated in Lucene trunk due to the
dangers it could introduce in a subsequent upgrade.


Strategies for Upgrading to Lucene 2.9/3.0
In the main, a Lucene-based application will benefit from the improvements in 2.9, even as
its new features, such as numeric capabilities and the new TokenStream API, do require
code modifications and may require reindexing in order to take full advantage. That said,
compared to previous version changes, an upgrade to version 2.9 requires a more
sophisticated upgrade procedure.
True, there are many cases in which an upgrade won't require code changes, as changes
limited to “expert” APIs won't affect applications only using high-level functionality. All the
same, even if an application complies with Lucene 2.9, it is likely that some of the changes
in runtime characteristics can introduce unexpected behaviors. In the sections below, we’ll
offer some brief suggestions for making the transition.


                                      An upgrade to version 3.0 requires a more
                                      sophisticated upgrade procedure. First,
                                      upgrade to 2.9; then remove the
                                      deprecation warnings. Only then should
                                      you upgrade to 3.0x.

Should you move to 2.9 or 3.0? Whichever you do, first bear in mind that going to 3.0 will
require a migration first to 2.9; it is a prerequisite. Only once that 2.9 transition is
completed, will you be ready to work through the deprecation warnings in order to move.
Because 3.0 is a deprecation release, all deprecated-marked code in Lucene 2.9 will be
removed. Some parts of the API might be modified in order to make use of Java Generics,

Programmer’s Guide: What’s New in Lucene 2.9 / 3.0
A Lucid Imagination Technical White Paper • June 2010                                   Page 25
but in general the upgrade from 2.9 to 3.0 should be as seamless as earlier upgrades have
been. Once you have replaced the usage of any deprecated API(s) in your code, you should
then be able to upgrade the next time simply by replacing the Lucene JAR file.


Upgrade to 2.9—Recommended Actions
At a minimum, if you plan an upgrade of your search application to Lucene 2.9, you should
recompile your application against the new version before the application is rolled out in a
production environment. The most critical issues will immediately raise a compile-time
error once the JAR is in the classpath.
For those of you using Lucene from a single location, for example, in the JRE's ext directory,
you should make sure that 2.9 is the only Lucene version accessible. In cases where an
application relies on extending Lucene in any particular way and the upgrade doesn't raise
a compile-time error, it is recommended that you add a test-case for the extension based on
the behavior executed against the older version of Lucene.
It is also extremely important that you backup and archive your index before opening it
with Lucene 2.9, as it will make changes to the index that may not be readable by previous
versions.
Again, we strongly recommend a careful reading of the CHANGES.txt file included in every
Lucene distribution, especially the sections on back-compatibility policy and on changes in
runtime behavior. Careful study followed by proper planning and testing should prevent
you from running into any surprises once the new Lucene 2.9-based application goes into
production.


Upgrade to 2.9—Optional Actions
Lucene 2.9 includes many new features that are not required for use of the new release.
Nevertheless, 2.9 has numerous parts of the API marked as deprecated, since they are to be
removed in the next release. To prepare for the next release and further improvements in
this direction, it is strongly recommended that you replace any deprecated API during the
upgrade process.
Applications using any kind of numeric searches can improve their performance heavily by
replacing a custom solution with Lucene's Numeric Capabilities described earlier in this
white paper.


Programmer’s Guide: What’s New in Lucene 2.9 / 3.0
A Lucid Imagination Technical White Paper • June 2010                                 Page 26
Last but not least, the new TokenStream API will replace the older API entirely in the next
release. Custom TokenStream, TokenFilter, and Tokenizer implementations
should be updated to the attribute-based API. Here, the source distribution contains basic
test cases that can help you safely upgrade.
Finally, to reiterate, you would do best to write new added test cases against their current
Lucene version, and upgrade the test and your code once you have gained enough
confidence in the stability of the upgrade.


Migrating from Lucene to Solr?
Many Lucene implementations, of a variety of vintages, date back to a time where Solr
lacked core capabilities that were available only by building from scratch all underlying
services needed by the Lucene search libraries. Happily, Solr today offers a very complete
(with some small, but meaningful exceptions) implementation of Lucene functionality.
While a complete, robust approach to migrating from Lucene to Solr is beyond the scope of
this paper, here are some thoughts on the advantages of doing so. A slightly longer
comparison of Solr and Lucene is available in the Appendix to this document.


       Using Standards by Default
       As a server application running in Jetty, Tomcat, or any other Servlet Container, Solr
       easily installs, integrates and runs in existing production environments. Solr’s
       RESTful APIs and its XML-driven implementation simplify configuration, operation,
       and search application development. With a rich array of client libraries, from
       standard SolrJ for Java through JSON, Python, and many others, the base of
       programming skills needed for the search application is much narrower.
       Makes Lucene Best-Practices ready to use
       From caching Filters, Queries, or Documents via Spell-Checking to warming
       Searchers in the background, Solr offers an enormous set of search features that
       would require lots of Lucene experience and expertise to implement without it. Solr
       lets you immediately benefit from these low-level developments, simplifying
       creation and development of your search environment.
       Reuse some of your Lucene Libraries and indexes
       Because Lucene is at the core of Solr, your implementation of Lucene can reuse
       many of the same libraries; just plug in details for your handlers or analyzers into

Programmer’s Guide: What’s New in Lucene 2.9 / 3.0
A Lucid Imagination Technical White Paper • June 2010                                 Page 27
Solr’s config.xml and start testing. Likely as not, you’ll find you can move much of
       your code as is from Lucene or even use your existing indexes with Solr directly.
       Lower maintenance and revision costs
       As this transition from Lucene 2.4.1 to Lucene 2.9/3.0 demonstrates, many of the
       low-level, high-control advantages of implementing directly with Lucene are
       negated once anything changes in your environment. As a server, Solr insulates you
       from much of that, and helps remove the temptation to hard code optimistically or
       to skip abstractions.
       Cloud-readiness
       As large data sets grow in scope and distribution, search services will necessary rely
       on a much higher level of abstraction, above not only data and I/O, but for more
       elastic distribution of search resources and operations, i.e., shards and
       insert/updates. If your business is at a place where hardware might scale via a
       transition into some kind of cloud environment, you will benefit by taking
       advantage of the forthcoming Solr cloud-enabling capabilities, including distributed
       node management, relevancy calculations across multiple collections, etc.
It’s important to note that there are many good reasons not to migrate to Solr from Lucene,
whether they have to do with the cost of a new abstraction model in your general
application implementation, or with no real need for serving search over HTTP. With the
merger of the Lucene and Solr development projects, you won’t be shortchanged on any of
the underlying functionality. But at the same time, the stronger functional affinity between
the two means you’ll have to give careful thought to your long-term deployment goals in
order to pick the right one.




Programmer’s Guide: What’s New in Lucene 2.9 / 3.0
A Lucid Imagination Technical White Paper • June 2010                                 Page 28
References
http://lucene.apache.org/java/2_9_0/index.html
http://lucene.apache.org/java/2_9_0/changes/Changes.html
http://lucene.apache.org/java/2_9_0/changes/Contrib-Changes.html
http://www.lucidimagination.com/Community/Hear-from-the-Experts/Podcasts-and-
Videos/Interview-Uwe-Schindler
http://wiki.apache.org/lucene-java/NearRealtimeSearch
http://wiki.apache.org/lucene-java/Payloads
http://www.lucidimagination.com/blog/2009/08/05/getting-started-with-payloads/
http://wiki.apache.org/lucene-java/ConceptsAndDefinitions
http://wiki.apache.org/lucene-java/FlexibleIndexing
http://wiki.apache.org/lucene-java/Java_1.5_Migration
http://www.lucidimagination.com/How-We-Can-Help/webinar-Lucene-29
http://www.nsshutdown.com/projects/lucene/whitepaper/locallucene_v2.html
http://www.lucidimagination.com/Community/Hear-from-the-Experts/Podcasts-and-
Videos/Interview-Ryan-McKinley
http://ocw.kfupm.edu.sa/user062/ICS48201/NLLight%20Stemming%20for%20Arabic%
20Information%20Retrieval.pdf
https://javacc.dev.java.net/




Programmer’s Guide: What’s New in Lucene 2.9 / 3.0
A Lucid Imagination Technical White Paper • June 2010                       Page 29
Next Steps
For more information on how Lucid Imagination can help search application developers,
employees, customers, and partners find the information they need, please visit
www.lucidimagination.com to access blog posts, articles, and reviews of dozens of
successful implementations.
Certified Distributions from Lucid Imagination are complete, supported bundles of
software that include additional bug fixes, performance enhancements, along with our free
30-day Get Started program. Coupled with one of our support subscriptions, a Certified
Distribution can provide a complete environment to develop, deploy, and maintain
commercial-grade search applications. Certified Distributions are available at
www.lucidimagination.com/Downloads.
Please e-mail specific questions to:
Support and Service: support@lucidimagination.com
Sales and Commercial: sales@lucidimagination.com
Consulting: consulting@lucidimagination.com
Or call: 1.650.353.4057




Programmer’s Guide: What’s New in Lucene 2.9 / 3.0
A Lucid Imagination Technical White Paper • June 2010                             Page 30
APPENDIX: Choosing Lucene or Solr
The great improvements in the capabilities of Lucene and Solr open source search
technology have created rapidly growing interest in using them as alternatives for their
search applications. As is often the case with open source technology, online community
documentation provides rich details on features and variations, but does little to provide
explicit direction on which technologies would be the best choice. So when is Lucene
preferable to Solr and vice versa?
There is in fact no single answer, as Lucene and Solr bring very similar underlying
technology to bear on somewhat distinct problems. Solr is versatile and powerful, a full-
featured, production-ready search application server requiring little formal software
programming. Lucene presents a collection of directly callable Java libraries, with fine-
grained control of machine functions and independence from higher-level protocols.
In choosing which might be best for your search solution, the key questions to consider are
application scope, deployment environment, and software development preferences.
If you are new to developing search applications, you should start with Solr. Solr provides
scalable search power out of the box, whereas Lucene requires solid information retrieval
experience and some meaningful heavy lifting in Java to take advantage of its capabilities.
In many instances, Solr doesn’t even require any real programming.
Solr is essentially the “serverization” of Lucene, and many of its abstract functions are
highly similar, if not the just the same. If you are building an app for the enterprise sector,
for instance, you will find Solr almost a 100% match to your business requirements: it
comes ready to run in a servlet container such as Tomcat or Jetty, and ready to scale in a
production Java environment. Its RESTful interfaces and XML-based configuration files can
greatly accelerate application development and maintenance. In fact, Lucene programmers
have often reported that they find Solr to contain “the same features I was going to build
myself as a framework for Lucene, but already very well implemented.” Once you start with
Solr, and you find yourself using a lot of the features Solr provides out of the box, you will
likely be better off using Solr’s well organized extension mechanisms instead of starting
from scratch using Apache Lucene.




Programmer’s Guide: What’s New in Lucene 2.9 / 3.0
A Lucid Imagination Technical White Paper • June 2010                                  Page 31
If, on the other hand, you do not wish to make any calls via HTTP, and wish to have all of
your resources controlled exclusively by Java API calls that you write, Lucene may be a
better choice. Lucene can work best when constructing and embedding a state-of-the-art
search engine, by allowing programmers to assemble and compile inside a native Java
application. Some programmers set aside the convenience of Solr in order to more directly
control the large set of sophisticated features with low-level access, data, or state
manipulation, and choose Lucene instead, for example, with byte-level manipulation of
segments or intervention in data I/O. Investment at the low level enables development of
extremely sophisticated, cutting edge text search and retrieval capabilities.
As for features, the latest version of Solr generally encapsulates the latest version of
Lucene. As the two are in many ways functional siblings, spending time gaining a solid
understanding how Lucene works internally can help you understand Apache Solr and its
extension of Lucene's workings.
No matter which you choose, the power of open source search is yours to harness. More
information on both Lucene and Solr can be found at www.lucidimagination.com.




Programmer’s Guide: What’s New in Lucene 2.9 / 3.0
A Lucid Imagination Technical White Paper • June 2010                              Page 32

More Related Content

Viewers also liked

Нестандартные методы интернет рекламы
Нестандартные методы интернет рекламыНестандартные методы интернет рекламы
Нестандартные методы интернет рекламыVladimir
 
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...Lucidworks (Archived)
 
基于成本代理模型的Ip长途网络成本仿真研究
基于成本代理模型的Ip长途网络成本仿真研究基于成本代理模型的Ip长途网络成本仿真研究
基于成本代理模型的Ip长途网络成本仿真研究sjm44
 
Center for Enterprise Innovation (CEI) Summary for HREDA, 9-25-14
Center for Enterprise Innovation (CEI) Summary for HREDA, 9-25-14Center for Enterprise Innovation (CEI) Summary for HREDA, 9-25-14
Center for Enterprise Innovation (CEI) Summary for HREDA, 9-25-14Marty Kaszubowski
 
The scene- I love you like a love song Selena Gomez
The scene- I love you like a love song Selena GomezThe scene- I love you like a love song Selena Gomez
The scene- I love you like a love song Selena Gomeztanica
 
情報科学演習 09
情報科学演習 09情報科学演習 09
情報科学演習 09libryukyu
 
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for BusinessSFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for BusinessLucidworks (Archived)
 
"A Study of I/O and Virtualization Performance with a Search Engine based on ...
"A Study of I/O and Virtualization Performance with a Search Engine based on ..."A Study of I/O and Virtualization Performance with a Search Engine based on ...
"A Study of I/O and Virtualization Performance with a Search Engine based on ...Lucidworks (Archived)
 
Integrating Advanced Text Analytics into Solr
Integrating Advanced Text Analytics into SolrIntegrating Advanced Text Analytics into Solr
Integrating Advanced Text Analytics into SolrLucidworks (Archived)
 
Network Forensics Puzzle Contest に挑戦 #2
Network Forensics Puzzle Contest に挑戦 #2Network Forensics Puzzle Contest に挑戦 #2
Network Forensics Puzzle Contest に挑戦 #2彰 村地
 
Understanding Lucene Search Performance
Understanding Lucene Search PerformanceUnderstanding Lucene Search Performance
Understanding Lucene Search PerformanceLucidworks (Archived)
 
Webテクノロジー@2012
Webテクノロジー@2012Webテクノロジー@2012
Webテクノロジー@2012彰 村地
 
Speed Up Web 2012
Speed Up Web 2012Speed Up Web 2012
Speed Up Web 2012彰 村地
 
Lucene rev preso bialecki solr crawlers-lr
Lucene rev preso bialecki solr crawlers-lrLucene rev preso bialecki solr crawlers-lr
Lucene rev preso bialecki solr crawlers-lrLucidworks (Archived)
 
Jazeed about Solr - People as A Search Problem
Jazeed about Solr - People as A Search ProblemJazeed about Solr - People as A Search Problem
Jazeed about Solr - People as A Search ProblemLucidworks (Archived)
 
Lady gaga
Lady gaga Lady gaga
Lady gaga tanica
 
Presentacion Ingles
Presentacion InglesPresentacion Ingles
Presentacion Inglestanica
 
Indexing Text and HTML Files with Solr
Indexing Text and HTML Files with SolrIndexing Text and HTML Files with Solr
Indexing Text and HTML Files with SolrLucidworks (Archived)
 

Viewers also liked (20)

Нестандартные методы интернет рекламы
Нестандартные методы интернет рекламыНестандартные методы интернет рекламы
Нестандартные методы интернет рекламы
 
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
 
基于成本代理模型的Ip长途网络成本仿真研究
基于成本代理模型的Ip长途网络成本仿真研究基于成本代理模型的Ip长途网络成本仿真研究
基于成本代理模型的Ip长途网络成本仿真研究
 
Center for Enterprise Innovation (CEI) Summary for HREDA, 9-25-14
Center for Enterprise Innovation (CEI) Summary for HREDA, 9-25-14Center for Enterprise Innovation (CEI) Summary for HREDA, 9-25-14
Center for Enterprise Innovation (CEI) Summary for HREDA, 9-25-14
 
The scene- I love you like a love song Selena Gomez
The scene- I love you like a love song Selena GomezThe scene- I love you like a love song Selena Gomez
The scene- I love you like a love song Selena Gomez
 
情報科学演習 09
情報科学演習 09情報科学演習 09
情報科学演習 09
 
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for BusinessSFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
 
"A Study of I/O and Virtualization Performance with a Search Engine based on ...
"A Study of I/O and Virtualization Performance with a Search Engine based on ..."A Study of I/O and Virtualization Performance with a Search Engine based on ...
"A Study of I/O and Virtualization Performance with a Search Engine based on ...
 
Integrating Advanced Text Analytics into Solr
Integrating Advanced Text Analytics into SolrIntegrating Advanced Text Analytics into Solr
Integrating Advanced Text Analytics into Solr
 
Network Forensics Puzzle Contest に挑戦 #2
Network Forensics Puzzle Contest に挑戦 #2Network Forensics Puzzle Contest に挑戦 #2
Network Forensics Puzzle Contest に挑戦 #2
 
Understanding Lucene Search Performance
Understanding Lucene Search PerformanceUnderstanding Lucene Search Performance
Understanding Lucene Search Performance
 
Webテクノロジー@2012
Webテクノロジー@2012Webテクノロジー@2012
Webテクノロジー@2012
 
Speed Up Web 2012
Speed Up Web 2012Speed Up Web 2012
Speed Up Web 2012
 
Lucene rev preso bialecki solr crawlers-lr
Lucene rev preso bialecki solr crawlers-lrLucene rev preso bialecki solr crawlers-lr
Lucene rev preso bialecki solr crawlers-lr
 
Jazeed about Solr - People as A Search Problem
Jazeed about Solr - People as A Search ProblemJazeed about Solr - People as A Search Problem
Jazeed about Solr - People as A Search Problem
 
Lady gaga
Lady gaga Lady gaga
Lady gaga
 
Presentacion Ingles
Presentacion InglesPresentacion Ingles
Presentacion Ingles
 
Web Design Course FETAC Level 5
Web Design Course FETAC Level 5 Web Design Course FETAC Level 5
Web Design Course FETAC Level 5
 
La Pensadora
La PensadoraLa Pensadora
La Pensadora
 
Indexing Text and HTML Files with Solr
Indexing Text and HTML Files with SolrIndexing Text and HTML Files with Solr
Indexing Text and HTML Files with Solr
 

Similar to What’s New in Apache Lucene 3.0

Guide to open suse 13.2 by mustafa rasheed abass & abdullah t. tua'ama (update)
Guide to open suse 13.2 by mustafa rasheed abass & abdullah t. tua'ama (update)Guide to open suse 13.2 by mustafa rasheed abass & abdullah t. tua'ama (update)
Guide to open suse 13.2 by mustafa rasheed abass & abdullah t. tua'ama (update)Mustafa AL-Timemmie
 
Cognos Analytics 11.2 New Features
Cognos Analytics 11.2 New FeaturesCognos Analytics 11.2 New Features
Cognos Analytics 11.2 New FeaturesSenturus
 
Top Alternatives To CentOS Linux Server Distributions For Programmers – 2022 ...
Top Alternatives To CentOS Linux Server Distributions For Programmers – 2022 ...Top Alternatives To CentOS Linux Server Distributions For Programmers – 2022 ...
Top Alternatives To CentOS Linux Server Distributions For Programmers – 2022 ...Real Estate
 
The latestopensourcesoftwareavailableandthelatestdevelopmentinict (1)
The latestopensourcesoftwareavailableandthelatestdevelopmentinict (1)The latestopensourcesoftwareavailableandthelatestdevelopmentinict (1)
The latestopensourcesoftwareavailableandthelatestdevelopmentinict (1)iffah_najwa46
 
Implementing the Auphonic Web Application Programming Interface
Implementing the Auphonic Web Application Programming InterfaceImplementing the Auphonic Web Application Programming Interface
Implementing the Auphonic Web Application Programming InterfaceEducational Technology
 
Enterprise connect and_office_editor_release_notes_10.3.1[1]
Enterprise connect and_office_editor_release_notes_10.3.1[1]Enterprise connect and_office_editor_release_notes_10.3.1[1]
Enterprise connect and_office_editor_release_notes_10.3.1[1]Manoharan Venkidusamy, ITIL-V3
 
Evolution of netflix conductor
Evolution of netflix conductorEvolution of netflix conductor
Evolution of netflix conductorvedu12
 
How to Gain Greater Business Intelligence from Lucene/Solr
How to Gain Greater Business Intelligence from Lucene/SolrHow to Gain Greater Business Intelligence from Lucene/Solr
How to Gain Greater Business Intelligence from Lucene/Solrlucenerevolution
 
Developing a database server: software engineer's view
Developing a database server: software engineer's viewDeveloping a database server: software engineer's view
Developing a database server: software engineer's viewLaurynas Biveinis
 
Software Project Management: Release Notes
Software Project Management: Release NotesSoftware Project Management: Release Notes
Software Project Management: Release NotesMinhas Kamal
 
C:\Documents And Settings\User\桌面\Installation Guide O Oo3
C:\Documents And Settings\User\桌面\Installation Guide O Oo3C:\Documents And Settings\User\桌面\Installation Guide O Oo3
C:\Documents And Settings\User\桌面\Installation Guide O Oo3Shilong Sang
 
CANONICAL UBUNTU MANAGEMENT TOOL GETS HEFTY UPGRADE, MICRON ASSOCIATES
CANONICAL UBUNTU MANAGEMENT TOOL GETS HEFTY UPGRADE, MICRON ASSOCIATESCANONICAL UBUNTU MANAGEMENT TOOL GETS HEFTY UPGRADE, MICRON ASSOCIATES
CANONICAL UBUNTU MANAGEMENT TOOL GETS HEFTY UPGRADE, MICRON ASSOCIATESamorlawrenz
 
B-Translator as a Software Engineering Project
B-Translator as a Software Engineering ProjectB-Translator as a Software Engineering Project
B-Translator as a Software Engineering ProjectDashamir Hoxha
 
Sakai 2.6 Overview
Sakai 2.6 OverviewSakai 2.6 Overview
Sakai 2.6 OverviewAuSakai
 

Similar to What’s New in Apache Lucene 3.0 (20)

What’s New in Apache Lucene 2.9
What’s New in Apache Lucene 2.9What’s New in Apache Lucene 2.9
What’s New in Apache Lucene 2.9
 
What’s New in Apache Lucene 2.9
What’s New in Apache Lucene 2.9What’s New in Apache Lucene 2.9
What’s New in Apache Lucene 2.9
 
What’s New in Solr 1.4
What’s New in Solr 1.4What’s New in Solr 1.4
What’s New in Solr 1.4
 
Overview of Searching in Solr 1.4
Overview of Searching in Solr 1.4Overview of Searching in Solr 1.4
Overview of Searching in Solr 1.4
 
Guide to open suse 13.2 by mustafa rasheed abass & abdullah t. tua'ama (update)
Guide to open suse 13.2 by mustafa rasheed abass & abdullah t. tua'ama (update)Guide to open suse 13.2 by mustafa rasheed abass & abdullah t. tua'ama (update)
Guide to open suse 13.2 by mustafa rasheed abass & abdullah t. tua'ama (update)
 
OpenOffice.org 2.x and Beyond
OpenOffice.org 2.x and BeyondOpenOffice.org 2.x and Beyond
OpenOffice.org 2.x and Beyond
 
Cognos Analytics 11.2 New Features
Cognos Analytics 11.2 New FeaturesCognos Analytics 11.2 New Features
Cognos Analytics 11.2 New Features
 
Top Alternatives To CentOS Linux Server Distributions For Programmers – 2022 ...
Top Alternatives To CentOS Linux Server Distributions For Programmers – 2022 ...Top Alternatives To CentOS Linux Server Distributions For Programmers – 2022 ...
Top Alternatives To CentOS Linux Server Distributions For Programmers – 2022 ...
 
java new technology
java new technologyjava new technology
java new technology
 
The latestopensourcesoftwareavailableandthelatestdevelopmentinict (1)
The latestopensourcesoftwareavailableandthelatestdevelopmentinict (1)The latestopensourcesoftwareavailableandthelatestdevelopmentinict (1)
The latestopensourcesoftwareavailableandthelatestdevelopmentinict (1)
 
Implementing the Auphonic Web Application Programming Interface
Implementing the Auphonic Web Application Programming InterfaceImplementing the Auphonic Web Application Programming Interface
Implementing the Auphonic Web Application Programming Interface
 
Enterprise connect and_office_editor_release_notes_10.3.1[1]
Enterprise connect and_office_editor_release_notes_10.3.1[1]Enterprise connect and_office_editor_release_notes_10.3.1[1]
Enterprise connect and_office_editor_release_notes_10.3.1[1]
 
Evolution of netflix conductor
Evolution of netflix conductorEvolution of netflix conductor
Evolution of netflix conductor
 
How to Gain Greater Business Intelligence from Lucene/Solr
How to Gain Greater Business Intelligence from Lucene/SolrHow to Gain Greater Business Intelligence from Lucene/Solr
How to Gain Greater Business Intelligence from Lucene/Solr
 
Developing a database server: software engineer's view
Developing a database server: software engineer's viewDeveloping a database server: software engineer's view
Developing a database server: software engineer's view
 
Software Project Management: Release Notes
Software Project Management: Release NotesSoftware Project Management: Release Notes
Software Project Management: Release Notes
 
C:\Documents And Settings\User\桌面\Installation Guide O Oo3
C:\Documents And Settings\User\桌面\Installation Guide O Oo3C:\Documents And Settings\User\桌面\Installation Guide O Oo3
C:\Documents And Settings\User\桌面\Installation Guide O Oo3
 
CANONICAL UBUNTU MANAGEMENT TOOL GETS HEFTY UPGRADE, MICRON ASSOCIATES
CANONICAL UBUNTU MANAGEMENT TOOL GETS HEFTY UPGRADE, MICRON ASSOCIATESCANONICAL UBUNTU MANAGEMENT TOOL GETS HEFTY UPGRADE, MICRON ASSOCIATES
CANONICAL UBUNTU MANAGEMENT TOOL GETS HEFTY UPGRADE, MICRON ASSOCIATES
 
B-Translator as a Software Engineering Project
B-Translator as a Software Engineering ProjectB-Translator as a Software Engineering Project
B-Translator as a Software Engineering Project
 
Sakai 2.6 Overview
Sakai 2.6 OverviewSakai 2.6 Overview
Sakai 2.6 Overview
 

More from Lucidworks (Archived)

SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
 SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and SolrLucidworks (Archived)
 
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceSFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceLucidworks (Archived)
 
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search EngineChicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search EngineLucidworks (Archived)
 
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with SearchChicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with SearchLucidworks (Archived)
 
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache SolrMinneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache SolrLucidworks (Archived)
 
Minneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com SearchMinneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com SearchLucidworks (Archived)
 
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...Lucidworks (Archived)
 
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...Lucidworks (Archived)
 
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...Lucidworks (Archived)
 
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DCBig Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DCLucidworks (Archived)
 
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCWhat's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCLucidworks (Archived)
 
Solr At AOL, Presented by Sean Timm at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DCSolr At AOL, Presented by Sean Timm at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DCLucidworks (Archived)
 
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCIntro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCLucidworks (Archived)
 
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DCTest Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DCLucidworks (Archived)
 
Building a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLKBuilding a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLKLucidworks (Archived)
 
Introducing LucidWorks App for Splunk Enterprise webinar
Introducing LucidWorks App for Splunk Enterprise webinarIntroducing LucidWorks App for Splunk Enterprise webinar
Introducing LucidWorks App for Splunk Enterprise webinarLucidworks (Archived)
 

More from Lucidworks (Archived) (20)

Integrating Hadoop & Solr
Integrating Hadoop & SolrIntegrating Hadoop & Solr
Integrating Hadoop & Solr
 
The Data-Driven Paradigm
The Data-Driven ParadigmThe Data-Driven Paradigm
The Data-Driven Paradigm
 
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
 SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
 
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceSFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
 
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search EngineChicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
 
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with SearchChicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
 
What's new in solr june 2014
What's new in solr june 2014What's new in solr june 2014
What's new in solr june 2014
 
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache SolrMinneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
 
Minneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com SearchMinneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com Search
 
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
 
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
 
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
 
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DCBig Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
 
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCWhat's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
 
Solr At AOL, Presented by Sean Timm at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DCSolr At AOL, Presented by Sean Timm at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DC
 
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCIntro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
 
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DCTest Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
 
Building a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLKBuilding a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLK
 
Introducing LucidWorks App for Splunk Enterprise webinar
Introducing LucidWorks App for Splunk Enterprise webinarIntroducing LucidWorks App for Splunk Enterprise webinar
Introducing LucidWorks App for Splunk Enterprise webinar
 
Solr4 nosql search_server_2013
Solr4 nosql search_server_2013Solr4 nosql search_server_2013
Solr4 nosql search_server_2013
 

Recently uploaded

Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 

Recently uploaded (20)

Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 

What’s New in Apache Lucene 3.0

  • 1. Programmer’s Programmer Guide to Open Source Search Search: What’s New in Apache Lucene 3.0 A Lucid Imagination Technical White Paper
  • 2. © 2010 by Lucid Imagination, Inc. under the terms of Creative Commons license, as detailed at http://www.lucidimagination.com/Copyrights-and-Disclaimers/. Version 1.02, published 6 June 2010. Solr, Lucene, Apachecon and their logos are trademarks of the Apache Software Foundation. Programmer’s Guide: What’s New in Lucene 2.9 / 3.0 A Lucid Imagination Technical White Paper • June 2010 Page i
  • 3. Abstract Apache Lucene is a high-performance, cross-platform, full-featured Information Retrieval library in open source, suitable for nearly every application that requires full-text search features. Since its introduction nearly 10 years ago, Apache Lucene has become a competitive player for developing extensible, high-performance full-text search solutions. The experience accumulated over time by the community of Lucene committers and contributors and the innovations they have engineered have delivered significant ongoing advances in Lucene’s capabilities. This white paper describes the new features and improvements in the latest versions, Apache Lucene 2.9 and 3.0. It is intended mainly for programmers familiar with the broad base of Lucene’s capabilities, though those new to Lucene should also find it a useful exploration of the newest features. Key topics such as how to upgrade from 2.9 to 3.0, as well as considerations for migrating from Lucene to Solr, are also addressed. In the simplest terms, Lucene is now faster and more flexible than before. Historic weak points have been improved to open the way for innovative new features like near-real-time search, flexible indexing, and high-performance numerical range queries. Many new features have been added, new APIs introduced, and critical bugs have been fixed—all with the same goal: improving Lucene’s state-of-the-art search capabilities. Programmer’s Guide: What’s New in Lucene 2.9 / 3.0 A Lucid Imagination Technical White Paper • June 2010 Page ii
  • 4. Table of Contents Introduction ............................................................................................................................................................ 1 Core Features and Improvements .................................................................................................................. 4 Numeric Capabilities and Numeric Range Queries .............................................................................. 4 New TokenStream API .................................................................................................................................... 8 Per-Segment Search ...................................................................................................................................... 12 Near Realtime Search (NRS) ...................................................................................................................... 14 MultiTermQuery-Related Improvements ............................................................................................. 15 Payloads ............................................................................................................................................................. 16 Additions to Lucene Contrib .......................................................................................................................... 18 New Contrib Analyzers ................................................................................................................................ 18 Lucene Spatial (formerly known as LocalLucene) ............................................................................ 18 Lucene Remote and Java RMI .................................................................................................................... 20 New Flexible QueryParser .......................................................................................................................... 20 Minor Changes and Improvements in Lucene 2.9 ............................................................................. 21 Changes and Improvements in Lucene 3.0 .......................................................................................... 23 Lucene Version by Version Compatibility since 2.9 ......................................................................... 24 Strategies for Upgrading to Lucene 2.9 / 3.0........................................................................................... 25 Upgrade to 2.9—Recommended Actions .............................................................................................. 26 Upgrade to 2.9—Optional Actions ........................................................................................................... 26 Migrating from Lucene to Solr? .................................................................................................................... 27 References ............................................................................................................................................................ 29 Next Steps ............................................................................................................................................................. 30 APPENDIX: Choosing Lucene or Solr .......................................................................................................... 31 Programmer’s Guide: What’s New in Lucene 2.9 / 3.0 A Lucid Imagination Technical White Paper • June 2010 Page iii
  • 5. Introduction Apache Lucene is a high-performance, cross-platform, full-featured Information Retrieval library, in open source, suitable for nearly every application that requires full-text search features. Lucene currently ranks among the top 15 open source projects and is one of the top 5 Apache projects, with installations at over 4,000 companies. Downloads of Lucene, and its server implementation Solr, have grown nearly tenfold over the past three years; Solr is the fastest-growing Lucene subproject. Lucene and Solr offer an attractive alternative to proprietary licensed search and discovery software vendors.1 With the release of versions 2.9 and 3.0 (September and November 2009), the Apache Lucene community delivered the latest upgrades of Lucene. This white paper aims to address key issues for you if you have an Apache Lucene-based application, and need to upgrade existing code to work well with these latest versions, so that you may take advantage of the various improvements and prepare for future releases and application maintainability. If you do not have a Lucene application, the paper should also give you a good overview of the innovations in this release. Unlike the previous 2.4.1 release (March 2009), Lucene 2.9 and 3.0 go well beyond just a bug-fix release. They introduce multiple performance improvements, new features, better runtime behavior, API changes, and bug-fixes at a variety of levels. Importantly, 2.9 deprecates a number of legacy interfaces, and 3.0 is in the main a reimplemented version of 2.9, but without those deprecated interfaces. The 2.9 release improves Lucene in several key aspects, which make it an even more compelling alternative to other solutions. Most notably: Improvements for Near-Realtime Search capabilities make documents searchable almost instantaneously. A new, straightforward API for handling Numeric Ranges both simplifies development and virtually wipes out performance overhead. Analysis API has been replaced for more streamlined, flexible text handling. 1 See the Appendix for a discussion of when to choose Lucene or Solr. Programmer’s Guide: What’s New in Lucene 2.9 / 3.0 A Lucid Imagination Technical White Paper • June 2010 Page 1
  • 6. And, behind the scenes, the groundwork has been laid for yet more indexing flexibility in future releases. Lucene Contrib also adds new utility packages, introduced with this release: An extremely flexible query parser framework opens new possibilities for programmers to more easily create their own query parsing syntax. Local-Lucene and its geo-search capabilities, now donated to Apache, provide this near-mandatory functionality for state-of-the-art search. Various contributions have markedly improved support for languages like Arabic, Persian, and Chinese. Version 3.0 is again a cleanup release and considered feature equivalent to its predecessor. 3.0 is the first Apache Lucene release requiring Java 5 at runtime, enabling Lucene to make use of new language features such as Generics Enumerations, Variable Arguments, along with Java 5’s concurrent utilities. 2.9 release improves Lucene in several key aspects and 2.9 deprecates a number of legacy interfaces. 3.0 is in the main a reimplemented version of 2.9, but without those deprecated interfaces. While the majority of programmers are already running on either version 1.5 or 1.6 platforms (1.6 is the recommended JVM), Java 1.4 reached its end of service life in October 2008. With the new major Lucene 3.0 release, all legacy issues marked as deprecated have now been removed, enforcing their replacement. Some important notes on compatibility: because previous minor releases also contained performance improvements and bug fixes, programmers have been accustomed to upgrading to a new Lucene version just by replacing the JAR file in their classpath. And, in those past cases, Lucene-based apps could be upgraded flawlessly without recompiling the software components accessing or extending Apache Lucene. However, this may not be so with Lucene 2.9/3.0. Programmer’s Guide: What’s New in Lucene 2.9 / 3.0 A Lucid Imagination Technical White Paper • June 2010 Page 2
  • 7. Lucene 2.9 introduces several back-compatibility-breaking changes that may well require back breaking changes in your code that uses the library. A drop-in library replacement is not guaranteed to be successful; at a minimum, it is not likely to be flawless. As a result, we recommend that if you are upgrading from a previous Lucene release, you should at least recompile any software components directly accessing or extending the library. In the latter case, ftware recompilation alone will most likely not be sufficient. More details on these dependencies are discussed in the “Upgrading Lucene” section of the paper. We’ve also noted any significant compatibility issues, labeling them ignificant with this flag: This document is not intended to be a comprehensive overview of all functions of Lucene 2.9/3.0, but rather of new key features and capabilities. Always check the Lucid Imagination Certified distribution (www.lucidimagination.com/downloads and the official www.lucidimagination.com/downloads) Lucene Website (lucene.apache.org for the most up-to-date release information. lucene.apache.org) date Programmer’s Guide: What’s New in Lucene 2.9 / 3.0 A Lucid Imagination Technical White Paper • June 2010 Page 3
  • 8. Core Features and Improvements Numeric Capabilities and Numeric Range Queries One of Apache Lucene's basic properties is its representation of internal searchable values (terms) as UTF-8 encoded characters. Every value passed to Lucene must be converted into a string in order to be searchable. At the same time, Lucene is frequently applied to search numeric values and ranges, such as prices, dates, or other numeric field attributes. Historically, searching over numeric ranges has been a weak point of the library. However, the 2.9 release comes with a tremendous improvement for searching numeric values, especially for range queries. Prior to Lucene 2.9, numeric values were encoded with leading zeros, essentially as a full- precision value. Values stored with full precision ended up creating many unique terms in the index. Thus, if you needed to retrieve all documents in a certain range (e.g., from $1.50 to $1500.00), Lucene had to iterate through a lot of terms whenever many documents with unique values were indexed. Consequently, execution of queries with large ranges and lots of unique terms could be extremely slow as a result of this overhead. Many workaround techniques have evolved over the years to improve the performance of ranges, such as encoding date ranges in multiple fields with separate fields for year, month, and day. But at the end of the day, every programmer had to roll his or her own way of searching ranges efficiently. In Lucene 2.9, NumericUtils and its relatives (NumericRangeQuery / NumericRangeFilter) introduce native numeric encoding and search capabilities. Numeric Java primitives (long, int, float, and double) are transformed into prefix- encoded representations with increasing precision. Internally each prefix precision is generated by stripping off the least significant bits indicated by the precisionStep. Each value is subsequently converted to a sequence of 7-bit ASCII characters (due to the UTF-8 term encoding in the index, 8 or more bits would split into two or more bytes) resulting in a predictable number of prefix-terms that can be calculated ahead of time. The figure below illustrates such a Prefix Tree. Programmer’s Guide: What’s New in Lucene 2.9 / 3.0 A Lucid Imagination Technical White Paper • June 2010 Page 4
  • 9. Example of a Prefix Tree, where the leaves of the tree hold the actual term values and all the descendants of a node have a common prefix associated with the node. Bold circles mark all relevant nodes to retrieve a range from 215 to 977. The generated terms are indexed just like any other string values passed to Lucene. Under the hood, Lucene associates distinct terms with all documents containing the term, so that all documents containing a numeric value with the same prefix are “grouped” together, meaning the number of terms that need to be searched is reduced tremendously. This stands in contrast to the relatively less efficient encoding scheme in previous releases, where each unique numeric value was indexed as a distinct term based on the number of terms in the index. Programmer’s Guide: What’s New in Lucene 2.9 / 3.0 A Lucid Imagination Technical White Paper • June 2010 Page 5
  • 10. Directory directory = new RAMDirectory(); Analyzer analyzer = new WhitespaceAnalyzer(); IndexWriter writer = new IndexWriter(directory, analyzer, IndexWriter.MaxFieldLength.UNLIMITED); for (int i = 0; i < 20000; i++) { Document doc = new Document(); doc.add(new Field("id", String.valueOf(i), Field.Store.YES, Field.Index.NOT_ANALYZED_NO_NORMS)); String num = Integer.toString(i); String paddedValue = "00000".substring(0, 5 - num.length()) + num; doc.add(new Field("oldNumeric", paddedValue, Field.Store.YES, Field.Index.NOT_ANALYZED_NO_NORMS)); writer.addDocument(doc); } writer.close(); Indexing a zero-padded numeric value for use with an ordinary RangeQuery. You can also use the native encoding of numeric values beyond range searches. Numeric fields can be loaded in the internal FieldCache, where they are used for sorting. Zero- padding of numeric primitives (see code example above) is no longer needed as the trie- encoding guarantees the correct ordering without requiring execution overhead or extra coding. The code listing below instead uses the new NumericField to index a numeric Java primitive using 4-bit precision. Like the straightforward NumericField, querying numeric ranges also provides a type-safe API. NumericRangeQuery instances are created using one of the provided static constructors for the corresponding Java primitive. Programmer’s Guide: What’s New in Lucene 2.9 / 3.0 A Lucid Imagination Technical White Paper • June 2010 Page 6
  • 11. Directory directory = new RAMDirectory(); Analyzer analyzer = new WhitespaceAnalyzer(); IndexWriter writer = new IndexWriter(directory, analyzer, IndexWriter.MaxFieldLength.UNLIMITED); for (int i = 0; i < 20000; i++) { Document doc = new Document(); doc.add(new Field("id", String.valueOf(i), Field.Store.YES, Field.Index.NOT_ANALYZED_NO_NORMS)); doc.add(new NumericField("newNumeric", 4, Field.Store.YES, true).setIntValue(i)); writer.addDocument(doc); } writer.close(); Indexing numeric values with the new NumericField type The example below shows a numeric range query using an int primitive with the same precision used in the indexing example. If different precision values are used at index or search time, numeric queries can yield unexpected behavior. Programmer’s Guide: What’s New in Lucene 2.9 / 3.0 A Lucid Imagination Technical White Paper • June 2010 Page 7
  • 12. IndexSearcher searcher = new IndexSearcher(directory, true); Query query = NumericRangeQuery.newIntRange("newNumeric", 4, 10, 10000, true, false); TopDocs docs = searcher.search(query, null, 10); assertNotNull("Docs is null", docs); assertEquals(9990, docs.totalHits); for (int i = 0; i < docs.scoreDocs.length; i++) { ScoreDocs d= docs.scoreDocs[i]; assertTrue(sd.doc >= 10 && sd.doc < 10000); } Searching numeric values with the new NumericRangeQuery Improvements resulting from new Lucene numeric capabilities are equally significant in versatility and performance. Now, Lucene can cover almost every use-case related to numeric values. Moreover, range searches or sorting on float or double values up to fast date searches (dates converted to time stamps) will execute in less than 100 milliseconds in most cases. By comparison, the old approach using padded full-precision values could take up to 30 seconds or more depending on the underlying index. New TokenStream API Almost every programmer who has extended Lucene has worked with its analysis function. Text analysis is common to almost every use-case, and is among the best known Lucene APIs. Since its early days, Lucene has used a “Decorator Pattern” to provide a pluggable and flexible analysis API, allowing a combination of existing and customized analysis implementations. The central analysis class TokenStream enumerates a sequence of tokens from either a document's fields or from a query. Commonly, multiple TokenStream instances are chained, each applying a separate analysis step to text terms represented by a Token class that encodes all relevant information about a term. Prior to Lucene 2.9, TokenStream operated exclusively on Token instances transporting term information through the analysis chain. With this release, the token-based API has been marked as deprecated. It is completely replaced by an attribute-based API. Programmer’s Guide: What’s New in Lucene 2.9 / 3.0 A Lucid Imagination Technical White Paper • June 2010 Page 8
  • 13. Here’s how it has changed. Rather than receiving a Token instance from one of the two TokenStream.next() methods, the new API follows a stateful approach instead. To advance in the stream, consumers call TokenStream.incrementToken(), which returns a Boolean result indicating if the end of the stream has been reached. Information gathered during the analysis process is encoded in attributes accessible via the new TokenStream base class AttributeSource. In contrast to the older Token class, the Attribute-based approach separates specific term characteristics from others not necessarily related. Each TokenStream adds the attributes it is specifically targeting at construction time (see code listing below) and keeps a reference to it throughout its lifetime. This provides type-safe access to all attributes relevant for a particular TokenStream instance. protected CharReplacementTokenStream(TokenStream input) { super(input); termAtt = (TermAttribute) addAttribute(TermAttribute.class); } Adding a TermAttribute at construction time Inside TokenStream.incrementToken(), a token stream only operates on attributes that have been declared in the constructor. For instance, if you have Lucene replacing a character like a German umlaut in a term, only the TermAttribute (declared at construction time in the code listing above) is used. (Other attributes like PositionIncrementAttribute or PayloadAttribute are ignored by this TokenStream as they might not be needed in this particular use-case.) Programmer’s Guide: What’s New in Lucene 2.9 / 3.0 A Lucid Imagination Technical White Paper • June 2010 Page 9
  • 14. public boolean incrementToken() throws IOException { if (input.incrementToken()) { final char[] termBuffer = termAtt.termBuffer(); final int termLength = termAtt.termLength(); if (replaceChar(termBuffer, termLength)) { termAtt.setTermBuffer(output, 0, outputPos); } return true; } return false; } Replacing characters using the new attribute-based API. What the above example does not demonstrate is the full power of the new token API. There, we replaced one or more characters in the token and discarded the original one. Yet, in many use-cases, the original token should be preserved in addition to the modified one. Using the old API required a fair bit of work and logic to handle such a common use-case. In contrast, the new attribute-based approach allows capture and restoration of the state of attributes, which makes such use-cases almost trivial. The example below shows a version of the previous example improved for Lucene 2.9/3.0, in which the original term attribute is restored once the stream is advanced. Programmer’s Guide: What’s New in Lucene 2.9 / 3.0 A Lucid Imagination Technical White Paper • June 2010 Page 10
  • 15. public boolean incrementToken() throws IOException { if (state != null) { restoreState(state); state = null; return true; } if (input.incrementToken()) { final char[] termBuffer = termAtt.termBuffer(); final int termLength = termAtt.termLength(); if (replaceChar(termBuffer, termLength)) { state = captureState(); termAtt.setTermBuffer(output, 0, outputPos); } return true; } return false; } Replacing characters and additionally emitting the original term text using the new attribute-based API (position increments are omitted). The separation of attributes makes it possible to add arbitrary properties to the analysis chain without using a customized Token class. Attributes are then made type-safely accessible by all subsequent TokenStream instances, and can eventually be used by the consumer. This way, you get a generic way to add various kind of custom information, such as part-of-speech tags, payloads, or average document length to the token stream. Unfortunately, Lucene 2.9 and 3.0 don't yet provide functionality to persist a custom Attribute implementation to the underlying index. This improvement, part of what is often referred to as "flexible indexing," is under active development and is proposed for one of the upcoming Lucene releases. Beyond the generalizability of this API, one of its most significant improvements is its effective reuse of Attribute instances across multiple iterations of analysis. Attribute implementations are created during TokenStream instantiation and are reused each time the stream advances to a successive increment. Even if a stream is used for another Programmer’s Guide: What’s New in Lucene 2.9 / 3.0 A Lucid Imagination Technical White Paper • June 2010 Page 11
  • 16. analysis, the same Attribute instances may be used, provided the stream is reusable. This greatly reduces the rate of object creation, streamlining execution and minimizing any required garbage collection. n. While Lucene 2.9 provides full back back-compatibility for old-style TokenStream implementations, it is strongly recommended to update any existing custom TokenStream implementations to exclusively use incrementToken instead of one of the overhead- heavy next() methods. Lucene 3.0 removed this compatibility layer and enforces the new attribute based API. If you are trying to update your custom TokenStream or one of its subclass (TokenFilter and Tokenizer) implementations, it is recommended that you use the abstract BaseTokenStreamTestCase class, which provides various utility functions for testing against the new and old API. The test case is freely available for download in the source distribution of Apache Lucene 2.9 he 2.9/3.0. Per-Segment Search Since the early days of Apache Lucene, documents have been stored at the lowest level in a segment—a small but entirely independent index. On the highest abstraction level, Lucene a combines segments into one large index and executes searches across all visible segments. As more and more documents are added to an index, Lucene buffers your documents in RAM and flushes them to disk periodically. Depending on a variety of factors, Lucene either incrementally adds documents to an existing segment, or creates entirely new segments. To reduce the negative impact of an increasing number of segments on search performance, performance Lucene tries to combine/merge multiple segments into larger ones. For optimal search performance, Lucene can optimize an index that essentially merges all existing segments into a single segment. Prior to Lucene 2.9, search logic resided at the highest abstraction level, accessing a single IndexReader no matter how many segments the index was composed of. Similar the Similarly FieldCache was associated with the top top-level IndexReader, and then had to be invalidated each time an index was reopened. With Lucene 2.9 the search logic and the 2.9, FieldCache have moved to a per-segment level. While this has introduced a little more per his internal complexity, the benefit of the tradeoff is a new per-segment index behavior that segment yields a rich variety of performance improvements for unoptimized indexes. Programmer’s Guide: What’s New in Lucene 2.9 / 3.0 A Lucid Imagination Technical White Paper • June 2010 Page 12
  • 17. In most applications, existing segments rarely change internally, and this property had not been effectively utilized in previous versions of Lucene. IndexReader.reopen(), first added in Lucene 2.4, now has the ability to add new or changed segments to an already existing top-level IndexReader instead of reloading all existing segments. The FieldCache also takes advantage of rarely changing segments. Cache instances of unchanged or updated segments can remain in memory or need only be rebuilt instead of invalidating the FieldCache entirely. Depending on the number of changed index segments, this can heavily reduce I/O as well as garbage collection costs, compared to reopening the entire index. Previous versions of Lucene also suffered from long warming time for sorting and function queries. Those use-cases have been improved as the warm-up of reopened searchers is now much faster. It's worth mentioning that Per-Segment Search doesn't yield improvements in all situations. If an IndexReader is opened on an optimized index, all pre-existing segments are merged into a single one, which then loads in its entirety. In other situations, perhaps more common, where some changes have been committed to the index and a new IndexReader instance is obtained by calling IndexReader.reopen() on a previously opened reader, the new per-segment capabilities can dramatically speed up reopening. But in this case, opening a new IndexReader using one of the overloaded static IndexReader.open() methods will create an entirely new reader instance and therefore can't take advantage of any per-segment capabilities. Programmer’s Guide: What’s New in Lucene 2.9 / 3.0 A Lucid Imagination Technical White Paper • June 2010 Page 13
  • 18. IndexReader reader = indexWriter.getReader(); … IndexReader newReader = indexWriter.reopen(); if (reader != newReader) { reader.close(); reader = newReader; } Obtaining and reopening a Near-Realtime Reader from an IndexWriter instance Realtime The majority of Lucene users won’t touch the changes related to Per- Segment Search during their day-to-day business unless there are ment working on low-level code implementing Filters or Custom – level Collector classes. Both classes directly expose the per – segment model like Collector#setNextReader(), which is called once Collector#setNextReader() for each segment during search. The Filter API instead doesn’t immediately yield its relation to per per-segment search and has caused lots of confusion in the past. Filter#getDocIdSet(IndexReader) and its deprecated relative Filter#bits(IndexReader) are also called once per segment instead of once per index. The document IDs set by the Filter must be relative to the current segment rathe than absolute. rather Near Realtime Search (NRS) More and more, Lucene programmers are pursuing real-time or near-real real-time requirements with their search applications. Previous Lucene versions did a decent job with the incremental changes characteristic of this scenario, capturing those changes and making them available for searching. Lucene 2.9 adds significant new capabilities for addressing the requirements of high-change document environments. high change First of all, the IndexWriter in general responsible for modifying the underlying index IndexWriter—in and flushing documents to disk disk—now offers a way to obtain an IndexReader instance directly from the writer. The newly obtained reader then not only reflects the documents Programmer’s Guide: What’s New in Lucene 2.9 / 3.0 A Lucid Imagination Technical White Paper • June 2010 Page 14
  • 19. already flushed to disk, but also makes all uncommitted documents still residing in memory almost instantly searchable. The reader instance returned by IndexWriter.getReader() supports reopening the reader as long as the writer releasing the reader has not been committed. Once it is committed, reopening the reader will result in an AlreadyClosedExecption AlreadyClosedExecption. It is important to understand why this feature is referred to as “near real-time” rather than real “real-time.” When IndexWriter.getReader() is called for the very first time, Lucene needs to consume a reasonable amount of additional resources ( (i.e., RAM, CPU CPU-cycles, and file descriptors) to make uncommitted documents searchable. Due to this additional work, searchable. uncommitted documents will not always be available instantaneously. Nonetheless, in most cases, the performance gained with this feature will be better than just reopening the index, or the traditional simpler approach of opening a brand new reader instance. To keep the latency as low as possible, the IndexWriter offers an optional “prewarmup” functionality, by which newly merged segments can be prepared for real-time search. If you real are new to this feature, you should be a aware that the pre-warmup API is still marked warmup experimental and might change in future releases. MultiTermQuery-Related Improvements elated In Lucene 2.4, many standard queries, such as FuzzyQuery, WildcardQuery, and PrefixQuery were refactored and subclassed under MultiTermQuery Lucene 2.9 adds some MultiTermQuery. improvements under the hood, resulting in much better performance for those queries.2 In Lucene 2.9/3.0, multiterm queries now use a constant score internally, based on the , assumption that most programmers don't care about the interim score of the queries resulting from the term expansion that takes place during query rewriting. 2 This could be a back-compatibility issue if one of those classes has been subclassed. compatibility Programmer’s Guide: What’s New in Lucene 2.9 / 3.0 A Lucid Imagination Technical White Paper • June 2010 Page 15
  • 20. Although constant-scoring is now the default behavior, the older scoring mode is still available for multiterm queries in 2.9/3.0. Beyond that, you can choose one of the following scoring modes: Filtered constant score: rewrites the multiterm query into a ConstantScoreQuery in combination with a filter to match all relevant documents. BooleanQuery constant score: rewrites the multiterm query into a ConstantScoreQuery based on a BooleanQuery by translating each term into an optional Boolean clause. This mode still has a limitation of maxClauseCount and might raise an exception if the query has too many Boolean clauses. Conventional scoring (not recommended): rewrites the multiterm query into an ordinary BooleanQuery. Automatic constant score (default): tries to choose the best constant score mode (Filter or BooleanQuery) based on term and document counts from the query. If the number of terms and documents is small enough, BooleanQuery is chosen, otherwise the query rewrites to a filter-backed ConstantScoreQuery. You can change the scoring mode by passing an implementation of RewriteMethod to MultiTermQuery.setRewriteMethod() as shown in the code example below. PrefixQuery prefixQuery = new PrefixQuery(new Term("aField", "luc")); prefixQuery.setRewriteMethod( MultiTermQuery.CONSTANT_SCORE_FILTER_REWRITE); Explicitly setting a filtered constant-score RewriteMethod on a PrefixQuery Payloads The Payloads feature, though originally added in a previous version of Lucene, remains pretty new to most programmers. A payload is essentially a byte array that is associated with a particular term in the index. Payloads can be associated with a single term during text analysis and subsequently committed directly to the index. On the search side, these byte arrays are accessible to influence the scoring for a particular term, or even to filter entire documents. Programmer’s Guide: What’s New in Lucene 2.9 / 3.0 A Lucid Imagination Technical White Paper • June 2010 Page 16
  • 21. For instance, if your Lucene application is analyzing the phrase “Gangs of New York,” payloads can encode information about the terms “New” and “York” together, so that they are treated as a paired term for the name of a city, or can specify that “Gangs” is a noun rather than a verb. Prior to 2.9, payloads were exposed via a query called BoostingTermQuery, which has now been renamed to PayloadTermQuery. By using this query type, you can query Lucene to find all occurrences where “New” is a part of a city name like “New York” or “New Orleans”. In comparison with previous versions, Lucene 2.9/3.0 also provides more control and flexibility for payload scoring. You can pass a custom PayloadFunction to the constructor of a payload-aware query. Each payload is fed back to the custom function, which calculates the score based on the cumulative outcomes of payload occurrences. This improvement becomes even more useful when payloads are used in combination with span queries. Spans represent a range of term positions in a document, whereas in turn, payloads can help scoring based on the distance between terms. For instance, using a PayloadNearQuery , documents can be scored differently if terms are in the same sentence or paragraph if that information is encoded in the payload. At a higher abstraction level, another payload aware TokenFilter has been added. DelimitedPayloadTokenFilter splits tokens separated by a predefined character delimiter, where the first part of the token is the token itself and the second part after the delimiter represents the payload. For example, it can parse an e-mail address, for example carol.smith@apache.org, by making “carol.smith” the token, and creating a payload to represent the domain name, “apache.org”. A customizable payload-encoder takes care of encoding the values while everything else magically happens inside the filter. Besides being a convenient way to add payloads to existing search functionality, this class also serves as a working example of how to use payloads during the analysis process.3 3 See www.lucidimagination.com/blog/2009/08/05/getting-started-with-payloads for more information. Programmer’s Guide: What’s New in Lucene 2.9 / 3.0 A Lucid Imagination Technical White Paper • June 2010 Page 17
  • 22. Additions to Lucene Contrib So far, we’ve reviewed key new features and improvements introduced in the Apache Lucene core API. This section outlines the major additions and improvements to Lucene Contrib packages. Contrib packages are parts of Lucene that do not necessarily belong to the API core but are often helpful in building Lucene applications. New Contrib Analyzers The Analysis package in Lucene Contrib has always been a valuable source for almost every Lucene programmer. The latest release brings several noteworthy improvements especially in terms of language support. Better support for Chinese: Chinese, like many Asian languages, does not use white spaces to delimit one word from another, nor is punctuation used at all. Smart-CN provides an analyzer with improved tokenization and capabilities in splitting individual characters. While Smart-CN is part of the analyzers contrib module, it is distributed in its own JAR file because of the large (6MB) file resources it depends on. “Light10”-based Arabic analysis: a new Analyzer based on a high-performance stemming algorithm (Light10) applying lightweight prefix and suffix removal to Arabic text. Persian analyzer: applying character normalization and Persian stopword removal to Persian-only or mixed language text. Reverse String filter, as in leading wildcards: to support a search feature like leading wildcards efficiently, one of the common tricks/approaches is to index terms in reverse order. A leading wildcard effectively becomes a trailing wildcard if searched against a field with reversed tokens. Lucene Spatial (formerly known as LocalLucene) Geospatial search has become a very common use-case, especially with the advent of mobile devices. Almost every new mobile platform supports a “nearby” search feature. End users seeking data on something near their current location (restaurants, movie theatres, Programmer’s Guide: What’s New in Lucene 2.9 / 3.0 A Lucid Imagination Technical White Paper • June 2010 Page 18
  • 23. plumbers, etc.) expect both that results are limited to within a certain range, and that results can be ranked by distance from the end user’s location. In early 2009, an open source project formerly known as LocalLucene was donated to Apache Lucene and integrated as a contrib package. Lucene Spatial extends Lucene capabilities with support for geographical and location-based search. While Lucene Spatial doesn't have any distance scoring capabilities, it can effectively help to filter and sort based on geographical information like longitude and latitude values. Filtering is an especially common use-case, when combined with a full-text query. In searching for “French restaurant” within 5 miles from a specific location, the filter restricts down the search space to documents with location fields within 5 miles; the rest of the search operation is implemented in core Lucene. Lucene Spatial has a couple of different ways to encode geographic information: GeoHash: a hierarchical spatial data structure that subdivides space into buckets in a grid shape. GeoHash takes the even bits from the longitude value while the odd bits are taken from the latitude value. The result is an arbitrary precision base 32-encoded string that offers the property of gradually removing characters from the end of the string to reduce the size and precision of the code. Nearby places are likely to have similar prefixes due to this property. Cartesian Tiers: projects the world on a flat surface. Overlays to this projection are created as grids (Cartesian Tiers), with each tier having an increasing number (always by a power of two) of grid boxes on it dividing up the projection. Location data can be placed within one of the grid boxes with different precision depending on the number of grid boxes on the tier. Both of the above allow efficient storage of geo-information in a Lucene index. In contrast to plain latitude and longitude values indexed in separate fields, GeoHash and Cartesian Tiers encode in a single field. Note that despite its previous releases under a different name (LocalLucene), the Lucene Spatial API still isn't considered stable and might change in future releases. Programmer’s Guide: What’s New in Lucene 2.9 / 3.0 A Lucid Imagination Technical White Paper • June 2010 Page 19
  • 24. Lucene Remote and Java RMI The historic dependency on Java RMI from the Lucene core has now been removed: Lucene Remote is now partitioned into an optional contrib package. While the package itself doesn't add any functionality to Lucene it introduces a critical back-compatibility issue likely to be relevant for many programmers. In prior versions, the core-interface Searchable extended java.rmi.Remote to enable searches on remote indexes. If you had taken advantage of this convenience, you will now have to add the new Lucene-remote JAR file to the classpath and change their code to use the new remote base interface RMIRemoteSearchable as shown below. final RMIRemoteSearchable remoteObject = ...; final String remoteObjectName = ...; Naming.rebind (remoteObjectName, remoteObject); Searchable searchable = (Searchable)Naming.lookup(remoteObjectName); Using RemoteSearchable with Lucene 2.9 New Flexible QueryParser Lucene’s built-in query parser has been a burden on developers trying to extend the default query syntax. While changing certain parts of it, such as query instantiation, could be readily achieved by subclassing the parser, changing the actual syntax required deep knowledge of the JavaCC parser-generator. The new contrib package QueryParser provides a complete query parser framework, which is fully compliant with the core parser but enables flexible customization by using a modular architecture. The basic idea of the new query parser is to separate the syntax from the semantics of a query, internally represented as a tree. Ultimately the parser splits up in three stages: 1. Parsing stage: transforms the query text (syntax) into a QueryNode tree. This stage is exploited by a single interface (SyntaxParser) mandatory for custom implementation of this stage. Programmer’s Guide: What’s New in Lucene 2.9 / 3.0 A Lucid Imagination Technical White Paper • June 2010 Page 20
  • 25. 2. Query-Node processing stage: once the QueryNode tree is created, a chain of processors start working on the tree. While walking down the tree, a processor can apply query optimizations, child reordering, or term tokenization even before the query is actually executed. 3. Building stage: the final stage builds the actual Lucene Query object by mapping QueryNode types to associated builders. Each builder subsequently applies the actual conversion into a Lucene query. The snippet below, taken from the new standard QueryParser implementation, shows how the stages are exposed at the API's top level. QueryNode queryTree = this.syntaxParser.parse(query, getField()); queryTree = this.processorPipeline.process(queryTree); return (Query) this.builder.build(queryTree); To provide a smooth transition from the existing core parser to the new API, this contrib package also contains an implementation fully compliant with the standard query syntax. This not only helps the switch to the new query parser but it also serves as an example of how to use and extend the API. That said, the standard implementation is based on the new query parser API and therefore it can't simply replace a core parser as is. If you have been replacing Lucene's current query parser, you can use QueryParserWrapper instead, which preserves the old query parser interface but calls the new parser framework. One final caveat: the QueryParserWrapper is marked as deprecated, as the new query parser will be moved to the core in the upcoming release and eventually replace the old API. Minor Changes and Improvements in Lucene 2.9 Beside the improvements and entirely new features, Lucene 2.9 contains several minor improvements worth mentioning. The following points are a partial outline of minor changes. Programmer’s Guide: What’s New in Lucene 2.9 / 3.0 A Lucid Imagination Technical White Paper • June 2010 Page 21
  • 26. Term vector-based highlighter: a new term highlighter implementation based on term based highlighter: vectors (essentially a view of terms, offsets and positions in a documents field). It supports s offsets, features like n-Gram fields and phrase-unit highlighting with slops and yields good unit performance on large documents. The downside is that it requires a lot more disk spaces due to stored term vectors. Collector replaces HitCollector: the low-level HitCollector was deprecated and replaced with a new Collector class. Collector offers a more efficient API to collect hits across sequential IndexReader instances. The most significant improvement here is that score nificant calculation is now decoupled from collecting hits or skipped entirely if not needed needed—a nice new efficiency. Improved String “interning”: Lucene 2.9 internally uses a custom String intern “interning”: cache instead of Java's default String.intern(). The lockless . implementation yields minor internal performance improvements. New n-gram distance a new n-gram-based distance measure was added to the istance: based contrib spellcheck package. package Weight is now an abstract class: the Weight interface was refactored to an abstract class including minor method signature changes. ExtendedFieldCache marked deprecated: All methods and parsers from the interface ExtendedFieldCache have been moved into FieldCache FieldCache. ExtendedFieldCache is now deprecated and contains only a few declarations for binary backwards compatibility. MergePolicy interface changed: MergePolicy now requires an IndexWriter instance to be passed upon instantiation. As a result, IndexWriter was removed as a method argument from all MergePolicy methods. For a complete list of improvements, bug-fixes, compatibility, and runtime behavior improvem behavi changes you should consult the CHANGES.txt file included in the Lucene distribution (lucene.apache.org/java/2_9_0/changes/Changes.html ). lucene.apache.org/java/2_9_0/changes/Changes.html Programmer’s Guide: What’s New in Lucene 2.9 / 3.0 A Lucid Imagination Technical White Paper • June 2010 Page 22
  • 27. Changes and Improvements in Lucene 3.0 Lucene 3.0 provides a clean transition to a new major version of the library. Since no new features have been introduce this section will give a short overview of important changes introduced, regarding backwards compatibility, API Changes and removed features. Changes, Removed Compressed Fields: Compressed fields already deprecated in Lucene 2.9 have been removed without a direct replacement. While Lucene 3.0 is still able to read indexes with compressed fields, index merges or optimizations will decompress and store such fields transparently. Given this behavior indexes built with ven behavior, compressed fields might suddenly become larger during a segment merge or optimization. Removed deprecated Classes and Methods: Deprecated methods and classes have been removed in 3.0. A full list can be found at lucene.apache.org/java /3_0_0/changes/Changes.html#3.0.0.api_changes. /3_0_0/changes/Changes.html#3.0.0.api_changes Generics & Java 5 Features: Lucene 3.0 became the first release requiring Java 5 as an underlying execution environment. In addition to various replacements of classes with their improved equivalents like StringBuffer and impr StringBuilder many new language features were introduced. Public APIs StringBuilder, now make heavy use of Generic Types and Variable Arguments. Underneath the hood, this move flattened the way to introduce improvements using Java 5 Concurrent Utilities. Scorer Deprecations: 3.0 refactors several methods on Deprecation Lucene’s lowest level. Scorer and its abstract super-class DocIdSetIteration have incompatible API changes while equivalents for old APIs are provided. Custom Query, Scorer or DocIdSetIterator implemen- tations must be ported to the new API in order to be compatible with Lucene 3.0. Programmer’s Guide: What’s New in Lucene 2.9 / 3.0 A Lucid Imagination Technical White Paper • June 2010 Page 23
  • 28. Made core TokenStreams final: To enforce Lucene’s Decorator-based analysis model several core based TokenStream implementations have been declared final without any replacement and can therefore not be subclassed anymore. Users subclassing streams like KeywordTokenizer or StandardTokenizer are required to rebuild the functionality. To gain a comprehensive understanding of what has changed in 3.0 in contrast to Lucene mprehensive under 2.9, programmers should consult the CHANGES.txt file and the corresponding issues on the Lucene issue tracker. Lucene Version by Version compatibility since 2.9 y In Lucene 2.9, a Version constant was first introduced to help in preserving a Version by Version backwards compatibility. The initial purpose of Version was to enable the Lucene contributors to eventually fix long time known bugs and limitation in Lucene long-time without breaking their own backwards compatibility policy. Lucene’s StandardAnalyzer was the first class making use of Version to change its runtime behavior based on the given version beha number. Its constructor requires an instance of Version that changed its internal behavior accordingly: As of 2.4, Tokens incorrectly identified as acronyms were corrected As of 2.9, StopFilter preserves position increment You might ask why this old, and in these cases incorrect behavior is preserved at all, and incorrect, all why it is the user’s responsibility to decide which is correct. Yet the answer isn’t as obvious s as expected. Since Lucene preserves backwards compatibility for indices created with indices previous versions of the library, it also has to preserve compatibility with how those library indices have been build and how they are queried. Changes like runtime behavior of Analyzers, TokenFilters, and Tokenizers can easily break backwards compatibility, trigger unexpected behavior, and can cause bad user behavior experience if queries return different documents than before. Programmer’s Guide: What’s New in Lucene 2.9 / 3.0 A Lucid Imagination Technical White Paper • June 2010 Page 24
  • 29. The Version constant has been introduced to make the upgrade process easier for users who cannot afford to rebuild their index or need to support “old” indices in production environments. In such cases, it is recommended that you pass the Version constant you are upgrading from to all constructors expecting a Version constant. Once the latest behavior is desired, the version you are upgraded to should be used in favor of Version#LUCENE_CURRENT, which has been deprecated in Lucene trunk due to the dangers it could introduce in a subsequent upgrade. Strategies for Upgrading to Lucene 2.9/3.0 In the main, a Lucene-based application will benefit from the improvements in 2.9, even as its new features, such as numeric capabilities and the new TokenStream API, do require code modifications and may require reindexing in order to take full advantage. That said, compared to previous version changes, an upgrade to version 2.9 requires a more sophisticated upgrade procedure. True, there are many cases in which an upgrade won't require code changes, as changes limited to “expert” APIs won't affect applications only using high-level functionality. All the same, even if an application complies with Lucene 2.9, it is likely that some of the changes in runtime characteristics can introduce unexpected behaviors. In the sections below, we’ll offer some brief suggestions for making the transition. An upgrade to version 3.0 requires a more sophisticated upgrade procedure. First, upgrade to 2.9; then remove the deprecation warnings. Only then should you upgrade to 3.0x. Should you move to 2.9 or 3.0? Whichever you do, first bear in mind that going to 3.0 will require a migration first to 2.9; it is a prerequisite. Only once that 2.9 transition is completed, will you be ready to work through the deprecation warnings in order to move. Because 3.0 is a deprecation release, all deprecated-marked code in Lucene 2.9 will be removed. Some parts of the API might be modified in order to make use of Java Generics, Programmer’s Guide: What’s New in Lucene 2.9 / 3.0 A Lucid Imagination Technical White Paper • June 2010 Page 25
  • 30. but in general the upgrade from 2.9 to 3.0 should be as seamless as earlier upgrades have been. Once you have replaced the usage of any deprecated API(s) in your code, you should then be able to upgrade the next time simply by replacing the Lucene JAR file. Upgrade to 2.9—Recommended Actions At a minimum, if you plan an upgrade of your search application to Lucene 2.9, you should recompile your application against the new version before the application is rolled out in a production environment. The most critical issues will immediately raise a compile-time error once the JAR is in the classpath. For those of you using Lucene from a single location, for example, in the JRE's ext directory, you should make sure that 2.9 is the only Lucene version accessible. In cases where an application relies on extending Lucene in any particular way and the upgrade doesn't raise a compile-time error, it is recommended that you add a test-case for the extension based on the behavior executed against the older version of Lucene. It is also extremely important that you backup and archive your index before opening it with Lucene 2.9, as it will make changes to the index that may not be readable by previous versions. Again, we strongly recommend a careful reading of the CHANGES.txt file included in every Lucene distribution, especially the sections on back-compatibility policy and on changes in runtime behavior. Careful study followed by proper planning and testing should prevent you from running into any surprises once the new Lucene 2.9-based application goes into production. Upgrade to 2.9—Optional Actions Lucene 2.9 includes many new features that are not required for use of the new release. Nevertheless, 2.9 has numerous parts of the API marked as deprecated, since they are to be removed in the next release. To prepare for the next release and further improvements in this direction, it is strongly recommended that you replace any deprecated API during the upgrade process. Applications using any kind of numeric searches can improve their performance heavily by replacing a custom solution with Lucene's Numeric Capabilities described earlier in this white paper. Programmer’s Guide: What’s New in Lucene 2.9 / 3.0 A Lucid Imagination Technical White Paper • June 2010 Page 26
  • 31. Last but not least, the new TokenStream API will replace the older API entirely in the next release. Custom TokenStream, TokenFilter, and Tokenizer implementations should be updated to the attribute-based API. Here, the source distribution contains basic test cases that can help you safely upgrade. Finally, to reiterate, you would do best to write new added test cases against their current Lucene version, and upgrade the test and your code once you have gained enough confidence in the stability of the upgrade. Migrating from Lucene to Solr? Many Lucene implementations, of a variety of vintages, date back to a time where Solr lacked core capabilities that were available only by building from scratch all underlying services needed by the Lucene search libraries. Happily, Solr today offers a very complete (with some small, but meaningful exceptions) implementation of Lucene functionality. While a complete, robust approach to migrating from Lucene to Solr is beyond the scope of this paper, here are some thoughts on the advantages of doing so. A slightly longer comparison of Solr and Lucene is available in the Appendix to this document. Using Standards by Default As a server application running in Jetty, Tomcat, or any other Servlet Container, Solr easily installs, integrates and runs in existing production environments. Solr’s RESTful APIs and its XML-driven implementation simplify configuration, operation, and search application development. With a rich array of client libraries, from standard SolrJ for Java through JSON, Python, and many others, the base of programming skills needed for the search application is much narrower. Makes Lucene Best-Practices ready to use From caching Filters, Queries, or Documents via Spell-Checking to warming Searchers in the background, Solr offers an enormous set of search features that would require lots of Lucene experience and expertise to implement without it. Solr lets you immediately benefit from these low-level developments, simplifying creation and development of your search environment. Reuse some of your Lucene Libraries and indexes Because Lucene is at the core of Solr, your implementation of Lucene can reuse many of the same libraries; just plug in details for your handlers or analyzers into Programmer’s Guide: What’s New in Lucene 2.9 / 3.0 A Lucid Imagination Technical White Paper • June 2010 Page 27
  • 32. Solr’s config.xml and start testing. Likely as not, you’ll find you can move much of your code as is from Lucene or even use your existing indexes with Solr directly. Lower maintenance and revision costs As this transition from Lucene 2.4.1 to Lucene 2.9/3.0 demonstrates, many of the low-level, high-control advantages of implementing directly with Lucene are negated once anything changes in your environment. As a server, Solr insulates you from much of that, and helps remove the temptation to hard code optimistically or to skip abstractions. Cloud-readiness As large data sets grow in scope and distribution, search services will necessary rely on a much higher level of abstraction, above not only data and I/O, but for more elastic distribution of search resources and operations, i.e., shards and insert/updates. If your business is at a place where hardware might scale via a transition into some kind of cloud environment, you will benefit by taking advantage of the forthcoming Solr cloud-enabling capabilities, including distributed node management, relevancy calculations across multiple collections, etc. It’s important to note that there are many good reasons not to migrate to Solr from Lucene, whether they have to do with the cost of a new abstraction model in your general application implementation, or with no real need for serving search over HTTP. With the merger of the Lucene and Solr development projects, you won’t be shortchanged on any of the underlying functionality. But at the same time, the stronger functional affinity between the two means you’ll have to give careful thought to your long-term deployment goals in order to pick the right one. Programmer’s Guide: What’s New in Lucene 2.9 / 3.0 A Lucid Imagination Technical White Paper • June 2010 Page 28
  • 33. References http://lucene.apache.org/java/2_9_0/index.html http://lucene.apache.org/java/2_9_0/changes/Changes.html http://lucene.apache.org/java/2_9_0/changes/Contrib-Changes.html http://www.lucidimagination.com/Community/Hear-from-the-Experts/Podcasts-and- Videos/Interview-Uwe-Schindler http://wiki.apache.org/lucene-java/NearRealtimeSearch http://wiki.apache.org/lucene-java/Payloads http://www.lucidimagination.com/blog/2009/08/05/getting-started-with-payloads/ http://wiki.apache.org/lucene-java/ConceptsAndDefinitions http://wiki.apache.org/lucene-java/FlexibleIndexing http://wiki.apache.org/lucene-java/Java_1.5_Migration http://www.lucidimagination.com/How-We-Can-Help/webinar-Lucene-29 http://www.nsshutdown.com/projects/lucene/whitepaper/locallucene_v2.html http://www.lucidimagination.com/Community/Hear-from-the-Experts/Podcasts-and- Videos/Interview-Ryan-McKinley http://ocw.kfupm.edu.sa/user062/ICS48201/NLLight%20Stemming%20for%20Arabic% 20Information%20Retrieval.pdf https://javacc.dev.java.net/ Programmer’s Guide: What’s New in Lucene 2.9 / 3.0 A Lucid Imagination Technical White Paper • June 2010 Page 29
  • 34. Next Steps For more information on how Lucid Imagination can help search application developers, employees, customers, and partners find the information they need, please visit www.lucidimagination.com to access blog posts, articles, and reviews of dozens of successful implementations. Certified Distributions from Lucid Imagination are complete, supported bundles of software that include additional bug fixes, performance enhancements, along with our free 30-day Get Started program. Coupled with one of our support subscriptions, a Certified Distribution can provide a complete environment to develop, deploy, and maintain commercial-grade search applications. Certified Distributions are available at www.lucidimagination.com/Downloads. Please e-mail specific questions to: Support and Service: support@lucidimagination.com Sales and Commercial: sales@lucidimagination.com Consulting: consulting@lucidimagination.com Or call: 1.650.353.4057 Programmer’s Guide: What’s New in Lucene 2.9 / 3.0 A Lucid Imagination Technical White Paper • June 2010 Page 30
  • 35. APPENDIX: Choosing Lucene or Solr The great improvements in the capabilities of Lucene and Solr open source search technology have created rapidly growing interest in using them as alternatives for their search applications. As is often the case with open source technology, online community documentation provides rich details on features and variations, but does little to provide explicit direction on which technologies would be the best choice. So when is Lucene preferable to Solr and vice versa? There is in fact no single answer, as Lucene and Solr bring very similar underlying technology to bear on somewhat distinct problems. Solr is versatile and powerful, a full- featured, production-ready search application server requiring little formal software programming. Lucene presents a collection of directly callable Java libraries, with fine- grained control of machine functions and independence from higher-level protocols. In choosing which might be best for your search solution, the key questions to consider are application scope, deployment environment, and software development preferences. If you are new to developing search applications, you should start with Solr. Solr provides scalable search power out of the box, whereas Lucene requires solid information retrieval experience and some meaningful heavy lifting in Java to take advantage of its capabilities. In many instances, Solr doesn’t even require any real programming. Solr is essentially the “serverization” of Lucene, and many of its abstract functions are highly similar, if not the just the same. If you are building an app for the enterprise sector, for instance, you will find Solr almost a 100% match to your business requirements: it comes ready to run in a servlet container such as Tomcat or Jetty, and ready to scale in a production Java environment. Its RESTful interfaces and XML-based configuration files can greatly accelerate application development and maintenance. In fact, Lucene programmers have often reported that they find Solr to contain “the same features I was going to build myself as a framework for Lucene, but already very well implemented.” Once you start with Solr, and you find yourself using a lot of the features Solr provides out of the box, you will likely be better off using Solr’s well organized extension mechanisms instead of starting from scratch using Apache Lucene. Programmer’s Guide: What’s New in Lucene 2.9 / 3.0 A Lucid Imagination Technical White Paper • June 2010 Page 31
  • 36. If, on the other hand, you do not wish to make any calls via HTTP, and wish to have all of your resources controlled exclusively by Java API calls that you write, Lucene may be a better choice. Lucene can work best when constructing and embedding a state-of-the-art search engine, by allowing programmers to assemble and compile inside a native Java application. Some programmers set aside the convenience of Solr in order to more directly control the large set of sophisticated features with low-level access, data, or state manipulation, and choose Lucene instead, for example, with byte-level manipulation of segments or intervention in data I/O. Investment at the low level enables development of extremely sophisticated, cutting edge text search and retrieval capabilities. As for features, the latest version of Solr generally encapsulates the latest version of Lucene. As the two are in many ways functional siblings, spending time gaining a solid understanding how Lucene works internally can help you understand Apache Solr and its extension of Lucene's workings. No matter which you choose, the power of open source search is yours to harness. More information on both Lucene and Solr can be found at www.lucidimagination.com. Programmer’s Guide: What’s New in Lucene 2.9 / 3.0 A Lucid Imagination Technical White Paper • June 2010 Page 32