Recent Additions to Lucene Arsenal

Recent Additions to Lucene’s Arsenal
Shai Erera, Researcher, IBM
Adrien Grand, ElasticSearch

Who We Are
•

Shai Erera
–
–
–
–

•

Working at IBM – Information Retrieval Research
Lucene/Solr committer and PMC member
http://shaierera.blogspot.com
shaie@apache.org

Adrien Grand
–
–
–

@jpountz
Lucene/Solr committer and PMC member
Software engineer at Elasticsearch

Replicator

Replication
Client

The Replicator

Backup

Replication
Client

Primary

Backup
http://shaierera.blogspot.com/2013/05/the-replicator.html

Replication Components
•

Replicator
–
–
–

•

Revision
–
–

•

Describes a list of files and metadata
Responsible to ensure the files are available as long as clients replicate it

ReplicationClient
–
–
–

•

Mediates between the client and server
Manages the published Revisions
Implementation for replication over HTTP

Performs the replication operation on the replica side
Copies delta files and invokes ReplicationHandler upon successful copy
Always replicates latest revision

ReplicationHandler
–

Acts on the copied files

Index Replication
•

IndexRevision
–
–

•

IndexReplicationHandler
–
–
–

•

Obtains a snapshot on the last commit through SnapshotDeletionPolicy
Released when revision is released by Replicator
Copies the files to the index directory and fsync them
Aborts (rollback) on any error
Upon successful completion, invokes a callback (e.g.
SearcherManager.maybeRefresh())

Similar extensions for faceted index replication
–
–

IndexAndTaxonomyRevision: obtains snapshots on both the search and taxonomy
indexes
IndexAndTaxonomyReplicationHandler: copies the files to the respective
directories, keeping both in sync

Sample Code
// Server-side: publish a new Revision
Replicator replicator = new LocalReplicator();
replicator.publish(new IndexRevision(indexWriter));
// Client-side: replicate a Revision
Replicator replicator; // either LocalReplicator or HttpReplicator
// refresh SearcherManager after index is updated
Callable<Boolean> callback = new Callable<Boolean>() {
public Boolean call() throws Exception {
// index was updated, refresh manager
searcherManager.maybeRefresh();
}
}
ReplicationHandler handler = new IndexReplicationHandler(indexDir, callback);
SourceDirectoryFactory factory = new PerSessionDirectoryFactory(workDir);
ReplicationClient client = new ReplicationClient(replicator, handler, factory);
client.updateNow(); // invoke client manually
// -- OR -client.startUpdateThread(30000); // check for updates every 30 seconds

Future Work
•

Resume
–
–

•

Parallel Replication
–

•

Session level: don’t copy files that were already successfully copied
File level: don’t copy file parts that were already successfully copied
Copy revision files in parallel

Other replication strategies
–

Peer-to-peer

Index Sorting
How to trade index speed for search speed

Anatomy of a Lucene index
Index = collection of immutable segments
Segments store documents sequentially on disk
Add data = create a new segment
Segments get eventually merged together
Order of segments / documents in segments doesn’t matter
– the following segments are equivalent

Id

1

3

10

4

7

20

42

11

9

8

15

18

30

31

99

5

12

Price

9

0

7

8

2

2

1

8

10

3

4

4

6

10

1

1

13

Id

12

9

31

1

4

11

10

30

15

18

8

7

20

42

99

5

3

Price

13

10

10

9

8

8

7

6

4

4

3

2

2

1

1

1

0

Anatomy of a Lucene index
ordinal of a doc in a segment = doc id
used in the inverted index to refer to docs

shoe

1, 3, 5, 8, 11, 13, 15

doc id

0

1

2

3

4

5

7

8

9

10 11 12 13 14 15 16

Id

1

3

10

4

7

20 42 11

9

8

15 18 30 31 99

5

12

Price

9

0

7

8

2

2

10

3

4

1

13

6
1

8

4

6

10

1

Top hits
Get top N=2 results:
– Create a priority queue of size N
– Accumulate matching docs

Id

1

3

10

4

7

20

42

11

9

8

15

18

30

31

99

5

12

Price

9

0

7

8

2

2

1

8

10

3

4

4

6

10

1

1

13

()

(3)

Create an empty
priority queue

(3,4)

(4,20)

(4,9)

Automatic overflow of the
priority queue to remove the
least one

(4,9)

(9,31) (9,31)

Top hits

Early termination
Let’s do the same on a sorted index

Id

12

9

31

1

4

11

10

30

15

18

8

7

20

42

99

5

3

Price

13

10

10

9

8

8

7

6

4

4

3

2

2

1

1

1

0

()

(9)

(9,31) (9,31)

Priority queue never
changes after this
document

(9,31)

(9,31)

(9,31) (9,31)

Early termination
Pros
– makes finding the top hits much faster
– file-system cache-friendly
Cons
– only works for static ranks
– not if the sort order depends on the query
– requires the index to be sorted
– doesn’t work for tasks that require visiting every doc:
– total number of matches
– faceting

Static ranks
Not uncommon!
Graph-based ranks
– Google’s PageRank
Facebook social search / Unicorn
– https://www.facebook.com/publications/219621248185635
Many more...

Doesn’t need to be the exact sort order
– heuristics when score is only a function of the static rank

Offline sorting
A live index can’t be kept sorted
– would require inserting docs between existing docs!
– segments are immutable
Offline sorting to the rescue:
– index as usual
– sort into a new index
– search!
Pros/cons
– super fast to search, the whole index is fully sorted
– but only works for static content

Offline Sorting
// open a reader on the unsorted index and create a sorted (but slow) view
DirectoryReader reader = DirectoryReader.open(in);
boolean ascending = false;
Sorter sorter = new NumericDocValuesSorter("price", ascending);
AtomicReader sortedReader = SortingAtomicReader.wrap(
SlowCompositeReaderWrapper.wrap(reader), sorter);
// copy the content of the sorted reader to the new dir
IndexWriter writer = new IndexWriter(out, iwConf);
writer.addIndexes(sortedReader);
writer.close();
reader.close();

Online sorting?
Sort segments independently
– wouldn’t require inserting data into existing segments
– collection could still be early-terminated on a per-segment basis
But segments are immutable
– must be sorted before starting writing them

Online sorting?
2 sources of segments
– flush
– merge
flushed segments can’t be sorted
– Lucene writes stored fields to disk on the fly
– could be buffered but this would require a lot of memory
merged segments can be sorted
– create a sorted view over the segments to merge
– pass this view to SegmentMerger instead of the original segments
not a bad trade-off
– flushed segments are usually small & fast to collect

Online sorting?

Merged segments can easily take 99+%
of the size of the index

Merged segments

Flushed segments
- NRT reopens
- RAM buffer size limit hit

http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html

Online Sorting
IndexWriterConfig iwConf = new IndexWriterConfig(...);
// original MergePolicy finds the segments to merge
MergePolicy origMP = iwConf.getMergePolicy();
// SortingMergePolicy wraps the segments with a sorted view
boolean ascending = false;
Sorter sorter = new NumericDocValuesSorter("price", ascending);
MergePolicy sortingMP = new SortingMergePolicy(origMP, sorter);
// setup IndexWriter to use SortingMergePolicy
iwConf.setMergePolicy(sortingMP);
IndexWriter writer = new IndexWriter(dir, iwConf);
// index as usual

Early termination
Collect top N matches
Offline sorting
– index sorted globally
– early terminate after N matches have been collected
– no priority queue needed!
Online sorting
– no early termination on flushed segments
– early termination on merged segments
– if N matches have been collected
– or if current match is less than the top of the PQ

Early Termination
class MyCollector extends Collector {
@Override
public void setNextReader(AtomicReaderContext context) throws IOException {
readerIsSorted = SortingMergePolicy.isSorted(context.reader(), sorter);
collected = 0;
}
@Override
public void collect(int doc) throws IOException {
if (readerIsSorted &&
(++collected >= maxDocsToCollect || curVal <= pq.top()) {
// Special exception that tells IndexSearcher to terminate
// collection of the current segment
throw new CollectionTerminatedException();
} else {
// collect hit
}
}
}

Recent Additions to Lucene Arsenal

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Recent Additions to Lucene Arsenal

Similar to Recent Additions to Lucene Arsenal (20)

More from lucenerevolution

More from lucenerevolution (20)

Recently uploaded

Recently uploaded (20)

Recent Additions to Lucene Arsenal