0
Recent Additions to Lucene’s Arsenal
Shai Erera, Researcher, IBM
Adrien Grand, ElasticSearch
Who We Are
•

Shai Erera
–
–
–
–

•

Working at IBM – Information Retrieval Research
Lucene/Solr committer and PMC member
...
The Replicator
Load Balancing

Load
Balancer
Failover
Index Backup
Replicator

Replication
Client

The Replicator

Backup

Replication
Client

Primary

Backup
http://shaierera.blogspot.com/...
Replication Components
•

Replicator
–
–
–

•

Revision
–
–

•

Describes a list of files and metadata
Responsible to ensu...
Index Replication
•

IndexRevision
–
–

•

IndexReplicationHandler
–
–
–

•

Obtains a snapshot on the last commit through...
Sample Code
// Server-side: publish a new Revision
Replicator replicator = new LocalReplicator();
replicator.publish(new I...
Future Work
•

Resume
–
–

•

Parallel Replication
–

•

Session level: don’t copy files that were already successfully co...
Index Sorting
How to trade index speed for search speed
Anatomy of a Lucene index
Index = collection of immutable segments
Segments store documents sequentially on disk
Add data ...
Anatomy of a Lucene index
ordinal of a doc in a segment = doc id
used in the inverted index to refer to docs

shoe

1, 3, ...
Top hits
Get top N=2 results:
– Create a priority queue of size N
– Accumulate matching docs

Id

1

3

10

4

7

20

42

...
Early termination
Let’s do the same on a sorted index

Id

12

9

31

1

4

11

10

30

15

18

8

7

20

42

99

5

3

Pr...
Early termination
Pros
– makes finding the top hits much faster
– file-system cache-friendly
Cons
– only works for static ...
Static ranks
Not uncommon!
Graph-based ranks
– Google’s PageRank
Facebook social search / Unicorn
– https://www.facebook.c...
Offline sorting
A live index can’t be kept sorted
– would require inserting docs between existing docs!
– segments are imm...
Offline Sorting
// open a reader on the unsorted index and create a sorted (but slow) view
DirectoryReader reader = Direct...
Online sorting?
Sort segments independently
– wouldn’t require inserting data into existing segments
– collection could st...
Online sorting?
2 sources of segments
– flush
– merge
flushed segments can’t be sorted
– Lucene writes stored fields to di...
Online sorting?

Merged segments can easily take 99+%
of the size of the index

Merged segments

Flushed segments
- NRT re...
Online Sorting
IndexWriterConfig iwConf = new IndexWriterConfig(...);
// original MergePolicy finds the segments to merge
...
Early termination
Collect top N matches
Offline sorting
– index sorted globally
– early terminate after N matches have bee...
Early Termination
class MyCollector extends Collector {
@Override
public void setNextReader(AtomicReaderContext context) t...
Questions?
Recent Additions to Lucene Arsenal
Upcoming SlideShare
Loading in...5
×

Recent Additions to Lucene Arsenal

853

Published on

Presented by Shai Erera, Researcher, IBM

Lucene's arsenal has recently expanded to include two new modules: Index Sorting and Replication. Index sorting lets you keep an index consistently sorted based on some criteria (e.g. modification date). This allows for efficient search early-termination as well as achieve better index compression. Index replication lets you replicate a search index to achieve high-availability, fault tolerance as well as take hot index backups. In this talk we will introduce these modules, discuss implementation and design details as well as best practices.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
853
On Slideshare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
13
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "Recent Additions to Lucene Arsenal"

  1. 1. Recent Additions to Lucene’s Arsenal Shai Erera, Researcher, IBM Adrien Grand, ElasticSearch
  2. 2. Who We Are • Shai Erera – – – – • Working at IBM – Information Retrieval Research Lucene/Solr committer and PMC member http://shaierera.blogspot.com shaie@apache.org Adrien Grand – – – @jpountz Lucene/Solr committer and PMC member Software engineer at Elasticsearch
  3. 3. The Replicator
  4. 4. Load Balancing Load Balancer
  5. 5. Failover
  6. 6. Index Backup
  7. 7. Replicator Replication Client The Replicator Backup Replication Client Primary Backup http://shaierera.blogspot.com/2013/05/the-replicator.html
  8. 8. Replication Components • Replicator – – – • Revision – – • Describes a list of files and metadata Responsible to ensure the files are available as long as clients replicate it ReplicationClient – – – • Mediates between the client and server Manages the published Revisions Implementation for replication over HTTP Performs the replication operation on the replica side Copies delta files and invokes ReplicationHandler upon successful copy Always replicates latest revision ReplicationHandler – Acts on the copied files
  9. 9. Index Replication • IndexRevision – – • IndexReplicationHandler – – – • Obtains a snapshot on the last commit through SnapshotDeletionPolicy Released when revision is released by Replicator Copies the files to the index directory and fsync them Aborts (rollback) on any error Upon successful completion, invokes a callback (e.g. SearcherManager.maybeRefresh()) Similar extensions for faceted index replication – – IndexAndTaxonomyRevision: obtains snapshots on both the search and taxonomy indexes IndexAndTaxonomyReplicationHandler: copies the files to the respective directories, keeping both in sync
  10. 10. Sample Code // Server-side: publish a new Revision Replicator replicator = new LocalReplicator(); replicator.publish(new IndexRevision(indexWriter)); // Client-side: replicate a Revision Replicator replicator; // either LocalReplicator or HttpReplicator // refresh SearcherManager after index is updated Callable<Boolean> callback = new Callable<Boolean>() { public Boolean call() throws Exception { // index was updated, refresh manager searcherManager.maybeRefresh(); } } ReplicationHandler handler = new IndexReplicationHandler(indexDir, callback); SourceDirectoryFactory factory = new PerSessionDirectoryFactory(workDir); ReplicationClient client = new ReplicationClient(replicator, handler, factory); client.updateNow(); // invoke client manually // -- OR -client.startUpdateThread(30000); // check for updates every 30 seconds
  11. 11. Future Work • Resume – – • Parallel Replication – • Session level: don’t copy files that were already successfully copied File level: don’t copy file parts that were already successfully copied Copy revision files in parallel Other replication strategies – Peer-to-peer
  12. 12. Index Sorting How to trade index speed for search speed
  13. 13. Anatomy of a Lucene index Index = collection of immutable segments Segments store documents sequentially on disk Add data = create a new segment Segments get eventually merged together Order of segments / documents in segments doesn’t matter – the following segments are equivalent Id 1 3 10 4 7 20 42 11 9 8 15 18 30 31 99 5 12 Price 9 0 7 8 2 2 1 8 10 3 4 4 6 10 1 1 13 Id 12 9 31 1 4 11 10 30 15 18 8 7 20 42 99 5 3 Price 13 10 10 9 8 8 7 6 4 4 3 2 2 1 1 1 0
  14. 14. Anatomy of a Lucene index ordinal of a doc in a segment = doc id used in the inverted index to refer to docs shoe 1, 3, 5, 8, 11, 13, 15 doc id 0 1 2 3 4 5 7 8 9 10 11 12 13 14 15 16 Id 1 3 10 4 7 20 42 11 9 8 15 18 30 31 99 5 12 Price 9 0 7 8 2 2 10 3 4 1 13 6 1 8 4 6 10 1
  15. 15. Top hits Get top N=2 results: – Create a priority queue of size N – Accumulate matching docs Id 1 3 10 4 7 20 42 11 9 8 15 18 30 31 99 5 12 Price 9 0 7 8 2 2 1 8 10 3 4 4 6 10 1 1 13 () (3) Create an empty priority queue (3,4) (4,20) (4,9) Automatic overflow of the priority queue to remove the least one (4,9) (9,31) (9,31) Top hits
  16. 16. Early termination Let’s do the same on a sorted index Id 12 9 31 1 4 11 10 30 15 18 8 7 20 42 99 5 3 Price 13 10 10 9 8 8 7 6 4 4 3 2 2 1 1 1 0 () (9) (9,31) (9,31) Priority queue never changes after this document (9,31) (9,31) (9,31) (9,31)
  17. 17. Early termination Pros – makes finding the top hits much faster – file-system cache-friendly Cons – only works for static ranks – not if the sort order depends on the query – requires the index to be sorted – doesn’t work for tasks that require visiting every doc: – total number of matches – faceting
  18. 18. Static ranks Not uncommon! Graph-based ranks – Google’s PageRank Facebook social search / Unicorn – https://www.facebook.com/publications/219621248185635 Many more... Doesn’t need to be the exact sort order – heuristics when score is only a function of the static rank
  19. 19. Offline sorting A live index can’t be kept sorted – would require inserting docs between existing docs! – segments are immutable Offline sorting to the rescue: – index as usual – sort into a new index – search! Pros/cons – super fast to search, the whole index is fully sorted – but only works for static content
  20. 20. Offline Sorting // open a reader on the unsorted index and create a sorted (but slow) view DirectoryReader reader = DirectoryReader.open(in); boolean ascending = false; Sorter sorter = new NumericDocValuesSorter("price", ascending); AtomicReader sortedReader = SortingAtomicReader.wrap( SlowCompositeReaderWrapper.wrap(reader), sorter); // copy the content of the sorted reader to the new dir IndexWriter writer = new IndexWriter(out, iwConf); writer.addIndexes(sortedReader); writer.close(); reader.close();
  21. 21. Online sorting? Sort segments independently – wouldn’t require inserting data into existing segments – collection could still be early-terminated on a per-segment basis But segments are immutable – must be sorted before starting writing them
  22. 22. Online sorting? 2 sources of segments – flush – merge flushed segments can’t be sorted – Lucene writes stored fields to disk on the fly – could be buffered but this would require a lot of memory merged segments can be sorted – create a sorted view over the segments to merge – pass this view to SegmentMerger instead of the original segments not a bad trade-off – flushed segments are usually small & fast to collect
  23. 23. Online sorting? Merged segments can easily take 99+% of the size of the index Merged segments Flushed segments - NRT reopens - RAM buffer size limit hit http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html
  24. 24. Online Sorting IndexWriterConfig iwConf = new IndexWriterConfig(...); // original MergePolicy finds the segments to merge MergePolicy origMP = iwConf.getMergePolicy(); // SortingMergePolicy wraps the segments with a sorted view boolean ascending = false; Sorter sorter = new NumericDocValuesSorter("price", ascending); MergePolicy sortingMP = new SortingMergePolicy(origMP, sorter); // setup IndexWriter to use SortingMergePolicy iwConf.setMergePolicy(sortingMP); IndexWriter writer = new IndexWriter(dir, iwConf); // index as usual
  25. 25. Early termination Collect top N matches Offline sorting – index sorted globally – early terminate after N matches have been collected – no priority queue needed! Online sorting – no early termination on flushed segments – early termination on merged segments – if N matches have been collected – or if current match is less than the top of the PQ
  26. 26. Early Termination class MyCollector extends Collector { @Override public void setNextReader(AtomicReaderContext context) throws IOException { readerIsSorted = SortingMergePolicy.isSorted(context.reader(), sorter); collected = 0; } @Override public void collect(int doc) throws IOException { if (readerIsSorted && (++collected >= maxDocsToCollect || curVal <= pq.top()) { // Special exception that tells IndexSearcher to terminate // collection of the current segment throw new CollectionTerminatedException(); } else { // collect hit } } }
  27. 27. Questions?
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×