SlideShare a Scribd company logo
1 of 28
Download to read offline
Recent Additions to Lucene’s Arsenal
Shai Erera, Researcher, IBM
Adrien Grand, ElasticSearch
Who We Are
•

Shai Erera
–
–
–
–

•

Working at IBM – Information Retrieval Research
Lucene/Solr committer and PMC member
http://shaierera.blogspot.com
shaie@apache.org

Adrien Grand
–
–
–

@jpountz
Lucene/Solr committer and PMC member
Software engineer at Elasticsearch
The Replicator
Load Balancing

Load
Balancer
Failover
Index Backup
Replicator

Replication
Client

The Replicator

Backup

Replication
Client

Primary

Backup
http://shaierera.blogspot.com/2013/05/the-replicator.html
Replication Components
•

Replicator
–
–
–

•

Revision
–
–

•

Describes a list of files and metadata
Responsible to ensure the files are available as long as clients replicate it

ReplicationClient
–
–
–

•

Mediates between the client and server
Manages the published Revisions
Implementation for replication over HTTP

Performs the replication operation on the replica side
Copies delta files and invokes ReplicationHandler upon successful copy
Always replicates latest revision

ReplicationHandler
–

Acts on the copied files
Index Replication
•

IndexRevision
–
–

•

IndexReplicationHandler
–
–
–

•

Obtains a snapshot on the last commit through SnapshotDeletionPolicy
Released when revision is released by Replicator
Copies the files to the index directory and fsync them
Aborts (rollback) on any error
Upon successful completion, invokes a callback (e.g.
SearcherManager.maybeRefresh())

Similar extensions for faceted index replication
–
–

IndexAndTaxonomyRevision: obtains snapshots on both the search and taxonomy
indexes
IndexAndTaxonomyReplicationHandler: copies the files to the respective
directories, keeping both in sync
Sample Code
// Server-side: publish a new Revision
Replicator replicator = new LocalReplicator();
replicator.publish(new IndexRevision(indexWriter));
// Client-side: replicate a Revision
Replicator replicator; // either LocalReplicator or HttpReplicator
// refresh SearcherManager after index is updated
Callable<Boolean> callback = new Callable<Boolean>() {
public Boolean call() throws Exception {
// index was updated, refresh manager
searcherManager.maybeRefresh();
}
}
ReplicationHandler handler = new IndexReplicationHandler(indexDir, callback);
SourceDirectoryFactory factory = new PerSessionDirectoryFactory(workDir);
ReplicationClient client = new ReplicationClient(replicator, handler, factory);
client.updateNow(); // invoke client manually
// -- OR -client.startUpdateThread(30000); // check for updates every 30 seconds
Future Work
•

Resume
–
–

•

Parallel Replication
–

•

Session level: don’t copy files that were already successfully copied
File level: don’t copy file parts that were already successfully copied
Copy revision files in parallel

Other replication strategies
–

Peer-to-peer
Index Sorting
How to trade index speed for search speed
Anatomy of a Lucene index
Index = collection of immutable segments
Segments store documents sequentially on disk
Add data = create a new segment
Segments get eventually merged together
Order of segments / documents in segments doesn’t matter
– the following segments are equivalent

Id

1

3

10

4

7

20

42

11

9

8

15

18

30

31

99

5

12

Price

9

0

7

8

2

2

1

8

10

3

4

4

6

10

1

1

13

Id

12

9

31

1

4

11

10

30

15

18

8

7

20

42

99

5

3

Price

13

10

10

9

8

8

7

6

4

4

3

2

2

1

1

1

0
Anatomy of a Lucene index
ordinal of a doc in a segment = doc id
used in the inverted index to refer to docs

shoe

1, 3, 5, 8, 11, 13, 15

doc id

0

1

2

3

4

5

7

8

9

10 11 12 13 14 15 16

Id

1

3

10

4

7

20 42 11

9

8

15 18 30 31 99

5

12

Price

9

0

7

8

2

2

10

3

4

1

13

6
1

8

4

6

10

1
Top hits
Get top N=2 results:
– Create a priority queue of size N
– Accumulate matching docs

Id

1

3

10

4

7

20

42

11

9

8

15

18

30

31

99

5

12

Price

9

0

7

8

2

2

1

8

10

3

4

4

6

10

1

1

13

()

(3)

Create an empty
priority queue

(3,4)

(4,20)

(4,9)

Automatic overflow of the
priority queue to remove the
least one

(4,9)

(9,31) (9,31)

Top hits
Early termination
Let’s do the same on a sorted index

Id

12

9

31

1

4

11

10

30

15

18

8

7

20

42

99

5

3

Price

13

10

10

9

8

8

7

6

4

4

3

2

2

1

1

1

0

()

(9)

(9,31) (9,31)

Priority queue never
changes after this
document

(9,31)

(9,31)

(9,31) (9,31)
Early termination
Pros
– makes finding the top hits much faster
– file-system cache-friendly
Cons
– only works for static ranks
– not if the sort order depends on the query
– requires the index to be sorted
– doesn’t work for tasks that require visiting every doc:
– total number of matches
– faceting
Static ranks
Not uncommon!
Graph-based ranks
– Google’s PageRank
Facebook social search / Unicorn
– https://www.facebook.com/publications/219621248185635
Many more...

Doesn’t need to be the exact sort order
– heuristics when score is only a function of the static rank
Offline sorting
A live index can’t be kept sorted
– would require inserting docs between existing docs!
– segments are immutable
Offline sorting to the rescue:
– index as usual
– sort into a new index
– search!
Pros/cons
– super fast to search, the whole index is fully sorted
– but only works for static content
Offline Sorting
// open a reader on the unsorted index and create a sorted (but slow) view
DirectoryReader reader = DirectoryReader.open(in);
boolean ascending = false;
Sorter sorter = new NumericDocValuesSorter("price", ascending);
AtomicReader sortedReader = SortingAtomicReader.wrap(
SlowCompositeReaderWrapper.wrap(reader), sorter);
// copy the content of the sorted reader to the new dir
IndexWriter writer = new IndexWriter(out, iwConf);
writer.addIndexes(sortedReader);
writer.close();
reader.close();
Online sorting?
Sort segments independently
– wouldn’t require inserting data into existing segments
– collection could still be early-terminated on a per-segment basis
But segments are immutable
– must be sorted before starting writing them
Online sorting?
2 sources of segments
– flush
– merge
flushed segments can’t be sorted
– Lucene writes stored fields to disk on the fly
– could be buffered but this would require a lot of memory
merged segments can be sorted
– create a sorted view over the segments to merge
– pass this view to SegmentMerger instead of the original segments
not a bad trade-off
– flushed segments are usually small & fast to collect
Online sorting?

Merged segments can easily take 99+%
of the size of the index

Merged segments

Flushed segments
- NRT reopens
- RAM buffer size limit hit

http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html
Online Sorting
IndexWriterConfig iwConf = new IndexWriterConfig(...);
// original MergePolicy finds the segments to merge
MergePolicy origMP = iwConf.getMergePolicy();
// SortingMergePolicy wraps the segments with a sorted view
boolean ascending = false;
Sorter sorter = new NumericDocValuesSorter("price", ascending);
MergePolicy sortingMP = new SortingMergePolicy(origMP, sorter);
// setup IndexWriter to use SortingMergePolicy
iwConf.setMergePolicy(sortingMP);
IndexWriter writer = new IndexWriter(dir, iwConf);
// index as usual
Early termination
Collect top N matches
Offline sorting
– index sorted globally
– early terminate after N matches have been collected
– no priority queue needed!
Online sorting
– no early termination on flushed segments
– early termination on merged segments
– if N matches have been collected
– or if current match is less than the top of the PQ
Early Termination
class MyCollector extends Collector {
@Override
public void setNextReader(AtomicReaderContext context) throws IOException {
readerIsSorted = SortingMergePolicy.isSorted(context.reader(), sorter);
collected = 0;
}
@Override
public void collect(int doc) throws IOException {
if (readerIsSorted &&
(++collected >= maxDocsToCollect || curVal <= pq.top()) {
// Special exception that tells IndexSearcher to terminate
// collection of the current segment
throw new CollectionTerminatedException();
} else {
// collect hit
}
}
}
Questions?

More Related Content

What's hot

Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with SolrErik Hatcher
 
Lucene's Latest (for Libraries)
Lucene's Latest (for Libraries)Lucene's Latest (for Libraries)
Lucene's Latest (for Libraries)Erik Hatcher
 
Solr Recipes Workshop
Solr Recipes WorkshopSolr Recipes Workshop
Solr Recipes WorkshopErik Hatcher
 
Multi faceted responsive search, autocomplete, feeds engine & logging
Multi faceted responsive search, autocomplete, feeds engine & loggingMulti faceted responsive search, autocomplete, feeds engine & logging
Multi faceted responsive search, autocomplete, feeds engine & logginglucenerevolution
 
Lucene Introduction
Lucene IntroductionLucene Introduction
Lucene Introductionotisg
 
Hacking Lucene for Custom Search Results
Hacking Lucene for Custom Search ResultsHacking Lucene for Custom Search Results
Hacking Lucene for Custom Search ResultsOpenSource Connections
 
Solr Application Development Tutorial
Solr Application Development TutorialSolr Application Development Tutorial
Solr Application Development TutorialErik Hatcher
 
Tutorial on developing a Solr search component plugin
Tutorial on developing a Solr search component pluginTutorial on developing a Solr search component plugin
Tutorial on developing a Solr search component pluginsearchbox-com
 
Solr Query Parsing
Solr Query ParsingSolr Query Parsing
Solr Query ParsingErik Hatcher
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with SolrErik Hatcher
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with SolrErik Hatcher
 
Solr Indexing and Analysis Tricks
Solr Indexing and Analysis TricksSolr Indexing and Analysis Tricks
Solr Indexing and Analysis TricksErik Hatcher
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr DevelopersErik Hatcher
 
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)dnaber
 
code4lib 2011 preconference: What's New in Solr (since 1.4.1)
code4lib 2011 preconference: What's New in Solr (since 1.4.1)code4lib 2011 preconference: What's New in Solr (since 1.4.1)
code4lib 2011 preconference: What's New in Solr (since 1.4.1)Erik Hatcher
 
Building your own search engine with Apache Solr
Building your own search engine with Apache SolrBuilding your own search engine with Apache Solr
Building your own search engine with Apache SolrBiogeeks
 

What's hot (20)

Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
Lucene's Latest (for Libraries)
Lucene's Latest (for Libraries)Lucene's Latest (for Libraries)
Lucene's Latest (for Libraries)
 
Solr Recipes Workshop
Solr Recipes WorkshopSolr Recipes Workshop
Solr Recipes Workshop
 
Multi faceted responsive search, autocomplete, feeds engine & logging
Multi faceted responsive search, autocomplete, feeds engine & loggingMulti faceted responsive search, autocomplete, feeds engine & logging
Multi faceted responsive search, autocomplete, feeds engine & logging
 
Lucene Introduction
Lucene IntroductionLucene Introduction
Lucene Introduction
 
Hacking Lucene for Custom Search Results
Hacking Lucene for Custom Search ResultsHacking Lucene for Custom Search Results
Hacking Lucene for Custom Search Results
 
Solr Application Development Tutorial
Solr Application Development TutorialSolr Application Development Tutorial
Solr Application Development Tutorial
 
Lucene and MySQL
Lucene and MySQLLucene and MySQL
Lucene and MySQL
 
Tutorial on developing a Solr search component plugin
Tutorial on developing a Solr search component pluginTutorial on developing a Solr search component plugin
Tutorial on developing a Solr search component plugin
 
Solr Query Parsing
Solr Query ParsingSolr Query Parsing
Solr Query Parsing
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
Solr 4
Solr 4Solr 4
Solr 4
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
Solr Indexing and Analysis Tricks
Solr Indexing and Analysis TricksSolr Indexing and Analysis Tricks
Solr Indexing and Analysis Tricks
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
 
Apache Solr Workshop
Apache Solr WorkshopApache Solr Workshop
Apache Solr Workshop
 
Lucene basics
Lucene basicsLucene basics
Lucene basics
 
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
 
code4lib 2011 preconference: What's New in Solr (since 1.4.1)
code4lib 2011 preconference: What's New in Solr (since 1.4.1)code4lib 2011 preconference: What's New in Solr (since 1.4.1)
code4lib 2011 preconference: What's New in Solr (since 1.4.1)
 
Building your own search engine with Apache Solr
Building your own search engine with Apache SolrBuilding your own search engine with Apache Solr
Building your own search engine with Apache Solr
 

Similar to Recent Additions to Lucene Arsenal

Did you mean 'Galene'?
Did you mean 'Galene'?Did you mean 'Galene'?
Did you mean 'Galene'?Azeem Mohammad
 
Cdcr apachecon-talk
Cdcr apachecon-talkCdcr apachecon-talk
Cdcr apachecon-talkAmrit Sarkar
 
SplunkLive! Advanced Session
SplunkLive! Advanced SessionSplunkLive! Advanced Session
SplunkLive! Advanced SessionSplunk
 
Micro frontend: The microservices puzzle extended to frontend
Micro frontend: The microservices puzzle  extended to frontendMicro frontend: The microservices puzzle  extended to frontend
Micro frontend: The microservices puzzle extended to frontendAudrey Neveu
 
Hands on training on DbFit Part-II
Hands on training on DbFit Part-IIHands on training on DbFit Part-II
Hands on training on DbFit Part-IIBabul Mirdha
 
High Performance Mysql
High Performance MysqlHigh Performance Mysql
High Performance Mysqlliufabin 66688
 
Andrzej bialecki lr-2013-dublin
Andrzej bialecki lr-2013-dublinAndrzej bialecki lr-2013-dublin
Andrzej bialecki lr-2013-dublinlucenerevolution
 
Professionalizing the Front-end
Professionalizing the Front-endProfessionalizing the Front-end
Professionalizing the Front-endJordi Anguela
 
Redshift Chartio Event Presentation
Redshift Chartio Event PresentationRedshift Chartio Event Presentation
Redshift Chartio Event PresentationChartio
 
SVN Tool Information : Best Practices
SVN Tool Information  : Best PracticesSVN Tool Information  : Best Practices
SVN Tool Information : Best PracticesMaidul Islam
 
PostgreSQL 9.4, 9.5 and Beyond @ COSCUP 2015 Taipei
PostgreSQL 9.4, 9.5 and Beyond @ COSCUP 2015 TaipeiPostgreSQL 9.4, 9.5 and Beyond @ COSCUP 2015 Taipei
PostgreSQL 9.4, 9.5 and Beyond @ COSCUP 2015 TaipeiSatoshi Nagayasu
 
SFDC Deployments
SFDC DeploymentsSFDC Deployments
SFDC DeploymentsSujit Kumar
 
Oracle AHF Insights 23c: Deeper Diagnostic Insights for your Oracle Database ...
Oracle AHF Insights 23c: Deeper Diagnostic Insights for your Oracle Database ...Oracle AHF Insights 23c: Deeper Diagnostic Insights for your Oracle Database ...
Oracle AHF Insights 23c: Deeper Diagnostic Insights for your Oracle Database ...Sandesh Rao
 
Oracle AHF Insights 23c: Deeper Diagnostic Insights for your Oracle Database
Oracle AHF Insights 23c: Deeper Diagnostic Insights for your Oracle DatabaseOracle AHF Insights 23c: Deeper Diagnostic Insights for your Oracle Database
Oracle AHF Insights 23c: Deeper Diagnostic Insights for your Oracle DatabaseSandesh Rao
 
Day 7 - Make it Fast
Day 7 - Make it FastDay 7 - Make it Fast
Day 7 - Make it FastBarry Jones
 

Similar to Recent Additions to Lucene Arsenal (20)

Did you mean 'Galene'?
Did you mean 'Galene'?Did you mean 'Galene'?
Did you mean 'Galene'?
 
Cdcr apachecon-talk
Cdcr apachecon-talkCdcr apachecon-talk
Cdcr apachecon-talk
 
SplunkLive! Advanced Session
SplunkLive! Advanced SessionSplunkLive! Advanced Session
SplunkLive! Advanced Session
 
Micro frontend: The microservices puzzle extended to frontend
Micro frontend: The microservices puzzle  extended to frontendMicro frontend: The microservices puzzle  extended to frontend
Micro frontend: The microservices puzzle extended to frontend
 
Hands on training on DbFit Part-II
Hands on training on DbFit Part-IIHands on training on DbFit Part-II
Hands on training on DbFit Part-II
 
High Performance Mysql
High Performance MysqlHigh Performance Mysql
High Performance Mysql
 
Andrzej bialecki lr-2013-dublin
Andrzej bialecki lr-2013-dublinAndrzej bialecki lr-2013-dublin
Andrzej bialecki lr-2013-dublin
 
Professionalizing the Front-end
Professionalizing the Front-endProfessionalizing the Front-end
Professionalizing the Front-end
 
Redshift Chartio Event Presentation
Redshift Chartio Event PresentationRedshift Chartio Event Presentation
Redshift Chartio Event Presentation
 
SVN Tool Information : Best Practices
SVN Tool Information  : Best PracticesSVN Tool Information  : Best Practices
SVN Tool Information : Best Practices
 
PostgreSQL 9.4, 9.5 and Beyond @ COSCUP 2015 Taipei
PostgreSQL 9.4, 9.5 and Beyond @ COSCUP 2015 TaipeiPostgreSQL 9.4, 9.5 and Beyond @ COSCUP 2015 Taipei
PostgreSQL 9.4, 9.5 and Beyond @ COSCUP 2015 Taipei
 
SFDC Deployments
SFDC DeploymentsSFDC Deployments
SFDC Deployments
 
ZooKeeper (and other things)
ZooKeeper (and other things)ZooKeeper (and other things)
ZooKeeper (and other things)
 
Oracle AHF Insights 23c: Deeper Diagnostic Insights for your Oracle Database ...
Oracle AHF Insights 23c: Deeper Diagnostic Insights for your Oracle Database ...Oracle AHF Insights 23c: Deeper Diagnostic Insights for your Oracle Database ...
Oracle AHF Insights 23c: Deeper Diagnostic Insights for your Oracle Database ...
 
Oracle AHF Insights 23c: Deeper Diagnostic Insights for your Oracle Database
Oracle AHF Insights 23c: Deeper Diagnostic Insights for your Oracle DatabaseOracle AHF Insights 23c: Deeper Diagnostic Insights for your Oracle Database
Oracle AHF Insights 23c: Deeper Diagnostic Insights for your Oracle Database
 
01 oracle architecture
01 oracle architecture01 oracle architecture
01 oracle architecture
 
System analysis
System analysisSystem analysis
System analysis
 
Breaking data
Breaking dataBreaking data
Breaking data
 
Sql optimize
Sql optimizeSql optimize
Sql optimize
 
Day 7 - Make it Fast
Day 7 - Make it FastDay 7 - Make it Fast
Day 7 - Make it Fast
 

More from lucenerevolution

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucenelucenerevolution
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! lucenerevolution
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solrlucenerevolution
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationslucenerevolution
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloudlucenerevolution
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiledlucenerevolution
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs lucenerevolution
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchlucenerevolution
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?lucenerevolution
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APIlucenerevolution
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMlucenerevolution
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside downlucenerevolution
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...lucenerevolution
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - finallucenerevolution
 
The First Class Integration of Solr with Hadoop
The First Class Integration of Solr with HadoopThe First Class Integration of Solr with Hadoop
The First Class Integration of Solr with Hadooplucenerevolution
 
A Novel methodology for handling Document Level Security in Search Based Appl...
A Novel methodology for handling Document Level Security in Search Based Appl...A Novel methodology for handling Document Level Security in Search Based Appl...
A Novel methodology for handling Document Level Security in Search Based Appl...lucenerevolution
 
How Lucene Powers the LinkedIn Segmentation and Targeting Platform
How Lucene Powers the LinkedIn Segmentation and Targeting PlatformHow Lucene Powers the LinkedIn Segmentation and Targeting Platform
How Lucene Powers the LinkedIn Segmentation and Targeting Platformlucenerevolution
 
Query Latency Optimization with Lucene
Query Latency Optimization with LuceneQuery Latency Optimization with Lucene
Query Latency Optimization with Lucenelucenerevolution
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friendslucenerevolution
 

More from lucenerevolution (20)

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here!
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solr
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applications
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloud
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST API
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside down
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - final
 
The First Class Integration of Solr with Hadoop
The First Class Integration of Solr with HadoopThe First Class Integration of Solr with Hadoop
The First Class Integration of Solr with Hadoop
 
A Novel methodology for handling Document Level Security in Search Based Appl...
A Novel methodology for handling Document Level Security in Search Based Appl...A Novel methodology for handling Document Level Security in Search Based Appl...
A Novel methodology for handling Document Level Security in Search Based Appl...
 
How Lucene Powers the LinkedIn Segmentation and Targeting Platform
How Lucene Powers the LinkedIn Segmentation and Targeting PlatformHow Lucene Powers the LinkedIn Segmentation and Targeting Platform
How Lucene Powers the LinkedIn Segmentation and Targeting Platform
 
Query Latency Optimization with Lucene
Query Latency Optimization with LuceneQuery Latency Optimization with Lucene
Query Latency Optimization with Lucene
 
10 keys to Solr's Future
10 keys to Solr's Future10 keys to Solr's Future
10 keys to Solr's Future
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friends
 

Recently uploaded

TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....rightmanforbloodline
 
Microsoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - QuestionnaireMicrosoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - QuestionnaireExakis Nelite
 
Vector Search @ sw2con for slideshare.pptx
Vector Search @ sw2con for slideshare.pptxVector Search @ sw2con for slideshare.pptx
Vector Search @ sw2con for slideshare.pptxjbellis
 
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)Paige Cruz
 
Design Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptxDesign Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptxFIDO Alliance
 
How to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cfHow to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cfdanishmna97
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...panagenda
 
Event-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream ProcessingEvent-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream ProcessingScyllaDB
 
State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!Memoori
 
Top 10 CodeIgniter Development Companies
Top 10 CodeIgniter Development CompaniesTop 10 CodeIgniter Development Companies
Top 10 CodeIgniter Development CompaniesTopCSSGallery
 
UiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overviewUiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overviewDianaGray10
 
Portal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russePortal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russe中 央社
 
WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024Lorenzo Miniero
 
Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...caitlingebhard1
 
The Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightThe Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightSafe Software
 
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on ThanabotsContinuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on ThanabotsLeah Henrickson
 
2024 May Patch Tuesday
2024 May Patch Tuesday2024 May Patch Tuesday
2024 May Patch TuesdayIvanti
 
ADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptxADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptxFIDO Alliance
 
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...Skynet Technologies
 
Navigating the Large Language Model choices_Ravi Daparthi
Navigating the Large Language Model choices_Ravi DaparthiNavigating the Large Language Model choices_Ravi Daparthi
Navigating the Large Language Model choices_Ravi DaparthiRaviKumarDaparthi
 

Recently uploaded (20)

TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
 
Microsoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - QuestionnaireMicrosoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - Questionnaire
 
Vector Search @ sw2con for slideshare.pptx
Vector Search @ sw2con for slideshare.pptxVector Search @ sw2con for slideshare.pptx
Vector Search @ sw2con for slideshare.pptx
 
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
 
Design Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptxDesign Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptx
 
How to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cfHow to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cf
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
 
Event-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream ProcessingEvent-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream Processing
 
State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!
 
Top 10 CodeIgniter Development Companies
Top 10 CodeIgniter Development CompaniesTop 10 CodeIgniter Development Companies
Top 10 CodeIgniter Development Companies
 
UiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overviewUiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overview
 
Portal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russePortal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russe
 
WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024
 
Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...
 
The Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightThe Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and Insight
 
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on ThanabotsContinuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
 
2024 May Patch Tuesday
2024 May Patch Tuesday2024 May Patch Tuesday
2024 May Patch Tuesday
 
ADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptxADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptx
 
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
 
Navigating the Large Language Model choices_Ravi Daparthi
Navigating the Large Language Model choices_Ravi DaparthiNavigating the Large Language Model choices_Ravi Daparthi
Navigating the Large Language Model choices_Ravi Daparthi
 

Recent Additions to Lucene Arsenal

  • 1.
  • 2. Recent Additions to Lucene’s Arsenal Shai Erera, Researcher, IBM Adrien Grand, ElasticSearch
  • 3. Who We Are • Shai Erera – – – – • Working at IBM – Information Retrieval Research Lucene/Solr committer and PMC member http://shaierera.blogspot.com shaie@apache.org Adrien Grand – – – @jpountz Lucene/Solr committer and PMC member Software engineer at Elasticsearch
  • 9. Replication Components • Replicator – – – • Revision – – • Describes a list of files and metadata Responsible to ensure the files are available as long as clients replicate it ReplicationClient – – – • Mediates between the client and server Manages the published Revisions Implementation for replication over HTTP Performs the replication operation on the replica side Copies delta files and invokes ReplicationHandler upon successful copy Always replicates latest revision ReplicationHandler – Acts on the copied files
  • 10. Index Replication • IndexRevision – – • IndexReplicationHandler – – – • Obtains a snapshot on the last commit through SnapshotDeletionPolicy Released when revision is released by Replicator Copies the files to the index directory and fsync them Aborts (rollback) on any error Upon successful completion, invokes a callback (e.g. SearcherManager.maybeRefresh()) Similar extensions for faceted index replication – – IndexAndTaxonomyRevision: obtains snapshots on both the search and taxonomy indexes IndexAndTaxonomyReplicationHandler: copies the files to the respective directories, keeping both in sync
  • 11. Sample Code // Server-side: publish a new Revision Replicator replicator = new LocalReplicator(); replicator.publish(new IndexRevision(indexWriter)); // Client-side: replicate a Revision Replicator replicator; // either LocalReplicator or HttpReplicator // refresh SearcherManager after index is updated Callable<Boolean> callback = new Callable<Boolean>() { public Boolean call() throws Exception { // index was updated, refresh manager searcherManager.maybeRefresh(); } } ReplicationHandler handler = new IndexReplicationHandler(indexDir, callback); SourceDirectoryFactory factory = new PerSessionDirectoryFactory(workDir); ReplicationClient client = new ReplicationClient(replicator, handler, factory); client.updateNow(); // invoke client manually // -- OR -client.startUpdateThread(30000); // check for updates every 30 seconds
  • 12. Future Work • Resume – – • Parallel Replication – • Session level: don’t copy files that were already successfully copied File level: don’t copy file parts that were already successfully copied Copy revision files in parallel Other replication strategies – Peer-to-peer
  • 13. Index Sorting How to trade index speed for search speed
  • 14. Anatomy of a Lucene index Index = collection of immutable segments Segments store documents sequentially on disk Add data = create a new segment Segments get eventually merged together Order of segments / documents in segments doesn’t matter – the following segments are equivalent Id 1 3 10 4 7 20 42 11 9 8 15 18 30 31 99 5 12 Price 9 0 7 8 2 2 1 8 10 3 4 4 6 10 1 1 13 Id 12 9 31 1 4 11 10 30 15 18 8 7 20 42 99 5 3 Price 13 10 10 9 8 8 7 6 4 4 3 2 2 1 1 1 0
  • 15. Anatomy of a Lucene index ordinal of a doc in a segment = doc id used in the inverted index to refer to docs shoe 1, 3, 5, 8, 11, 13, 15 doc id 0 1 2 3 4 5 7 8 9 10 11 12 13 14 15 16 Id 1 3 10 4 7 20 42 11 9 8 15 18 30 31 99 5 12 Price 9 0 7 8 2 2 10 3 4 1 13 6 1 8 4 6 10 1
  • 16. Top hits Get top N=2 results: – Create a priority queue of size N – Accumulate matching docs Id 1 3 10 4 7 20 42 11 9 8 15 18 30 31 99 5 12 Price 9 0 7 8 2 2 1 8 10 3 4 4 6 10 1 1 13 () (3) Create an empty priority queue (3,4) (4,20) (4,9) Automatic overflow of the priority queue to remove the least one (4,9) (9,31) (9,31) Top hits
  • 17. Early termination Let’s do the same on a sorted index Id 12 9 31 1 4 11 10 30 15 18 8 7 20 42 99 5 3 Price 13 10 10 9 8 8 7 6 4 4 3 2 2 1 1 1 0 () (9) (9,31) (9,31) Priority queue never changes after this document (9,31) (9,31) (9,31) (9,31)
  • 18. Early termination Pros – makes finding the top hits much faster – file-system cache-friendly Cons – only works for static ranks – not if the sort order depends on the query – requires the index to be sorted – doesn’t work for tasks that require visiting every doc: – total number of matches – faceting
  • 19. Static ranks Not uncommon! Graph-based ranks – Google’s PageRank Facebook social search / Unicorn – https://www.facebook.com/publications/219621248185635 Many more... Doesn’t need to be the exact sort order – heuristics when score is only a function of the static rank
  • 20. Offline sorting A live index can’t be kept sorted – would require inserting docs between existing docs! – segments are immutable Offline sorting to the rescue: – index as usual – sort into a new index – search! Pros/cons – super fast to search, the whole index is fully sorted – but only works for static content
  • 21. Offline Sorting // open a reader on the unsorted index and create a sorted (but slow) view DirectoryReader reader = DirectoryReader.open(in); boolean ascending = false; Sorter sorter = new NumericDocValuesSorter("price", ascending); AtomicReader sortedReader = SortingAtomicReader.wrap( SlowCompositeReaderWrapper.wrap(reader), sorter); // copy the content of the sorted reader to the new dir IndexWriter writer = new IndexWriter(out, iwConf); writer.addIndexes(sortedReader); writer.close(); reader.close();
  • 22. Online sorting? Sort segments independently – wouldn’t require inserting data into existing segments – collection could still be early-terminated on a per-segment basis But segments are immutable – must be sorted before starting writing them
  • 23. Online sorting? 2 sources of segments – flush – merge flushed segments can’t be sorted – Lucene writes stored fields to disk on the fly – could be buffered but this would require a lot of memory merged segments can be sorted – create a sorted view over the segments to merge – pass this view to SegmentMerger instead of the original segments not a bad trade-off – flushed segments are usually small & fast to collect
  • 24. Online sorting? Merged segments can easily take 99+% of the size of the index Merged segments Flushed segments - NRT reopens - RAM buffer size limit hit http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html
  • 25. Online Sorting IndexWriterConfig iwConf = new IndexWriterConfig(...); // original MergePolicy finds the segments to merge MergePolicy origMP = iwConf.getMergePolicy(); // SortingMergePolicy wraps the segments with a sorted view boolean ascending = false; Sorter sorter = new NumericDocValuesSorter("price", ascending); MergePolicy sortingMP = new SortingMergePolicy(origMP, sorter); // setup IndexWriter to use SortingMergePolicy iwConf.setMergePolicy(sortingMP); IndexWriter writer = new IndexWriter(dir, iwConf); // index as usual
  • 26. Early termination Collect top N matches Offline sorting – index sorted globally – early terminate after N matches have been collected – no priority queue needed! Online sorting – no early termination on flushed segments – early termination on merged segments – if N matches have been collected – or if current match is less than the top of the PQ
  • 27. Early Termination class MyCollector extends Collector { @Override public void setNextReader(AtomicReaderContext context) throws IOException { readerIsSorted = SortingMergePolicy.isSorted(context.reader(), sorter); collected = 0; } @Override public void collect(int doc) throws IOException { if (readerIsSorted && (++collected >= maxDocsToCollect || curVal <= pq.top()) { // Special exception that tells IndexSearcher to terminate // collection of the current segment throw new CollectionTerminatedException(); } else { // collect hit } } }