Successfully reported this slideshow.
Lucene today, tomorrow and beyond                Simon Willnauer                Apache Lucene Core Committer & PMC Chair  ...
Who am I?       • Lucene Core Committer       • Project Management Committee Chair (PMC)       • Apache Member       • Ber...
http://www.searchworkings.org       • Community Portal targeting OpenSource Search                                        ...
What makes this talk different?       • The most of the talks here are presenting what Lucene can do or what          peop...
Lu                                         2001           ce                                                             n...
And who did all the work?                                                                                                 ...
Lets make this a fair game!                             28 committers from 8 different countries                          ...
And the companies                             8Thursday, October 20, 2011
Where are we now - once 4.0 is out?       • Lucene 4.0 contains a ton of smallish improvements       • Lots of refined API...
Some random improvements       • FuzzyQuery speedup by 20000% (yes 20k!)       • Indexing throughput improvements 200% to ...
Flexible Indexing & Codecs       • Allows to customize low level index structure per field       • Yields significant perf...
IndexDocValues       • Value per field & document - similar to FieldCache       • Type-safe and efficient on-disk & in-mem...
Flexible Scoring       • New ranking models in addition to VSM       • Adds key statistics to Lucene index to support othe...
What else?       • DocumentWriterPerThread          • High throughput incremental indexing          • Preparation for RT-S...
This was what we get with Lucene 4.0 (roughly)       • What is missing in this picture?       • Where are we going?       ...
Lucene - a Full Text Search Library                     CORE SEARCH                  FEATURES! - LIMITATIONS?             ...
Positions - not a first class citizen       • We have:          • Spans (Near, First, MultiTerm...)          • PhraseQuery...
Positions - not a first class citizen       • Solutions?          • Multi-Phase searches             • Collect documents w...
Positions - not a first class citizen       • What about highlighting?          • The implementation is a mess          • ...
Updates - Huh? Incremental you know!       • Everybody wants it, right?          • Updating a field without reindexing the...
Updates - Huh? Incremental you know!                   term      fre   Posting list   1   The old night keeper keeps the k...
Updates - Hu? Incremental you know!       • Much easier (and closer) for not-indexed values          • IndexDocValues     ...
The JVM - or is it the JIT?       • Unpredictable Mr. JIT                             Grouping benchmark changes Spans? WT...
The JVM - or is it the JIT?       • The cost of a virtual method call                                ConjunctionScorer Cod...
The JVM - or is it the JIT?       • Lucene has a lot of HOT loops          • Each TermScorer needs DocID & TermFreq for ev...
Possible Solutions / Paths to explore       • Native Code / Generation (thats gonna be fun!)       • Code Specialization  ...
ByteCode generation       • Specializing Queries at Runtime?          • Might bring nice speed improvements per use-case  ...
The Future beyond the core       • Users have two options          • Nothing - plain Lucene (well its a lot already - a lo...
<dream>Lucene 5.0</dream>       • actually, XML is backwards: { “dream” : “Lucene 5.0” }       • Solr has grown, grown lar...
{“dream” : “Lucene 5.0”}       • Can we get this more modular, lightweight & lean?          • I rather do some coding than...
Isn’t this what Solr is?       • Not quiet!          • Lucene tries to provide APIs where you hardly can’t take anything  ...
Back to {“dream” : “Lucene 5.0”}       • Can we go one step further?                                                      ...
Disclaimer       • This was my personal vision maybe not the one other people have.       • Lets see what the community wa...
Questions                             Thank you!                                          34Thursday, October 20, 2011
Upcoming SlideShare
Loading in …5
×

Willnauer today tomorrow_and_beyond_eurocon2011

835 views

Published on

lucene

Published in: Technology
  • Be the first to comment

Willnauer today tomorrow_and_beyond_eurocon2011

  1. 1. Lucene today, tomorrow and beyond Simon Willnauer Apache Lucene Core Committer & PMC Chair simonw@apache.org / simon.willnauer@searchworkings.orgThursday, October 20, 2011
  2. 2. Who am I? • Lucene Core Committer • Project Management Committee Chair (PMC) • Apache Member • BerlinBuzzwords Co-Founder • Addicted to OpenSource • Apache Solr & Lucene User / Consultant / Promoter 2Thursday, October 20, 2011
  3. 3. http://www.searchworkings.org • Community Portal targeting OpenSource Search 3Thursday, October 20, 2011
  4. 4. What makes this talk different? • The most of the talks here are presenting what Lucene can do or what people do with Lucene, right? • This talk will show what Lucene can’t do today (trunk) but might be doing in the future. • I won’t talk about what people going to do in the future - maybe next time :) 4Thursday, October 20, 2011
  5. 5. Lu 2001 ce ne jo 2002 in edThursday, October 20, 2011 th e Lu AS 2003 ce F ne be 2004 Lu co ce m es ne Ap Lu 2005 ce 1. 2 ac ne he 1. TL 2006 Lu 4 P ce Let’s go back in time a bit ne 2. 2007 Lu 0 ce ne 2008 Lu 2. 1 ce Lu ne & 2 ce 2. .2 2009 ne 3 2. Lu 4 2010 ce ne Lu 2. ce 9 2011 ne & 3. Happy Birthday! Lu & 0 ce So 2012 lr ne M 3. er 1 ge Lu -3 ce .4 ne 4. 0 ? 5 2014
  6. 6. And who did all the work? Created from Lucene core CHANGES.TXT Especially “via” is interesting since we use this for contributions from non-committers (FooBar via $committer_name) 6Thursday, October 20, 2011
  7. 7. Lets make this a fair game! 28 committers from 8 different countries 7Thursday, October 20, 2011
  8. 8. And the companies 8Thursday, October 20, 2011
  9. 9. Where are we now - once 4.0 is out? • Lucene 4.0 contains a ton of smallish improvements • Lots of refined APIs • Large speed improvements • New modules • And lots of paths to explore for the future! 9Thursday, October 20, 2011
  10. 10. Some random improvements • FuzzyQuery speedup by 20000% (yes 20k!) • Indexing throughput improvements 200% to 280% • Document Filtering speedup up to 480% • Loading term dictionaries up to 30x faster using 10% of the memory compared to 3.x • 600000 key-value lookups/second • Tremendous reduction of GC needs at runtime Your mileage may vary! 10Thursday, October 20, 2011
  11. 11. Flexible Indexing & Codecs • Allows to customize low level index structure per field • Yields significant performance gains depending on the use-case • Highly optimized data-structures • Allows future improvements due to per codec Backwards Compatibility • Lets you decide on memory consumption 11Thursday, October 20, 2011
  12. 12. IndexDocValues • Value per field & document - similar to FieldCache • Type-safe and efficient on-disk & in-memory access • Soon update-able • More flexible than FieldCache • Fast loading times 12Thursday, October 20, 2011
  13. 13. Flexible Scoring • New ranking models in addition to VSM • Adds key statistics to Lucene index to support other scoring models • Decoupled matching from ranking • Powerful Similarity API (can use IndexDocValues) 13Thursday, October 20, 2011
  14. 14. What else? • DocumentWriterPerThread • High throughput incremental indexing • Preparation for RT-Search • AutomatonQuery (FuzzyQuery) • Query as s Deterministic Finite Automata (DFA) • Levenshtein Automata for fast Fuzzy Queries (up to 20000% improvement over 3.x) • Flexible Automata concatenation 14Thursday, October 20, 2011
  15. 15. This was what we get with Lucene 4.0 (roughly) • What is missing in this picture? • Where are we going? • What comes after 4.0? • What is not going to make it into 4.0? All this boils down to: “What do WE & YOU want Lucene to become in the future?” 15Thursday, October 20, 2011
  16. 16. Lucene - a Full Text Search Library CORE SEARCH FEATURES! - LIMITATIONS? 16Thursday, October 20, 2011
  17. 17. Positions - not a first class citizen • We have: • Spans (Near, First, MultiTerm...) • PhraseQuery (sloppy & strict) • The Problem: • Either use “common” query hierarchy or Spans • Score ALL or NOTHING • Scoring lots of documents takes ages 17Thursday, October 20, 2011
  18. 18. Positions - not a first class citizen • Solutions? • Multi-Phase searches • Collect documents without positions • Re-score top N based on position data • Query hierarchy can be complex • We need an API with the same granularity as Scorer • Span semantics should not be bound to a query • Divorce scoring & matching for positions 18Thursday, October 20, 2011
  19. 19. Positions - not a first class citizen • What about highlighting? • The implementation is a mess • Tons of If (query instanceof FooQuery) • Hard to extend for custom queries • First steps are already taken! • http://svn.apache.org/repos/asf/lucene/dev/branches/positions/ • Scorer allows to pull positions for any query - Help Wanted! 19Thursday, October 20, 2011
  20. 20. Updates - Huh? Incremental you know! • Everybody wants it, right? • Updating a field without reindexing the entire doc? Yeah! • Watch out, this comes not for free! • You can’t simply update a field - it’s a reverse index! • Term -> [ (docID, freq) ] ( how to update this ) • Lucene is write once - no in-place updates (which is good!) • We have write per field per segment deltas and merge them on IndexReader open?! - seems tricky? • Lots of paths need to be explored - maybe “appending fields”? 20Thursday, October 20, 2011
  21. 21. Updates - Huh? Incremental you know! term fre Posting list 1 The old night keeper keeps the keep in the town and q 1 6 2 In the big old house in the big old gown. big 2 23 3 The house in the town had the big old keep dark 1 6 did 1 4 4 Where the old night keeper never did sleep. gown 1 2 5 The night keeper keeps the keep in the night had 1 3 6 And keeps in the dark and sleeps in the light. house 2 23 in 5 12356 keep keeper 3 3 135 145 update freq & postings keeps 3 156 2 In the small old house in the big old gown. light 1 6 never 1 4 night 3 145 insert new term old 4 1234 sleep 1 4 sleeps 1 6 the 6 123456 town 2 13 where 1 4 21Thursday, October 20, 2011
  22. 22. Updates - Hu? Incremental you know! • Much easier (and closer) for not-indexed values • IndexDocValues • Assumption: • Document Title OR Body changes are low frequent • PageRank OR User-Ratings change very frequently • Maybe available in 4.0 • Bottom Line: this is still far away but on the list! 22Thursday, October 20, 2011
  23. 23. The JVM - or is it the JIT? • Unpredictable Mr. JIT Grouping benchmark changes Spans? WTF? 23Thursday, October 20, 2011
  24. 24. The JVM - or is it the JIT? • The cost of a virtual method call ConjunctionScorer Code Specialization 24Thursday, October 20, 2011
  25. 25. The JVM - or is it the JIT? • Lucene has a lot of HOT loops • Each TermScorer needs DocID & TermFreq for every possible hit • Calling DocsEnum#next() & #freq() adds up • Inlining seems unreliable • Solutions? 25Thursday, October 20, 2011
  26. 26. Possible Solutions / Paths to explore • Native Code / Generation (thats gonna be fun!) • Code Specialization • Can bring 50% to 100% performance improvements • ByteCode Generation & Query Compilation • Prototypes for FunctionQuery yields 300% speed improvements • Bulk Reading APIs - BulkPostings branch - watch out its hairy • Reading more than one DocID / TermFreq at a time • More than one step backwards - API wise 26Thursday, October 20, 2011
  27. 27. ByteCode generation • Specializing Queries at Runtime? • Might bring nice speed improvements per use-case • Problems arise with testing and correctness? • Could help tremendously with bulk postings • Some people say the API is unusable (Uwe?) • Maybe you don’t need to use it at all? • Would be nice if you could specify you query on a very high level and Lucene generates optimal code for you? 27Thursday, October 20, 2011
  28. 28. The Future beyond the core • Users have two options • Nothing - plain Lucene (well its a lot already - a lot to code) • All - Solr / ElasticSearch etc. •I’d like something in between, you? 28Thursday, October 20, 2011
  29. 29. <dream>Lucene 5.0</dream> • actually, XML is backwards: { “dream” : “Lucene 5.0” } • Solr has grown, grown large and is showing its age! • 95% of the time I only want one or two “services” Solr provides • still I got to use it - all or nothing! • I have to setup a (to me) heavy weight container (5 years ago Jetty / Tomcat was lightweight - times ‘r changing) • I got to figure out this documentation - fair enough! 29Thursday, October 20, 2011
  30. 30. {“dream” : “Lucene 5.0”} • Can we get this more modular, lightweight & lean? • I rather do some coding than configure 2 lines of XML, you? Suggestions Replication Faceting Modules CoreUtils Grouping Spellchecking Durability / Recovery Join today tomorrow 30Thursday, October 20, 2011
  31. 31. Isn’t this what Solr is? • Not quiet! • Lucene tries to provide APIs where you hardly can’t take anything away • When I think of Solr, you can hardly add anything • Everybody should be able to build their own $Solr • How hard will it be to draw the line? • Who is going to benefit? 31Thursday, October 20, 2011
  32. 32. Back to {“dream” : “Lucene 5.0”} • Can we go one step further? Service - Module HTTP - Module • ElasticSearch did a great job making things dead simple! • we should follow this example and less might be more eventually! • Taking it as far as ElasticSearch (all or nothing again) seems not the right path for Lucene but simple is good, no? 32Thursday, October 20, 2011
  33. 33. Disclaimer • This was my personal vision maybe not the one other people have. • Lets see what the community wants / needs - It’s all about the users! 33Thursday, October 20, 2011
  34. 34. Questions Thank you! 34Thursday, October 20, 2011

×