Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Apache Lucene 4

3,323 views

Published on

Slides from my presentation o

Published in: Technology
  • Be the first to comment

Apache Lucene 4

  1. 1. Apache Lucene 4Andrzej Białecki, Robert Muir, Grant Ingersoll LucidWorks
  2. 2. Topics• Lucene 4 Beta released this week• Key Features• Community• Evaluation
  3. 3. Features• Quick Hit: – Language Analysis • UNICODE compliant • 32+ languages • 100+ TokenStreams – Ancillary • Faceting, spelling, MLT, Joins, collapsing, highlighting, benchmarking, …• More to come: – FSTs – Indexing and Storage – Search
  4. 4. FS(A|T)• Keys: – byte[] – write-once – Linear time build of min. automata (nlogn if not sorted, which isn’t our case) – Compression – Reverse lookups – Weights (used for auto-suggest) – Pluggable Algebra• Uses: – Term Dictionary, TokenStreams, Japanese, synonyms, spelling, others – FuzzyQuery is 100x faster -- http://bit.ly/hgO65c• More: – http://slidesha.re/vKtpVA – http://bit.ly/Pkjyu0 – “Smaller Representation of Finite State Automata” • Proc. of the 16th Inter. Conf. on Implementation and Application of Automata, CIAA2011, vol. 6807, 2011, pp. 118—192.
  5. 5. Indexing and Storage• Segmented, write-once approach with merging• Fast: http://bit.ly/l8qE0i – 23.2 GB Wikipedia in 5 minutes – 270 GB/hour of plain text• Near Real Time Indexing/Search• Codecs – Abstraction for: Dictionaries, Postings, Field Storage, Term Vectors and more – Lucene40 is default – uses Block Tree – For fun: SimpleTextCodec• Directory – Abstraction for IO
  6. 6. Search• Many query types, query parsers, filtering capabilities• DAAT (mostly) evaluation• Pluggable Similarity – Many implementations and room for more • BM25, DFR, etc.
  7. 7. Community• Large, diverse community with many non-traditional search engine usages – Object stores, Record linkage, mobile,• Always Be Testing – Randomized system tests are all the rage – http://vimeo.com/32087114• “The Apache Way”• You never know where the next good idea is coming from
  8. 8. Evaluation• Performance • Relevance – http://people.apache.or – Many people have done g/~mikemccand/luceneb private evaluations ench/ – Empirical/Anecdotal: $ queries, random sample – More needed http://people.apache.org/~mikemccand/lucenebench/indexing.html
  9. 9. Resources• http://lucene.apache.org• grant@lucidworks.com• @gsingers• http://www.lucidworks.com

×