Apache Lucene 4
Upcoming SlideShare
Loading in...5

Apache Lucene 4



Slides from my presentation o

Slides from my presentation o



Total Views
Views on SlideShare
Embed Views



14 Embeds 852

http://searchhub.org 518
http://solrsearch.collected.info 158
http://lucidsearchhub.stephenz.com 141
http://info.lucidworks.com 10
https://si0.twimg.com 7
https://twitter.com 6
http://webcache.googleusercontent.com 3
http://pdc03.pdcsim 3
http://newsblur.com 1
http://twitter.com 1
http://marcop 1
http://www.lucidimagination.com 1
https://twimg0-a.akamaihd.net 1
http://www.newsblur.com 1



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment
  • Quick on Language Analysis and AncillaryLanguage Analysis: 32 diff languages, ~100+ tokenizers, token filters, etc.Ancillary: highlighting, joins, “collapsing”, highlighting, spell checking, etc.
  • Merge controlsAll of this stuff is like pluggability for analyzers

Apache Lucene 4 Apache Lucene 4 Presentation Transcript

  • Apache Lucene 4Andrzej Białecki, Robert Muir, Grant Ingersoll LucidWorks
  • Topics• Lucene 4 Beta released this week• Key Features• Community• Evaluation
  • Features• Quick Hit: – Language Analysis • UNICODE compliant • 32+ languages • 100+ TokenStreams – Ancillary • Faceting, spelling, MLT, Joins, collapsing, highlighting, benchmarking, …• More to come: – FSTs – Indexing and Storage – Search
  • FS(A|T)• Keys: – byte[] – write-once – Linear time build of min. automata (nlogn if not sorted, which isn’t our case) – Compression – Reverse lookups – Weights (used for auto-suggest) – Pluggable Algebra• Uses: – Term Dictionary, TokenStreams, Japanese, synonyms, spelling, others – FuzzyQuery is 100x faster -- http://bit.ly/hgO65c• More: – http://slidesha.re/vKtpVA – http://bit.ly/Pkjyu0 – “Smaller Representation of Finite State Automata” • Proc. of the 16th Inter. Conf. on Implementation and Application of Automata, CIAA2011, vol. 6807, 2011, pp. 118—192.
  • Indexing and Storage• Segmented, write-once approach with merging• Fast: http://bit.ly/l8qE0i – 23.2 GB Wikipedia in 5 minutes – 270 GB/hour of plain text• Near Real Time Indexing/Search• Codecs – Abstraction for: Dictionaries, Postings, Field Storage, Term Vectors and more – Lucene40 is default – uses Block Tree – For fun: SimpleTextCodec• Directory – Abstraction for IO
  • Search• Many query types, query parsers, filtering capabilities• DAAT (mostly) evaluation• Pluggable Similarity – Many implementations and room for more • BM25, DFR, etc.
  • Community• Large, diverse community with many non-traditional search engine usages – Object stores, Record linkage, mobile,• Always Be Testing – Randomized system tests are all the rage – http://vimeo.com/32087114• “The Apache Way”• You never know where the next good idea is coming from
  • Evaluation• Performance • Relevance – http://people.apache.or – Many people have done g/~mikemccand/luceneb private evaluations ench/ – Empirical/Anecdotal: $ queries, random sample – More needed http://people.apache.org/~mikemccand/lucenebench/indexing.html
  • Resources• http://lucene.apache.org• grant@lucidworks.com• @gsingers• http://www.lucidworks.com