Quick on Language Analysis and AncillaryLanguage Analysis: 32 diff languages, ~100+ tokenizers, token filters, etc.Ancillary: highlighting, joins, “collapsing”, highlighting, spell checking, etc.
Merge controlsAll of this stuff is like pluggability for analyzers
Apache Lucene 4
Apache Lucene 4Andrzej Białecki, Robert Muir, Grant Ingersoll LucidWorks
Topics• Lucene 4 Beta released this week• Key Features• Community• Evaluation
Features• Quick Hit: – Language Analysis • UNICODE compliant • 32+ languages • 100+ TokenStreams – Ancillary • Faceting, spelling, MLT, Joins, collapsing, highlighting, benchmarking, …• More to come: – FSTs – Indexing and Storage – Search
FS(A|T)• Keys: – byte – write-once – Linear time build of min. automata (nlogn if not sorted, which isn’t our case) – Compression – Reverse lookups – Weights (used for auto-suggest) – Pluggable Algebra• Uses: – Term Dictionary, TokenStreams, Japanese, synonyms, spelling, others – FuzzyQuery is 100x faster -- http://bit.ly/hgO65c• More: – http://slidesha.re/vKtpVA – http://bit.ly/Pkjyu0 – “Smaller Representation of Finite State Automata” • Proc. of the 16th Inter. Conf. on Implementation and Application of Automata, CIAA2011, vol. 6807, 2011, pp. 118—192.
Indexing and Storage• Segmented, write-once approach with merging• Fast: http://bit.ly/l8qE0i – 23.2 GB Wikipedia in 5 minutes – 270 GB/hour of plain text• Near Real Time Indexing/Search• Codecs – Abstraction for: Dictionaries, Postings, Field Storage, Term Vectors and more – Lucene40 is default – uses Block Tree – For fun: SimpleTextCodec• Directory – Abstraction for IO
Search• Many query types, query parsers, filtering capabilities• DAAT (mostly) evaluation• Pluggable Similarity – Many implementations and room for more • BM25, DFR, etc.
Community• Large, diverse community with many non-traditional search engine usages – Object stores, Record linkage, mobile,• Always Be Testing – Randomized system tests are all the rage – http://vimeo.com/32087114• “The Apache Way”• You never know where the next good idea is coming from
Evaluation• Performance • Relevance – http://people.apache.or – Many people have done g/~mikemccand/luceneb private evaluations ench/ – Empirical/Anecdotal: $ queries, random sample – More needed http://people.apache.org/~mikemccand/lucenebench/indexing.html