Apache Lucene 4
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

Apache Lucene 4

  • 2,953 views
Uploaded on

Slides from my presentation o

Slides from my presentation o

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
2,953
On Slideshare
2,098
From Embeds
855
Number of Embeds
15

Actions

Shares
Downloads
23
Comments
0
Likes
3

Embeds 855

http://searchhub.org 519
http://solrsearch.collected.info 158
http://lucidsearchhub.stephenz.com 141
http://info.lucidworks.com 10
https://si0.twimg.com 7
https://twitter.com 7
http://pdc03.pdcsim 3
http://webcache.googleusercontent.com 3
http://newsblur.com 1
http://twitter.com 1
http://www.newsblur.com 1
http://marcop 1
http://www.lucidimagination.com 1
https://twimg0-a.akamaihd.net 1
http://lucidworks.com 1

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • Quick on Language Analysis and AncillaryLanguage Analysis: 32 diff languages, ~100+ tokenizers, token filters, etc.Ancillary: highlighting, joins, “collapsing”, highlighting, spell checking, etc.
  • Merge controlsAll of this stuff is like pluggability for analyzers

Transcript

  • 1. Apache Lucene 4Andrzej Białecki, Robert Muir, Grant Ingersoll LucidWorks
  • 2. Topics• Lucene 4 Beta released this week• Key Features• Community• Evaluation
  • 3. Features• Quick Hit: – Language Analysis • UNICODE compliant • 32+ languages • 100+ TokenStreams – Ancillary • Faceting, spelling, MLT, Joins, collapsing, highlighting, benchmarking, …• More to come: – FSTs – Indexing and Storage – Search
  • 4. FS(A|T)• Keys: – byte[] – write-once – Linear time build of min. automata (nlogn if not sorted, which isn’t our case) – Compression – Reverse lookups – Weights (used for auto-suggest) – Pluggable Algebra• Uses: – Term Dictionary, TokenStreams, Japanese, synonyms, spelling, others – FuzzyQuery is 100x faster -- http://bit.ly/hgO65c• More: – http://slidesha.re/vKtpVA – http://bit.ly/Pkjyu0 – “Smaller Representation of Finite State Automata” • Proc. of the 16th Inter. Conf. on Implementation and Application of Automata, CIAA2011, vol. 6807, 2011, pp. 118—192.
  • 5. Indexing and Storage• Segmented, write-once approach with merging• Fast: http://bit.ly/l8qE0i – 23.2 GB Wikipedia in 5 minutes – 270 GB/hour of plain text• Near Real Time Indexing/Search• Codecs – Abstraction for: Dictionaries, Postings, Field Storage, Term Vectors and more – Lucene40 is default – uses Block Tree – For fun: SimpleTextCodec• Directory – Abstraction for IO
  • 6. Search• Many query types, query parsers, filtering capabilities• DAAT (mostly) evaluation• Pluggable Similarity – Many implementations and room for more • BM25, DFR, etc.
  • 7. Community• Large, diverse community with many non-traditional search engine usages – Object stores, Record linkage, mobile,• Always Be Testing – Randomized system tests are all the rage – http://vimeo.com/32087114• “The Apache Way”• You never know where the next good idea is coming from
  • 8. Evaluation• Performance • Relevance – http://people.apache.or – Many people have done g/~mikemccand/luceneb private evaluations ench/ – Empirical/Anecdotal: $ queries, random sample – More needed http://people.apache.org/~mikemccand/lucenebench/indexing.html
  • 9. Resources• http://lucene.apache.org• grant@lucidworks.com• @gsingers• http://www.lucidworks.com