Your SlideShare is downloading. ×
0
Apache Lucene 4
Apache Lucene 4
Apache Lucene 4
Apache Lucene 4
Apache Lucene 4
Apache Lucene 4
Apache Lucene 4
Apache Lucene 4
Apache Lucene 4
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Apache Lucene 4

2,533

Published on

Slides from my presentation o

Slides from my presentation o

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,533
On Slideshare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
24
Comments
0
Likes
3
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Quick on Language Analysis and AncillaryLanguage Analysis: 32 diff languages, ~100+ tokenizers, token filters, etc.Ancillary: highlighting, joins, “collapsing”, highlighting, spell checking, etc.
  • Merge controlsAll of this stuff is like pluggability for analyzers
  • Transcript

    • 1. Apache Lucene 4Andrzej Białecki, Robert Muir, Grant Ingersoll LucidWorks
    • 2. Topics• Lucene 4 Beta released this week• Key Features• Community• Evaluation
    • 3. Features• Quick Hit: – Language Analysis • UNICODE compliant • 32+ languages • 100+ TokenStreams – Ancillary • Faceting, spelling, MLT, Joins, collapsing, highlighting, benchmarking, …• More to come: – FSTs – Indexing and Storage – Search
    • 4. FS(A|T)• Keys: – byte[] – write-once – Linear time build of min. automata (nlogn if not sorted, which isn’t our case) – Compression – Reverse lookups – Weights (used for auto-suggest) – Pluggable Algebra• Uses: – Term Dictionary, TokenStreams, Japanese, synonyms, spelling, others – FuzzyQuery is 100x faster -- http://bit.ly/hgO65c• More: – http://slidesha.re/vKtpVA – http://bit.ly/Pkjyu0 – “Smaller Representation of Finite State Automata” • Proc. of the 16th Inter. Conf. on Implementation and Application of Automata, CIAA2011, vol. 6807, 2011, pp. 118—192.
    • 5. Indexing and Storage• Segmented, write-once approach with merging• Fast: http://bit.ly/l8qE0i – 23.2 GB Wikipedia in 5 minutes – 270 GB/hour of plain text• Near Real Time Indexing/Search• Codecs – Abstraction for: Dictionaries, Postings, Field Storage, Term Vectors and more – Lucene40 is default – uses Block Tree – For fun: SimpleTextCodec• Directory – Abstraction for IO
    • 6. Search• Many query types, query parsers, filtering capabilities• DAAT (mostly) evaluation• Pluggable Similarity – Many implementations and room for more • BM25, DFR, etc.
    • 7. Community• Large, diverse community with many non-traditional search engine usages – Object stores, Record linkage, mobile,• Always Be Testing – Randomized system tests are all the rage – http://vimeo.com/32087114• “The Apache Way”• You never know where the next good idea is coming from
    • 8. Evaluation• Performance • Relevance – http://people.apache.or – Many people have done g/~mikemccand/luceneb private evaluations ench/ – Empirical/Anecdotal: $ queries, random sample – More needed http://people.apache.org/~mikemccand/lucenebench/indexing.html
    • 9. Resources• http://lucene.apache.org• grant@lucidworks.com• @gsingers• http://www.lucidworks.com

    ×