Your SlideShare is downloading. ×
Lucene intro
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

Lucene intro

966
views

Published on

Lucene Intro presentation presented at Timisoara Java Users Group meeting

Lucene Intro presentation presented at Timisoara Java Users Group meeting

Published in: Technology, Education

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
966
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
15
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Office (Word,Excel,PowerPoint), OpenOffice, PDF, Images(metadata), audio (ID3 for mp3 files), RTF, etc..\n
  • \n
  • similar to NoSQL databases. Not all documents need to contain the same fields.\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/api/core/org/apache/lucene/search/Similarity.html\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Transcript

    • 1. Lucene Intro
    • 2. About Me• Cristian Vat• Java Developer / Geek / Enthusiast• Contact • @deathy • ... or TM JUG mailing list
    • 3. About YOU• Heard about Lucene / Solr ?• Used Lucene / Solr ?
    • 4. Databases / Text Search
    • 5. Databases• Select/Search on (usually) exact values or ranges• Group/Summarize Results• Sort results by value(s) of certain result column(s)
    • 6. Text Search• Search for individual words/tokens• Search long text documents• More language-aware• “Sorting” by Relevance of results by default
    • 7. IR Quick Intro
    • 8. IR = Information Retrieval
    • 9. IR Quick Intro• Doc 1: “I did enact Julius Caesar: I was killed i’ the Capitol; Brutus killed me.”• Doc 2: “So let it be with Caesar. The noble Brutus hath told you Caesar was ambitious:”
    • 10. IR Quick Intro• Index • “I” -> Doc 1 • “Caesar” -> Doc 1, Doc 2 • “enact” -> Doc 1 • “noble” -> Doc 1
    • 11. IR Quick Intro• Search • caesar • c?es* • caesar AND noble • “Julius Caesar” • Caesar NOT Brutus
    • 12. Lucene Ecosystem ...and many more
    • 13. Lucene• IR Library• Just API for Indexing/Searching• No GUI• No parsers for different file formats
    • 14. Lucene• Fast• Thread-Safe/Multi-Threaded indexing and searching• No dependencies! (not even logging framework)
    • 15. Solr• Search Server / Layer over Lucene• Provides REST-like HTTP (JSON/XML) API• Client libraries in Java, PHP, Python, Ruby, Perl, .NET, ...
    • 16. Solr• More structured indexes• Replication / Distribution, Master-Slave, etc.• Faceted Search / Filtering• Indexing of rich document types (via Tika)
    • 17. Tika• “Content Analysis Toolkit”• Text and Metadata extraction from various rich document types• Used by Solr for indexing rich document types
    • 18. Lucene (in more detail)
    • 19. Lucene Index Structure• Index = One or more Documents• Document = one or more Fields with values• NO Schema/Structure restrictions
    • 20. Adding documents
    • 21. Lucene Search
    • 22. Query Parser• AND, OR, NOT ( +/- ) • “apache AND lucene NOT solr” ( “+apache +lucene -solr” )• Range Queries • year:[1994 TO 2011]• Wildcard/Fuzzy: • “ap?che”, “apac*”, “appche”˜0.8
    • 23. Sorting or Results• Default sort by Relevance• Possible to use custom sort fields
    • 24. Relevance• Score is calculated for each document based on individual document/fields and the current search query
    • 25. For the nerdshttp://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/api/core/org/apache/lucene/search/Similarity.html
    • 26. Analysis• From long continuous text to individual tokens/words used for indexing
    • 27. Analysis• Text -> Tokenizer -> (TokenFilter)* -> Tokens
    • 28. Tokenizer• Splits main text into words, by whitespace, punctuation, other rules• Text: “So, it has come to this!”• Tokens: [ “So”, “it”, “has”, “come”, “to”, “this” ]
    • 29. Token Filters• Change existing tokens or add new ones • Case-Folding • Synonyms • Stemming
    • 30. Token Filters• Text: “The Pandorica was constructed to ensure the safety of the Alliance.”• Tokens: [“The”, “Pandorica”, “was”, “constructed”, “to”, “ensure”, “the”, “safety”, “of”, “the”, “Alliance” ]• Filtered: [ “pandorica”, “was”, “construct”, “to”, “ensure”, “safe”, “of”, “alliance” ]
    • 31. Q&AQuestions?
    • 32. Thanks