Lucene Intro
About Me•   Cristian Vat•   Java Developer / Geek / Enthusiast•   Contact    •   @deathy    •   ... or TM JUG mailing list
About YOU•   Heard about Lucene / Solr ?•   Used Lucene / Solr ?
Databases / Text Search
Databases•   Select/Search on (usually) exact values or    ranges•   Group/Summarize Results•   Sort results by value(s) o...
Text Search•   Search for individual words/tokens•   Search long text documents•   More language-aware•   “Sorting” by Rel...
IR Quick Intro
IR = Information Retrieval
IR Quick Intro•   Doc 1: “I did enact Julius Caesar: I was killed i’    the Capitol; Brutus killed me.”•   Doc 2: “So let ...
IR Quick Intro•   Index    •   “I” -> Doc 1    •   “Caesar” -> Doc 1, Doc 2    •   “enact” -> Doc 1    •   “noble” -> Doc 1
IR Quick Intro•   Search    •   caesar    •   c?es*    •   caesar AND noble    •   “Julius Caesar”    •   Caesar NOT Brutus
Lucene Ecosystem             ...and many more
Lucene•   IR Library•   Just API for Indexing/Searching•   No GUI•   No parsers for different file formats
Lucene•   Fast•   Thread-Safe/Multi-Threaded indexing and    searching•   No dependencies! (not even logging    framework)
Solr•   Search Server / Layer over Lucene•   Provides REST-like HTTP (JSON/XML) API•   Client libraries in Java, PHP, Pyth...
Solr•   More structured indexes•   Replication / Distribution, Master-Slave, etc.•   Faceted Search / Filtering•   Indexin...
Tika•   “Content Analysis Toolkit”•   Text and Metadata extraction from various    rich document types•   Used by Solr for...
Lucene (in more detail)
Lucene Index Structure•   Index = One or more Documents•   Document = one or more Fields with values•   NO Schema/Structur...
Adding documents
Lucene Search
Query Parser•   AND, OR, NOT ( +/- )    •   “apache AND lucene NOT solr” ( “+apache        +lucene -solr” )•   Range Queri...
Sorting or Results•   Default sort by Relevance•   Possible to use custom sort fields
Relevance•   Score is calculated for each document based    on individual document/fields and the current    search query
For the nerdshttp://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/api/core/org/apache/lucene/search/Similarity....
Analysis•   From long continuous text to individual    tokens/words used for indexing
Analysis•   Text -> Tokenizer -> (TokenFilter)* -> Tokens
Tokenizer•   Splits main text into words, by whitespace,    punctuation, other rules•   Text: “So, it has come to this!”• ...
Token Filters•   Change existing tokens or add new ones    •   Case-Folding    •   Synonyms    •   Stemming
Token Filters•   Text: “The Pandorica was constructed to    ensure the safety of the Alliance.”•   Tokens: [“The”, “Pandor...
Q&AQuestions?
Thanks
Upcoming SlideShare
Loading in …5
×

Lucene intro

1,067
-1

Published on

Lucene Intro presentation presented at Timisoara Java Users Group meeting

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,067
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
16
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Office (Word,Excel,PowerPoint), OpenOffice, PDF, Images(metadata), audio (ID3 for mp3 files), RTF, etc..\n
  • \n
  • similar to NoSQL databases. Not all documents need to contain the same fields.\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/api/core/org/apache/lucene/search/Similarity.html\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Lucene intro

    1. 1. Lucene Intro
    2. 2. About Me• Cristian Vat• Java Developer / Geek / Enthusiast• Contact • @deathy • ... or TM JUG mailing list
    3. 3. About YOU• Heard about Lucene / Solr ?• Used Lucene / Solr ?
    4. 4. Databases / Text Search
    5. 5. Databases• Select/Search on (usually) exact values or ranges• Group/Summarize Results• Sort results by value(s) of certain result column(s)
    6. 6. Text Search• Search for individual words/tokens• Search long text documents• More language-aware• “Sorting” by Relevance of results by default
    7. 7. IR Quick Intro
    8. 8. IR = Information Retrieval
    9. 9. IR Quick Intro• Doc 1: “I did enact Julius Caesar: I was killed i’ the Capitol; Brutus killed me.”• Doc 2: “So let it be with Caesar. The noble Brutus hath told you Caesar was ambitious:”
    10. 10. IR Quick Intro• Index • “I” -> Doc 1 • “Caesar” -> Doc 1, Doc 2 • “enact” -> Doc 1 • “noble” -> Doc 1
    11. 11. IR Quick Intro• Search • caesar • c?es* • caesar AND noble • “Julius Caesar” • Caesar NOT Brutus
    12. 12. Lucene Ecosystem ...and many more
    13. 13. Lucene• IR Library• Just API for Indexing/Searching• No GUI• No parsers for different file formats
    14. 14. Lucene• Fast• Thread-Safe/Multi-Threaded indexing and searching• No dependencies! (not even logging framework)
    15. 15. Solr• Search Server / Layer over Lucene• Provides REST-like HTTP (JSON/XML) API• Client libraries in Java, PHP, Python, Ruby, Perl, .NET, ...
    16. 16. Solr• More structured indexes• Replication / Distribution, Master-Slave, etc.• Faceted Search / Filtering• Indexing of rich document types (via Tika)
    17. 17. Tika• “Content Analysis Toolkit”• Text and Metadata extraction from various rich document types• Used by Solr for indexing rich document types
    18. 18. Lucene (in more detail)
    19. 19. Lucene Index Structure• Index = One or more Documents• Document = one or more Fields with values• NO Schema/Structure restrictions
    20. 20. Adding documents
    21. 21. Lucene Search
    22. 22. Query Parser• AND, OR, NOT ( +/- ) • “apache AND lucene NOT solr” ( “+apache +lucene -solr” )• Range Queries • year:[1994 TO 2011]• Wildcard/Fuzzy: • “ap?che”, “apac*”, “appche”˜0.8
    23. 23. Sorting or Results• Default sort by Relevance• Possible to use custom sort fields
    24. 24. Relevance• Score is calculated for each document based on individual document/fields and the current search query
    25. 25. For the nerdshttp://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/api/core/org/apache/lucene/search/Similarity.html
    26. 26. Analysis• From long continuous text to individual tokens/words used for indexing
    27. 27. Analysis• Text -> Tokenizer -> (TokenFilter)* -> Tokens
    28. 28. Tokenizer• Splits main text into words, by whitespace, punctuation, other rules• Text: “So, it has come to this!”• Tokens: [ “So”, “it”, “has”, “come”, “to”, “this” ]
    29. 29. Token Filters• Change existing tokens or add new ones • Case-Folding • Synonyms • Stemming
    30. 30. Token Filters• Text: “The Pandorica was constructed to ensure the safety of the Alliance.”• Tokens: [“The”, “Pandorica”, “was”, “constructed”, “to”, “ensure”, “the”, “safety”, “of”, “the”, “Alliance” ]• Filtered: [ “pandorica”, “was”, “construct”, “to”, “ensure”, “safe”, “of”, “alliance” ]
    31. 31. Q&AQuestions?
    32. 32. Thanks
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×