Lucene And Solr Introduction By Pascal Dimassimo [email_address]
About me <ul><li>Java developers with 10+ years of experience
Working for OpenText/Nstein on Semantic Navigation application
http://semanticnavigation.opentext.com/ </li></ul>
History <ul><li>Lucene launches in 2000
Solr launches in 2006
Lucid Imagination in 2009 </li><ul><li>Hire the core developers of Lucene and Solr
Offer commercial support </li></ul><li>Lucene Revolution in 2010 </li></ul>
Buzz <ul><li>According to IDC </li><ul><li>“53% of companies using Open source use Lucene”
“Largely responsible for significant decline in commercial OEM revenue” </li></ul></ul>Source http://lucenerevolution.com/...
Lucene? <ul><li>“Lucene is a powerful Java search  library  that lets you easily add search to any application” - Lucene I...
NOT an application
Text indexing and searching
Open Source
Mature
Easy to learn API </li></ul>
Typical Search App Taken from Lucene In Action 2 nd  Edition Lucene
Search? <ul><li>Naive approach: linear-search (à la grep)
O(n) -> Slow...
You want to find a word in a book: how do you do it?
Inverted Index </li></ul>
Inverted Index Original Slide from Michael Busch (available at  http://goo.gl/0MQvy  )
Inverted Index Original Slide from Michael Busch (available at  http://goo.gl/0MQvy  )
Lucene Document FSDirectory dir = FSDirectory. open ( new  File( &quot;./index&quot; )); SimpleAnalyzer analyzer =  new  S...
Lucene Document <ul><li>Document: what is returned as search result
Organized in  fields.  A field must be specified at query time!
Schema-less
Plain text </li></ul>
Fields <ul><li>Indexed: put the content in the inverted index.
Analyzed: split the content into terms to be added to the inverted index. Normalized terms.
Stored: Keep the original content on disk
Multivalued: Repeat the same field multiple times in the same document with different values </li></ul>
Lucene Document String content =  &quot;The old night keeper keeps the keep in the town&quot; ; String author =  &quot;Pet...
Lucene Demo <ul><li>Indexing unit tests written by David </li></ul>
Relevancy <ul><li>How to you tell which document is more important? </li></ul>
Vectorial Model <ul><li>N dimension vectors for documents and queries
Score represents how close the vectors are
Tf-idf (term frequency–inverse document frequency)
Documents with many of the search terms are scored higher
Smaller documents are scored higher </li></ul>
Analyzer Taken from Lucene In Action 2 nd  Edition
Analyzer <ul><li>Convert text into terms
Used when indexing and querying
Tokenizer + Filters
Custom analyzers </li></ul>
Analyzer &quot;The quick brown fox jumped over the lazy dog&quot; WhitespaceAnalyzer [The] [quick] [brown] [fox] [jumped] ...
Upcoming SlideShare
Loading in...5
×

Lucene And Solr Intro

4,713

Published on

A presentation I gave at the Montreal JUG on November 18th 2010
http://goo.gl/7WxuI

Published in: Technology
0 Comments
10 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
4,713
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
0
Comments
0
Likes
10
Embeds 0
No embeds

No notes for slide
  • Do one thing well Apache Licence 10 years Version 3.0 It is fast!
  • Analyze documents: split each words Get documents in. Lucene returns a list of documents as search result.
  • Exemple livre: on recherche du début à chaque fois qu&apos;on recherche un mot Beacoup plus simple d&apos;utiliser un index Inverted index: for a word, list documents that contains it
  • Analyse: transformer le contenu en termes Un terme pourrait être plus d&apos;un mot: “New York” Position is also stored Binary Search: O(log n) -&gt; logarithmic Boolean Search Wildcard Search
  • Lucene generates a id for each document Stored = Original content stored “as is” on disk. Can be returned to the user when document is returned When Lucene returns document, it returns id. You can retrieve stored content with the id
  • Document: email, article, usager Email fields: expéditeur, destinataire, titre, contenu, attachement Article fields: auteur, titre, catégorie, contenu, date de publication Analogie BD: document = rangée, field = colonne On peut stocker des documents avec des champs différents.
  • Lucene generates a id for each document Stored = Original content stored “as is” on disk. Can be returned to the user when document is returned When Lucene returns document, it returns id. You can retrieve stored content with the id
  • Lucene can returns results sorted by a field
  • Terms almost synonym of words
  • Basic Query instance: TermQuery Use PerFieldAnalyzerWrapper to specify the specific analyzer for each field
  • Terms stored in alphabetical order. Using String.compareTo. Returns all docs for each terms in range
  • Supports AND, OR, NOT Supports +, -
  • Supports AND, OR, NOT Supports +, -
  • CNET l&apos;a utilisé pour permettre aux utilisateurs de mieux retrouver les produits
  • Lucene And Solr Intro

    1. 1. Lucene And Solr Introduction By Pascal Dimassimo [email_address]
    2. 2. About me <ul><li>Java developers with 10+ years of experience
    3. 3. Working for OpenText/Nstein on Semantic Navigation application
    4. 4. http://semanticnavigation.opentext.com/ </li></ul>
    5. 5. History <ul><li>Lucene launches in 2000
    6. 6. Solr launches in 2006
    7. 7. Lucid Imagination in 2009 </li><ul><li>Hire the core developers of Lucene and Solr
    8. 8. Offer commercial support </li></ul><li>Lucene Revolution in 2010 </li></ul>
    9. 9. Buzz <ul><li>According to IDC </li><ul><li>“53% of companies using Open source use Lucene”
    10. 10. “Largely responsible for significant decline in commercial OEM revenue” </li></ul></ul>Source http://lucenerevolution.com/sites/default/files/slides/Lucene%20Rev%20Preso%20IDC_MarketTrends_Reynolds.pdf
    11. 11. Lucene? <ul><li>“Lucene is a powerful Java search library that lets you easily add search to any application” - Lucene In Action 2 nd Edition
    12. 12. NOT an application
    13. 13. Text indexing and searching
    14. 14. Open Source
    15. 15. Mature
    16. 16. Easy to learn API </li></ul>
    17. 17. Typical Search App Taken from Lucene In Action 2 nd Edition Lucene
    18. 18. Search? <ul><li>Naive approach: linear-search (à la grep)
    19. 19. O(n) -> Slow...
    20. 20. You want to find a word in a book: how do you do it?
    21. 21. Inverted Index </li></ul>
    22. 22. Inverted Index Original Slide from Michael Busch (available at http://goo.gl/0MQvy )
    23. 23. Inverted Index Original Slide from Michael Busch (available at http://goo.gl/0MQvy )
    24. 24. Lucene Document FSDirectory dir = FSDirectory. open ( new File( &quot;./index&quot; )); SimpleAnalyzer analyzer = new SimpleAnalyzer(); MaxFieldLength len = IndexWriter.MaxFieldLength. UNLIMITED ; IndexWriter writer = new IndexWriter(dir, analyzer, true , len); String content = &quot;The old night keeper keeps the keep in the town&quot; ; Document doc = new Document(); doc.add( new Field( &quot;content&quot; , content, Field.Store. YES , Field.Index. ANALYZED )); writer.addDocument(doc); writer.commit();
    25. 25. Lucene Document <ul><li>Document: what is returned as search result
    26. 26. Organized in fields. A field must be specified at query time!
    27. 27. Schema-less
    28. 28. Plain text </li></ul>
    29. 29. Fields <ul><li>Indexed: put the content in the inverted index.
    30. 30. Analyzed: split the content into terms to be added to the inverted index. Normalized terms.
    31. 31. Stored: Keep the original content on disk
    32. 32. Multivalued: Repeat the same field multiple times in the same document with different values </li></ul>
    33. 33. Lucene Document String content = &quot;The old night keeper keeps the keep in the town&quot; ; String author = &quot;Peter Smith&quot; ; String category1 = &quot;Fiction&quot; ; String category2 = &quot;Canadian&quot; ; String isbn = &quot;978-1-933988-17-7&quot; ; String id = &quot;ABY123&quot; ; Document doc = new Document(); doc.add( new Field( &quot;content&quot; , content, Field.Store. YES , Field.Index. ANALYZED )); doc.add( new Field( &quot;author&quot; , author, Field.Store. YES , Field.Index. ANALYZED )); doc.add( new Field( &quot;category&quot; , category1, Field.Store. YES , Field.Index. ANALYZED )); doc.add( new Field( &quot;category&quot; , category2, Field.Store. YES , Field.Index. ANALYZED )); doc.add( new Field( &quot;isbn&quot; , isbn, Field.Store. YES , Field.Index. NOT_ANALYZED )); doc.add( new Field( &quot;id&quot; , id, Field.Store. YES , Field.Index. NO )); writer.addDocument(doc); writer.commit();
    34. 34. Lucene Demo <ul><li>Indexing unit tests written by David </li></ul>
    35. 35. Relevancy <ul><li>How to you tell which document is more important? </li></ul>
    36. 36. Vectorial Model <ul><li>N dimension vectors for documents and queries
    37. 37. Score represents how close the vectors are
    38. 38. Tf-idf (term frequency–inverse document frequency)
    39. 39. Documents with many of the search terms are scored higher
    40. 40. Smaller documents are scored higher </li></ul>
    41. 41. Analyzer Taken from Lucene In Action 2 nd Edition
    42. 42. Analyzer <ul><li>Convert text into terms
    43. 43. Used when indexing and querying
    44. 44. Tokenizer + Filters
    45. 45. Custom analyzers </li></ul>
    46. 46. Analyzer &quot;The quick brown fox jumped over the lazy dog&quot; WhitespaceAnalyzer [The] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dog] SimpleAnalyzer [the] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dog] StopAnalyzer [quick] [brown] [fox] [jumped] [over] [lazy] [dog] StandardAnalyzer [quick] [brown] [fox] [jumped] [over] [lazy] [dog] Example from Lucene In Action 2 nd Edition
    47. 47. Analyzer &quot;XY&Z Corporation - xyz@example.com&quot; WhitespaceAnalyzer [XY&Z] [Corporation] [-] [xyz@example.com] SimpleAnalyzer [xy] [z] [corporation] [xyz] [example] [com] StopAnalyzer [xy] [z] [corporation] [xyz] [example] [com] StandardAnalyzer [xy&z] [corporation] [xyz@example.com] Example from Lucene In Action 2 nd Edition
    48. 48. Custom Analyzers WhitespaceTokenizer Tokenize at white spaces KeywordTokenizer Tokenize input as a single token StandardTokenizer Tokenize at white spaces but keeping high-level entity as token (email, etc TODO) LowerCaseFilter Lowercases token text StopFilter Removes words that exist in a provided set of words PorterStemFilter Stems each token using the Porter stemming algorithm. For example, country and countries both stem to countri . Some descriptions from Lucene In Action 2 nd Edition
    49. 49. Query <ul><li>Asking Lucene “what documents contain this word?”
    50. 50. Lucene applied an Analyzer to each word queried
    51. 51. Query can be programmatically build
    52. 52. Powerful Query Syntax </li></ul>
    53. 53. Query code SimpleAnalyzer analyzer = new SimpleAnalyzer(); QueryParser parser = new QueryParser(Version. LUCENE_30 , &quot;content&quot; , analyzer); Query query = parser.parse( &quot;big&quot; ); TopDocs docs = searcher.search(query, 10);
    54. 54. Query Syntax: Basic title:montreal text field
    55. 55. Query Syntax: Range name:[a TO k] range field
    56. 56. Query Syntax: Boolean title:(java AND programming) operator field
    57. 57. Query Syntax: Boolean title:java OR name:pascal operator field field
    58. 58. Query Syntax: Phrase title:”Lucene in Action” phrase field
    59. 59. Query Syntax: Wildcard title:program* Term prefix field
    60. 60. Lucene Demo <ul><li>Searching unit tests written by David </li></ul>
    61. 61. Lucene summary <ul><li>Inverted index for fast document retrieval
    62. 62. Relevancy scoring
    63. 63. Powerful query syntax </li></ul>
    64. 64. Solr <ul><li>Created by Yonik Seeley in 2004 and released as open source in 2006
    65. 65. HTTP application built around Lucene
    66. 66. Makes it easy to develop search solutions
    67. 67. Advanced features develop on top of Lucene
    68. 68. As of 2010, Lucene and Solr are merged </li></ul>
    69. 69. Solr Schema <ul><li>Solr allows to administer one or more Lucene indexes
    70. 70. Each index has its own schema
    71. 71. Lists all fields allowed for an index
    72. 72. Defines the analyzers for each field </li></ul>
    73. 73. Solr Schema < field name = &quot;id&quot; type = &quot;string&quot; indexed = &quot;true&quot; stored = &quot;true&quot; required = &quot;true&quot; /> < field name = &quot;title&quot; type = &quot;text&quot; indexed = &quot;true&quot; stored = &quot;true&quot; /> < field name = &quot;presenter&quot; type = &quot;text_ws&quot; indexed = &quot;true&quot; stored = &quot;true&quot; /> < field name = &quot;date&quot; type = &quot;date&quot; indexed = &quot;true&quot; stored = &quot;true&quot; /> < field name = &quot;abstract&quot; type = &quot;text&quot; indexed = &quot;true&quot; stored = &quot;true&quot; />
    74. 74. Solr Schema < fieldType name = &quot;text&quot; class = &quot;solr.TextField&quot; positionIncrementGap = &quot;100&quot; > < analyzer type = &quot;index&quot; > < tokenizer class = &quot;solr.WhitespaceTokenizerFactory&quot; /> < filter class = &quot;solr.StopFilterFactory&quot; ignoreCase = &quot;true&quot; words = &quot;stopwords.txt&quot; /> < filter class = &quot;solr.LowerCaseFilterFactory&quot; /> < filter class = &quot;solr.ISOLatin1AccentFilterFactory&quot; /> < filter class = &quot;solr.SnowballPorterFilterFactory&quot; language = &quot;English&quot; protected = &quot;protwords.txt&quot; /> </ analyzer > < analyzer type = &quot;query&quot; > < tokenizer class = &quot;solr.WhitespaceTokenizerFactory&quot; /> < filter class = &quot;solr.StopFilterFactory&quot; ignoreCase = &quot;true&quot; words = &quot;stopwords.txt&quot; /> < filter class = &quot;solr.LowerCaseFilterFactory&quot; /> < filter class = &quot;solr.ISOLatin1AccentFilterFactory&quot; /> < filter class = &quot;solr.SnowballPorterFilterFactory&quot; language = &quot;English&quot; protected = &quot;protwords.txt&quot; /> </ analyzer > </ fieldType >
    75. 75. Solr Schema < fieldType name = &quot;text_ws&quot; class = &quot;solr.TextField&quot; positionIncrementGap = &quot;100&quot; > < analyzer type = &quot;index&quot; > < tokenizer class = &quot;solr.WhitespaceTokenizerFactory&quot; /> < filter class = &quot;solr.LowerCaseFilterFactory&quot; /> </ analyzer > < analyzer type = &quot;query&quot; > < tokenizer class = &quot;solr.WhitespaceTokenizerFactory&quot; /> < filter class = &quot;solr.LowerCaseFilterFactory&quot; /> </ analyzer > </ fieldType >
    76. 76. Solr Indexation <ul><li>HTTP POST
    77. 77. XML by default, but also CSV
    78. 78. Multi threaded
    79. 79. Advanced features: binary document extraction, DB plugin </li></ul>
    80. 80. Solr Indexation < add > < doc > < field name = &quot;id&quot; > 002 </ field > < field name = &quot;title&quot; > Lucene And Solr Introduction </ field > < field name = &quot;presenter&quot; > Pascal Dimassimo </ field > < field name = &quot;date&quot; > 2010-11-18T00:00:00Z </ field > < field name = &quot;abstract&quot; > ... </ field > </ doc > <doc>...</doc> </ add > curl http://localhost:8983/solr/update -H &quot;Content-Type: text/xml&quot; --data-binary @add.xml
    81. 81. Solr Query <ul><li>HTTP GET
    82. 82. Query Parameters
    83. 83. Response in XML by default, but other formats are supported (json, php, ruby) </li></ul>
    84. 84. Solr Query curl http://localhost:8983/solr/select?q=title:Lucene < response > < lst name = &quot;responseHeader&quot; > < int name = &quot;status&quot; > 0 </ int > < int name = &quot;QTime&quot; > 269 </ int > < lst name = &quot;params&quot; > < str name = &quot;q&quot; > title:Lucene </ str > </ lst > </ lst > < result name = &quot;response&quot; numFound = &quot;1&quot; start = &quot;0&quot; > < doc > < str name = &quot;id&quot; > 002 </ str > < str name = &quot;title&quot; > Lucene And Solr Introduction </ str > < str name = &quot;presenter&quot; > Pascal Dimassimo </ str > < date name = &quot;date&quot; > 2010-11-18T00:00:00Z </ date > < str name = &quot;abstract&quot; > ... </ str > </ doc > </ result > </ response >
    85. 85. Solr Query Parameters q Lucene Query sort Field to sort on. Defaut to score start Offset for the results page to display. Default 0 rows Numbers of results to display per page. Default 10 fq Filter Query. Default to all documents fl List of fields to display per document. Default to all fields wt Format to display result. Default to xml
    86. 86. Solr Facets <ul><li>For a query results, list of all distinct indexed values of a field with their frequencies
    87. 87. Useful for drilling down in results set </li></ul>
    88. 88. SolrJ <ul><li>Library to connect and interact with Solr </li></ul>String url = &quot;http://localhost:8983/solr&quot; ; CommonsHttpSolrServer server = new CommonsHttpSolrServer(url); SolrInputDocument doc = new SolrInputDocument(); doc.addField( &quot;id&quot; , &quot;id1&quot; , 1.0f); doc.addField( &quot;name&quot; , &quot;doc1&quot; , 1.0f); doc.addField( &quot;price&quot; , 10); server.add(doc); server.commit();
    89. 89. Solr Demo <ul><li>Using Evernote Data
    90. 90. Indexed using solr-feeder
    91. 91. https://github.com/pascaldimassimo/solr-feeder </li></ul>
    92. 92. Solr Features <ul><li>Text Highlighting
    93. 93. Spell Checking
    94. 94. Forced Placements
    95. 95. More Like This
    96. 96. Replication
    97. 97. Database connector
    98. 98. Geo-location </li></ul>
    99. 99. More Information

    ×