• Save
Lucene And Solr Intro
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

Lucene And Solr Intro

  • 4,404 views
Uploaded on

A presentation I gave at the Montreal JUG on November 18th 2010...

A presentation I gave at the Montreal JUG on November 18th 2010
http://goo.gl/7WxuI

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
4,404
On Slideshare
2,914
From Embeds
1,490
Number of Embeds
6

Actions

Shares
Downloads
0
Comments
0
Likes
9

Embeds 1,490

http://pascaldimassimo.com 1,445
http://baradates.tumblr.com 34
http://webcache.googleusercontent.com 4
http://127.0.0.1 4
url_unknown 2
http://translate.googleusercontent.com 1

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • Do one thing well Apache Licence 10 years Version 3.0 It is fast!
  • Analyze documents: split each words Get documents in. Lucene returns a list of documents as search result.
  • Exemple livre: on recherche du début à chaque fois qu'on recherche un mot Beacoup plus simple d'utiliser un index Inverted index: for a word, list documents that contains it
  • Analyse: transformer le contenu en termes Un terme pourrait être plus d'un mot: “New York” Position is also stored Binary Search: O(log n) -> logarithmic Boolean Search Wildcard Search
  • Lucene generates a id for each document Stored = Original content stored “as is” on disk. Can be returned to the user when document is returned When Lucene returns document, it returns id. You can retrieve stored content with the id
  • Document: email, article, usager Email fields: expéditeur, destinataire, titre, contenu, attachement Article fields: auteur, titre, catégorie, contenu, date de publication Analogie BD: document = rangée, field = colonne On peut stocker des documents avec des champs différents.
  • Lucene generates a id for each document Stored = Original content stored “as is” on disk. Can be returned to the user when document is returned When Lucene returns document, it returns id. You can retrieve stored content with the id
  • Lucene can returns results sorted by a field
  • Terms almost synonym of words
  • Basic Query instance: TermQuery Use PerFieldAnalyzerWrapper to specify the specific analyzer for each field
  • Terms stored in alphabetical order. Using String.compareTo. Returns all docs for each terms in range
  • Supports AND, OR, NOT Supports +, -
  • Supports AND, OR, NOT Supports +, -
  • CNET l'a utilisé pour permettre aux utilisateurs de mieux retrouver les produits

Transcript

  • 1. Lucene And Solr Introduction By Pascal Dimassimo [email_address]
  • 2. About me
    • Java developers with 10+ years of experience
    • 3. Working for OpenText/Nstein on Semantic Navigation application
    • 4. http://semanticnavigation.opentext.com/
  • 5. History
    • Lucene launches in 2000
    • 6. Solr launches in 2006
    • 7. Lucid Imagination in 2009
      • Hire the core developers of Lucene and Solr
      • 8. Offer commercial support
    • Lucene Revolution in 2010
  • 9. Buzz
    • According to IDC
      • “53% of companies using Open source use Lucene”
      • 10. “Largely responsible for significant decline in commercial OEM revenue”
    Source http://lucenerevolution.com/sites/default/files/slides/Lucene%20Rev%20Preso%20IDC_MarketTrends_Reynolds.pdf
  • 11. Lucene?
    • “Lucene is a powerful Java search library that lets you easily add search to any application” - Lucene In Action 2 nd Edition
    • 12. NOT an application
    • 13. Text indexing and searching
    • 14. Open Source
    • 15. Mature
    • 16. Easy to learn API
  • 17. Typical Search App Taken from Lucene In Action 2 nd Edition Lucene
  • 18. Search?
    • Naive approach: linear-search (à la grep)
    • 19. O(n) -> Slow...
    • 20. You want to find a word in a book: how do you do it?
    • 21. Inverted Index
  • 22. Inverted Index Original Slide from Michael Busch (available at http://goo.gl/0MQvy )
  • 23. Inverted Index Original Slide from Michael Busch (available at http://goo.gl/0MQvy )
  • 24. Lucene Document FSDirectory dir = FSDirectory. open ( new File( "./index" )); SimpleAnalyzer analyzer = new SimpleAnalyzer(); MaxFieldLength len = IndexWriter.MaxFieldLength. UNLIMITED ; IndexWriter writer = new IndexWriter(dir, analyzer, true , len); String content = "The old night keeper keeps the keep in the town" ; Document doc = new Document(); doc.add( new Field( "content" , content, Field.Store. YES , Field.Index. ANALYZED )); writer.addDocument(doc); writer.commit();
  • 25. Lucene Document
    • Document: what is returned as search result
    • 26. Organized in fields. A field must be specified at query time!
    • 27. Schema-less
    • 28. Plain text
  • 29. Fields
    • Indexed: put the content in the inverted index.
    • 30. Analyzed: split the content into terms to be added to the inverted index. Normalized terms.
    • 31. Stored: Keep the original content on disk
    • 32. Multivalued: Repeat the same field multiple times in the same document with different values
  • 33. Lucene Document String content = "The old night keeper keeps the keep in the town" ; String author = "Peter Smith" ; String category1 = "Fiction" ; String category2 = "Canadian" ; String isbn = "978-1-933988-17-7" ; String id = "ABY123" ; Document doc = new Document(); doc.add( new Field( "content" , content, Field.Store. YES , Field.Index. ANALYZED )); doc.add( new Field( "author" , author, Field.Store. YES , Field.Index. ANALYZED )); doc.add( new Field( "category" , category1, Field.Store. YES , Field.Index. ANALYZED )); doc.add( new Field( "category" , category2, Field.Store. YES , Field.Index. ANALYZED )); doc.add( new Field( "isbn" , isbn, Field.Store. YES , Field.Index. NOT_ANALYZED )); doc.add( new Field( "id" , id, Field.Store. YES , Field.Index. NO )); writer.addDocument(doc); writer.commit();
  • 34. Lucene Demo
    • Indexing unit tests written by David
  • 35. Relevancy
    • How to you tell which document is more important?
  • 36. Vectorial Model
    • N dimension vectors for documents and queries
    • 37. Score represents how close the vectors are
    • 38. Tf-idf (term frequency–inverse document frequency)
    • 39. Documents with many of the search terms are scored higher
    • 40. Smaller documents are scored higher
  • 41. Analyzer Taken from Lucene In Action 2 nd Edition
  • 42. Analyzer
    • Convert text into terms
    • 43. Used when indexing and querying
    • 44. Tokenizer + Filters
    • 45. Custom analyzers
  • 46. Analyzer "The quick brown fox jumped over the lazy dog" WhitespaceAnalyzer [The] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dog] SimpleAnalyzer [the] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dog] StopAnalyzer [quick] [brown] [fox] [jumped] [over] [lazy] [dog] StandardAnalyzer [quick] [brown] [fox] [jumped] [over] [lazy] [dog] Example from Lucene In Action 2 nd Edition
  • 47. Analyzer "XY&Z Corporation - xyz@example.com" WhitespaceAnalyzer [XY&Z] [Corporation] [-] [xyz@example.com] SimpleAnalyzer [xy] [z] [corporation] [xyz] [example] [com] StopAnalyzer [xy] [z] [corporation] [xyz] [example] [com] StandardAnalyzer [xy&z] [corporation] [xyz@example.com] Example from Lucene In Action 2 nd Edition
  • 48. Custom Analyzers WhitespaceTokenizer Tokenize at white spaces KeywordTokenizer Tokenize input as a single token StandardTokenizer Tokenize at white spaces but keeping high-level entity as token (email, etc TODO) LowerCaseFilter Lowercases token text StopFilter Removes words that exist in a provided set of words PorterStemFilter Stems each token using the Porter stemming algorithm. For example, country and countries both stem to countri . Some descriptions from Lucene In Action 2 nd Edition
  • 49. Query
    • Asking Lucene “what documents contain this word?”
    • 50. Lucene applied an Analyzer to each word queried
    • 51. Query can be programmatically build
    • 52. Powerful Query Syntax
  • 53. Query code SimpleAnalyzer analyzer = new SimpleAnalyzer(); QueryParser parser = new QueryParser(Version. LUCENE_30 , "content" , analyzer); Query query = parser.parse( "big" ); TopDocs docs = searcher.search(query, 10);
  • 54. Query Syntax: Basic title:montreal text field
  • 55. Query Syntax: Range name:[a TO k] range field
  • 56. Query Syntax: Boolean title:(java AND programming) operator field
  • 57. Query Syntax: Boolean title:java OR name:pascal operator field field
  • 58. Query Syntax: Phrase title:”Lucene in Action” phrase field
  • 59. Query Syntax: Wildcard title:program* Term prefix field
  • 60. Lucene Demo
    • Searching unit tests written by David
  • 61. Lucene summary
    • Inverted index for fast document retrieval
    • 62. Relevancy scoring
    • 63. Powerful query syntax
  • 64. Solr
    • Created by Yonik Seeley in 2004 and released as open source in 2006
    • 65. HTTP application built around Lucene
    • 66. Makes it easy to develop search solutions
    • 67. Advanced features develop on top of Lucene
    • 68. As of 2010, Lucene and Solr are merged
  • 69. Solr Schema
    • Solr allows to administer one or more Lucene indexes
    • 70. Each index has its own schema
    • 71. Lists all fields allowed for an index
    • 72. Defines the analyzers for each field
  • 73. Solr Schema < field name = &quot;id&quot; type = &quot;string&quot; indexed = &quot;true&quot; stored = &quot;true&quot; required = &quot;true&quot; /> < field name = &quot;title&quot; type = &quot;text&quot; indexed = &quot;true&quot; stored = &quot;true&quot; /> < field name = &quot;presenter&quot; type = &quot;text_ws&quot; indexed = &quot;true&quot; stored = &quot;true&quot; /> < field name = &quot;date&quot; type = &quot;date&quot; indexed = &quot;true&quot; stored = &quot;true&quot; /> < field name = &quot;abstract&quot; type = &quot;text&quot; indexed = &quot;true&quot; stored = &quot;true&quot; />
  • 74. Solr Schema < fieldType name = &quot;text&quot; class = &quot;solr.TextField&quot; positionIncrementGap = &quot;100&quot; > < analyzer type = &quot;index&quot; > < tokenizer class = &quot;solr.WhitespaceTokenizerFactory&quot; /> < filter class = &quot;solr.StopFilterFactory&quot; ignoreCase = &quot;true&quot; words = &quot;stopwords.txt&quot; /> < filter class = &quot;solr.LowerCaseFilterFactory&quot; /> < filter class = &quot;solr.ISOLatin1AccentFilterFactory&quot; /> < filter class = &quot;solr.SnowballPorterFilterFactory&quot; language = &quot;English&quot; protected = &quot;protwords.txt&quot; /> </ analyzer > < analyzer type = &quot;query&quot; > < tokenizer class = &quot;solr.WhitespaceTokenizerFactory&quot; /> < filter class = &quot;solr.StopFilterFactory&quot; ignoreCase = &quot;true&quot; words = &quot;stopwords.txt&quot; /> < filter class = &quot;solr.LowerCaseFilterFactory&quot; /> < filter class = &quot;solr.ISOLatin1AccentFilterFactory&quot; /> < filter class = &quot;solr.SnowballPorterFilterFactory&quot; language = &quot;English&quot; protected = &quot;protwords.txt&quot; /> </ analyzer > </ fieldType >
  • 75. Solr Schema < fieldType name = &quot;text_ws&quot; class = &quot;solr.TextField&quot; positionIncrementGap = &quot;100&quot; > < analyzer type = &quot;index&quot; > < tokenizer class = &quot;solr.WhitespaceTokenizerFactory&quot; /> < filter class = &quot;solr.LowerCaseFilterFactory&quot; /> </ analyzer > < analyzer type = &quot;query&quot; > < tokenizer class = &quot;solr.WhitespaceTokenizerFactory&quot; /> < filter class = &quot;solr.LowerCaseFilterFactory&quot; /> </ analyzer > </ fieldType >
  • 76. Solr Indexation
    • HTTP POST
    • 77. XML by default, but also CSV
    • 78. Multi threaded
    • 79. Advanced features: binary document extraction, DB plugin
  • 80. Solr Indexation < add > < doc > < field name = &quot;id&quot; > 002 </ field > < field name = &quot;title&quot; > Lucene And Solr Introduction </ field > < field name = &quot;presenter&quot; > Pascal Dimassimo </ field > < field name = &quot;date&quot; > 2010-11-18T00:00:00Z </ field > < field name = &quot;abstract&quot; > ... </ field > </ doc > <doc>...</doc> </ add > curl http://localhost:8983/solr/update -H &quot;Content-Type: text/xml&quot; --data-binary @add.xml
  • 81. Solr Query
    • HTTP GET
    • 82. Query Parameters
    • 83. Response in XML by default, but other formats are supported (json, php, ruby)
  • 84. Solr Query curl http://localhost:8983/solr/select?q=title:Lucene < response > < lst name = &quot;responseHeader&quot; > < int name = &quot;status&quot; > 0 </ int > < int name = &quot;QTime&quot; > 269 </ int > < lst name = &quot;params&quot; > < str name = &quot;q&quot; > title:Lucene </ str > </ lst > </ lst > < result name = &quot;response&quot; numFound = &quot;1&quot; start = &quot;0&quot; > < doc > < str name = &quot;id&quot; > 002 </ str > < str name = &quot;title&quot; > Lucene And Solr Introduction </ str > < str name = &quot;presenter&quot; > Pascal Dimassimo </ str > < date name = &quot;date&quot; > 2010-11-18T00:00:00Z </ date > < str name = &quot;abstract&quot; > ... </ str > </ doc > </ result > </ response >
  • 85. Solr Query Parameters q Lucene Query sort Field to sort on. Defaut to score start Offset for the results page to display. Default 0 rows Numbers of results to display per page. Default 10 fq Filter Query. Default to all documents fl List of fields to display per document. Default to all fields wt Format to display result. Default to xml
  • 86. Solr Facets
    • For a query results, list of all distinct indexed values of a field with their frequencies
    • 87. Useful for drilling down in results set
  • 88. SolrJ
    • Library to connect and interact with Solr
    String url = &quot;http://localhost:8983/solr&quot; ; CommonsHttpSolrServer server = new CommonsHttpSolrServer(url); SolrInputDocument doc = new SolrInputDocument(); doc.addField( &quot;id&quot; , &quot;id1&quot; , 1.0f); doc.addField( &quot;name&quot; , &quot;doc1&quot; , 1.0f); doc.addField( &quot;price&quot; , 10); server.add(doc); server.commit();
  • 89. Solr Demo
    • Using Evernote Data
    • 90. Indexed using solr-feeder
    • 91. https://github.com/pascaldimassimo/solr-feeder
  • 92. Solr Features
  • 99. More Information