SEARCH MEUsing Lucene.Net In Your Apps
About Me   Zachary Johnson Gramana   Engineer at Potts Consulting Group   Proud new father of Rex
Search is...   A vague term that encompasses multiple    problems.   Better term is “information retrieval”, or IR    sy...
Problems Solved   Information Overload   Transparently handle all kinds of data:     structured (hierarchical)     sem...
Problems Solved    Information Overload      Find  the information that users want,       not just the information they ...
Basic IR System Capabilities   Collection (importing, crawling)       Anonymous web page crawling (google)       User-u...
What is Lucene.Net?   Port of the Apache Foundation‟s Lucene    libraries from Java to C#   It‟s a search library.   Lu...
Used By   StackOverflow   JIRA   IBM   Akamai   Apple   Autodesk   Orchard   RavenDB   CouchDB
What Isn‟t Lucene.NET   Not a complete information retrieval system       Check out Google Search Appliance instead:    ...
Concept and Overview
What‟s In an Index?   Stores a collection of Documents, each of    which represent a source record.   Document contain: ...
Lucene‟s Index Structure   Documents store a collection of fields.   Fields are collection of terms, plus and identifier...
The Inverted Index     (taken from Apple‟s excellent “Search Basics” article at http://bit.ly/JO59kH )
Lucene‟s Index Structure   What an „inverted index‟?     verted   index: document points to collection of      terms    ...
Analysis     (taken from Apple‟s excellent “Search Basics” article at http://bit.ly/JO59kH )
Tokenization     (taken from Thomas Koch‟s presentation “Search Basics and Lucene” at http://bit.ly/JTOrnH )
Tokenization   Normalization: “Gramåna” > “gramana”   Stemming: “preschooling” > “school”
Norms    (taken from Thomas Koch‟s presentation “Search Basics and Lucene” at http://bit.ly/JTOrnH )
Time to Look at Some Code
Getting a Query   Two options:     Parse a search string using a QueryParser class.     Programatically build a query....
General Query Types     (taken from the Wikipedia entry “Information Retrieval” at http://bitly.com/T1Qbw)
Some Lucene Query Types   TermQuery (general purpose)   BooleanQuery   MultiPhraseQuery   SpanQuery   WildcardQuery ...
Time to Look at More Code
Lucene.Net Contribs   Spatial (geo-spatial search)   Similarity   SimpleFactedSearch   Highlighter   SpellChecker   ...
That‟s All!Thanks for your time and attention.twitter: @zgramanablog: http://www.excitabyte.com/Email: zgramanaATgee mail ...
Upcoming SlideShare
Loading in …5
×

Search Me: Using Lucene.Net

1,255 views

Published on

May 2012 JaxDUG presentation by Zachary Gramana on using the Lucene.NET library to add search functionality to .NET applications. Contains an overview of search/information retrieval concepts and highlights some common use-cases.

Published in: Technology
  • Be the first to comment

Search Me: Using Lucene.Net

  1. 1. SEARCH MEUsing Lucene.Net In Your Apps
  2. 2. About Me Zachary Johnson Gramana Engineer at Potts Consulting Group Proud new father of Rex
  3. 3. Search is... A vague term that encompasses multiple problems. Better term is “information retrieval”, or IR system. Interdisciplinary, drawing from:  computer science (parsing, data structures)  psychology (query grammar, human/computer interact.)  linguistics (textual analysis)  information science (scoring/relevancy)  maths (document retrieval strategy)
  4. 4. Problems Solved Information Overload Transparently handle all kinds of data:  structured (hierarchical)  semi-structured (markup)  un-structured data (plain text)
  5. 5. Problems Solved  Information Overload  Find the information that users want, not just the information they asked for.  Transparently handle all kinds of data:  structured (hierarchical)  semi-structured (markup)  un-structured data (plain text)  Single portal to multiple data types and sources.  Do it fast!
  6. 6. Basic IR System Capabilities Collection (importing, crawling)  Anonymous web page crawling (google)  User-uploaded photographs (flickr)  Publisher upload of .mp3 files (iTunes) Indexing  Analysis  Modify index data structure Querying  Input parsing  Query generation & execution  Collecting the results  Filtering the results (optional)
  7. 7. What is Lucene.Net? Port of the Apache Foundation‟s Lucene libraries from Java to C# It‟s a search library. Lucene created by Doug Cutting Named after his wife. First released in 2000 on SourceForge Migrated to Apache Foundation in 9/2001.
  8. 8. Used By StackOverflow JIRA IBM Akamai Apple Autodesk Orchard RavenDB CouchDB
  9. 9. What Isn‟t Lucene.NET Not a complete information retrieval system  Check out Google Search Appliance instead: http://www.google.com/enterprise/search/ Not a web-crawler.  Check out Arachnode instead http://arachnode.net Not a query service.  Check out SOLR instead http://lucene.apache.org/solr Not hard  Check out Windows Search SDK instead http://bit.ly/ImRtMk
  10. 10. Concept and Overview
  11. 11. What‟s In an Index? Stores a collection of Documents, each of which represent a source record. Document contain:  Metadata about the source record.  (optionally) actual data from the source record.  (optionally) derived analytical products. Documents store a collection of token/frequency pairs (optionally position), plus a document identifier.
  12. 12. Lucene‟s Index Structure Documents store a collection of fields. Fields are collection of terms, plus and identifier, and optional term vectors. Terms are string key-value-pairs of a field name, and a string value. Lucene provides special classes to deal with tricky data, like the NumericField class. Term vectors are terms, along with their frequency counts and positions. Fields can be indexed, stored, or both.  Storing allows a term value to be retrieved after indexing.  Indexing adds the term value to Lucene‟s inverted index.
  13. 13. The Inverted Index (taken from Apple‟s excellent “Search Basics” article at http://bit.ly/JO59kH )
  14. 14. Lucene‟s Index Structure What an „inverted index‟?  verted index: document points to collection of terms  inverted index: term points to a collection of documents One or more segments  Self-contained, independent partition of the entire index.  Stores: field names, field values, term dictionary, term frequencies, term proximities, normalization factor, term vectors, and (optional) deleted record lookup table.
  15. 15. Analysis (taken from Apple‟s excellent “Search Basics” article at http://bit.ly/JO59kH )
  16. 16. Tokenization (taken from Thomas Koch‟s presentation “Search Basics and Lucene” at http://bit.ly/JTOrnH )
  17. 17. Tokenization Normalization: “Gramåna” > “gramana” Stemming: “preschooling” > “school”
  18. 18. Norms (taken from Thomas Koch‟s presentation “Search Basics and Lucene” at http://bit.ly/JTOrnH )
  19. 19. Time to Look at Some Code
  20. 20. Getting a Query Two options:  Parse a search string using a QueryParser class.  Programatically build a query. QueryParser can build very complex queries very quickly, but requires user to provide a query string. Programatic building of a query requires less overhead for simple queries.
  21. 21. General Query Types (taken from the Wikipedia entry “Information Retrieval” at http://bitly.com/T1Qbw)
  22. 22. Some Lucene Query Types TermQuery (general purpose) BooleanQuery MultiPhraseQuery SpanQuery WildcardQuery FilteredQuery MoreLikeThisQuery BoostingQuery FuzzyQuery ConstantScoreRangeQuery
  23. 23. Time to Look at More Code
  24. 24. Lucene.Net Contribs Spatial (geo-spatial search) Similarity SimpleFactedSearch Highlighter SpellChecker WordNET (synonyms) Snowball (stemming library) RegEx
  25. 25. That‟s All!Thanks for your time and attention.twitter: @zgramanablog: http://www.excitabyte.com/Email: zgramanaATgee mail dot com

×