May 2012 JaxDUG presentation by Zachary Gramana on using the Lucene.NET library to add search functionality to .NET applications. Contains an overview of search/information retrieval concepts and highlights some common use-cases.
About Me Zachary Johnson Gramana Engineer at Potts Consulting Group Proud new father of Rex
Search is... A vague term that encompasses multiple problems. Better term is “information retrieval”, or IR system. Interdisciplinary, drawing from: computer science (parsing, data structures) psychology (query grammar, human/computer interact.) linguistics (textual analysis) information science (scoring/relevancy) maths (document retrieval strategy)
Problems Solved Information Overload Transparently handle all kinds of data: structured (hierarchical) semi-structured (markup) un-structured data (plain text)
Problems Solved Information Overload Find the information that users want, not just the information they asked for. Transparently handle all kinds of data: structured (hierarchical) semi-structured (markup) un-structured data (plain text) Single portal to multiple data types and sources. Do it fast!
Basic IR System Capabilities Collection (importing, crawling) Anonymous web page crawling (google) User-uploaded photographs (flickr) Publisher upload of .mp3 files (iTunes) Indexing Analysis Modify index data structure Querying Input parsing Query generation & execution Collecting the results Filtering the results (optional)
What is Lucene.Net? Port of the Apache Foundation‟s Lucene libraries from Java to C# It‟s a search library. Lucene created by Doug Cutting Named after his wife. First released in 2000 on SourceForge Migrated to Apache Foundation in 9/2001.
What Isn‟t Lucene.NET Not a complete information retrieval system Check out Google Search Appliance instead: http://www.google.com/enterprise/search/ Not a web-crawler. Check out Arachnode instead http://arachnode.net Not a query service. Check out SOLR instead http://lucene.apache.org/solr Not hard Check out Windows Search SDK instead http://bit.ly/ImRtMk
What‟s In an Index? Stores a collection of Documents, each of which represent a source record. Document contain: Metadata about the source record. (optionally) actual data from the source record. (optionally) derived analytical products. Documents store a collection of token/frequency pairs (optionally position), plus a document identifier.
Lucene‟s Index Structure Documents store a collection of fields. Fields are collection of terms, plus and identifier, and optional term vectors. Terms are string key-value-pairs of a field name, and a string value. Lucene provides special classes to deal with tricky data, like the NumericField class. Term vectors are terms, along with their frequency counts and positions. Fields can be indexed, stored, or both. Storing allows a term value to be retrieved after indexing. Indexing adds the term value to Lucene‟s inverted index.
The Inverted Index (taken from Apple‟s excellent “Search Basics” article at http://bit.ly/JO59kH )
Lucene‟s Index Structure What an „inverted index‟? verted index: document points to collection of terms inverted index: term points to a collection of documents One or more segments Self-contained, independent partition of the entire index. Stores: field names, field values, term dictionary, term frequencies, term proximities, normalization factor, term vectors, and (optional) deleted record lookup table.
Analysis (taken from Apple‟s excellent “Search Basics” article at http://bit.ly/JO59kH )
Tokenization (taken from Thomas Koch‟s presentation “Search Basics and Lucene” at http://bit.ly/JTOrnH )
Getting a Query Two options: Parse a search string using a QueryParser class. Programatically build a query. QueryParser can build very complex queries very quickly, but requires user to provide a query string. Programatic building of a query requires less overhead for simple queries.
General Query Types (taken from the Wikipedia entry “Information Retrieval” at http://bitly.com/T1Qbw)