May 2012 JaxDUG presentation by Zachary Gramana on using the Lucene.NET library to add search functionality to .NET applications. Contains an overview of search/information retrieval concepts and highlights some common use-cases.
2. About Me
Zachary Johnson Gramana
Engineer at Potts Consulting Group
Proud new father of Rex
3. Search is...
A vague term that encompasses multiple
problems.
Better term is “information retrieval”, or IR
system.
Interdisciplinary, drawing from:
computer science (parsing, data structures)
psychology (query grammar, human/computer
interact.)
linguistics (textual analysis)
information science (scoring/relevancy)
maths (document retrieval strategy)
4. Problems Solved
Information Overload
Transparently handle all kinds of data:
structured (hierarchical)
semi-structured (markup)
un-structured data (plain text)
5. Problems Solved
Information Overload
Find the information that users want,
not just the information they asked for.
Transparently handle all kinds of data:
structured (hierarchical)
semi-structured (markup)
un-structured data (plain text)
Single portal to multiple data types and
sources.
Do it fast!
6. Basic IR System Capabilities
Collection (importing, crawling)
Anonymous web page crawling (google)
User-uploaded photographs (flickr)
Publisher upload of .mp3 files (iTunes)
Indexing
Analysis
Modify index data structure
Querying
Input parsing
Query generation & execution
Collecting the results
Filtering the results (optional)
7. What is Lucene.Net?
Port of the Apache Foundation‟s Lucene
libraries from Java to C#
It‟s a search library.
Lucene created by Doug Cutting
Named after his wife.
First released in 2000 on SourceForge
Migrated to Apache Foundation in 9/2001.
8. Used By
StackOverflow
JIRA
IBM
Akamai
Apple
Autodesk
Orchard
RavenDB
CouchDB
9. What Isn‟t Lucene.NET
Not a complete information retrieval system
Check out Google Search Appliance instead:
http://www.google.com/enterprise/search/
Not a web-crawler.
Check out Arachnode instead
http://arachnode.net
Not a query service.
Check out SOLR instead
http://lucene.apache.org/solr
Not hard
Check out Windows Search SDK instead
http://bit.ly/ImRtMk
11. What‟s In an Index?
Stores a collection of Documents, each of
which represent a source record.
Document contain:
Metadata about the source record.
(optionally) actual data from the source record.
(optionally) derived analytical products.
Documents store a collection of
token/frequency pairs (optionally position),
plus a document identifier.
12. Lucene‟s Index Structure
Documents store a collection of fields.
Fields are collection of terms, plus and identifier, and
optional term vectors.
Terms are string key-value-pairs of a field name, and
a string value.
Lucene provides special classes to deal with tricky
data, like the NumericField class.
Term vectors are terms, along with their frequency
counts and positions.
Fields can be indexed, stored, or both.
Storing allows a term value to be retrieved after indexing.
Indexing adds the term value to Lucene‟s inverted index.
13. The Inverted Index
(taken from Apple‟s excellent “Search Basics” article at http://bit.ly/JO59kH )
14. Lucene‟s Index Structure
What an „inverted index‟?
verted index: document points to collection of
terms
inverted index: term points to a collection of
documents
One or more segments
Self-contained, independent partition of the
entire index.
Stores: field names, field values, term dictionary,
term frequencies, term proximities, normalization
factor, term vectors, and (optional) deleted record
lookup table.
15. Analysis
(taken from Apple‟s excellent “Search Basics” article at http://bit.ly/JO59kH )
16. Tokenization
(taken from Thomas Koch‟s presentation “Search Basics and Lucene” at http://bit.ly/JTOrnH )
20. Getting a Query
Two options:
Parse a search string using a QueryParser class.
Programatically build a query.
QueryParser can build very complex queries
very quickly, but requires user to provide a
query string.
Programatic building of a query requires less
overhead for simple queries.
21. General Query Types
(taken from the Wikipedia entry “Information Retrieval” at http://bitly.com/T1Qbw)