Search Me: Using Lucene.Net

SEARCH ME
Using Lucene.Net In Your Apps

About Me
 Zachary Johnson Gramana
 Engineer at Potts Consulting Group
 Proud new father of Rex

Search is...
 A vague term that encompasses multiple
problems.
 Better term is “information retrieval”, or IR
system.
 Interdisciplinary, drawing from:
 computer science (parsing, data structures)
 psychology (query grammar, human/computer
interact.)
 linguistics (textual analysis)

 information science (scoring/relevancy)

 maths (document retrieval strategy)

Problems Solved
 Information Overload
 Transparently handle all kinds of data:
 structured (hierarchical)
 semi-structured (markup)

 un-structured data (plain text)

Problems Solved
 Information Overload
 Find the information that users want,
not just the information they asked for.
 Transparently handle all kinds of data:
 structured (hierarchical)
 semi-structured (markup)

 un-structured data (plain text)

 Single portal to multiple data types and
sources.
 Do it fast!

Basic IR System Capabilities
 Collection (importing, crawling)
 Anonymous web page crawling (google)
 User-uploaded photographs (flickr)
 Publisher upload of .mp3 files (iTunes)
 Indexing
 Analysis
 Modify index data structure
 Querying
 Input parsing
 Query generation & execution
 Collecting the results
 Filtering the results (optional)

What is Lucene.Net?
 Port of the Apache Foundation‟s Lucene
libraries from Java to C#
 It‟s a search library.
 Lucene created by Doug Cutting
 Named after his wife.
 First released in 2000 on SourceForge
 Migrated to Apache Foundation in 9/2001.

Used By
 StackOverflow
 JIRA
 IBM
 Akamai
 Apple
 Autodesk
 Orchard
 RavenDB
 CouchDB

What Isn‟t Lucene.NET
 Not a complete information retrieval system
 Check out Google Search Appliance instead:
http://www.google.com/enterprise/search/
 Not a web-crawler.
 Check out Arachnode instead
http://arachnode.net
 Not a query service.
 Check out SOLR instead
http://lucene.apache.org/solr
 Not hard
 Check out Windows Search SDK instead
http://bit.ly/ImRtMk

What‟s In an Index?
 Stores a collection of Documents, each of
which represent a source record.
 Document contain:
 Metadata about the source record.
 (optionally) actual data from the source record.

 (optionally) derived analytical products.

 Documents store a collection of
token/frequency pairs (optionally position),
plus a document identifier.

Lucene‟s Index Structure
 Documents store a collection of fields.
 Fields are collection of terms, plus and identifier, and
optional term vectors.
 Terms are string key-value-pairs of a field name, and
a string value.
 Lucene provides special classes to deal with tricky
data, like the NumericField class.
 Term vectors are terms, along with their frequency
counts and positions.
 Fields can be indexed, stored, or both.
 Storing allows a term value to be retrieved after indexing.
 Indexing adds the term value to Lucene‟s inverted index.

The Inverted Index

(taken from Apple‟s excellent “Search Basics” article at http://bit.ly/JO59kH )

Lucene‟s Index Structure
 What an „inverted index‟?
 verted index: document points to collection of
terms
 inverted index: term points to a collection of
documents
 One or more segments
 Self-contained, independent partition of the
entire index.
 Stores: field names, field values, term dictionary,
term frequencies, term proximities, normalization
factor, term vectors, and (optional) deleted record
lookup table.

Analysis

(taken from Apple‟s excellent “Search Basics” article at http://bit.ly/JO59kH )

Tokenization

(taken from Thomas Koch‟s presentation “Search Basics and Lucene” at http://bit.ly/JTOrnH )

Tokenization
 Normalization: “Gramåna” > “gramana”
 Stemming: “preschooling” > “school”

Norms

(taken from Thomas Koch‟s presentation “Search Basics and Lucene” at http://bit.ly/JTOrnH )

Getting a Query
 Two options:
 Parse a search string using a QueryParser class.
 Programatically build a query.

 QueryParser can build very complex queries
very quickly, but requires user to provide a
query string.
 Programatic building of a query requires less
overhead for simple queries.

General Query Types

(taken from the Wikipedia entry “Information Retrieval” at http://bitly.com/T1Qbw)

Some Lucene Query Types
 TermQuery (general purpose)
 BooleanQuery
 MultiPhraseQuery
 SpanQuery
 WildcardQuery
 FilteredQuery
 MoreLikeThisQuery
 BoostingQuery
 FuzzyQuery
 ConstantScoreRangeQuery

Lucene.Net Contribs
 Spatial (geo-spatial search)
 Similarity
 SimpleFactedSearch
 Highlighter
 SpellChecker
 WordNET (synonyms)
 Snowball (stemming library)
 RegEx

That‟s All!
Thanks for your time and attention.

twitter: @zgramana
blog: http://www.excitabyte.com/
Email: zgramanaATgee mail dot com

Search Me: Using Lucene.Net

More Related Content

What's hot

Similar to Search Me: Using Lucene.Net

Recently uploaded

Search Me: Using Lucene.Net