Search enabled applications with lucene.net

d35xp

W.Meints

Search enabled
applications with
Lucene.NET

Agenda

Introduction Technical bits

Inspiration

#ISKALUCENE

Google has ruined search for
everyone

This is what you often build
as a developer. Because the
user wants it.

Three reasons why search sometimes
sucks
• Can I even search?
• The number one reason, because sometimes it’s not there or it is
there, but you cannot see it is there. Confusing stuff!

• The search form is too complicated
• I need to be an expert to find something I don’t know is there…. Good
thinking!

• The search engine is too slow
• They sometimes warn you about this (why?!)

Three reasons why search sometimes
sucks
• I am not going to address all of these issues today.

• The focus of this talk is on the technical stuff, which solves
• Having to use complex search forms to find something
• Having to wait a long time before you find something (hopefully).

• Usability of search engines is something I could talk about for a
very long time too… but not today.

This is what we expect to see today

Implementing proper search functionality

Simplicity is key

Gives the right
answers

Allows me to refine

What search is today
Search is hard on the
developer. It involves a lot of
things:

• Linguistics
• Psychology
• Information analysis
• Computer science
• Complex math

Lucene.NET as a possible solution
• Lucene.NET is derived from
its Java cousin Lucene.

• Compact search engine that
offers a solution to most of
your search problems.

• Best of all. It is free.

Getting started with lucene.NET

Getting started

Overview of Lucene
• Lucene provides the core
things you need to build a
search system

• It does not:
• Contain a search results page.
• Parse HTML, Word, Excel, etc.

This is what is in the box
• Text analyzer
• Splits text in searchable terms
• Filters out stopwords (if you
want)

• QueryParser
• Common syntax without
needing to learn anything

• IndexSearcher
• The goods, THE thing to have.

This is what is in the box
• IndexReader
• Reads everything from the
index

• IndexWriter
• Stores documents and fields

• Directory
• The index itself, comes in many
sizes and shapes

A standard recipe for building search

1 2 3

Build an index with Build a query from the Get results and present
content you want to question the user asked. them to the user.
search through.

Step 1: Building an index
• The lucene search index is nothing like your average database!

• Storage happens in key/value pairs

• Most of the time nothing is stored and you can still search for it
• The engine stores hashes of content
• Only when you ask it to store, it stores something

• The Lucene indexing uses a tree like index structure

Doc #1 Doc #2 Doc #3 Each document gets its own segment initially

Segments get merged during optimization cycles
Merged #1 + #2

Full index Finally everything is merged back into one big pile.

• Reasons for going in this direction:
• Segments are small, and update very fast.
• Searching many segments is slower than one bigger segments

• Overall, a merging segments index is more scalable and easier
to implement than a B-tree index that is used elsewhere.


Analyzer

Your Parser Document IndexWriter

Field

Field Directory

Step 2: Building queries
• Querying Lucene.NET is done through the IndexSearcher for
almost every scenario you can think of.

• There’s a number of possible options for queries:
• Hand build a query using BooleanQuery, TermQuery or another query
type
• Let lucene decide which would best fit by parsing the query.


“Some QueryParser Query IndexSearcher

text”

Analyzer

• There’s a standard QueryParser, but you can also use the
MultiFieldQueryParser

• The MultiFieldQueryParser allows you to build a query across
multiple fields at once.

• Using the QueryParser and analyzer to get a good query for the
search engine is one way of going at it.

• Other query types include:
• BooleanQuery – Terms must, should or must not appear in the
document
• TermQuery – Look for a single term
• SpanQuery – Find terms that are close together in the text

Please note: You can combine!

• SpanQuery is a little weird, it allows you to find terms close
together in a piece of text. For example:

“The lazy fox jumps over the quick brown dog”
“The quick brown fox jumps over the lazy dog”

The second sentence is the one you want. The first one is
sort of correct, but a little funky. Since when did the dog
become brown and quick??

Step 3: Getting results
• With indexed content and a the right query, you can get the
answer to everything (Which by the way, might not be 42…)

• The IndexSearcher is used to find the answer to your query.


Query IndexSearcher

IndexReader

Directory

• Documents are matched against your query using complex
math.

• A TF-IDF algorithm is used to determine how well the document
matches the query.

• You have been warned! This is complex stuff.

• In the demo I showed you the basic form of finding documents.
• There’s more to the Search method than meets the eye!

• Depending on your needs, you may have to use a collector.
• A collector optimizes the way you retrieve documents from the index

• Need to find documents in ranked order?
• Use the default method or use a TopDocsCollector

• Need to sort the documents in a particular order?
• Use the TopFieldsCollector instead.
• This collector is optimized for sorting fields

• Don’t want documents that have nothing to do with what you
asked for in the first place?
• Use a PositiveScoresOnlyCollector
• Matches documents with score > 0

Use this only when you have a
smaller index.

A standard recipe for building search

1 2 3

Build an index with Build a query Get results
content
QueryParser IndexSearcher
IndexWriter MultiFieldQueryParser Collector
Document
Choose the right query Choose the right collector
Think about Store / Index type! for better performance!
settings on your fields!

Good to go
• Now that you know how Lucene.NET works I think it is time to
show you a few other things…

Categorize content based on previous
content
?
IndexSearcher

Body

Probably a good
candidate! Label Occurences
Search 180
Requirements 40
Other label 12

Detecting plagiarized content
?
Lucene in action

IndexSearcher
?
Potential problematic document
Field Value
Lucene in Orchard
Title Lucene.NET in action
Body Lorem ipsum stuff and
more about that
Lucene thingie.
Tags Search, Lucene, .NET,
C#

Spell check content
• You can spell check a document based on what others wrote.
• Very similar to categorization, but instead of checking the highest hit for
a single field, check which word matches best for the term at hand.

• Uses an n-gram structure and the Levenshtein distance algorithm
(sounds good, doesn’t it?)

• Do NOT build this yourself, but download here:
https://nuget.org/packages/Lucene.Net.Contrib/3.0.3

Play jeopardy?
• The IBM Watson super
computer uses Lucene

By the way…
• Endeavour knowNow uses Lucene.NET

• And there are more devs using it.
• Twitter uses Lucene for realtime search
• StackOverflow uses Lucene for searching questions
• RavenDB uses Lucene as their primary storage mechanism

• Give it a try, you might be surprised!

http://www.fizzylogic.nl/

@wmeints

Search enabled applications with lucene.net

Recommended

Recommended

More Related Content

Similar to Search enabled applications with lucene.net

Similar to Search enabled applications with lucene.net (20)

More from Willem Meints

More from Willem Meints (11)

Search enabled applications with lucene.net

Editor's Notes