Analyze this! tips and tricks on getting the lucene solr analyzer to index and search your content right

Analyze This!

Tom Hill
Lucid Imagination
Webinar 1/28/2010

Lucid Imagination, Inc.

Analyze This!

Analysis
Basics, Tips and Tools


Page 2
© 2010 Lucid Imagination, Inc.

Overview
We’ll be covering:
What is analysis, and why do you care?
Some common problems with analysis
Tools for troubleshooting
Analyzer Tool
Schema Browser
Luke
Existing Analyzers, Filters and Tokenizers

Page 3

What is Analysis?

• Converting your text into terms
Solr does NOT search your text
Solr searches the set of terms created by analysis
Problems happen when the terms are not what you think they
are


Page 4

Examples

Don’t => dont

iPhone => i phone
iphon
τα πρώτα δείγματα =>πρωτα δειγματα
The quick brown fox jumps => The quick brown fox jumps


Page 5 © 2008-2009 © 2010 Lucid Imagination, Inc. 5

Different Effects of Analysis
There are many ways to analyze a run of text.
Break on whitespace, punctuation, caseChanges, numb3rs
Stemming (shoes -> shoe)
Removing/replacing unwanted words/symbols
Combining words
Adding new words (synonyms)
And many more


Page 6 © 2008-2009 © 2010 Lucid Imagination, Inc. 6

Copy Fields 1

It’s common to want to index data more than one way
You might store an analyzed version of a field for searching
And store an unanalyzed version for faceting or sorting
You might store a stemmed and non-stemmed version of a field
To boost precise matches


Page 7

Copy Fields 2

It’s also common to copy to a common destination field
For example: “alltext”
Note this copies from the SOURCE of the copied field
Not the analyzed version of the copied field
<copyField source="cat" dest="text"/>
<copyField source="name" dest="text"/>
<copyField source="manu" dest="text"/>


Page 8

What could go wrong?

• Lots of things
You can’t find things
You find too much
Poor query or indexing performance


Page 9

Common Scenario #1

Someone sets up Solr for the first time
Adds some data
Then posts to the mailing list, and says “why can’t I find my
data?”
The problem’s basic, but it’s useful to know how to identify it.


Page 10

“When I Search For ‘fox’…”


Page 11

“…I Find Nothing”


Page 12

“But, If I look at the index”


Page 13

“It’s right there”


Page 14

Analysis Tool

Your first stop for figuring out analysis problems


Page 15

Analysis Tool


Page 16

Analysis Tool Demo


Page 17

Stored vs. Indexed

Solr can store both analyzed and un-analyzed content
But you knew that …
“stored” vs. “indexed” in the field definition
How can you see what is actually indexed?
…that is, the terms you can search for


Page 18

Schema Browser
Schema Browser lets you examine the fields and how they are
configured.
It also allows you to examine the terms in the index


Page 19

Schema Browser


Page 20

Schema Browser


Page 21

Schema Browser Demo


Page 22

How Many of You Just Copied the Example Schema?

• Just because it works for one person’s data, doesn’t mean it
works for yours.
• Take the time to look at the output


Page 23

Luke

Lucene Index Exploration Tool
Allows you to look at (and modify) the contents of an index


Page 24

Luke Main Screen


Page 25

Luke Document “Reconstruction”


Page 26

Luke Document “Reconstruction”


Page 27

Close-up from last slide

solr null_1 enterpris search server
null_100 apach softwar foundat null_100 softwar null_100 search
null_100 advanc
full fulltext|text search capabl use
lucen null_100 optim null_1 high …


Page 28

Position Increment Gap

The null_xxx entries are how luke represents the position
increment between instances of multi-valued fields.
The example had
<field name=“text">Solr, the Enterprise Search Server</field>
<field name=“text">Apache Software Foundation</field>
Using a position increment prevents phrase queries from
matching across different values of a field
Without the gap “Server Apache” would be a valid phrase.


Page 29

Analysis Can Affect Performance

Analysis doesn’t just product success/failure on a search
It can affect the query processing speed, too.


Page 30

Slow Searches

They index 500,000 books
Multiple languages in one field
So they can’t do stemming or stop words
Their worst case query was:
“The lives and literature of the beat generation”
It took 2 minutes to run.
The query requires checking every doc containing “the” & “and”
And the position info for each occurrence

Page 31

Bi-grams

Bi-grams combine adjacent terms
““The lives and literature “ becomes
“The lives” “lives and” “and literature”
Only have to check documents that contain the pair adjacent to
each other.
Only have to look at position information for the pair
But can triple the size of the index
Word indexed by itself
Indexed both with preceding term, and following term

Page 32

Common Grams

Form bi-grams only for common terms
“The” occurs 2 billion times. “The lives” occurs 360k.
Used the only 32 most common terms
Average response went from 460 ms to 68ms.


Page 33

Implied Phrase Queries

Another example involved a query with “L’art”
This turns into a phrase query, “L art” with the default config.
PhraseQuery(text:"l art")
“Turning it into the single token ‘L art’ is much more efficient.
Occurs in far fewer documents that “L”
Is a term query, not a phrase query.


Page 34

Multiple Languages

Generally, we suggest keeping different languages in their own
fields
This lets you have an analyzer for each language
Stemming, stop words, etc.
If you don’t know the total number of languages, you can use
dynamic fields.
That allows you to accept them, but not to dynamically stem, etc.


Page 35

Analysis And Query Parsing

What happens when parsing a query in Solr?
You may have many fields, with different analyzers
Which Analyzer gets used?


Page 36


QueryParser splits the query
Understands quotes, parens and whitespace
Gives the resulting pieces to the correct analyzer
Explicit or Default


Page 37


To see what happens to your query
Use the “Full Interface” section of the admin interface
Check ‘debug: enable’
Or just add “&debugQuery=on” to the end of your query string
We’re using the Lucene Query Parser
Dismax does different things.


Page 38

Seeing the results of query parsing


Page 39

Seeing the results of query parsing


Page 40

Query Examples

title:foo bar
Becomes: +title:foo +text:bar
“foo” goes title field analyzer, bar to default field analyzer
manu:”foo_bar baz”
Becomes: manu:"foo bar baz“
Note _ got removed. The whole string goes to manu analyzer
Phrase query
title: (foo bar)

Becomes: title:foo title:bar
foo and bar passed separately to title’s analyzer

Page 41

Components of an Analyzer


Page 42

Components of an Analyzer

CharFilters
Tokenizers
TokenFilters


Page 43

CharFilters

Used to clean up/regularize characters before passing to
TokenFilter
Remove accents, etc. MappingCharFilter
They can also do complex things, we’ll look at
HTMLStripCharFilter later.


Page 44

Tokenizers

Convert text to tokens (terms)
Only one per analyzer
Many Options
WhitespaceTokenizer
StandardTokenizer
PatternTokenizer
More…


Page 45

TokenFilters

Process the tokens produced by the Tokenizer
Can be many of them per field


Page 46

Some example TokenFilters that come with Solr/Lucene

There are way too many to list them all
We’re just going to go through a few of them


Page 47

Reversing Filter

Why?
Leading wildcards require traversing the whole index
Reverse the order, and leading wildcards become trailing
*cats => stac*
Only have to check terms that start with stac, instead of the
whole index.


Page 48

Phonetic Analysis

Creates a phonetic representation of the text, for “sounds like”
matching
PhoneticFilterFactory. Uses one of
Metaphone
Double Metaphone
Soundex
Refined Soundex


Page 49

Synonyms

Synonym filter allows you to include alternate words that the
user can use when searching
For example, theater, theatre
Useful for movie titles, where words are deliberately mis-spelled
Don’t over-use synonyms
It helps recall, but lowers precision
Produces tokens at the same token position
“local theater company”
theatre Lucid Imagination, Inc.

Page 50

HTML text extraction

Removes html tags, attributes comments
XML processing directives
Removes <script> and <style> contents
Replaces entities
HtmlStripCharFilterFactory


Page 51

Spell Checking

Spell checker starts by analyzing the source terms into n-grams
From the Lucene Wiki:


Page 52

Spell Checking

You don’t actually have to know that to use the spell checker
But I think it’s kind of cool
Use luke to explore the index generated by the spell checker.


Page 53

And many more

Regular expression Tokenizer
Stemmers for many languages
Persian, Hindi, Chinese, Japanese, etc.
Third party/commercial stemmers available, too.
SnowballPorterFilter


Page 54

Recap

If you can’t find it, and you are sure it’s there:
It’s likely an analysis problem
Three main tools for troubleshooting analysis
Analysis tool
Schema browser
Luke
Look at your index, documents and the output of your analyzers
periodically.

Page 55

Additional Resources

Lucid Imagination Solr Reference Guide
LucidImagination.com/downloads
Lucene in Action Second Edition
This isn’t published yet, but you can get the early access version
from manning.com/hatcher3
http://www.hathitrust.org/blog
Solr wiki on Analysis
Wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
Luke - http://code.google.com/p/luke/

Page 56

Questions

If we have time, we’ll take some questions


Page 57

Thanks!
Tom Hill
LucidImagination.com


Page 58

Analyze this! tips and tricks on getting the lucene solr analyzer to index and search your content right

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (16)

Similar to Analyze this! tips and tricks on getting the lucene solr analyzer to index and search your content right

Similar to Analyze this! tips and tricks on getting the lucene solr analyzer to index and search your content right (18)

More from Lucidworks (Archived)

More from Lucidworks (Archived) (20)

Recently uploaded

Recently uploaded (20)

Analyze this! tips and tricks on getting the lucene solr analyzer to index and search your content right