HACKING LUCENE AND
SOLR FOR FUN AND
PROFIT

Grant Ingersoll
CTO, LucidWorks,
grant@lucidworks.com, @gsingers
Keyword Search is so yesterday
•

Search is a system building block
– text is only a part of the story

•

If the algorithms fit,
use them!

•

Embrace fuzziness!

•

Scoring features are everywhere
Lucene and Solr can do…
•

Classic: Fast, fuzzy text matching across a large document collection

•

Data Quality and Analysis
– Faceting, slicing and dicing of numerical/enumerated data
– Spatial
– Spell checking, record linkage, highlighting
– Stats, Missing fields, etc.

•

Top N problems
Topics

• Search Hacks

• “Trust me, I’m a mathematician”

• “I wish I had thought of that” Hack
Search Hacks
Learn IR
•

SimpleTextCodec Example
conf.setCodec(new SimpleTextCodec());
File simpleText = new File("simpletext");
directory = new SimpleFSDirectory(simpleText);
writer = new IndexWriter(directory, conf);
index(writer);

•

Similarity:
BM25Similarity bm25Similarity = new BM25Similarity();
conf.setSimilarity(bm25Similarity);

•

http://www.ibm.com/developerworks/java/library/j-solr-lucene/index.html
http://localhost:8983/solr/answer?q=what+is+trimethylbenzene&defType=qa&qa=true&qa.qf=body
Simple QA Workflow
Analysis
•
•

Split into sentences
– Buffer tokens – see com.tamingtext.texttamer.solr.SentenceTokenizer
Identify Names using OpenNLP

•

Add Entity marker tokens at the same position as original token
– Could also be done with Payloads

•

Index

•

https://github.com/tamingtext/book/tree/master/src/main/java/com/tamingtext/textta
mer/solr
https://github.com/tamingtext/book/blob/master/apache-solr/solrqa/conf/schema.xml

•
Search Side

• Custom Query Parser takes in user’s natural language query,
classifies it to find the Answer Type and generates Solr query
• Retrieve candidate passages that match keywords and expected
answer type
• Unlike keyword search, we need to know exactly where matches
occur
• https://github.com/tamingtext/book/tree/master/src/main/java/com/
tamingtext/qa
Answer Type Classification

• Answer Type examples:
– Person (P), Location (L), Organization (O), Time Point (T),
Duration (R), Money (M)
– See page 248 for more
• Train an OpenNLP classifier off of a set of previously annotated
questions, e.g.:
– P Which French monarch reinstated the divine right of the
monarchy to France and was known as `The Sun King'
because of the splendour of his reign?
“Trust me, I’m a mathematician”
Classification
kNN and TF/IDF Classification w/ Lucene

https://github.com/tamingtext/book/tree/master/src/main/java/com/tamingtext/classifier/mlt
Lucene Classification Module
•

Builds classifier off of index information

•

See the org.apache.lucene.classification package

•

Naïve Bayes Classifier

•

kNN Classifier

•

Perceptron Classifier
Recommenders
•
•

•

•

•

Cross recommendation as search
– with search used to build cross recommendation!
Recommend content to people who exhibit certain behaviors (clicks, query terms,
other)
(Ab)use of a search engine
– but not as a search engine for content
– more like a search engine for behavior
See Ted Dunning’s talk from Berlin Buzzwords on Multi-modal Recommendation
Algorithms
– http://berlinbuzzwords.com/sessions/multi-modal-recommendation-algorithms
Go get Mahout/Myrrix or just do it in y(our) search engine
Recommendation Basics
•

History:
User

Thing

1

3

2

4

3

4

2

3

3

2

1

1

2

1
Recommendation Basics
•

History as matrix:
t1

t3

t4

u1

1

0

1

0

u2

1

0

1

1

u3

•

t2

0

1

0

1

t1+t3 cooccur 2 times, t1+t4 once, t2+t4 once
Recommendation Basics
•

Coocurrence
t1

t3

t4

t1

2

0

2

1

t2

0

1

0

t3

2

0

1

t4

•

t2

1

1

1
not t1

1
t3
1
2
2
1

t1

not t3
1
1

More details at http://lucenerevolution.org/2013/Crowd-sourced-intelligence-builtinto-Search-over-Hadoop
“I wish I had thought of that”
Time Space Continuum
•

Leverage Solr’s new spatial capabilities to index non-spatial data, such as time
ranges
– Useful for Open Hours, Shifts, etc.

•

Key: multi-valued range data

•

Query using rectangle intersections
– q = shift:"Intersects(0 19 23 365)”

•

Credits to David Smiley and Hoss…
https://people.apache.org/~hossman/spatial-for-non-spatial-meetup-20130117/
Finance Example
% change
AAPL
IBM

AAPL
AAPL

MSFT

MSFT
Time
IBM

AAPL

MSFT
Resources
•

http://www.manning.com/ingersoll
– http://github.com/tamingtext/book

•

http://www.tamingtext.com

•

Me:
– @gsingers
– grant@lucidworks.com

Hacking Lucene and Solr for Fun and Profit