HACKING LUCENE AND
SOLR FOR FUN AND
PROFIT

Grant Ingersoll
CTO, LucidWorks,
grant@lucidworks.com, @gsingers
Keyword Search is so yesterday
•

Search is a system building block
– text is only a part of the story

•

If the algorith...
Lucene and Solr can do…
•

Classic: Fast, fuzzy text matching across a large document collection

•

Data Quality and Anal...
Topics

• Search Hacks

• “Trust me, I’m a mathematician”

• “I wish I had thought of that” Hack
Search Hacks
Learn IR
•

SimpleTextCodec Example
conf.setCodec(new SimpleTextCodec());
File simpleText = new File("simpletext");
direct...
http://localhost:8983/solr/answer?q=what+is+trimethylbenzene&defType=qa&qa=true&qa.qf=body
Simple QA Workflow
Analysis
•
•

Split into sentences
– Buffer tokens – see com.tamingtext.texttamer.solr.SentenceTokenizer
Identify Names us...
Search Side

• Custom Query Parser takes in user’s natural language query,
classifies it to find the Answer Type and gener...
Answer Type Classification

• Answer Type examples:
– Person (P), Location (L), Organization (O), Time Point (T),
Duration...
“Trust me, I’m a mathematician”
Classification
kNN and TF/IDF Classification w/ Lucene

https://github.com/tamingtext/book/tree/master/src/main/java/com/tamingtext/class...
Lucene Classification Module
•

Builds classifier off of index information

•

See the org.apache.lucene.classification pa...
Recommenders
•
•

•

•

•

Cross recommendation as search
– with search used to build cross recommendation!
Recommend cont...
Recommendation Basics
•

History:
User

Thing

1

3

2

4

3

4

2

3

3

2

1

1

2

1
Recommendation Basics
•

History as matrix:
t1

t3

t4

u1

1

0

1

0

u2

1

0

1

1

u3

•

t2

0

1

0

1

t1+t3 coocc...
Recommendation Basics
•

Coocurrence
t1

t3

t4

t1

2

0

2

1

t2

0

1

0

t3

2

0

1

t4

•

t2

1

1

1
not t1

1
t3...
“I wish I had thought of that”
Time Space Continuum
•

Leverage Solr’s new spatial capabilities to index non-spatial data, such as time
ranges
– Useful f...
Finance Example
% change
AAPL
IBM

AAPL
AAPL

MSFT

MSFT
Time
IBM

AAPL

MSFT
Resources
•

http://www.manning.com/ingersoll
– http://github.com/tamingtext/book

•

http://www.tamingtext.com

•

Me:
– ...
Hacking Lucene and Solr for Fun and Profit
Upcoming SlideShare
Loading in...5
×

Hacking Lucene and Solr for Fun and Profit

1,064

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,064
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
20
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Hacking Lucene and Solr for Fun and Profit

  1. 1. HACKING LUCENE AND SOLR FOR FUN AND PROFIT Grant Ingersoll CTO, LucidWorks, grant@lucidworks.com, @gsingers
  2. 2. Keyword Search is so yesterday • Search is a system building block – text is only a part of the story • If the algorithms fit, use them! • Embrace fuzziness! • Scoring features are everywhere
  3. 3. Lucene and Solr can do… • Classic: Fast, fuzzy text matching across a large document collection • Data Quality and Analysis – Faceting, slicing and dicing of numerical/enumerated data – Spatial – Spell checking, record linkage, highlighting – Stats, Missing fields, etc. • Top N problems
  4. 4. Topics • Search Hacks • “Trust me, I’m a mathematician” • “I wish I had thought of that” Hack
  5. 5. Search Hacks
  6. 6. Learn IR • SimpleTextCodec Example conf.setCodec(new SimpleTextCodec()); File simpleText = new File("simpletext"); directory = new SimpleFSDirectory(simpleText); writer = new IndexWriter(directory, conf); index(writer); • Similarity: BM25Similarity bm25Similarity = new BM25Similarity(); conf.setSimilarity(bm25Similarity); • http://www.ibm.com/developerworks/java/library/j-solr-lucene/index.html
  7. 7. http://localhost:8983/solr/answer?q=what+is+trimethylbenzene&defType=qa&qa=true&qa.qf=body
  8. 8. Simple QA Workflow
  9. 9. Analysis • • Split into sentences – Buffer tokens – see com.tamingtext.texttamer.solr.SentenceTokenizer Identify Names using OpenNLP • Add Entity marker tokens at the same position as original token – Could also be done with Payloads • Index • https://github.com/tamingtext/book/tree/master/src/main/java/com/tamingtext/textta mer/solr https://github.com/tamingtext/book/blob/master/apache-solr/solrqa/conf/schema.xml •
  10. 10. Search Side • Custom Query Parser takes in user’s natural language query, classifies it to find the Answer Type and generates Solr query • Retrieve candidate passages that match keywords and expected answer type • Unlike keyword search, we need to know exactly where matches occur • https://github.com/tamingtext/book/tree/master/src/main/java/com/ tamingtext/qa
  11. 11. Answer Type Classification • Answer Type examples: – Person (P), Location (L), Organization (O), Time Point (T), Duration (R), Money (M) – See page 248 for more • Train an OpenNLP classifier off of a set of previously annotated questions, e.g.: – P Which French monarch reinstated the divine right of the monarchy to France and was known as `The Sun King' because of the splendour of his reign?
  12. 12. “Trust me, I’m a mathematician”
  13. 13. Classification
  14. 14. kNN and TF/IDF Classification w/ Lucene https://github.com/tamingtext/book/tree/master/src/main/java/com/tamingtext/classifier/mlt
  15. 15. Lucene Classification Module • Builds classifier off of index information • See the org.apache.lucene.classification package • Naïve Bayes Classifier • kNN Classifier • Perceptron Classifier
  16. 16. Recommenders • • • • • Cross recommendation as search – with search used to build cross recommendation! Recommend content to people who exhibit certain behaviors (clicks, query terms, other) (Ab)use of a search engine – but not as a search engine for content – more like a search engine for behavior See Ted Dunning’s talk from Berlin Buzzwords on Multi-modal Recommendation Algorithms – http://berlinbuzzwords.com/sessions/multi-modal-recommendation-algorithms Go get Mahout/Myrrix or just do it in y(our) search engine
  17. 17. Recommendation Basics • History: User Thing 1 3 2 4 3 4 2 3 3 2 1 1 2 1
  18. 18. Recommendation Basics • History as matrix: t1 t3 t4 u1 1 0 1 0 u2 1 0 1 1 u3 • t2 0 1 0 1 t1+t3 cooccur 2 times, t1+t4 once, t2+t4 once
  19. 19. Recommendation Basics • Coocurrence t1 t3 t4 t1 2 0 2 1 t2 0 1 0 t3 2 0 1 t4 • t2 1 1 1 not t1 1 t3 1 2 2 1 t1 not t3 1 1 More details at http://lucenerevolution.org/2013/Crowd-sourced-intelligence-builtinto-Search-over-Hadoop
  20. 20. “I wish I had thought of that”
  21. 21. Time Space Continuum • Leverage Solr’s new spatial capabilities to index non-spatial data, such as time ranges – Useful for Open Hours, Shifts, etc. • Key: multi-valued range data • Query using rectangle intersections – q = shift:"Intersects(0 19 23 365)” • Credits to David Smiley and Hoss… https://people.apache.org/~hossman/spatial-for-non-spatial-meetup-20130117/
  22. 22. Finance Example % change AAPL IBM AAPL AAPL MSFT MSFT Time IBM AAPL MSFT
  23. 23. Resources • http://www.manning.com/ingersoll – http://github.com/tamingtext/book • http://www.tamingtext.com • Me: – @gsingers – grant@lucidworks.com
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×