Your SlideShare is downloading. ×
Hacking Lucene and Solr for Fun and Profit
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Hacking Lucene and Solr for Fun and Profit

900
views

Published on

Published in: Technology

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
900
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
18
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. HACKING LUCENE AND SOLR FOR FUN AND PROFIT Grant Ingersoll CTO, LucidWorks, grant@lucidworks.com, @gsingers
  • 2. Keyword Search is so yesterday • Search is a system building block – text is only a part of the story • If the algorithms fit, use them! • Embrace fuzziness! • Scoring features are everywhere
  • 3. Lucene and Solr can do… • Classic: Fast, fuzzy text matching across a large document collection • Data Quality and Analysis – Faceting, slicing and dicing of numerical/enumerated data – Spatial – Spell checking, record linkage, highlighting – Stats, Missing fields, etc. • Top N problems
  • 4. Topics • Search Hacks • “Trust me, I’m a mathematician” • “I wish I had thought of that” Hack
  • 5. Search Hacks
  • 6. Learn IR • SimpleTextCodec Example conf.setCodec(new SimpleTextCodec()); File simpleText = new File("simpletext"); directory = new SimpleFSDirectory(simpleText); writer = new IndexWriter(directory, conf); index(writer); • Similarity: BM25Similarity bm25Similarity = new BM25Similarity(); conf.setSimilarity(bm25Similarity); • http://www.ibm.com/developerworks/java/library/j-solr-lucene/index.html
  • 7. http://localhost:8983/solr/answer?q=what+is+trimethylbenzene&defType=qa&qa=true&qa.qf=body
  • 8. Simple QA Workflow
  • 9. Analysis • • Split into sentences – Buffer tokens – see com.tamingtext.texttamer.solr.SentenceTokenizer Identify Names using OpenNLP • Add Entity marker tokens at the same position as original token – Could also be done with Payloads • Index • https://github.com/tamingtext/book/tree/master/src/main/java/com/tamingtext/textta mer/solr https://github.com/tamingtext/book/blob/master/apache-solr/solrqa/conf/schema.xml •
  • 10. Search Side • Custom Query Parser takes in user’s natural language query, classifies it to find the Answer Type and generates Solr query • Retrieve candidate passages that match keywords and expected answer type • Unlike keyword search, we need to know exactly where matches occur • https://github.com/tamingtext/book/tree/master/src/main/java/com/ tamingtext/qa
  • 11. Answer Type Classification • Answer Type examples: – Person (P), Location (L), Organization (O), Time Point (T), Duration (R), Money (M) – See page 248 for more • Train an OpenNLP classifier off of a set of previously annotated questions, e.g.: – P Which French monarch reinstated the divine right of the monarchy to France and was known as `The Sun King' because of the splendour of his reign?
  • 12. “Trust me, I’m a mathematician”
  • 13. Classification
  • 14. kNN and TF/IDF Classification w/ Lucene https://github.com/tamingtext/book/tree/master/src/main/java/com/tamingtext/classifier/mlt
  • 15. Lucene Classification Module • Builds classifier off of index information • See the org.apache.lucene.classification package • Naïve Bayes Classifier • kNN Classifier • Perceptron Classifier
  • 16. Recommenders • • • • • Cross recommendation as search – with search used to build cross recommendation! Recommend content to people who exhibit certain behaviors (clicks, query terms, other) (Ab)use of a search engine – but not as a search engine for content – more like a search engine for behavior See Ted Dunning’s talk from Berlin Buzzwords on Multi-modal Recommendation Algorithms – http://berlinbuzzwords.com/sessions/multi-modal-recommendation-algorithms Go get Mahout/Myrrix or just do it in y(our) search engine
  • 17. Recommendation Basics • History: User Thing 1 3 2 4 3 4 2 3 3 2 1 1 2 1
  • 18. Recommendation Basics • History as matrix: t1 t3 t4 u1 1 0 1 0 u2 1 0 1 1 u3 • t2 0 1 0 1 t1+t3 cooccur 2 times, t1+t4 once, t2+t4 once
  • 19. Recommendation Basics • Coocurrence t1 t3 t4 t1 2 0 2 1 t2 0 1 0 t3 2 0 1 t4 • t2 1 1 1 not t1 1 t3 1 2 2 1 t1 not t3 1 1 More details at http://lucenerevolution.org/2013/Crowd-sourced-intelligence-builtinto-Search-over-Hadoop
  • 20. “I wish I had thought of that”
  • 21. Time Space Continuum • Leverage Solr’s new spatial capabilities to index non-spatial data, such as time ranges – Useful for Open Hours, Shifts, etc. • Key: multi-valued range data • Query using rectangle intersections – q = shift:"Intersects(0 19 23 365)” • Credits to David Smiley and Hoss… https://people.apache.org/~hossman/spatial-for-non-spatial-meetup-20130117/
  • 22. Finance Example % change AAPL IBM AAPL AAPL MSFT MSFT Time IBM AAPL MSFT
  • 23. Resources • http://www.manning.com/ingersoll – http://github.com/tamingtext/book • http://www.tamingtext.com • Me: – @gsingers – grant@lucidworks.com

×