Hacking Lucene and Solr for Fun and Profit

1,713 views
1,441 views

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,713
On SlideShare
0
From Embeds
0
Number of Embeds
193
Actions
Shares
0
Downloads
22
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Hacking Lucene and Solr for Fun and Profit

  1. 1. HACKING LUCENE AND SOLR FOR FUN AND PROFIT Grant Ingersoll CTO, LucidWorks, grant@lucidworks.com, @gsingers
  2. 2. Keyword Search is so yesterday • Search is a system building block – text is only a part of the story • If the algorithms fit, use them! • Embrace fuzziness! • Scoring features are everywhere
  3. 3. Lucene and Solr can do… • Classic: Fast, fuzzy text matching across a large document collection • Data Quality and Analysis – Faceting, slicing and dicing of numerical/enumerated data – Spatial – Spell checking, record linkage, highlighting – Stats, Missing fields, etc. • Top N problems
  4. 4. Topics • Search Hacks • “Trust me, I’m a mathematician” • “I wish I had thought of that” Hack
  5. 5. Search Hacks
  6. 6. Learn IR • SimpleTextCodec Example conf.setCodec(new SimpleTextCodec()); File simpleText = new File("simpletext"); directory = new SimpleFSDirectory(simpleText); writer = new IndexWriter(directory, conf); index(writer); • Similarity: BM25Similarity bm25Similarity = new BM25Similarity(); conf.setSimilarity(bm25Similarity); • http://www.ibm.com/developerworks/java/library/j-solr-lucene/index.html
  7. 7. http://localhost:8983/solr/answer?q=what+is+trimethylbenzene&defType=qa&qa=true&qa.qf=body
  8. 8. Simple QA Workflow
  9. 9. Analysis • • Split into sentences – Buffer tokens – see com.tamingtext.texttamer.solr.SentenceTokenizer Identify Names using OpenNLP • Add Entity marker tokens at the same position as original token – Could also be done with Payloads • Index • https://github.com/tamingtext/book/tree/master/src/main/java/com/tamingtext/textta mer/solr https://github.com/tamingtext/book/blob/master/apache-solr/solrqa/conf/schema.xml •
  10. 10. Search Side • Custom Query Parser takes in user’s natural language query, classifies it to find the Answer Type and generates Solr query • Retrieve candidate passages that match keywords and expected answer type • Unlike keyword search, we need to know exactly where matches occur • https://github.com/tamingtext/book/tree/master/src/main/java/com/ tamingtext/qa
  11. 11. Answer Type Classification • Answer Type examples: – Person (P), Location (L), Organization (O), Time Point (T), Duration (R), Money (M) – See page 248 for more • Train an OpenNLP classifier off of a set of previously annotated questions, e.g.: – P Which French monarch reinstated the divine right of the monarchy to France and was known as `The Sun King' because of the splendour of his reign?
  12. 12. “Trust me, I’m a mathematician”
  13. 13. Classification
  14. 14. kNN and TF/IDF Classification w/ Lucene https://github.com/tamingtext/book/tree/master/src/main/java/com/tamingtext/classifier/mlt
  15. 15. Lucene Classification Module • Builds classifier off of index information • See the org.apache.lucene.classification package • Naïve Bayes Classifier • kNN Classifier • Perceptron Classifier
  16. 16. Recommenders • • • • • Cross recommendation as search – with search used to build cross recommendation! Recommend content to people who exhibit certain behaviors (clicks, query terms, other) (Ab)use of a search engine – but not as a search engine for content – more like a search engine for behavior See Ted Dunning’s talk from Berlin Buzzwords on Multi-modal Recommendation Algorithms – http://berlinbuzzwords.com/sessions/multi-modal-recommendation-algorithms Go get Mahout/Myrrix or just do it in y(our) search engine
  17. 17. Recommendation Basics • History: User Thing 1 3 2 4 3 4 2 3 3 2 1 1 2 1
  18. 18. Recommendation Basics • History as matrix: t1 t3 t4 u1 1 0 1 0 u2 1 0 1 1 u3 • t2 0 1 0 1 t1+t3 cooccur 2 times, t1+t4 once, t2+t4 once
  19. 19. Recommendation Basics • Coocurrence t1 t3 t4 t1 2 0 2 1 t2 0 1 0 t3 2 0 1 t4 • t2 1 1 1 not t1 1 t3 1 2 2 1 t1 not t3 1 1 More details at http://lucenerevolution.org/2013/Crowd-sourced-intelligence-builtinto-Search-over-Hadoop
  20. 20. “I wish I had thought of that”
  21. 21. Time Space Continuum • Leverage Solr’s new spatial capabilities to index non-spatial data, such as time ranges – Useful for Open Hours, Shifts, etc. • Key: multi-valued range data • Query using rectangle intersections – q = shift:"Intersects(0 19 23 365)” • Credits to David Smiley and Hoss… https://people.apache.org/~hossman/spatial-for-non-spatial-meetup-20130117/
  22. 22. Finance Example % change AAPL IBM AAPL AAPL MSFT MSFT Time IBM AAPL MSFT
  23. 23. Resources • http://www.manning.com/ingersoll – http://github.com/tamingtext/book • http://www.tamingtext.com • Me: – @gsingers – grant@lucidworks.com

×