Your SlideShare is downloading. ×
0
Hacking Lucene and Solr for Fun and Profit
Hacking Lucene and Solr for Fun and Profit
Hacking Lucene and Solr for Fun and Profit
Hacking Lucene and Solr for Fun and Profit
Hacking Lucene and Solr for Fun and Profit
Hacking Lucene and Solr for Fun and Profit
Hacking Lucene and Solr for Fun and Profit
Hacking Lucene and Solr for Fun and Profit
Hacking Lucene and Solr for Fun and Profit
Hacking Lucene and Solr for Fun and Profit
Hacking Lucene and Solr for Fun and Profit
Hacking Lucene and Solr for Fun and Profit
Hacking Lucene and Solr for Fun and Profit
Hacking Lucene and Solr for Fun and Profit
Hacking Lucene and Solr for Fun and Profit
Hacking Lucene and Solr for Fun and Profit
Hacking Lucene and Solr for Fun and Profit
Hacking Lucene and Solr for Fun and Profit
Hacking Lucene and Solr for Fun and Profit
Hacking Lucene and Solr for Fun and Profit
Hacking Lucene and Solr for Fun and Profit
Hacking Lucene and Solr for Fun and Profit
Hacking Lucene and Solr for Fun and Profit
Hacking Lucene and Solr for Fun and Profit
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Hacking Lucene and Solr for Fun and Profit

990

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
990
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
20
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. HACKING LUCENE AND SOLR FOR FUN AND PROFIT Grant Ingersoll CTO, LucidWorks, grant@lucidworks.com, @gsingers
  • 2. Keyword Search is so yesterday • Search is a system building block – text is only a part of the story • If the algorithms fit, use them! • Embrace fuzziness! • Scoring features are everywhere
  • 3. Lucene and Solr can do… • Classic: Fast, fuzzy text matching across a large document collection • Data Quality and Analysis – Faceting, slicing and dicing of numerical/enumerated data – Spatial – Spell checking, record linkage, highlighting – Stats, Missing fields, etc. • Top N problems
  • 4. Topics • Search Hacks • “Trust me, I’m a mathematician” • “I wish I had thought of that” Hack
  • 5. Search Hacks
  • 6. Learn IR • SimpleTextCodec Example conf.setCodec(new SimpleTextCodec()); File simpleText = new File("simpletext"); directory = new SimpleFSDirectory(simpleText); writer = new IndexWriter(directory, conf); index(writer); • Similarity: BM25Similarity bm25Similarity = new BM25Similarity(); conf.setSimilarity(bm25Similarity); • http://www.ibm.com/developerworks/java/library/j-solr-lucene/index.html
  • 7. http://localhost:8983/solr/answer?q=what+is+trimethylbenzene&defType=qa&qa=true&qa.qf=body
  • 8. Simple QA Workflow
  • 9. Analysis • • Split into sentences – Buffer tokens – see com.tamingtext.texttamer.solr.SentenceTokenizer Identify Names using OpenNLP • Add Entity marker tokens at the same position as original token – Could also be done with Payloads • Index • https://github.com/tamingtext/book/tree/master/src/main/java/com/tamingtext/textta mer/solr https://github.com/tamingtext/book/blob/master/apache-solr/solrqa/conf/schema.xml •
  • 10. Search Side • Custom Query Parser takes in user’s natural language query, classifies it to find the Answer Type and generates Solr query • Retrieve candidate passages that match keywords and expected answer type • Unlike keyword search, we need to know exactly where matches occur • https://github.com/tamingtext/book/tree/master/src/main/java/com/ tamingtext/qa
  • 11. Answer Type Classification • Answer Type examples: – Person (P), Location (L), Organization (O), Time Point (T), Duration (R), Money (M) – See page 248 for more • Train an OpenNLP classifier off of a set of previously annotated questions, e.g.: – P Which French monarch reinstated the divine right of the monarchy to France and was known as `The Sun King' because of the splendour of his reign?
  • 12. “Trust me, I’m a mathematician”
  • 13. Classification
  • 14. kNN and TF/IDF Classification w/ Lucene https://github.com/tamingtext/book/tree/master/src/main/java/com/tamingtext/classifier/mlt
  • 15. Lucene Classification Module • Builds classifier off of index information • See the org.apache.lucene.classification package • Naïve Bayes Classifier • kNN Classifier • Perceptron Classifier
  • 16. Recommenders • • • • • Cross recommendation as search – with search used to build cross recommendation! Recommend content to people who exhibit certain behaviors (clicks, query terms, other) (Ab)use of a search engine – but not as a search engine for content – more like a search engine for behavior See Ted Dunning’s talk from Berlin Buzzwords on Multi-modal Recommendation Algorithms – http://berlinbuzzwords.com/sessions/multi-modal-recommendation-algorithms Go get Mahout/Myrrix or just do it in y(our) search engine
  • 17. Recommendation Basics • History: User Thing 1 3 2 4 3 4 2 3 3 2 1 1 2 1
  • 18. Recommendation Basics • History as matrix: t1 t3 t4 u1 1 0 1 0 u2 1 0 1 1 u3 • t2 0 1 0 1 t1+t3 cooccur 2 times, t1+t4 once, t2+t4 once
  • 19. Recommendation Basics • Coocurrence t1 t3 t4 t1 2 0 2 1 t2 0 1 0 t3 2 0 1 t4 • t2 1 1 1 not t1 1 t3 1 2 2 1 t1 not t3 1 1 More details at http://lucenerevolution.org/2013/Crowd-sourced-intelligence-builtinto-Search-over-Hadoop
  • 20. “I wish I had thought of that”
  • 21. Time Space Continuum • Leverage Solr’s new spatial capabilities to index non-spatial data, such as time ranges – Useful for Open Hours, Shifts, etc. • Key: multi-valued range data • Query using rectangle intersections – q = shift:"Intersects(0 19 23 365)” • Credits to David Smiley and Hoss… https://people.apache.org/~hossman/spatial-for-non-spatial-meetup-20130117/
  • 22. Finance Example % change AAPL IBM AAPL AAPL MSFT MSFT Time IBM AAPL MSFT
  • 23. Resources • http://www.manning.com/ingersoll – http://github.com/tamingtext/book • http://www.tamingtext.com • Me: – @gsingers – grant@lucidworks.com

×