• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Hacking Lucene and Solr for Fun and Profit
 

Hacking Lucene and Solr for Fun and Profit

on

  • 793 views

 

Statistics

Views

Total Views
793
Views on SlideShare
618
Embed Views
175

Actions

Likes
0
Downloads
15
Comments
0

2 Embeds 175

http://www.lucenerevolution.org 169
http://lucenerevolution.org 6

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Hacking Lucene and Solr for Fun and Profit Hacking Lucene and Solr for Fun and Profit Presentation Transcript

    • HACKING LUCENE AND SOLR FOR FUN AND PROFIT Grant Ingersoll CTO, LucidWorks, grant@lucidworks.com, @gsingers
    • Keyword Search is so yesterday • Search is a system building block – text is only a part of the story • If the algorithms fit, use them! • Embrace fuzziness! • Scoring features are everywhere
    • Lucene and Solr can do… • Classic: Fast, fuzzy text matching across a large document collection • Data Quality and Analysis – Faceting, slicing and dicing of numerical/enumerated data – Spatial – Spell checking, record linkage, highlighting – Stats, Missing fields, etc. • Top N problems
    • Topics • Search Hacks • “Trust me, I’m a mathematician” • “I wish I had thought of that” Hack
    • Search Hacks
    • Learn IR • SimpleTextCodec Example conf.setCodec(new SimpleTextCodec()); File simpleText = new File("simpletext"); directory = new SimpleFSDirectory(simpleText); writer = new IndexWriter(directory, conf); index(writer); • Similarity: BM25Similarity bm25Similarity = new BM25Similarity(); conf.setSimilarity(bm25Similarity); • http://www.ibm.com/developerworks/java/library/j-solr-lucene/index.html
    • http://localhost:8983/solr/answer?q=what+is+trimethylbenzene&defType=qa&qa=true&qa.qf=body
    • Simple QA Workflow
    • Analysis • • Split into sentences – Buffer tokens – see com.tamingtext.texttamer.solr.SentenceTokenizer Identify Names using OpenNLP • Add Entity marker tokens at the same position as original token – Could also be done with Payloads • Index • https://github.com/tamingtext/book/tree/master/src/main/java/com/tamingtext/textta mer/solr https://github.com/tamingtext/book/blob/master/apache-solr/solrqa/conf/schema.xml •
    • Search Side • Custom Query Parser takes in user’s natural language query, classifies it to find the Answer Type and generates Solr query • Retrieve candidate passages that match keywords and expected answer type • Unlike keyword search, we need to know exactly where matches occur • https://github.com/tamingtext/book/tree/master/src/main/java/com/ tamingtext/qa
    • Answer Type Classification • Answer Type examples: – Person (P), Location (L), Organization (O), Time Point (T), Duration (R), Money (M) – See page 248 for more • Train an OpenNLP classifier off of a set of previously annotated questions, e.g.: – P Which French monarch reinstated the divine right of the monarchy to France and was known as `The Sun King' because of the splendour of his reign?
    • “Trust me, I’m a mathematician”
    • Classification
    • kNN and TF/IDF Classification w/ Lucene https://github.com/tamingtext/book/tree/master/src/main/java/com/tamingtext/classifier/mlt
    • Lucene Classification Module • Builds classifier off of index information • See the org.apache.lucene.classification package • Naïve Bayes Classifier • kNN Classifier • Perceptron Classifier
    • Recommenders • • • • • Cross recommendation as search – with search used to build cross recommendation! Recommend content to people who exhibit certain behaviors (clicks, query terms, other) (Ab)use of a search engine – but not as a search engine for content – more like a search engine for behavior See Ted Dunning’s talk from Berlin Buzzwords on Multi-modal Recommendation Algorithms – http://berlinbuzzwords.com/sessions/multi-modal-recommendation-algorithms Go get Mahout/Myrrix or just do it in y(our) search engine
    • Recommendation Basics • History: User Thing 1 3 2 4 3 4 2 3 3 2 1 1 2 1
    • Recommendation Basics • History as matrix: t1 t3 t4 u1 1 0 1 0 u2 1 0 1 1 u3 • t2 0 1 0 1 t1+t3 cooccur 2 times, t1+t4 once, t2+t4 once
    • Recommendation Basics • Coocurrence t1 t3 t4 t1 2 0 2 1 t2 0 1 0 t3 2 0 1 t4 • t2 1 1 1 not t1 1 t3 1 2 2 1 t1 not t3 1 1 More details at http://lucenerevolution.org/2013/Crowd-sourced-intelligence-builtinto-Search-over-Hadoop
    • “I wish I had thought of that”
    • Time Space Continuum • Leverage Solr’s new spatial capabilities to index non-spatial data, such as time ranges – Useful for Open Hours, Shifts, etc. • Key: multi-valued range data • Query using rectangle intersections – q = shift:"Intersects(0 19 23 365)” • Credits to David Smiley and Hoss… https://people.apache.org/~hossman/spatial-for-non-spatial-meetup-20130117/
    • Finance Example % change AAPL IBM AAPL AAPL MSFT MSFT Time IBM AAPL MSFT
    • Resources • http://www.manning.com/ingersoll – http://github.com/tamingtext/book • http://www.tamingtext.com • Me: – @gsingers – grant@lucidworks.com