Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Webinar: OpenNLP and Solr for Superior Relevance


Published on

Lucidworks Senior Software Engineer and Solr Committer Steve Rowe explains how to increase relevance using Solr with Apache OpenNLP.

Published in: Technology
  • Get the best essay, research papers or dissertations. from ⇒ ⇐ A team of professional authors with huge experience will give u a result that will overcome your expectations.
    Are you sure you want to  Yes  No
    Your message goes here
  • My brother found Custom Writing Service ⇒ ⇐ and ordered a couple of works. Their customer service is outstanding, never left a query unanswered.
    Are you sure you want to  Yes  No
    Your message goes here
  • Follow the link, new dating source: ❶❶❶ ❶❶❶
    Are you sure you want to  Yes  No
    Your message goes here
  • Sex in your area is here: ♥♥♥ ♥♥♥
    Are you sure you want to  Yes  No
    Your message goes here
  • Increasing Sex Drive And Getting Harder Erections, Naturally ♥♥♥
    Are you sure you want to  Yes  No
    Your message goes here

Webinar: OpenNLP and Solr for Superior Relevance

  1. 1. 2016 OCTOBER 11-14
  2. 2. OpenNLP and Solr for Superior Relevance Steve Rowe @steven_a_rowe Sr. Software Engineer, Lucidworks
  3. 3. • Previously worked at the Center for Natural Language Processing (CNLP) at Syracuse University • Sr. Software Engineer at Lucidworks • Committer on Apache Lucene/Solr project • Committer on JFlex scanner generator project About me
  4. 4. • OpenNLP capabilities • OpenNLP / Solr integration • OpenNLP models: training and licensing • Part-of-speech: what is it good for (absolutely/RB something/NN) • Lemmatization versus stemming • Solr configuration and demonstration of lemmatization and named entity extraction • Future Work Agenda
  5. 5. • Machine learning: maximum entropy and perceptron based • Sentence segmentation • Tokenization • Part-of-speech (POS) tagging • Lemmatization • Named entity recognition (NER) • Phrase chunking • Parsing • Co-reference resolution • Document classification Apache OpenNLP capabilities
  6. 6. • OpenNLP isn’t integrated with Solr in any release (yet) • TDD (talk driven development) • LUCENE-2899: WIP patch, builds, works (demo later) • Currently implemented: • Sentence segmentation and tokenization • Part-of-speech (POS) tagging • Phrase chunking • Dictionary-based lemmatization • Named entity recognition (NER) OpenNLP Solr integration
  7. 7. • Most OpenNLP phases can be trained, but each phase depends on the previous ones. • Publicly available models are based on data with non-free licenses. • You can train your own models, and you very likely want to for production use. • Example workflow: • Use an existing model to tag your training data • Modify the tagged data according to your needs • One way to do that: the brat rapid annotation tool (OpenNLP understands its output format) • Run OpenNLP command-line training tools to create a model • Run OpenNLP command-line evaluation tools to test model performance • Repeat until you get the quality you want OpenNLP models: training & licensing
  8. 8. • Created: 30/Jan/11 10:44 <- over 5 years old • Lance Norskog wrote the bulk of the implementation • I modernized Lance’s patch and added support for dictionary-based lemmatization LUCENE-2899
  9. 9. • Both can be used with search to increase recall • Lemmas are real words: infinitive verbs, singular nouns • e.g. Speaking/VBG, spoke/VB -> speak; stigmata/NNS -> stigma • Can be produced by algorithm and/or known-item dictionary • OpenNLP 1.6.1 will include a machine-learned lemmatization implementation • Caveat: poor quality part-of-speech over short query text • Stems are not (necessarily) real words • e.g. Speaking -> speak, spoke -> spoke, stigmata -> stigmata
 (Porter stemmer) • produced via algorithm Lemmatization vs. Stemming
  10. 10. Penn Treebank part of speech tags PRP$ Possessive pronoun RB Adverb RBR Adverb, comparative RBS Adverb, superlative RP Particle SYM Symbol TO to UH Interjection VB Verb, base form VBD Verb, past tense VBG Verb, gerund or present participle VBN Verb, past participle VBP Verb, non-3rd person singular present VBZ Verb, 3rd person singular present WDT Wh-determiner WP Wh-pronoun WP$ Possessive wh-pronoun WRB Wh-adverb CC Coordinating conjunction CD Cardinal number DT Determiner EX Existential there FW Foreign word IN Preposition/subordinating conjunction JJ Adjective JJR Adjective, comparative JJS Adjective, superlative LS List item marker MD Modal NN Noun, singular or mass NNS Noun, plural NNP Proper noun, singular NNPS Proper noun, plural PDT Predeterminer POS Possessive ending PRP Personal pronoun
  11. 11. Solr configuration • Put jars on classpath • Add required resources to configset: • models • lemmatization dictionary • Add field type(s) using OpenNLP-based analysis components, then fields using these field types
  12. 12. Put jars on the classpath • Add to configset’s solrconfig.xml: <lib dir="${solr.install.dir:../../../..}
 /contrib/analysis-extras/lucene-libs" regex=".*.jar"/> <lib dir="${solr.install.dir:../../../..}
 /contrib/analysis-extras/lib" regex="opennlp-.*.jar"/>
  13. 13. Add required resources to configset • Download models from • Download lemma dictionary from conf/
  14. 14. Add field type and fields curl -X POST http://localhost:8983/solr/opennlp/schema 
 -H 'Content-type: application/json' --data-binary '{
 “stored":true }
  15. 15. • (Switch to http://localhost:8983/solr here) Demo
  16. 16. • Make Solr update request processors for named entity recognition, maybe phrase chunker. • Optimize memory usage to only process one sentence at a time. • Commit/release LUCENE-2899! Future Work
  17. 17. Resources Solr: OpenNLP: LUCENE-2899: OpenNLP pre-trained models: brat rapid annotation tool: LanguageTool lemma dictionaries: Company: Our blog: Twitter: @steven_a_rowe