Natural language search in Solr             Tommaso Teofili, Sourcesense   t.teofili@sourcesense.com, October 19th 2011
Agenda An approach to natural language search in  Solr Main points  •   Solr-UIMA integration module  •   Custom Lucene ...
My Background Software engineer at Sourcesense  • Enterprise search consultant Member of the Apache Software Foundation ...
Google in ‘99
Google today
Google today
The Challenge Improved recall/precision  • ‘articles about science’ (concepts)  • ‘movies by K. Spacey’ vs ‘movies with K...
Hurdles understanding documents’ text/user queries extract domain-specific/wide entities and  concepts index/search per...
Use Case   search engine for an online movies magazine   Solr based   non technical users   time / cost    • Solr 3.x ...
Online movies magazine
General approach Natural language processing Processing documents at indexing time  • document text analysis  • write en...
NLP AI discipline   • Computers understanding and managing     information written in human language analyze text at var...
Technical detail NLP algorithms plugged via Apache UIMA Indexing time  • UpdateProcessor plugin (solr/contrib/uima)  • C...
Why Apache UIMA?   OASIS standard for UIM   TLP since March 2010   Deploy pipelines of Analysis Engines   AEs wrap NLP...
NLP and OSS Sentence Split   • OpenNLP, UIMA Addons, StanfordNLP PoS tagging   • OpenNLP, UIMA Addons, StanfordNLP Chun...
Solr NLS architecture
UIMA Update Processor
Lucene analysis & UIMA Type : denote lexical types for tokens Payload : a byte array stored at each term  position toke...
UIMA type-aware tokenizer
Solr NLS QParser analyze user query extract (and query on) concepts / entities use types/PoS in the query for  • boosti...
Scaling architecture
Performance basic (in memory)  • slower with NRT indexing  • search could be significantly impacted ReST (SimpleServer) ...
DisMax vs NLS
Wrap up   general purpose architecture   generally improved recall / precision   NLP algorithms accuracy make the diffe...
Sources Resources  • http://svn.apache.org/repos/asf/lucene/dev/trunk/    solr/contrib/uima/  • https://github.com/tteofi...
Thanks http://www.sourcesense.com t.teofili@sourcesense.com @tteofili
Upcoming SlideShare
Loading in …5
×

Natural Language Search in Solr - Tommaso Teofili

1,353 views

Published on

See conference video - http://www.lucidimagination.com/devzone/events/conferences/ApacheLuceneEurocon2011

This presentation aims to showcase how to build and implement a search engine which is able to understand a query written in a way much nearer to spoken language than to keyword-based search using Apache Lucene/Solr and Apache UIMA. A system which can recognize semantics in natural language can be very handy for non expert users, e-learning systems, customer care systems, etc. With such a system it's possible to submit queries such as "hotels near Rome" or "people working at Google" without having to manually transform a user entered natural language query to a Lucene/Solr query.

Published in: Technology
  • Be the first to comment

Natural Language Search in Solr - Tommaso Teofili

  1. 1. Natural language search in Solr Tommaso Teofili, Sourcesense t.teofili@sourcesense.com, October 19th 2011
  2. 2. Agenda An approach to natural language search in Solr Main points • Solr-UIMA integration module • Custom Lucene analyzers for UIMA • OSS NLP algorithms in Lucene/Solr • Orchestrating blocks to build a sample system able to understand natural language queries Results
  3. 3. My Background Software engineer at Sourcesense • Enterprise search consultant Member of the Apache Software Foundation • UIMA • Clerezza • Stanbol • DirectMemory • ...
  4. 4. Google in ‘99
  5. 5. Google today
  6. 6. Google today
  7. 7. The Challenge Improved recall/precision • ‘articles about science’ (concepts) • ‘movies by K. Spacey’ vs ‘movies with K. Spacey’ Easier experience for non-expert users • ‘people working at Google’ - ‘cities near London’ Horizontal domains (e.g. Google) Vertical domains
  8. 8. Hurdles understanding documents’ text/user queries extract domain-specific/wide entities and concepts index/search performance
  9. 9. Use Case search engine for an online movies magazine Solr based non technical users time / cost • Solr 3.x setup : 2 mins • NLS setup / tweak : 5 days expecting • improved recall / precision • more time (clicks) on site ($)
  10. 10. Online movies magazine
  11. 11. General approach Natural language processing Processing documents at indexing time • document text analysis • write enriched text in (dedicated) fields • add custom types / payloads to terms Processing queries at searching time • query analysis • higher boosts to entities/concepts • in-sentence search • ...
  12. 12. NLP AI discipline • Computers understanding and managing information written in human language analyze text at various levels incrementally enrich / give structure extract concepts and named entities
  13. 13. Technical detail NLP algorithms plugged via Apache UIMA Indexing time • UpdateProcessor plugin (solr/contrib/uima) • Custom tokenizers/filters Search time • Custom QParserPlugin
  14. 14. Why Apache UIMA? OASIS standard for UIM TLP since March 2010 Deploy pipelines of Analysis Engines AEs wrap NLP algorithms Scaling capabilities
  15. 15. NLP and OSS Sentence Split • OpenNLP, UIMA Addons, StanfordNLP PoS tagging • OpenNLP, UIMA Addons, StanfordNLP Chunking/Parsing • OpenNLP, StanfordNLP NER • OpenNLP, UIMA Addons, Stanbol, StanfordNLP Clustering/Classifying • Mahout, OpenNLP, StanfordNLP ...
  16. 16. Solr NLS architecture
  17. 17. UIMA Update Processor
  18. 18. Lucene analysis & UIMA Type : denote lexical types for tokens Payload : a byte array stored at each term position tokenize / filter tokens covered by a certain annotation type store UIMA annotations’ features in types / payloads
  19. 19. UIMA type-aware tokenizer
  20. 20. Solr NLS QParser analyze user query extract (and query on) concepts / entities use types/PoS in the query for • boosting terms • synonim expansion search within sentences faceting / clustering using entities identify ‘place queries’ and expand Solr spatial queries (for filtering / boosting)
  21. 21. Scaling architecture
  22. 22. Performance basic (in memory) • slower with NRT indexing • search could be significantly impacted ReST (SimpleServer) • faster • need to explictly digest results UIMA-AS • fast also with NRT indexing • fast search • scales nicely with lots of data
  23. 23. DisMax vs NLS
  24. 24. Wrap up general purpose architecture generally improved recall / precision NLP algorithms accuracy make the difference lots of OSS alternatives performances can be kept good
  25. 25. Sources Resources • http://svn.apache.org/repos/asf/lucene/dev/trunk/ solr/contrib/uima/ • https://github.com/tteofili/le11-nls Links • http://wiki.apache.org/solr/SolrUIMA • http://googleblog.blogspot.com/2010/01/helping- computers-understand-language.html
  26. 26. Thanks http://www.sourcesense.com t.teofili@sourcesense.com @tteofili

×