Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

The Future of Search in Plone


Published on

From the beginning, the Zope Catalog has provided Plone with out-of-the-box content search - an important feature not found in all open source content management systems. However, search engine technology has been racing ahead and user expectations of what search should do have been changing. At the same time, search engines have gone from premium enterprise product to cheap commodity. The most important search engine worth considering these days is also open source: Lucene/Solr. Several add-on products exist that integrate Solr with Plone, and interest in this technology is growing.

In this talk, Sally Kleinfeldt provides an information retrieval tutorial and discusses the questions: What does Solr bring to Plone? Should Solr become part of Plone core?

These slides include conclusions from the conference discussion. A link to audio of the presentation is here:

Published in: Technology
  • Be the first to comment

  • Be the first to like this

The Future of Search in Plone

  1. 1. The Future of Search in Plone Sally Kleinfeldt and friends Plone Conference, San Francisco November 3, 2011Tuesday, November 29, 2011
  2. 2. Motivation • Raise awareness • Promote discussion • Forge consensusTuesday, November 29, 2011
  3. 3. Agenda • Introduction to IR concepts • Description of Solr and ZCatalog • DiscussionTuesday, November 29, 2011
  4. 4. IR 101Tuesday, November 29, 2011
  5. 5. IR 101 • Transformations • Terms • Models • MeasuresTuesday, November 29, 2011
  6. 6. IR 101 Transformations • Turn binary, HTML, or other document formats into fields and strings • Parse the strings into a set of terms • Build indexes of the terms specific to the IR model used • Queries are parsed into query operators and strings, which are parsed into termsTuesday, November 29, 2011
  7. 7. IR 101 String => Terms • Tokenization - locate word boundaries • Normalization - remove capitals and diacritics • Stopping - remove stop words (a, of, on, the...) • Stemming - reduce to word stems (walks, walking => walk) • Recognizers - concepts, parts of speech, names, locations... • Must be identical for documents and queriesTuesday, November 29, 2011
  8. 8. IR 101 Terms • Application specific • Words or phrases • IR models assign weights to terms in documentsTuesday, November 29, 2011
  9. 9. IR 101 Term Weighting • Simplest:Yes/No Boolean value • Better: Term Frequency - # occurrences • More meaningful: tf-idf • Term Freq * Inverse Document Freq • How many documents contain the term? • Increase weight of rare terms and vice versaTuesday, November 29, 2011
  10. 10. IR 101 Boolean Model • First and most adopted • Based on Boolean logic + set theory • Does a document contain query terms - Y/N • Intuitive, easy to implement • No ranking, special query language, too many or too few results • Typical for library systemsTuesday, November 29, 2011
  11. 11. IR 101 Vector Space Models • Represent documents and queries as vectors of terms • Term values are weighted - by count or tf-idf • Use vector operations to compare documents with queries • Relevance score based on cosine of angle between doc/query vectorsTuesday, November 29, 2011
  12. 12. IR 101 Probabilistic Models • Compute probability that a document is relevant to a query • Relevance ranking functions range from simple to complex • Sophisticated ranking functions include • Okapi BM25 (uses tf and idf) • Machine learning formulas (use training data)Tuesday, November 29, 2011
  13. 13. IR 101 Extending the Models • Many many refinements possible • Term interdependencies • Fuzzy sets • Semantic analysis, link analysis • Combining models (Extended Boolean) • The best search engines represent thousands of engineering hoursTuesday, November 29, 2011
  14. 14. IR 101 Measures • Search engine results are measured against: • Precision - Percent of results that are relevant • Recall - Percent of relevant results that are returned • F-Score - Harmonic mean of precision and recallTuesday, November 29, 2011
  15. 15. ZCatalog and SolrTuesday, November 29, 2011
  16. 16. ZCatalog • Zope/Plone search engine • Full text and field searching • Probabilistic model using Okapi BM25 • OOTB ZCTextIndex very simple • TextIndexNG adds multilingual, better parsing components, binary transforms, synonymsTuesday, November 29, 2011
  17. 17. Solr • Popular open source enterprise search platform • Eliminating smaller commercial search companies • Java, based on Lucene Java search library, sophisticated vector space ++ model • RESTful APIs • Large, active community • Powers Twitter, Wikipedia, Netflix...Tuesday, November 29, 2011
  18. 18. What does Solr have that ZCatalog Doesn’t? • Better relevance ranking • More search features: snippets, hit highlighting, spelling suggestions, synonyms, more like this, faceted search • More configurable: stop words, field boosting, parsing components • An army of engineers working on itTuesday, November 29, 2011
  19. 19. Plone + Solr Today • Two add-ons available • collective.solr - Intercepts catalog queries and dispatches them to Solr • alm.solrindex - adds a new index type to the catalog, SolrIndex • Plus a buildout recipe: collective.recipe.solrinstanceTuesday, November 29, 2011
  20. 20. Conclusions from Conference DiscussionTuesday, November 29, 2011
  21. 21. Why Does Plone Need Solr? • Certain types of projects need it, for features or because ZCatalog can’t scale to very large sites • We need it to keep up with the enterprise CMS packTuesday, November 29, 2011
  22. 22. Points of Agreement • It will be impossible to completely replace ZCatalog with Solr • Solr indexing will never be transactional • Removing ZCatalog from Zope would be very difficult • Tackle small, focused ZCatalog improvements when possible - like improving indexing interfaceTuesday, November 29, 2011
  23. 23. Points of Agreement • Navigation and search should be handled separately • Navigation needs to be transactional, search does not • Split out a catalog used for navigation from the general catalog • Explore a non-catalog utility to support navigation, optimize for speedTuesday, November 29, 2011
  24. 24. Points of Agreement • Treating Solr integration simply as ZCatalog replacement does not take best advantage of Solr features • ZCatalog can’t represent the richness of Solr, focus on the Solr API • Take advantage of spelling suggestions, facets, results snippets with hit highlighting, synonyms, more like this, etc. • Provide Solr indexing, field weighting, etc. configuration choices in the control panelTuesday, November 29, 2011
  25. 25. Points of Agreement • Neither of the current Solr add-ons provides the best foundation for the future • But they’ve taught us how to do things better • Non-Solr approaches to improved Plone search should be deprecated • Andreas Jung is not planning improvements to TextIndexNG!Tuesday, November 29, 2011
  26. 26. Points of Agreement • Stop investing in ZCatalog as a search engine, Solr is the futureTuesday, November 29, 2011
  27. 27. Plone + Solr Roadmap • Short term: Make Solr integration easy with an approved add-on (like LDAP) • Build on what we’ve learned and create a better add-on to replace collective.solr and alm.solrindex • Who wants to sponsor a sprint?Tuesday, November 29, 2011
  28. 28. Plone + Solr Roadmap • Long term: Ship Solr integration with Plone, but don’t require Solr • Solr has a lot of overhead and is not always needed • But using it should be as easy as answering yes to a “Build with Solr?” installation optionTuesday, November 29, 2011