SlideShare a Scribd company logo
1 of 28
Download to read offline
The Future of Search in
                                     Plone
                                          Sally Kleinfeldt
                                            and friends
                                  Plone Conference, San Francisco
                                        November 3, 2011




Tuesday, November 29, 2011
Motivation


                             •   Raise awareness

                             •   Promote discussion

                             •   Forge consensus




Tuesday, November 29, 2011
Agenda


                             •   Introduction to IR concepts

                             •   Description of Solr and ZCatalog

                             •   Discussion




Tuesday, November 29, 2011
IR 101




Tuesday, November 29, 2011
IR 101

                             •   Transformations

                             •   Terms

                             •   Models

                             •   Measures




Tuesday, November 29, 2011
IR 101
                                     Transformations
                             •   Turn binary, HTML, or other document
                                 formats into fields and strings

                             •   Parse the strings into a set of terms

                             •   Build indexes of the terms specific to the IR
                                 model used

                             •   Queries are parsed into query operators and
                                 strings, which are parsed into terms




Tuesday, November 29, 2011
IR 101
                                     String => Terms
                             •   Tokenization - locate word boundaries

                             •   Normalization - remove capitals and diacritics

                             •   Stopping - remove stop words (a, of, on,
                                 the...)

                             •   Stemming - reduce to word stems (walks,
                                 walking => walk)

                             •   Recognizers - concepts, parts of speech,
                                 names, locations...

                             •   Must be identical for documents and queries


Tuesday, November 29, 2011
IR 101
                                              Terms

                             •   Application specific

                             •   Words or phrases

                             •   IR models assign weights to terms in
                                 documents




Tuesday, November 29, 2011
IR 101
                                       Term Weighting
                             •   Simplest:Yes/No Boolean value

                             •   Better: Term Frequency - # occurrences

                             •   More meaningful: tf-idf

                                 •   Term Freq * Inverse Document Freq

                                 •   How many documents contain the term?

                                 •   Increase weight of rare terms and vice
                                     versa




Tuesday, November 29, 2011
IR 101
                                      Boolean Model
                             •   First and most adopted

                             •   Based on Boolean logic + set theory

                             •   Does a document contain query terms - Y/N

                             •   Intuitive, easy to implement

                             •   No ranking, special query language, too many
                                 or too few results

                             •   Typical for library systems




Tuesday, November 29, 2011
IR 101
                                 Vector Space Models
                             •   Represent documents and queries as vectors
                                 of terms

                             •   Term values are weighted - by count or tf-idf

                             •   Use vector operations to compare
                                 documents with queries

                             •   Relevance score based on cosine of angle
                                 between doc/query vectors




Tuesday, November 29, 2011
IR 101
                                 Probabilistic Models
                             •   Compute probability that a document is
                                 relevant to a query

                             •   Relevance ranking functions range from
                                 simple to complex

                             •   Sophisticated ranking functions include

                                 •   Okapi BM25 (uses tf and idf)

                                 •   Machine learning formulas (use training
                                     data)




Tuesday, November 29, 2011
IR 101
                             Extending the Models
                             •   Many many refinements possible

                                 •   Term interdependencies

                                 •   Fuzzy sets

                                 •   Semantic analysis, link analysis

                                 •   Combining models (Extended Boolean)

                             •   The best search engines represent thousands
                                 of engineering hours




Tuesday, November 29, 2011
IR 101
                                              Measures
                             •   Search engine results are measured against:

                                 •   Precision - Percent of results that are
                                     relevant

                                 •   Recall - Percent of relevant results that are
                                     returned

                                 •   F-Score - Harmonic mean of precision and
                                     recall




Tuesday, November 29, 2011
ZCatalog and Solr




Tuesday, November 29, 2011
ZCatalog
                             •   Zope/Plone search engine

                             •   Full text and field searching

                             •   Probabilistic model using Okapi BM25

                             •   OOTB ZCTextIndex very simple

                             •   TextIndexNG adds multilingual, better parsing
                                 components, binary transforms, synonyms




Tuesday, November 29, 2011
Solr
                             •   Popular open source enterprise search
                                 platform

                             •   Eliminating smaller commercial search
                                 companies

                             •   Java, based on Lucene Java search library,
                                 sophisticated vector space ++ model

                             •   RESTful APIs

                             •   Large, active community

                             •   Powers Twitter, Wikipedia, Netflix...



Tuesday, November 29, 2011
What does Solr have
                             that ZCatalog Doesn’t?
                             •   Better relevance ranking

                             •   More search features: snippets, hit
                                 highlighting, spelling suggestions, synonyms,
                                 more like this, faceted search

                             •   More configurable: stop words, field
                                 boosting, parsing components

                             •   An army of engineers working on it




Tuesday, November 29, 2011
Plone + Solr
                                              Today
                             •   Two add-ons available

                                 •   collective.solr - Intercepts catalog queries
                                     and dispatches them to Solr

                                 •   alm.solrindex - adds a new index type to
                                     the catalog, SolrIndex

                             •   Plus a buildout recipe:
                                 collective.recipe.solrinstance




Tuesday, November 29, 2011
Conclusions from
                             Conference Discussion




Tuesday, November 29, 2011
Why Does Plone
                                     Need Solr?
                             •   Certain types of projects need it, for features
                                 or because ZCatalog can’t scale to very large
                                 sites

                             •   We need it to keep up with the enterprise
                                 CMS pack




Tuesday, November 29, 2011
Points of Agreement
                             •   It will be impossible to completely replace
                                 ZCatalog with Solr

                                 •   Solr indexing will never be transactional

                                 •   Removing ZCatalog from Zope would be
                                     very difficult

                                 •   Tackle small, focused ZCatalog
                                     improvements when possible - like
                                     improving indexing interface




Tuesday, November 29, 2011
Points of Agreement

                             •   Navigation and search should be handled
                                 separately

                                 •   Navigation needs to be transactional,
                                     search does not

                                 •   Split out a catalog used for navigation from
                                     the general catalog

                                 •   Explore a non-catalog utility to support
                                     navigation, optimize for speed




Tuesday, November 29, 2011
Points of Agreement
                             •   Treating Solr integration simply as ZCatalog
                                 replacement does not take best advantage of
                                 Solr features

                                 •   ZCatalog can’t represent the richness of
                                     Solr, focus on the Solr API

                                 •   Take advantage of spelling suggestions,
                                     facets, results snippets with hit highlighting,
                                     synonyms, more like this, etc.

                                 •   Provide Solr indexing, field weighting, etc.
                                     configuration choices in the control panel



Tuesday, November 29, 2011
Points of Agreement
                             •   Neither of the current Solr add-ons provides
                                 the best foundation for the future

                                 •   But they’ve taught us how to do things
                                     better

                             •   Non-Solr approaches to improved Plone
                                 search should be deprecated

                                 •   Andreas Jung is not planning improvements
                                     to TextIndexNG!




Tuesday, November 29, 2011
Points of Agreement


                             •   Stop investing in ZCatalog as a search engine,
                                 Solr is the future




Tuesday, November 29, 2011
Plone + Solr
                                            Roadmap
                             •   Short term: Make Solr integration easy with
                                 an approved add-on (like LDAP)

                                 •   Build on what we’ve learned and create a
                                     better add-on to replace collective.solr and
                                     alm.solrindex

                                 •   Who wants to sponsor a sprint?




Tuesday, November 29, 2011
Plone + Solr
                                            Roadmap
                             •   Long term: Ship Solr integration with Plone,
                                 but don’t require Solr

                                 •   Solr has a lot of overhead and is not always
                                     needed

                                 •   But using it should be as easy as answering
                                     yes to a “Build with Solr?” installation
                                     option




Tuesday, November 29, 2011

More Related Content

Similar to The Future of Search in Plone (6)

Natural Language Search in Solr
Natural Language Search in SolrNatural Language Search in Solr
Natural Language Search in Solr
 
Ontologies Fmi 042010
Ontologies Fmi 042010Ontologies Fmi 042010
Ontologies Fmi 042010
 
Are Your Tests Really Helping You?
Are Your Tests Really Helping You?Are Your Tests Really Helping You?
Are Your Tests Really Helping You?
 
Wed 1430 kartik_subramanian_color
Wed 1430 kartik_subramanian_colorWed 1430 kartik_subramanian_color
Wed 1430 kartik_subramanian_color
 
AudioSIG28Nov2011
AudioSIG28Nov2011AudioSIG28Nov2011
AudioSIG28Nov2011
 
Cs 510iri lecture8_relevanceevaluation-revised
Cs 510iri lecture8_relevanceevaluation-revisedCs 510iri lecture8_relevanceevaluation-revised
Cs 510iri lecture8_relevanceevaluation-revised
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 

The Future of Search in Plone

  • 1. The Future of Search in Plone Sally Kleinfeldt and friends Plone Conference, San Francisco November 3, 2011 Tuesday, November 29, 2011
  • 2. Motivation • Raise awareness • Promote discussion • Forge consensus Tuesday, November 29, 2011
  • 3. Agenda • Introduction to IR concepts • Description of Solr and ZCatalog • Discussion Tuesday, November 29, 2011
  • 5. IR 101 • Transformations • Terms • Models • Measures Tuesday, November 29, 2011
  • 6. IR 101 Transformations • Turn binary, HTML, or other document formats into fields and strings • Parse the strings into a set of terms • Build indexes of the terms specific to the IR model used • Queries are parsed into query operators and strings, which are parsed into terms Tuesday, November 29, 2011
  • 7. IR 101 String => Terms • Tokenization - locate word boundaries • Normalization - remove capitals and diacritics • Stopping - remove stop words (a, of, on, the...) • Stemming - reduce to word stems (walks, walking => walk) • Recognizers - concepts, parts of speech, names, locations... • Must be identical for documents and queries Tuesday, November 29, 2011
  • 8. IR 101 Terms • Application specific • Words or phrases • IR models assign weights to terms in documents Tuesday, November 29, 2011
  • 9. IR 101 Term Weighting • Simplest:Yes/No Boolean value • Better: Term Frequency - # occurrences • More meaningful: tf-idf • Term Freq * Inverse Document Freq • How many documents contain the term? • Increase weight of rare terms and vice versa Tuesday, November 29, 2011
  • 10. IR 101 Boolean Model • First and most adopted • Based on Boolean logic + set theory • Does a document contain query terms - Y/N • Intuitive, easy to implement • No ranking, special query language, too many or too few results • Typical for library systems Tuesday, November 29, 2011
  • 11. IR 101 Vector Space Models • Represent documents and queries as vectors of terms • Term values are weighted - by count or tf-idf • Use vector operations to compare documents with queries • Relevance score based on cosine of angle between doc/query vectors Tuesday, November 29, 2011
  • 12. IR 101 Probabilistic Models • Compute probability that a document is relevant to a query • Relevance ranking functions range from simple to complex • Sophisticated ranking functions include • Okapi BM25 (uses tf and idf) • Machine learning formulas (use training data) Tuesday, November 29, 2011
  • 13. IR 101 Extending the Models • Many many refinements possible • Term interdependencies • Fuzzy sets • Semantic analysis, link analysis • Combining models (Extended Boolean) • The best search engines represent thousands of engineering hours Tuesday, November 29, 2011
  • 14. IR 101 Measures • Search engine results are measured against: • Precision - Percent of results that are relevant • Recall - Percent of relevant results that are returned • F-Score - Harmonic mean of precision and recall Tuesday, November 29, 2011
  • 15. ZCatalog and Solr Tuesday, November 29, 2011
  • 16. ZCatalog • Zope/Plone search engine • Full text and field searching • Probabilistic model using Okapi BM25 • OOTB ZCTextIndex very simple • TextIndexNG adds multilingual, better parsing components, binary transforms, synonyms Tuesday, November 29, 2011
  • 17. Solr • Popular open source enterprise search platform • Eliminating smaller commercial search companies • Java, based on Lucene Java search library, sophisticated vector space ++ model • RESTful APIs • Large, active community • Powers Twitter, Wikipedia, Netflix... Tuesday, November 29, 2011
  • 18. What does Solr have that ZCatalog Doesn’t? • Better relevance ranking • More search features: snippets, hit highlighting, spelling suggestions, synonyms, more like this, faceted search • More configurable: stop words, field boosting, parsing components • An army of engineers working on it Tuesday, November 29, 2011
  • 19. Plone + Solr Today • Two add-ons available • collective.solr - Intercepts catalog queries and dispatches them to Solr • alm.solrindex - adds a new index type to the catalog, SolrIndex • Plus a buildout recipe: collective.recipe.solrinstance Tuesday, November 29, 2011
  • 20. Conclusions from Conference Discussion Tuesday, November 29, 2011
  • 21. Why Does Plone Need Solr? • Certain types of projects need it, for features or because ZCatalog can’t scale to very large sites • We need it to keep up with the enterprise CMS pack Tuesday, November 29, 2011
  • 22. Points of Agreement • It will be impossible to completely replace ZCatalog with Solr • Solr indexing will never be transactional • Removing ZCatalog from Zope would be very difficult • Tackle small, focused ZCatalog improvements when possible - like improving indexing interface Tuesday, November 29, 2011
  • 23. Points of Agreement • Navigation and search should be handled separately • Navigation needs to be transactional, search does not • Split out a catalog used for navigation from the general catalog • Explore a non-catalog utility to support navigation, optimize for speed Tuesday, November 29, 2011
  • 24. Points of Agreement • Treating Solr integration simply as ZCatalog replacement does not take best advantage of Solr features • ZCatalog can’t represent the richness of Solr, focus on the Solr API • Take advantage of spelling suggestions, facets, results snippets with hit highlighting, synonyms, more like this, etc. • Provide Solr indexing, field weighting, etc. configuration choices in the control panel Tuesday, November 29, 2011
  • 25. Points of Agreement • Neither of the current Solr add-ons provides the best foundation for the future • But they’ve taught us how to do things better • Non-Solr approaches to improved Plone search should be deprecated • Andreas Jung is not planning improvements to TextIndexNG! Tuesday, November 29, 2011
  • 26. Points of Agreement • Stop investing in ZCatalog as a search engine, Solr is the future Tuesday, November 29, 2011
  • 27. Plone + Solr Roadmap • Short term: Make Solr integration easy with an approved add-on (like LDAP) • Build on what we’ve learned and create a better add-on to replace collective.solr and alm.solrindex • Who wants to sponsor a sprint? Tuesday, November 29, 2011
  • 28. Plone + Solr Roadmap • Long term: Ship Solr integration with Plone, but don’t require Solr • Solr has a lot of overhead and is not always needed • But using it should be as easy as answering yes to a “Build with Solr?” installation option Tuesday, November 29, 2011