Taking eZ Find beyond    full-text search        Paul Borgermans    eZ Summer Conference    London, June 16-17, 2011      ...
About me●   10 years in the eZ ecosystem         –   eZ Lucene → eZ Solr → eZ Find         –   3.5 years with eZ Systems (...
Large sites?               © 2011 Paul Borgermans
Lots of traffic?                   © 2011 Paul Borgermans
Per user navigation needs?                        © 2011 Paul Borgermans
Complex pages?Slow attribute filters?                          © 2011 Paul Borgermans
Need to integrate data from other            sources?    ERP              DB                            © 2011 Paul Borger...
eZ Find is your friend!Although sometimes more like a rough diamond                                               © 2011 P...
Preludium Meet the beast ….                     © 2011 Paul Borgermans
eZ Find     RESTful               © 2011 Paul Borgermans
Solr in a nutshell●   State of the art, advanced full text search and    information retrieval engine●   Fast, scalable wi...
Solr                   HTTP Request Servlet                     Update Servlet  Admin            Disjunction              ...
Performance!●   The backend Solr employs intelligent caches        –   filters        –   queries        –   internal inde...
Using eZ Find/Solr beyond         search                        © 2011 Paul Borgermans
eZ Find alter egos●   eZ Find/Solr as a scalable IR engine/layer        –   Remove the burden on your DB        –   Signif...
eZ Find alter egos (...)●   Alternate navigation interfaces        –   Facets, filtering, sorting        –   Function quer...
Provisions in eZ Find●   Attribute storage (serialized content)        – Less DB queries●   Multi-core setup●   Distribute...
© 2011 Paul Borgermans
Getting external data into Solr                            © 2011 Paul Borgermans
Tools●   Solr Data Import Handler (DIH)●   Apache Manifold Connector framework●   Using APIs       –   eZ Find       –   Z...
Integrating external data: Solr DIH    http://wiki.apache.org/solr/DataImportHandler●   Goals       –   Read data residing...
Configuring DIH●   Need a more complete Solr: add DIH jars●   solrconfig.xml:    <requestHandler name="/dataimport"    cla...
Using DIH●   Send commands to DIH request handler            http://<host>:<port>/solr/dataimport?command=<command>       ...
Apache Manifold CF    http://incubator.apache.org/connectors/●   ManifoldCF is a crawler framework●   Supports:       –   ...
With eZ Find API...<?php$solr = new eZSolrBase(http://localhost:8983/solr);$documents = array( array( id => 1135,...      ...
Or With Zeta Components Search●   http://incubator.apache.org/zetacomponents/       <?php       require_once tutorial_auto...
Indexing workflow●   Assemble documents in the correct XML format●   Send one or more documents at a time●   Commit => it ...
Indexing workflow:           important properties (...)●   Update = Add with same global id●   Deleting       –   An indiv...
Indexing: performance                 considerations●   Commits can become expensive       –   Use them wisely: in batches...
But you will also need to configure Solr                                      © 2011 Paul Borgermans
Field definitions: schema.xml●   Field types        –   text        –   numerical        –   dates        –   location    ...
schema.xml: simple field type examples    <fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms=...
Analysis●   Solr does not really search your text, but rather    the terms that result from the analysis of text●   Typica...
Solr comes with many tokenizers and                   filters●   Some are language specific●   Others are very specialised...
Semantic aspects                   © 2011 Paul Borgermans
Semantic aspects:       using an annotation engine●   Main use cases for CMS systems       –   Suggest tags to use for edi...
Annotation engine: “open”       databases                            © 2011 Paul Borgermans
eZ Publish / eZ Find integration●   Personal initiative        –   Joined an EC funded project as “early adopter”●   Initi...
Something extra ...                      © 2011 Paul Borgermans
The eZ Publish content model●   One of the main strengths●   But          –   Do you need versioning in all cases?        ...
So, a call for participation ….                                  © 2011 Paul Borgermans
A new content repository project●   Provide a very powerful content model        –   adaptable to various scenarios and us...
A new content repository ...●   Builds on top of an IR (information retrieval)    layer        –   initially SOLR based●  ...
Connects to eZ Publish through ..●   eZ Find●   Dedicated modules    and after refactoring of the kernel●   Use it as a co...
Thank you!   Questions?http://joind.in/3443paul.borgermans@gmail.com@paulborgermans                            © 2011 Paul...
Upcoming SlideShare
Loading in …5
×

Taking eZ Find beyond full-text search

4,840
-1

Published on

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
4,840
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
3
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Taking eZ Find beyond full-text search

  1. 1. Taking eZ Find beyond full-text search Paul Borgermans eZ Summer Conference London, June 16-17, 2011 © 2011 Paul Borgermans
  2. 2. About me● 10 years in the eZ ecosystem – eZ Lucene → eZ Solr → eZ Find – 3.5 years with eZ Systems (2007-2010) – Independent consultant since 2011● Fancying – Apache projects (mainly Solr, Hadoop, Stanbol, Zeta, ..) – NoSQL (Not only SQL) and scalable architectures – eZ Publish & CMS systems in general – Semantic web © 2011 Paul Borgermans
  3. 3. Large sites? © 2011 Paul Borgermans
  4. 4. Lots of traffic? © 2011 Paul Borgermans
  5. 5. Per user navigation needs? © 2011 Paul Borgermans
  6. 6. Complex pages?Slow attribute filters? © 2011 Paul Borgermans
  7. 7. Need to integrate data from other sources? ERP DB © 2011 Paul Borgermans
  8. 8. eZ Find is your friend!Although sometimes more like a rough diamond © 2011 Paul Borgermans
  9. 9. Preludium Meet the beast …. © 2011 Paul Borgermans
  10. 10. eZ Find RESTful © 2011 Paul Borgermans
  11. 11. Solr in a nutshell● State of the art, advanced full text search and information retrieval engine● Fast, scalable with native replication features● Flexible configuration● Extensible● Document oriented storage● Geospatial search (Solr 3.1+)● Native cloud features* * under active development, almost complete (Solr 4.0) © 2011 Paul Borgermans
  12. 12. Solr HTTP Request Servlet Update Servlet Admin Disjunction XML/PHP XML Standard CustomInterface Request Max JSON/... Update Request Request Response Interface Handler Handler Handler Writer Config Schema Caching Update Solr Core Handler Analysis Concurrency Replication Lucene Figure credit: Yonik Seeley © 2011 Paul Borgermans
  13. 13. Performance!● The backend Solr employs intelligent caches – filters – queries – internal indexes● Optimized for search/retrieval – Slower writing● When updates are done, caches are reconstructed on the fly in the background● Horizonthal & vertical scaling © 2011 Paul Borgermans
  14. 14. Using eZ Find/Solr beyond search © 2011 Paul Borgermans
  15. 15. eZ Find alter egos● eZ Find/Solr as a scalable IR engine/layer – Remove the burden on your DB – Significant speedups also for regular content – Clustering built-in● eZ Find/Solr as a content and integration engine – Document oriented storage system (hello NoSQL) – Archive use-case – External content © 2011 Paul Borgermans
  16. 16. eZ Find alter egos (...)● Alternate navigation interfaces – Facets, filtering, sorting – Function queries (!)● Document clustering – More Like This – Tag based (and more semantic stuff coming up) – Carrot2 based © 2011 Paul Borgermans
  17. 17. Provisions in eZ Find● Attribute storage (serialized content) – Less DB queries● Multi-core setup● Distributed search in fetch(ezfind, search) – Query parameters – Filter parameters – Fields to return (for rendering) © 2011 Paul Borgermans
  18. 18. © 2011 Paul Borgermans
  19. 19. Getting external data into Solr © 2011 Paul Borgermans
  20. 20. Tools● Solr Data Import Handler (DIH)● Apache Manifold Connector framework● Using APIs – eZ Find – Zeta Components Search © 2011 Paul Borgermans
  21. 21. Integrating external data: Solr DIH http://wiki.apache.org/solr/DataImportHandler● Goals – Read data residing in relational databases and XML files – Build Solr documents according to configuration (joins, views, ...) – Update Solr with such documents – Provide ability to do full imports .. – .. as well as delta imports © 2011 Paul Borgermans
  22. 22. Configuring DIH● Need a more complete Solr: add DIH jars● solrconfig.xml: <requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler"> <lst name="defaults"> <str name="config">/home/username/data-config.xml</str> </lst> </requestHandler>● Configure data sources (RDBMS, XML files) – data-config.xml with connection and schema information © 2011 Paul Borgermans
  23. 23. Using DIH● Send commands to DIH request handler http://<host>:<port>/solr/dataimport?command=<command> – full-import – delta-import – Status● You can use eZ Find raw Solr request API © 2011 Paul Borgermans
  24. 24. Apache Manifold CF http://incubator.apache.org/connectors/● ManifoldCF is a crawler framework● Supports: – File System, Windows Shares – JDBC, RSS – Web, LiveLink (OpenText) – Documentum (EMC) – SharePoint (MSFT) – Meridio (Autonomy) – FileNet (IBM) © 2011 Paul Borgermans
  25. 25. With eZ Find API...<?php$solr = new eZSolrBase(http://localhost:8983/solr);$documents = array( array( id => 1135,... tags_lk => array(London,2011)));foreach ($documents as $doc){ ezfSolrUtils::addDocument($solr, $doc);}$solr->commit();?> © 2011 Paul Borgermans
  26. 26. Or With Zeta Components Search● http://incubator.apache.org/zetacomponents/ <?php require_once tutorial_autoload.php; // on localhost with the default port $handler = new ezcSearchSolrHandler; // on another host with a different port $handler = new ezcSearchSolrHandler ( 10.0.2.184, 9123 ); ?> © 2011 Paul Borgermans
  27. 27. Indexing workflow● Assemble documents in the correct XML format● Send one or more documents at a time● Commit => it becomes searchable● Optional parameters – Boosting at the document level – Boosting at the field level – Auto-commit heartbeat interval (commitWithin, millisecs) © 2011 Paul Borgermans
  28. 28. Indexing workflow: important properties (...)● Update = Add with same global id● Deleting – An individual document (id) – A collection of documents (using a Solr query expression) – Needs a commit() to really disappear from search results © 2011 Paul Borgermans
  29. 29. Indexing: performance considerations● Commits can become expensive – Use them wisely: in batches where you can – Delay options ● cron job ● CommitWithin parameter● From time to time, also need an optimize() command – Deletes leave “holes” – File fragmentation with adding/updating – Daily, weekly for very large indexes (multi GB) © 2011 Paul Borgermans
  30. 30. But you will also need to configure Solr © 2011 Paul Borgermans
  31. 31. Field definitions: schema.xml● Field types – text – numerical – dates – location – … (about 25 in total)● Actual fields (name, definition, properties)● Dynamic fields● Copy fields (as aggregators) © 2011 Paul Borgermans
  32. 32. schema.xml: simple field type examples <fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/> <!-- boolean type: "true" or "false" --> <fieldType name="boolean" class="solr.BoolField" sortMissingLast="true" omitNorms="true"/> <!-- A Trie based date field for faster date rangequeries and date faceting. --> <fieldType name="tdate" class="solr.TrieDateField"omitNorms="true" precisionStep="6"positionIncrementGap="0"/> <!-- A text field that only splits on whitespace for exact matchingof words --> <fieldType name="text_ws" class="solr.TextField"positionIncrementGap="100"> <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"/> </analyzer> </fieldType> © 2011 Paul Borgermans
  33. 33. Analysis● Solr does not really search your text, but rather the terms that result from the analysis of text● Typically a chain of – Character filter(s) – Tokenisation – Filter A – Filter B – … © 2011 Paul Borgermans
  34. 34. Solr comes with many tokenizers and filters● Some are language specific● Others are very specialised● It is very important to get this right otherwise, you may not get what you expect! Best practice: do like eZ Find, provide multiple incarnations to suite facet, filter and search needs © 2011 Paul Borgermans
  35. 35. Semantic aspects © 2011 Paul Borgermans
  36. 36. Semantic aspects: using an annotation engine● Main use cases for CMS systems – Suggest tags to use for editors – Enhance search engine relevancy – Enhance clustering (related content)● Based on – Domain specific ontologies – Public available databases and (RESTful) services © 2011 Paul Borgermans
  37. 37. Annotation engine: “open” databases © 2011 Paul Borgermans
  38. 38. eZ Publish / eZ Find integration● Personal initiative – Joined an EC funded project as “early adopter”● Initial goals: – eZ Find relevancy optimisation – Annotation suggestions from public data● More ambitious – eZ Publish based, domain specific ontology definition – TBD, as Apache Stanbol evolves © 2011 Paul Borgermans
  39. 39. Something extra ... © 2011 Paul Borgermans
  40. 40. The eZ Publish content model● One of the main strengths● But – Do you need versioning in all cases? – Translations: quite tightly coupled – Difficulties to have workflows independent of the published version – Variability in objects: sometimes too rigid – Want traveling objects (UUID) – ...● And of course: scalability of the implementation is limited too © 2011 Paul Borgermans
  41. 41. So, a call for participation …. © 2011 Paul Borgermans
  42. 42. A new content repository project● Provide a very powerful content model – adaptable to various scenarios and use-cases● Exposes a rich service layer, including an optional security model – Role / policy based● Exposes its content through a variety of ways – Simple to use API PHP – REST-style – Later: various standards (PHPCR, CMIS) © 2011 Paul Borgermans
  43. 43. A new content repository ...● Builds on top of an IR (information retrieval) layer – initially SOLR based● Pluggable persistence layer – Traditional RDBMS – Highly scalable NoSQL stores (Hbase, MongoDB, CouchDB, ..) © 2011 Paul Borgermans
  44. 44. Connects to eZ Publish through ..● eZ Find● Dedicated modules and after refactoring of the kernel● Use it as a content store for eZ Publish itself © 2011 Paul Borgermans
  45. 45. Thank you! Questions?http://joind.in/3443paul.borgermans@gmail.com@paulborgermans © 2011 Paul Borgermans

×