Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Taking eZ Find beyond full-text search

5,385 views

Published on

Published in: Technology
  • Be the first to comment

Taking eZ Find beyond full-text search

  1. 1. Taking eZ Find beyond full-text search Paul Borgermans eZ Summer Conference London, June 16-17, 2011 © 2011 Paul Borgermans
  2. 2. About me● 10 years in the eZ ecosystem – eZ Lucene → eZ Solr → eZ Find – 3.5 years with eZ Systems (2007-2010) – Independent consultant since 2011● Fancying – Apache projects (mainly Solr, Hadoop, Stanbol, Zeta, ..) – NoSQL (Not only SQL) and scalable architectures – eZ Publish & CMS systems in general – Semantic web © 2011 Paul Borgermans
  3. 3. Large sites? © 2011 Paul Borgermans
  4. 4. Lots of traffic? © 2011 Paul Borgermans
  5. 5. Per user navigation needs? © 2011 Paul Borgermans
  6. 6. Complex pages?Slow attribute filters? © 2011 Paul Borgermans
  7. 7. Need to integrate data from other sources? ERP DB © 2011 Paul Borgermans
  8. 8. eZ Find is your friend!Although sometimes more like a rough diamond © 2011 Paul Borgermans
  9. 9. Preludium Meet the beast …. © 2011 Paul Borgermans
  10. 10. eZ Find RESTful © 2011 Paul Borgermans
  11. 11. Solr in a nutshell● State of the art, advanced full text search and information retrieval engine● Fast, scalable with native replication features● Flexible configuration● Extensible● Document oriented storage● Geospatial search (Solr 3.1+)● Native cloud features* * under active development, almost complete (Solr 4.0) © 2011 Paul Borgermans
  12. 12. Solr HTTP Request Servlet Update Servlet Admin Disjunction XML/PHP XML Standard CustomInterface Request Max JSON/... Update Request Request Response Interface Handler Handler Handler Writer Config Schema Caching Update Solr Core Handler Analysis Concurrency Replication Lucene Figure credit: Yonik Seeley © 2011 Paul Borgermans
  13. 13. Performance!● The backend Solr employs intelligent caches – filters – queries – internal indexes● Optimized for search/retrieval – Slower writing● When updates are done, caches are reconstructed on the fly in the background● Horizonthal & vertical scaling © 2011 Paul Borgermans
  14. 14. Using eZ Find/Solr beyond search © 2011 Paul Borgermans
  15. 15. eZ Find alter egos● eZ Find/Solr as a scalable IR engine/layer – Remove the burden on your DB – Significant speedups also for regular content – Clustering built-in● eZ Find/Solr as a content and integration engine – Document oriented storage system (hello NoSQL) – Archive use-case – External content © 2011 Paul Borgermans
  16. 16. eZ Find alter egos (...)● Alternate navigation interfaces – Facets, filtering, sorting – Function queries (!)● Document clustering – More Like This – Tag based (and more semantic stuff coming up) – Carrot2 based © 2011 Paul Borgermans
  17. 17. Provisions in eZ Find● Attribute storage (serialized content) – Less DB queries● Multi-core setup● Distributed search in fetch(ezfind, search) – Query parameters – Filter parameters – Fields to return (for rendering) © 2011 Paul Borgermans
  18. 18. © 2011 Paul Borgermans
  19. 19. Getting external data into Solr © 2011 Paul Borgermans
  20. 20. Tools● Solr Data Import Handler (DIH)● Apache Manifold Connector framework● Using APIs – eZ Find – Zeta Components Search © 2011 Paul Borgermans
  21. 21. Integrating external data: Solr DIH http://wiki.apache.org/solr/DataImportHandler● Goals – Read data residing in relational databases and XML files – Build Solr documents according to configuration (joins, views, ...) – Update Solr with such documents – Provide ability to do full imports .. – .. as well as delta imports © 2011 Paul Borgermans
  22. 22. Configuring DIH● Need a more complete Solr: add DIH jars● solrconfig.xml: <requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler"> <lst name="defaults"> <str name="config">/home/username/data-config.xml</str> </lst> </requestHandler>● Configure data sources (RDBMS, XML files) – data-config.xml with connection and schema information © 2011 Paul Borgermans
  23. 23. Using DIH● Send commands to DIH request handler http://<host>:<port>/solr/dataimport?command=<command> – full-import – delta-import – Status● You can use eZ Find raw Solr request API © 2011 Paul Borgermans
  24. 24. Apache Manifold CF http://incubator.apache.org/connectors/● ManifoldCF is a crawler framework● Supports: – File System, Windows Shares – JDBC, RSS – Web, LiveLink (OpenText) – Documentum (EMC) – SharePoint (MSFT) – Meridio (Autonomy) – FileNet (IBM) © 2011 Paul Borgermans
  25. 25. With eZ Find API...<?php$solr = new eZSolrBase(http://localhost:8983/solr);$documents = array( array( id => 1135,... tags_lk => array(London,2011)));foreach ($documents as $doc){ ezfSolrUtils::addDocument($solr, $doc);}$solr->commit();?> © 2011 Paul Borgermans
  26. 26. Or With Zeta Components Search● http://incubator.apache.org/zetacomponents/ <?php require_once tutorial_autoload.php; // on localhost with the default port $handler = new ezcSearchSolrHandler; // on another host with a different port $handler = new ezcSearchSolrHandler ( 10.0.2.184, 9123 ); ?> © 2011 Paul Borgermans
  27. 27. Indexing workflow● Assemble documents in the correct XML format● Send one or more documents at a time● Commit => it becomes searchable● Optional parameters – Boosting at the document level – Boosting at the field level – Auto-commit heartbeat interval (commitWithin, millisecs) © 2011 Paul Borgermans
  28. 28. Indexing workflow: important properties (...)● Update = Add with same global id● Deleting – An individual document (id) – A collection of documents (using a Solr query expression) – Needs a commit() to really disappear from search results © 2011 Paul Borgermans
  29. 29. Indexing: performance considerations● Commits can become expensive – Use them wisely: in batches where you can – Delay options ● cron job ● CommitWithin parameter● From time to time, also need an optimize() command – Deletes leave “holes” – File fragmentation with adding/updating – Daily, weekly for very large indexes (multi GB) © 2011 Paul Borgermans
  30. 30. But you will also need to configure Solr © 2011 Paul Borgermans
  31. 31. Field definitions: schema.xml● Field types – text – numerical – dates – location – … (about 25 in total)● Actual fields (name, definition, properties)● Dynamic fields● Copy fields (as aggregators) © 2011 Paul Borgermans
  32. 32. schema.xml: simple field type examples <fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/> <!-- boolean type: "true" or "false" --> <fieldType name="boolean" class="solr.BoolField" sortMissingLast="true" omitNorms="true"/> <!-- A Trie based date field for faster date rangequeries and date faceting. --> <fieldType name="tdate" class="solr.TrieDateField"omitNorms="true" precisionStep="6"positionIncrementGap="0"/> <!-- A text field that only splits on whitespace for exact matchingof words --> <fieldType name="text_ws" class="solr.TextField"positionIncrementGap="100"> <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"/> </analyzer> </fieldType> © 2011 Paul Borgermans
  33. 33. Analysis● Solr does not really search your text, but rather the terms that result from the analysis of text● Typically a chain of – Character filter(s) – Tokenisation – Filter A – Filter B – … © 2011 Paul Borgermans
  34. 34. Solr comes with many tokenizers and filters● Some are language specific● Others are very specialised● It is very important to get this right otherwise, you may not get what you expect! Best practice: do like eZ Find, provide multiple incarnations to suite facet, filter and search needs © 2011 Paul Borgermans
  35. 35. Semantic aspects © 2011 Paul Borgermans
  36. 36. Semantic aspects: using an annotation engine● Main use cases for CMS systems – Suggest tags to use for editors – Enhance search engine relevancy – Enhance clustering (related content)● Based on – Domain specific ontologies – Public available databases and (RESTful) services © 2011 Paul Borgermans
  37. 37. Annotation engine: “open” databases © 2011 Paul Borgermans
  38. 38. eZ Publish / eZ Find integration● Personal initiative – Joined an EC funded project as “early adopter”● Initial goals: – eZ Find relevancy optimisation – Annotation suggestions from public data● More ambitious – eZ Publish based, domain specific ontology definition – TBD, as Apache Stanbol evolves © 2011 Paul Borgermans
  39. 39. Something extra ... © 2011 Paul Borgermans
  40. 40. The eZ Publish content model● One of the main strengths● But – Do you need versioning in all cases? – Translations: quite tightly coupled – Difficulties to have workflows independent of the published version – Variability in objects: sometimes too rigid – Want traveling objects (UUID) – ...● And of course: scalability of the implementation is limited too © 2011 Paul Borgermans
  41. 41. So, a call for participation …. © 2011 Paul Borgermans
  42. 42. A new content repository project● Provide a very powerful content model – adaptable to various scenarios and use-cases● Exposes a rich service layer, including an optional security model – Role / policy based● Exposes its content through a variety of ways – Simple to use API PHP – REST-style – Later: various standards (PHPCR, CMIS) © 2011 Paul Borgermans
  43. 43. A new content repository ...● Builds on top of an IR (information retrieval) layer – initially SOLR based● Pluggable persistence layer – Traditional RDBMS – Highly scalable NoSQL stores (Hbase, MongoDB, CouchDB, ..) © 2011 Paul Borgermans
  44. 44. Connects to eZ Publish through ..● eZ Find● Dedicated modules and after refactoring of the kernel● Use it as a content store for eZ Publish itself © 2011 Paul Borgermans
  45. 45. Thank you! Questions?http://joind.in/3443paul.borgermans@gmail.com@paulborgermans © 2011 Paul Borgermans

×