Taking eZ Find beyond full-text search

  • 4,529 views
Uploaded on

 

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
4,529
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
3
Comments
0
Likes
2

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Taking eZ Find beyond full-text search Paul Borgermans eZ Summer Conference London, June 16-17, 2011 © 2011 Paul Borgermans
  • 2. About me● 10 years in the eZ ecosystem – eZ Lucene → eZ Solr → eZ Find – 3.5 years with eZ Systems (2007-2010) – Independent consultant since 2011● Fancying – Apache projects (mainly Solr, Hadoop, Stanbol, Zeta, ..) – NoSQL (Not only SQL) and scalable architectures – eZ Publish & CMS systems in general – Semantic web © 2011 Paul Borgermans
  • 3. Large sites? © 2011 Paul Borgermans
  • 4. Lots of traffic? © 2011 Paul Borgermans
  • 5. Per user navigation needs? © 2011 Paul Borgermans
  • 6. Complex pages?Slow attribute filters? © 2011 Paul Borgermans
  • 7. Need to integrate data from other sources? ERP DB © 2011 Paul Borgermans
  • 8. eZ Find is your friend!Although sometimes more like a rough diamond © 2011 Paul Borgermans
  • 9. Preludium Meet the beast …. © 2011 Paul Borgermans
  • 10. eZ Find RESTful © 2011 Paul Borgermans
  • 11. Solr in a nutshell● State of the art, advanced full text search and information retrieval engine● Fast, scalable with native replication features● Flexible configuration● Extensible● Document oriented storage● Geospatial search (Solr 3.1+)● Native cloud features* * under active development, almost complete (Solr 4.0) © 2011 Paul Borgermans
  • 12. Solr HTTP Request Servlet Update Servlet Admin Disjunction XML/PHP XML Standard CustomInterface Request Max JSON/... Update Request Request Response Interface Handler Handler Handler Writer Config Schema Caching Update Solr Core Handler Analysis Concurrency Replication Lucene Figure credit: Yonik Seeley © 2011 Paul Borgermans
  • 13. Performance!● The backend Solr employs intelligent caches – filters – queries – internal indexes● Optimized for search/retrieval – Slower writing● When updates are done, caches are reconstructed on the fly in the background● Horizonthal & vertical scaling © 2011 Paul Borgermans
  • 14. Using eZ Find/Solr beyond search © 2011 Paul Borgermans
  • 15. eZ Find alter egos● eZ Find/Solr as a scalable IR engine/layer – Remove the burden on your DB – Significant speedups also for regular content – Clustering built-in● eZ Find/Solr as a content and integration engine – Document oriented storage system (hello NoSQL) – Archive use-case – External content © 2011 Paul Borgermans
  • 16. eZ Find alter egos (...)● Alternate navigation interfaces – Facets, filtering, sorting – Function queries (!)● Document clustering – More Like This – Tag based (and more semantic stuff coming up) – Carrot2 based © 2011 Paul Borgermans
  • 17. Provisions in eZ Find● Attribute storage (serialized content) – Less DB queries● Multi-core setup● Distributed search in fetch(ezfind, search) – Query parameters – Filter parameters – Fields to return (for rendering) © 2011 Paul Borgermans
  • 18. © 2011 Paul Borgermans
  • 19. Getting external data into Solr © 2011 Paul Borgermans
  • 20. Tools● Solr Data Import Handler (DIH)● Apache Manifold Connector framework● Using APIs – eZ Find – Zeta Components Search © 2011 Paul Borgermans
  • 21. Integrating external data: Solr DIH http://wiki.apache.org/solr/DataImportHandler● Goals – Read data residing in relational databases and XML files – Build Solr documents according to configuration (joins, views, ...) – Update Solr with such documents – Provide ability to do full imports .. – .. as well as delta imports © 2011 Paul Borgermans
  • 22. Configuring DIH● Need a more complete Solr: add DIH jars● solrconfig.xml: <requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler"> <lst name="defaults"> <str name="config">/home/username/data-config.xml</str> </lst> </requestHandler>● Configure data sources (RDBMS, XML files) – data-config.xml with connection and schema information © 2011 Paul Borgermans
  • 23. Using DIH● Send commands to DIH request handler http://<host>:<port>/solr/dataimport?command=<command> – full-import – delta-import – Status● You can use eZ Find raw Solr request API © 2011 Paul Borgermans
  • 24. Apache Manifold CF http://incubator.apache.org/connectors/● ManifoldCF is a crawler framework● Supports: – File System, Windows Shares – JDBC, RSS – Web, LiveLink (OpenText) – Documentum (EMC) – SharePoint (MSFT) – Meridio (Autonomy) – FileNet (IBM) © 2011 Paul Borgermans
  • 25. With eZ Find API...<?php$solr = new eZSolrBase(http://localhost:8983/solr);$documents = array( array( id => 1135,... tags_lk => array(London,2011)));foreach ($documents as $doc){ ezfSolrUtils::addDocument($solr, $doc);}$solr->commit();?> © 2011 Paul Borgermans
  • 26. Or With Zeta Components Search● http://incubator.apache.org/zetacomponents/ <?php require_once tutorial_autoload.php; // on localhost with the default port $handler = new ezcSearchSolrHandler; // on another host with a different port $handler = new ezcSearchSolrHandler ( 10.0.2.184, 9123 ); ?> © 2011 Paul Borgermans
  • 27. Indexing workflow● Assemble documents in the correct XML format● Send one or more documents at a time● Commit => it becomes searchable● Optional parameters – Boosting at the document level – Boosting at the field level – Auto-commit heartbeat interval (commitWithin, millisecs) © 2011 Paul Borgermans
  • 28. Indexing workflow: important properties (...)● Update = Add with same global id● Deleting – An individual document (id) – A collection of documents (using a Solr query expression) – Needs a commit() to really disappear from search results © 2011 Paul Borgermans
  • 29. Indexing: performance considerations● Commits can become expensive – Use them wisely: in batches where you can – Delay options ● cron job ● CommitWithin parameter● From time to time, also need an optimize() command – Deletes leave “holes” – File fragmentation with adding/updating – Daily, weekly for very large indexes (multi GB) © 2011 Paul Borgermans
  • 30. But you will also need to configure Solr © 2011 Paul Borgermans
  • 31. Field definitions: schema.xml● Field types – text – numerical – dates – location – … (about 25 in total)● Actual fields (name, definition, properties)● Dynamic fields● Copy fields (as aggregators) © 2011 Paul Borgermans
  • 32. schema.xml: simple field type examples <fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/> <!-- boolean type: "true" or "false" --> <fieldType name="boolean" class="solr.BoolField" sortMissingLast="true" omitNorms="true"/> <!-- A Trie based date field for faster date rangequeries and date faceting. --> <fieldType name="tdate" class="solr.TrieDateField"omitNorms="true" precisionStep="6"positionIncrementGap="0"/> <!-- A text field that only splits on whitespace for exact matchingof words --> <fieldType name="text_ws" class="solr.TextField"positionIncrementGap="100"> <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"/> </analyzer> </fieldType> © 2011 Paul Borgermans
  • 33. Analysis● Solr does not really search your text, but rather the terms that result from the analysis of text● Typically a chain of – Character filter(s) – Tokenisation – Filter A – Filter B – … © 2011 Paul Borgermans
  • 34. Solr comes with many tokenizers and filters● Some are language specific● Others are very specialised● It is very important to get this right otherwise, you may not get what you expect! Best practice: do like eZ Find, provide multiple incarnations to suite facet, filter and search needs © 2011 Paul Borgermans
  • 35. Semantic aspects © 2011 Paul Borgermans
  • 36. Semantic aspects: using an annotation engine● Main use cases for CMS systems – Suggest tags to use for editors – Enhance search engine relevancy – Enhance clustering (related content)● Based on – Domain specific ontologies – Public available databases and (RESTful) services © 2011 Paul Borgermans
  • 37. Annotation engine: “open” databases © 2011 Paul Borgermans
  • 38. eZ Publish / eZ Find integration● Personal initiative – Joined an EC funded project as “early adopter”● Initial goals: – eZ Find relevancy optimisation – Annotation suggestions from public data● More ambitious – eZ Publish based, domain specific ontology definition – TBD, as Apache Stanbol evolves © 2011 Paul Borgermans
  • 39. Something extra ... © 2011 Paul Borgermans
  • 40. The eZ Publish content model● One of the main strengths● But – Do you need versioning in all cases? – Translations: quite tightly coupled – Difficulties to have workflows independent of the published version – Variability in objects: sometimes too rigid – Want traveling objects (UUID) – ...● And of course: scalability of the implementation is limited too © 2011 Paul Borgermans
  • 41. So, a call for participation …. © 2011 Paul Borgermans
  • 42. A new content repository project● Provide a very powerful content model – adaptable to various scenarios and use-cases● Exposes a rich service layer, including an optional security model – Role / policy based● Exposes its content through a variety of ways – Simple to use API PHP – REST-style – Later: various standards (PHPCR, CMIS) © 2011 Paul Borgermans
  • 43. A new content repository ...● Builds on top of an IR (information retrieval) layer – initially SOLR based● Pluggable persistence layer – Traditional RDBMS – Highly scalable NoSQL stores (Hbase, MongoDB, CouchDB, ..) © 2011 Paul Borgermans
  • 44. Connects to eZ Publish through ..● eZ Find● Dedicated modules and after refactoring of the kernel● Use it as a content store for eZ Publish itself © 2011 Paul Borgermans
  • 45. Thank you! Questions?http://joind.in/3443paul.borgermans@gmail.com@paulborgermans © 2011 Paul Borgermans