• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Taking eZ Find beyond  full-text search

Taking eZ Find beyond full-text search






Total Views
Views on SlideShare
Embed Views



1 Embed 22

http://lanyrd.com 22



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

    Taking eZ Find beyond  full-text search Taking eZ Find beyond full-text search Presentation Transcript

    • Taking eZ Find beyond full-text search Paul Borgermans eZ Summer Conference London, June 16-17, 2011 © 2011 Paul Borgermans
    • About me● 10 years in the eZ ecosystem – eZ Lucene → eZ Solr → eZ Find – 3.5 years with eZ Systems (2007-2010) – Independent consultant since 2011● Fancying – Apache projects (mainly Solr, Hadoop, Stanbol, Zeta, ..) – NoSQL (Not only SQL) and scalable architectures – eZ Publish & CMS systems in general – Semantic web © 2011 Paul Borgermans
    • Large sites? © 2011 Paul Borgermans
    • Lots of traffic? © 2011 Paul Borgermans
    • Per user navigation needs? © 2011 Paul Borgermans
    • Complex pages?Slow attribute filters? © 2011 Paul Borgermans
    • Need to integrate data from other sources? ERP DB © 2011 Paul Borgermans
    • eZ Find is your friend!Although sometimes more like a rough diamond © 2011 Paul Borgermans
    • Preludium Meet the beast …. © 2011 Paul Borgermans
    • eZ Find RESTful © 2011 Paul Borgermans
    • Solr in a nutshell● State of the art, advanced full text search and information retrieval engine● Fast, scalable with native replication features● Flexible configuration● Extensible● Document oriented storage● Geospatial search (Solr 3.1+)● Native cloud features* * under active development, almost complete (Solr 4.0) © 2011 Paul Borgermans
    • Solr HTTP Request Servlet Update Servlet Admin Disjunction XML/PHP XML Standard CustomInterface Request Max JSON/... Update Request Request Response Interface Handler Handler Handler Writer Config Schema Caching Update Solr Core Handler Analysis Concurrency Replication Lucene Figure credit: Yonik Seeley © 2011 Paul Borgermans
    • Performance!● The backend Solr employs intelligent caches – filters – queries – internal indexes● Optimized for search/retrieval – Slower writing● When updates are done, caches are reconstructed on the fly in the background● Horizonthal & vertical scaling © 2011 Paul Borgermans
    • Using eZ Find/Solr beyond search © 2011 Paul Borgermans
    • eZ Find alter egos● eZ Find/Solr as a scalable IR engine/layer – Remove the burden on your DB – Significant speedups also for regular content – Clustering built-in● eZ Find/Solr as a content and integration engine – Document oriented storage system (hello NoSQL) – Archive use-case – External content © 2011 Paul Borgermans
    • eZ Find alter egos (...)● Alternate navigation interfaces – Facets, filtering, sorting – Function queries (!)● Document clustering – More Like This – Tag based (and more semantic stuff coming up) – Carrot2 based © 2011 Paul Borgermans
    • Provisions in eZ Find● Attribute storage (serialized content) – Less DB queries● Multi-core setup● Distributed search in fetch(ezfind, search) – Query parameters – Filter parameters – Fields to return (for rendering) © 2011 Paul Borgermans
    • © 2011 Paul Borgermans
    • Getting external data into Solr © 2011 Paul Borgermans
    • Tools● Solr Data Import Handler (DIH)● Apache Manifold Connector framework● Using APIs – eZ Find – Zeta Components Search © 2011 Paul Borgermans
    • Integrating external data: Solr DIH http://wiki.apache.org/solr/DataImportHandler● Goals – Read data residing in relational databases and XML files – Build Solr documents according to configuration (joins, views, ...) – Update Solr with such documents – Provide ability to do full imports .. – .. as well as delta imports © 2011 Paul Borgermans
    • Configuring DIH● Need a more complete Solr: add DIH jars● solrconfig.xml: <requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler"> <lst name="defaults"> <str name="config">/home/username/data-config.xml</str> </lst> </requestHandler>● Configure data sources (RDBMS, XML files) – data-config.xml with connection and schema information © 2011 Paul Borgermans
    • Using DIH● Send commands to DIH request handler http://<host>:<port>/solr/dataimport?command=<command> – full-import – delta-import – Status● You can use eZ Find raw Solr request API © 2011 Paul Borgermans
    • Apache Manifold CF http://incubator.apache.org/connectors/● ManifoldCF is a crawler framework● Supports: – File System, Windows Shares – JDBC, RSS – Web, LiveLink (OpenText) – Documentum (EMC) – SharePoint (MSFT) – Meridio (Autonomy) – FileNet (IBM) © 2011 Paul Borgermans
    • With eZ Find API...<?php$solr = new eZSolrBase(http://localhost:8983/solr);$documents = array( array( id => 1135,... tags_lk => array(London,2011)));foreach ($documents as $doc){ ezfSolrUtils::addDocument($solr, $doc);}$solr->commit();?> © 2011 Paul Borgermans
    • Or With Zeta Components Search● http://incubator.apache.org/zetacomponents/ <?php require_once tutorial_autoload.php; // on localhost with the default port $handler = new ezcSearchSolrHandler; // on another host with a different port $handler = new ezcSearchSolrHandler (, 9123 ); ?> © 2011 Paul Borgermans
    • Indexing workflow● Assemble documents in the correct XML format● Send one or more documents at a time● Commit => it becomes searchable● Optional parameters – Boosting at the document level – Boosting at the field level – Auto-commit heartbeat interval (commitWithin, millisecs) © 2011 Paul Borgermans
    • Indexing workflow: important properties (...)● Update = Add with same global id● Deleting – An individual document (id) – A collection of documents (using a Solr query expression) – Needs a commit() to really disappear from search results © 2011 Paul Borgermans
    • Indexing: performance considerations● Commits can become expensive – Use them wisely: in batches where you can – Delay options ● cron job ● CommitWithin parameter● From time to time, also need an optimize() command – Deletes leave “holes” – File fragmentation with adding/updating – Daily, weekly for very large indexes (multi GB) © 2011 Paul Borgermans
    • But you will also need to configure Solr © 2011 Paul Borgermans
    • Field definitions: schema.xml● Field types – text – numerical – dates – location – … (about 25 in total)● Actual fields (name, definition, properties)● Dynamic fields● Copy fields (as aggregators) © 2011 Paul Borgermans
    • schema.xml: simple field type examples <fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/> <!-- boolean type: "true" or "false" --> <fieldType name="boolean" class="solr.BoolField" sortMissingLast="true" omitNorms="true"/> <!-- A Trie based date field for faster date rangequeries and date faceting. --> <fieldType name="tdate" class="solr.TrieDateField"omitNorms="true" precisionStep="6"positionIncrementGap="0"/> <!-- A text field that only splits on whitespace for exact matchingof words --> <fieldType name="text_ws" class="solr.TextField"positionIncrementGap="100"> <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"/> </analyzer> </fieldType> © 2011 Paul Borgermans
    • Analysis● Solr does not really search your text, but rather the terms that result from the analysis of text● Typically a chain of – Character filter(s) – Tokenisation – Filter A – Filter B – … © 2011 Paul Borgermans
    • Solr comes with many tokenizers and filters● Some are language specific● Others are very specialised● It is very important to get this right otherwise, you may not get what you expect! Best practice: do like eZ Find, provide multiple incarnations to suite facet, filter and search needs © 2011 Paul Borgermans
    • Semantic aspects © 2011 Paul Borgermans
    • Semantic aspects: using an annotation engine● Main use cases for CMS systems – Suggest tags to use for editors – Enhance search engine relevancy – Enhance clustering (related content)● Based on – Domain specific ontologies – Public available databases and (RESTful) services © 2011 Paul Borgermans
    • Annotation engine: “open” databases © 2011 Paul Borgermans
    • eZ Publish / eZ Find integration● Personal initiative – Joined an EC funded project as “early adopter”● Initial goals: – eZ Find relevancy optimisation – Annotation suggestions from public data● More ambitious – eZ Publish based, domain specific ontology definition – TBD, as Apache Stanbol evolves © 2011 Paul Borgermans
    • Something extra ... © 2011 Paul Borgermans
    • The eZ Publish content model● One of the main strengths● But – Do you need versioning in all cases? – Translations: quite tightly coupled – Difficulties to have workflows independent of the published version – Variability in objects: sometimes too rigid – Want traveling objects (UUID) – ...● And of course: scalability of the implementation is limited too © 2011 Paul Borgermans
    • So, a call for participation …. © 2011 Paul Borgermans
    • A new content repository project● Provide a very powerful content model – adaptable to various scenarios and use-cases● Exposes a rich service layer, including an optional security model – Role / policy based● Exposes its content through a variety of ways – Simple to use API PHP – REST-style – Later: various standards (PHPCR, CMIS) © 2011 Paul Borgermans
    • A new content repository ...● Builds on top of an IR (information retrieval) layer – initially SOLR based● Pluggable persistence layer – Traditional RDBMS – Highly scalable NoSQL stores (Hbase, MongoDB, CouchDB, ..) © 2011 Paul Borgermans
    • Connects to eZ Publish through ..● eZ Find● Dedicated modules and after refactoring of the kernel● Use it as a content store for eZ Publish itself © 2011 Paul Borgermans
    • Thank you! Questions?http://joind.in/3443paul.borgermans@gmail.com@paulborgermans © 2011 Paul Borgermans