Repository performance tuning


Published on

Presentation at the .adaptTo(Berlin) meetup

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Repository performance tuning

  1. 1. Jukka Zitting | Senior Developer<br />Repository performance tuning<br />
  2. 2. Agenda<br />Performance tuning steps<br />Repository internals<br />Basic content access<br />Batch processing<br />Clustering<br />Query performance<br />Full text indexing<br />Questions and answers<br />2<br />
  3. 3. Performance tuning steps<br />Step 1: Identify the symptom<br />Create a test case that consistently measures current performance<br />Define the performance target if current level unacceptable<br />Make sure that the test case and the target performance are really relevant<br />Step 2: Identify the cause<br />Main suspects: Hardware, Repository, Application, Client<br />Revise the test case until the problem no longer occurs;for example: Selenium, JMeter, JUnit, Iometer<br />Step 3: Identify/implement possible solutions<br />Change content, configuration, code or upgrade hardware<br />Step 4: Verify results<br />If target not reached, iterate the process or revise the goal<br />3<br />
  4. 4. Repository internals<br />4<br />Data<br />Store<br />Persistence<br />Manager<br />Query<br />Index<br />Cluster<br />Journal<br />
  5. 5. Data Store<br />Content-addressed storage for large binary properties<br />Arbitrarily sized binary streams<br />Addressed by MD5 hash<br />String properties not included, use UTF-8 to map to binary<br />Fast delivery of binary content<br />Read directly from disk<br />Can also be read in ranges<br />Improved write throughput<br />Multiple uploads can proceed concurrently (within hardware limits)<br />Cheap copies<br />Garbage collection used to reclaim disk space<br />Logically shared by the entire cluster<br />5<br />Data<br />Store<br />
  6. 6. Cluster Journal<br />Journal of all persisted changes in the repository<br />Content changes<br />Namespace, nodetype registrations, etc.<br />Used to keep all cluster nodes in sync<br />Observation events to all cluster nodes (see JackrabbitEvent.isExternal)<br />Search index updates<br />Internal cache invalidation<br />Old events need to be discarded eventually<br />No notable performance impact, just extra disk space<br />Keep events for the longest possible time a node can be offline without getting completely recreated<br />Logically shared by the entire cluster<br />Writes synchronized over the entire cluster<br />6<br />Cluster<br />Journal<br />
  7. 7. Persistence Manager<br />Identifier-addressed storage for nodes and properties<br />Each node has a UUID, even if not mix:referenceable<br />Essentially a key-value store, even when backed by a RDBMS<br />Also keeps track of node references<br />Bundles as units of content<br />Bundle = UUID, type, properties, child node references, etc.<br />Only large binaries stored elsewhere in the data store<br />Designed for balanced content hierarchies, avoid too many child nodes<br />Atomic updates<br />A save() call persists the entire transient space as a single atomic operation<br />One PM per workspace (and one for the shared version store)<br />Logically (often also physically) shared across a cluster<br />7<br />Persistence<br />Manager<br />
  8. 8. Query Index<br />Inverse index based on Apache Lucene<br />Flexible mapping from terms to node identifiers<br />Special handling for the path structure<br />Mostly synchronous index updates<br />Long full text extraction tasks handled in background<br />Other cluster nodes will update their indexes at next cluster sync <br />Everything indexed by default<br />Indexing configuration for tweaking functionality, performance and disk usage<br />One index per workspace (and one for the shared version store)<br />Not shared across a cluster, indexes are local to each cluster node<br />See<br />8<br />Query<br />Index<br />
  9. 9. Agenda<br />Performance tuning steps<br />Repository internals<br />Basic content access<br />Batch processing<br />Clustering<br />Query performance<br />Indexing configuration<br />Questions and answers<br />9<br />
  10. 10. Basic content access<br />Very fast access by path and ID<br />Underlying storage addressed by ID, but path traversal is in any case needed for ACL checks<br />Relevant caches:<br />Path to ID map (internal structure, not configurable)<br />Item state caches (automatically balanced, configurable for special cases)<br />Bundle cache (default fairly low, increase for large deployments)<br />Also some PM-specific options (TarPM index, etc.)<br />Caches optimized for a reasonably sized active working set<br />typical web access pattern: handful of key resources and a long tail of less frequently accessed content, few writes<br />Performance hit especially when updating nodes with lots of child nodes<br />FineGrainedISMLocking for concurrent, non-overlapping writes<br />10<br />
  11. 11. Example: Bundle cache configuration<br />11<br /><!-- In …/repository/worspaces/${}/workspace.xml --><br /><Workspace …><br /> <PersistenceManager class=“…"><br /> <paramname="bundleCacheSize" value="8"/><br /> </PersistenceManager><br /></Workspace><br />
  12. 12. Batch processing<br />Two issues: read and write<br />Reading lots of content<br />Tree traversal the best approach, but will flood caches<br />Schedule for off-peak times<br />Add explicit delay (used by the garbage collectors)<br />Use a dedicated cluster node for batch processing<br />Writing lots of content (including deleting large subtrees)<br />The entire transient space is kept in memory and committed atomically<br />Split the operation to smaller pieces<br />Save after every ~1k nodes<br />Leverage the data store if possible<br />12<br />
  13. 13. Clustering<br />Good for horizontally scaling reads<br />Practically zero overhead on read access<br />Not so good for heavy concurrent writes<br />Exclusive lock over the whole cluster<br />Direct all writes to a single master node<br />Leverage the data store<br />Note the cluster sync interval for query consistency, etc.<br />Session.refresh() can be used to force a cluster sync<br />13<br />
  14. 14. Query performance<br />What’s really fast?<br />Constraints on properties, node types, full text<br />Typically O(n) where n is the number of results, vs. the total number of nodes <br />What’s pretty fast?<br />Path constraints<br />What needs some planning?<br />Constraints on the child axis<br />Sorting, limit/offset <br />Joins<br />What’s not yet available?<br />Aggregate queries (COUNT, SUM, DISTINCT, etc.)<br />Faceting<br />14<br />
  15. 15. Join engine<br />15<br />SELECT a.* FROM [nt:unstructured] AS a JOIN [nt:unstructured] AS b<br /> <PersistenceManager class=“…"><br /> <paramname="bundleCacheSize" value="8"/><br /> </PersistenceManager><br /></Workspace><br />
  16. 16. Indexing configuration<br />Default configuration<br />Index all non-binary properties<br />Index binary jcr:data properties (think nt:file/nt:resource)<br />Full text extraction support for all major document formats<br />Full text extraction from images, packages, etc. is explicitly disabled<br />CQ5 / WEM comes with default aggregate indexing rules for cq:Pages, etc.<br />Why change the configuration?<br />Reduce the index size (by default almost as large as the PM)<br />Enable features like aggregate indexes<br />Assign boost values for selected properties to improve search result relevance<br />16<br />
  17. 17. Indexing configuration<br />How to change the configuration?<br />indexing_configuration.xml file in the workspace directory<br />Referenced by the indexingConfiguration option in the workspace.xml file<br />See<br />Example:<br />17<br /><?xml version="1.0"?><!DOCTYPE configuration SYSTEM<br />""><configuration xmlns:jcr=""<br />xmlns:nt=""><br /> <aggregateprimaryType="nt:file"><br /> <include>jcr:content</include><br /> </aggregate><br /></configuration><br />
  18. 18. Question and Answers<br />18<br />