Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Repository performance tuning

9,028 views

Published on

Presentation at the .adaptTo(Berlin) meetup

Published in: Technology

Repository performance tuning

  1. 1. Jukka Zitting | Senior Developer<br />Repository performance tuning<br />
  2. 2. Agenda<br />Performance tuning steps<br />Repository internals<br />Basic content access<br />Batch processing<br />Clustering<br />Query performance<br />Full text indexing<br />Questions and answers<br />2<br />
  3. 3. Performance tuning steps<br />Step 1: Identify the symptom<br />Create a test case that consistently measures current performance<br />Define the performance target if current level unacceptable<br />Make sure that the test case and the target performance are really relevant<br />Step 2: Identify the cause<br />Main suspects: Hardware, Repository, Application, Client<br />Revise the test case until the problem no longer occurs;for example: Selenium, JMeter, JUnit, Iometer<br />Step 3: Identify/implement possible solutions<br />Change content, configuration, code or upgrade hardware<br />Step 4: Verify results<br />If target not reached, iterate the process or revise the goal<br />3<br />
  4. 4. Repository internals<br />4<br />Data<br />Store<br />Persistence<br />Manager<br />Query<br />Index<br />Cluster<br />Journal<br />
  5. 5. Data Store<br />Content-addressed storage for large binary properties<br />Arbitrarily sized binary streams<br />Addressed by MD5 hash<br />String properties not included, use UTF-8 to map to binary<br />Fast delivery of binary content<br />Read directly from disk<br />Can also be read in ranges<br />Improved write throughput<br />Multiple uploads can proceed concurrently (within hardware limits)<br />Cheap copies<br />Garbage collection used to reclaim disk space<br />Logically shared by the entire cluster<br />5<br />Data<br />Store<br />
  6. 6. Cluster Journal<br />Journal of all persisted changes in the repository<br />Content changes<br />Namespace, nodetype registrations, etc.<br />Used to keep all cluster nodes in sync<br />Observation events to all cluster nodes (see JackrabbitEvent.isExternal)<br />Search index updates<br />Internal cache invalidation<br />Old events need to be discarded eventually<br />No notable performance impact, just extra disk space<br />Keep events for the longest possible time a node can be offline without getting completely recreated<br />Logically shared by the entire cluster<br />Writes synchronized over the entire cluster<br />6<br />Cluster<br />Journal<br />
  7. 7. Persistence Manager<br />Identifier-addressed storage for nodes and properties<br />Each node has a UUID, even if not mix:referenceable<br />Essentially a key-value store, even when backed by a RDBMS<br />Also keeps track of node references<br />Bundles as units of content<br />Bundle = UUID, type, properties, child node references, etc.<br />Only large binaries stored elsewhere in the data store<br />Designed for balanced content hierarchies, avoid too many child nodes<br />Atomic updates<br />A save() call persists the entire transient space as a single atomic operation<br />One PM per workspace (and one for the shared version store)<br />Logically (often also physically) shared across a cluster<br />7<br />Persistence<br />Manager<br />
  8. 8. Query Index<br />Inverse index based on Apache Lucene<br />Flexible mapping from terms to node identifiers<br />Special handling for the path structure<br />Mostly synchronous index updates<br />Long full text extraction tasks handled in background<br />Other cluster nodes will update their indexes at next cluster sync <br />Everything indexed by default<br />Indexing configuration for tweaking functionality, performance and disk usage<br />One index per workspace (and one for the shared version store)<br />Not shared across a cluster, indexes are local to each cluster node<br />See http://wiki.apache.org/jackrabbit/Search#Search_Configuration<br />8<br />Query<br />Index<br />
  9. 9. Agenda<br />Performance tuning steps<br />Repository internals<br />Basic content access<br />Batch processing<br />Clustering<br />Query performance<br />Indexing configuration<br />Questions and answers<br />9<br />
  10. 10. Basic content access<br />Very fast access by path and ID<br />Underlying storage addressed by ID, but path traversal is in any case needed for ACL checks<br />Relevant caches:<br />Path to ID map (internal structure, not configurable)<br />Item state caches (automatically balanced, configurable for special cases)<br />Bundle cache (default fairly low, increase for large deployments)<br />Also some PM-specific options (TarPM index, etc.)<br />Caches optimized for a reasonably sized active working set<br />typical web access pattern: handful of key resources and a long tail of less frequently accessed content, few writes<br />Performance hit especially when updating nodes with lots of child nodes<br />FineGrainedISMLocking for concurrent, non-overlapping writes<br />10<br />
  11. 11. Example: Bundle cache configuration<br />11<br /><!-- In …/repository/worspaces/${wsp.name}/workspace.xml --><br /><Workspace …><br /> <PersistenceManager class=“…"><br /> <paramname="bundleCacheSize" value="8"/><br /> </PersistenceManager><br /></Workspace><br />
  12. 12. Batch processing<br />Two issues: read and write<br />Reading lots of content<br />Tree traversal the best approach, but will flood caches<br />Schedule for off-peak times<br />Add explicit delay (used by the garbage collectors)<br />Use a dedicated cluster node for batch processing<br />Writing lots of content (including deleting large subtrees)<br />The entire transient space is kept in memory and committed atomically<br />Split the operation to smaller pieces<br />Save after every ~1k nodes<br />Leverage the data store if possible<br />12<br />
  13. 13. Clustering<br />Good for horizontally scaling reads<br />Practically zero overhead on read access<br />Not so good for heavy concurrent writes<br />Exclusive lock over the whole cluster<br />Direct all writes to a single master node<br />Leverage the data store<br />Note the cluster sync interval for query consistency, etc.<br />Session.refresh() can be used to force a cluster sync<br />13<br />
  14. 14. Query performance<br />What’s really fast?<br />Constraints on properties, node types, full text<br />Typically O(n) where n is the number of results, vs. the total number of nodes <br />What’s pretty fast?<br />Path constraints<br />What needs some planning?<br />Constraints on the child axis<br />Sorting, limit/offset <br />Joins<br />What’s not yet available?<br />Aggregate queries (COUNT, SUM, DISTINCT, etc.)<br />Faceting<br />14<br />
  15. 15. Join engine<br />15<br />SELECT a.* FROM [nt:unstructured] AS a JOIN [nt:unstructured] AS b<br /> <PersistenceManager class=“…"><br /> <paramname="bundleCacheSize" value="8"/><br /> </PersistenceManager><br /></Workspace><br />
  16. 16. Indexing configuration<br />Default configuration<br />Index all non-binary properties<br />Index binary jcr:data properties (think nt:file/nt:resource)<br />Full text extraction support for all major document formats<br />Full text extraction from images, packages, etc. is explicitly disabled<br />CQ5 / WEM comes with default aggregate indexing rules for cq:Pages, etc.<br />Why change the configuration?<br />Reduce the index size (by default almost as large as the PM)<br />Enable features like aggregate indexes<br />Assign boost values for selected properties to improve search result relevance<br />16<br />
  17. 17. Indexing configuration<br />How to change the configuration?<br />indexing_configuration.xml file in the workspace directory<br />Referenced by the indexingConfiguration option in the workspace.xml file<br />See http://wiki.apache.org/jackrabbit/IndexingConfiguration<br />Example:<br />17<br /><?xml version="1.0"?><!DOCTYPE configuration SYSTEM<br />"http://jackrabbit.apache.org/dtd/indexing-configuration-1.0.dtd"><configuration xmlns:jcr="http://www.jcp.org/jcr/1.0"<br />xmlns:nt="http://www.jcp.org/jcr/nt/1.0"><br /> <aggregateprimaryType="nt:file"><br /> <include>jcr:content</include><br /> </aggregate><br /></configuration><br />
  18. 18. Question and Answers<br />18<br />

×