SlideShare a Scribd company logo
1 of 19
Jukka Zitting  |  Senior Developer Repository performance tuning
Agenda Performance tuning steps Repository internals Basic content access Batch processing Clustering Query performance Full text indexing Questions and answers 2
Performance tuning steps Step 1: Identify the symptom Create a test case that consistently measures current performance Define the performance target if current level unacceptable Make sure that the test case and the target performance are really relevant Step 2: Identify the cause Main suspects: Hardware, Repository, Application, Client Revise the test case until the problem no longer occurs;for example: Selenium, JMeter, JUnit, Iometer Step 3: Identify/implement possible solutions Change content, configuration, code or upgrade hardware Step 4: Verify results If target not reached, iterate the process or revise the goal 3
Repository internals 4 Data Store Persistence Manager Query Index Cluster Journal
Data Store Content-addressed storage for large binary properties Arbitrarily sized binary streams Addressed by MD5 hash String properties not included, use UTF-8 to map to binary Fast delivery of binary content Read directly from disk Can also be read in ranges Improved write throughput Multiple uploads can proceed concurrently (within hardware limits) Cheap copies Garbage collection used to reclaim disk space Logically shared by the entire cluster 5 Data Store
Cluster Journal Journal of all persisted changes in the repository Content changes Namespace, nodetype registrations, etc. Used to keep all cluster nodes in sync Observation events to all cluster nodes (see JackrabbitEvent.isExternal) Search index updates Internal cache invalidation Old events need to be discarded eventually No notable performance impact, just extra disk space Keep events for the longest possible time a node can be offline without getting completely recreated Logically shared by the entire cluster Writes synchronized over the entire cluster 6 Cluster Journal
Persistence Manager Identifier-addressed storage for nodes and properties Each node has a UUID, even if not mix:referenceable Essentially a key-value store, even when backed by a RDBMS Also keeps track of node references Bundles as units of content Bundle = UUID, type, properties, child node references, etc. Only large binaries stored elsewhere in the data store Designed for balanced content hierarchies, avoid too many child nodes Atomic updates A save() call persists the entire transient space as a single atomic operation One PM per workspace (and one for the shared version store) Logically (often also physically) shared across a cluster 7 Persistence Manager
Query Index Inverse index based on Apache Lucene Flexible mapping from terms to node identifiers Special handling for the path structure Mostly synchronous index updates Long full text extraction tasks handled in background Other cluster nodes will update their indexes at next cluster sync  Everything indexed by default Indexing configuration for tweaking functionality, performance and disk usage One index per workspace (and one for the shared version store) Not shared across a cluster, indexes are local to each cluster node See http://wiki.apache.org/jackrabbit/Search#Search_Configuration 8 Query Index
Agenda Performance tuning steps Repository internals Basic content access Batch processing Clustering Query performance Indexing configuration Questions and answers 9
Basic content access Very fast access by path and ID Underlying storage addressed by ID, but path traversal is in any case needed for ACL checks Relevant caches: Path to ID map (internal structure, not configurable) Item state caches (automatically balanced, configurable for special cases) Bundle cache (default fairly low, increase for large deployments) Also some PM-specific options (TarPM index, etc.) Caches optimized for a reasonably sized active working set typical web access pattern: handful of key resources and a long tail of less frequently accessed content, few writes Performance hit especially when updating nodes with lots of child nodes FineGrainedISMLocking for concurrent, non-overlapping writes 10
Example: Bundle cache configuration 11 <!-- In …/repository/worspaces/${wsp.name}/workspace.xml --> <Workspace …>   <PersistenceManager class=“…">   <paramname="bundleCacheSize" value="8"/>   </PersistenceManager> </Workspace>
Batch processing Two issues: read and write Reading lots of content Tree traversal the best approach, but will flood caches Schedule for off-peak times Add explicit delay (used by the garbage collectors) Use a dedicated cluster node for batch processing Writing lots of content (including deleting large subtrees) The entire transient space is kept in memory and committed atomically Split the operation to smaller pieces Save after every ~1k nodes Leverage the data store if possible 12
Clustering Good for horizontally scaling reads Practically zero overhead on read access Not so good for heavy concurrent writes Exclusive lock over the whole cluster Direct all writes to a single master node Leverage the data store Note the cluster sync interval for query consistency, etc. Session.refresh() can be used to force a cluster sync 13
Query performance What’s really fast? Constraints on properties, node types, full text Typically O(n) where n is the number of results, vs. the total number of nodes  What’s pretty fast? Path constraints What needs some planning? Constraints on the child axis Sorting, limit/offset  Joins What’s not yet available? Aggregate queries (COUNT, SUM, DISTINCT, etc.) Faceting 14
Join engine 15 SELECT a.* FROM [nt:unstructured] AS a JOIN [nt:unstructured] AS b   <PersistenceManager class=“…">   <paramname="bundleCacheSize" value="8"/>   </PersistenceManager> </Workspace>
Indexing configuration Default configuration Index all non-binary properties Index binary jcr:data properties (think nt:file/nt:resource) Full text extraction support for all major document formats Full text extraction from images, packages, etc. is explicitly disabled CQ5 / WEM comes with default aggregate indexing rules for cq:Pages, etc. Why change the configuration? Reduce the index size (by default almost as large as the PM) Enable features like aggregate indexes Assign boost values for selected properties to improve search result relevance 16
Indexing configuration How to change the configuration? indexing_configuration.xml file in the workspace directory Referenced by the indexingConfiguration option in the workspace.xml file See http://wiki.apache.org/jackrabbit/IndexingConfiguration Example: 17 <?xml version="1.0"?><!DOCTYPE configuration SYSTEM "http://jackrabbit.apache.org/dtd/indexing-configuration-1.0.dtd"><configuration xmlns:jcr="http://www.jcp.org/jcr/1.0" xmlns:nt="http://www.jcp.org/jcr/nt/1.0">   <aggregateprimaryType="nt:file">    <include>jcr:content</include>  </aggregate> </configuration>
Question and Answers 18
Repository performance tuning

More Related Content

What's hot

RFID Technologies for Inventory Management
RFID Technologies for Inventory ManagementRFID Technologies for Inventory Management
RFID Technologies for Inventory Management
sabdulaz
 
Business Intelligence Resume - Sam Kamara
Business Intelligence Resume - Sam KamaraBusiness Intelligence Resume - Sam Kamara
Business Intelligence Resume - Sam Kamara
skamara1
 

What's hot (20)

RFID Technologies for Inventory Management
RFID Technologies for Inventory ManagementRFID Technologies for Inventory Management
RFID Technologies for Inventory Management
 
Easy Microservices with JHipster - Devoxx BE 2017
Easy Microservices with JHipster - Devoxx BE 2017Easy Microservices with JHipster - Devoxx BE 2017
Easy Microservices with JHipster - Devoxx BE 2017
 
ATG Advanced RQL
ATG Advanced RQLATG Advanced RQL
ATG Advanced RQL
 
Abdul ETL Resume
Abdul ETL ResumeAbdul ETL Resume
Abdul ETL Resume
 
Data Driven Framework in Selenium
Data Driven Framework in SeleniumData Driven Framework in Selenium
Data Driven Framework in Selenium
 
Oracle APEX Introduction (release 18.1)
Oracle APEX Introduction (release 18.1)Oracle APEX Introduction (release 18.1)
Oracle APEX Introduction (release 18.1)
 
Solace Singapore User Group: Dell Boomi Presentation
Solace Singapore User Group: Dell Boomi PresentationSolace Singapore User Group: Dell Boomi Presentation
Solace Singapore User Group: Dell Boomi Presentation
 
Oracle Endeca 101 Developer Introduction High Level Overview
Oracle Endeca 101 Developer Introduction High Level OverviewOracle Endeca 101 Developer Introduction High Level Overview
Oracle Endeca 101 Developer Introduction High Level Overview
 
Business Intelligence Resume - Sam Kamara
Business Intelligence Resume - Sam KamaraBusiness Intelligence Resume - Sam Kamara
Business Intelligence Resume - Sam Kamara
 
IndabaX Ghana Poster.pdf
IndabaX Ghana Poster.pdfIndabaX Ghana Poster.pdf
IndabaX Ghana Poster.pdf
 
Oracle Access Manager Overview
Oracle Access Manager OverviewOracle Access Manager Overview
Oracle Access Manager Overview
 
An introduction to QuerySurge webinar
An introduction to QuerySurge webinarAn introduction to QuerySurge webinar
An introduction to QuerySurge webinar
 
Telerik
TelerikTelerik
Telerik
 
Overview of atg framework
Overview of atg frameworkOverview of atg framework
Overview of atg framework
 
7 data warehouse & marts
7 data warehouse & marts7 data warehouse & marts
7 data warehouse & marts
 
Introduction to Android Development
Introduction to Android DevelopmentIntroduction to Android Development
Introduction to Android Development
 
Module 3 - QuickSight Overview
Module 3 - QuickSight OverviewModule 3 - QuickSight Overview
Module 3 - QuickSight Overview
 
Business Transactions with AppDynamics
Business Transactions with AppDynamicsBusiness Transactions with AppDynamics
Business Transactions with AppDynamics
 
Rise of the Data Cloud
Rise of the Data CloudRise of the Data Cloud
Rise of the Data Cloud
 
Seminar on mobile application development with android
Seminar on mobile application development with androidSeminar on mobile application development with android
Seminar on mobile application development with android
 

Viewers also liked

MicroKernel & NodeStore
MicroKernel & NodeStoreMicroKernel & NodeStore
MicroKernel & NodeStore
Jukka Zitting
 
The return of the hierarchical model
The return of the hierarchical modelThe return of the hierarchical model
The return of the hierarchical model
Jukka Zitting
 
Open source masterclass - Life in the Apache Incubator
Open source masterclass - Life in the Apache IncubatorOpen source masterclass - Life in the Apache Incubator
Open source masterclass - Life in the Apache Incubator
Jukka Zitting
 
Content extraction with apache tika
Content extraction with apache tikaContent extraction with apache tika
Content extraction with apache tika
Jukka Zitting
 
Content Management With Apache Jackrabbit
Content Management With Apache JackrabbitContent Management With Apache Jackrabbit
Content Management With Apache Jackrabbit
Jukka Zitting
 

Viewers also liked (20)

Apache Jackrabbit @ Swiss Open Source Awards 2011
Apache Jackrabbit @ Swiss Open Source Awards 2011Apache Jackrabbit @ Swiss Open Source Awards 2011
Apache Jackrabbit @ Swiss Open Source Awards 2011
 
OSGifying the repository
OSGifying the repositoryOSGifying the repository
OSGifying the repository
 
Oak, the architecture of Apache Jackrabbit 3
Oak, the architecture of Apache Jackrabbit 3Oak, the architecture of Apache Jackrabbit 3
Oak, the architecture of Apache Jackrabbit 3
 
MicroKernel & NodeStore
MicroKernel & NodeStoreMicroKernel & NodeStore
MicroKernel & NodeStore
 
The return of the hierarchical model
The return of the hierarchical modelThe return of the hierarchical model
The return of the hierarchical model
 
Open source masterclass - Life in the Apache Incubator
Open source masterclass - Life in the Apache IncubatorOpen source masterclass - Life in the Apache Incubator
Open source masterclass - Life in the Apache Incubator
 
/path/to/content - the Apache Jackrabbit content repository
/path/to/content - the Apache Jackrabbit content repository/path/to/content - the Apache Jackrabbit content repository
/path/to/content - the Apache Jackrabbit content repository
 
Apache development with GitHub and Travis CI
Apache development with GitHub and Travis CIApache development with GitHub and Travis CI
Apache development with GitHub and Travis CI
 
Content extraction with apache tika
Content extraction with apache tikaContent extraction with apache tika
Content extraction with apache tika
 
Content Management With Apache Jackrabbit
Content Management With Apache JackrabbitContent Management With Apache Jackrabbit
Content Management With Apache Jackrabbit
 
The new repository in AEM 6
The new repository in AEM 6The new repository in AEM 6
The new repository in AEM 6
 
Enterprise Manager: Write powerful scripts with EMCLI
Enterprise Manager: Write powerful scripts with EMCLIEnterprise Manager: Write powerful scripts with EMCLI
Enterprise Manager: Write powerful scripts with EMCLI
 
JCR, Sling or AEM? Which API should I use and when?
JCR, Sling or AEM? Which API should I use and when?JCR, Sling or AEM? Which API should I use and when?
JCR, Sling or AEM? Which API should I use and when?
 
Oracle Enterprise Manager Cloud Control 13c for DBAs
Oracle Enterprise Manager Cloud Control 13c for DBAsOracle Enterprise Manager Cloud Control 13c for DBAs
Oracle Enterprise Manager Cloud Control 13c for DBAs
 
新浪云平台的经验和教训
新浪云平台的经验和教训新浪云平台的经验和教训
新浪云平台的经验和教训
 
Good Luck
Good LuckGood Luck
Good Luck
 
Shakespeare revealed 02.ppt
Shakespeare revealed 02.pptShakespeare revealed 02.ppt
Shakespeare revealed 02.ppt
 
Marek
MarekMarek
Marek
 
Digital thinking
Digital thinkingDigital thinking
Digital thinking
 
Open Cultuur Data Masterclass #3 - Open State - Lex Slaghuis
Open Cultuur Data Masterclass #3 - Open State - Lex SlaghuisOpen Cultuur Data Masterclass #3 - Open State - Lex Slaghuis
Open Cultuur Data Masterclass #3 - Open State - Lex Slaghuis
 

Similar to Repository performance tuning

Optimizing your java applications for multi core hardware
Optimizing your java applications for multi core hardwareOptimizing your java applications for multi core hardware
Optimizing your java applications for multi core hardware
IndicThreads
 
Planning for-high-performance-web-application
Planning for-high-performance-web-applicationPlanning for-high-performance-web-application
Planning for-high-performance-web-application
Nguyễn Duy Nhân
 
Drupal Backend Performance and Scalability
Drupal Backend Performance and ScalabilityDrupal Backend Performance and Scalability
Drupal Backend Performance and Scalability
Ashok Modi
 
Ch9 OS
Ch9 OSCh9 OS
Ch9 OS
C.U
 

Similar to Repository performance tuning (20)

Overview of MongoDB and Other Non-Relational Databases
Overview of MongoDB and Other Non-Relational DatabasesOverview of MongoDB and Other Non-Relational Databases
Overview of MongoDB and Other Non-Relational Databases
 
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
 
Performance and predictability
Performance and predictabilityPerformance and predictability
Performance and predictability
 
Ch8
Ch8Ch8
Ch8
 
NoSQL Introduction, Theory, Implementations
NoSQL Introduction, Theory, ImplementationsNoSQL Introduction, Theory, Implementations
NoSQL Introduction, Theory, Implementations
 
IntelliJ IDEA Architecture and Performance
IntelliJ IDEA Architecture and PerformanceIntelliJ IDEA Architecture and Performance
IntelliJ IDEA Architecture and Performance
 
Optimizing your java applications for multi core hardware
Optimizing your java applications for multi core hardwareOptimizing your java applications for multi core hardware
Optimizing your java applications for multi core hardware
 
Planning for-high-performance-web-application
Planning for-high-performance-web-applicationPlanning for-high-performance-web-application
Planning for-high-performance-web-application
 
Apache ignite as in-memory computing platform
Apache ignite as in-memory computing platformApache ignite as in-memory computing platform
Apache ignite as in-memory computing platform
 
Unit-4 swapping.pptx
Unit-4 swapping.pptxUnit-4 swapping.pptx
Unit-4 swapping.pptx
 
Performance and predictability
Performance and predictabilityPerformance and predictability
Performance and predictability
 
Climbing the beanstalk
Climbing the beanstalkClimbing the beanstalk
Climbing the beanstalk
 
Drupal Backend Performance and Scalability
Drupal Backend Performance and ScalabilityDrupal Backend Performance and Scalability
Drupal Backend Performance and Scalability
 
tittle
tittletittle
tittle
 
OS_Ch9
OS_Ch9OS_Ch9
OS_Ch9
 
OSCh9
OSCh9OSCh9
OSCh9
 
Ch9 OS
Ch9 OSCh9 OS
Ch9 OS
 
Chapter 8 - Main Memory
Chapter 8 - Main MemoryChapter 8 - Main Memory
Chapter 8 - Main Memory
 
FOWA Scaling The Lamp Stack Workshop
FOWA Scaling The Lamp Stack WorkshopFOWA Scaling The Lamp Stack Workshop
FOWA Scaling The Lamp Stack Workshop
 
Main memory os - prashant odhavani- 160920107003
Main memory   os - prashant odhavani- 160920107003Main memory   os - prashant odhavani- 160920107003
Main memory os - prashant odhavani- 160920107003
 

More from Jukka Zitting

Text and metadata extraction with Apache Tika
Text and metadata extraction with Apache TikaText and metadata extraction with Apache Tika
Text and metadata extraction with Apache Tika
Jukka Zitting
 
Content Storage With Apache Jackrabbit
Content Storage With Apache JackrabbitContent Storage With Apache Jackrabbit
Content Storage With Apache Jackrabbit
Jukka Zitting
 
Introduction to JCR and Apache Jackrabbi
Introduction to JCR and Apache JackrabbiIntroduction to JCR and Apache Jackrabbi
Introduction to JCR and Apache Jackrabbi
Jukka Zitting
 

More from Jukka Zitting (9)

Text and metadata extraction with Apache Tika
Text and metadata extraction with Apache TikaText and metadata extraction with Apache Tika
Text and metadata extraction with Apache Tika
 
Mime Magic With Apache Tika
Mime Magic With Apache TikaMime Magic With Apache Tika
Mime Magic With Apache Tika
 
NoSQL Oakland
NoSQL OaklandNoSQL Oakland
NoSQL Oakland
 
Content Storage With Apache Jackrabbit
Content Storage With Apache JackrabbitContent Storage With Apache Jackrabbit
Content Storage With Apache Jackrabbit
 
Introduction to JCR and Apache Jackrabbi
Introduction to JCR and Apache JackrabbiIntroduction to JCR and Apache Jackrabbi
Introduction to JCR and Apache Jackrabbi
 
File System On Steroids
File System On SteroidsFile System On Steroids
File System On Steroids
 
Mime Magic With Apache Tika
Mime Magic With Apache TikaMime Magic With Apache Tika
Mime Magic With Apache Tika
 
Design and architecture of Jackrabbit
Design and architecture of JackrabbitDesign and architecture of Jackrabbit
Design and architecture of Jackrabbit
 
Apache Tika
Apache TikaApache Tika
Apache Tika
 

Recently uploaded

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 

Repository performance tuning

  • 1. Jukka Zitting | Senior Developer Repository performance tuning
  • 2. Agenda Performance tuning steps Repository internals Basic content access Batch processing Clustering Query performance Full text indexing Questions and answers 2
  • 3. Performance tuning steps Step 1: Identify the symptom Create a test case that consistently measures current performance Define the performance target if current level unacceptable Make sure that the test case and the target performance are really relevant Step 2: Identify the cause Main suspects: Hardware, Repository, Application, Client Revise the test case until the problem no longer occurs;for example: Selenium, JMeter, JUnit, Iometer Step 3: Identify/implement possible solutions Change content, configuration, code or upgrade hardware Step 4: Verify results If target not reached, iterate the process or revise the goal 3
  • 4. Repository internals 4 Data Store Persistence Manager Query Index Cluster Journal
  • 5. Data Store Content-addressed storage for large binary properties Arbitrarily sized binary streams Addressed by MD5 hash String properties not included, use UTF-8 to map to binary Fast delivery of binary content Read directly from disk Can also be read in ranges Improved write throughput Multiple uploads can proceed concurrently (within hardware limits) Cheap copies Garbage collection used to reclaim disk space Logically shared by the entire cluster 5 Data Store
  • 6. Cluster Journal Journal of all persisted changes in the repository Content changes Namespace, nodetype registrations, etc. Used to keep all cluster nodes in sync Observation events to all cluster nodes (see JackrabbitEvent.isExternal) Search index updates Internal cache invalidation Old events need to be discarded eventually No notable performance impact, just extra disk space Keep events for the longest possible time a node can be offline without getting completely recreated Logically shared by the entire cluster Writes synchronized over the entire cluster 6 Cluster Journal
  • 7. Persistence Manager Identifier-addressed storage for nodes and properties Each node has a UUID, even if not mix:referenceable Essentially a key-value store, even when backed by a RDBMS Also keeps track of node references Bundles as units of content Bundle = UUID, type, properties, child node references, etc. Only large binaries stored elsewhere in the data store Designed for balanced content hierarchies, avoid too many child nodes Atomic updates A save() call persists the entire transient space as a single atomic operation One PM per workspace (and one for the shared version store) Logically (often also physically) shared across a cluster 7 Persistence Manager
  • 8. Query Index Inverse index based on Apache Lucene Flexible mapping from terms to node identifiers Special handling for the path structure Mostly synchronous index updates Long full text extraction tasks handled in background Other cluster nodes will update their indexes at next cluster sync Everything indexed by default Indexing configuration for tweaking functionality, performance and disk usage One index per workspace (and one for the shared version store) Not shared across a cluster, indexes are local to each cluster node See http://wiki.apache.org/jackrabbit/Search#Search_Configuration 8 Query Index
  • 9. Agenda Performance tuning steps Repository internals Basic content access Batch processing Clustering Query performance Indexing configuration Questions and answers 9
  • 10. Basic content access Very fast access by path and ID Underlying storage addressed by ID, but path traversal is in any case needed for ACL checks Relevant caches: Path to ID map (internal structure, not configurable) Item state caches (automatically balanced, configurable for special cases) Bundle cache (default fairly low, increase for large deployments) Also some PM-specific options (TarPM index, etc.) Caches optimized for a reasonably sized active working set typical web access pattern: handful of key resources and a long tail of less frequently accessed content, few writes Performance hit especially when updating nodes with lots of child nodes FineGrainedISMLocking for concurrent, non-overlapping writes 10
  • 11. Example: Bundle cache configuration 11 <!-- In …/repository/worspaces/${wsp.name}/workspace.xml --> <Workspace …> <PersistenceManager class=“…"> <paramname="bundleCacheSize" value="8"/> </PersistenceManager> </Workspace>
  • 12. Batch processing Two issues: read and write Reading lots of content Tree traversal the best approach, but will flood caches Schedule for off-peak times Add explicit delay (used by the garbage collectors) Use a dedicated cluster node for batch processing Writing lots of content (including deleting large subtrees) The entire transient space is kept in memory and committed atomically Split the operation to smaller pieces Save after every ~1k nodes Leverage the data store if possible 12
  • 13. Clustering Good for horizontally scaling reads Practically zero overhead on read access Not so good for heavy concurrent writes Exclusive lock over the whole cluster Direct all writes to a single master node Leverage the data store Note the cluster sync interval for query consistency, etc. Session.refresh() can be used to force a cluster sync 13
  • 14. Query performance What’s really fast? Constraints on properties, node types, full text Typically O(n) where n is the number of results, vs. the total number of nodes What’s pretty fast? Path constraints What needs some planning? Constraints on the child axis Sorting, limit/offset Joins What’s not yet available? Aggregate queries (COUNT, SUM, DISTINCT, etc.) Faceting 14
  • 15. Join engine 15 SELECT a.* FROM [nt:unstructured] AS a JOIN [nt:unstructured] AS b <PersistenceManager class=“…"> <paramname="bundleCacheSize" value="8"/> </PersistenceManager> </Workspace>
  • 16. Indexing configuration Default configuration Index all non-binary properties Index binary jcr:data properties (think nt:file/nt:resource) Full text extraction support for all major document formats Full text extraction from images, packages, etc. is explicitly disabled CQ5 / WEM comes with default aggregate indexing rules for cq:Pages, etc. Why change the configuration? Reduce the index size (by default almost as large as the PM) Enable features like aggregate indexes Assign boost values for selected properties to improve search result relevance 16
  • 17. Indexing configuration How to change the configuration? indexing_configuration.xml file in the workspace directory Referenced by the indexingConfiguration option in the workspace.xml file See http://wiki.apache.org/jackrabbit/IndexingConfiguration Example: 17 <?xml version="1.0"?><!DOCTYPE configuration SYSTEM "http://jackrabbit.apache.org/dtd/indexing-configuration-1.0.dtd"><configuration xmlns:jcr="http://www.jcp.org/jcr/1.0" xmlns:nt="http://www.jcp.org/jcr/nt/1.0"> <aggregateprimaryType="nt:file"> <include>jcr:content</include> </aggregate> </configuration>