Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Sizing your Alfresco 
platform 
Luis Cabaceira 
Technical Consultant
Agenda 
• Who am I 
• Presentation Objectives 
• Sizing Key questions 
• Sample Architecture 
• Sizing the different appli...
About me 
• Luis Cabaceira, Technical Consultant at 
Alfresco 
• 13 years of field experience 
• Helped clients in various...
Objectives of the presentation 
• Explain sizing key questions that will drive 
sizing decisions 
• Some Architecture best...
Considerations 
• The information provided on this presentation 
is based on Alfresco internal benchmarks as 
well as our ...
Sizing Key Information
3 required sizing topics 
• The most important topics are 
• Use Case 
• Concurrent Users 
• Repository documents 
• In ma...
Identify the Use case
Concurrent Users
Repository documents
Extra sizing topics 
The more information you have about your 
scenario the more accurate will be your 
sizing predictions...
Architecture
Authority structure
Operations
Components,Protocols and Apis
Batch processes 
• Approximate numbers and and types of 
batch jobs 
• workflows 
• scheduled processes 
• transformations...
Customizations and Integrations
Response Times Requirements
There is no perfect formula
Reference Architecture Study
Database Sizing 
All operations in Alfresco require a database 
connection, the database performance plays a 
crucial role...
Database Size calculation 
D = Number of Documents + Avg. number of 
versions 
DA = Average number of Metadata fields per ...
Database Performance is Vital
User facing nodes sizing
User facing nodes sizing 
The repository user-facing cluster is dedicated on 
serving the usual user requests (reads, manu...
User facing nodes sizing factors 
Operations: Percentage of browse, download and 
write operations 
Components, Protocols ...
User facing nodes calculations 
Considering our field experience and our internal 
benchmark values and assuming we assign...
User facing nodes calculations 
By default Alfresco has only 1 thread configured for 
libreoffice, but the number of threa...
Dedicated tracking nodes sizing
Dedicated tracking nodes sizing 
The dedicated tracking nodes exist on each Solr 
server and they are the only reference t...
Dedicated tracking nodes sizing 
Normally these nodes do not have high requirements 
on memory and CPU but sizing will dep...
Solr nodes sizing
Solr nodes sizing factors 
Authority Structure 
Recent benchmarks comparisons that authority 
structure has a direct and i...
Solr nodes sizing factors 
Repository Size 
Important to notice that repository sizes should 
mostly only affect average r...
Solr nodes Calculations 
Use the online formula published by alfresco to 
calculate the necessary memory for your solr 
no...
Solr nodes Calculations 
To calculate the number of Solr nodes on your 
cluster you should consider the search response 
t...
Solr Indexes 
You can expect the size of your indexes (when 
using FTI) to be between 40-60% the estimated 
size of your c...
Content Ingestion nodes sizing
Bulk ingestion nodes sizing 
Sizing the nodes dedicated to content ingestion is 
very similar to sizing the tracking nodes...
Bulk ingestion nodes tuning 
Disable unneeded services, many standard services 
are not required when bulk loading content...
Content Store Sizing
Content Store Sizing 
To size your content store consider 
• Number of documents (ND) 
• Number of Versions (NV) 
• Averag...
Sample Real Life numbers 
SIZING AREA INFORMATION 
USE CASE Collaboration 
CONCURRENT USERS 600 
REPOSITORY 4M docs/ Avg d...
Sample Real Life numbers 
ALFRESCO NODES CPU JVM MEMORY 
2 2 Quad-Core 8 GB 
SOLR NODES CPU JVM MEMORY 
2 1 Quad-Core 20GB
2 Common use cases 
Collaboration Backend Repository
Collaboration vs Backend Repo 
Collaboration Scenario 
• Search is usually just a 
small portion of the 
operations percen...
Collaboration vs Backend Repo 
Collaboration Scenario 
• Repository Sizes are 
usually of small or 
intermediate size. 
• ...
Collaboration vs Backend Repo 
Collaboration Scenario 
• Concurrent users are 
normally quite high 
• Share interface is u...
Collaboration vs Backend Repo 
Collaboration Scenario 
• Batch operations should 
mostly be around human 
interaction work...
Database Thread Pool 
A default Alfresco instance is configured to use up to 
a maximum of forty (40) database connections...
Database Scaling 
Alfresco relies largely on a fast and highly 
transactional interaction with the RDBMS, so the 
health o...
Some Golden Rules
User facing nodes Golden Rules 
Browsing the repository using a client (share or 
explorer) performs Acl checks on each it...
User facing nodes Golden Rules 
User operations take application server threads. The 
more concurrency the more threads wi...
User facing nodes Golden Rules 
Customizations on the repository should be carefully 
analyzed for memory and cpu consumpt...
Golden Rules – Solr memory 
To minimize memory requirements on solr : 
• Reduce the cache sizes and check the cache hit 
r...
Capacity of Alfresco Repository 
How do can determine the throughput of my 
Alfresco repository server ? 
A ECM repository...
The C.A.R. 
“The maximum number of transactions that can be 
handled in a single second before degrading the 
expected per...
Capacity of Alfresco Repository 
3 more figures 
EC = The expected concurrency represented in 
number of users. 
TT = user...
The Expected Response Times 
OPERATION VALUE WEIGHT LAYER CPU MEM I/O 
Download 3 sec 20 Repo 
Write/Upload 5 sec 10 Repo/...
Capacity of Alfresco Repository 
With the inclusion of new variables we can tune 
our C.A.R. definition and redefine it as...
Predicting the future 
Re-evaluate your sizing introducing new factors. 
Take a dynamic / use case oriented approach to 
i...
Simple lab tests results 
We’ve executed some lab tests, one server running Alfresco and another running 
the database, si...
Size does matter !
Questions
Additional Info 
Alfresco Blog 
http://blogs.alfresco.com/wp/lcabaceira/ 
Github page 
https://github.com/lcabaceira 
Than...
Sizing your alfresco platform
Upcoming SlideShare
Loading in …5
×

Sizing your alfresco platform

10,062 views

Published on

Sizing an alfresco infrastructure has always been an interesting topic with lots of unrevealed questions. There is no perfect formula that can accurately define what is the perfect sizing for your architecture considering your use case. However, we can provide you with valuable guidance on how to size your Alfresco solution, by asking the right questions, collecting the right numbers, and taking the right assumptions on a very interesting sizing exercise.

How many alfresco servers will you need on your alfresco cluster? How many CPUs/cores do you need on those servers to handle your estimated user concurrency? How do you estimate the sizing and growth of your storage? How much memory do you need on your Solr servers? How many Solr servers do you need to get the response times you require? What are the golden rules that can drive and maintain the success of an Alfresco project?

Published in: Technology
  • For data visualization,data analyticsand data intelligence tools online training with job placements, register at http://www.todaycourses.com
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Sizing your alfresco platform

  1. 1. Sizing your Alfresco platform Luis Cabaceira Technical Consultant
  2. 2. Agenda • Who am I • Presentation Objectives • Sizing Key questions • Sample Architecture • Sizing the different application layers
  3. 3. About me • Luis Cabaceira, Technical Consultant at Alfresco • 13 years of field experience • Helped clients in various fields on architecture and sizing decisions Government & Intelligence Banking & Insurance Manufacturing Media & Government & Banking & Manufacturing Media Publishing & Intelligence Insurance Publishing High Tech
  4. 4. Objectives of the presentation • Explain sizing key questions that will drive sizing decisions • Some Architecture best practices • Explain the impact of system usage on the different application layers
  5. 5. Considerations • The information provided on this presentation is based on Alfresco internal benchmarks as well as our field experience in various customer implementations. • Sizing predictions should be validated by executing dedicated benchmarks. • Monitoring your solution will also help on validating your sizing predictions allowing for incremental and continuous improvement
  6. 6. Sizing Key Information
  7. 7. 3 required sizing topics • The most important topics are • Use Case • Concurrent Users • Repository documents • In many cases organizations are not able to provide the answers to the other questions and you have to build assumptions.
  8. 8. Identify the Use case
  9. 9. Concurrent Users
  10. 10. Repository documents
  11. 11. Extra sizing topics The more information you have about your scenario the more accurate will be your sizing predictions. When available you should get information on the topics explained on the next slides to complete the information gathering stage.
  12. 12. Architecture
  13. 13. Authority structure
  14. 14. Operations
  15. 15. Components,Protocols and Apis
  16. 16. Batch processes • Approximate numbers and and types of batch jobs • workflows • scheduled processes • transformations and renditions
  17. 17. Customizations and Integrations
  18. 18. Response Times Requirements
  19. 19. There is no perfect formula
  20. 20. Reference Architecture Study
  21. 21. Database Sizing All operations in Alfresco require a database connection, the database performance plays a crucial role on your Alfresco environment. It’s vital to have the database properly sized and tuned for your specific use case. Content is not stored in the database Database size is un-affected by size or the document's content Database size is affected by the number/type of metadata fields
  22. 22. Database Size calculation D = Number of Documents + Avg. number of versions DA = Average number of Metadata fields per document F = Number of Folders FA = Average number of Metadata fields per Folder Dbsize = ((D * DA) + (F*FA)) * 5.5k Add 2k for each user and 5k for each group
  23. 23. Database Performance is Vital
  24. 24. User facing nodes sizing
  25. 25. User facing nodes sizing The repository user-facing cluster is dedicated on serving the usual user requests (reads, manual writes and updates). The most important factor that stresses these servers is the number of concurrent users hitting the cluster. Note that these nodes also do transformations to swf when a user accesses the document preview. User Uploads will also create thumbnail images.
  26. 26. User facing nodes sizing factors Operations: Percentage of browse, download and write operations Components, Protocols and API: Which components, protocols and APIs of the product are the users using to access these nodes. Response Times: The detail around response time requirements.
  27. 27. User facing nodes calculations Considering our field experience and our internal benchmark values and assuming we assign a quad-core cpu to the node with a user think-time of 30s 1 alfresco node for each 150 concurrent users. Number of Nodes = N Concurrent Users / 150 If we assign 2 quad-core cpus to each node. 1 alfresco node for each 300 concurrent users. Number of Nodes = N Concurrent Users / 300
  28. 28. User facing nodes calculations By default Alfresco has only 1 thread configured for libreoffice, but the number of threads can be increased using the port numbers on jodConverter. If have lots of uploads and you need faster (concurrent) transformations, consider raising the number of libreoffice threads. Have this in consideration on your memory calculations for these servers, it takes typically 1GB for each Libreoffice thread. Transformation server can be an option.
  29. 29. Dedicated tracking nodes sizing
  30. 30. Dedicated tracking nodes sizing The dedicated tracking nodes exist on each Solr server and they are the only reference to Alfresco that Solr knows about. Note that they normally have their own JVM and application server instance. The dedicated tracking Alfresco instances perform. • Text extraction of the document’s content to send to Solr. (Alfresco api call) • Allow for local indexing (document’s do not have to travel over the wire during tracking)
  31. 31. Dedicated tracking nodes sizing Normally these nodes do not have high requirements on memory and CPU but sizing will depend on the number of expecting uploads and transactions on the system. Indexing (Text extraction) impacts CPU and it’s API call that occurs exclusively on these nodes.
  32. 32. Solr nodes sizing
  33. 33. Solr nodes sizing factors Authority Structure Recent benchmarks comparisons that authority structure has a direct and important impact on performance of SOLR. When sizing Solr we should analyze the types of searches that will be executed and the authority structure of the corresponding use case
  34. 34. Solr nodes sizing factors Repository Size Important to notice that repository sizes should mostly only affect average response times for search operations, and more importantly for certain kinds of global searches very sensitive to the overall size of the repository. This is the layer most affected by the increase on repository size.
  35. 35. Solr nodes Calculations Use the online formula published by alfresco to calculate the necessary memory for your solr nodes. Note that the formula provided gives results for each core and it considers only one searcher. There can be up to 2 searchers per core, so the values you get from the formula you should multiply by 4. Total Memory = Result x 2 cores x 2 searchers
  36. 36. Solr nodes Calculations To calculate the number of Solr nodes on your cluster you should consider the search response times, the frequency of the search requests and the size of your repository. There are several possible architectures for solr and corresponding sizing scenarios depending on your use case, also tuning plays an important role. http://blogs.alfresco.com/wp/lcabaceira/2014/06/20/ solr-tuning-maximizing-your-solr-performance/
  37. 37. Solr Indexes You can expect the size of your indexes (when using FTI) to be between 40-60% the estimated size of your content store.
  38. 38. Content Ingestion nodes sizing
  39. 39. Bulk ingestion nodes sizing Sizing the nodes dedicated to content ingestion is very similar to sizing the tracking nodes, these nodes do not take any user requests and don’t receive search requests but they do generate document thumbnails. Start with 1 cpu (quad-core). They are mainly database and solr proxys. They read xml files and there is few memory consumption. The load on this server will mostly impact Solr indexing and the dedicated tracking instances (text extraction).
  40. 40. Bulk ingestion nodes tuning Disable unneeded services, many standard services are not required when bulk loading content Verify the existence of rules on target folder that can delay the speed of the ingestion and disable when applicable. Minimize network and file I/O operations Get source content as close to server storage as possible Tune JVM, Network, Threads, DB Connections.
  41. 41. Content Store Sizing
  42. 42. Content Store Sizing To size your content store consider • Number of documents (ND) • Number of Versions (NV) • Average document size (DS) • Annual Growth rate (AGR) Content Store Size (Y1) = (ND * NV * DS) If you have thumbnails and previews enabled also consider the pdf and swf versions of each document and the documents thumbnails
  43. 43. Sample Real Life numbers SIZING AREA INFORMATION USE CASE Collaboration CONCURRENT USERS 600 REPOSITORY 4M docs/ Avg doc size is 3mb/ Ingestion Rate is 75000 docs /Year ARCHITECTURE High availability required on Solr and Alfresco user facing nodes AUTHORITY STRUCTURE 8000 named users, 100 groups, user belongs to 4 or 5 groups, Group hierarchy max depth 4. OPERATIONS Search/Write/Browse split is 10/10/80 COMPONENTS PROTOCOLS Alfresco Share + DM + Sites. AND APIs SPP, Integration with AD. CUSTOMIZATIONS IMPACT None BATCH OPERATIONS Ad Synchronization, External loaded dictionary data RESPONSE TIMES Not Defined
  44. 44. Sample Real Life numbers ALFRESCO NODES CPU JVM MEMORY 2 2 Quad-Core 8 GB SOLR NODES CPU JVM MEMORY 2 1 Quad-Core 20GB
  45. 45. 2 Common use cases Collaboration Backend Repository
  46. 46. Collaboration vs Backend Repo Collaboration Scenario • Search is usually just a small portion of the operations percentage (around 10%) • User authority structure will be complex • Most uploads are manually driven Backend Scenario • In most of the cases especially for very large repositories there wont be full text indexing/search • Authority structures will be in general fairly simple. • Dedicated nodes for ingestion are normally used.
  47. 47. Collaboration vs Backend Repo Collaboration Scenario • Repository Sizes are usually of small or intermediate size. • Customizations in most cases will concentrate at the front end (Share). • Architecture options are in general the standard ones provided by Alfresco (cluster, dedicated index/transformation layers, etc). Backend Scenario • Repository sizes are usually quite big. • Custom solution code may live external to Alfresco by using CMIS, public APIs, etc. • Architecture options are normally more complex using proxies, cluster and unclustered layers, sharding of Alfresco repository, etc
  48. 48. Collaboration vs Backend Repo Collaboration Scenario • Concurrent users are normally quite high • Share interface is used, very common usage of SPP, CIFS, IMAP, WebDAV and other public interfaces (CMIS) for other interfaces (mobile). Backend Scenario • Concurrent users are in general fairly small but think times are much smaller than for collaboration. • Most of the load should concentrate around public API (CMIS) and custom developed REST API (Webscripts).
  49. 49. Collaboration vs Backend Repo Collaboration Scenario • Batch operations should mostly be around human interaction workflows and the standard Alfresco jobs. Backend Scenario • Batch operations will usually have a considerable importance, including content injection processes (bulk or not), custom workflows and scheduled jobs.
  50. 50. Database Thread Pool A default Alfresco instance is configured to use up to a maximum of forty (40) database connections. Because all operations in Alfresco require a database connection, this places a hard upper limit on the amount of concurrent requests a single Alfresco instance can service (i.e. 40), from all protocols. Alfresco recommends to increase the maximum size of the database connection pool to at least [number of application server worker threads] + 75.
  51. 51. Database Scaling Alfresco relies largely on a fast and highly transactional interaction with the RDBMS, so the health of the underlying system is vital. If your project will have lots of concurrent users and operations, consider an active-active database cluster with at least 2 machines.
  52. 52. Some Golden Rules
  53. 53. User facing nodes Golden Rules Browsing the repository using a client (share or explorer) performs Acl checks on each item present on the folder being browsed. Acl checks go against the database and occupy a thread in tomcat, causing cpu consumption. Consider having a max of 1000 documents on each folder to reduce this overhead. Add a transformations timeout and configure the transformation limits to prevent threads spending to much time doing transformations.
  54. 54. User facing nodes Golden Rules User operations take application server threads. The more concurrency the more threads will be occupied and the more cpu will be used. Size the maximum number of application server threads (and the number of cluster nodes) according to your expected concurrency. Include the Alfresco scheduled tasks on your cpu calculations, those also require server threads and consume cpu.
  55. 55. User facing nodes Golden Rules Customizations on the repository should be carefully analyzed for memory and cpu consumption. Check for unclosed resultsets, they are a common memory leak. The local repository caches (L2 caches) occupy server memory, size them wisely. Check my blog post on the alfresco repository caches.http://blogs.alfresco.com/wp/lcabaceira/2014/ 05/29/repository-caches/
  56. 56. Golden Rules – Solr memory To minimize memory requirements on solr : • Reduce the cache sizes and check the cache hit rate. • Disable ACL checks. • Disable archive indexing, if you are not using it. • Since everything scales to the number of documents in the index, add the Index control aspect to the documents you do not want in the index
  57. 57. Capacity of Alfresco Repository How do can determine the throughput of my Alfresco repository server ? A ECM repository is very similar to a database in regards to the type of events that they manage and execute, specially when we think about “transactions per second” where a “transaction” is considered to be a basic repository operation (create, browse, download, update , search and delete) .
  58. 58. The C.A.R. “The maximum number of transactions that can be handled in a single second before degrading the expected performance”
  59. 59. Capacity of Alfresco Repository 3 more figures EC = The expected concurrency represented in number of users. TT = user think time represented in seconds ERT = Expected/Accepted response times
  60. 60. The Expected Response Times OPERATION VALUE WEIGHT LAYER CPU MEM I/O Download 3 sec 20 Repo Write/Upload 5 sec 10 Repo/Solr/DB Delete 3 sec 5 Repo/DB Browse/Read 2 sec 50 Repo/Solr/DB Search Metadata 2 sec 10 DB/Solr Full Text Search 5 sec 5 Solr It’s an object representing the response times being considered including the weight of each type and the layer that it affects. It takes known repository actions as arguments.
  61. 61. Capacity of Alfresco Repository With the inclusion of new variables we can tune our C.A.R. definition and redefine it as. “Number of transactions that that the server can handle in one second under the expected concurrency(EC) with the agreed Think Time (TT) ensuring the expected response times(ERT) “. By configuring the Alfresco audit trail we can define our initial throughput (average transactions per seconds).
  62. 62. Predicting the future Re-evaluate your sizing introducing new factors. Take a dynamic / use case oriented approach to identify a formula, built on a system of attributes, values, weights and affected areas. Introduce one or more ERT objects representing use case specific response times and operations acting as influencers on the server throughput.
  63. 63. Simple lab tests results We’ve executed some lab tests, one server running Alfresco and another running the database, simple collaboration use case on a repository with 5000 documents. Alfresco Server Details • Processor: 64-bit Intel Xeon 3.3Ghz (Quad-Core) • Memory: 8GB RAM • JVM: 64-bit Sun Java 7 (JDK 1.7) • Operating System: 64-bit Red Hat Linux Test Details • ERT = The sample ERT values shown before on the presentation • Think Time = 30 seconds. • EC = 150 users The C.A.R. of the server was between 10-15 TPS during usage peaks. Through JVM tuning along with network and database optimizations, this number can rise over 25 TPS.
  64. 64. Size does matter !
  65. 65. Questions
  66. 66. Additional Info Alfresco Blog http://blogs.alfresco.com/wp/lcabaceira/ Github page https://github.com/lcabaceira Thank you

×