Peter Monks!Director of Technology, Strategic Alliances!
Agenda! 1.  Introduction to the Bulk Filesystem Import Tool! 2.  Demo! 3.  Performance analysis!   1.  Methodology!   2.  ...
Introduction to the
ulk ile ystem mport ool !      (   for short)!
Introduction to the BFSIT!        =    ulk ile ystem mport ool •  Primary use case: one-off content migration / ingestion!...
Introduction to the BFSIT!Why not use…  •    Web UIs?!  •    ACP Files?!  •    CIFS, FTP, NFS, WebDAV, IMAP?!  •    CMIS?!...
Introduction to the BFSIT!Solution •     Import content from the Alfresco server! •     Load folders & content as they app...
Introduction to the BFSIT!Usage: 1.  Initiated via a simple repo Web Script:!     (can also be initiated via wget, curl, e...
Introduction to the BFSIT!Process details: •  Place source directory on job queue & immediately return! •  Worker thread p...
Introduction to the BFSIT!Process details: •  Place source directory on job queue & immediately return! •  Worker thread p...
Demo!
Performance Analysis:
    Methodology!
Goals and Test Plan!Goals:  •  Benchmark total time taken for bulk imports, using combination of:!    •    Machine environ...
Environments!Environment 1                              Environment 2  •  2009 model MacBook Pro!                •  2006 m...
Content Sets!Name	                          #	  Folders	     #	  Files	   Total	  Size	                          Notes	  T...
Performance Analysis:
Repository Tuning Results                        !
Baseline!            Notes:             •  Repository tuned as                per Day Zero Config                Guide!    ...
Disable User Quotas!                       Observations:                        •  Quota calculation                      ...
Disable In-txn Indexing!                           Notes:                            •  This configuration is              ...
Disable Indexing Entirely!                             Notes:                              •  This configuration is        ...
Optimal Repository Configuration!Optimal repository configuration, without functionally cripplingAlfresco, is:  •  Disable ...
Optimal Repository Configuration – Results!                           Notes:                             •  This configurati...
Average speedup of ~40%?!
Performance Analysis:
BFSIT Tuning Results!
Worker Thread Pool Sizes!                            Notes:                             •  Baseline is optimal            ...
Batch Weights!                 Observations:                  •  Larger batches = better                     performance! ...
Optimal BFSIT Configuration!Optimal BFSIT configuration: •  High thread count (mostly irrelevant):!             alfresco-bu...
Optimal BFSIT Configuration - Results!                           Observations:                            •  Modest improve...
Average speedup of ~6.5%?!
Rethinking the Problem! What if the BFSIT didn’t have to stream content             into the repository at all?  What if t...
In-place Import!                   Notes:                    •  Baseline is optimal                       repository config...
Average speedup of ~60%?!
Performance Analysis:
     Conclusions               !
Conclusions!Results: •     Minimum improvement of 6%! •     Average improvement of 60%! •     Maximum improvement of 99.7%...
Conclusions!Developers: •  Macro-optimization will always outperform micro-optimization!! •  Multi-threading is not a magi...
BFSIT Roadmap!
BFSIT Roadmap!Official roadmap is on the Google Code project’s wiki [2].BFSIT v1.1 – Performance:   •    Issue #91: Optimi...
References![1] http://code.google.com/p/alfresco-bulk-filesystem-import/[2] http://code.google.com/p/alfresco-bulk-filesys...
Questions?!
Appendix A – “Typical” Content Set Distributions!
BP-3 Taking Your Bulk Content Ingestions to the Next Level
Upcoming SlideShare
Loading in …5
×

BP-3 Taking Your Bulk Content Ingestions to the Next Level

4,582 views

Published on

Learn about the Alfresco Bulk Filesystem Import Tool, a community developed extension to Alfresco that provides a high performance bulk import feature. Discover how different tuning parameters affect import performance, and learn how to determine the optimum configuration for your Alfresco environment.

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
4,582
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
112
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

BP-3 Taking Your Bulk Content Ingestions to the Next Level

  1. 1. Peter Monks!Director of Technology, Strategic Alliances!
  2. 2. Agenda! 1.  Introduction to the Bulk Filesystem Import Tool! 2.  Demo! 3.  Performance analysis! 1.  Methodology! 2.  Results! 3.  Conclusions! 4.  Roadmap for the Bulk Filesystem Import Tool! 5.  Q&A! 6.  Appendices!
  3. 3. Introduction to the
ulk ile ystem mport ool ! ( for short)!
  4. 4. Introduction to the BFSIT! = ulk ile ystem mport ool •  Primary use case: one-off content migration / ingestion! •  Provides high-performance import of content! •  A community maintained extension to Alfresco! •  Hosted on Google Code [1]! •  LGPL licensed! •  Widely adopted!
  5. 5. Introduction to the BFSIT!Why not use… •  Web UIs?! •  ACP Files?! •  CIFS, FTP, NFS, WebDAV, IMAP?! •  CMIS?!All of the above suffer from one or more of: •  Content sent over network! •  External (out of process) orchestration! •  Content requires pre-/post-processing (e.g. ACP)! •  Chatty (e.g. CIFS, NFS)! •  Overly general (e.g. CMIS)!
  6. 6. Introduction to the BFSIT!Solution •  Import content from the Alfresco server! •  Load folders & content as they appear on disk! •  Content is imported in batches! •  The “unit of work” is the directory! •  Each directory is imported in at least one batch! •  More if lots of content! •  Batches within a directory are processed serially!
  7. 7. Introduction to the BFSIT!Usage: 1.  Initiated via a simple repo Web Script:! (can also be initiated via wget, curl, et al)! 2.  Import runs in background! 3.  Detailed status is displayed while in progress! •  Weʼll see that during the demo!
  8. 8. Introduction to the BFSIT!Process details: •  Place source directory on job queue & immediately return! •  Worker thread pulls a single directory off job queue, and:! 1.  Lists the contents of the directory! 2.  Groups entries into “importable items”! 3.  Filters importable items, based on admin-defined filtering rules! 4.  Subdivides list of importable items into batches! 5.  Imports batches, one at a time (serially)! 6.  Places all subdirectories onto the job queue!
  9. 9. Introduction to the BFSIT!Process details: •  Place source directory on job queue & immediately return! •  Worker thread pulls a single directory off job queue, and:! 1.  Lists the contents of the directory! 2.  Groups entries into “importable items”! I/O   3.  Filters importable items, based on admin-defined filtering rules! Bound   Phases   4.  Subdivides list of importable items into batches! 5.  Imports batches, one at a time (serially)! CPU   6.  Places all subdirectories onto the job queue! Bound   Phases  
  10. 10. Demo!
  11. 11. Performance Analysis:
 Methodology!
  12. 12. Goals and Test Plan!Goals: •  Benchmark total time taken for bulk imports, using combination of:! •  Machine environments! •  Source content sets! •  Alfresco repository configurations! •  Bulk import tool configurations!Test Plan: •  Parallel testing in 2 environments! •  Two runs per test per machine:! 1.  Import into fresh (empty) repository! 2.  Delete target folder then re-import (without restarting Alfresco)! •  Record average of duration of each run! •  Modify only one configuration parameter at a time, resetting earlier modifications in between!
  13. 13. Environments!Environment 1 Environment 2 •  2009 model MacBook Pro! •  2006 model Thinkpad T60! •  2.8Ghz dual-core CPU! •  2.33Ghz dual-core CPU! •  4GB RAM! •  3GB RAM! •  Solid State Drive (Toshiba OEM) ! •  Dual hard drives (Seagate, Hitachi)! •  64bit Mac OSX Lion 10.7.1! •  First used for source directory! •  MySQL 5.1! •  Second used for Alfresco repository! •  Apple JDK 1.6.0_26! •  64bit Ubuntu Natty Narwhal 11.04! •  MySQL 5.1! •  OpenJDK 1.6.0_22!NOTE 1: Neither of these environments are “production grade”!NOTE 2: These environments are not directly comparable!
  14. 14. Content Sets!Name   #  Folders   #  Files   Total  Size   Notes  Typical   38   4,640   1.44GB  Extreme  File  Size   1   9   4.41GB  Extreme  File  Volume   4   11,100   521.7KB  Extreme  Directory   1,021   0   0B   100  levels  of  nes8ng  Structure  
  15. 15. Performance Analysis:
Repository Tuning Results !
  16. 16. Baseline! Notes: •  Repository tuned as per Day Zero Config Guide! •  BFSIT has default configuration! Observations: •  Environment 2 is significantly slower at creating cm:folder nodes! •  Theory: creating cm:folder nodes is “seeky” (more on this later)!
  17. 17. Disable User Quotas! Observations: •  Quota calculation performance proportional to number of cm:content nodes! •  Quota calculation performance not affected by content size!
  18. 18. Disable In-txn Indexing! Notes: •  This configuration is not compatible with Share 3.x!! Observations: •  Transactional indexing slows Alfresco down a lot, particularly in environment 2! •  Theory: indexing is highly “seeky”!
  19. 19. Disable Indexing Entirely! Notes: •  This configuration is not compatible with Share 3.x!! •  This configuration functionally cripples Alfresco!! Observations: •  Some contention between ingestions & indexing (even async)! •  Theory: SOLR integration in 4.x should provide similar performance!
  20. 20. Optimal Repository Configuration!Optimal repository configuration, without functionally cripplingAlfresco, is: •  Disable user quotas:! system.usages.enabled=false •  Disable in-transaction indexing:! index.tracking.disableInTransactionIndexing=true alfresco.cluster.name=dummyCluster •  Indexing still occurs, just not synchronously in-transaction! •  Incompatible with Share 3.x, but can be disabled temporarily during import, then re-enabled post-import!
  21. 21. Optimal Repository Configuration – Results! Notes: •  This configuration is not compatible with Share 3.x!! Observations: •  Slower environment (2) benefits more than the faster environment (1)! •  Configuration canʼt speed up import of large files! •  Requires faster storage devices (e.g. RAID 10)!
  22. 22. Average speedup of ~40%?!
  23. 23. Performance Analysis:
BFSIT Tuning Results!
  24. 24. Worker Thread Pool Sizes! Notes: •  Baseline is optimal repository configuration! •  Only the “Typical” content set was used for testing! Observations: •  Multi-threading is mostly irrelevant! •  Not surprising, given ingestion is I/O bound! •  Steady improvement in environment 1! •  Theory: concurrent I/O support in SSD!
  25. 25. Batch Weights! Observations: •  Larger batches = better performance! …HOWEVER…! •  UI responsiveness got worse! •  A classic trade-off! •  Ultimately, performance similar to baseline (batch weight = 100)!
  26. 26. Optimal BFSIT Configuration!Optimal BFSIT configuration: •  High thread count (mostly irrelevant):! alfresco-bulk-filesystem-import.threadpool.size.core=48 alfresco-bulk-filesystem-import.threadpool.size.max=48 •  More importantly, high batch weight:! alfresco-bulk-filesystem-import.batch.weight=1000 •  Impacts UI responsiveness! •  Could reduce if needed, at little cost!
  27. 27. Optimal BFSIT Configuration - Results! Observations: •  Modest improvement over baseline! •  Implies default BFSIT configuration is close to optimal!
  28. 28. Average speedup of ~6.5%?!
  29. 29. Rethinking the Problem! What if the BFSIT didn’t have to stream content into the repository at all? What if the source content was already in the contentstore and only had to be “linked” into the repository?
  30. 30. In-place Import! Notes: •  Baseline is optimal repository configuration! •  Optimal repository & BFSIT configuration! Observations: •  Improvement across the board! •  Best improvement is extreme file size case – 375X faster!!
  31. 31. Average speedup of ~60%?!
  32. 32. Performance Analysis:
 Conclusions !
  33. 33. Conclusions!Results: •  Minimum improvement of 6%! •  Average improvement of 60%! •  Maximum improvement of 99.7%! •  In absolute terms, saw performance of up to:! •  16GB / sec! •  120 nodes / sec!Recall this wasn’t on production hardware!!
  34. 34. Conclusions!Developers: •  Macro-optimization will always outperform micro-optimization!! •  Multi-threading is not a magic bullet! Itʼs only helpful if a given operation is CPU bound and can be parallelised.!Administrators: •  Use the Day Zero Configuration Guide for every install you do!! •  Donʼt assume superficially similar environments will perform similarly! •  For bulk ingestions Alfresco is (mostly) I/O bound!
  35. 35. BFSIT Roadmap!
  36. 36. BFSIT Roadmap!Official roadmap is on the Google Code project’s wiki [2].BFSIT v1.1 – Performance: •  Issue #91: Optimization of directory analysis phase [complete].! •  Issue #8: Multi-threaded imports [complete].! •  Issue #86: In-place imports [complete].! •  Issue #77: graphical display of throughput.! •  Issue #17: Test various different dimensions to see how they affect performance [complete – this talk!]!BFSIT v1.2 – Alfresco 4.0, Usability & Performance: •  Issue #92: Test on Alfresco 4.0! •  Issue #26: Integrate into Shares administration console! •  Issue #94: Investigate use of Alfrescos BatchProcessor framework for the multi-threaded importer! •  Issue #96: Measure performance of alternative batching strategies! •  Issue #79: Reimplement the bulk filesystem import as a subsystem! •  Issue #62: Add support for cm:content properties!BFSIT v1.3+: •  You tell me – Iʼm always keen to hear feedback!! •  The issues list [3] and mailing list [4] are great ways to start getting involved in the project!
  37. 37. References![1] http://code.google.com/p/alfresco-bulk-filesystem-import/[2] http://code.google.com/p/alfresco-bulk-filesystem-import/wiki/Roadmap[3] http://code.google.com/p/alfresco-bulk-filesystem-import/issues/list[4] http://groups.google.com/group/alfresco-bulk-filesystem-importAlso: •  http://blogs.alfresco.com/wp/pmonks/2009/10/22/bulk-import-from-a-filesystem/! •  Sessions:! •  BP-1 – Performance Tuning! •  BP-6 – Repository Customization Best Practices! •  BP-9 – Share Customization Best Practices!
  38. 38. Questions?!
  39. 39. Appendix A – “Typical” Content Set Distributions!

×