Niklas Nebel Sun Microsystems PERFORMANCE IMPROVEMENTS IN CALC
Agenda <ul><li>Introduction and context
Local optimizations
Handling sheets separately
DataPilot performance
Load & save outlook </li></ul>
Introduction and Context
Performance work in all of OOo <ul><li>Performance project </li><ul><li>Big improvements from 3.0 to 3.2 </li></ul><li>Sta...
Writer load performance </li><ul><li>Comparable with MS Word 2007 </li></ul><li>Impress load performance </li><ul><li>Comp...
Recalculation: Up to 20 times faster (extreme case) </li></ul></ul>
Local Optimizations
API Usage When Saving Text Cells <ul><li>Filter uses getFormula API method
Single quote character added if text can be parsed as a number
Unnecessary parsing step
Can take up to 17% of CPU time </li></ul>
Querying the Document Null Date <ul><li>Internal representation: Days since the null date
File format: XML Schema dates ( ≈   ISO 8601)
Utility method for conversion </li><ul><li>Queries the null date from the document
Several UNO calls </li></ul><li>Querying once is enough
10% of CPU time if only date cells are used </li></ul>
Collecting Formatted Cell Ranges <ul><li>Collect cell ranges with equal cell formats </li><ul><li>For generating automatic...
Keep a list of ranges for each set of formats
Try to join adjacent ranges </li></ul><li>Formats are kept and iterated column-wise </li><ul><li>Can use this information ...
Formula Optimizations <ul><li>String handling when formuas are parsed </li><ul><li>Functions, references, names are case-i...
Operators, separators, parentheses are not
Reduce case conversion calls </li><ul><li>5% of CPU time saved </li></ul></ul><li>Sorting of values for MEDIAN etc. </li><...
Use std::nth_element STL method instead
Faster calculation after loading </li></ul></ul>
Formula Recalculation (1) <ul><li>Detection of duplicate notifications </li><ul><li>When a cell range is modified
Upcoming SlideShare
Loading in …5
×

Performance Improvements

842 views
751 views

Published on

Performance improvement in OpenOffice.org

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
842
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Performance Improvements

  1. 1. Niklas Nebel Sun Microsystems PERFORMANCE IMPROVEMENTS IN CALC
  2. 2. Agenda <ul><li>Introduction and context
  3. 3. Local optimizations
  4. 4. Handling sheets separately
  5. 5. DataPilot performance
  6. 6. Load & save outlook </li></ul>
  7. 7. Introduction and Context
  8. 8. Performance work in all of OOo <ul><li>Performance project </li><ul><li>Big improvements from 3.0 to 3.2 </li></ul><li>Start-up: Cold start of Writer 20% faster
  9. 9. Writer load performance </li><ul><li>Comparable with MS Word 2007 </li></ul><li>Impress load performance </li><ul><li>Comparable with MS PowerPoint 2007 </li></ul><li>Calc performance </li><ul><li>Load and save: Up to twice as fast
  10. 10. Recalculation: Up to 20 times faster (extreme case) </li></ul></ul>
  11. 11. Local Optimizations
  12. 12. API Usage When Saving Text Cells <ul><li>Filter uses getFormula API method
  13. 13. Single quote character added if text can be parsed as a number
  14. 14. Unnecessary parsing step
  15. 15. Can take up to 17% of CPU time </li></ul>
  16. 16. Querying the Document Null Date <ul><li>Internal representation: Days since the null date
  17. 17. File format: XML Schema dates ( ≈ ISO 8601)
  18. 18. Utility method for conversion </li><ul><li>Queries the null date from the document
  19. 19. Several UNO calls </li></ul><li>Querying once is enough
  20. 20. 10% of CPU time if only date cells are used </li></ul>
  21. 21. Collecting Formatted Cell Ranges <ul><li>Collect cell ranges with equal cell formats </li><ul><li>For generating automatic styles
  22. 22. Keep a list of ranges for each set of formats
  23. 23. Try to join adjacent ranges </li></ul><li>Formats are kept and iterated column-wise </li><ul><li>Can use this information when trying to join </li></ul><li>Prevents pathological cases </li></ul>
  24. 24. Formula Optimizations <ul><li>String handling when formuas are parsed </li><ul><li>Functions, references, names are case-insensitive
  25. 25. Operators, separators, parentheses are not
  26. 26. Reduce case conversion calls </li><ul><li>5% of CPU time saved </li></ul></ul><li>Sorting of values for MEDIAN etc. </li><ul><li>Not necessary to completely sort the array
  27. 27. Use std::nth_element STL method instead
  28. 28. Faster calculation after loading </li></ul></ul>
  29. 29. Formula Recalculation (1) <ul><li>Detection of duplicate notifications </li><ul><li>When a cell range is modified
  30. 30. Parameter range can contain several changed cells
  31. 31. Notify each range only once </li></ul><li>Also useful for single-cell change </li><ul><li>Parameter range can contain several changed results
  32. 32. Extreme case: Issue 95967 – 20x faster </li></ul></ul>
  33. 33. Handling Sheets Separately
  34. 34. Updating Row Heights <ul><li>Optimal row height depends on local conditions </li><ul><li>Especially fonts </li></ul><li>Core structures need concrete height values </li><ul><li>Positioning of shapes: Whole file </li><ul><li>File format: relative to cell position
  35. 35. Internally: absolute positions </li></ul><li>Screen output: Only single sheet </li></ul><li>Update row heights </li><ul><li>After loading: Visible sheet and sheets with shapes
  36. 36. Others as needed (display, printing, …) </li></ul></ul>
  37. 37. Updating Row Heights: Comments <ul><li>Cell comments (formerly: notes) are shapes
  38. 38. Often used in large sheets </li><ul><li>Usually not shown </li></ul><li>Create shape only when comment is shown </li><ul><li>Saves time if there are many hidden comments
  39. 39. Row heights can be updated later </li></ul></ul>
  40. 40. Updating Row Heights: Results <ul><li>No effect for single sheet
  41. 41. Little improvement for text and numbers
  42. 42. 30% CPU time with date cells on many sheets
  43. 43. Formula results don't have to be calculated </li></ul>
  44. 44. Partial Saving <ul><li>Don't generate XML elements for whole file
  45. 45. Copy unchanged parts on stream level
  46. 46. Could copy from temporary storage </li><ul><li>Storage layer creates copy of the unpacked file </li></ul><li>Access the original file </li><ul><li>Uncompress on the fly </li></ul><li>Cost </li><ul><li>File access: Read the compressed file
  47. 47. CPU: Uncompress </li></ul></ul>
  48. 48. Experiment: Incremental Saving <ul><li>Generate XML elements only for changed cells </li><ul><li>Proof of concept: Only single-cell changes </li></ul><li>No additional information kept after loading
  49. 49. Minimal parsing to find affected cells in stream </li><ul><li>Takes extra time
  50. 50. Less if affected cells near start of file </li></ul><li>Results (compared to 3.0): </li><ul><li>40 – 70% improvement in CPU time
  51. 51. 30 – 50% improvement in total time </li></ul></ul>
  52. 52. Sheet-Wise Saving <ul><li>Handle sheets instead of individual cells
  53. 53. Fewer sheets than cells </li><ul><li>Additional information can be kept in memory </li></ul><li>Easier to find modified sheets than modified cells
  54. 54. One obvious limitation: </li><ul><li>Only useful with several sheets </li></ul></ul>
  55. 55. Finding Modified Sheets <ul><li>Few code changes for most types of changes </li><ul><li>Formula notification for cell contents
  56. 56. Formula calculation for changed results
  57. 57. Cell format changes
  58. 58. Column widths or row heights
  59. 59. Handled separately: Print ranges, etc. </li></ul><li>Currently no handling of drawing layer changes </li><ul><li>All sheets are considered modified </li></ul></ul>
  60. 60. Automatic Styles <ul><li>Direct formats are collected in automatic styles </li><ul><li>Referenced by name </li><ul><li>Generated name (“ce1” etc.) </li></ul><li>One list for the whole document
  61. 61. Have to be created with the same names again </li></ul><li>Implemented for cell contents (incl. comments) </li><ul><li>Keep a mapping of names to cell/text positions
  62. 62. Collect styles for unchanged sheets first
  63. 63. Include in existing duplicate detection for other sheets </li></ul><li>Sheets with shapes always saved normally </li></ul>
  64. 64. Putting the Parts Together <ul><li>When loading a file </li><ul><li>Compatibility checks: Namespaces, encoding
  65. 65. Keep stream positions and style information </li></ul><li>Steps to save a spreadsheet document </li><ul><li>meta.xml, styles.xml, embedded objects: as usual
  66. 66. content.xml </li><ul><li>Generate common content and modified sheets
  67. 67. For each sheet: Generate or copy stream portion </li></ul><li>For “Save” and “Save As” update stream positions </li></ul></ul>
  68. 68. Results <ul><li>Influencing factors </li><ul><li>Unchanged sheets
  69. 69. Type of sheet content
  70. 70. CPU time / file access </li></ul><li>Example </li><ul><li>Text, numbers, dates
  71. 71. 16 sheets </li></ul><li>Single sheet modified </li><ul><li>Twice as fast
  72. 72. On top of other changes </li></ul></ul>
  73. 73. Formula Recalculation (2) <ul><li>Sheet area is divided into “slots” </li><ul><li>16 columns by 128 rows
  74. 74. Range dependency registered in all affected slots
  75. 75. Needs attention when row limit is changed </li></ul><li>Change: Use hash_set instead of set </li><ul><li>Faster modification of dependency structures
  76. 76. Loading time </li></ul><li>Change: Separate structures per sheet </li><ul><li>Faster recalculation if several sheets are used </li></ul></ul>
  77. 77. DataPilot Performance
  78. 78. DataPilot Memory Usage <ul><li>Issue 55266: Several fields with many items
  79. 79. Fix now under way from IBM Symphony team </li><ul><li>Don't allocate results for all child items
  80. 80. New cache table </li></ul><li>CWS datapilotperf </li><ul><li>Planned for OOo 3.3
  81. 81. Combination of large fields no longer a limitation </li></ul></ul>
  82. 82. Load & Save Outlook
  83. 83. DOM Usage <ul><li>Prototype by Christian Lippka for Impress </li><ul><li>Use fast SAX to fill a compact DOM representation
  84. 84. Import from DOM, possibly parallel to parsing </li></ul><li>Results for Impress </li><ul><li>Only 2% improvement for typical presentation
  85. 85. Filling DOM tree uses 2% of CPU time
  86. 86. Not worth the effort </li></ul><li>Calc may be different </li><ul><li>Larger number of XML elements
  87. 87. But: Memory usage twice the XML stream size </li></ul></ul>
  88. 88. Further Separation of Sheets <ul><li>Load only the visible sheet </li><ul><li>Load other sheets as needed, or in background
  89. 89. Parse XML fragment from stream, or use DOM
  90. 90. Formulas, charts may depend on changed cells </li><ul><li>Dependencies must be known before saving </li></ul></ul><li>Parse formulas only as needed </li><ul><li>Per sheet or individually
  91. 91. Already a separate step (but for all formulas) </li></ul><li>Handle several sheets in parallel </li><ul><li>More fine-grained locking needed </li></ul></ul>
  92. 92. Q & A
  93. 93. PERFORMANCE IMPROVEMENTS IN CALC Niklas Nebel [email_address]

×