Paper: Measuring Maintainability of Spreadsheets in the Wild
Authors: José Pedro Correia and Miguel Alexandre Ferreira
Session: Early Research Achievements Track Session 2: Software Changes and Maintainability
IAC 2024 - IA Fast Track to Search Focused AI Solutions
ERA - Measuring Maintainability of Spreadsheets in the Wild
1. Measuring Maintainability of Spreadsheets in the Wild
José Pedro Correia & Miguel Alexandre Ferreira
September 2011 T +31 20 314 0950
info@sig.eu
www.sig.eu
2. Introduction
2 I 20
Spreadsheets
• are widely used in all kinds of organizations
• contain important business logic
• are maintained by different people
Do all spreadsheets matter?
• throwaway calculations don’t
• some data intensive spreadsheets might
• “spreadsheet programs” matter the most
Measuring Maintainability of Spreadsheets in the Wild – José Pedro Correia & Miguel Alexandre Ferreira – September 2011
3. Pragmatic criteria for “spreadsheet programs”
3 I 20
have formulas
the formulas have references
Measuring Maintainability of Spreadsheets in the Wild – José Pedro Correia & Miguel Alexandre Ferreira – September 2011
4. Definitions
4 I 20
• spreadsheet = { sheet1, sheet2, …, sheeti }
cell11 cell12 … cell1j
cell21 cell22 … cell2j
• sheet =
…
…
…
celli1 celli2 … cellij
Measuring Maintainability of Spreadsheets in the Wild – José Pedro Correia & Miguel Alexandre Ferreira – September 2011
5. Definitions
5 I 20
• cell types
• blank: no content
• data: non-blank and does not contain a formula
• proxy: contains a formula that is a direct, single reference (e.g. =A1)
• calculation: contains a formula and is not a proxy
• cell roles blank data proxy calculation
not referenced no role label1 data sink1 calc. sink
referenced open input data source1 data move calc. step
1[Hodnigg & Mittermeir – Metric-based spreadsheet visualization: Support for focused maintenance – EuSpRIG’08]
Measuring Maintainability of Spreadsheets in the Wild – José Pedro Correia & Miguel Alexandre Ferreira – September 2011
6. Definitions
6 I 20
=R1 + 2
• formula copy equivalents1 =R2 + 2
=R3 + 2
=R4 + 2
• unique formula =X + 2
1[Mittermeir & Clermont – Finding high-level structures in spreadsheet programs – WCRE’02]
Measuring Maintainability of Spreadsheets in the Wild – José Pedro Correia & Miguel Alexandre Ferreira – September 2011
7. Research questions
7 I 20
1. Which metrics can we use to assess spreadsheet maintainability?
2. What are the typical values for the selected metrics?
Measuring Maintainability of Spreadsheets in the Wild – José Pedro Correia & Miguel Alexandre Ferreira – September 2011
8. Study outline
8 I 20
Metric selection
Measuring the EUSES Spreadsheet Corpus
Analysis of results
Measuring Maintainability of Spreadsheets in the Wild – José Pedro Correia & Miguel Alexandre Ferreira – September 2011
9. Goal Question Metric (mockup)
9 I 20
Main goal Maintainability
Sub goals Analyzability Changeability Stability Testability
How large is the Are there How much How complex is
Questions spreadsheet? inconsistencies? coupling is there? the spreadsheet?
Sub questions … … … … …
Metrics … … … … … ...
Measuring Maintainability of Spreadsheets in the Wild – José Pedro Correia & Miguel Alexandre Ferreira – September 2011
10. Example metrics
10 I 20
Spreadsheet level
• # used rows / columns
• # formulas / unique formulas
Sheet level
• # data fan-in / fan-out
• # data move / sink cells
Formula level
• McCabe complexity
• # operators
Measuring Maintainability of Spreadsheets in the Wild – José Pedro Correia & Miguel Alexandre Ferreira – September 2011
11. The EUSES Spreadsheet Corpus1
11 I 20
Spreadsheets
5000
4000
3000
1609
2000
1000
0
Total Internet Contain Contain > 25 unique
search formulas formulas with formulas
references
1[Fisher II & Rothermel – The EUSES spreadsheet corpus: a shared resource for supporting experimentation with spreadsheet
dependability mechanisms – WEUSE’05]
Measuring Maintainability of Spreadsheets in the Wild – José Pedro Correia & Miguel Alexandre Ferreira – September 2011
12. Power-law like distributions
12 I 20
1000
800
600
Frequency
400
200
0
0 5000 10000 15000
NON_BLANK_CELLS (up to 99 quantile)
most metrics follow a power-law like distribution
Measuring Maintainability of Spreadsheets in the Wild – José Pedro Correia & Miguel Alexandre Ferreira – September 2011
13. Extremely skewed distributions
13 I 20
Attribute Object Min Q1 Med. Q3 95% 99% Max
# data move cells Sheet 0 0 0 0 12 80 672
# data sink cells Sheet 0 0 0 0 8 80 1188
# data fan-out Sheet 0 0 0 0 52 1256 964366
# data fan-in Sheet 0 0 0 0 76 1814 964366
at least 75% of sheets have no proxy cells
at least 75% of sheets are not referenced / have no references
Measuring Maintainability of Spreadsheets in the Wild – José Pedro Correia & Miguel Alexandre Ferreira – September 2011
14. Sparse distributions
14 I 20
Attribute Object Min Q1 Med. Q3 95% 99% Max
McCabe complexity Formula 1 1 1 1 1 5 34
# unidentified values Formula 0 0 0 1 3 9 51
conditionals and magic values are uncommon
Measuring Maintainability of Spreadsheets in the Wild – José Pedro Correia & Miguel Alexandre Ferreira – September 2011
15. Documentation within the spreadsheet
15 I 20
Attribute Object Min Q1 Med. Q3 95% 99% Max
% label cells Sheet 0 35 64 100 100 100 100
at least 25% of sheets are purely for documentation purposes
Measuring Maintainability of Spreadsheets in the Wild – José Pedro Correia & Miguel Alexandre Ferreira – September 2011
16. Spreadsheet layout
16 I 20
Attribute Object Min Q1 Med. Q3 95% 99% Max
# used columns Spreadsheet 2 10 19 41 122 228 738
# used rows Spreadsheet 2 47 99 210 686 1629 40518
most common layout seems to be vertical
Measuring Maintainability of Spreadsheets in the Wild – José Pedro Correia & Miguel Alexandre Ferreira – September 2011
17. “Dragging” formulas
17 I 20
Attribute Object Min Q1 Med. Q3 95% 99% Max
# formula cells Spreadsheet 1 20 75 239 1153 4052 24523
# unique formulas Spreadsheet 1 3 8 26 103 252 961
seems to be a common practice
Measuring Maintainability of Spreadsheets in the Wild – José Pedro Correia & Miguel Alexandre Ferreira – September 2011
18. Answering the research questions
18 I 20
1. Which metrics can we use to assess spreadsheet maintainability?
• levels: sheets, rows/columns, cells, formulas
2. What are the typical values for the selected metrics?
• most distributions resemble software metrics distributions
• some distributions are extremely skewed/sparse
• expect label only sheets/more rows than columns/copy equivalent formulas
Measuring Maintainability of Spreadsheets in the Wild – José Pedro Correia & Miguel Alexandre Ferreira – September 2011
19. Roadmap
19 I 20
1. Study how and what to measure in spreadsheets
2. Select a minimal set of metrics and build a quality model
3. Gather a representative set of measurements for calibration
4. Calibrate the thresholds in the model
5. Validate the model
Measuring Maintainability of Spreadsheets in the Wild – José Pedro Correia & Miguel Alexandre Ferreira – September 2011
20. The end…
20 I 20
Thank you for your attention!
Q&A
Complete data set and technical report at:
http://www.sig.eu/en/spreadsheet-quality
Miguel Ferreira
Software Improvement Group
m.ferreira@sig.eu
Measuring Maintainability of Spreadsheets in the Wild – José Pedro Correia & Miguel Alexandre Ferreira – September 2011