Upcoming SlideShare
×

ERA - Measuring Maintainability of Spreadsheets in the Wild

347 views

Published on

Paper: Measuring Maintainability of Spreadsheets in the Wild

Authors: José Pedro Correia and Miguel Alexandre Ferreira

Session: Early Research Achievements Track Session 2: Software Changes and Maintainability

Published in: Technology, Education
0 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

Views
Total views
347
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
3
0
Likes
0
Embeds 0
No embeds

No notes for slide

ERA - Measuring Maintainability of Spreadsheets in the Wild

1. 1. Measuring Maintainability of Spreadsheets in the WildJosé Pedro Correia & Miguel Alexandre Ferreira September 2011 T +31 20 314 0950 info@sig.eu www.sig.eu
2. 2. Introduction 2 I 20Spreadsheets •  are widely used in all kinds of organizations •  contain important business logic •  are maintained by different peopleDo all spreadsheets matter? •  throwaway calculations don’t •  some data intensive spreadsheets might •  “spreadsheet programs” matter the mostMeasuring Maintainability of Spreadsheets in the Wild – José Pedro Correia & Miguel Alexandre Ferreira – September 2011
3. 3. Pragmatic criteria for “spreadsheet programs” 3 I 20have formulasthe formulas have referencesMeasuring Maintainability of Spreadsheets in the Wild – José Pedro Correia & Miguel Alexandre Ferreira – September 2011
4. 4. Definitions 4 I 20 •  spreadsheet = { sheet1, sheet2, …, sheeti } cell11 cell12 … cell1j cell21 cell22 … cell2j •  sheet = … … … celli1 celli2 … cellijMeasuring Maintainability of Spreadsheets in the Wild – José Pedro Correia & Miguel Alexandre Ferreira – September 2011
5. 5. Definitions 5 I 20 •  cell types •  blank: no content •  data: non-blank and does not contain a formula •  proxy: contains a formula that is a direct, single reference (e.g. =A1) •  calculation: contains a formula and is not a proxy •  cell roles blank data proxy calculation not referenced no role label1 data sink1 calc. sink referenced open input data source1 data move calc. step 1[Hodnigg & Mittermeir – Metric-based spreadsheet visualization: Support for focused maintenance – EuSpRIG’08]Measuring Maintainability of Spreadsheets in the Wild – José Pedro Correia & Miguel Alexandre Ferreira – September 2011
6. 6. Definitions 6 I 20 =R1 + 2 •  formula copy equivalents1 =R2 + 2 =R3 + 2 =R4 + 2 •  unique formula =X + 2 1[Mittermeir & Clermont – Finding high-level structures in spreadsheet programs – WCRE’02]Measuring Maintainability of Spreadsheets in the Wild – José Pedro Correia & Miguel Alexandre Ferreira – September 2011
7. 7. Research questions 7 I 201.  Which metrics can we use to assess spreadsheet maintainability?2.  What are the typical values for the selected metrics?Measuring Maintainability of Spreadsheets in the Wild – José Pedro Correia & Miguel Alexandre Ferreira – September 2011
8. 8. Study outline 8 I 20Metric selectionMeasuring the EUSES Spreadsheet CorpusAnalysis of resultsMeasuring Maintainability of Spreadsheets in the Wild – José Pedro Correia & Miguel Alexandre Ferreira – September 2011
9. 9. Goal Question Metric (mockup) 9 I 20 Main goal Maintainability Sub goals Analyzability Changeability Stability Testability How large is the Are there How much How complex is Questions spreadsheet? inconsistencies? coupling is there? the spreadsheet? Sub questions … … … … … Metrics … … … … … ...Measuring Maintainability of Spreadsheets in the Wild – José Pedro Correia & Miguel Alexandre Ferreira – September 2011
10. 10. Example metrics 10 I 20Spreadsheet level •  # used rows / columns •  # formulas / unique formulasSheet level •  # data fan-in / fan-out •  # data move / sink cellsFormula level •  McCabe complexity •  # operatorsMeasuring Maintainability of Spreadsheets in the Wild – José Pedro Correia & Miguel Alexandre Ferreira – September 2011
11. 11. The EUSES Spreadsheet Corpus1 11 I 20 Spreadsheets 5000 4000 3000 1609 2000 1000 0 Total Internet Contain Contain > 25 unique search formulas formulas with formulas references 1[Fisher II & Rothermel – The EUSES spreadsheet corpus: a shared resource for supporting experimentation with spreadsheet dependability mechanisms – WEUSE’05]Measuring Maintainability of Spreadsheets in the Wild – José Pedro Correia & Miguel Alexandre Ferreira – September 2011
12. 12. Power-law like distributions 12 I 20 1000 800 600 Frequency 400 200 0 0 5000 10000 15000 NON_BLANK_CELLS (up to 99 quantile) most metrics follow a power-law like distributionMeasuring Maintainability of Spreadsheets in the Wild – José Pedro Correia & Miguel Alexandre Ferreira – September 2011
13. 13. Extremely skewed distributions 13 I 20 Attribute Object Min Q1 Med. Q3 95% 99% Max # data move cells Sheet 0 0 0 0 12 80 672 # data sink cells Sheet 0 0 0 0 8 80 1188 # data fan-out Sheet 0 0 0 0 52 1256 964366 # data fan-in Sheet 0 0 0 0 76 1814 964366 at least 75% of sheets have no proxy cells at least 75% of sheets are not referenced / have no referencesMeasuring Maintainability of Spreadsheets in the Wild – José Pedro Correia & Miguel Alexandre Ferreira – September 2011
14. 14. Sparse distributions 14 I 20 Attribute Object Min Q1 Med. Q3 95% 99% Max McCabe complexity Formula 1 1 1 1 1 5 34 # unidentified values Formula 0 0 0 1 3 9 51 conditionals and magic values are uncommonMeasuring Maintainability of Spreadsheets in the Wild – José Pedro Correia & Miguel Alexandre Ferreira – September 2011
15. 15. Documentation within the spreadsheet 15 I 20 Attribute Object Min Q1 Med. Q3 95% 99% Max % label cells Sheet 0 35 64 100 100 100 100 at least 25% of sheets are purely for documentation purposesMeasuring Maintainability of Spreadsheets in the Wild – José Pedro Correia & Miguel Alexandre Ferreira – September 2011
16. 16. Spreadsheet layout 16 I 20 Attribute Object Min Q1 Med. Q3 95% 99% Max # used columns Spreadsheet 2 10 19 41 122 228 738 # used rows Spreadsheet 2 47 99 210 686 1629 40518 most common layout seems to be verticalMeasuring Maintainability of Spreadsheets in the Wild – José Pedro Correia & Miguel Alexandre Ferreira – September 2011
17. 17. “Dragging” formulas 17 I 20 Attribute Object Min Q1 Med. Q3 95% 99% Max # formula cells Spreadsheet 1 20 75 239 1153 4052 24523 # unique formulas Spreadsheet 1 3 8 26 103 252 961 seems to be a common practiceMeasuring Maintainability of Spreadsheets in the Wild – José Pedro Correia & Miguel Alexandre Ferreira – September 2011
18. 18. Answering the research questions 18 I 201.  Which metrics can we use to assess spreadsheet maintainability? •  levels: sheets, rows/columns, cells, formulas2.  What are the typical values for the selected metrics? •  most distributions resemble software metrics distributions •  some distributions are extremely skewed/sparse •  expect label only sheets/more rows than columns/copy equivalent formulasMeasuring Maintainability of Spreadsheets in the Wild – José Pedro Correia & Miguel Alexandre Ferreira – September 2011
19. 19. Roadmap 19 I 201.  Study how and what to measure in spreadsheets2.  Select a minimal set of metrics and build a quality model3.  Gather a representative set of measurements for calibration4.  Calibrate the thresholds in the model5.  Validate the modelMeasuring Maintainability of Spreadsheets in the Wild – José Pedro Correia & Miguel Alexandre Ferreira – September 2011
20. 20. The end… 20 I 20 Thank you for your attention! Q&A Complete data set and technical report at: http://www.sig.eu/en/spreadsheet-quality Miguel Ferreira Software Improvement Group m.ferreira@sig.euMeasuring Maintainability of Spreadsheets in the Wild – José Pedro Correia & Miguel Alexandre Ferreira – September 2011