Analyzing & visualizing spreadsheetsFelienne Hermans (@felienne)In this slidedeck I present anoverview of my PhD research. Irecently defended my dissertationtitled ‘Analyzing and visualizingSpreadsheets’
In this slidedeck I present anoverview of my PhD research. Irecently defended my dissertationtitled ‘Analyzing and visualizingSpreadsheets’ This one!
Bridging the gapFunny story: I wasn’t hired toresearch spreadsheets at all. WhenI started my PhD project, I wassupposed to research the gapbetween business users andprogrammers.UsersProgrammers
To research this gap, I started bystudying business in practice
What surprised me, is that this gapwasn’t that big, it was more like asmall creek than a huge cliff.Some programmers were heavillyinvolved in business, and even moreinteresting: some business guys weredoing serious programming.ProgrammersUsers
What surprised me, is that this gapwasn’t that big, it was more like asmall creek than a huge cliff.Some programmers were heavillyinvolved in business, and even moreinteresting: some business guys weredoing serious programming.In Excel!ProgrammersUsers
What surprised me, is that this gapwasn’t that big, it was more like asmall creek than a huge cliff.Some programmers were heavillyinvolved in business, and even moreinteresting: some business guys weredoing serious programming.In Excel!So I looked into some previous workon the impact of spreadsheets onbusiness.ProgrammersUsers
95% of all U.S. firms use spreadsheets forfinancial reporting
90% of all analysts in industry performcalculations in spreadsheets
50% of spreadsheets form the basis fordecisions
Importance can grow over timeWhen studying the impact ofspreadsheets, we found that theydo not become importantovernight. As processes change,spreadsheets can become keycompany assets over time.Nobody sets out to create a missioncritical spreadsheet, they “justhappen”
This is a simple spreadsheet for manyusersFurthermore, spreadsheets canbecome surprisingly complex.
And, spreadsheet exist‘under the radar’Another interesting property ofspreadsheets is that they often live‘under the radar’:There is no list of spreadsheets, noone keeps track of what sheets areneeded for what report and somespreadsheets do not have a clearowner.
Only 33% of spreadsheets hasa manualFinally, spreadsheets are lackingdocumentation. In only one third ofspreadsheets we found‘documentation’ (i.e. Some sort ofexplanation on how to use thespreadsheet) Technicaldocumentation, explaining why aspreadsheet was designed as it is,was hardly ever found.
Complex spreadsheets withoutdocumentation can lead to serious errorsYou can imagine the combinationof all the above facts:• Spreadsheets are important• They are complex• They lack documentationis a potential recipe for disaster.And indeed, those errors happen
The European Spreadsheet Risk InterestGroup (Eusprig.org) collects horror stories
Estimated loss: 10 billion dollars a year
We interviewed spreadsheetprofessionalsOnce I had studied relatedspreadsheet work and the horrorstories from Eusprig, I wanted togain a deeper understanding ofspreadsheet problems in practice.So I interviewed 27 spreadsheetprofessionals at the Dutch Robecobank.
We interviewed spreadsheetprofessionalsOnce I had studied relatedspreadsheet work and the horrorstories from Eusprig, I wanted togain a deeper understanding ofspreadsheet problems in practice.So I interviewed 27 spreadsheetprofessionals at the Dutch Robecobank.I asked only two questions (a semi-structured interview) to obtain anoverall view of spreadsheetproblems:
What annoys you?
And what makes you happy?
Financial professionals spend 2 days aweek working with ExcelFrom the interviews, we learned thefollowing facts
Spreadsheets can have a long life,5 years on average
Average sheet is used by 12 differentpeople
There is a gap! Between importance andtreatment.Then I concluded that there is aninteresting gap that needsbridging:the gap between how importantspreadsheets are and how wellthey are treated.So how could this gap be bridged?
It looks like software in the 70s!Let’s summarize the problemsaround spreadsheets again:• They lack documentation• They contain errors• They stay alive for several yearsand are used by several people• They are complexDoes this remind you ofsomething?It reminded me of the problems inthe early days of software
Hence, we tried to bridge this gap withmethods from software engineering.
Spreadsheet users lack great toolsupportIf you compare the tooling ofspreadsheet developers with thatof software developers, thedifference is clear.
Modern IDEs (like Visual Studio)have all kinds of build-in tools tohelp you build software in aresponsible way: debugging,testing, analyzing and visualizingare accessible at the click of abutton.
Compare this to a spreadsheetenvironment, like Excel. Lots ofsupport to create a spreadsheet,with fonts and colors and borders,but none of the helpful tools tobuild a maintainable spreadsheet.
We did not start coding immediatelyHowever tempting, we did not startto build a spreadsheet IDEimmediately. Instead, we lookedat the results of the interviews, tofind the most pressing informationneed that spreadsheet users had.
Most important problem: support forunderstanding spreadsheets was missing
To address this information needspecifically, we developed ourtool Breviz.This tool visualizes thedependencies among worksheets,depicted as rectangles with arrowsdrawn between them. The thickerthe arrow, the more connectionsthere are.Example: In worksheet ‘POAProject’ formulas are placed thatrefer to cells in ‘ProjectTeam’
We went back to practiceWith our tool, we went back topractice, to see whether it reallysupported spreadsheet users.
Turned out, it did. Some of theresponses of users:“This diagramreminds me ofwhat I had in mindwhen building”
Turned out, it did. Some of theresponses of users:This remark is interesting:apparently, this spreadsheet userdid do some modeling beforebuilding a spreadsheet.“This diagramreminds me ofwhat I had in mindwhen building”
Turned out, it did. Some of theresponses of users:A clear sign that we were on theright track!“This makes my job10 times easier”
This work was publishedat ICSE 2011
However, unexpected things alsohappened. Not all spreadsheetslooked as well structured as thisone.Let’s look at some of them:
Here, pink blocks representworksheets outside of thespreadsheet. So this spreadsheetgathers information from over 20other worksheets and combinesthis information.
Users diagnosed with the diagramsWe found that, due to the diversityon the diagrams, users started tojudge spreadsheets based on theirdataflow diagrams.We therefore formalized thisfeeling users had into ‘smells’ atthe design level.These spreadsheet smells turnedout to be very similar to codesmells as defined by Fowler.
Consider for instance the ‘featureenvy’ smell. This occurs when amethod from class B refers tomany fields outside its own class.This method envies all the coolfields that A has, hence the name.
Consider for instance the ‘featureenvy’ smell. This occurs when amethod from class B refers tomany fields outside its own class.This method envies all the coolfields that A has, hence the name.Easy to see how this smell couldbe defined on spreadsheets,where a formula in worksheet Bcould be overly interested in cellson worksheet A.
We added support in Breviz fordetecting and visualizing theseinter-worksheet code smells.
We went back to practiceNext, of course, we went back topractice, to see how users feltabout the detected smells.
“Thatshould beimproved”Results showed that usersunderstoond why certainconstructions were qualified assmelly.
“Thatshould beimproved”Results showed that usersunderstoond why certainconstructions were qualified assmelly.“This must beconfusing for others”
Published at ICSE 2012
However, new problems were to bediscovered. We found that, oncethe structure of the spreadsheetshad been understood andvalidated, complex formulas stillgot in the way of understandingspreadsheets.
This led us to the idea of formula smells
Again, we took our inpiration fromthe smells that Fowler defines in hiscanonical book on refctoring.
Published at ICSM 2012
In a recent extention of the paper,we also suggest refactoringscorresponding to smells.This formula, for instance, containthe same subformula twice.Extracting this subformula into aseperate cell will improvereadbility.
We went back to practiceAnd again... A look in practice
We found that cloning (i.e. Copypasting) in spreadsheets was aproblem. If data is copy-pasted,updates will not be propagated tothe copies and that might lead toerrors.Based on existing work in clonedetection in source code, wedeveloped an algorithm to detecclones.
Clone visualization was added toour visualization, indicated with adashed arrow. After all, when datais copy-pasted betweenworksheets, there is a dependencybetween those worksheets (albeit adifferent one than a formula link)
To validate our algorithm, weperformed a case study at thedistribution centre of the SouthDutch food bank. There, theyprocess 100.000 kilos of food permonth, and keep track of that withspreadsheets.We were able to detect 61 near-miss clones, of which 25 wereactual errors.Because of our analysis, thisdistrubution centre is now runningerror-free spreadsheets!
To be published at ICSE 2013
And this paper concluded my PhDthesis.I will continue to work onspreadsheet analysis for at leastfive more years at Delft University ofTechnology, so in the remainingfew slides, I’ll line out what I will beworking on in the future.
Remember spreadsheets stay inbusiness for 5 years and are usedby 12 people during their life span?This makes it interesting to consider‘spreadsheet evolution’ and studyhow spreadsheets are created.
Visual Basic AnalysisIn our current visualization andanalysis technique, we onlyconsider formulas.However, spreadsheets also allowfor code to interact with data andformulas (VBA code in Excel).By analyzing this, we could makeour analysis more complete andinteresting.
Spreadsheet testingFinally, we want to research howspreadsheet users test. One mightthink that spreadsheet users do nottest, but this is not true.
In our previous studies, we oftensaw formules like this one. Here,nothing is really calculated.Instead, some sort of validation isperformed: if ‘find zone’!W3 issmaller than 0, we are notinterested in the value.When we could extract these typeof formulas, we could use them totest the spreadsheet.
Analyzing and visualizing spreadsheetsFelienne HermansThanks for reading about theresearch adventure I was enjoyingthe past 4 years!If you want to know more, have alook at my blog: www.felienne.comIf you are intrested in collaborating,please send me anEmail firstname.lastname@example.org a tweet @felienne