FUCK spreadsheets!
first steps to become a data-driven company
ABOUT ME
Steven Stadler
Data Scientist @ STUDITEMPS GmbH
RWTH Alumni (Computer Science)
@joermungandr steven.stadler@gmail.com
AGENDA
• WHY ARE SPREADSHEETS BAD?
• WHICH ALTERNATIVES DO WE HAVE?
• WHAT IS PLANNED?
“ ... Simply put, spreadsheets are good for
quick and dirty work, but they are not designed
for serious and reliable work. ... Spreadsheets
make code review difficult. The code is hidden
away in dozens if not hundreds of little cells If
you are not reviewing your code carefully and if
you make it difficult for others to review it, how
do expect it to be reliable?”
– Daniel Lemire
even with excel experienced scientists do
mistakes
Piketty’s Capital (Karl Max 2.0)
-> the yield on capital is higher than wage
growth
Piketty’s excel code contained mistakes, fudging
and other problems
–Financial Times, May 23rd 2014
“… once the FT cleaned up and simplified the
data, the European numbers do not show any
tendency towards rising wealth inequality after
1970. An independent specialist in measuring
inequality shared the FT’s concerns.”
“I work in a large company and I can’t help but
notice the way the business team uses excel for
everything. There are times were emergency
meetings are pulled because the numbers don’t
add up. Sometimes the issue is a single cell
among 60,000 containing a typo in the formula
(a dollar sign missing).”
–Anonymous
“Wurden die Auswertungen des Adwords
Reporting Dokumentes weitergehend genutzt?
Mir ist ein Fehler in den Formeln aufgefallen,
den ich jetzt berichtigen werde. Er besteht seit
einschl. KW19, als weitere Kampagnen aktiviert
wurden
Sorry dafür. Konkret wurden über Zeile 55
hinausgehende Werte nicht mit in die
Gesamtsumme unten aufgenommen”
–Anonymous
• Import of genetical data into databases
• Genes: mrtA
• Proteins: Mrta
• Excel’s autocapitalisation sucks -> several
thousands database entries were wrong
“Es wurden und werden Dokumente in der
täglichen Arbeit genutzt, die aber unter
Umständen im ursprünglichen "Besitz" von
Mitarbeitern sind, die nicht mehr im
Unternehmen sind. Somit mussten wir dann den
ein oder anderen Account wiederherstellen.
Dies geht allerdings nur noch bis Mittwoch.
Danach sind Daten von gelöschten Accounts
nicht wiederherstellbar!”
–Anonymous
CONCLUSION
• 88% of spreadsheets contain errors *
• a simple mistake like misplacing a decimal
point can result in huge errors
* Studies(Market Watch, 2013)
CONCLUSION
• You will lose your data. It might be a hard disk
crash, or a computer virus, or maybe even a
rogue employee deliberately entering incorrect
information on a spreadsheet.
CONCLUSION
• Spreadsheets do not support testing. For
anything that matters, you should validate and
test your code automatically and systematically.
CONCLUSION
• Spreadsheets make code reviews impractical.
To inspect the code, you need to look at every
cell. In practice, this means that you cannot
reasonably ask someone to read over your
formulas to make sure that there is no mistake.
CONCLUSION
• Spreadsheet encourage redundancies.
Spreadsheets encourage copy-and-paste.
Though copying and pasting is sometimes the
right tool, it also creates redundancies. These
redundancies make it very difficult to update a
spreadsheet: are you absolutely sure that you
have changed the formula throughout?
Solutions?
I. DATA MANAGEMENT
II. DATA EVALUATION
I. DATA MANAGEMENT
Store information in a
database
ADVANTAGES
• Version control
• you can revert versions
• you can jump to older versions
• changelog
ADVANTAGES
• Accessmanagement
• restrict user access
• log user activity
• prevent data „ownership“
ADVANTAGES
• Quality Control
• data can be guarded
• force specific data format
ADVANTAGES
• Linked data
• connect data cells with more context
• object-based thinking
• put comments in comment fields
put comments in comment fields
ADVANTAGES
• Accessibility
• everyone can work on the same data
• prevent data islands
• no problems with sets of different excel
sheets
AIRTABLE
SHORT FACTS
• previous company was sold to salesforce
• funding 2015: ca. 11 million dollars
• pitched @ silicon valley and supported by
Ashton Kutcher :-P
USE CASES
• Applicant management
• Management of campaign data
• Timeshift planing
• …
OTHER POSSIBILITIES
• Google docs API (but rights should be
restricted)
• An own developed app as excel replacement
( please no )
II. DATA EVALUATION
• Two possibilities
• Data Analyst / Data Scientist
• App/SaS for data evaluation (BI)
THE BEST SOLUTION
R
BENEFITS OF R
• what is done in spreadsheets can be done in R
• faster than spreadsheets
• open-source
• code review
• version control
but … we should be realistic
a department would need a person willing to
learn and use R
OTHER OPTIONS
• every app visualises it’s own data
• i.e. Jobmensa with New Relic Insights
• an own App or service which visualises all
gathered data at our company
but why should we change?
DATA DRIVEN COMPANY
decisions bound on data
if something bad happens its possible to inspect all
possible error sources
real time marketing
by christian “alberto” albert

Fuck Spreadsheets - first steps to become a data-driven company

  • 1.
    FUCK spreadsheets! first stepsto become a data-driven company
  • 2.
    ABOUT ME Steven Stadler DataScientist @ STUDITEMPS GmbH RWTH Alumni (Computer Science) @joermungandr steven.stadler@gmail.com
  • 3.
    AGENDA • WHY ARESPREADSHEETS BAD? • WHICH ALTERNATIVES DO WE HAVE? • WHAT IS PLANNED?
  • 4.
    “ ... Simplyput, spreadsheets are good for quick and dirty work, but they are not designed for serious and reliable work. ... Spreadsheets make code review difficult. The code is hidden away in dozens if not hundreds of little cells If you are not reviewing your code carefully and if you make it difficult for others to review it, how do expect it to be reliable?” – Daniel Lemire
  • 5.
    even with excelexperienced scientists do mistakes
  • 6.
    Piketty’s Capital (KarlMax 2.0) -> the yield on capital is higher than wage growth
  • 7.
    Piketty’s excel codecontained mistakes, fudging and other problems
  • 8.
    –Financial Times, May23rd 2014 “… once the FT cleaned up and simplified the data, the European numbers do not show any tendency towards rising wealth inequality after 1970. An independent specialist in measuring inequality shared the FT’s concerns.”
  • 10.
    “I work ina large company and I can’t help but notice the way the business team uses excel for everything. There are times were emergency meetings are pulled because the numbers don’t add up. Sometimes the issue is a single cell among 60,000 containing a typo in the formula (a dollar sign missing).” –Anonymous
  • 11.
    “Wurden die Auswertungendes Adwords Reporting Dokumentes weitergehend genutzt? Mir ist ein Fehler in den Formeln aufgefallen, den ich jetzt berichtigen werde. Er besteht seit einschl. KW19, als weitere Kampagnen aktiviert wurden Sorry dafür. Konkret wurden über Zeile 55 hinausgehende Werte nicht mit in die Gesamtsumme unten aufgenommen” –Anonymous
  • 13.
    • Import ofgenetical data into databases • Genes: mrtA • Proteins: Mrta • Excel’s autocapitalisation sucks -> several thousands database entries were wrong
  • 14.
    “Es wurden undwerden Dokumente in der täglichen Arbeit genutzt, die aber unter Umständen im ursprünglichen "Besitz" von Mitarbeitern sind, die nicht mehr im Unternehmen sind. Somit mussten wir dann den ein oder anderen Account wiederherstellen. Dies geht allerdings nur noch bis Mittwoch. Danach sind Daten von gelöschten Accounts nicht wiederherstellbar!” –Anonymous
  • 15.
    CONCLUSION • 88% ofspreadsheets contain errors * • a simple mistake like misplacing a decimal point can result in huge errors * Studies(Market Watch, 2013)
  • 16.
    CONCLUSION • You willlose your data. It might be a hard disk crash, or a computer virus, or maybe even a rogue employee deliberately entering incorrect information on a spreadsheet.
  • 17.
    CONCLUSION • Spreadsheets donot support testing. For anything that matters, you should validate and test your code automatically and systematically.
  • 18.
    CONCLUSION • Spreadsheets makecode reviews impractical. To inspect the code, you need to look at every cell. In practice, this means that you cannot reasonably ask someone to read over your formulas to make sure that there is no mistake.
  • 19.
    CONCLUSION • Spreadsheet encourageredundancies. Spreadsheets encourage copy-and-paste. Though copying and pasting is sometimes the right tool, it also creates redundancies. These redundancies make it very difficult to update a spreadsheet: are you absolutely sure that you have changed the formula throughout?
  • 20.
  • 21.
    I. DATA MANAGEMENT II.DATA EVALUATION
  • 22.
  • 23.
  • 24.
    ADVANTAGES • Version control •you can revert versions • you can jump to older versions • changelog
  • 25.
    ADVANTAGES • Accessmanagement • restrictuser access • log user activity • prevent data „ownership“
  • 26.
    ADVANTAGES • Quality Control •data can be guarded • force specific data format
  • 27.
    ADVANTAGES • Linked data •connect data cells with more context • object-based thinking • put comments in comment fields
  • 28.
    put comments incomment fields
  • 29.
    ADVANTAGES • Accessibility • everyonecan work on the same data • prevent data islands • no problems with sets of different excel sheets
  • 30.
  • 32.
    SHORT FACTS • previouscompany was sold to salesforce • funding 2015: ca. 11 million dollars • pitched @ silicon valley and supported by Ashton Kutcher :-P
  • 33.
    USE CASES • Applicantmanagement • Management of campaign data • Timeshift planing • …
  • 34.
    OTHER POSSIBILITIES • Googledocs API (but rights should be restricted) • An own developed app as excel replacement ( please no )
  • 35.
  • 36.
    • Two possibilities •Data Analyst / Data Scientist • App/SaS for data evaluation (BI)
  • 37.
  • 38.
    BENEFITS OF R •what is done in spreadsheets can be done in R • faster than spreadsheets • open-source • code review • version control
  • 39.
    but … weshould be realistic a department would need a person willing to learn and use R
  • 40.
    OTHER OPTIONS • everyapp visualises it’s own data • i.e. Jobmensa with New Relic Insights • an own App or service which visualises all gathered data at our company
  • 41.
    but why shouldwe change?
  • 42.
    DATA DRIVEN COMPANY decisionsbound on data if something bad happens its possible to inspect all possible error sources real time marketing
  • 43.