Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
How Do We Solve 

the World’s
Spreadsheet Problem?
Alex Rasmussen

@alexras
Hi, I’m Alex!
@alexras
freenome.com
bitsondisk.com
alexras.info
My Background
2009-2013: really fast sorting
2013-2016: data wrangling
2017-2018: cancer-fighting robots
I think/worry a lot 

about spreadsheets.
Today’s focus:
spreadsheet data
(for compute, felienne.com)
1. Spreadsheets are great
2. Spreadsheets are a problem
3. How we can fix it
This talk:
1. Spreadsheets are great
2. Spreadsheets are a problem
3. How we can fix it
This talk:
What’s so great
about spreadsheets?
Spreadsheets are 

Ubiquitous
1.2 billion Office users (~16% of humans)

1.2 billion Office users (~16% of humans)

60 million Office 365 customers

1.2 billion Office users (~16% of humans)

60 million Office 365 customers

>5 million businesses use Google Apps
Spreadsheets are 

Approachable
Spreadsheets are 

Flexible
Data grids
Data grids
Graphs
Data grids
Graphs
Anything tabular
Data grids
Graphs
Anything tabular
Full-scale “apps”
Tatsuo Horiuchi (b. 1940)
Kegon Falls, 2007
AutoShape on canvas
https://pasokonga.com/
So if spreadsheets are
ubiquitous, approachable,
and flexible, 

what’s the problem?
1. Spreadsheets are great
2. Spreadsheets are a problem
3. How we can fix it
This talk:
Problem #1:
Data Types
Automatic type
conversion can cause
serious problems.
DEC1
DEC1
12/1
RIKEN Identifier
2310009E13
RIKEN Identifier
2310009E13
2.31E+13
–Johnny Appleseed
“We confirmed gene name errors in 987
supplementary files from 704 published articles
(19.6% of all articl...
False Equivalence
000123

= 00123
= 123
True if they’re integers, 

but what if they’re strings?
Enumerated Types
Enumerated Types
“Prostate Cancer”
Enumerated Types
“Prostate Cancer”
“prostate cancer”
Enumerated Types
“Prostate Cancer”
“prostate cancer”
“prostatecancer”
Enumerated Types
“Prostate Cancer”
“prostate cancer”
“prostatecancer”
“PC”
Enumerated Types
“Prostate Cancer”
“prostate cancer”
“prostatecancer”
“PC”
“prostate”
Enumerated Types
“Prostate Cancer”
“prostate cancer”
“prostatecancer”
“PC”
“prostate”
“prostrate”
List Validations? Sheet Protection?
Easy to add
Easy to remove by accident
Hard to enforce
Data loss!
False equivalence!
Ontological chaos!
Mass hysteria!
Problem #2:
Queryability
Inside a
spreadsheet, things
are pretty good!
Formulas!
Pivot Tables!
Filters!
What about
querying across
spreadsheets?
Get and Transform
No Mac support.
Structure changes?
Type changes?
Column Renames?
Have fun re-loading.
And what
about joins?
There’s VLOOKUP
=VLOOKUP(“Product 1”,
Prices!$A$2:$B$9,2,FALSE)
… but, like, eww.
Data inside a
spreadsheet is hard to
connect to data outside
that spreadsheet.
Summary:

Spreadsheets are 

bad at types and

hard to query
1. Spreadsheets are great
2. Spreadsheets are a problem
3. How we can fix it
This talk:
What about databases?
Databases are great
in ways that
spreadsheets aren’t.
Databases are great at
data type definition
and enforcement.
Numeric
Monetary
Character
Binary
Date/Time
Boolean
Enumerated
Geometric
Network Address
Bit String
Text Search
UUID
XML
J...
Databases are
purpose-built for
queries and joins.
BUT
Databases 

aren’t as approachable 

as spreadsheets.
$ psql -d postgres
psql (10.4, server 9.6.9)
Type "help" for help.
postgres=#
Databases 

aren’t as flexible 

as spreadsheets.
Databases are good at
storing and querying data. 



But that’s it.
Spreadsheets and
databases have
complementary
skillsets.
So, what do we
do about it?
How to Solve
Your Spreadsheet
Problem
1. Identify the use case.
2. Stop the spread.
3. Backfill.
1. Identify the use case.
2. Stop the spread.
3. Backfill.
Every spreadsheet
solves a problem.
What is that problem?
What’s the business need?
How much data is there?
How fast does it change?
How frequent are additions?
1. Identify the use case.
2. Stop the spread.
3. Backfill.
Give new data a
structured home.
No custom apps.
At least at first.
Optimize for
Speed
1. Identify the use case.
2. Stop the spread.
3. Backfill.
It’s time for some
Data Wrangling.
(yee-haw )
Writing one-off
scripts is sometimes
the best option.
https://www.trifacta.com/start-wrangling/
https://www.trifacta.com/start-wrangling/
Infer wrangle
“recipe” from
high-level actions.
Another Option:
Programming 

By Example
FlashRelate
A B C D E . . . R
1 value year value year Comments
2 Albania 1,000 1950 930 1981 FRA 1
3 Austria 3,139 1951 3,...
Foofah
Provide input/output sample, synthesize 

layout and syntactic transformations.
https://github.com/umich-dbgroup/fo...
1. Identify the use case.
2. Stop the spread.
3. Backfill.
How to Solve Your
Spreadsheet Problem
What about 

the future?
Spreadsheets aren’t
going anywhere, 

for good reason.
Learn from the
spreadsheet.
Meet the users
where they are.
Thank you.
@alexras
Consulting Inquiries: contact@bitsondisk.com
How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018
How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018
How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018
How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018
Upcoming SlideShare
Loading in …5
×

of

How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018 Slide 1 How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018 Slide 2 How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018 Slide 3 How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018 Slide 4 How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018 Slide 5 How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018 Slide 6 How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018 Slide 7 How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018 Slide 8 How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018 Slide 9 How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018 Slide 10 How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018 Slide 11 How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018 Slide 12 How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018 Slide 13 How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018 Slide 14 How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018 Slide 15 How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018 Slide 16 How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018 Slide 17 How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018 Slide 18 How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018 Slide 19 How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018 Slide 20 How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018 Slide 21 How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018 Slide 22 How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018 Slide 23 How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018 Slide 24 How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018 Slide 25 How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018 Slide 26 How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018 Slide 27 How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018 Slide 28 How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018 Slide 29 How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018 Slide 30 How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018 Slide 31 How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018 Slide 32 How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018 Slide 33 How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018 Slide 34 How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018 Slide 35 How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018 Slide 36 How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018 Slide 37 How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018 Slide 38 How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018 Slide 39 How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018 Slide 40 How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018 Slide 41 How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018 Slide 42 How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018 Slide 43 How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018 Slide 44 How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018 Slide 45 How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018 Slide 46 How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018 Slide 47 How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018 Slide 48 How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018 Slide 49 How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018 Slide 50 How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018 Slide 51 How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018 Slide 52 How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018 Slide 53 How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018 Slide 54 How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018 Slide 55 How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018 Slide 56 How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018 Slide 57 How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018 Slide 58 How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018 Slide 59 How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018 Slide 60 How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018 Slide 61 How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018 Slide 62 How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018 Slide 63 How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018 Slide 64 How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018 Slide 65 How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018 Slide 66 How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018 Slide 67 How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018 Slide 68 How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018 Slide 69 How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018 Slide 70 How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018 Slide 71 How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018 Slide 72 How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018 Slide 73 How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018 Slide 74 How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018 Slide 75 How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018 Slide 76 How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018 Slide 77 How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018 Slide 78 How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018 Slide 79 How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018 Slide 80 How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018 Slide 81 How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018 Slide 82 How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018 Slide 83 How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018 Slide 84 How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018 Slide 85 How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018 Slide 86 How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018 Slide 87 How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018 Slide 88 How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018 Slide 89 How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018 Slide 90
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

0 Likes

Share

Download to read offline

How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018

Download to read offline

In the past five years, I have spent a lot of time trying to get high-integrity data out of spreadsheets and into databases. In this talk, I explore common data integrity problems when dealing with spreadsheet data, investigate whether those integrity problems are inescapable, and share ongoing work to mitigate them.

  • Be the first to like this

How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018

  1. 1. How Do We Solve 
 the World’s Spreadsheet Problem? Alex Rasmussen
 @alexras
  2. 2. Hi, I’m Alex! @alexras freenome.com bitsondisk.com alexras.info
  3. 3. My Background 2009-2013: really fast sorting 2013-2016: data wrangling 2017-2018: cancer-fighting robots
  4. 4. I think/worry a lot 
 about spreadsheets.
  5. 5. Today’s focus: spreadsheet data (for compute, felienne.com)
  6. 6. 1. Spreadsheets are great 2. Spreadsheets are a problem 3. How we can fix it This talk:
  7. 7. 1. Spreadsheets are great 2. Spreadsheets are a problem 3. How we can fix it This talk:
  8. 8. What’s so great about spreadsheets?
  9. 9. Spreadsheets are 
 Ubiquitous
  10. 10. 1.2 billion Office users (~16% of humans)

  11. 11. 1.2 billion Office users (~16% of humans)
 60 million Office 365 customers

  12. 12. 1.2 billion Office users (~16% of humans)
 60 million Office 365 customers
 >5 million businesses use Google Apps
  13. 13. Spreadsheets are 
 Approachable
  14. 14. Spreadsheets are 
 Flexible
  15. 15. Data grids
  16. 16. Data grids Graphs
  17. 17. Data grids Graphs Anything tabular
  18. 18. Data grids Graphs Anything tabular Full-scale “apps”
  19. 19. Tatsuo Horiuchi (b. 1940) Kegon Falls, 2007 AutoShape on canvas
  20. 20. https://pasokonga.com/
  21. 21. So if spreadsheets are ubiquitous, approachable, and flexible, 
 what’s the problem?
  22. 22. 1. Spreadsheets are great 2. Spreadsheets are a problem 3. How we can fix it This talk:
  23. 23. Problem #1: Data Types
  24. 24. Automatic type conversion can cause serious problems.
  25. 25. DEC1
  26. 26. DEC1 12/1
  27. 27. RIKEN Identifier 2310009E13
  28. 28. RIKEN Identifier 2310009E13 2.31E+13
  29. 29. –Johnny Appleseed “We confirmed gene name errors in 987 supplementary files from 704 published articles (19.6% of all articles).” 
 https://genomebiology.biomedcentral.com/ articles/10.1186/s13059-016-1044-7
  30. 30. False Equivalence 000123
 = 00123 = 123 True if they’re integers, 
 but what if they’re strings?
  31. 31. Enumerated Types
  32. 32. Enumerated Types “Prostate Cancer”
  33. 33. Enumerated Types “Prostate Cancer” “prostate cancer”
  34. 34. Enumerated Types “Prostate Cancer” “prostate cancer” “prostatecancer”
  35. 35. Enumerated Types “Prostate Cancer” “prostate cancer” “prostatecancer” “PC”
  36. 36. Enumerated Types “Prostate Cancer” “prostate cancer” “prostatecancer” “PC” “prostate”
  37. 37. Enumerated Types “Prostate Cancer” “prostate cancer” “prostatecancer” “PC” “prostate” “prostrate”
  38. 38. List Validations? Sheet Protection? Easy to add Easy to remove by accident Hard to enforce
  39. 39. Data loss! False equivalence! Ontological chaos! Mass hysteria!
  40. 40. Problem #2: Queryability
  41. 41. Inside a spreadsheet, things are pretty good!
  42. 42. Formulas! Pivot Tables! Filters!
  43. 43. What about querying across spreadsheets?
  44. 44. Get and Transform
  45. 45. No Mac support.
  46. 46. Structure changes? Type changes? Column Renames? Have fun re-loading.
  47. 47. And what about joins?
  48. 48. There’s VLOOKUP =VLOOKUP(“Product 1”, Prices!$A$2:$B$9,2,FALSE) … but, like, eww.
  49. 49. Data inside a spreadsheet is hard to connect to data outside that spreadsheet.
  50. 50. Summary:
 Spreadsheets are 
 bad at types and
 hard to query
  51. 51. 1. Spreadsheets are great 2. Spreadsheets are a problem 3. How we can fix it This talk:
  52. 52. What about databases?
  53. 53. Databases are great in ways that spreadsheets aren’t.
  54. 54. Databases are great at data type definition and enforcement.
  55. 55. Numeric Monetary Character Binary Date/Time Boolean Enumerated Geometric Network Address Bit String Text Search UUID XML JSON Arrays Composite Range Pseudo-Types So Many Types of Types!
  56. 56. Databases are purpose-built for queries and joins.
  57. 57. BUT
  58. 58. Databases 
 aren’t as approachable 
 as spreadsheets.
  59. 59. $ psql -d postgres psql (10.4, server 9.6.9) Type "help" for help. postgres=#
  60. 60. Databases 
 aren’t as flexible 
 as spreadsheets.
  61. 61. Databases are good at storing and querying data. 
 
 But that’s it.
  62. 62. Spreadsheets and databases have complementary skillsets.
  63. 63. So, what do we do about it?
  64. 64. How to Solve Your Spreadsheet Problem
  65. 65. 1. Identify the use case. 2. Stop the spread. 3. Backfill.
  66. 66. 1. Identify the use case. 2. Stop the spread. 3. Backfill.
  67. 67. Every spreadsheet solves a problem. What is that problem?
  68. 68. What’s the business need? How much data is there? How fast does it change? How frequent are additions?
  69. 69. 1. Identify the use case. 2. Stop the spread. 3. Backfill.
  70. 70. Give new data a structured home.
  71. 71. No custom apps. At least at first.
  72. 72. Optimize for Speed
  73. 73. 1. Identify the use case. 2. Stop the spread. 3. Backfill.
  74. 74. It’s time for some Data Wrangling. (yee-haw )
  75. 75. Writing one-off scripts is sometimes the best option.
  76. 76. https://www.trifacta.com/start-wrangling/
  77. 77. https://www.trifacta.com/start-wrangling/ Infer wrangle “recipe” from high-level actions.
  78. 78. Another Option: Programming 
 By Example
  79. 79. FlashRelate A B C D E . . . R 1 value year value year Comments 2 Albania 1,000 1950 930 1981 FRA 1 3 Austria 3,139 1951 3,177 1955 FRA 3 4 Belgium 541 1947 601 1950 5 Bulgaria 2,964 1947 3,259 1958 FRA 1 6 Czech . . . 2,416 1950 2,503 1960 NC . . . (a) A B C D 1 Albania 1,000 1950 FRA 1 2 Albania 930 1981 FRA 1 . . . 5 Austria 3,139 1951 FRA 3 6 Austria 3,177 1955 FRA 3 . . . 9 Belgium 541 1947 10 Belgium 601 1950 . . . (b) Figure 1: (a) A semi-structured spreadsheet with two example tuples highlighted. The first tuple (red) represents the tim harvest (per 1000 hectares) for Albania in 1950. The second tuple (blue) represents the timber harvest for Austria in 1950. An extracted relational table with the same two tuples highlighted as in Fig. 1a tion. The spreadsheet author’s encoding, while perhaps eas- ier to read, makes this task a challenge. The first thought of such a programmer might be to convert the data into a form better suited to performing the computation, like the constraints. Data that matches these constraints are autom ically converted into relational tables. FLARE has two kinds of constraints. Constraints o the text within a spreadsheet cell are referred to as Provide example rows, 
 synthesize layout transformations. https://github.com/microsoft/prose
  80. 80. Foofah Provide input/output sample, synthesize 
 layout and syntactic transformations. https://github.com/umich-dbgroup/foofah
  81. 81. 1. Identify the use case. 2. Stop the spread. 3. Backfill. How to Solve Your Spreadsheet Problem
  82. 82. What about 
 the future?
  83. 83. Spreadsheets aren’t going anywhere, 
 for good reason.
  84. 84. Learn from the spreadsheet.
  85. 85. Meet the users where they are.
  86. 86. Thank you. @alexras Consulting Inquiries: contact@bitsondisk.com

In the past five years, I have spent a lot of time trying to get high-integrity data out of spreadsheets and into databases. In this talk, I explore common data integrity problems when dealing with spreadsheet data, investigate whether those integrity problems are inescapable, and share ongoing work to mitigate them.

Views

Total views

117

On Slideshare

0

From embeds

0

Number of embeds

0

Actions

Downloads

0

Shares

0

Comments

0

Likes

0

×