Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018

7 views

Published on

In the past five years, I have spent a lot of time trying to get high-integrity data out of spreadsheets and into databases. In this talk, I explore common data integrity problems when dealing with spreadsheet data, investigate whether those integrity problems are inescapable, and share ongoing work to mitigate them.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018

  1. 1. How Do We Solve 
 the World’s Spreadsheet Problem? Alex Rasmussen
 @alexras
  2. 2. Hi, I’m Alex! @alexras freenome.com bitsondisk.com alexras.info
  3. 3. My Background 2009-2013: really fast sorting 2013-2016: data wrangling 2017-2018: cancer-fighting robots
  4. 4. I think/worry a lot 
 about spreadsheets.
  5. 5. Today’s focus: spreadsheet data (for compute, felienne.com)
  6. 6. 1. Spreadsheets are great 2. Spreadsheets are a problem 3. How we can fix it This talk:
  7. 7. 1. Spreadsheets are great 2. Spreadsheets are a problem 3. How we can fix it This talk:
  8. 8. What’s so great about spreadsheets?
  9. 9. Spreadsheets are 
 Ubiquitous
  10. 10. 1.2 billion Office users (~16% of humans)

  11. 11. 1.2 billion Office users (~16% of humans)
 60 million Office 365 customers

  12. 12. 1.2 billion Office users (~16% of humans)
 60 million Office 365 customers
 >5 million businesses use Google Apps
  13. 13. Spreadsheets are 
 Approachable
  14. 14. Spreadsheets are 
 Flexible
  15. 15. Data grids
  16. 16. Data grids Graphs
  17. 17. Data grids Graphs Anything tabular
  18. 18. Data grids Graphs Anything tabular Full-scale “apps”
  19. 19. Tatsuo Horiuchi (b. 1940) Kegon Falls, 2007 AutoShape on canvas
  20. 20. https://pasokonga.com/
  21. 21. So if spreadsheets are ubiquitous, approachable, and flexible, 
 what’s the problem?
  22. 22. 1. Spreadsheets are great 2. Spreadsheets are a problem 3. How we can fix it This talk:
  23. 23. Problem #1: Data Types
  24. 24. Automatic type conversion can cause serious problems.
  25. 25. DEC1
  26. 26. DEC1 12/1
  27. 27. RIKEN Identifier 2310009E13
  28. 28. RIKEN Identifier 2310009E13 2.31E+13
  29. 29. –Johnny Appleseed “We confirmed gene name errors in 987 supplementary files from 704 published articles (19.6% of all articles).” 
 https://genomebiology.biomedcentral.com/ articles/10.1186/s13059-016-1044-7
  30. 30. False Equivalence 000123
 = 00123 = 123 True if they’re integers, 
 but what if they’re strings?
  31. 31. Enumerated Types
  32. 32. Enumerated Types “Prostate Cancer”
  33. 33. Enumerated Types “Prostate Cancer” “prostate cancer”
  34. 34. Enumerated Types “Prostate Cancer” “prostate cancer” “prostatecancer”
  35. 35. Enumerated Types “Prostate Cancer” “prostate cancer” “prostatecancer” “PC”
  36. 36. Enumerated Types “Prostate Cancer” “prostate cancer” “prostatecancer” “PC” “prostate”
  37. 37. Enumerated Types “Prostate Cancer” “prostate cancer” “prostatecancer” “PC” “prostate” “prostrate”
  38. 38. List Validations? Sheet Protection? Easy to add Easy to remove by accident Hard to enforce
  39. 39. Data loss! False equivalence! Ontological chaos! Mass hysteria!
  40. 40. Problem #2: Queryability
  41. 41. Inside a spreadsheet, things are pretty good!
  42. 42. Formulas! Pivot Tables! Filters!
  43. 43. What about querying across spreadsheets?
  44. 44. Get and Transform
  45. 45. No Mac support.
  46. 46. Structure changes? Type changes? Column Renames? Have fun re-loading.
  47. 47. And what about joins?
  48. 48. There’s VLOOKUP =VLOOKUP(“Product 1”, Prices!$A$2:$B$9,2,FALSE) … but, like, eww.
  49. 49. Data inside a spreadsheet is hard to connect to data outside that spreadsheet.
  50. 50. Summary:
 Spreadsheets are 
 bad at types and
 hard to query
  51. 51. 1. Spreadsheets are great 2. Spreadsheets are a problem 3. How we can fix it This talk:
  52. 52. What about databases?
  53. 53. Databases are great in ways that spreadsheets aren’t.
  54. 54. Databases are great at data type definition and enforcement.
  55. 55. Numeric Monetary Character Binary Date/Time Boolean Enumerated Geometric Network Address Bit String Text Search UUID XML JSON Arrays Composite Range Pseudo-Types So Many Types of Types!
  56. 56. Databases are purpose-built for queries and joins.
  57. 57. BUT
  58. 58. Databases 
 aren’t as approachable 
 as spreadsheets.
  59. 59. $ psql -d postgres psql (10.4, server 9.6.9) Type "help" for help. postgres=#
  60. 60. Databases 
 aren’t as flexible 
 as spreadsheets.
  61. 61. Databases are good at storing and querying data. 
 
 But that’s it.
  62. 62. Spreadsheets and databases have complementary skillsets.
  63. 63. So, what do we do about it?
  64. 64. How to Solve Your Spreadsheet Problem
  65. 65. 1. Identify the use case. 2. Stop the spread. 3. Backfill.
  66. 66. 1. Identify the use case. 2. Stop the spread. 3. Backfill.
  67. 67. Every spreadsheet solves a problem. What is that problem?
  68. 68. What’s the business need? How much data is there? How fast does it change? How frequent are additions?
  69. 69. 1. Identify the use case. 2. Stop the spread. 3. Backfill.
  70. 70. Give new data a structured home.
  71. 71. No custom apps. At least at first.
  72. 72. Optimize for Speed
  73. 73. 1. Identify the use case. 2. Stop the spread. 3. Backfill.
  74. 74. It’s time for some Data Wrangling. (yee-haw )
  75. 75. Writing one-off scripts is sometimes the best option.
  76. 76. https://www.trifacta.com/start-wrangling/
  77. 77. https://www.trifacta.com/start-wrangling/ Infer wrangle “recipe” from high-level actions.
  78. 78. Another Option: Programming 
 By Example
  79. 79. FlashRelate A B C D E . . . R 1 value year value year Comments 2 Albania 1,000 1950 930 1981 FRA 1 3 Austria 3,139 1951 3,177 1955 FRA 3 4 Belgium 541 1947 601 1950 5 Bulgaria 2,964 1947 3,259 1958 FRA 1 6 Czech . . . 2,416 1950 2,503 1960 NC . . . (a) A B C D 1 Albania 1,000 1950 FRA 1 2 Albania 930 1981 FRA 1 . . . 5 Austria 3,139 1951 FRA 3 6 Austria 3,177 1955 FRA 3 . . . 9 Belgium 541 1947 10 Belgium 601 1950 . . . (b) Figure 1: (a) A semi-structured spreadsheet with two example tuples highlighted. The first tuple (red) represents the tim harvest (per 1000 hectares) for Albania in 1950. The second tuple (blue) represents the timber harvest for Austria in 1950. An extracted relational table with the same two tuples highlighted as in Fig. 1a tion. The spreadsheet author’s encoding, while perhaps eas- ier to read, makes this task a challenge. The first thought of such a programmer might be to convert the data into a form better suited to performing the computation, like the constraints. Data that matches these constraints are autom ically converted into relational tables. FLARE has two kinds of constraints. Constraints o the text within a spreadsheet cell are referred to as Provide example rows, 
 synthesize layout transformations. https://github.com/microsoft/prose
  80. 80. Foofah Provide input/output sample, synthesize 
 layout and syntactic transformations. https://github.com/umich-dbgroup/foofah
  81. 81. 1. Identify the use case. 2. Stop the spread. 3. Backfill. How to Solve Your Spreadsheet Problem
  82. 82. What about 
 the future?
  83. 83. Spreadsheets aren’t going anywhere, 
 for good reason.
  84. 84. Learn from the spreadsheet.
  85. 85. Meet the users where they are.
  86. 86. Thank you. @alexras Consulting Inquiries: contact@bitsondisk.com

×