Garbage in, rainbows out


Published on

Published in: Technology, Business
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • I wanted to call this talk “dirty inputs.” \n
  • \n
  • Gatekeeper\nUnexpected inputs fail, push back to the user\n\n
  • Fault tolerant systems. Model validations are the most obvious form cleansing. They are the gatekeepers.\n
  • How about bulk records?\nUsing CSV, uploads as my example could be any source\nConsuming external json, sharing databases. Anything outside of the black rectangle\nWhat’s the downside to relying on validation when we get garbage?\nBest case is it fails, we ask the user to fix their stuff and try again.\nAllow half-fails? “Fix lines X Y and Z?” \n
  • Super basic example. Unravels a CSV file, turns a potnentially wide table into a long one.\n
  • Typical data grid, once again from any source\n
  • And now we have a stream of data. Allows for more graceful failures. Since the entire input is in the system we can prompt the user to fix the errors or devise filters to do it automatically. \n\nIs it possible we would get better filters in the future? Better methods of cleaning the data. I’m sure none of you have ever seen a database where the columns were shifted by 1 because of a bone headed mistake that happened 2 months ago. Me either.\n
  • Schemaless store is just the landing area for the data to be moved into our database in batches. The stream could be MongoDB, SQL Light, cave drawings with a web cam where your OCR software processes it into something usable. \n\nIt doesn’t matter.\n
  • \n
  • What if it looked more like this? How many do fake deletes? Why? How is an update different from a delete?\nIf we automate the input/ filter process why do it only once?\nWhy throw out anything at all? How would that system be different? Here is as far as I am. Ish. That “All data” is a few hundred gigs in MySQL tables and I have scripts that run when something updates. Add a ZIP and 56 minutes later it shows up in my Rails app.\n
  • Nathan Marz had this idea first. \n
  • How’s about this? \n\nQuery is a function of all data. Capture is done in the rawest granular way possible so speed wouldn’t be a consideration. Events rather than “stuff” so it can be rewound to the beginning of time.\n
  • What is coffee? It’s filthy ass water, that’s what it is. Coffeeologists (board certified ones) measure the quality of coffee using the same dimensions as clean drinking water. pH, dissolved solids, rat feces. The usual.\n
  • Pre ground grocery store beans have been sitting there for months and have lost their volatile flavor molecules. \nThe drip machine sprays unfiltered water that is too hot into the center of the filter over extracting some grounds and leaving others under extracted. \nThe coffee hits the bottom of the hot glass carafe and is instantly burned.\nWhat about the coffee nerds here? \n
  • Pour over fixes the water temp and center over-extraction\nPress pot goes further and allows for extraction fine tuning\n\n
  • The issue is variables out of your control:\nBean age\nWater quality\n\nPress pots can come close but you’re brewing blind. \n
  • \n
  • \n
  • Garbage in, rainbows out

    1. 1. Garbage InRainbows Out
    2. 2. Zach BriggsNew Developer7 Years in Data AnalyticsMike FidlerSystems Security SpecialistEx GeologistHis Unix experience is old enough to drinkAmateur Inventor
    3. 3. Validates
    4. 4. This form contains one error.
    5. 5. Screenshot of column viewScreenshot of serial view
    6. 6. ApplicationRaw Data Database
    7. 7. Continuous ApplicationRaw Data Process Database
    8. 8. Coffee
    9. 9. Why your coffee is shit.
    10. 10. Anything but drip
    11. 11. Thank YouZach - of RecordMike - Neck BeardAvailable for hire