Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Engineering data quality


Published on

Garbage in, garbage out - we have all heard about the importance of data quality. Having high quality data is essential for all types of use cases, whether it is reporting, anomaly detection, or for avoiding bias in machine learning applications. But where does high quality data come from? How can one assess data quality, improve quality if necessary, and prevent bad quality from slipping in? Obtaining good data quality involves several engineering challenges. In this presentation, we will go through tools and strategies that help us measure, monitor, and improve data quality. We will enumerate factors that can cause data collection and data processing to cause data quality issues, and we will show how to use engineering to detect and mitigate data quality problems.

Published in: Data & Analytics
  • Login to see the comments

  • Be the first to like this

Engineering data quality

  1. 1. Engineering data quality Øredev, 2019-11-08 Lars Albertsson (@lalleal) Scling 1
  2. 2. Data value requires data quality 2 Hey, the CRM pipeline is down! We really need the data. But the data is completely bogus, and we need to work with the provider to fix it. …? But we use it to feed our analytics, and need data now!
  3. 3. Scope 3 Data engineering perspective on data quality ● Context - big data environments ● Origins of good or bad data ● Quality assessment ● Quality assurance
  4. 4. Big data - a collaboration paradigm 4 Stream storage Data lake Data democratised
  5. 5. Data pipelines 5 Data lake
  6. 6. More data - decreased friction 6 Data lake Stream storage
  7. 7. Scling - data-value-as-a-service 7 Data lake Stream storage ● Extract value from your data ● Data platform + custom data pipelines ● Imitate data leaders: ○ Quick idea-to-production ○ Operational efficiency Our marketing strategy: ● Promiscuously share knowledge ○ On slides devoid of glossy polish :-)
  8. 8. Data platform overview 8 Data lake Batch processing Cold store Dataset Pipeline Service Service Online services Offline data platform Job Workflow orchestration
  9. 9. Data quality dimensions ● Timeliness ○ E.g. the customer engagement report was produced at the expected time ● Correctness ○ The numbers in the reports were calculated correctly ● Completeness ○ The report includes information on all customers, using all information from the whole time period ● Consistency ○ The customer summaries are all based on the same time period 9
  10. 10. The truth is out there 10 I love working with data, because data is true.
  11. 11. Truth mutated 11 We put the new model out for A/B testing, and it looks great! Great. What fraction of test users showed a KPI improvement? 100%! Hmm.. Wait, it seems ads were disabled for the test group....
  12. 12. Not the whole truth 12 Our steel customers are affected by cracks, causing corrosion. Can you look at our defect reports, and help us predict issues? Sure, hang on. We have found a strong signal: The customer id...
  13. 13. Something but the truth 13 Huh, why do have a sharp increase in invalid_media_type ? I’ll have a look. It seems that we have a new media type “bullshit”...
  14. 14. Hearsay 14 Manufacturing line disruptions are expensive. Can you look at our sensor data, and help us predict? Sure, hang on.
  15. 15. Hearsay 15 Manufacturing line disruptions are expensive. Can you look at our sensor data, and help us predict? Sure, hang on. This looks like an early indicator! Wait, is this interpolated?
  16. 16. Events vs current state ● join(event, snapshot) → always time mismatch ● Usually acceptable ○ In one direction 16 DB’DB join?
  17. 17. Monitoring timeliness, examples ● Datamon - Spotify internal ● Twitter Ambrose (dead?) ● Airflow 17
  18. 18. Ensuring timeliness ● First rule of distributed systems: Avoid distributed systems. ● Keep things simple. ● Master workflow orchestration. Other than that, very large topic... 18
  19. 19. 19 Design for testability ● Output = function(input, code) ● No dependency on external services ● Avoid non-deterministic factors q DB Service
  20. 20. 20 Potential test scopes ● Unit/component ● Single job ● Multiple jobs ● Pipeline, including service ● Full system, including client Choose stable interfaces Each scope has a cost Job Service App Stream Stream Job Stream Job
  21. 21. 21 Recommended scopes ● Single job ● Multiple jobs ● Pipeline, including service Job Service App Stream Stream Job Stream Job
  22. 22. 22 Scopes to avoid ● Unit/Component ○ Few stable interfaces ○ Avoid mocks, dependency injection rituals ● Full system, including client ○ Client automation fragile “Focus on functional system tests, complement with smaller where you cannot get coverage.” - Henrik Kniberg Job Service App Stream Stream Job Stream Job
  23. 23. Testing single batch job 23 Job Standard Scalatest harness file://test_input/ file://test_output/ 1. Generate input 2. Run in local mode 3. Verify output f() p() Runs well in CI / from IDE
  24. 24. class CleanUserTest extends FlatSpec { def validateInvariants( input: Seq[User], output: Seq[User], counters: Map[String, Int]) = { output.foreach(recordInvariant) // Dataset invariants assert(input.size === output.size) assert(input.size should be >= counters["upper-cased"]) } def recordInvariant(u: User) = assert( === 2) def runJob(input: Seq[User]): Seq[User] = { // Same as before ... validateInvariants(input, output, counters) (output, counters) } // Test case is the same } 24 Invariants ● Some things are true ○ For every record ○ For every job invocation ● Not necessarily in production ○ Reuse invariant predicates as quality probes
  25. 25. Measuring correctness: counters ● User-defined ● Technical from framework ○ Execution time ○ Memory consumption ○ Data volumes ○ ... 25 case class Order(item: ItemId, userId: UserId) case class User(id: UserId, country: String) val orders = read(orderPath) val users = read(userPath) val orderNoUserCounter = longAccumulator("order-no-user") val joined: C[(Order, Option[User])] = orders .groupBy(_.userId) .leftJoin(users.groupBy( .values val orderWithUser: C[(Order, User)] = joined .flatMap( orderUser match case (order, Some(user)) => Some((order, user)) case (order, None) => { orderNoUserCounter.add(1) None }) SQL: Nope
  26. 26. 26 Measuring correctness: counters ● Processing tool (Spark/Hadoop) counters ○ Odd code path => bump counter ○ System metrics Hadoop / Spark counters DB Standard graphing tools Standard alerting service
  27. 27. 27 Measuring correctness: pipelines ● Processing tool (Spark/Hadoop) counters ○ Odd code path => bump counter ○ System metrics ● Dedicated quality assessment pipelines DB Quality assessment job Quality metadataset (tiny) Standard graphing tools Standard alerting service
  28. 28. 28 Conditional consumption ● Conditional consumption ○ Express in workflow orchestration ○ Read metrics DB, quality dataset ○ Producer can recommend, not decide ● Insufficient quality? ○ Wait for bug fix ○ Use older/newer input dataset Recommendation metrics Report
  29. 29. 29 The unknown unknowns ● Measure user behaviour ○ E.g. session length, engagement, funnel ● Revert to old egress dataset if necessary Measure interactions DB Standard alerting service Stream Job
  30. 30. Fuzzy products ● Data-driven applications often have fuzzy logic ○ No clear right output ○ Quality == total experience for all users ● Testing for productivity must be binary ● Cause → effect connection must be strong ○ Cause = code change ○ Effect = quality degradation 30 Code ∆?
  31. 31. Converting fuzzy to binary ● Break out binary behaviour 31 Simple input Output sane? Invariants? Clear cut scenario Clear cut result?
  32. 32. Golden scenario suite ● Scenarios that must never fail ● May include real-world data 32 "Stockholm" Geo data
  33. 33. Weighted quality sum ● Sum of test case results should not regress ● Individual regressions are acceptable ● Example: Map searches, done from a US IP address with English language browser setting Sum = 5.4 33 Input Output Verdict (0-1) Weight Weighted verdict Springfield Springfield, MA 1 2 2 Hartfield Hartford, CT 0 1 0 Philadelphia Philadelphia, Egypt 0.2 5 1 Boston Boston, UK 0.4 4 1.6 Betlehem Betlehem, Israel 0.8 1 0.8
  34. 34. Testing with real world / production data ● Data is volatile ○ Separate code change from test data change ○ Take snapshots to use for test ● Beware of privacy issues 34 ∆? Code ∆! Input ∆?
  35. 35. Data completeness ● Static workflow DAGs ensure dataset completeness ● Dataset completeness != data completeness ● Collected events might be delayed ● Event creation to collection delay is unbounded ○ Consider offline phones 35
  36. 36. val orderLateCounter = longAccumulator("order-event-late") val hourPaths = conf.order.split(",") val order = hourPaths .map( .reduce(a, b => a.union(b)) val orderThisHour = order .map({ cl => # Count the events that came after the delay window if (cl.eventTime.hour + config.delayHours < config.hour) { orderLateCounter.add(1) } order }) .filter(cl => cl.eventTime.hour == config.hour) class OrderShuffle(SparkSubmitTask): hour = DateHourParameter() delay_hours = IntParameter() jar = 'orderpipeline.jar' entry_class = '' def requires(self): # Note: This delays processing by three hours. return [Order(hour=hour) for hour in [self.hour + timedelta(hour=h) for h in range(self.delay_hours)]] def output(self): return HdfsTarget("/prod/red/order/v1/" f"delay={self.delay}/" f"{self.hour:%Y/%m/%d/%H}/") def app_options(self): return [ "--hour", self.hour, "--delay-hours", self.delay_hours, "--order", ",".join([i.path for i in self.input()]), "--output", self.output().path] Incompleteness recovery 36 SQL: Separate job for measuring window leakage.
  37. 37. class OrderShuffleAll(WrapperTask): hour = DateHourParameter() def requires(self): return [OrderShuffle(hour=self.hour, delay_hour=d) for d in [0, 4, 12]] class OrderDashboard(mysql.CopyToTable): hour = DateHourParameter() def requires(self): return OrderShuffle(hour=self.hour, delay_hour=0) class FinancialReport(SparkSubmitTask): date = DateParameter() def requires(self): return [OrderShuffle( hour=datetime.combine(, time(hour=h)), delay_hour=12) for h in range(24)] Fast data, complete data 37 Delay: 0 Delay: 4 Delay: 12
  38. 38. Things to plan for early 38 Data quality
  39. 39. Things to plan for early 39 Data quality Input validation Software supply chain Multi-cloud Perfor- mance Cloud native UX User feedback Web security Foobarility Scalability Testability Accessi- bility Bidi languages i18n Machine learning bias Get the MVP out? Mobile browsers
  40. 40. No. Valuable graph Any graph Does anyone care about data quality? 40 No graph Great! Meh. Valuable model Any ML model No model
  41. 41. 1999: Does anyone care about code quality? No. 41 7 years
  42. 42. Code quality 1999 42 Behold our great code! We think some QA and test automation would be great. Nah, boring. We don’t have time. Just put it in production for us.
  43. 43. Code quality 2019 43 We have invented DevOps and continuous delivery. Test automation is key! That sounds familiar...
  44. 44. Data quality 2019 44 Behold our great model! We think some data quality assessment and automation would be great. Nah, why? We don’t have time. Just put it in production for us.
  45. 45. Data quality 2029 45 We have invented MLOps and continuous modelling. Quality feedback automation is key! That sounds familiar...
  46. 46. Changing culture bottom-up 46
  47. 47. Repeating the success? 47
  48. 48. Resources, credits Presentations, articles on related subjects: ● ● Useful tools: ● ● ● 48 Thank you, ● Irene Gonzálvez, Spotify ● Anders Holst, RISE
  49. 49. Tech has massive impact on society 49 Product? Supplier? Employer? Make an active choice whether to have an impact! Cloud?
  50. 50. Laptop sticker Vintage data visualisations, by Karin Lind. ● Charles Minard: Napoleon’s Russian campaign of 1812. Drawn 1869. ● Matthew F Maury: Wind and Current Chart of the North Atlantic. Drawn 1852. ● Florence Nightingale: Causes of Mortality in the Army of the East. Crimean war, 1854-1856. Drawn 1858. ○ Blue = disease, red = wounds, black = battle + other. ● Harold Craft: Radio Observations of the Pulse Profiles and Dispersion Measures of Twelve Pulsars, 1970 ○ Joy Division: Unknown Pleasures, 1979 ○ “Joy plot” → “ridge plot” 50