Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Mortal analytics - Covid-19 and the problem of data quality

Social media are full of Covid-19 graphs, each pointing to an "obvious" conclusion that fits the author's agenda. Unfortunately, even the official sources publish analytics that point at incorrect conclusions. Bad data quality has become a matter of life and death.

We look at the quality problems with official Covid-19 data presentations. The problems are common in all domains, and solutions are known, but not widespread. We describe tools and patterns that data mature companies use to assess and improve data quality in similar situations. Mastering data quality and data operations is a prerequisite for building sustainable AI solutions, and we will explain how these patterns fit into machine learning product development.

  • Be the first to comment

  • Be the first to like this

Mortal analytics - Covid-19 and the problem of data quality

  1. 1. www.scling.com Mortal Analytics - Covid 19 & the problem of data quality Lars Albertsson (@lalleal) Scling 1
  2. 2. www.scling.com Why this presentation? ● Non-goal: Argue for or against a particular strategy ○ We are already too polarised ● Goals: ○ What can go wrong with data quality? ○ What can we learn? ○ Data engineering as a solution 2
  3. 3. www.scling.com Imperial College: We saved the world! 3 https://www.bbc.com/news/health-52968523
  4. 4. www.scling.com Imperial College model predictions for Sweden 4 https://www.medrxiv.org/content/10.1101/2020.04.11.20062133v1.full.pdf
  5. 5. www.scling.com Model and reality 5 https://swprs.org/a-swiss-doctor-on-covid-19/
  6. 6. www.scling.com Imperial College model code ● ● Screenshots are only part of functions... ● A couple of regression tests - no tests validating correct functionality ● My impression: No chance of producing high confidence result 6 https://github.com/mrc-ide/covid-sim
  7. 7. www.scling.com Imperial College: bugs are not a problem 7 https://lockdownsceptics.org/code-review-of-fergusons-model/
  8. 8. www.scling.com Example Imperial College bug handling 8 https://github.com/mrc-ide/covid-sim/issues/330 Imperial College response
  9. 9. www.scling.com Bad predictions are harmful 9 ● Each action has a health cost ○ Economic misery → social misery → health misery ○ Mental health ○ Drug / alcohol use ○ Domestic violence ● During Ebola pandemic, 10x more people died from fear of hospitals than from Ebola https://medium.com/@robert.munro/the-tech-communitys-response-to-ebola-44d2c8dbb5be
  10. 10. www.scling.com Ways to degrade data & analytics quality 10 ● Deviating definitions ● Selection ● Deviating context ● Presentation ● Interpretation ● Data collection ● Data processing ● Lack of quality assessment ● Lack of quality improvement Add senior software engineers with production experience. Data engineering
  11. 11. www.scling.com Define death 11 Observed Covid-19 death definitions: ● Infection confirmed, last 30 days ● Infection confirmed, any time ● Infection assumed ● Assumed cause ● Hospitalised ● Other disease complicated by Covid-19 ● Excess mortality
  12. 12. www.scling.com Sweden on the rise? 12 https://youtu.be/4uTj96ZowCU https://www.bbc.com/news/world-europe-53175459 https://sverigesradio.se/artikel/7503606 "New Covid-19 cases per day"
  13. 13. www.scling.com No, context is missing 13 Tests executed Test positive rate New cases https://youtu.be/4uTj96ZowCU https://twitter.com/JacobGudiol/status/1283308826842759168 https://twitter.com/JacobGudiol/status/1283308817787293696
  14. 14. www.scling.com Death numbers, different views 14https://twitter.com/HaraldofW/status/1270080232104624128 https://www.folkhalsomyndigheten.se/globalassets/statistik-uppfoljning/smittsamma-sjukdomar/veckorapporter-covid-19/2020/covid-19-veckorapport-vecka-25-final.pdf
  15. 15. www.scling.com Data will confess to anything 15 ● Absolute numbers mislead ○ Days since case x → time shift by country size ● Relative numbers mislead ○ Diluted in large countries ○ Small regions stand out https://swprs.org/a-swiss-doctor-on-covid-19/
  16. 16. www.scling.com Granularity matters 16 ● Outbreaks in regions ● Country aggregation - information loss ○ But debate assumes homogeneous countries ● Peak of Swedish outbreak ○ Major outbreak in Stockholm + surroundings ○ Rest of Sweden on par with Nordics ● Nothing is "obvious" https://www.folkhalsomyndigheten.se/globalassets/statistik-uppfoljning/smittsamma-sjukdomar/veckorapporter-covid-19/2020/covid-19-veckorapport-vecka-25-final.pdf Swedish policy "obviously" terrible. Compare numbers with neighbours!
  17. 17. www.scling.com Data collection 17 "The last week is not complete, so it is difficult to determine if the trend continues." https://youtu.be/4uTj96ZowCU https://www.folkhalsomyndigheten.se/globalassets/statistik-uppfoljning/smittsamma-sjukdomar/veckorapporter-covid-19/2020/covid-19-veckorapport-vecka-27-final.pdf
  18. 18. www.scling.com What conclusion from this graph? COVID-19 fatalities / day in Sweden 18 https://www.folkhalsomyndigheten.se/
  19. 19. www.scling.com Comparing apples, oranges, bananas, ... COVID-19 fatalities / day in Sweden 19 Fatalities collected during 2 day Fatalities collected during 4 days Fatalities collected during 10 days
  20. 20. www.scling.com Naive data collection ● Gather the events that we have ● Put them in a database ● "Let us look at the latest data" ● You never want the latest data! You want comparable data. 20
  21. 21. www.scling.com Wrong conclusion, every day ● Fatalities data as of April 6 April 15 April 19 21Graph by Statistisk Opinion, @StatistiskO
  22. 22. www.scling.com Wrong conclusion, every day ● Downward trend every day! 22 https://www.bloomberg.com/amp/news/articles/2020-07-17/georgia-massaged-virus-data-to-reopen-then-voided-mask-orders
  23. 23. www.scling.com Normalise data collection to compare 23Graph by Adam Altmejd, @adamaltmejd
  24. 24. www.scling.com Normalise data collection to compare 24Graph by Adam Altmejd, @adamaltmejd
  25. 25. www.scling.com Forecast for analytics with fresh data 25Graph by Adam Altmejd, @adamaltmejd
  26. 26. www.scling.com Why aren't authorities doing that? 26 ● Cost of processing data ● Manual handcraft not Industrial process https://github.com/FohmAnalys/SEIR-model-Stockholm We are not done processing the data yet. Since we do calculations quickly, some mistakes might happen.
  27. 27. www.scling.com ● Scaled processes ● Machine tools ● Challenges: scale, logistics, legal, organisation, faults, ... Manual, mechanised, industrialised 27 ● Muscle-powered ● Few tools ● Human touch for every step ● Direct human control ● Machine tools ● Low investment, direct return
  28. 28. www.scling.com Muscle powered analytics & machine learning ● Use hand tools to ○ Collect data ○ Aggregate for analytics or ○ Train a model ● Typical tools: ○ Excel ○ Matlab ○ Interactive SQL ○ Interactive BI tools ○ Jupyter ○ R ○ One-off Python scripts 28 "Dataset" - a data artifact of direct or indirect value
  29. 29. www.scling.com Mechanised analytics & machine learning ● Use machine tools to semi-automatically ○ Collect data ○ Aggregate for analytics or ○ Train a model ● Typical tools: Muscle tools + ○ Databases ○ Data warehouses + ETL ○ Hadoop, Spark, Flink ○ Java, Scala, Python, SQL ○ Kafka ○ Similar cloud services 29 Datasets, produced monthly / hourly / daily / ..
  30. 30. www.scling.com From craft to process 30
  31. 31. www.scling.com From craft to process 31 Multiple time windows
  32. 32. www.scling.com From craft to process 32 Multiple time windows Assess ingress data quality
  33. 33. www.scling.com From craft to process 33 Multiple time windows Assess ingress data quality Assess outcome data quality
  34. 34. www.scling.com From craft to process 34 Multiple time windows Assess ingress data quality Assess outcome data quality Repair broken data Intermediate datasets, reusable between pipelines
  35. 35. www.scling.com From craft to process 35 Multiple time windows Assess ingress data quality Repair broken data from complementary source Assess outcome data quality
  36. 36. www.scling.com From craft to process 36 Multiple time windows Assess ingress data quality Repair broken data from complementary source Forecast based on history Assess outcome data quality
  37. 37. www.scling.com From craft to process 37 Multiple time windows Assess ingress data quality Repair broken data from complementary source Forecast based on history, multiple parameter settings Assess outcome data quality
  38. 38. www.scling.com From craft to process 38 Multiple time windows Assess ingress data quality Repair broken data from complementary source Forecast based on history, multiple parameter settings Assess outcome data quality Assess forecast success, adapt parameters
  39. 39. www.scling.com Towards sustainable production ML 39 Multiple models, parameters, features Assess ingress data quality Repair broken data from complementary source Choose model and parameters based on performance and input data Benchmark models Try multiple models, measure, A/B test
  40. 40. www.scling.com Industrialised analytics / machine learning ● Build resilient, automated processes that ○ Collect & process ○ Assess & improve quality ○ Create multiple artifacts, measure, adapt ● Typical tools: Mechanised tools + ○ Data lake ○ Workflow orchestration (Luigi, Airflow) ○ Quality assessment, monitoring ○ Testing, CI/CD 40
  41. 41. www.scling.com ● Resilient data factory ● Every dev team, 100-1000s datasets / day per team Costs down - ROI from data 41 ● Hand-built ● Analyst team, < 10 dataset / day ● Semi-automated ● "The data team", 10-100 datasets / day Spotify ~2014, 20K datasets/day
  42. 42. www.scling.com Becoming data industrialised 42 ● Knowledge limited to leading tech companies + startups ● Change in processes & culture ○ C.f. agile, DevOps ○ Journey of many years ● Challenge is not technical ○ Can't buy a system or tool ○ Consultants can't help
  43. 43. www.scling.com Scling - data-value-as-a-service 43 Data value through collaboration Customer Data factory Data platform & lake data domain expertise Value from data! www.scling.com/reading-list www.scling.com/presentations www.scling.com/courses

    Be the first to comment

    Login to see the comments

Social media are full of Covid-19 graphs, each pointing to an "obvious" conclusion that fits the author's agenda. Unfortunately, even the official sources publish analytics that point at incorrect conclusions. Bad data quality has become a matter of life and death. We look at the quality problems with official Covid-19 data presentations. The problems are common in all domains, and solutions are known, but not widespread. We describe tools and patterns that data mature companies use to assess and improve data quality in similar situations. Mastering data quality and data operations is a prerequisite for building sustainable AI solutions, and we will explain how these patterns fit into machine learning product development.

Views

Total views

302

On Slideshare

0

From embeds

0

Number of embeds

14

Actions

Downloads

4

Shares

0

Comments

0

Likes

0

×