COVID-19 - How to Improve Outcomes By Improving Data

Presented by:
COVID-19: Building Trust in Data
To Save Lives

Paul Balas
pbalas@303Computing.com
Over 25 Years Experience Leading Digital
Transformations
Multiple MDM Implementations, Data Governance, and
Data Warehouse Initiatives
● Digital Transformation Consultant
● IT Executive
● Enterprise Architect
● Developer
m
2

This is a presentation to uncover the
systemic challenges in our
Government’s response to The COVID
Pandemic through the use of data...
3
And a proposal to fix it.

The Team!
4
Mingo Sanchez Elizabeth
Michel
Keith WorfolkKatie Everett
Kamal MaheshwariAlexandra-Cosmina
Comaniciu
Pooja K SwamyEvan Hu Bryan Haagsman,
PMP

Agenda
5
● Why is the Data Wrong?
● Why the New HHS System Won’t
Work
● How To Fix It - a POC

Why is the COVID-19 Data Wrong?
6

➢ Data Quality is suspect
➢ The CDC had an aging system
➢ The virus is spreading quickly - huge volumes of new
data to process
➢ We aren’t capturing the data we need
7

➢ Managing our supply chain
(PPE, Testing Supplies, Hospital ICU Beds)
➢ Ensure we have enough doctors, nurses, and other healthcare
professionals
➢ Issuing protective orders to stop the spread of the virus (Shelter-in-
Place, Social Distancing, Shuttering Business)
Why Do We Need The Data?
8

We’ll be asking ourselves, “Could we have saved more lives?”
But without trusted data,
no decision will be driven
with conviction
9

Dr. Deborah Birx
said in a White
House
Coronavirus task
force meeting -
“There is
nothing
from the
CDC that I
can trust.”
10

The Data Isn’t Standardized - How Many Tests Have
Been Given?
12

13
There are literally 1,000’s of
examples of incorrectly
reported data

The CDC System Isn’t Solving The
Problem
14

15
➢ CDC has 1,200 Users and about 950 State CDC
Partners
➢ There are over 6,000 Hospitals in the US
➢ About 2,000 Hospitals provide Covid data directly to
HH Protect
➢ TeleTracking provides data from about 1,100
Hospitals
(the contract with Teletracking was a quick way to get more data, more quickly into the system)
The Scope of The Problem

16
The main focus behind improvement of the new system is speed,
data format standardization and validation.
Not on improving Data Governance and Collaboration
How The CDC is Trying to Solve The Problem

18
Health and Human Services Goals for Data Governance
“Use of data across programs… remains a challenge.”
“Data are often housed in … data silos”

19
Even with a standardized platform like
TeleTracking, it’s just a data entry app that is
driven by procedural rules (though it is much
better than handing everyone an instruction
book and Excel)
The main problem in achieving
Data Quality is PEOPLE
Standardized Data Entry Vs. Agile Data Management

20
And here we have the problem; definitions
that may or may not be adhered to when
data is entered into the system
Medical forms are extremely complex and
require a great deal of training for health
practitioners to get it right
The Fallacy of Standardized Data Entry Solutions
Instructions for TeleTracking

21
If you’re a manager who believes that a
standardized data entry screen fixes
your organizations data quality
problems...
I strongly encourage you to speak with
your data scientists or data warehouse
developers

The New HHS System Will Fall Short
The new system doesn’t solve for the hard problem to achieve
Data Quality
People efficiently aligning to create standards
Embedding standards into the system (Procedures vs. ML)
Let’s see how we envision a faster approach
23

Imagine This is Your Problem to Solve
You are the CDO of the CDC tasked to improve our Nation’s ability
to better manage this and the next pandemic through the use of
data
Your first goal is to understand the key issues in the current system
(“as-is”), and develop a roadmap to address them
The stakes are high
24

Architectural Principles
You outline your key architectural principles to keep the broad team focused on
outcomes
Goal 1 - Build better trust in the data
Goal 2 - Understand which issues to fix first (Prioritize)
Goal 3 - The system should be agile to change
(Days and Weeks, not Months or Years for new
features)
Goal 4 - Efficient e-collaboration
25

Two Paths
You decide to split the problem in two:
Path 1 - Standardize data entry systems - long path
Path 2 - Build a framework for efficient Data Governance and
do it quickly
26

Path 2 - The Reference Platform
Master Data Management
Taxonomy Management
Data Quality
Cloud Data Integration
Data Transformation
Orchestration
Natural Language / Feature Extraction
Data Lake
Compute and Storage
m

POC - Understand Data Issues
You want to focus on the systemic issues in the underlying data flow across all
stakeholders
You choose to look at issues around ‘Testing’ as you believe you can get
immediate benefits for public health if you can build confidence in the test
data
What problems are states having in processing Test Data?
Is Test Data being reported consistently, accurately?
28

JHU
Johns Hopkins University has become the de-facto authority on COVID-19 data,
but do you know they are pulling it from other agencies. What types of problems
do they see?
“The website relies upon publicly available data from multiple sources that are not always consistent in how and when they are released and
updated. States may report components of testing data with different cadences, or they may even change how they report categories of
data over time, all of which can affect calculations of the rate of positivity. For example, some states report testing positives separately from
testing negatives, which may make it appear that 100% of their tests were positive or 100% negative on that day. Also, states have changed how
they count positives and negative test results and may retroactively change the numbers reported.” - JHU COVID Website
If you could classify all the issues across stakeholders, you believe you could have
a tool to get alignment with stakeholders by listening to them through their own
words
29

When is a Test Not a Test?
The CDC, Johns Hopkins, Covid Tracking Project and hundreds of other sites all
deal with testing differently. It seems simple, but even the CDC made this mistake.
30
“The Centers for Disease Control and Prevention is conflating the results of two different types of
coronavirus tests, distorting several important metrics and providing the country with an inaccurate picture of
the state of the pandemic. We’ve learned that the CDC is making, at best, a debilitating mistake: combining
test results that diagnose current coronavirus infections with test results that measure whether someone has
ever had the virus. The upshot is that the government’s disease-fighting agency is overstating the country’s
ability to test people who are sick with COVID-19. “ ALEXIS C. MADRIGALROBINSON MEYERMAY 21, 2020
This is something we want to correct for and monitor in our POC. Can our system
compare test reports from various agencies and help explain why it’s different?

It’s Going Well
The framework is proving valuable. You can now see the systemic data
quality issues and importantly communicate with stakeholders effectively
to get alignment.
You see an opportunity to do more, quickly: Can we use the system to show
how people in the public eye might influence people to get tested? You
know that will be critical once our testing capacity increases.
You ask the team to classify news outlets by Public Influencers, Events, and
Locations.
Using a traditional tool, this wouldn’t be possible, but you’ve seen how
efficiently you can master and classify data through this platform
32

DataOps and More
Your new DataOps system will be
provide more than just good data
quality for COVID and other
Pandemics
It will also allow you to conduct
data science experiments to see if
there is a correlation between
public policy actions, infection
rates, and ultimately deaths
34

What Did You Achieve as CDC CDO?
➢ You delivered a DataOps framework that will expedite realization of data
standards
➢ It puts the power of data governance and master data management into the
hands of the experts at the CDC, HHS, Hospitals and Labs
➢ It works in compliment with systems like TeleTracking
➢ It will scale beyond infectious disease data and can serve as a model for HHS
to ensure and promote data quality for all citizens
35

Google Cloud Platform
InfoWorks
How Was It Built
Internet
data sources
Data orchestration TableauTamr
Big Query
VM instances
Google Cloud
Storage
Natural
language
Python
Twitter API
News API
State Health
Department
Web Pages
JHU Github

I Had To Extract Meaning From Text

Tamr - Data Experts - Spend More Time Analyzing/Strategizing
Before: Experts spend too much
Time manually fixing data
Today: ML can do 80% of
data mastering lift...
…. Enabling experts to put final
touches on the last 20%.
39

The Tamr Agile Approach to Data Mastering
Mastered data
OLD WAY
Rules-based
Source data
Mastered data
Time
Quality
Months to years
60%–80% Accuracy
Modify rules, create
exceptions
Months 1–4
Months 5–12+
Iterate
Machine-driven
NEW WAY
Days to weeks
90%+ accuracy
Source data
Weeks 1–12
Iterate with human-
guided machine
learning
Identify developers
Get business input
Write rules
Review with business
Unified data
Rules
40

ML vs. Procedural MDM
Effort
Time
Train Operate
MLP

7 Tamr Projects Built in a Few Weeks

Taxonomies: Before vs. After Tamr
Tamr enabled us to create standardized taxonomies that can be managed by a
networked group of hospitals, labs, health officials
These taxonomies are critical to having good quality and conformed data across a
widely distributed data network
There is an efficient mechanism for building consensus across experts at the same time
as fixing the data
There is no solution like it in the market.
L

Mastering People: 530K to 9K in a Few Days
Using Tamr, I was able to take a corpus of over 500k entity records identified by Google
Natural Language across 60,000 news articles, hundreds of web pages, thousands of
tweets, reducing it to about 9k Golden Master People Records with links back to each
news article they were referenced in regardless of spelling or abbreviation
I estimate the system can be maintained in one to two hours a week at scale,
decreasing to minutes a week as the model learns
I don’t even have to monitor it. Tamr can notify me of my quality score and if I have any
pairs that it’s unsure how to match

Conclusions
● The COVID Pandemic data challenges are a macro-view of the same
challenges we all face in our own companies as we use data as information to
improve outcomes
● People need to work together more effectively so we can erase this Pandemic
from our lives
● Trusted data can truly help us save more lives
47

Thank You!
Paul Balas
303-912-5912
pbalas@303computing.com
http://www.303computing.com/

COVID-19 - How to Improve Outcomes By Improving Data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to COVID-19 - How to Improve Outcomes By Improving Data

Similar to COVID-19 - How to Improve Outcomes By Improving Data (20)

Recently uploaded

Recently uploaded (20)

COVID-19 - How to Improve Outcomes By Improving Data