2. Paul Balas
25+ Years Experience Leading Digital
Transformations
Multiple MDM Implementations, Data
Governance, and Data Warehouse Initiatives
q Digital Transformation Consultant
q IT Executive
q Enterprise Architect
q Developer
paul.balas@revisioninc.com
3. THE REVISION ANALYTICS DIFFERENCE
Founded in 1998, REVISION is a management consulting firm that marries domain and technology expertise
to achieve amazing results for our clients through digital transformations that deliver value.
§ REVISION Senior Consultants
Depth Of Experience –
“We’ve Walked The Walk.”
§ 20+ Years C-level Advisory
Roles In Performance
Excellence
§ 20+ Years IT Operations
& Security Experience
§ Airport and Transportation
Experience
§ Culture & Operations
State, Local & Federal
§ DoD Secret & Top Secret
Cleared Staff
Data
Strategy
Data
Ops
Data
Governance
MDM
Advanced
Analytics
Assessment & Implementation
4. 4
REVISION – LEADING IN RESPONSE TO THE COVID PANDEMIC
We built a system to enable John F. Kennedy International (JFK),
Newark Liberty International (EWR) and LaGuardia Airport (LGA) to
aid in response to the COVID pandemic
“… The Aviation Strategy Unit, created a custom
dashboard that estimates and predicts how many
passengers will arrive from states with travel
advisories. The new tool combines API data and
information within the Port Authority’s data
warehouse with updates from the COVID Tracking
Project…”
5. This is a presentation to uncover the systemic
data challenges in our Government’s response to
The COVID Pandemic
And a proposal to improve data quality and build trust
6. Agenda
6
What has happened since we started this journey?
Why is the Data Wrong?
Why the New HHS System Isn’t Working
How To Fix It - a POC
8. Dr. Deborah Birx said “there is nothing from the CDC that I can
trust” in a White House coronavirus task force meeting
8
May 10, 2020
9. TRUST
I’m going to show you four news articles that
speak to four different types of challenges
with achieving good data governance and
delivering trusted data.
Do you notice any governance challenges we
encounter in our own companies?
9
10. 10
Medical examiner offices
had struggled to keep up
with a recent spike in
COVID-19 deaths, with
more than 2,300 deaths
reported statewide in just
the past two weeks,
according to the Florida
Department of Health.
11. 11
"In reviewing the analysis obtained by NPR, Panchadsaram says the local
and hospital-level data HHS is collecting would be very useful to
researchers and health leaders. "That stuff isn't easy to find at a national
level," he says. "There's no one place [publicly] you can go to get all that
data."
October 30, 2020
12. 12
“During a news conference Saturday morning, Perna
explained that he had not taken into consideration the time
it would take for completed vaccine to go through the full
Food and Drug Administration quality control process, which
can take 48 hours.”
Dec. 19, 2020
13. 13
”I saw the President presenting graphs
that I never made. I know that someone
out there or someone inside was creating
a parallel set of data and graphics that
were shown to the President.”
January 25, 2021
15. 1: What are your top Data Governance
Challenges?
15
a) Increasing Volumes
b) Data Silos
c) Unclear Definitions
d) Dueling Dashboards
e) Lack of Executive Alignment
16. Why is COVID-19
Data Wrong?
No published standards on how to collect
and present the information
The CDC had an aging system which
wasn’t agile to change
The virus is spreading quickly –
huge volumes of new data to process
We weren’t capturing needed data
16
17. How Do We Need
to Use The Data?
PPE, Hospital Capacity, Testing
Supplies
Ensure we have enough doctors,
nurses, and other healthcare
professionals
Issuing protective orders to stop the
spread of the virus (Shelter-in-place,
Social Distancing, Shuttering
Business)
17
18. We’ve been asking ourselves, ”Could we have saved more lives?”
Without trusted data,
no decision will be driven
with conviction
18
20. The Data is Inconsistent - States Are Reporting to Differing Standards
https://www.beckershospitalreview.com/data-analytics/states-are-reporting-inconsistent-incomplete-covid-19-data-analysis-
finds.html#:~:text=The%20federal%20government%20has%20left,former%20CDC%20Director%20Tom%20Friedan.
20
21. The Data Isn’t Standardized - How Many Tests Have Been Given?
21
23. The HHS System isn’t
solving the problem, and
It’s a Big Problem
CDC has 1,200 Users and about 950
State CDC Partners
There are over 6,000 Hospitals in the
US
About 2,000 Hospitals provide Covid
data directly to HH Protect
TeleTracking provides data from
about 1,100 additional Hospitals
23
24. HHS Challenges
● “Use of data across programs… remains a challenge.”
● “Data are often housed in … data silos”
24
25. 25
NNDSS is focused on speed, data
format standardization, and
validation
They need to focus on
improving Data Governance
and Collaboration
26. 26
TeleTracking, is just a data entry app
that is driven by procedural rules
The main problem in achieving Data
Quality is PEOPLE
28. 28
If you’re one of those managers who
believes that a standardized data entry
screen fixes your organizations data
quality problem...
You haven’t done much data analysis
(go speak with your data team)
29. A Novel Way to Solve
The Data Governance Challenge
29
30. Imagine This is Your Problem to Solve
You are the CIO of the CDC tasked to improve our Nation’s ability to
better manage this and the next pandemic through the use of data.
Your first goal is to understand the key issues in the current system
(“as-is”), and develop a roadmap to address them.
The stakes are high.
30
31. Architectural Principles
Build trust in the data
Identify which issues to fix first
The system should be agile to change
Efficient collaboration
31
33. Identify and Address Systemic Governance Challenges
Master Data Management
Taxonomy Management
Data Quality
Cloud Data Integration
Data Transformation
Orchestration
Natural Language / Feature Extraction
Data Lake
Compute and Storage
34. Understand
Systemic Data
Challenges
You choose to look at issues around
‘Testing’ as you believe you can get
immediate benefits for public health if
you can build confidence in the test
data
What problems are states having in
processing Test Data?
Is Test Data being reported
consistently, accurately?
34
37. Three Weeks Later
37
Integrate and process the self-reported
data quality issues from your three test
sites
Create data quality taxonomies to
report on the issue types
Extract entities from the quality issues
and classify them using our
taxonomies
38.
39. Virtual
Standardization
Meetings
You use your findings to review issues
and publish recommendations to
conform to new standards for each of
the data providers.
You use the platform to manage
quality on an on-going basis.
You provide the DataOps platform to
data providers to self-manage their
quality issues.
39
40. Can We Do More?
You can now see the systemic data quality issues and
communicate with stakeholders effectively to get alignment.
Next, classify news by Public Influencers, Events, and Locations
because you want to see if what our leaders say has an impact
on testing and infection rates.
40
45. Google Cloud Platform
InfoWorks
Logical workflow
Internet
data sources
Data orchestration Tableau
TAMR
Big Query VM
instances
Google
cloud
storage
Natural
language
Python
Twitter API
News API
State Health
Department
Web Pages
JHU Github
48. Data Experts - Spend More Time Analyzing/Strategizing
Before: Experts spend too much
Time manually fixing data
Today: ML can do 80% of
data mastering lift...
…. Enabling experts to put final
touches on the last 20%.
48
49. The TAMR Agile Approach to Data Mastering
Mastered data
OLD WAY
Rules-based
Source data
Mastered data
Time
Quality
Months to years
60%–80% Accuracy
Modify rules, create
exceptions
Months 1–4
Months 5–12+
Iterate
Machine-driven
NEW WAY
Days to weeks
90%+ accuracy
Source data
Weeks 1–12
Iterate with human-
guided machine
learning
Identify developers
Get business input
Write rules
Review with business
Unified data
Rules
50. Taxonomies: Before vs. After Tamr
TAMR enabled us to create standardized taxonomies that can be managed by a
networked group of hospitals, labs, health officials.
These taxonomies are critical to having good quality and conformed data across a
widely distributed data network.
There is an efficient mechanism for building consensus across experts at the same
time as fixing the data.
There is no solution like it in the market.
53. Mastering People:
530K to 9K in a
Few Days
Using TAMR, I was able to take a
corpus of over 500k entity records
identified by Google Natural Language
across 60,000 news articles, hundreds
of web pages, thousands of tweets in a
few weeks, reducing it to a about 10k
Golden Master People Records
I estimate the system can be
maintained in one to two hours a week
at scale, decreasing to minutes a week
as the model learns
53
57. Event
Classification
Events are hard to identify in text
I first created a taxonomy of events
tagged by Google Natural Language to
limit those of interest.
I loaded those into TAMR
And then classified them in a few hours
57
58. 3: Which of these capabilities are you
most interested in learning more
about?
58
Agile Data Governance
Google Cloud Platform
Google Natural Language Processing API
TAMR
InfoWorks
DataOps
59. Conclusion
The COVID Pandemic data challenges
are a macro-view of the same
challenges we all face in our own
companies as we use data as
information to improve outcomes
The hardest problem to solve is
injecting Subject Matter Expertise into
a flexible data processing system that
can help us align SME perspectives,
and respond to changing needs with
agility
59