Building a Federated Data Directory
Platform for Public Health
Mark Paul
Engineering Manager
Anshul Bajpai
Data Engineering Lead
Agenda
1. Problems with Centralised Data Directories
2. Solution: Federated Data Directory Platform
3. Design Patterns
4. Intelligent System of Record Ranking
5. Architecture Patterns
▪ Australian digital health
infrastructure
▪ National directory of health
services and the practitioners
who provide them
▪ National, government-owned,
not-for-profit organization
▪ Trusted health information and
advice for all Australians
#1 Australian health
information website
4.8m community
connections each month
Problems with Centralised
Data Directories
Healthcare Directories - Critical Healthcare Infrastructure
▪ Enables Care Coordination
▪ Single Point-of-Failure
▪ Bad Data Quality = Clinical Risk to
Patients
Healthcare Directories - Problems
Data Updated via Content
Management Systems and Call
Centres
This Model Is Reactive and Inefficient!
Data volatility (High Frequency of change to data)
Basic Centrally Managed Databases
Applications
Solution: Federated Data
Directory Platform
Federating Data is a Powerful Concept
Federated Database:
▪ Maps multiple autonomous database systems into
a single federated datastore
Federated Data Platform:
▪ Controlled aggregation to create “gold-standard
data” by using multiple Autonomous Origin Data
Sources
▪ Data Aggregation via Event Sourcing pipelines
Building the Federated Data “Puzzle”
Federal, State, Public/Private Hospitals, EMR, and other Commercial Vendors
participate as Systems of Records
Design Patterns for
Federated Data Platforms
Source Classification
▪ System of Record (SoR):
▪ Identify your Authoritative SoRs
SoRs have Role/s:
▪ Source of Truth
▪ Authoritative owner of a subset of data
▪ Source of Validation:
▪ Improve Data Quality
▪ Source of Notification:
▪ Increase “data currency”
▪ Gold Entities
▪ Your final entity models (e.g. Healthcare Service,
Organisation, Practitioner)
▪ Raw Entities
▪ Raw (Source) entities that are in pre-mapping stage that
would be eventual mapped to your gold entities
▪ Source Channels
▪ Pipeline channels that transition Raw Entities into new
version of Gold Entities
Entity / Channel Setup
Attribute Sourcing
Id: "561f10e4-0109-b99f-a2df-c059f9dc4a9b"
name: "Cottesloe Medical Centre"
bookingProviders [
{ Id: hotdoc,
providerIdentifier: cottesloe-medical-centre },
{ Id: healthengine,
providerIdentifier: ctl-m-re }
]
practitionerRelations [
{ pracId: c618860e-a69a
type: providerNumber,
value: 2djfkdn3k34 },
{ pracId: hsjfk3e-53vd
type: providerNumber,
value: dsfh4kslfls }
]
Calendar: {
openRules: […],
closedRules: […]
}
Contacts: {
Email: sss@gmail.com
Website: www.tt.com
Phone: 3242343
}
Medicare (SoV)
Healthscope (SoN)
Vendor Software (SoT)
Vendor Software (SoT)
Internal
Healthcare
Service
Practitioner
Relation
Details about
Practitioners who work at
a service
Internal
Internal
Vendor Software (SoT)
Data Federation
Pre-Processing Raw (Bronze) Stage (Silver) Gold Publishing
Pre-Processing Layer
Automated Pre Processing via Notebooks (origin API or
offline data extracts via SFTP, S3 pickup folders)
▪ Generate “Source Data” Event Object
{
DataPayload: <type> Raw Entity Model
Provenance: <type> Provenance
}
▪ DataPayload holds the “Raw Entity” (Source Specific
Model)
▪ Provenance used for source / origin identification
Raw Processing Layer (Bronze)
Picks up Source Data Event from “Pre Processing Output”
▪ Performs routine, high level parsing / cleansing
▪ Generate “Core Data” Event Object
▪ Which carries to each downstream layer and captures
transition/operational changes at each layer
▪ Generate Event Trace ID - the end-to-end traceability
identification
▪ Data Lineage Object “begins” to capture operational
outcomes to events
Stage Processing Layer (Silver)
Picks up Core Data Event from “Raw Output”
▪ Mapping Operation - Convert from Raw (Source) Entity
model to Gold Entity Model
▪ Referencing Operation - Enrichment using Reference
Data lookups
▪ Merging Operation - New Gold Version created
▪ Validation Operation: Final validation against Gold
Model validation Rules
Stage Processing Layer
▪ Merging Operation
▪ Matching by “primary key”
▪ Merging (based on last version) / Delta Determination
▪ Version Incrementation
▪ Metadata Attribution generated and appended
▪ Logs Every Change to Every Attribute on Every Event
▪ Individual Data Lineage Objects store all operational
outcomes on the event (attribute exceptions, violations,
status changes etc.)
Gold Processing Layer
Picks up Core Data Event from “Stage Output”
▪ Entity Relationship Validation - Ensures entity
relationships are “intact” - Prevents Orphans
▪ Re-Processing & Replay : Replay latest versions (for
new reference data, business / validation logic to apply)
▪ Data Science Layer - Data Quality Benchmarks
Data Provenance Object event_trace_id: 79d77056-c773-4496-
ac0d-5223c49e06f0
file_name:
ext_provider_location_service_bf14
feb8-538a-4f40-85eb-
93b77d2c1704_2019-09-
17T22:50:47Z.json
source_file_name:
ext_provider_location_service_bf14
feb8-538a-4f40-85eb-
93b77d2c1704_2019-09-
17T22:50:47Z.json
flow_name:
nifi_flow_ext_provider_location_se
rvice_withstate_v1
owner_agency: HDAInternal
arrival_timestamp: 2019-09-
17T22:50:46Z
primary_key: [“pLocSvcId”]
primary_key_temporal: TRUE
data_in_load_strategy: DELTA
unique_data_code:
ext_provider_location_service
version: v0.0.1
source_identifier: TAL-2324
Trace an Event back to it’s Exact Origin
▪ Identify Upstream Source Identity & Raw Source File
▪ Inject Source (External) Identifier (e.g. Jira Ticket #)
Source Intention
▪ Target Entity (what this event intends to update)
Data Lineage Object event_trace_id: 79d77056-c773-
4496-ac0d-5223c49e06f0
application_id: STAGE-01
application_name: STAGE
application_description: Versioned
Entity Data
application_version: 1.0.0
application_state:
STAGE_REFERENCING
dms_event_id: 4000ae0b-6b08-4dce-
a432-fff8e608e7ec
source_dms_event_id: 4de1802c-
70e6-4552-b2b0-4349bfc3a073
operation: [{
operation_name:
ENTITY_REFERENCING,
operation_rule_name:
plsParsing,
operation_result: SUCCESS,
failure_severity: “”,
attributes: [“”],
created_time: 2019-09-
30T22:39:47Z
}],
created_time: 2019-09-
30T22:39:47Z
Encapsulate Operation Outcomes that occur to Entity
Events
▪ Capture deviation of Data Quality
▪ Exceptions / Warnings
▪ Exceptions - Fix Data
▪ Warning - Improve Data (Quality)
▪ Visibility of End to End Data Flow (via Operational
Outcomes Summary)
Intelligent System of Record
Ranking
“Dedicated System of Record (SoR)”
Has full update authority over your
data attributes
1. Data Quality Regressions flow
into your System
2. Low Frequency of Change (Low
Data Currency)
Problem
“Candidate Systems of Record (CSoR)”
Alternate “SoRs” who compete to update
the same data
Healthcare Service
{ Opening Hours,
Contact Details}
SoR
A
CSoR
B
CSoR
C
Solution
Source Opening Hours Contact Details
SoR A Priority 1 Priority 1
CSoR B - Priority 2
CSoR C - Priority 3
SoR
C
In the same
MicroBatch - SoR A
wins over CSoR B
and CSoR C
SoR
B SoR
A
▪ Ranking assigned based on “business priority”
Healthcare Service Entity Attributes
Manual Ranking
Source
Total Updates Contact Details-
Lineage Warnings
Contact Details-
Lineage Errors
SoR A 10 4 2
CSoR B 8 1 0
CSoR C 2 1 1
▪ Data Lineage outcomes aggregated over last 30 days
▪ “Priority boosted” based on “Recent Performance” of Sources
Healthcare Service Entity Lineage Events
Automatic Ranking
Contact
Details
Priority 2
Priority 1
Priority 3
Updated Priority
Contact
Details
Priority 1
Priority 2
Priority 3
Original Priority
Source
Total
Updates
Lineage
Warnings
Lineage
Errors
Public
Complaints
Count
Completeness
Score
Consistency
Score
Accuracy
Score
Conformity
Score
Integrity
Score
… Nth
SoR A 10 4 2 2 60 20 99 56 21 …
CSoR B 8 1 0 1 45 34 80 54 22
CSoR C 2 1 1 6 78 45 34 56 45
…Nth
Source
… … … … … … … … … …
▪ Sources and Features are growing
▪ “Seasonal” Data Regression
▪ Source Data Quality Model: “Confidence Score” based on “Past Performance” applied in “Real Time”
Healthcare Service Entity Features
Intelligent Ranking - Future State
Architecture Patterns for
Federated Data Platforms
Architecture Overview
Logical Data Zones using Databricks DELTA
▪ Data Control Plane: LANDING, RAW, STAGE, GOLD (i.e. Bronze, Silver, Gold)
▪ Used DELTA Cache for performance optimisation (stream and batch workloads)
▪ Runs on AWS Accounts under our Security Policy and Regulatory Compliance
▪ Operational Control Plane: Cluster Administration, Management functions like,
Access Control, Jobs and Schedules
Data Plane & Processing Pipeline
Continuous Streaming Applications
▪ Enables “True Event Sourcing” via Streaming
Input, Kinesis, S3, and DELTA
▪ Running Micro-batches lead to smaller and
more manageable data volumes
▪ Recoverability through Checkpoints and
Reliability through Streaming Sinks to DELTA
tables
Data Issue Problem Statement:
Downstream Health Integrator is complaining that un-
anticipated special unicode characters in the service
description is breaking their integration.
Restore & Recover Data Versions Seamlessly
▪ During Data quality issues, we can rewind to
previous versions
▪ Using Metadata attribution, Provenance and Data
Lineage features - we can trace the root cause to a
specific origin source up-to millisecond precision
▪ Complete audit trail and ability to provide SoR Data
Quality Reporting
Questions & Feedback
Mark Paul - @ThisIsMarkPaul
Your feedback is important to us.
Don’t forget to rate and
review the sessions.

Building a Federated Data Directory Platform for Public Health

  • 2.
    Building a FederatedData Directory Platform for Public Health Mark Paul Engineering Manager Anshul Bajpai Data Engineering Lead
  • 3.
    Agenda 1. Problems withCentralised Data Directories 2. Solution: Federated Data Directory Platform 3. Design Patterns 4. Intelligent System of Record Ranking 5. Architecture Patterns
  • 4.
    ▪ Australian digitalhealth infrastructure ▪ National directory of health services and the practitioners who provide them ▪ National, government-owned, not-for-profit organization ▪ Trusted health information and advice for all Australians #1 Australian health information website 4.8m community connections each month
  • 5.
  • 6.
    Healthcare Directories -Critical Healthcare Infrastructure ▪ Enables Care Coordination ▪ Single Point-of-Failure ▪ Bad Data Quality = Clinical Risk to Patients
  • 7.
    Healthcare Directories -Problems Data Updated via Content Management Systems and Call Centres This Model Is Reactive and Inefficient! Data volatility (High Frequency of change to data) Basic Centrally Managed Databases Applications
  • 8.
  • 9.
    Federating Data isa Powerful Concept Federated Database: ▪ Maps multiple autonomous database systems into a single federated datastore Federated Data Platform: ▪ Controlled aggregation to create “gold-standard data” by using multiple Autonomous Origin Data Sources ▪ Data Aggregation via Event Sourcing pipelines
  • 10.
    Building the FederatedData “Puzzle” Federal, State, Public/Private Hospitals, EMR, and other Commercial Vendors participate as Systems of Records
  • 11.
  • 12.
    Source Classification ▪ Systemof Record (SoR): ▪ Identify your Authoritative SoRs SoRs have Role/s: ▪ Source of Truth ▪ Authoritative owner of a subset of data ▪ Source of Validation: ▪ Improve Data Quality ▪ Source of Notification: ▪ Increase “data currency” ▪ Gold Entities ▪ Your final entity models (e.g. Healthcare Service, Organisation, Practitioner) ▪ Raw Entities ▪ Raw (Source) entities that are in pre-mapping stage that would be eventual mapped to your gold entities ▪ Source Channels ▪ Pipeline channels that transition Raw Entities into new version of Gold Entities Entity / Channel Setup
  • 13.
    Attribute Sourcing Id: "561f10e4-0109-b99f-a2df-c059f9dc4a9b" name:"Cottesloe Medical Centre" bookingProviders [ { Id: hotdoc, providerIdentifier: cottesloe-medical-centre }, { Id: healthengine, providerIdentifier: ctl-m-re } ] practitionerRelations [ { pracId: c618860e-a69a type: providerNumber, value: 2djfkdn3k34 }, { pracId: hsjfk3e-53vd type: providerNumber, value: dsfh4kslfls } ] Calendar: { openRules: […], closedRules: […] } Contacts: { Email: sss@gmail.com Website: www.tt.com Phone: 3242343 } Medicare (SoV) Healthscope (SoN) Vendor Software (SoT) Vendor Software (SoT) Internal Healthcare Service Practitioner Relation Details about Practitioners who work at a service Internal Internal Vendor Software (SoT) Data Federation
  • 14.
    Pre-Processing Raw (Bronze)Stage (Silver) Gold Publishing
  • 15.
    Pre-Processing Layer Automated PreProcessing via Notebooks (origin API or offline data extracts via SFTP, S3 pickup folders) ▪ Generate “Source Data” Event Object { DataPayload: <type> Raw Entity Model Provenance: <type> Provenance } ▪ DataPayload holds the “Raw Entity” (Source Specific Model) ▪ Provenance used for source / origin identification
  • 16.
    Raw Processing Layer(Bronze) Picks up Source Data Event from “Pre Processing Output” ▪ Performs routine, high level parsing / cleansing ▪ Generate “Core Data” Event Object ▪ Which carries to each downstream layer and captures transition/operational changes at each layer ▪ Generate Event Trace ID - the end-to-end traceability identification ▪ Data Lineage Object “begins” to capture operational outcomes to events
  • 17.
    Stage Processing Layer(Silver) Picks up Core Data Event from “Raw Output” ▪ Mapping Operation - Convert from Raw (Source) Entity model to Gold Entity Model ▪ Referencing Operation - Enrichment using Reference Data lookups ▪ Merging Operation - New Gold Version created ▪ Validation Operation: Final validation against Gold Model validation Rules
  • 18.
    Stage Processing Layer ▪Merging Operation ▪ Matching by “primary key” ▪ Merging (based on last version) / Delta Determination ▪ Version Incrementation ▪ Metadata Attribution generated and appended ▪ Logs Every Change to Every Attribute on Every Event ▪ Individual Data Lineage Objects store all operational outcomes on the event (attribute exceptions, violations, status changes etc.)
  • 19.
    Gold Processing Layer Picksup Core Data Event from “Stage Output” ▪ Entity Relationship Validation - Ensures entity relationships are “intact” - Prevents Orphans ▪ Re-Processing & Replay : Replay latest versions (for new reference data, business / validation logic to apply) ▪ Data Science Layer - Data Quality Benchmarks
  • 20.
    Data Provenance Objectevent_trace_id: 79d77056-c773-4496- ac0d-5223c49e06f0 file_name: ext_provider_location_service_bf14 feb8-538a-4f40-85eb- 93b77d2c1704_2019-09- 17T22:50:47Z.json source_file_name: ext_provider_location_service_bf14 feb8-538a-4f40-85eb- 93b77d2c1704_2019-09- 17T22:50:47Z.json flow_name: nifi_flow_ext_provider_location_se rvice_withstate_v1 owner_agency: HDAInternal arrival_timestamp: 2019-09- 17T22:50:46Z primary_key: [“pLocSvcId”] primary_key_temporal: TRUE data_in_load_strategy: DELTA unique_data_code: ext_provider_location_service version: v0.0.1 source_identifier: TAL-2324 Trace an Event back to it’s Exact Origin ▪ Identify Upstream Source Identity & Raw Source File ▪ Inject Source (External) Identifier (e.g. Jira Ticket #) Source Intention ▪ Target Entity (what this event intends to update)
  • 21.
    Data Lineage Objectevent_trace_id: 79d77056-c773- 4496-ac0d-5223c49e06f0 application_id: STAGE-01 application_name: STAGE application_description: Versioned Entity Data application_version: 1.0.0 application_state: STAGE_REFERENCING dms_event_id: 4000ae0b-6b08-4dce- a432-fff8e608e7ec source_dms_event_id: 4de1802c- 70e6-4552-b2b0-4349bfc3a073 operation: [{ operation_name: ENTITY_REFERENCING, operation_rule_name: plsParsing, operation_result: SUCCESS, failure_severity: “”, attributes: [“”], created_time: 2019-09- 30T22:39:47Z }], created_time: 2019-09- 30T22:39:47Z Encapsulate Operation Outcomes that occur to Entity Events ▪ Capture deviation of Data Quality ▪ Exceptions / Warnings ▪ Exceptions - Fix Data ▪ Warning - Improve Data (Quality) ▪ Visibility of End to End Data Flow (via Operational Outcomes Summary)
  • 22.
    Intelligent System ofRecord Ranking
  • 23.
    “Dedicated System ofRecord (SoR)” Has full update authority over your data attributes 1. Data Quality Regressions flow into your System 2. Low Frequency of Change (Low Data Currency) Problem “Candidate Systems of Record (CSoR)” Alternate “SoRs” who compete to update the same data Healthcare Service { Opening Hours, Contact Details} SoR A CSoR B CSoR C Solution
  • 24.
    Source Opening HoursContact Details SoR A Priority 1 Priority 1 CSoR B - Priority 2 CSoR C - Priority 3 SoR C In the same MicroBatch - SoR A wins over CSoR B and CSoR C SoR B SoR A ▪ Ranking assigned based on “business priority” Healthcare Service Entity Attributes Manual Ranking
  • 25.
    Source Total Updates ContactDetails- Lineage Warnings Contact Details- Lineage Errors SoR A 10 4 2 CSoR B 8 1 0 CSoR C 2 1 1 ▪ Data Lineage outcomes aggregated over last 30 days ▪ “Priority boosted” based on “Recent Performance” of Sources Healthcare Service Entity Lineage Events Automatic Ranking Contact Details Priority 2 Priority 1 Priority 3 Updated Priority Contact Details Priority 1 Priority 2 Priority 3 Original Priority
  • 26.
    Source Total Updates Lineage Warnings Lineage Errors Public Complaints Count Completeness Score Consistency Score Accuracy Score Conformity Score Integrity Score … Nth SoR A10 4 2 2 60 20 99 56 21 … CSoR B 8 1 0 1 45 34 80 54 22 CSoR C 2 1 1 6 78 45 34 56 45 …Nth Source … … … … … … … … … … ▪ Sources and Features are growing ▪ “Seasonal” Data Regression ▪ Source Data Quality Model: “Confidence Score” based on “Past Performance” applied in “Real Time” Healthcare Service Entity Features Intelligent Ranking - Future State
  • 27.
  • 28.
  • 29.
    Logical Data Zonesusing Databricks DELTA ▪ Data Control Plane: LANDING, RAW, STAGE, GOLD (i.e. Bronze, Silver, Gold) ▪ Used DELTA Cache for performance optimisation (stream and batch workloads) ▪ Runs on AWS Accounts under our Security Policy and Regulatory Compliance ▪ Operational Control Plane: Cluster Administration, Management functions like, Access Control, Jobs and Schedules
  • 30.
    Data Plane &Processing Pipeline
  • 31.
    Continuous Streaming Applications ▪Enables “True Event Sourcing” via Streaming Input, Kinesis, S3, and DELTA ▪ Running Micro-batches lead to smaller and more manageable data volumes ▪ Recoverability through Checkpoints and Reliability through Streaming Sinks to DELTA tables
  • 32.
    Data Issue ProblemStatement: Downstream Health Integrator is complaining that un- anticipated special unicode characters in the service description is breaking their integration.
  • 33.
    Restore & RecoverData Versions Seamlessly ▪ During Data quality issues, we can rewind to previous versions ▪ Using Metadata attribution, Provenance and Data Lineage features - we can trace the root cause to a specific origin source up-to millisecond precision ▪ Complete audit trail and ability to provide SoR Data Quality Reporting
  • 34.
    Questions & Feedback MarkPaul - @ThisIsMarkPaul Your feedback is important to us. Don’t forget to rate and review the sessions.