Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
WIFI SSID:SparkAISummit | Password: UnifiedAnalytics
Peter James, Healthdirect Australia
How Australia’s National Health Services
Directory Improved Data Quality, Reliability,...
About Us
Healthdirect Australia designs and delivers
innovative services for governments to provide
every Australian with ...
Our Impact
4#UnifiedAnalytics #SparkAISummit
National Health Services Directory
The National Health Services Directory (NHSD) holds core
information about healthcare o...
Supporting Health Planning
NHSD is used widely for analysis & planning
6#UnifiedAnalytics #SparkAISummit
healthmap.com.au ...
Landscape Changes
A Strategic Assessment defined a long-range vision of
success
• Australian Health Sector promotes adopti...
Challenges
Data Quality and Governance
• Availability and access to Systems Of Record
• Often no Single Source Of Truth
• ...
Challenges
Data Silos
• Data stored in multiple subsystems
– File Systems, ETL and workflow transient storage systems
– Mu...
Data Silos to Data Aggregation
Single point of access to the component pieces
10#UnifiedAnalytics #SparkAISummit
AHPRA
(Re...
Challenges
Data Scale
• Processes need to scale to ~1 Billion Data Points
• New demands include Bookings, Appointments, Re...
Architecture Overview
12#UnifiedAnalytics #SparkAISummit
Architecture Improvements
Databricks DELTA to create Logical Data Zones
• LANDING, RAW, STAGE, GOLD (i.e. Bronze, Silver, ...
Architecture Improvements
SPARK Structured Streaming for Continuous Applications
• Create a Streaming versions of existing...
Databricks Cluster
Continuous Applications –
Continued
• Keeps 'in-memory’ state of object sets
including comparable versi...
Data Plane & Pipeline
16#UnifiedAnalytics #SparkAISummit
Architecture Improvements
Unified Analytics and Operations
• Data cleaning & matching algorithms reliable, predictable and...
Improved Algorithm Development &
Usage
18#UnifiedAnalytics #SparkAISummit
Data Sharing Agreements Data Integration Algos &...
Algorithm Evaluation
19#UnifiedAnalytics #SparkAISummit
FPR TPR threshold
0.014433 0.800855 1.00
0.014433 0.801709 0.99
0....
Improved Data Processing
• 95% fuzzy name matches vs. <80% and dependent
on manual verification
• ~30,000 Automated Update...
Improved Data Security
• Databricks Platform provided foundational Security Accreditations
• Able to cross-walk these secu...
Securing it ALL
22#UnifiedAnalytics #SparkAISummit
Other Benefits
• Don’t have to be big to see benefits, strategic value and a new business model
• Strategic momentum throu...
DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT
Upcoming SlideShare
Loading in …5
×

How Australia’s National Health Services Directory Improved Data Quality, Reliability, and Integrity with Databricks Delta and Structured Streaming

334 views

Published on

"Healthdirect Australia delivers telehealth and digital health services in partnership with the Australian federal, state and territory governments. Healthdirect helps address key priorities and challenges within the health, aging and social service sectors. In Australia, we run the National Health Services Directory (NHSD) that stores every Healthcare Organisation, Health Practitioner, their Physical and Virtual Locations, Health Care Services and eHealth Secure Messaging Endpoints. This is forecast to need to scale to handle over 10 terabytes of data covering time-driven activity-based healthcare transactions.

Globally we share similar problems with Healthcare data such as incomplete integration across the Health Sector participants, data supply chain inefficiency and duplication. At Healthdirect we use Apache Spark and Databricks Delta’s fine-grained table features and data versioning to solve duplication and eliminate data redundancy. This has enabled us to develop and provide high-quality data through federation and interoperability services whilst providing the analytics to improve Health Services demand forecasting and clinical outcomes in service lines, such as Aged Care and Preventative Health."

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

How Australia’s National Health Services Directory Improved Data Quality, Reliability, and Integrity with Databricks Delta and Structured Streaming

  1. 1. WIFI SSID:SparkAISummit | Password: UnifiedAnalytics
  2. 2. Peter James, Healthdirect Australia How Australia’s National Health Services Directory Improved Data Quality, Reliability, and Integrity with Databricks Delta and Structured Streaming #UnifiedAnalytics #SparkAISummit
  3. 3. About Us Healthdirect Australia designs and delivers innovative services for governments to provide every Australian with 24/7 access to the trusted information and advice they need to manage their own health and health-related issues. 3#UnifiedAnalytics #SparkAISummit
  4. 4. Our Impact 4#UnifiedAnalytics #SparkAISummit
  5. 5. National Health Services Directory The National Health Services Directory (NHSD) holds core information about healthcare organisations, services and practitioners. • Hospital, EMR, Registry Data sharing Arrangements • Clinical Pathway Applications • Telehealth and Contact Centre Services • Support eHealth Secure Messaging Delivery • API Integration using Fast Healthcare Interoperability Resources (FHIR) • Consumer API’s and Web Widgets • Sector Analytics and Reporting 5#UnifiedAnalytics #SparkAISummit
  6. 6. Supporting Health Planning NHSD is used widely for analysis & planning 6#UnifiedAnalytics #SparkAISummit healthmap.com.au Royal Flying Doctors Service’s, SPOT platform
  7. 7. Landscape Changes A Strategic Assessment defined a long-range vision of success • Australian Health Sector promotes adoption of HL7/FHIR, Provider Directory Federation and Open Data standards • Business drivers pushing for Cost reduction and operational efficiencies • Architecture needed to scale to meet the objectives defined for future success • Need to develop new Analytics capabilities in support of Sector Health Planning • Improve Information Security to support with broader usage scenarios and regulatory changes 7#UnifiedAnalytics #SparkAISummit
  8. 8. Challenges Data Quality and Governance • Availability and access to Systems Of Record • Often no Single Source Of Truth • High variety of data sources from Federal, State, Public/Private Hospitals, EMR, and other Commercial Vendors • Health Domains are complex - Ontologies, Taxonomies, Thesauri, Code sets (e.g. MeSH, SNOMED-CT, ANZSIC) • Record Linking Issues, identity harmonisation, quasi-identifiers • Disjoint Data Governance (Network of Data Governance participants) 8#UnifiedAnalytics #SparkAISummit
  9. 9. Challenges Data Silos • Data stored in multiple subsystems – File Systems, ETL and workflow transient storage systems – Multiple RDBMS, Multiple Schemas – Search Engine Indexes • Read Inconsistency – Data out-of-synch between Search, Databases and Analytics storage – Large batch updates of low quality data leading to high error rates and process inefficiencies – Transactional boundaries per Data Silo no holistic end-to-end consistency • Data Access – High operational overheads processes – Incompatible with Security boundaries 9#UnifiedAnalytics #SparkAISummit
  10. 10. Data Silos to Data Aggregation Single point of access to the component pieces 10#UnifiedAnalytics #SparkAISummit AHPRA (Registries) Medicare Jurisdictions Aged Care Secure Message Vendors Desktop Vendors Hospital Networks Others NHSD
  11. 11. Challenges Data Scale • Processes need to scale to ~1 Billion Data Points • New demands include Bookings, Appointments, Referrals, Pricing and eHealth Transaction Activity – est. 1TB p.a. • Support Data federation and Interoperability requirements with requests growing > 58% p.a. • Existing systems were already under pressure with batch overruns and significant administration overheads 11#UnifiedAnalytics #SparkAISummit
  12. 12. Architecture Overview 12#UnifiedAnalytics #SparkAISummit
  13. 13. Architecture Improvements Databricks DELTA to create Logical Data Zones • LANDING, RAW, STAGE, GOLD (i.e. Bronze, Silver, Gold) • Store data ‘as-is’ either structured and/or unstructured data in DELTA Tables and Physical Partitions • Read Consistency, Data integrity - ACID transactions • Used DELTA Cache for frequent queries with transparent, automated cache control • Databricks Operational Control Plane for Cluster Administration, Management functions like, Access Control, Jobs and Schedules • Data Control Plane runs within our AWS Accounts under our Security Policy and regulatory compliance 13#UnifiedAnalytics #SparkAISummit
  14. 14. Architecture Improvements SPARK Structured Streaming for Continuous Applications • Create a Streaming versions of existing batch ETL jobs. • Move to Event based Data flows via Streaming Input, Kinesis, S3, and DELTA • Running more frequently lead to smaller and more manageable data volumes • Automation of release processes through Continuous Delivery and use of Databricks REST API • Recoverability through Checkpoints and Reliability through Streaming Sinks to DELTA tables 14#UnifiedAnalytics #SparkAISummit
  15. 15. Databricks Cluster Continuous Applications – Continued • Keeps 'in-memory’ state of object sets including comparable versions for object changes • Stateful transformations (flatMapGroupsWithState) • Aggregations for Data Quality measurements • Built User-defined libraries for Data Cleansing, Validation and complex logic 15#UnifiedAnalytics #SparkAISummit def appendAttributeStats( self, sourceDF, attribute_name, metrics = ['mean', 'min', 'max', 'median', 'num_missing', 'num_unique']):
  16. 16. Data Plane & Pipeline 16#UnifiedAnalytics #SparkAISummit
  17. 17. Architecture Improvements Unified Analytics and Operations • Data cleaning & matching algorithms reliable, predictable and measurable • Data Quality Analytics calculated at runtime, attached to object attributes, populate analysts dashboards • Track Data Lineage and Lifecycles providing Operational visibility of where our data is at any point • Complex processes easily broken down into simpler steps with ‘checks and balances’ prior to execution • Significant cost reduction through decommissioning complex Administration UI in favour of notebook environment 17#UnifiedAnalytics #SparkAISummit
  18. 18. Improved Algorithm Development & Usage 18#UnifiedAnalytics #SparkAISummit Data Sharing Agreements Data Integration Algos & Processes Data Quality Auditing Algorithms • sound design, goal-focused • verified against trusted data • version-controlled, documented • archived with training & validation data Data • permanent, versioned • use provenance to enhance algos, revise confidence estimate Data Quality Reporting
  19. 19. Algorithm Evaluation 19#UnifiedAnalytics #SparkAISummit FPR TPR threshold 0.014433 0.800855 1.00 0.014433 0.801709 0.99 0.014433 0.831054 0.98 0.014433 0.855840 0.97 0.016495 0.895726 0.95 0.026804 0.950997 0.90 • interpretation – the algorithm is appropriate to the domain – the data is of good quality – un-matchable names are rare exceptions – a matching threshold of 0.95 gives TPR = 95.1% FPR = 2.7% ROC curve for the set-based, fuzzy name-matching algorithm
  20. 20. Improved Data Processing • 95% fuzzy name matches vs. <80% and dependent on manual verification • ~30,000 Automated Updates / mth vs. ~30,000 bi- annual & Manual • 20 mins full load vs. > 24 hrs • 20 million records @ ~1 million/min 20#UnifiedAnalytics #SparkAISummit
  21. 21. Improved Data Security • Databricks Platform provided foundational Security Accreditations • Able to cross-walk these security compliance frameworks and localize to Australian Regulatory frameworks • This provided significant cost/effort reduction of GRC • Continuous Data Assurance - Security Guard-Rails monitor changes to access privileges, e.g. role elevation/conditional access, data security metadata data and auditing against data leakage and spills. 21#UnifiedAnalytics #SparkAISummit
  22. 22. Securing it ALL 22#UnifiedAnalytics #SparkAISummit
  23. 23. Other Benefits • Don’t have to be big to see benefits, strategic value and a new business model • Strategic momentum through realizing business vision with technology transformation • Dataset Transparency and Explainable Models with well-documented lineage and quality • Broader participation - no one holds a single parts of knowledge anymore, we extract more value in our data when everyone was able to access it • Governance transitioned focus on operational improvements • Inked Agreement in August 2018 completed migration to Databricks December– Live March 2019 23#UnifiedAnalytics #SparkAISummit
  24. 24. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT

×