Apache Spark Data Governance Best Practices—Lessons Learned from Centers for Medicare and Medicaid Services

WIFI SSID:SparkAISummit | Password: UnifiedAnalytics

Donghwa John Kim, NewWave
Apache Spark Data Governance
Best Practices—Lessons Learned from
Centers for Medicare and Medicaid
Services
#UnifiedAnalytics #SparkAISummit

Customers
3#UnifiedAnalytics #SparkAISummit

About NewWave
Mid-Size Business
300+ Employees
11 Prime Contracts
Support 7 CMS Centers
CMMI Level 4 for
Services & Development
ISO 9001:2015
Databricks Gold Level Partner
Microsoft Gold Cloud Platform
AWS Advanced Consulting Partner
Prime Contract
Vehicles
CMS SPARC – 8(a) & Small
GSA 8(A) STARS II
GSA Schedule 70 & Health IT
SIN

Technology Vendor Partners

Centers for Medicare & Medicaid Services (CMS)
CMS is the largest
healthcare payer in
the country, with a
budget of $793.7B.
NewWave is its
trusted partner
and leading
innovator.

A unique customer
that sets the standard
for industry & defines
the market in
healthcare

Data Challenge
• 2 billion data points* annually to store, analyze and
disseminate
• Privacy requirements (PHI, PII) without compromising
agility
• Central view of available data on multiple systems
* Just on Medicare data

The Objectives
The vision is to provide
a simple and reliable technology and data experience
for all of CMS IT Portfolio stakeholders.
Center-wide shared data services
Robust data governance
Single cloud-native architecture

The Definition of Genius Is Taking
the Complex and Making it Simple
– Albert Einstein

Solution from a Bird’s Eye View

Data Agility
Improved Data Quality
Cost Effectiveness
Data as a
Service

Agility - Dremio Virtual Datasets
• Built on top of the immutable physical datasets found in sources
• A layered stack of data transformations that have been performed on top of one or
more physical datasets
• Each virtual dataset is ultimately described by a SQL query
• Chaining of datasets are possible.
• Data Lineage - a history of all the applied transformations is available

Agility - Dremio Virtual Dataset Example

Simplicity - SQL for [almost] EVERYTHING
• Ability to join data
from multiple data
sources including
JSON, CSV, Parquet,
relational database
and NoSQL
• Unified interface for
the data
And suddenly ... SQL is sexy again!

Simplicity - SQL for [almost] EVERYTHING

Privacy - Row Level Masking
Use query_user() and is_member() for selective filtering of rows for different
users or groups without having to create multiple datasets.

Privacy - Column Level Masking

Privacy - Column Level Masking - VDS

Centralized View - Data Catalog
• Ability to search for the data
• Collaboration experience using Wiki and content
tagging

Data Lineage

Looker’s LookML = “SQL Evolved”
LookML is a language for describing dimensions, aggregates, calculations
and data relationships in a SQL database.

Looker’s Explorer

LookML => SQL

Data Modeling with Looker
SQL models
generated by Looker
from LookML can be
exported into Dremio
to create virtual
datasets.

Accessing Dremio from Databricks
• Adding Dremio JDBC Driver jar in Databricks

Accessing Dremio from Databricks
Parallelism level
Driver
Use it! *
Virtual Dataset
* https://docs.databricks.com/user-guide/secrets/example-secret-workflow.html

Demo

Incremental Approach Meets Client Needs
Focusing on structured and semi-structured data
• Unstructured data (e.g. medical records) will be supported in near future
Human centered design (HCD) approach
• Architecture and process will evolve as we interact more with end users and
collect feedback
Tools evolving to meet our needs
• Dremio-Snowflake integration
• Native Apache Spark (Arrow Flight RPC) and SAS Connectors
• Other service availabilities in Microsoft Azure Government

Questions?
We are hiring!
Also check out CMS AI Challenge:
https://www.cms.gov/newsroom/fact-sheets/cms-artificial-intelligence-
health-outcomes-challenge

DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

Apache Spark Data Governance Best Practices—Lessons Learned from Centers for Medicare and Medicaid Services

Recommended

Recommended

More Related Content

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Apache Spark Data Governance Best Practices—Lessons Learned from Centers for Medicare and Medicaid Services