In today's digital age of data exploration, Apache Spark has become the de facto platform of choice for processing large volume of data from variety of sources in diverse formats, serving equally disparate destinations for Business Intelligence and Advanced Analytics. Centers for Medicare and Medicaid Services (CMS) is a federal health agency under Health and Human Services (HHS). It is the single largest payer for health care in the United States, serving nearly 90 million Americans who rely on health care benefits through Medicare, Medicaid, and the State Children's Health Insurance Program (CHIPS). CMS recently adopted Apache Spark as its big data processing platform to ingest and analyze clinical and claims data from various data sources to produce healthcare models designed to improve patient's health and reduce costs at the same time. The data come from multiple sources and contain Personally Identifiable Information (PII) and Protected Health Information (PHI). Thus a data governance that includes robust security controls is a must. At the same time, it must be able to serve multiple business units with several roles within each of those units requiring different levels of access to the data. This presentation will cover best data governance practices including data security, data stewardship and data quality management using both open source and commercial tools based on lessons learned from the Apache Spark implementation at CMS.
Speaker: Donghwa Kim
2. Donghwa John Kim, NewWave
Apache Spark Data Governance
Best Practices—Lessons Learned from
Centers for Medicare and Medicaid
Services
#UnifiedAnalytics #SparkAISummit
4. About NewWave
4#UnifiedAnalytics #SparkAISummit
Mid-Size Business
300+ Employees
11 Prime Contracts
Support 7 CMS Centers
CMMI Level 4 for
Services & Development
ISO 9001:2015
Databricks Gold Level Partner
Microsoft Gold Cloud Platform
AWS Advanced Consulting Partner
Prime Contract
Vehicles
CMS SPARC – 8(a) & Small
GSA 8(A) STARS II
GSA Schedule 70 & Health IT
SIN
6. Centers for Medicare & Medicaid Services (CMS)
6#UnifiedAnalytics #SparkAISummit
CMS is the largest
healthcare payer in
the country, with a
budget of $793.7B.
NewWave is its
trusted partner
and leading
innovator.
8. Data Challenge
8#UnifiedAnalytics #SparkAISummit
• 2 billion data points* annually to store, analyze and
disseminate
• Privacy requirements (PHI, PII) without compromising
agility
• Central view of available data on multiple systems
* Just on Medicare data
9. The Objectives
9#UnifiedAnalytics #SparkAISummit
The vision is to provide
a simple and reliable technology and data experience
for all of CMS IT Portfolio stakeholders.
Center-wide shared data services
Robust data governance
Single cloud-native architecture
13. Agility - Dremio Virtual Datasets
13#UnifiedAnalytics #SparkAISummit
• Built on top of the immutable physical datasets found in sources
• A layered stack of data transformations that have been performed on top of one or
more physical datasets
• Each virtual dataset is ultimately described by a SQL query
• Chaining of datasets are possible.
• Data Lineage - a history of all the applied transformations is available
14. Agility - Dremio Virtual Dataset Example
14#UnifiedAnalytics #SparkAISummit
15. Simplicity - SQL for [almost] EVERYTHING
15#UnifiedAnalytics #SparkAISummit
• Ability to join data
from multiple data
sources including
JSON, CSV, Parquet,
relational database
and NoSQL
• Unified interface for
the data
And suddenly ... SQL is sexy again!
16. Simplicity - SQL for [almost] EVERYTHING
16#UnifiedAnalytics #SparkAISummit
17. Privacy - Row Level Masking
17#UnifiedAnalytics #SparkAISummit
Use query_user() and is_member() for selective filtering of rows for different
users or groups without having to create multiple datasets.
20. Centralized View - Data Catalog
20#UnifiedAnalytics #SparkAISummit
• Ability to search for the data
• Collaboration experience using Wiki and content
tagging
22. Looker’s LookML = “SQL Evolved”
22#UnifiedAnalytics #SparkAISummit
LookML is a language for describing dimensions, aggregates, calculations
and data relationships in a SQL database.
25. Data Modeling with Looker
25#UnifiedAnalytics #SparkAISummit
SQL models
generated by Looker
from LookML can be
exported into Dremio
to create virtual
datasets.
26. Accessing Dremio from Databricks
26#UnifiedAnalytics #SparkAISummit
• Adding Dremio JDBC Driver jar in Databricks
27. Accessing Dremio from Databricks
27#UnifiedAnalytics #SparkAISummit
Parallelism level
Driver
Use it! *
Virtual Dataset
* https://docs.databricks.com/user-guide/secrets/example-secret-workflow.html
30. Incremental Approach Meets Client Needs
30#UnifiedAnalytics #SparkAISummit
Focusing on structured and semi-structured data
• Unstructured data (e.g. medical records) will be supported in near future
Human centered design (HCD) approach
• Architecture and process will evolve as we interact more with end users and
collect feedback
Tools evolving to meet our needs
• Dremio-Snowflake integration
• Native Apache Spark (Arrow Flight RPC) and SAS Connectors
• Other service availabilities in Microsoft Azure Government