Cedar data estate

CEDAR
DATA
ESTATE
WASHINGTON STATE
DEPARTMENT OF HEALTH

CEDARDATA ES
TATE
CDC
(RAW)
SOURCE
SOURCE
CHARS
(RAW)
WAIIS/IMMS
(RAW)
SOURCE
SOURCE
CREST
(RAW)
01_RAW (RAW)
02_USABLE_PREP (COOKED)
03_USABLE
04_USEABLE_OUTPUT
05_OUTPUT
06_SANDBOX
Data Sciences
Support Unit
CEDAR (SERVED)
LAUREL
MADRONA
Data Analysis
Unit
PARQUET_SRC (COOKED)
DEV
TEST
STAGE
PROD
Data Sciences
Unit
RAW (RAW)
AUDIT
RPT_OUT
Data Audit
Unit
REDCap
(RAW)
SOURCE
AZURE PIPELINE
AZURE SHARE
LEGEND
RAW = Identical to source
COOKED = Cleaned and conformed
in common parquet files
SERVED = Business rule driven
under governance teams
CEDAR DATA LAKE

CEDAR
Data Estate
Data sources on left side of diagram are typically drawn from delimited text or relational databases and
are accessed via API or direct network connection using Azure Data Factory pipelines.
The CEDAR data lake is focused on extract and load activities, with only enough transformation to fulfill
the Kimball standards for cleaning and conforming. Consumers can receive data as
CEDAR is a “hub” data lake and as such is optimized for reads from a hierarchical Azure Data Lake Gen 2
data store.
Each of the client data consumer units receive a read-only “spoke” from CEDAR that is implemented
using Azure Share.
The Data Sciences Support Unit acts as CEDAR’s “first best customer” and serves as a center for
standards and best practices. DSSU maintains a code repository implemented in GitHub for the benefit
of all the CEDAR data consumer units.
Client data consumers vary in requirements and implementation. Each is envisioned (though not
required) to be built as a compute-optimized data lake that adds value by using both local and shared
data to create data products that are composed of data science experiments and machine learning
models, traditional data analytics, healthcare-driven insights, and various public and private
dashboards.
Third party public data consumers like local healthcare authorities, hospitals, clinics, and autonomous
indigenous healthcare organizations would access the products our data consumers produce via secure
API and Azure Identity Governance-derived accounts.
Content in the CEDAR data lake’s “served” folders is anticipated to carry the following additional
attributes:
 Data is organized by business fact subject groups like vaccinations, investigations, hospitalizations
and such rather than “siloed” by individual budgetary units.
 Data is cleaned and conformed using Kimball standards and best practices for “Data Mart” units.
 Governance is crucial to the “served” folders and is anticipated to be chaired at the level of the
office of technology innovation with stakeholders from the budgetary units who contributed data
and who thereby assisted in breaking down the budgetary “silos”.
Description and Notes

Cedar data estate

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Cedar data estate

Similar to Cedar data estate (20)

Recently uploaded

Recently uploaded (20)

Cedar data estate