Data governance teams attempt to apply manual control at various points for consistency and quality of the data. By thinking of our machine learning data pipelines as compilers that convert data into executable functions and leveraging data version control, data governance and engineering teams can engineer the data together, filing bugs against data versions, applying quality control checks to the data compilers, and other activities. This talk illustrates how innovations are poised to drive process and cultural changes to data governance, leading to order-of-magnitude improvements.
4. Data Governance Guideposts haven’t
changed
Data Quality Data Accessibility Data SecurityCompliance Availability
5. Traditional Data Governance is heavily dependent on human
intervention to manage, creating business decision bottlenecks
Chief Data Officer
Key Business Unit LeadsLead Data Stewards
Data Governance Council
Lead Data Stewards
Data Project Groups
Data Custodians
Data Stewards
Issues
Guidance
Initiatives
Initiatives
Traditional Approach in 2014
Recreated: http://datagovernanceaus.com.au/data-governance-what-is-it/
Lead Data Stewards
Chief Data Officer
Data Governance Council
Lead Data Stewards Key Business Unit Leads
Data Project Groups
Data Custodians
Data Stewards
Issues
Guidance
Initiatives
Initiatives
Traditional Approach in 2020
Data Governance Headcount, Meetings
& Quality Over Time
Data Quality
11. 11 Pariveda Solutions, Inc. Confidential & Proprietary.
1. Higher level languages (scala, python)
2. Automated Unit & Integration Tests
3. Static Analysis
4. Continuous Integration
5. Refactoring
All modern software engineering builds on
these fundamental constructs:
12. 12 Pariveda Solutions, Inc. Confidential & Proprietary.
The ability to compile input code
to executable outputs
Version control systems to keep
track of the input code
DevOps: Bring Compilers and Source Control to Infrastructure
13. 13 Pariveda Solutions, Inc. Confidential & Proprietary.
Let’s Look at DevOps
The same process of increasing
improvements has repeated itself, with
cloud native approaches and
continuous deployment becoming the
norm
15. 15 Pariveda Solutions, Inc. Confidential & Proprietary.
The ability to compile input code
to executable outputs
Version control systems to keep
track of the input code
DataOps: Bring Compilers and Source Control to the world of Data
If this is the
source
code…
…and this is the
resulting
operation
… then the pipeline
is the compiler
16. We still don’t really
understand how
data writes code
• This is why we have data
scientists experiment to
figure out the logic
• Later data engineers come
in later to build the
optimizers
18. Define
Everything as
Code
to reduce risk
increase
quality, and
build trust
Access&Privacy
Defines the requirements
to access the data
outputs produced by
each pipeline stage
Dependencies
Defines all libraries that this
component depends on to
execute / test without actually
including the libraries in SCM.
PipelineCode
The functional code for
this component. This
code should be separated
out so that pure business
logic lives in a library &
platform specific code
calls the lib.
CloudEnvironments
CloudFormation Templates define the
infrastructure that will be created to
deploy this component. Data cloning
or test data management provide the
datasets that enable testing
LogicTests
Test code to
ensure proper
function of the
business logic.
These capture
edge cases that
may not be in
the real data
DeployPipeline
Jenkins File definitions
include the pipeline of build
steps required to
successfully get this
component into production.
DataTests
Test code that
ensures input
data is correct
and outputs are
properly
configured.
By ensuring that every
aspect of developing
analytics solutions is
captured and tracked as
code, it becomes much
more clear which change
introduced a failure
Everything
as Code
21. Ingest Diff Model Enhance Transform
Production
Data Sources
Production Data Platform
Data Lake
Data Pipeline
Banking Data
Bloomberg, Dow Jones
Test Data Development Data Platform
Data Lake
Data Pipeline
Ingest Diff Model Enhance TransformDiff Model Enhance Transform
Raw Modeled Enhanced Products Data Mart
Test Banking Data
Test Bloomberg, Dow
Jones Data
test
test
test
test
test
Define Access Policies, Test end-to-end
Data Mart
Design right-to-left
Test Data for
NYSE
Raw Modeled Enhanced Products
Ingest Prototype
BI
Real time
AI
BI
Real time
AI
Dash-
boards
Dash-
boards
26. As you mature, you will be able to take on more complexity
Platform
Processes
Modern
ToolingFoundation
Data Forge
Framework
Data Ops
Roles/People
Organization
Technology and
Infrastructure
HigherlevelsofDataMaturity
Differentiated Business Value driven from your Data
Modern Data
Management
28. 28 Pariveda Solutions, Inc. Confidential & Proprietary.
1. Data environment management
2. Access & Privacy as Code
3. Test management
4. Continuous Deployment & Compliance
A similar process will play out with innovations
built on the foundation of DataOps