RWDG Slides: How to Govern Data Lakes

1
Copyright © 2019 Robert S. Seiner – KIK Consulting & EducationalServices / TDAN.com
Non-InvasiveData Governance™ is a trademark of Robert S. Seiner & KIK Consulting
#RWDG @RSeiner
How to Govern Data Lakes
with Special Guest Evan Terry
Monthly Webinar Series Hosted by DATAVERSITY
Robert S. Seiner – KIK Consulting / TDAN.com
July 18, 2019 – 11:00 a.m. PT / 2:00 p.m. ET
Real-World Data Governance

Unified Data Orchestration
Madan Kumar | Solutions Engineer| Alluxio
madan@alluxio.com

4 big trends driving the need for a new architecture
Separation of
Compute &
Storage
Hybrid – Multi
cloud
environments
Self-service
data across the
enterprise
Rise
of the object
store

Data Ecosystem - Beta Data Ecosystem 1.0
COMPUTE
STORAGE STORAGE
COMPUTE

Data Orchestration Framework
Java File API HDFS Interface S3 Interface REST APIFUSE Interface
HDFS Driver Swift Driver S3 Driver NFS Driver

Alluxio’s Approach to Big Data Federation
 Unified Access - Acts as a “virtual data lake.” Files are accessed in Alluxio’s
global namespace as if they resided in a single system
 Performant - Provides fast local access to important and frequently used data,
without maintaining a permanent copy of all data.
 Modern, flexible architecture - Promotes separation of compute from storage
 Storage Cost Optimization -Transparently reads and writes data directly
from the source system, and so does not need to create a permanent copy of
the data

Data Elasticity
with a unified
namespace
Abstract data silos & storage
systems to independently scale
data with compute
Run Spark, Hive, Presto, ML
workloads on your data
located anywhere
Accelerate big data
workloads with transparent
tiered local data
Data Accessibility
for popular APIs &
API translation
Data Locality
with Intelligent
Multi-tiering
Key Innovations of the Data Orchestration Layer

Use Cases Data Orchestration Enables
Hive
Alluxio
Run big data workloads in hybrid
cloud environments
On premise
Same instance
/ container
Spark
Alluxio
Any Cloud / Multi Cloud
Same data
center / region
PrestoSpark
Alluxio
Accelerate big data frameworks
on the public cloud
Same instance
/ container
Enable big data on object stores
across single or multiple clouds
Standalone

Incredible Open Source Momentum with growing community
900+ contributors &
growing
3760+ Git Stars
Apache 2.0 Licensed
Hundreds of thousands
of downloads
Join the conversation on Slack
alluxio.org/slack

2
2
#RWDG @RSeiner
• Real-World Data Governance – Monthly Webinar Series
– August 15, 2019 – Data Governance versus Information Governance
– Third Thursday each Month @ 2pm EST – Register at TDAN.com, KIKconsulting.com, DATAVERSITY.net
• Non-Invasive Data Governance Book
– ISBN 9781935504856 / Technics Publishing / Amazon.com
• Speaking @ Dataversity Events
– Data Architecture Summit, Chicago – October 14-17
– Data Governance Vision, Washington, DC – December 9-12
• Non-Invasive Data Governance Online Learning Plan
Non-Invasive Metadata Governance Online Learning Plan
– DATAVERSITY Training Center
– https://training.dataversity.net
• The Data Administration Newsletter (TDAN.com)
– Twice Monthly – Data Articles, Columns, Blogs and Features
– Produced by DATAVERSITY – Subscribe for emails
– New Non-Invasive Data Governance Framework now being published
• KIK Consulting & Educational Services
KIKConsulting.com
Home of Non-Invasive Data Governance™
– Home of Non-Invasive Metadata Governance
Introduction

3
3
#RWDG @RSeiner
Chief Analytics Officer, Velocity Mortgage Capital
Evan brings over 20 years of consulting experience in IT environments, including leading
software development projects, designing and implementing IT and data strategies, and working
on long term, cross departmental projects in such diverse industries as automotive, retail, state
government, and e-commerce payments.
Evan’s areas of expertise include designing practical analytics solutions, aligning business and IT
strategies, and implementing data management and governance programs.
He co-authored the data modeling book Beginning Relational Data Modeling and has spoken
about data and process quality and systems design. Evan has a BA in Economics from McGill
University and an MBA from Columbia Business School.
Special Guest Evan Terry

4
4
#RWDG @RSeiner
• In this webinar, Bob and Evan will discuss:
– The relationship between Data Lakes and Data Governance
– Preventing your Data Lake from becoming a Data Swamp
– Governing the Metadata associated with your Data Lake
– Leveraging governed data to provide trustworthy Analytics
– Measuring the value of a governed Data Lake
Abstract

5
5
#RWDG @RSeiner
• What is Data Governance?
– The execution and enforcement of authority over the
definition, production and usage of data and data-related assets.
Robert S. Seiner
– The management and organization of data.
Evan Terry
– The orchestration of people and process and data.
– The harmonization of people and process and data.
– The formalization of accountability for data.
– The implementation of decision-rights for data.
The Relationship between Data Lakes and Data Governance

6
6
#RWDG @RSeiner
• What is a Data Lake?
– A data lake is a system or repository of data stored in its natural/ raw format,
usually object blobs or files.
– A data lake is usually a single store of all enterprise data including raw copies
of source system data and transformed data used for tasks such as reporting,
visualization, advanced analytics and machine learning.
SAS Article, 2016
• When does a Data Lake become a Data Swamp?
– A data swamp is a deteriorated and unmanaged data lake that is either
inaccessible to its intended users or is providing little value.
Olavsrud, Thor. CIO 2017
– When the data in the lake is ungoverned.

7
7
#RWDG @RSeiner
• A connection between governance (how to manage and organize) and data lakes
for accurate and useful data management
• Catalogs are critical to help you govern data, especially in data lakes
– Find things
– Defining things
– Curate content
• Need to include policy-driven processes that classify and identify the information
in the lake, why it’s in there, what it means, who owns it, and who is using it
• A data lake without data governance will ultimately end up being a collection of
disconnected data pools or information silos—just all in one place.

8
8
#RWDG @RSeiner
• What can be done to prevent the swamping of your data lake?
– Implement data governance for the lake.
– Implement metadata management for the lake.
– Implement sound principles of:
• Data Definition
• Data Production
• Data Usage
• What is the appropriate level
of data governance for your
data lake?
Preventing your Data Lake from becoming a Data Swamp

9
9
#RWDG @RSeiner
• A “data lake” becomes a data swamp without organization
– No organization, no curation of content, little metadata
• Data warehouse principles are relevant:
– Stewardship/Curation
– Design, documentation, maintenance of the lake
– Metadata capture
– Governance
• Technique - Create zones in your data lake:
– Transition data sets from “raw data” to “clean data”
– Apply different curation/governance principles to each zone
Preventing your Data Lake from becoming a Data Swamp

10
10
#RWDG @RSeiner
• Governing metadata associated with:
– Data Definition
– Data Production
– Data Usage
• (Where) Is there metadata associated with your data lake?
• Who is responsible for the metadata associated with your data lake?
• “The metadata will not govern itself!”
Governing the Metadata Associated with your Data Lake

11
11
#RWDG @RSeiner
• Cataloging is key, but is tricky:
– don’t under/over catalog
– don't be too loose/rigid in your governance rules
• “Goldilocks” mentality – everything in moderation
• Tune governance to priorities and context
– One person's data lake is another’s data swamp
– Don't turn data lake into a data warehouse – the clearest data lake
– Cannot be all things to all people – playground, incubator, or operational
data store?
Governing the Metadata Associated with your Data Lake

12
12
#RWDG @RSeiner
• Sample DG purpose statement – Use strategic data with confidence.
• Make certain the water is clean or it may be unhealthy.
• “Boil water alert” – Is data governance the boiling of the water?
• “Freshwater” versus “Saltwater”
determines species that will
live in your lake.
Leveraging Governed Data to Provide Trustworthy Analytics

13
13
#RWDG @RSeiner
• Data catalogs solve the problems of finding, interpreting and using data
• Data lake is a tool and the context is key – differences in required data quality
• “Trustworthy” depends on context and accuracy needs – data lakes are defined
as “less” controlled and structured

14
14
#RWDG @RSeiner
• Provides much the same value as for a data warehouse – analytics requires:
– Who owns the data and can answer questions about it
– Finding the right data elements that meet your needs
– Cleaning the data to an appropriate level of quality
– Having the right security on the data being used
– Monitoring the data for adherence to standards
• Lightweight governance on adding, naming, organizing protects the shared
resource from the “tragedy of the commons”

15
15
#RWDG @RSeiner
• Metrics are one of the 6 core components of Data Governance.
Data, people, process, communications, metrics and tools.
• Measuring people’s ____________ the data in the lake.
– confidence in
– understanding of
– usage of
– decisions made using
– knowledge of what data resides in
– … all will depend on the effective management
of metadata associated with your data lake.
Measuring the Value of a Governed Data Lake

16
16
#RWDG @RSeiner
• Considerations for providing metrics
– Benchmark current status
– Select metrics that mean something to someone
– Select metrics associated with the data lake rather than data governance
– Consider that it is not easy to measure Return on Investment on DG
– Go jump in the lake!

17
17
#RWDG @RSeiner
• Unlocking the value depends on the data lake being broadly usable
• What is the value of R&D? What is the value of avoiding a disaster?
• The context of the data lake is key
– What is the purpose of the data lake?
– What is the tool the data lake will help you solve?
– How much value does governance (lightweight or not) provide?
• Value is measured in combination with the final use
– AI/Machine Learning
– Agility/Time to Market
– Variety of end users served/capabilities enabled

18
18
#RWDG @RSeiner
• In this webinar, Bob and Evan discussed:
– The relationship between Data Lakes and Data Governance
– Preventing your Data Lake from becoming a Data Swamp
– Governing the Metadata associated with your Data Lake
– Leveraging governed data to provide trustworthy Analytics
– Measuring the value of a governed Data Lake
Abstract

19
19
#RWDG @RSeiner
• Questions and Answers
Contact Information
Join us in the Dataversity Community to continue the conversation.
https://community.dataversity.net/

20
20
#RWDG @RSeiner
• Robert S. Seiner
KIK Consulting & Educational Services – KIKconsulting.com
The Data Administration Newsletter – TDAN.com
Post Office Box 112571, Upper St. Clair, Pennsylvania 15241
412.220.9643, 412.220.9644 (Fax)
rseiner@kikconsulting.com
rseiner@tdan.com
@RSeiner @TDAN_com
#RWDG
Contact Information

RWDG Slides: How to Govern Data Lakes

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to RWDG Slides: How to Govern Data Lakes

Similar to RWDG Slides: How to Govern Data Lakes (20)

More from DATAVERSITY

More from DATAVERSITY (20)

Recently uploaded

Recently uploaded (20)

RWDG Slides: How to Govern Data Lakes