The document discusses how data catalogs can be used to extract value from both structured and unstructured data by providing context about distributed data assets to enable various roles like data scientists and analysts to find and understand relevant datasets, and it recommends implementing an augmented data catalog using machine learning to automatically curate, verify and classify data to improve data quality and insights over time. The document also provides an overview of how to implement a phased data governance approach using a data catalog.
2. TOPICS
• Improve insights by extracting value from unstructured data utilizing a machine
learning augmented data catalog
• Learn how to implement an effective data catalog
• Avoid metadata silos using intelligent tools
• Let the insights come to you with AI-augmentation
• Explore the vendor market to maximize ROI
3.
4.
5. Definition
A data catalog creates and maintains an inventory of data assets through the
discovery, description and organization of distributed datasets. The data catalog
provides context to enable data stewards, data/business analysts, data engineers, data
scientists and other line of business (LOB) data consumers to find and understand
relevant datasets for the purpose of extracting business value.
In a nutshell,a data catalog is a place that shows what data assets you have and where they are
located.You might be asking,what is a data asset? That is any entity (i.e.reports,databases,
websites) that contains data.
Data Catalogs Are the New Black in Data Management and Analytics
6.
7. • Leverage an ML-augmented data catalog as a first step in metadata management
• Deploy data catalogs with the capability to scale beyond narrow (or tactical) use-case
requirements (such as cataloging data only within a Hadoop distribution),
8. Phase1
Catalog and
Lineage
• Infrastructure
and
Installation of
Catalog tool
• Data
Architects to
initiate the
collection of
data assets,
catalog and
identify
lineage
Phase2
Data
Stewardship,
Business
Glossary
•Appoint Part-
time
Governance
Lead role
(cross-
functional
business facing)
•Supporting
Analyst
•Manage
Governance
activities
Phase3
Operationalize
Governance
activities
•Accountability,
Ownership of
Data
•Operationalize
Data
Governance
activities
•Report Metrics
•Iterate
activities for all
information /
data projects
Improve / Enhance
Data Governance
HOW TO IMPLEMENT….
Manage Data LifecycleEstablish
Data Governance
Sustain Data Governance
Communicate
Manage Return
On Investment
Maintain Organization &
Sponsorship
Review/Update Processes
Review//Update Scope
(Quarterly Workshop)
Business Change
Management
Review & Approve New Projects
Maintain Data Definitions
Maintain Metrics
Identify Data Stewards
Conflict Resolution, Escalation
Plan
OrganizeOrganize
Define
Deploy
Core Foundation
Augmented Data Catalog*
* Machine learning powered process for curating, verifying, and classifying data that enhances speed and usability
Phased approach
Data Cataloging is a journey……
9. AI POWERED PROCESS FOR CURATING,
VERIFYING, AND CLASSIFYING DATA THAT
ENHANCES SPEED AND USABILITY
How does it work?What is it?
Use Algorithms (Advanced Statistics and Deep
Learning) to learn from the large scale data to:
Applicable to large, complex and
often streaming data sets
3rd party data, sensor data, customer
data, transactions
• Algorithmic sampling of data to
identify key patterns and business
rules
• Continuous monitoring to alert Data Stewards of
exceptions for timely resolution
• Correlation of data concepts across domains
and data sources to track usage and establish
lineage
• Ability to ingest and apply quality rules to
third party and unstructured data sources
• Establishes feedback loop that refines the
machine learning models to improve data quality
over time
Identify patterns Quality issues and anomalies
across massive, complex and
often streaming data setsBusiness rules