Inside open metadata—the deep dive

Mandy Chessell CBE FREng CEng FBCS
Distinguished Engineer, Master Inventor
Analytics Chief Data Office
 mandy_chessell@uk.ibm.com
18th April 2018
Good analytics needs good data and
that needs good metadata

Apache Atlas as an open innovation platform for metadata management and governance3
Agenda
 Why is metadata so important today?
 What is the challenge?
 Building an open ecosystem
 Apache Atlas and the specifics
 ODPI Data Governance PMC
 Progress report and call to action

Open Data
Site
The perils of reusing data …
Data Lake
Employee
Directory
Callie Quartile uses (1) open data
from the local government registrar
and (2) data from the employee
directory to (3) create a birthday
card service for the company.
Callie Quartile
Data Scientist
1
3
2

Open Data
Site
The perils of reusing data …
Data Lake
Employee
Directory
Callie Quartile
Data Scientist
1
3
2
Happy
Birthday
But its not my
birthday
Unfortunately the obvious date in the
registrar record was the registration of
birth date not the date of birth. Date
of birth was not published in the open
data.
Callie needed better information about
the open data to realise she had the
wrong data.

Metadata
should bring
as much
information
about the
data sets to
Callie’s data
science as is
known
collectively
by the
organization.
Employee Directory
NameBand Job Title
X
Data Set Name: Employee
Directory
X
Description:
Core attributes describing all
employees of OCO
pharmaceuticals created from a
daily extract from Kenexa.
Owner: Penny Payer
Status:
Last accessed: 6th May 2016
Records: 3488
Last Update: 1st May 2016
Contents:
Structure …
Contents …
Lineage …
XColumn:
Band
Classification Ranges:
Confidentiality: Public, Confidential,
Sensitive
Confidence: Authoritative
Retention: Indefinitely
Characteristi
cs
LineageDescription
Position reference number for non-
exempt employees. The value ranges
from 01 to 06 where 01 is the most senior
and 06 is the most junior.
Type: String
Classification: Public

Different personas need different services
Callie Quartile
Data Scientist
Jules Keeper
Chief Data Officer
Find data
Understand data
Manage analytics models
Build data strategy
Define governance program
Monitor progress

Faith Broker
HR and Privacy Officer
Gary Geeke
IT
Locate personal data
Ensure protection of personal data
Understand employee needs
Maintain “safe” IT Infrastructure
Build and deploy “good” APIs and services
Locate and resolve issues fast

Tanya Tidie
Clinical Trials Administrator
Ivor Padlock
Chief Security Officer
Maintain accurate patient records
Catalog clinical trials data
Demonstrate good data management practices
Understand risks to organization
Set up protection
Monitor for suspicious activity

Scope of metadata for a data driven organization
Glossary Collaboration
Governance
Models and
Reference Data
Metadata
Discovery
Lineage Data Assets
Base Types, Systems
and Infrastructure

Curation
00 3809890 6 7 Lemmie Stage 818928 3082 4 New York 4 27 DataStage Expert 1 45324 300 27 Code St Harlem NY 1 3
00 3809890 3 7 Callie Quartile 328080 7432 5 New York 4 27 Data Scientist 1 56944 045 27 Code St Harlem NY 1 3
00 3809890 1 7 Tanya Tidie 209482 4051 2 New York 4 27 Data Steward 1 43800 215 27 Code St Harlem NY 1 3
I know
I wonder
what this
means

Scared to share
Faith Broker
Business Team
00 3809890 3 7 Callie Quartile 328080 7432 5 New York 4 27 Data Scientist 1 56944 045 27 Code St Harlem NY 1 3
00 3809890 1 7 Tanya Tidie 209482 4051 2 New York 4 27 Data Steward 1 43800 215 27 Code St Harlem NY 1 3
Faith Broker has been doing some simple analysis
on the HR data of the company. She wants to share
this data with Callie Quartile to do some detailed
work. However, she does not want Callie to see the
sensitive personal information in the record.
00 3809890 6 7 Lemmie Stage 818928 3082 4 New York 4 27 DataStage Expert 1 XXXXX XXX 27 Code St Harlem NY 1 3
00 3809890 3 7 Callie Quartile 328080 7432 5 New York 4 27 Data Scientist 1 XXXXX XXX 27 Code St Harlem NY 1 3
00 3809890 1 7 Tanya Tidie 209482 4051 2 New York 4 27 Data Steward 1 XXXXX XXX 27 Code St Harlem NY 1 3
Callie Quartile
Data Scientist

Business
metadata
Structural
metadata for
a data store
Using glossary function for semantic processing
EMPNAME EMPNO JOBCODE SALARY
EMPLOYEE
RECORD
Employee
Work Location
Annual Salary
Job Title
Employee Id
Employee Name
Hourly Pay Rate
Manager Compensation Plan
HAS-A
HAS-A
HAS-A
HAS-A
HAS-A
HAS-A
IS-A IS-A
Sensitive
IS-A
Data

Why do we need metadata?
 Metadata enables data to be used outside of the application that created it.
• Analytics and decision making
• New business applications
• Reporting and compliance
 Metadata describes the format and content of data allowing people to judge which data set
to use for a new project
• Structure
• Meaning
• Origin
• Valid values and quality
• Usage and ownership
• Regulations and classifications that apply
• <more>
 Metadata describes the business context and classification of data allowing automated
governance processes to operate.

Today’s reality
 Many data platforms do not have metadata support
 Proprietary tools support a range of data sources and governance actions
• No-one supports everything you need and assumes all tools come from their suite
• Each tool starts “empty” requiring effort to populate metadata
• Each tool operates as if it is the only tool
• No integration/interoperability of metadata repositories from different vendors
 Expensive efforts to create an enterprise data catalogue

Today’s reality

Manual metadata capture

Automatic metadata capture
18

What needs to change?
Open and
Unified Metadata

A new manifesto for metadata and governance
 Metadata management must be automated
 Metadata management must become ubiquitous
 Metadata must become open and remotely accessible
 Metadata should be used to drive the governance of data
The discovery, maintenance and use of metadata has to be an integral part
of all tools that access, change and move information.
20

Open metadata management ecosystem
 Peer-to-peer network of repositories
 Metadata stored and managed close
to its source
 Each repository/tool brings unique
value.
 Open, extensible metadata structures
for metadata exchange and federation
– extending coverage of the types of
resources that need to be described.
 Open source infrastructure sharing
cost of development and maintenance
between vendors
 Support for open standards where
available
Collaboration
Space Metadata
Analytics Platform
Metadata
Application
Metadata
Cloud SaaS platform
Metadata
Hadoop Platform
Metadata

Apache Atlas
http://atlas.apache.org/
 Apache Atlas has just graduated to become a top-level project.
 It began as an incubator open source project on 5th May 2015 to deliver an
open source governance capability focused primarily on the Hadoop platform.
 Apache Atlas is designed to localize operational governance to the operating
data platform such as Hadoop.
 At its heart is a type-agnostic metadata store that can be access through restful
interfaces.
We see Apache Atlas as the reference implementation for open metadata and
governance, for vendors to pick up and use; or test their integration against.
Being open source allows all vendors to enrich/enhance standard.

Apache Atlas today

Updates to Apache Atlas  Automation
• Capture of metadata from data platforms,
data movement engines and data
protection engines.
• Exception management and stewardship
 Business Value
• Specialized services for key data roles
such as CDO, Data Scientist, Developer,
DevOps Operator, Asset Owner,
Applications
 Connectivity
• Metadata Highway offering open
metadata exchange, linking and
federation between heterogeneous
metadata repositories.

Taking guidance from existing metadata standards
 Well-defined
 Complementary
 Integrating
 Decoupled
https://www.w3.org/TR/vocab-dcat/

Instance representations in the graph

Open metadata meta-types, types and instances
«relationship»
DataContentForDataSet
*
*
dataContent
supportedDataSets
«entity»
DataSet
createTime : date
modiﬁedTime : date
«entity»
DataStore
«entity»
Asset
«entity»
GlossaryTerm
«entity»
Referenceable
description : string
expression : string
status : TermAssignmentStatus
conﬁdence : int
steward : string
source : string
«relationship»
SemanticAssignment
*
*
assignedElements
meaning

Open metadata type model summary
Glossary Collaboration
Governance
Models and
Reference Data
Metadata
Discovery
Lineage Data Assets
4
3
1
5
2
6
7
Base Types, Systems
and Infrastructure
0

Open metadata type model summary
Policy Metadata (Principles,
Regulations, Standards,
Approaches, Rule Specifications,
Roles and Metrics)
Governance
Actions and
Processes
Augmentation
MappingImplementation
Business Objects and
Relationships, Taxonomies
and Ontologies
Business Attributes
Organization
Teaming Metadata
(people profiles,
communities, projects,
notebooks, …)
Models and Schemas
4
3
1
5
Physical Asset Descriptions
(Data stores, APIs,
models and components)
Asset Collections
(Sets, Typed Sets, Type
Organized Sets)
Information Views
Rights
Management
Reference Data
Feedback Metadata
(tags, comments, ratings, …)
ClassificationSchemes
Classification
Strategy Subject Area Definition
Campaigns and Projects
Rollout
2
Discovery
Metadata (profile data,
technical classification, data
classification,
data quality assessment, …)
Augmentation
Instrument
Association
Information Process
Instrumentation (design lineage)
6
7
O-DEF
O-BDL
ConnectorsBasic Types, Infrastructure and Systems
Access
0

More detail here …
https://cwiki.apache.org/confluence/display/ATLAS/Building+out+the+Open+Metadata+Typesystem

Metadata and governance digital platform
Open Metadata
and Governance
Reporting
Platform
ETL Platform
Analytics
Platform
Virtualization
Platform
Governance
Platform
Data
Platform

Types of tools that may integrate with an open metadata
repository
 BI and visualization tools
• locating data assets and related information about them; defining
reports and publishing their metadata; viewing lineage
 Data Science tool
• wanting to find out about data assets available and manage user
lineage of transformations and analytics models – may also manage
metadata for analytics models
 API developer tool
• wanting to understand proper data structures and data meaning to
use for APIs – plus additional governance requirements that need to
be implemented by API because of the data it exchanges.
 Counter-fraud tools
• ad hoc analysis of logs and error reports, setting up rules
 Curator/owner tool
• for managing the curation of assets, providing access, verifying use of
assets, reviewing discovery results and exceptions, approving change
requests.
 Glossary tool
• for subject matter experts and information architects to share
expertise about a particular subject area – may also define structures
and related reference data
 Enterprise architect tools
• defining the data landscape and related systems.
 DevOps tools
• conformance to polices and standards in development
• metadata capture at deployment
• validation of deployment platform requirements
 Data integration engine
• locating appropriate data and component assets, log design lineage,
log operational lineage
 Information Virtualisation tools
• locate appropriate data assets, build views and publish them, add
design lineage, log operational lineage
 Governance tools
• setting up and monitoring governance program, data quality, …
 Stewardship tools
• reviewing assigned exceptions, making data changes and requesting
approval
 Information security tools
• setting up data access policies and enforcement
 Auditor tools
• view compliance reports and validate policies and policy
implementations

Open Metadata Access Services
Project Management
Community ProfileAsset Catalog
Stewardship Action
Information View
Governance Program
Information Process
Subject Area
Connected Asset Discovery
Governance Engine
Information Protection
Developer
Data Platform
Asset Owner
Information Landscape
Data Science
DevOps
Asset Consumer
Information
Infrastructure

OMAS service instance
Both call API and notifications

Inside the server
Open Metadata and Governance (OMAG) Server
Open Metadata Access Services (OMAS)
OMRS Topic
Connector
OMRS Cohort
Registry Store
Connector
OMRS Archive
Connector
OMRS
AuditLog
Connector
OMRS Event
Mapper
Connector
OMRS
Repository
Connector
Server
Conﬁguration
OMAS REST APIs
and Topics
OMAG
Administration
REST APIs
OMRS
Repository
REST APIs
Open Metadata Repository Services (OMRS)

Inside the server
Open Metadata and Governance (OMAG) Server
Open Metadata Access Services (OMAS)
OMRS Topic
Connector
OMRS Cohort
Registry Store
Connector
OMRS Archive
Connector
OMRS
AuditLog
Connector
OMRS Event
Mapper
Connector
OMRS
Repository
Connector
Server
Conﬁguration
OMAS REST APIs
and Topics
OMAG
Administration
REST APIs
OMRS
Repository
REST APIs
Administration
Enterprise Repository Services
Local Repository
Services
Cohort
Services

Integration patterns
https://cwiki.apache.org/confluence/display/ATLAS/Integrating+into+the+Open+Metadata+and+Governance+Ecosystem
IBM Information
Governance Catalog
Apache
Atlas

Caller Pattern
 A metadata tool can access the
consumer-specific APIs to work
with metadata.
 The Access Layer handles the
calls to metadata repositories
connected to the metadata
highway

Native Pattern
 Native
implementation of
the open
metadata
governance APIs
 Apache Atlas is a
native
implementation of
the open
metadata and
governance APIs.

Adapter Pattern
 Simple
components plug
into a repository
proxy to connect
in an existing
metadata
repository.

Plug-in Pattern
 Open Connector Framework (OCF)
• Connectors to data, analytics etc
 Open Discovery Framework (ODF)
• Metadata discovery services
 Governance action Framework (GAF)
• Stewardship services for triage and
remediation of exceptions

IBM Unified Governance

Simple cohort
Cohort A
Chief Data Office
Data Lake
Systems of Record

Multiple Cohorts
Cohort BCohort A
Chief Data Office
Data Lake
Systems of Record
Mobile
Apps
Data
Lake
Systems of
Record
Marketing

First server

Establishing contact

Federated queries

Caching metadata for availability and performance

ODPI - co-creation with practitioners
• Compliance assistance and certification
for vendors
• Subject matter experts sharing best
practices and co-creating content packs
https://github.com/odpi/data-governance

• Your governance program is based on
established practices and definitions
• Allows a broader range of tools in your
organization
• Automated governance processes
protect and manage your data
Your metadata offerings will deliver value
faster as they tap into metadata collected by
other vendor’s tools.
ODPi packages extend your metadata
system’s and tools’ capabilities
Conformance tests minimize your effort in
being compliant with key standards and
regulations.
Customers have increased confidence in your
tools and services due to ODPi certification.
Data Governance Professionals
Vendors
How ODPi Helps

Summary
 Big data is creating new opportunities and requirements that needs new types
of systems. Data Lakes are just one part of this story.
 Metadata is critical to make the best use of this data for the widest range of
scenarios.
 Most organizations use tools and platforms from many vendors.
 Open standards have had limited take-up
 Can we use open source to create a digital platform that allows vendors to take
advantage of metadata from a broader ecosystem?
• Open Metadata and Governance defines the standards
• Apache Atlas provides the reference implementation
• ODPi helps to build the ecosystem

Call to action – how can you help?
 Direct contribution to the Apache Atlas and/or ODPi Data Governance projects.
• There are many features that still need to be developed.
 Encouraging your vendors/partners and projects internal to your organization
to embrace the Open Metadata and Governance standards to grow the
ecosystem of data and processing that is assured by metadata and governance
capability.
52

https://cwiki.apache.org/confluence/display/ATLAS/Atlas+Projects

zzzz
z
z
z
Questions?

Inside open metadata—the deep dive

More Related Content

What's hot

Similar to Inside open metadata—the deep dive

More from DataWorks Summit

Recently uploaded

Inside open metadata—the deep dive

Editor's Notes