This document discusses Apache Atlas, an open source metadata management and governance framework for Hadoop ecosystems. It provides an overview of Atlas' features for modeling and classifying metadata, integrating with components like Hive and Ranger, and its architecture using a graph database and Kafka messaging. The document also outlines use cases for lineage tracking, compliance, and data governance as well as the roadmap for additional component integration and metadata export/import capabilities.
Governance using
Apache Atlas:Why and How
Vimal Sharma, Apache Atlas PMC & Committer
Software Engineer, Hortonworks
Apache ID: svimal2106@apache.org
2.
Apache Atlas :Project Details
Ø Incubated to Apache in May 2015
Ø Organizations : IBM, Hortonworks, Aetna, Merck, Target
Ø 3 releases in last year
Ø Graduated to a Top Level Project in June 2017
0.7
(July 2016)
0.7.1
(Jan 2017)
0.8
(Mar 2017)
TLP
(June 2017)
3.
Apache Atlas :Introduction
Ø Governance and metadata framework for Hadoop
Ø Model a component and capture metadata
Ø Data Assets - Hive Table, HBase column family
Ø Process - Storm Topology, Sqoop Import
Ø Classification - Tag metadata entities
Ø Built in support for popular components
Ø Extensible Architecture
Governance Problem (UseCases)
Ø ETL Pipeline Failure Scenarios
• Upstream failure analysis
• Alerts to downstream processes
• Visual lineage of ETL pipelines
Ø Redundant Processing
• Does a derived dataset contain required information
• Can metadata classification be used to determine this?
• Avoid expensive processing
6.
Use Cases
Ø Complianceand Security
• Impose security constraints on sensitive data
• Data can span multiple Hadoop components
• One policy to govern them all
Ø Cluster Admin
• Periodic cleanup of datasets
• Which are the unused/dormant datasets
• How to define the relevance of a dataset
7.
Cross Component Lineage
•Lineage : Upstream and downstream Data Assets
• Individual Components : Own Metadata store
• Cross Component events
• Atlas : Flexibility to model arbitrary components
8.
Ranger Integration
• Ranger: Listener on Tag addition/deletion
• Attribute based policies rather than asset based policies
PII
9.
Type System
• Modelof metadata to be stored
• Every type has
Ø Unique Name
Ø Attributes
Ø SuperTypes
• Attributes
Ø Mandatory/Optional
Ø Unique
Ø Composite
Ø ReverseReference
Hook Design
Ø HiveHook
• Multiple clients e.g Pig, Hive, Beeline
• Always full update to avoid inconsistency
Ø Synchronous vs Asynchronous communication
• Earlier : Hook communicated with server directly
• Now : Metadata entities pushed to Kafka
Ø Un-partitioned Kafka topic
• Avoid out of order messages
16.
Roadmap
Ø Hooks forSpark, HBase and NiFi
Ø Column level lineage for Hive
• create table dest as select (a + b) x, (c * d) y from source
Ø Export/Import of metadata
a
b
Addition x
17.
Contribute
Ø Project Website- http://atlas.apache.org/
Ø Dev Mailing List - dev@atlas.incubator.apache.org
Ø User Mailing List - user@atlas.incubator.apache.org
Ø JIRA link - https://issues.apache.org/jira/browse/ATLAS