Governance using
Apache Atlas: Why and How
Vimal Sharma, Apache Atlas PMC & Committer
Software Engineer, Hortonworks
Apache ID: svimal2106@apache.org
Apache Atlas : Project Details
Ø Incubated to Apache in May 2015
Ø Organizations : IBM, Hortonworks, Aetna, Merck, Target
Ø 3 releases in last year
Ø Graduated to a Top Level Project in June 2017
0.7
(July	2016)
0.7.1
(Jan 2017)
0.8
(Mar 2017)
TLP
(June	2017)
Apache Atlas : Introduction
Ø Governance and metadata framework for Hadoop
Ø Model a component and capture metadata
Ø Data Assets - Hive Table, HBase column family
Ø Process - Storm Topology, Sqoop Import
Ø Classification - Tag metadata entities
Ø Built in support for popular components
Ø Extensible Architecture
Integration
Core
Apps
Type	System
Graph	Abstraction/Engine
API	
<HTTP/REST>
Titan
Metadata	
Store	<HBase>
Index	Store	
<Solr>
UI
Business	Glossary
(Roadmap)
Metadata	Sources
Messaging	
<Kafka>
Hive Sqoop Storm Custom Ranger	Tag	Based	
Policies
Ingest	/	Export Search
Apache	Atlas:	Architecture
Governance Problem (Use Cases)
Ø ETL Pipeline Failure Scenarios
• Upstream failure analysis
• Alerts to downstream processes
• Visual lineage of ETL pipelines
Ø Redundant Processing
• Does a derived dataset contain required information
• Can metadata classification be used to determine this?
• Avoid expensive processing
Use Cases
Ø Compliance and Security
• Impose security constraints on sensitive data
• Data can span multiple Hadoop components
• One policy to govern them all
Ø Cluster Admin
• Periodic cleanup of datasets
• Which are the unused/dormant datasets
• How to define the relevance of a dataset
Cross Component Lineage
• Lineage : Upstream and downstream Data Assets
• Individual Components : Own Metadata store
• Cross Component events
• Atlas : Flexibility to model arbitrary components
Ranger Integration
• Ranger : Listener on Tag addition/deletion
• Attribute based policies rather than asset based policies
PII
Type System
• Model of metadata to be stored
• Every type has
Ø Unique Name
Ø Attributes
Ø SuperTypes
• Attributes
Ø Mandatory/Optional
Ø Unique
Ø Composite
Ø ReverseReference
Atlas Base Types
Referenceable
DataSet Process
Asset
Name
Owner
Description
qualifiedName
Inputs
Outputs
Spark Introduction
• RDD : Basic Unit of execution
• DataFrame : Relational RDD
• Let’s model DataFrame type!
DataFrame Type
DataSet
spark_dataframe dataframe_column
source
destination
columns
type
dataframe
comment
Graph Snapshot
Ø 1:	Dataframe Type
Ø 2:	Column	Type
Ø 3:	Dataframe Entity
Ø 4,	5:	Column	Entities
3
4
5
1
2
/hdfs/source
/hdfs/destination
employeeInfo@Hortonworks
name
id
Demo Example
PayrollDetails
(HDFS	PATH)
SalaryProcessor
(DATAFRAME)
EmployeeSalary
(KAKFA	TOPIC)
VariableComponent
(HDFS	PATH)
Hook Design
Ø Hive Hook
• Multiple clients e.g Pig, Hive, Beeline
• Always full update to avoid inconsistency
Ø Synchronous vs Asynchronous communication
• Earlier : Hook communicated with server directly
• Now : Metadata entities pushed to Kafka
Ø Un-partitioned Kafka topic
• Avoid out of order messages
Roadmap
Ø Hooks for Spark, HBase and NiFi
Ø Column level lineage for Hive
• create table dest as select (a + b) x, (c * d) y from source
Ø Export/Import of metadata
a
b
Addition x
Contribute
Ø Project Website - http://atlas.apache.org/
Ø Dev Mailing List - dev@atlas.incubator.apache.org
Ø User Mailing List - user@atlas.incubator.apache.org
Ø JIRA link - https://issues.apache.org/jira/browse/ATLAS
Questions

Fifth Elephant Apache Atlas Talk