Governance using
Apache Atlas: Why and How
Vimal Sharma, Apache Atlas Committer
Software Engineer, Hortonworks
Apache ID: svimal2106@apache.org
Apache Atlas : Project Details
Ø Apache Incubator project since May 2015
Ø Organizations : IBM, Hortonworks, Aetna, Merck, Target
Ø 3 releases in last 9 months
0.7
(July	2016)
0.7.1
(Jan 2017)
0.8
(Mar 2017)
Apache Atlas : Introduction
Ø Governance and metadata framework for Hadoop
Ø Model a component and capture metadata
Ø Data Assets - Hive Table, HBase column family
Ø Process - Storm Topology, Sqoop Import
Ø Classification - Tag metadata entities
Ø Built in support for popular components
Ø Extensible Architecture
Integration
Core
Apps
Type	System
Graph	Abstraction/Engine
API	
<HTTP/REST>
Titan
Metadata	
Store	<HBase>
Index	Store	
<Solr>
UI
Business	Glossary
(Roadmap)
Metadata	Sources
Messaging	
<Kafka>
Hive Sqoop Storm Custom Ranger	Tag	Based	
Policies
Ingest	/	Export Search
Apache	Atlas:	Architecture
Governance Problem (Use Cases)
Ø Impact Analysis : Table schema modification impact
Ø ETL Redundancy : Avoiding redundant processing
Ø Data Classification : Impose security constraints on sensitive data
Ø Admin : Candidate tables for archival
Cross Component Lineage
• Lineage : Upstream and downstream Data Assets
• Individual Components : Own Metadata store
• Cross Component events
• Atlas : Flexibility to model arbitrary components
Ranger Integration
• Ranger : Listener on Tag addition/deletion
• Attribute based policies rather than asset based policies
PII
TypeSystem
• Model of metadata to be stored
• Every type has
Ø Unique Name
Ø Attributes
Ø SuperTypes
• Attributes
Ø Mandatory/Optional
Ø Unique
Ø Composite
Ø ReverseReference
Atlas Base Types
Referenceable
DataSet Process
Asset
Name
Owner
Description
qualifiedName
Inputs
Outputs
Spark Introduction
• RDD : Basic Unit of execution
• DataFrame : Relational RDD
• Let’s model DataFrame type!
DataFrame Type
DataSet
spark_dataframe dataframe_column
source
destination
columns
type
dataframe
comment
Graph Snapshot
Ø 1:	Dataframe Type
Ø 2:	Column	Type
Ø 3:	Dataframe Entity
Ø 4,	5:	Column	Entities
3
4
5
1
2
/hdfs/source
/hdfs/destination
employeeInfo@Hortonworks
name
id
Demo Example
PayrollDetails
(HDFS	PATH)
SalaryProcessor
(DATAFRAME)
EmployeeSalary
(KAKFA	TOPIC)
VariableComponent
(HDFS	PATH)
Hook Design
Ø Hive Hook
Ø Multiple clients e.g Pig, Hive, Beeline
Ø Always full update to avoid inconsistency
Ø Synchronous vs Asynchronous communication
Ø Earlier : Hook communicated with server directly
Ø Now : Metadata entities pushed to Kafka
Ø Unpartitioned Kafka topic
Ø Avoid out of order messages
Roadmap
Ø Hooks for Spark, HBase and NiFi
Ø Column level lineage for Hive
Ø create table dest as select (a + b), (c * d) from source
Ø Export/Import of metadata
Contribute
Ø Project Website - http://atlas.incubator.apache.org/
Ø Dev Mailing List - dev@atlas.incubator.apache.org
Ø User Mailing List - user@atlas.incubator.apache.org
Ø JIRA link - https://issues.apache.org/jira/browse/ATLAS
Questions

Atlas ApacheCon 2017