Fifth Elephant Apache Atlas Talk

Governance using
Apache Atlas: Why and How
Vimal Sharma, Apache Atlas PMC & Committer
Software Engineer, Hortonworks
Apache ID: svimal2106@apache.org

Apache Atlas : Project Details
Ø Incubated to Apache in May 2015
Ø Organizations : IBM, Hortonworks, Aetna, Merck, Target
Ø 3 releases in last year
Ø Graduated to a Top Level Project in June 2017
0.7
(July 2016)
0.7.1
(Jan 2017)
0.8
(Mar 2017)
TLP
(June 2017)

Apache Atlas : Introduction
Ø Governance and metadata framework for Hadoop
Ø Model a component and capture metadata
Ø Data Assets - Hive Table, HBase column family
Ø Process - Storm Topology, Sqoop Import
Ø Classification - Tag metadata entities
Ø Built in support for popular components
Ø Extensible Architecture

Integration
Core
Apps
Type System
Graph Abstraction/Engine
API
<HTTP/REST>
Titan
Metadata
Store <HBase>
Index Store
<Solr>
UI
Business Glossary
(Roadmap)
Metadata Sources
Messaging
<Kafka>
Hive Sqoop Storm Custom Ranger Tag Based
Policies
Ingest / Export Search
Apache Atlas: Architecture

Governance Problem (Use Cases)
Ø ETL Pipeline Failure Scenarios
• Upstream failure analysis
• Alerts to downstream processes
• Visual lineage of ETL pipelines
Ø Redundant Processing
• Does a derived dataset contain required information
• Can metadata classification be used to determine this?
• Avoid expensive processing

Use Cases
Ø Compliance and Security
• Impose security constraints on sensitive data
• Data can span multiple Hadoop components
• One policy to govern them all
Ø Cluster Admin
• Periodic cleanup of datasets
• Which are the unused/dormant datasets
• How to define the relevance of a dataset

Cross Component Lineage
• Lineage : Upstream and downstream Data Assets
• Individual Components : Own Metadata store
• Cross Component events
• Atlas : Flexibility to model arbitrary components

Ranger Integration
• Ranger : Listener on Tag addition/deletion
• Attribute based policies rather than asset based policies
PII

Type System
• Model of metadata to be stored
• Every type has
Ø Unique Name
Ø Attributes
Ø SuperTypes
• Attributes
Ø Mandatory/Optional
Ø Unique
Ø Composite
Ø ReverseReference

Atlas Base Types
Referenceable
DataSet Process
Asset
Name
Owner
Description
qualifiedName
Inputs
Outputs

Spark Introduction
• RDD : Basic Unit of execution
• DataFrame : Relational RDD
• Let’s model DataFrame type!

DataFrame Type
DataSet
spark_dataframe dataframe_column
source
destination
columns
type
dataframe
comment

Graph Snapshot
Ø 1: Dataframe Type
Ø 2: Column Type
Ø 3: Dataframe Entity
Ø 4, 5: Column Entities
3
4
5
1
2
/hdfs/source
/hdfs/destination
employeeInfo@Hortonworks
name
id

Demo Example
PayrollDetails
(HDFS PATH)
SalaryProcessor
(DATAFRAME)
EmployeeSalary
(KAKFA TOPIC)
VariableComponent
(HDFS PATH)

Hook Design
Ø Hive Hook
• Multiple clients e.g Pig, Hive, Beeline
• Always full update to avoid inconsistency
Ø Synchronous vs Asynchronous communication
• Earlier : Hook communicated with server directly
• Now : Metadata entities pushed to Kafka
Ø Un-partitioned Kafka topic
• Avoid out of order messages

Roadmap
Ø Hooks for Spark, HBase and NiFi
Ø Column level lineage for Hive
• create table dest as select (a + b) x, (c * d) y from source
Ø Export/Import of metadata
a
b
Addition x

Contribute
Ø Project Website - http://atlas.apache.org/
Ø Dev Mailing List - dev@atlas.incubator.apache.org
Ø User Mailing List - user@atlas.incubator.apache.org
Ø JIRA link - https://issues.apache.org/jira/browse/ATLAS

Fifth Elephant Apache Atlas Talk

In this document

More Related Content

What's hot

Similar to Fifth Elephant Apache Atlas Talk

Recently uploaded

Fifth Elephant Apache Atlas Talk