Atlas ApacheCon 2017

Governance using
Apache Atlas: Why and How
Vimal Sharma, Apache Atlas Committer
Software Engineer, Hortonworks
Apache ID: svimal2106@apache.org

Apache Atlas : Project Details
Ø Apache Incubator project since May 2015
Ø Organizations : IBM, Hortonworks, Aetna, Merck, Target
Ø 3 releases in last 9 months
0.7
(July 2016)
0.7.1
(Jan 2017)
0.8
(Mar 2017)

Apache Atlas : Introduction
Ø Governance and metadata framework for Hadoop
Ø Model a component and capture metadata
Ø Data Assets - Hive Table, HBase column family
Ø Process - Storm Topology, Sqoop Import
Ø Classification - Tag metadata entities
Ø Built in support for popular components
Ø Extensible Architecture

Integration
Core
Apps
Type System
Graph Abstraction/Engine
API
<HTTP/REST>
Titan
Metadata
Store <HBase>
Index Store
<Solr>
UI
Business Glossary
(Roadmap)
Metadata Sources
Messaging
<Kafka>
Hive Sqoop Storm Custom Ranger Tag Based
Policies
Ingest / Export Search
Apache Atlas: Architecture

Governance Problem (Use Cases)
Ø Impact Analysis : Table schema modification impact
Ø ETL Redundancy : Avoiding redundant processing
Ø Data Classification : Impose security constraints on sensitive data
Ø Admin : Candidate tables for archival

Cross Component Lineage
• Lineage : Upstream and downstream Data Assets
• Individual Components : Own Metadata store
• Cross Component events
• Atlas : Flexibility to model arbitrary components

Ranger Integration
• Ranger : Listener on Tag addition/deletion
• Attribute based policies rather than asset based policies
PII

TypeSystem
• Model of metadata to be stored
• Every type has
Ø Unique Name
Ø Attributes
Ø SuperTypes
• Attributes
Ø Mandatory/Optional
Ø Unique
Ø Composite
Ø ReverseReference

Atlas Base Types
Referenceable
DataSet Process
Asset
Name
Owner
Description
qualifiedName
Inputs
Outputs

Spark Introduction
• RDD : Basic Unit of execution
• DataFrame : Relational RDD
• Let’s model DataFrame type!

DataFrame Type
DataSet
spark_dataframe dataframe_column
source
destination
columns
type
dataframe
comment

Graph Snapshot
Ø 1: Dataframe Type
Ø 2: Column Type
Ø 3: Dataframe Entity
Ø 4, 5: Column Entities
3
4
5
1
2
/hdfs/source
/hdfs/destination
employeeInfo@Hortonworks
name
id

Demo Example
PayrollDetails
(HDFS PATH)
SalaryProcessor
(DATAFRAME)
EmployeeSalary
(KAKFA TOPIC)
VariableComponent
(HDFS PATH)

Hook Design
Ø Hive Hook
Ø Multiple clients e.g Pig, Hive, Beeline
Ø Always full update to avoid inconsistency
Ø Synchronous vs Asynchronous communication
Ø Earlier : Hook communicated with server directly
Ø Now : Metadata entities pushed to Kafka
Ø Unpartitioned Kafka topic
Ø Avoid out of order messages

Roadmap
Ø Hooks for Spark, HBase and NiFi
Ø Column level lineage for Hive
Ø create table dest as select (a + b), (c * d) from source
Ø Export/Import of metadata

Contribute
Ø Project Website - http://atlas.incubator.apache.org/
Ø Dev Mailing List - dev@atlas.incubator.apache.org
Ø User Mailing List - user@atlas.incubator.apache.org
Ø JIRA link - https://issues.apache.org/jira/browse/ATLAS

Atlas ApacheCon 2017

More Related Content

What's hot

Similar to Atlas ApacheCon 2017

Recently uploaded

Atlas ApacheCon 2017