Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the Enterprise

1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Top Three Big Data Governance
Issues and How Apache ATLAS
resolves it for the Enterprise
June 28, 2016
Apache Atlas

Disclaimer
This document may contain product features and technology directions that are under development, may be
under development in the future or may ultimately not be developed.
Project capabilities are based on information that is publicly available within the Apache Software Foundation
project websites ("Apache"). Progress of the project capabilities can be tracked from inception to release
through Apache, however, technical feasibility, market demand, user feedback and the overarching Apache
Software Foundation community development process can all effect timing and final delivery.
This document’s description of these features and technology directions does not represent a contractual
commitment, promise or obligation from Hortonworks to deliver these features in any generally available
product.
Product features and technology directions are subject to change, and must not be included in contracts,
purchase orders, or sales agreements of any kind.
Since this document contains an outline of general product development plans, customers should not rely upon it
when making purchasing decisions.

Atlas Data Governance
Organizations need data governance to understand its information to answer
questions such as:
• What do we know about our information?
• Where did this data come from and who can use it?
• Does this data adhere to company policies and rules?

STRUCTURED
UNSTRUCTURED
Vision - Enterprise Data Governance Across Platforms
TRADITIONAL
RDBMS
METADATA
MPP
APPLIANCES
Project 1
Project 5
Project 4
Project 3
METADATA
Project 6
DATA
LAKE
Atlas: Metadata Truth in Hadoop
Data Management
along the entire data lifecycle with integrated
provenance and lineage capability
Modeling with Metadata
enables comprehensive data lineage through
a hybrid approach with enhanced tagging and
attribute capabilities
Interoperable Solutions
across the Hadoop ecosystem, through a
common metadata store

Apache Atlas Overview

Atlas Data Governance
Data governance practices provide a holistic approach to managing,
improving and leveraging information to help you gain insight and build
confidence in business decisions and operations.
Atlas helps customers discover information about data objects, their
meaning, location, characteristics, and usage.

Atlas timeline: from DGI to present
May
2015
Apache
Atlas
Incubation
DGI group
Kickoff
Dec
2014
July
2015
HDP 2.3
Foundation
GA Release
First kickoff to
GA in 7 months
Global
Financial
Company
* DGI: Data Governance Initiative
Key Benefits:
• Co-Dev = Built for
real customer use
cases
• Faster & Safer =
Customers know
business + HWX
knows Hadoop
Jan
2016
HDP 2.4
Kafka/Storm
Sqoop
Falcon
Tag Based
Security
Summer
2016
HDP 2.5
Business Catalog
AD integration
Versioning

Big Data Management Through Metadata
Management Scalability
Many traditional tools and patterns do not scale when applied to multi-
tenant data lakes. Many enterprise have silo’d data and metadata
stores that collide in the data lake. This is compounded by the ability to
have very large windows (years). Can traditional EDW tools manage
100 million entities effectively with room to grow ?
Metadata Tools
Scalable, decoupled, de-centralized manage driven through metadata
is the only via solution. This allows quick integration with automation
and other metamodels
Tags for Management, Discovery and Security
Proper metadata is the foundation for business taxonomy, stewardship,
attribute based security and self-service.
Key Benefits:
Modern Data Lakes
need new ways to
govern because:
• Cost – Traditional staff ratio
to data size not possible
• Diversity – Only way to
manage velocity of new
datasets
• Agility – Quick change based
on tags / taxonomy

High Level Architecture: 4 Key points
Type System
Repository
Search DSL
Bridge
Hive Storm
Falcon
Custo
m
REST API
Graph DB
Search
Kafka
Sqoop
Connectors
MessagingFramework
3 REST API
Modern, flexible
access to Atlas
services, HDP
components, UI &
external tools
1 Data Lineage
Only product that
captures lineage
across Hadoop
components at
platform level.
4 Exchange
Leverage existing
metadata / models by
importing it from
current tools. Export
metadata to
downstream systems
2 Agile Data
Modeling:
Type system allows
custom metadata
structures in a
hierarchy taxonomy

Governance Ready Certification Program
Discovery
Tagging
Prep /
Cleanse
ETL
Governance
BPM
Self Service
Visualization
Choice: Customers choose features that they want to
deploy—a la carte versus vendor lock
Curated & Fast: Selected group of vendor partners to
provide rich, complimentary and complete features ready
to deploy
Agile: Low switching costs, Faster deployment and
innovation
Centralized: Common SLA & common open metadata
store
Flexibility: Interoperability of products through Atlas
metadata
Safe: HDP at core to provide stability and interoperability

Governance Ready Certification Program
Completed:
• Waterline
• Dataguise
• Attivo
Next:
• SAP ILM,VORA
• IBM IGC
Work in progress:
• Collibra
• Alation
• Meta Integration
(Miti)
• Paxata
• Syncsort
• Trifacta

Near Term Roadmap:
Summer 2016

Summer 2016 Release Summary
• Dynamic Access Policies
• Cross component lineage
• Enterprise Readiness
• Business Catalog
Differentiato
r
Differentiato
r
Differentiato
r

Dynamic Access Policy
Apache Ranger + Atlas Integration

Summary of Dynamic Access Policies
• Basic Tag policy – PII example. Permission
mapped to re-useable tag not resource
• Geo-based policy – Policy based on IP address
mappings. Rule enforcement dynamically geo
aware.
• Time-based policy – Timer for data access for
resource management, compliance reporting
• Prohibitions – Prevention of toxic combinations
of Hive tables or columns that may pose a risk
together.
Key Benefits:
New scalable metadata
based security paradigm
Dynamic, real-time
policy
Automatically updates to
changes in metadata
Centralized and simple
to manage policy

How does Atlas work with Ranger at scale?
Atlas provides: Metadata
• Business Classification (taxonomy): Company > HR > Driver
• Hierarchy with Inheritance of attribute to child objects:
Sensitive “PII” tag of department HR will be inherited by group
HR> Driver
• Atlas will notify Ranger via Kafka Topic for changes
Apache Atlas
Hive
Ranger
Falcon
Kafka
Storm
Atlas provides the
metadata tag to
create policies
Ranger provides: Access & Entitlements
• Ranger will cache tags and asset mapping for performance
• Ranger will have a policy based on tags instead of roles.
• Example: PII = <group> This can work for a may assets.

Scalable Access Control – Reusable Tag Policy
User group
• AD
• Linux
Resources:
• Files
• Tables
• Topologies
Atlas Tag
• PII
ANY asset PII
• Files
• Tables
• Topologies
Single Admin Group
Assigns
Many Stewards Tag +
Single point of
enforcement and
audit
All future tagging
is covered by
existing policy

Automatic update of policies – active protection
Metastore
• Tags
• Assets
• Entities
Notification
Framework
Kafka Topics
Atlas
Atlas Client
• Subscribes to
Topic
• Gets Metadata
Updates
PDP
Resource Cache
Ranger
Notification Metadata
updates
Message
durability
Optimized
for Speed
Event driven
updates

Hadoop Cross Component
Data Lineage

Apache Atlas Component Integration
• Cross- component dataset lineage. Centralized
location for all metadata inside HDP
• Single Interface point for Metadata Exchange with
platforms outside of HDP
Apache Atlas
Hive
Ranger
Falcon
Sqoop
Storm
Kafka
Spark
NiFi
HBase
HDP 2.3
HDP 2.5
Beyond HDP 2.5

Users in the upcoming release of HDP 2.5 will be able to
track lineage across the following components using
Atlas:

Sqoop – Import from and export to relational databases, and
additional package that leverages sqoop. ATLAS-184 , SQOOP-
2609
 Hive - Dataset lineage with entity versioning (including schema
changes) ATLAS-75. ATLAS-183, ATLAS-492
 Kafka/ Storm - IoT event-level processing, such as syslogs, or
sensor data ATLAS-181 , ATLAS-183, STORM-1381
 Falcon - Data lifecycle at Feed and Process entity level for
replication, and repeating workflows. Tracks period-icy,
throttling, ecviction. ATLAS-69 , FALCON-1570
Summary of Data Lineage
Key Benefits:
Enterprises need open
solutions, not single app
vendor
More native connectors
than anyone else with
more coming
Hardened metadata
infrastructure

Sqoop
Teradata
Connector
Apache
Kafka
Expanded Native Connector: Dataset Lineage
Custom
Activity
Reporter
Metadata
Repository
RDBMS
Any process
using Sqoop is
covered
No other tool
tracks IOT of
the box

Differentiator
Differentiator
Differentiator

Enterprise Readiness

Security/Enterprise Readiness
• Highly reliable and scalable components
• Authorization with AD via Ranger
• Rolling upgrade support HDP 2.5 +
• BC & DR capabilities
• Improved performance of 5x from previous version

Enterprise Readiness:
Scalable and Highly Reliable Components
Solr
Cloud
Kafka
Quorum
Type System
Repository
Search DSL
Bridge
Hive Storm
Falcon Custom
REST API
Graph DB
Search
Kafka
SqoopConnectors
MessagingFramework
HBase

Differentiator
Differentiator
Differentiator

Business Taxonomy (Catalog)

Key Concepts
Business Taxonomy (Catalog)
The practice and science of classification of things or
concepts, including the principles that underlie such
classification. The business organization model is
hierarchical making authoritative with no duplication.
Data Lineage (Provenance)
Data lineage is defined as a data life cycle that includes the
data's origins and where it moves over time. It describes
what happens to data as it goes through diverse processes. It
helps provide visibility into the analytics pipeline and
simplifies tracing errors back to their sources
Tags: Traits vs. Labels vs. Business Taxonomy
Atlas has Tags that are authorative and prevent duplication.
Tag can span different parts of the business taxonomy. A tag
PII can be used in HR as well Finance or Sales.
Benefits:
A view of data assets
organized by business
language
Impact analysis, Compliance,
Acceptable use
Common tag though Hadoop
components

Taxonomies Benefits:
• Search / Discovery – Business catalog of
conceptual, logical and physical assets
• Security --Dynamic metadata based
Access control

We conduct open-ended user interviews so that we can learn more
about who are users are and what their needs are. This helps us
validate whether or not we’re solving the right problem.
Research: Focused on Hadoop

We test our prototype in InVision - a click through prototyping tool
that allows users to interact with static mockups.
Usability Testing

Principle Roles & Activities
• Data Steward – Curator, responsible
for catalog veracity
• Data Scientist – Analyst, primary
consumer of Business Catalog
• Administrator – Role management
only
• Data Engineer – Data ingress and
egress, semantic data quality
• 50% - 80%+ Time
spend looking
for data
• Profit Center • Primary User
of Atlas
• Enables
Scientist
Goal: < 25% spent on
finding data
=
Empowering scientist to
spend their time
uncovering insights --
faster

Atlas Value
• Designed for Hadoop at platform, not application level
• High Confidence data in Hadoop for regulated verticals
• Compliance and business objectives aligned to data organization
• Faster discovery for analysts – reduce time to value
• Agile and adaptable – ensures information is current by native
connectors
• Dynamic protection with Ranger in simple audited policies

Additional Atlas Sessions
• Extend Governance in Hadoop with the Atlas Ecosystem:
integrations with partners Waterline, Trifacta and Attivo:
Thursday 4:10PM @ Room 210A
• BOF: Apache Knox and Apache Ranger provide Hadoop security
while Atlas provides a Hadoop metadata store and enterprise
compliance. Come learn and discuss security & governance
innovations and future directions.
Thursday 5-7 PM @ Room 210A

Learn More:
• Hortonworks links: http://hortonworks.com/solutions/security-and-
governance/
• Tutorials: https://github.com/hortonworks/tutorials/tree/atlas-ranger-
tp/tutorials/hortonworks/atlas-ranger-preview

Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the Enterprise

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the Enterprise

Similar to Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the Enterprise (20)

More from DataWorks Summit/Hadoop Summit

More from DataWorks Summit/Hadoop Summit (20)

Recently uploaded

Recently uploaded (20)

Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the Enterprise

Editor's Notes