A presentation on some work I'm engaged in with Apache Atlas and Apache Ranger. Delivered April 6 2017 at the DataWorks Summit in Munich. https://dataworkssummit.com/munich-2017/sessions/unleashing-the-power-of-apache-atlas-with-apache-ranger/
A video can also be found at https://www.youtube.com/watch?v=pMCRuD4d9-U&index=57&list=PLQ-KRsI-e9bAjjx9fPHUTZKw28hS8KOn3
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Unleashing the power of apache atlas with apache - virtual dataconnector
1. Unleashing the power of Apache Atlas
with Apache Ranger
Virtual Data Connector Project
NIGEL JONES
JONESN@UK.IBM.COM
DATAWORKS, MUNICH, APRIL 2017
Apache®, Apache Atlas, Apache Ranger & other Apache project names referenced are either registered trademarks or trademarks of
the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation
is implied by the use of these marks.
2. About Me – Nigel Jones
https://www.linkedin.com/in/nigelljones/
jonesn@uk.ibm.com (Anyone still use email?)
@planetf1 – noisy, f1, electric vehicles, food & drink …. A split of work/life
accounts didn’t work for me!
And of course the Apache Atlas & Ranger mailing lists & JIRA!
Science fan at school uni. It was cloud chambers back then… now just the
cloud
IBM Hursley, UK since 1990
Last 3 years focus on Data Lake, Information Governance, Open Metadata
4. Data?
What data do I have?
What does it mean?
Where is it?
Who has access to it?
Who owns it?
What quality is it?
How does it relate to other data?
How to I control, audit & understand access?
5. Regulatory needs
Adhere to regulations like BCBS-239 and GDPR
Need to know meaning, value of the data
Demonstrate processes in place to govern access
Audit
Significant fines if rules breached
Whilst ensuring easy, ready access to appropriate data for data professionals to
support an agile business
7. Metadata..
Metadata enables data to be used outside of the application that created it.
Analytics and decision making
New business applications
Reporting and compliance
Metadata describes the format and content of data allowing people to judge which
dataset to use for a new project
Structure
Meaning
Origin
Valid values and quality
Usage and ownership
Regulations and classifications that apply
Metadata describes the business context and classification of data allowing automated
governance processes to operate.
8. Which can support…
An enterprise data catalogue that lists all data including where it is, what it
is, who owns it, it’s meaning, quality, where it came from , and can fully
describe it’s business context & how the data should be governed….
Subject Matter experts searching, collaborating, feeding back about their
data needs and use
Automated governance actions to protect and manage including auditing,
monitoring, quality control, rights management
9. But easily…
Open frameworks & APIs
Automatic collection & discovery of metadata in a dynamic heterogeneous
environment
Using predefined standards for glossaries, schemas, rules, regulations to
reduce cost
Cheap to integrate new tools
No proprietary lock-in & assumptions that all tools are from one suite or
vendor
Avoiding silos
Distributed and Open
12. Data virtualization project
Collaboration – IBM, several banks & open community
A Data Lake environment
Not just Hadoop, but other sources too
Business Terms, Classifications, Metadata rich
Offer virtualized views. Expose relational data with business terms
Manage Access to resources – permit, deny, log, filter/mask …. THROUGH METADATA
Open, pluggable
Working through use cases, design, initial MVP (this year)
Critique, feedback is welcomed. We’re looking for guidance and support from the Atlas
& Ranger communities as well as contribute our ideas
Proposed changes all go through mailing list and JIRA for feedback
13. Apache Atlas
“Atlas is a scalable and extensible set of core foundational governance
services – enabling enterprises to effectively and efficiently meet their
compliance requirements within Hadoop and allows integration with the
whole enterprise data ecosystem.” …. http://www.apache.org
Open Community -- Apache Incubator since May 2015
Type agnostic metadata store
REST API & UI
Supports many Hadoop components including HBase, Hive, Sqoop, Storm
& others
14. Apache Ranger
Centralized security administration to manage all security related tasks in a
central UI or using REST APIs.
Fine grained authorization to do a specific action and/or operation with
Hadoop component/tool and managed through a central administration
tool
Standardize authorization method across all Hadoop components.
Enhanced support for different authorization methods - Role based access
control, attribute based access control etc.
Centralize auditing of user access and administrative actions (security
related) within all the components of Hadoop.
… from http://ranger.apache.org
15. Project Interactions
Search/Rep
ort
GaianDB
• Search for list of assets by metadata
• Search for data
• Reporting tool obtains data to draw report
Underlying data, sql, hive,
HDFS, Oracle, Netezza
etc
Manages logical views
Deploys rules, pushes
classifications, source for
user roles (not users)
+ranger plugin to permit/deny, mask etc
Pulls rules. classifications
RDBMSHadoop
Apache
Atlas
Apache
Ranger
Apache
Solr
16. Why Atlas and Ranger?
Open Source essential to forming an active ecosystem
Vision, active community & evolving – ability to contribute & work with
others to provide the best solution
Already have good core capabilities
Atlas type system is very flexible
Ranger offers a range of policy types and provides a pluggable framework
Already cross project integration
Use of tag based policie in Ranger sourced from Atlas
Can be used independently of full Hadoop stack
17. Refined virtual connector scope scope
GaianDB
Ranger
Plugin
Titan
(GraphDB,
Metadata
Repository)
Ranger
Config
Ranger Server
Atlas
Poll Policies
OMAS
OMRS
IGC
Pre Post Create View
Metadata
Extract
physical
metadata
Manage
Logical
Tables
Virtualizer
Retrieve meta data
Retrieve meta data
Retrieve meta data
Push meta data
Oracle Netezza
Hive
Tables
Push and query meta data
Data Lake Repositories
Meta
Data
Data Lake Virtualization
tag-sync
rule-sync
Config (eg Policies,
Audit log location)
LDAP
Audit
Log
Mapper
Search for data/reporting
Push and
query
metadata
Meta
Data
Navigator
Meta
Data
Datameer
18. GaianDB & Virtualizer
GaianDB
Open Source
Federated, self learning, dynamic configuration
Based on Apache Derby
Already had “policy” support – we’re plugging in
Ranger for this project
Virtualizer
Listens to event notifications on assets etc
Creates view definitions in GaianDB, and new Atlas APIs
to store metadata. Could use different virtual engine..
Designed to be open to other virtualization
technologies.
LT1 LT2
DS2DS1 DS3
PolicyPlugin
(ranger)
Virtualizer Atlas
GaianDB supports federation
– not used for MVP
19. Atlas – glossary enhancements
Get Atlas closer to parity with commercial offerings
Business Terms – categories, category hierarchies
Has-a, is-a, type-of, synonym, antonym, arbitrary relationships
Assets mapped to Business Terms
Classifications
Hierarchy
Navigable mappings to retain ability to flatten tags to ranger
Instead of hive column EMP_SALARY -> SPI, now can be EMP_SALARY -> SALARY ->
SPI …
Used to drive governance
ATLAS-1410
20. Atlas – other enhancements
Consumer Centric APIs
Open Metadata Access Services (OMAS)
REST & more Kafka notifications
Asset, Catalog, Connector, Glossary, Governance Action, Governance Definitions,
Information View, Roles and Access
Repository level APIs
Open Metadata Repository Services (OMRS)
REST & more Kafka notifications
Pluggability through an Open Connector Framework to other metadata repositories
– distributed and Open
Standard data model/core
Enhancement to core model – versioning, external linkage etc
More standard types ie for all relational databases to ease sharing
21. Ranger areas being looked at
Building a plugin for GaianDB
Access control, simple masking. More later
User synchronization (large #users, role of Atlas)
Changes to tag sync process for New glossary proposal
As more metadata goes into Atlas, it becomes source for generation of
some kinds of policies. Where is the master?
Generating ranger rules from governance definitions
How about control of access to Atlas itself?
Aside: Interfaces used by enforcement engines (such as to get classification
data) need to be efficient – these should work for projects like Apache
Sentry as well as Atlas
22. Beyond the MVP
Open Discovery Framework
Consider other security enforcement engines – such as Apache Sentry &
driving more capability around rules & governance actions from Atlas
metadata
Work on standard models to support different domains
Lineage
From high level design lineage through to operational detail. Logs vs graph….
API metadata
Infrastructure – JanusGraph…
Abstraction added by IBM in last few months for titan 1
23. The vision
An enterprise data catalog that lists all of your data, where it is located, its origin (lineage), owner, structure, meaning, classification and quality
Spanning systems both on premise and cloud providers
Hosted locally to your data platforms but integrated to provide the enterprise view
New data tools (from any vendor) connect to your data catalog out of the box
No vendor lock-in; nor expensive population of yet another proprietary siloed metadata repository
Metadata is added automatically to the catalog as new data is created
Extensible discovery processes characterise and classify the data
Interested parties and processes are notified
Subject matter experts collaborating around the data
Locate the data they need, quickly and efficiently
Feed back their knowledge about the data and the uses they have made about it to help others and support economic evaluation of data
Automated governance processes protect and manage your data
Metadata-driven access control
Auditing, metering and monitoring
Quality control and exception management
Rights management
Predefined standards for glossaries, data schemas, rules and regulations that reduce the cost of doing business
Open frameworks and APIs for collaborating with universities, traditional vendors and new innovators around data and advanced analytics
24. Summary
Atlas can help us have an industry wide common metadata platform around
which a vibrant ecosystem can evolve
Not only in Hadoop but more broadly
Metadata driven governance can be scalable & enable us to manage our data
better, and be compliant with regulations
The ideas presented here resonate with many people we’ve spoken to
Get involved! I’d love to hear the feedback on this approach!
Comment on the JIRAS, ask questions, contribute, disagree… ;-)
Look at JIRA Tag “VirtualDataConnector” or start at ATLAS-1689
Atlas wiki
“Innovation happens best not in isolation but in collaboration” (keynote)
THANKS!
27. Atlas
graphDB
“gaiandb”
IGC
IGC REST API
Oracle
Data
HDFS
Data
Netezza
Data
P-JDBC P-JDBCP-JDBC
GAF OMAS
Virtual
Asset
OMAS
Search
Search/Explore UI
Catalog
OMAS
OMRS
OMRS
GAF Pre
GAF Post
Connector Framework
*
Atlas boundaries
Developed in POC
May not be in POC initially
* May be hardcoded at first
C
o
n
n
e
c
t
o
r
F
r
a
m
e
w
o
r
k
ATLAS
Virtualizer
Architecture
28. Metadata areas and types
Policy Metadata (Principles,
Regulations, Standards, Approaches,
Rule Specifications, Roles and
Metrics)
Governance
Actions and
Processes
Augmentation
MappingImplementation
Connector Directories
Access
Access
Information
Auditor
Integration
Developer
Business
Analyst
Data
Scientist
Information
Worker
Information
Owner
Information
Governor
Information
Steward
Data
Quality
Analyst
Business Objects and
Relationships, Taxonomies
and Ontologies
Business Attributes
Organization
Information
Curator
Teaming Metadata
(people profiles, communities,
projects,
notebooks, …)
Models and Schemas
3
2
4
5
Physical Asset Descriptions
(Data stores, APIs,
models and components)
Asset Collections
(Sets, Typed Sets, Type
Organized Sets)
Information Views
Rights
Management
Reference Data
Feedback Metadata
(tags, comments, ratings, …)
ClassificationSchemes
Classification
Strategy Subject Area Definition
Campaigns and Projects
Infrastructure and systems
Rollout
1
Discovery
Metadata (profile data,
technical classification, data
classification,
data quality assessment, …)
Augmentation
Instrument
Association
Information Process
Instrumentation (design lineage)
6
7
29. User & Group/Role synchronization
UserSync2
LDAP holds role-membership
(LDAP groups) – could also be
Active Directory
ATLAS manages definitive
list of roles <that are used
for atlas managed sources>
• Corporate LDAP has a huge number of
users/groups
• Ranger currently needs to sync all
• In future perhaps we establish group/role
membership during authentication
• Capability for alternative source could be merged
in to base UserSync
LDAP lookup ->
group:member
Governance Action OMAS
- getRoles
Apache
Ranger
LDAP
Apache
Atlas
30. Atlas Glossary v2: Tag Sync to Ranger
TagSync2
ATLAS glossary manages a
sophisticated enterprise
glossary structure
• Atlas Glossary v2 Proposed in ATLAS-1410 (David Radley) Sync Builds on existing tagsync
approach
• New API in Atlas will flatten classification structure
• No changes to ranger – but exposing richer classification could be area of future work
Governance Action OMAS
Confidential
Salary
emp_renum
Business
Term
Hive Column
Business
Term
Confidential
emp_renum
Hive Column
Tag
Apache
Ranger
Apache
Atlas
31. Policy (Rule) synchronization
RuleSync
• Generate policies in Ranger based off entities in Atlas
• Currently designing how this works
• Scoped by policy service so existing Ranger UI approach still works
Governance Action OMAS
- getRules
Role
Classifications
Asset
Ranger Rule
Action
Apache
Ranger
Apache
Atlas
32. VirtualDataConnector JIRAS 20170402
RANGER-1488
RANGER-1487
RANGER-1486
RANGER-1485
RANGER-1464
RANGER-1454
RANGER-1234
RANGER-1186
RANGER-1168
ATLAS-1696
ATLAS-1694
ATLAS-1691
ATLAS-1158
ATLAS-520
ATLAS-519
ATLAS-455
ATLAS-197
Create Ranger plugin for gaiandb
generate rules from Governance definitions in Atlas
New usersync alternative for Atlas (vdc)
Ranger support for Virtual Data Connector Project (ATLAS)
Support Atlas v2 glossary in Atlas plugin (for access control to terms etc)
Support of Atlas v2 glossary API proposal for tag source
Post-evaluation phase user extensions
Ranger Source: eclipse
Add data masking for tag based policies
Governance Action Framework OMAS
Sample assets to support Virtual Connector Project
OMAS Interfaces for Atlas
Build ATLAS using Docker
Temporal / Versioning support for types, traits, entites ....
metrics
Timeouts in tests should be configurable from system property
Add build instructions in top level dir
33. References
Apache Atlas - http://atlas.apache.org/
Top level JIRA for this activity https://issues.apache.org/jira/browse/ATLAS-
1689
Apache Ranger - http://ranger.apache.org/
GaianDB
https://github.com/gaiandb/gaiandb
https://developer.ibm.com/open/openprojects/gaian-database/
The case for open metadata – A.M.Chessell
http://www.ibmbigdatahub.com/blog/case-open-metadata
Editor's Notes
This is the nirvana. Many tools from different teams – open or proprietary – all able to exchange metadata easily.
A new tool can easily understand existing metadata, can integrate with minimal effort
GaianDB is a open source project from IBM that is based on Apache Derby and supports a highly distributed model with self learning/healing capabilities. It virtualizes access to underlying data sources – for example a virtual table my be surfaced via JDBC that is actually based on a combination of a CSV file and another relational database. We are using it in the Virtual Data Connector project to provide a single point of control via a ranger plugin, as well as to do some data source mappings such as hiding technical columns from view or renaming columns with more business like terms gleaned from the glossary (Atlas)
This is broadly the scope of an MVP definition we’re using to focus our initial work this year. We have use cases we can share with anyone interested, and will be capturing that info in the Atlas/Ranger JIRAs and potentially wiki.The list of metadata repositories is an example. Our MVP sources some metadata from IGC since that is being used by some participants, but the focus is on open interfaces and Atlas. The other repositories are potential ideas only.Similarly
It’s important to architect this in an open way. The rules used to decide when to virtualize a resource need to be pluggable – perhaps for example all data arriving in a partular DataLake zone will be a virtualization candidate. Further the actual technology needs to be changeable – proprietary or other open projects – for example perhaps Presto is a candidate . Ideas welcome – proposals will be shared in Jira
OMAS = Open Metadata Access Services – These are consumer centric interfaces so would pass objects suited to a particular consumer – for example Ranger in the case of the Governance Action OMAS, or a catalog UI perhaps for the Catalog services. Each consumer has different needs in terms of object structure, or whether it deals with individual objects or sets, and this can differ from the model used in the underlaying repository. For some interfaces this mapping Is simple, for others more complex.OMRS = Open Metadata Repository Services. This refers to the core repository, ie the Atlas type system. We see other metadata repositories adopting the same UI, and are proposing a mechanism that will allow these to be plugged in. Metadata can change rapidly, and the only scalable approach is to ensure it’s open & distributed. Contributions from other metadata server authors welcome ! Note that these are our working names. Fundamentally they are Atlas, and so the Atlas community will together need to agree on the actual names moving forward.
A standard data model – in addition to a common mechanism/server – is necessary to make it much easier to *understand* the metadata we store. Whilst there will always be a need for extensions, having a good base object definition will make application integration easier. For example we might all wish to describe a RDBMS in a very similar way. We can then go on from this to have more standards oriented around industry models
GaianDB ranger plugin – GaianDB already has the capability to have a policy plugin which governs access to it’s virtual tables. To integrate with Ranger and Atlas we will have a ranger style plugin. Whilst this will function like any other ranger plugin, in addition policies will be generated from Atlas itself.User synchronization challenges are described later. In summary, in an enterprise environment there may be many users (100k+) in LDAP, but only a small number have access to the virtualization infrastructure. We’re going to key the user sync off the list of user roles found in Atlas itself, and then obtain the role membership from LDAP. This will then be uploaded to ranger as per existing usersyncTag Sync – the glossary enhancements provide additional structure in how Business Terms, Classifications & assets are linked. A new Atlas API will flatten this structure and thus preserve Ranger’s ability to use atlas tags as today. In future there may be a re-evaluation to see if this more sophisticated approach should be pushed to Ranger tooPolicies – Since Atlas now has richer metadata including information about asset ownership, high level governance policies, data classification, rules may be generated in some cases from Atlas, or from a new rule-sync process. This is still being worked through and we’ll share our ideas with the communityOpenness – Some users I’ve spoken to are interested in Atlas but may currently use other technologies for enforcement, including Sentry in Hadoop. The intent is to ensure all the interfaces defined are open to all, and useful… so that should someone wish they could just as well integrate with Sentry as Ranger. This loosely coupled approach helps support an innovative exciting ecosystem
Atlas holds metadata round user roles – they are used to define governance rules… In keeping with Ranger’s process to synchronize users & groups we will source these slightly differently, though this is mostly simply a scoping exercise to avoid pulling everything from LDAP. One consideration for the future is whether Ranger needs to sync users/groups at all – whilst the sync can help with typeahead when manually defining policies, it’s of relatively little use at runtime if instead plugins could pull the current user role membership from ldap or elsewhere after connection. Possibly for another JIRA
Ranger already does tag synchronization with Atlas, but changes will be needed to support the new glossary capabilities
A new tagsync process will likely be implemented so that either old or new can be used to avoid any breakage for existing users
Currently working through how this may work, but fundamentally we can define the governance rules in Atlas, and likely generate executable rules in ranger. Refer to the JIRAs for ongoing design on this area
In no particular order and an example only – a query for all JIRAS against Atlas, Ranger with label = ‘VirtualDataConnector’ as of 2 April 2017. This is a list of issues we’re interested in, in particular. The root JIRA for our current design work is ATLAS-1689 which it appears we forgot to tag . There are others too so please rerun the query!