SlideShare a Scribd company logo
1 of 33
Unleashing the power of Apache Atlas
with Apache Ranger
Virtual Data Connector Project
NIGEL JONES
JONESN@UK.IBM.COM
DATAWORKS, MUNICH, APRIL 2017
Apache®, Apache Atlas, Apache Ranger & other Apache project names referenced are either registered trademarks or trademarks of
the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation
is implied by the use of these marks.
About Me – Nigel Jones
 https://www.linkedin.com/in/nigelljones/
 jonesn@uk.ibm.com (Anyone still use email?)
 @planetf1 – noisy, f1, electric vehicles, food & drink …. A split of work/life
accounts didn’t work for me!
 And of course the Apache Atlas & Ranger mailing lists & JIRA!
 Science fan at school uni. It was cloud chambers back then… now just the
cloud 
 IBM Hursley, UK since 1990
 Last 3 years focus on Data Lake, Information Governance, Open Metadata
The Problem…..
WHY ARE WE HERE…..
Data?
 What data do I have?
 What does it mean?
 Where is it?
 Who has access to it?
 Who owns it?
 What quality is it?
 How does it relate to other data?
 How to I control, audit & understand access?
Regulatory needs
 Adhere to regulations like BCBS-239 and GDPR
 Need to know meaning, value of the data
 Demonstrate processes in place to govern access
 Audit
 Significant fines if rules breached
 Whilst ensuring easy, ready access to appropriate data for data professionals to
support an agile business
So what do we need to address this?
Metadata..
 Metadata enables data to be used outside of the application that created it.
 Analytics and decision making
 New business applications
 Reporting and compliance
 Metadata describes the format and content of data allowing people to judge which
dataset to use for a new project
 Structure
 Meaning
 Origin
 Valid values and quality
 Usage and ownership
 Regulations and classifications that apply
 Metadata describes the business context and classification of data allowing automated
governance processes to operate.
Which can support…
 An enterprise data catalogue that lists all data including where it is, what it
is, who owns it, it’s meaning, quality, where it came from , and can fully
describe it’s business context & how the data should be governed….
 Subject Matter experts searching, collaborating, feeding back about their
data needs and use
 Automated governance actions to protect and manage including auditing,
monitoring, quality control, rights management
But easily…
 Open frameworks & APIs
 Automatic collection & discovery of metadata in a dynamic heterogeneous
environment
 Using predefined standards for glossaries, schemas, rules, regulations to
reduce cost
 Cheap to integrate new tools
 No proprietary lock-in & assumptions that all tools are from one suite or
vendor
 Avoiding silos
 Distributed and Open
The vision
Open and
Unified Metadata
Virtualization Data Connector project
Data virtualization project
 Collaboration – IBM, several banks & open community
 A Data Lake environment
 Not just Hadoop, but other sources too
 Business Terms, Classifications, Metadata rich
 Offer virtualized views. Expose relational data with business terms
 Manage Access to resources – permit, deny, log, filter/mask …. THROUGH METADATA
 Open, pluggable
 Working through use cases, design, initial MVP (this year)
 Critique, feedback is welcomed. We’re looking for guidance and support from the Atlas
& Ranger communities as well as contribute our ideas
 Proposed changes all go through mailing list and JIRA for feedback
Apache Atlas
 “Atlas is a scalable and extensible set of core foundational governance
services – enabling enterprises to effectively and efficiently meet their
compliance requirements within Hadoop and allows integration with the
whole enterprise data ecosystem.” …. http://www.apache.org
 Open Community -- Apache Incubator since May 2015
 Type agnostic metadata store
 REST API & UI
 Supports many Hadoop components including HBase, Hive, Sqoop, Storm
& others
Apache Ranger
 Centralized security administration to manage all security related tasks in a
central UI or using REST APIs.
 Fine grained authorization to do a specific action and/or operation with
Hadoop component/tool and managed through a central administration
tool
 Standardize authorization method across all Hadoop components.
 Enhanced support for different authorization methods - Role based access
control, attribute based access control etc.
 Centralize auditing of user access and administrative actions (security
related) within all the components of Hadoop.
 … from http://ranger.apache.org
Project Interactions
Search/Rep
ort
GaianDB
• Search for list of assets by metadata
• Search for data
• Reporting tool obtains data to draw report
Underlying data, sql, hive,
HDFS, Oracle, Netezza
etc
Manages logical views
Deploys rules, pushes
classifications, source for
user roles (not users)
+ranger plugin to permit/deny, mask etc
Pulls rules. classifications
RDBMSHadoop
Apache
Atlas
Apache
Ranger
Apache
Solr
Why Atlas and Ranger?
 Open Source essential to forming an active ecosystem
 Vision, active community & evolving – ability to contribute & work with
others to provide the best solution
 Already have good core capabilities
 Atlas type system is very flexible
 Ranger offers a range of policy types and provides a pluggable framework
 Already cross project integration
 Use of tag based policie in Ranger sourced from Atlas
 Can be used independently of full Hadoop stack
Refined virtual connector scope scope
GaianDB
Ranger
Plugin
Titan
(GraphDB,
Metadata
Repository)
Ranger
Config
Ranger Server
Atlas
Poll Policies
OMAS
OMRS
IGC
Pre Post Create View
Metadata
Extract
physical
metadata
Manage
Logical
Tables
Virtualizer
Retrieve meta data
Retrieve meta data
Retrieve meta data
Push meta data
Oracle Netezza
Hive
Tables
Push and query meta data
Data Lake Repositories
Meta
Data
Data Lake Virtualization
tag-sync
rule-sync
Config (eg Policies,
Audit log location)
LDAP
Audit
Log
Mapper
Search for data/reporting
Push and
query
metadata
Meta
Data
Navigator
Meta
Data
Datameer
GaianDB & Virtualizer
 GaianDB
 Open Source
 Federated, self learning, dynamic configuration
 Based on Apache Derby
 Already had “policy” support – we’re plugging in
Ranger for this project
 Virtualizer
 Listens to event notifications on assets etc
 Creates view definitions in GaianDB, and new Atlas APIs
to store metadata. Could use different virtual engine..
 Designed to be open to other virtualization
technologies.
LT1 LT2
DS2DS1 DS3
PolicyPlugin
(ranger)
Virtualizer Atlas
GaianDB supports federation
– not used for MVP
Atlas – glossary enhancements
 Get Atlas closer to parity with commercial offerings
 Business Terms – categories, category hierarchies
 Has-a, is-a, type-of, synonym, antonym, arbitrary relationships
 Assets mapped to Business Terms
 Classifications
 Hierarchy
 Navigable mappings to retain ability to flatten tags to ranger
 Instead of hive column EMP_SALARY -> SPI, now can be EMP_SALARY -> SALARY ->
SPI …
 Used to drive governance
 ATLAS-1410
Atlas – other enhancements
 Consumer Centric APIs
 Open Metadata Access Services (OMAS)
 REST & more Kafka notifications
 Asset, Catalog, Connector, Glossary, Governance Action, Governance Definitions,
Information View, Roles and Access
 Repository level APIs
 Open Metadata Repository Services (OMRS)
 REST & more Kafka notifications
 Pluggability through an Open Connector Framework to other metadata repositories
– distributed and Open
 Standard data model/core
 Enhancement to core model – versioning, external linkage etc
 More standard types ie for all relational databases to ease sharing
Ranger areas being looked at
 Building a plugin for GaianDB
 Access control, simple masking. More later
 User synchronization (large #users, role of Atlas)
 Changes to tag sync process for New glossary proposal
 As more metadata goes into Atlas, it becomes source for generation of
some kinds of policies. Where is the master?
 Generating ranger rules from governance definitions
 How about control of access to Atlas itself?
 Aside: Interfaces used by enforcement engines (such as to get classification
data) need to be efficient – these should work for projects like Apache
Sentry as well as Atlas
Beyond the MVP
 Open Discovery Framework
 Consider other security enforcement engines – such as Apache Sentry &
driving more capability around rules & governance actions from Atlas
metadata
 Work on standard models to support different domains
 Lineage
 From high level design lineage through to operational detail. Logs vs graph….
 API metadata
 Infrastructure – JanusGraph…
 Abstraction added by IBM in last few months for titan 1
The vision
 An enterprise data catalog that lists all of your data, where it is located, its origin (lineage), owner, structure, meaning, classification and quality
 Spanning systems both on premise and cloud providers
 Hosted locally to your data platforms but integrated to provide the enterprise view
 New data tools (from any vendor) connect to your data catalog out of the box
 No vendor lock-in; nor expensive population of yet another proprietary siloed metadata repository
 Metadata is added automatically to the catalog as new data is created
 Extensible discovery processes characterise and classify the data
 Interested parties and processes are notified
 Subject matter experts collaborating around the data
 Locate the data they need, quickly and efficiently
 Feed back their knowledge about the data and the uses they have made about it to help others and support economic evaluation of data
 Automated governance processes protect and manage your data
 Metadata-driven access control
 Auditing, metering and monitoring
 Quality control and exception management
 Rights management
 Predefined standards for glossaries, data schemas, rules and regulations that reduce the cost of doing business
 Open frameworks and APIs for collaborating with universities, traditional vendors and new innovators around data and advanced analytics
Summary
 Atlas can help us have an industry wide common metadata platform around
which a vibrant ecosystem can evolve
 Not only in Hadoop but more broadly
 Metadata driven governance can be scalable & enable us to manage our data
better, and be compliant with regulations
 The ideas presented here resonate with many people we’ve spoken to
 Get involved! I’d love to hear the feedback on this approach!
 Comment on the JIRAS, ask questions, contribute, disagree… ;-)
 Look at JIRA Tag “VirtualDataConnector” or start at ATLAS-1689
 Atlas wiki
 “Innovation happens best not in isolation but in collaboration” (keynote)
 THANKS!
Questions
After this talk
jonesn@uk.ibm.com
17:50 Room 4 – Security & Governance BOF
zzz
z
z
z
z
Questions?
Backup charts
Atlas
graphDB
“gaiandb”
IGC
IGC REST API
Oracle
Data
HDFS
Data
Netezza
Data
P-JDBC P-JDBCP-JDBC
GAF OMAS
Virtual
Asset
OMAS
Search
Search/Explore UI
Catalog
OMAS
OMRS
OMRS
GAF Pre
GAF Post
Connector Framework
*
Atlas boundaries
Developed in POC
May not be in POC initially
* May be hardcoded at first
C
o
n
n
e
c
t
o
r
F
r
a
m
e
w
o
r
k
ATLAS
Virtualizer
Architecture
Metadata areas and types
Policy Metadata (Principles,
Regulations, Standards, Approaches,
Rule Specifications, Roles and
Metrics)
Governance
Actions and
Processes
Augmentation
MappingImplementation
Connector Directories
Access
Access
Information
Auditor
Integration
Developer
Business
Analyst
Data
Scientist
Information
Worker
Information
Owner
Information
Governor
Information
Steward
Data
Quality
Analyst
Business Objects and
Relationships, Taxonomies
and Ontologies
Business Attributes
Organization
Information
Curator
Teaming Metadata
(people profiles, communities,
projects,
notebooks, …)
Models and Schemas
3
2
4
5
Physical Asset Descriptions
(Data stores, APIs,
models and components)
Asset Collections
(Sets, Typed Sets, Type
Organized Sets)
Information Views
Rights
Management
Reference Data
Feedback Metadata
(tags, comments, ratings, …)
ClassificationSchemes
Classification
Strategy Subject Area Definition
Campaigns and Projects
Infrastructure and systems
Rollout
1
Discovery
Metadata (profile data,
technical classification, data
classification,
data quality assessment, …)
Augmentation
Instrument
Association
Information Process
Instrumentation (design lineage)
6
7
User & Group/Role synchronization
UserSync2
LDAP holds role-membership
(LDAP groups) – could also be
Active Directory
ATLAS manages definitive
list of roles <that are used
for atlas managed sources>
• Corporate LDAP has a huge number of
users/groups
• Ranger currently needs to sync all
• In future perhaps we establish group/role
membership during authentication
• Capability for alternative source could be merged
in to base UserSync
LDAP lookup ->
group:member
Governance Action OMAS
- getRoles
Apache
Ranger
LDAP
Apache
Atlas
Atlas Glossary v2: Tag Sync to Ranger
TagSync2
ATLAS glossary manages a
sophisticated enterprise
glossary structure
• Atlas Glossary v2 Proposed in ATLAS-1410 (David Radley) Sync Builds on existing tagsync
approach
• New API in Atlas will flatten classification structure
• No changes to ranger – but exposing richer classification could be area of future work
Governance Action OMAS
Confidential
Salary
emp_renum
Business
Term
Hive Column
Business
Term
Confidential
emp_renum
Hive Column
Tag
Apache
Ranger
Apache
Atlas
Policy (Rule) synchronization
RuleSync
• Generate policies in Ranger based off entities in Atlas
• Currently designing how this works
• Scoped by policy service so existing Ranger UI approach still works
Governance Action OMAS
- getRules
Role
Classifications
Asset
Ranger Rule
Action
Apache
Ranger
Apache
Atlas
VirtualDataConnector JIRAS 20170402
 RANGER-1488
 RANGER-1487
 RANGER-1486
 RANGER-1485
 RANGER-1464
 RANGER-1454
 RANGER-1234
 RANGER-1186
 RANGER-1168
 ATLAS-1696
 ATLAS-1694
 ATLAS-1691
 ATLAS-1158
 ATLAS-520
 ATLAS-519
 ATLAS-455
 ATLAS-197
 Create Ranger plugin for gaiandb
 generate rules from Governance definitions in Atlas
 New usersync alternative for Atlas (vdc)
 Ranger support for Virtual Data Connector Project (ATLAS)
 Support Atlas v2 glossary in Atlas plugin (for access control to terms etc)
 Support of Atlas v2 glossary API proposal for tag source
 Post-evaluation phase user extensions
 Ranger Source: eclipse
 Add data masking for tag based policies
 Governance Action Framework OMAS
 Sample assets to support Virtual Connector Project
 OMAS Interfaces for Atlas
 Build ATLAS using Docker
 Temporal / Versioning support for types, traits, entites ....
 metrics
 Timeouts in tests should be configurable from system property
 Add build instructions in top level dir
References
 Apache Atlas - http://atlas.apache.org/
 Top level JIRA for this activity https://issues.apache.org/jira/browse/ATLAS-
1689
 Apache Ranger - http://ranger.apache.org/
 GaianDB
 https://github.com/gaiandb/gaiandb
 https://developer.ibm.com/open/openprojects/gaian-database/
 The case for open metadata – A.M.Chessell
 http://www.ibmbigdatahub.com/blog/case-open-metadata

More Related Content

What's hot

Apache Atlas: Tracking dataset lineage across Hadoop components
Apache Atlas: Tracking dataset lineage across Hadoop componentsApache Atlas: Tracking dataset lineage across Hadoop components
Apache Atlas: Tracking dataset lineage across Hadoop componentsDataWorks Summit/Hadoop Summit
 
Data Governance - Atlas 7.12.2015
Data Governance - Atlas 7.12.2015Data Governance - Atlas 7.12.2015
Data Governance - Atlas 7.12.2015Hortonworks
 
Partner Ecosystem Showcase for Apache Ranger and Apache Atlas
Partner Ecosystem Showcase for Apache Ranger and Apache AtlasPartner Ecosystem Showcase for Apache Ranger and Apache Atlas
Partner Ecosystem Showcase for Apache Ranger and Apache AtlasDataWorks Summit
 
An architecture for federated data discovery and lineage over on-prem datasou...
An architecture for federated data discovery and lineage over on-prem datasou...An architecture for federated data discovery and lineage over on-prem datasou...
An architecture for federated data discovery and lineage over on-prem datasou...DataWorks Summit
 
Open Metadata and Governance with Apache Atlas
Open Metadata and Governance with Apache AtlasOpen Metadata and Governance with Apache Atlas
Open Metadata and Governance with Apache AtlasDataWorks Summit
 
GDPR-focused partner community showcase for Apache Ranger and Apache Atlas
GDPR-focused partner community showcase for Apache Ranger and Apache AtlasGDPR-focused partner community showcase for Apache Ranger and Apache Atlas
GDPR-focused partner community showcase for Apache Ranger and Apache AtlasDataWorks Summit
 
Best Practices for Enterprise User Management in Hadoop Environment
Best Practices for Enterprise User Management in Hadoop EnvironmentBest Practices for Enterprise User Management in Hadoop Environment
Best Practices for Enterprise User Management in Hadoop EnvironmentDataWorks Summit/Hadoop Summit
 
Classification based security in Hadoop
Classification based security in HadoopClassification based security in Hadoop
Classification based security in HadoopMadhan Neethiraj
 
Overview of new features in Apache Ranger
Overview of new features in Apache RangerOverview of new features in Apache Ranger
Overview of new features in Apache RangerDataWorks Summit
 
History of Privacera
History of PrivaceraHistory of Privacera
History of PrivaceraPrivacera
 
Data Governance in Apache Falcon - Hadoop Summit Brussels 2015
Data Governance in Apache Falcon - Hadoop Summit Brussels 2015 Data Governance in Apache Falcon - Hadoop Summit Brussels 2015
Data Governance in Apache Falcon - Hadoop Summit Brussels 2015 Seetharam Venkatesh
 
Driving Enterprise Data Governance for Big Data Systems through Apache Falcon
Driving Enterprise Data Governance for Big Data Systems through Apache FalconDriving Enterprise Data Governance for Big Data Systems through Apache Falcon
Driving Enterprise Data Governance for Big Data Systems through Apache FalconDataWorks Summit
 
Data Governance Initiative
Data Governance InitiativeData Governance Initiative
Data Governance InitiativeDataWorks Summit
 
Data Discovery & Lineage in Enterprise Hadoop
Data Discovery & Lineage in Enterprise HadoopData Discovery & Lineage in Enterprise Hadoop
Data Discovery & Lineage in Enterprise HadoopDataWorks Summit
 
Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies
Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies
Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies DataWorks Summit/Hadoop Summit
 
Security and Governance on Hadoop with Apache Atlas and Apache Ranger by Srik...
Security and Governance on Hadoop with Apache Atlas and Apache Ranger by Srik...Security and Governance on Hadoop with Apache Atlas and Apache Ranger by Srik...
Security and Governance on Hadoop with Apache Atlas and Apache Ranger by Srik...Artem Ervits
 
Extend Governance in Hadoop with Atlas Ecosystem: Waterline, Attivo & Trifacta
Extend Governance in Hadoop with Atlas Ecosystem: Waterline, Attivo & TrifactaExtend Governance in Hadoop with Atlas Ecosystem: Waterline, Attivo & Trifacta
Extend Governance in Hadoop with Atlas Ecosystem: Waterline, Attivo & TrifactaDataWorks Summit/Hadoop Summit
 
Data governance in Hadoop (My Personal Notes)
Data governance in Hadoop (My Personal Notes)Data governance in Hadoop (My Personal Notes)
Data governance in Hadoop (My Personal Notes)Komes Chandavimol
 

What's hot (20)

Apache Atlas: Tracking dataset lineage across Hadoop components
Apache Atlas: Tracking dataset lineage across Hadoop componentsApache Atlas: Tracking dataset lineage across Hadoop components
Apache Atlas: Tracking dataset lineage across Hadoop components
 
Apache Ranger
Apache RangerApache Ranger
Apache Ranger
 
Data Governance - Atlas 7.12.2015
Data Governance - Atlas 7.12.2015Data Governance - Atlas 7.12.2015
Data Governance - Atlas 7.12.2015
 
Partner Ecosystem Showcase for Apache Ranger and Apache Atlas
Partner Ecosystem Showcase for Apache Ranger and Apache AtlasPartner Ecosystem Showcase for Apache Ranger and Apache Atlas
Partner Ecosystem Showcase for Apache Ranger and Apache Atlas
 
An architecture for federated data discovery and lineage over on-prem datasou...
An architecture for federated data discovery and lineage over on-prem datasou...An architecture for federated data discovery and lineage over on-prem datasou...
An architecture for federated data discovery and lineage over on-prem datasou...
 
Open Metadata and Governance with Apache Atlas
Open Metadata and Governance with Apache AtlasOpen Metadata and Governance with Apache Atlas
Open Metadata and Governance with Apache Atlas
 
GDPR-focused partner community showcase for Apache Ranger and Apache Atlas
GDPR-focused partner community showcase for Apache Ranger and Apache AtlasGDPR-focused partner community showcase for Apache Ranger and Apache Atlas
GDPR-focused partner community showcase for Apache Ranger and Apache Atlas
 
Best Practices for Enterprise User Management in Hadoop Environment
Best Practices for Enterprise User Management in Hadoop EnvironmentBest Practices for Enterprise User Management in Hadoop Environment
Best Practices for Enterprise User Management in Hadoop Environment
 
Classification based security in Hadoop
Classification based security in HadoopClassification based security in Hadoop
Classification based security in Hadoop
 
Overview of new features in Apache Ranger
Overview of new features in Apache RangerOverview of new features in Apache Ranger
Overview of new features in Apache Ranger
 
History of Privacera
History of PrivaceraHistory of Privacera
History of Privacera
 
Data Governance in Apache Falcon - Hadoop Summit Brussels 2015
Data Governance in Apache Falcon - Hadoop Summit Brussels 2015 Data Governance in Apache Falcon - Hadoop Summit Brussels 2015
Data Governance in Apache Falcon - Hadoop Summit Brussels 2015
 
Driving Enterprise Data Governance for Big Data Systems through Apache Falcon
Driving Enterprise Data Governance for Big Data Systems through Apache FalconDriving Enterprise Data Governance for Big Data Systems through Apache Falcon
Driving Enterprise Data Governance for Big Data Systems through Apache Falcon
 
Data Governance Initiative
Data Governance InitiativeData Governance Initiative
Data Governance Initiative
 
Data Discovery & Lineage in Enterprise Hadoop
Data Discovery & Lineage in Enterprise HadoopData Discovery & Lineage in Enterprise Hadoop
Data Discovery & Lineage in Enterprise Hadoop
 
Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies
Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies
Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies
 
Security and Governance on Hadoop with Apache Atlas and Apache Ranger by Srik...
Security and Governance on Hadoop with Apache Atlas and Apache Ranger by Srik...Security and Governance on Hadoop with Apache Atlas and Apache Ranger by Srik...
Security and Governance on Hadoop with Apache Atlas and Apache Ranger by Srik...
 
IOT, Streaming Analytics and Machine Learning
IOT, Streaming Analytics and Machine Learning IOT, Streaming Analytics and Machine Learning
IOT, Streaming Analytics and Machine Learning
 
Extend Governance in Hadoop with Atlas Ecosystem: Waterline, Attivo & Trifacta
Extend Governance in Hadoop with Atlas Ecosystem: Waterline, Attivo & TrifactaExtend Governance in Hadoop with Atlas Ecosystem: Waterline, Attivo & Trifacta
Extend Governance in Hadoop with Atlas Ecosystem: Waterline, Attivo & Trifacta
 
Data governance in Hadoop (My Personal Notes)
Data governance in Hadoop (My Personal Notes)Data governance in Hadoop (My Personal Notes)
Data governance in Hadoop (My Personal Notes)
 

Similar to Unleashing the power of apache atlas with apache - virtual dataconnector

The rise of big data governance: insight on this emerging trend from active o...
The rise of big data governance: insight on this emerging trend from active o...The rise of big data governance: insight on this emerging trend from active o...
The rise of big data governance: insight on this emerging trend from active o...DataWorks Summit
 
Master Meta Data
Master Meta DataMaster Meta Data
Master Meta DataDigikrit
 
Teradata - Presentation at Hortonworks Booth - Strata 2014
Teradata - Presentation at Hortonworks Booth - Strata 2014Teradata - Presentation at Hortonworks Booth - Strata 2014
Teradata - Presentation at Hortonworks Booth - Strata 2014Hortonworks
 
A Look into the Apache OODT Ecosystem
A Look into the Apache OODT EcosystemA Look into the Apache OODT Ecosystem
A Look into the Apache OODT EcosystemChris Mattmann
 
The Rise of Big Data Governance: Insight on this Emerging Trend from Active O...
The Rise of Big Data Governance: Insight on this Emerging Trend from Active O...The Rise of Big Data Governance: Insight on this Emerging Trend from Active O...
The Rise of Big Data Governance: Insight on this Emerging Trend from Active O...DataWorks Summit
 
LinkedInSaxoBankDataWorkbench
LinkedInSaxoBankDataWorkbenchLinkedInSaxoBankDataWorkbench
LinkedInSaxoBankDataWorkbenchSheetal Pratik
 
Governance Software Systems_ Managing and Governing Your Data Assets.pptx
Governance Software Systems_ Managing and Governing Your Data Assets.pptxGovernance Software Systems_ Managing and Governing Your Data Assets.pptx
Governance Software Systems_ Managing and Governing Your Data Assets.pptxMounika662749
 
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS Amazon Web Services LATAM
 
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...DataWorks Summit/Hadoop Summit
 
Azure_Purview.pdf
Azure_Purview.pdfAzure_Purview.pdf
Azure_Purview.pdfhija7
 
Balancing data democratization with comprehensive information governance: bui...
Balancing data democratization with comprehensive information governance: bui...Balancing data democratization with comprehensive information governance: bui...
Balancing data democratization with comprehensive information governance: bui...DataWorks Summit
 
Clinical Trials & Big Data-Final
Clinical Trials & Big Data-FinalClinical Trials & Big Data-Final
Clinical Trials & Big Data-FinalManoj Vig
 
Introducing new AIOps innovations in Oracle 19c - San Jose AICUG
Introducing new AIOps innovations in Oracle 19c - San Jose AICUGIntroducing new AIOps innovations in Oracle 19c - San Jose AICUG
Introducing new AIOps innovations in Oracle 19c - San Jose AICUGSandesh Rao
 
What's New with Amazon Redshift ft. McDonald's (ANT350-R1) - AWS re:Invent 2018
What's New with Amazon Redshift ft. McDonald's (ANT350-R1) - AWS re:Invent 2018What's New with Amazon Redshift ft. McDonald's (ANT350-R1) - AWS re:Invent 2018
What's New with Amazon Redshift ft. McDonald's (ANT350-R1) - AWS re:Invent 2018Amazon Web Services
 
How to govern and secure a Data Mesh?
How to govern and secure a Data Mesh?How to govern and secure a Data Mesh?
How to govern and secure a Data Mesh?confluent
 
Data governance datalakes_multitenancy
Data governance datalakes_multitenancyData governance datalakes_multitenancy
Data governance datalakes_multitenancySathish K S
 
Google Data Engineering.pdf
Google Data Engineering.pdfGoogle Data Engineering.pdf
Google Data Engineering.pdfavenkatram
 

Similar to Unleashing the power of apache atlas with apache - virtual dataconnector (20)

The rise of big data governance: insight on this emerging trend from active o...
The rise of big data governance: insight on this emerging trend from active o...The rise of big data governance: insight on this emerging trend from active o...
The rise of big data governance: insight on this emerging trend from active o...
 
Master Meta Data
Master Meta DataMaster Meta Data
Master Meta Data
 
Teradata - Presentation at Hortonworks Booth - Strata 2014
Teradata - Presentation at Hortonworks Booth - Strata 2014Teradata - Presentation at Hortonworks Booth - Strata 2014
Teradata - Presentation at Hortonworks Booth - Strata 2014
 
A Look into the Apache OODT Ecosystem
A Look into the Apache OODT EcosystemA Look into the Apache OODT Ecosystem
A Look into the Apache OODT Ecosystem
 
The Rise of Big Data Governance: Insight on this Emerging Trend from Active O...
The Rise of Big Data Governance: Insight on this Emerging Trend from Active O...The Rise of Big Data Governance: Insight on this Emerging Trend from Active O...
The Rise of Big Data Governance: Insight on this Emerging Trend from Active O...
 
LinkedInSaxoBankDataWorkbench
LinkedInSaxoBankDataWorkbenchLinkedInSaxoBankDataWorkbench
LinkedInSaxoBankDataWorkbench
 
Governance Software Systems_ Managing and Governing Your Data Assets.pptx
Governance Software Systems_ Managing and Governing Your Data Assets.pptxGovernance Software Systems_ Managing and Governing Your Data Assets.pptx
Governance Software Systems_ Managing and Governing Your Data Assets.pptx
 
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
 
SAIP
SAIPSAIP
SAIP
 
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
 
Azure_Purview.pdf
Azure_Purview.pdfAzure_Purview.pdf
Azure_Purview.pdf
 
Microsoft Purview
Microsoft PurviewMicrosoft Purview
Microsoft Purview
 
Balancing data democratization with comprehensive information governance: bui...
Balancing data democratization with comprehensive information governance: bui...Balancing data democratization with comprehensive information governance: bui...
Balancing data democratization with comprehensive information governance: bui...
 
Archonnex at ICPSR
Archonnex at ICPSRArchonnex at ICPSR
Archonnex at ICPSR
 
Clinical Trials & Big Data-Final
Clinical Trials & Big Data-FinalClinical Trials & Big Data-Final
Clinical Trials & Big Data-Final
 
Introducing new AIOps innovations in Oracle 19c - San Jose AICUG
Introducing new AIOps innovations in Oracle 19c - San Jose AICUGIntroducing new AIOps innovations in Oracle 19c - San Jose AICUG
Introducing new AIOps innovations in Oracle 19c - San Jose AICUG
 
What's New with Amazon Redshift ft. McDonald's (ANT350-R1) - AWS re:Invent 2018
What's New with Amazon Redshift ft. McDonald's (ANT350-R1) - AWS re:Invent 2018What's New with Amazon Redshift ft. McDonald's (ANT350-R1) - AWS re:Invent 2018
What's New with Amazon Redshift ft. McDonald's (ANT350-R1) - AWS re:Invent 2018
 
How to govern and secure a Data Mesh?
How to govern and secure a Data Mesh?How to govern and secure a Data Mesh?
How to govern and secure a Data Mesh?
 
Data governance datalakes_multitenancy
Data governance datalakes_multitenancyData governance datalakes_multitenancy
Data governance datalakes_multitenancy
 
Google Data Engineering.pdf
Google Data Engineering.pdfGoogle Data Engineering.pdf
Google Data Engineering.pdf
 

Recently uploaded

Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 

Recently uploaded (20)

Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 

Unleashing the power of apache atlas with apache - virtual dataconnector

  • 1. Unleashing the power of Apache Atlas with Apache Ranger Virtual Data Connector Project NIGEL JONES JONESN@UK.IBM.COM DATAWORKS, MUNICH, APRIL 2017 Apache®, Apache Atlas, Apache Ranger & other Apache project names referenced are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks.
  • 2. About Me – Nigel Jones  https://www.linkedin.com/in/nigelljones/  jonesn@uk.ibm.com (Anyone still use email?)  @planetf1 – noisy, f1, electric vehicles, food & drink …. A split of work/life accounts didn’t work for me!  And of course the Apache Atlas & Ranger mailing lists & JIRA!  Science fan at school uni. It was cloud chambers back then… now just the cloud   IBM Hursley, UK since 1990  Last 3 years focus on Data Lake, Information Governance, Open Metadata
  • 4. Data?  What data do I have?  What does it mean?  Where is it?  Who has access to it?  Who owns it?  What quality is it?  How does it relate to other data?  How to I control, audit & understand access?
  • 5. Regulatory needs  Adhere to regulations like BCBS-239 and GDPR  Need to know meaning, value of the data  Demonstrate processes in place to govern access  Audit  Significant fines if rules breached  Whilst ensuring easy, ready access to appropriate data for data professionals to support an agile business
  • 6. So what do we need to address this?
  • 7. Metadata..  Metadata enables data to be used outside of the application that created it.  Analytics and decision making  New business applications  Reporting and compliance  Metadata describes the format and content of data allowing people to judge which dataset to use for a new project  Structure  Meaning  Origin  Valid values and quality  Usage and ownership  Regulations and classifications that apply  Metadata describes the business context and classification of data allowing automated governance processes to operate.
  • 8. Which can support…  An enterprise data catalogue that lists all data including where it is, what it is, who owns it, it’s meaning, quality, where it came from , and can fully describe it’s business context & how the data should be governed….  Subject Matter experts searching, collaborating, feeding back about their data needs and use  Automated governance actions to protect and manage including auditing, monitoring, quality control, rights management
  • 9. But easily…  Open frameworks & APIs  Automatic collection & discovery of metadata in a dynamic heterogeneous environment  Using predefined standards for glossaries, schemas, rules, regulations to reduce cost  Cheap to integrate new tools  No proprietary lock-in & assumptions that all tools are from one suite or vendor  Avoiding silos  Distributed and Open
  • 12. Data virtualization project  Collaboration – IBM, several banks & open community  A Data Lake environment  Not just Hadoop, but other sources too  Business Terms, Classifications, Metadata rich  Offer virtualized views. Expose relational data with business terms  Manage Access to resources – permit, deny, log, filter/mask …. THROUGH METADATA  Open, pluggable  Working through use cases, design, initial MVP (this year)  Critique, feedback is welcomed. We’re looking for guidance and support from the Atlas & Ranger communities as well as contribute our ideas  Proposed changes all go through mailing list and JIRA for feedback
  • 13. Apache Atlas  “Atlas is a scalable and extensible set of core foundational governance services – enabling enterprises to effectively and efficiently meet their compliance requirements within Hadoop and allows integration with the whole enterprise data ecosystem.” …. http://www.apache.org  Open Community -- Apache Incubator since May 2015  Type agnostic metadata store  REST API & UI  Supports many Hadoop components including HBase, Hive, Sqoop, Storm & others
  • 14. Apache Ranger  Centralized security administration to manage all security related tasks in a central UI or using REST APIs.  Fine grained authorization to do a specific action and/or operation with Hadoop component/tool and managed through a central administration tool  Standardize authorization method across all Hadoop components.  Enhanced support for different authorization methods - Role based access control, attribute based access control etc.  Centralize auditing of user access and administrative actions (security related) within all the components of Hadoop.  … from http://ranger.apache.org
  • 15. Project Interactions Search/Rep ort GaianDB • Search for list of assets by metadata • Search for data • Reporting tool obtains data to draw report Underlying data, sql, hive, HDFS, Oracle, Netezza etc Manages logical views Deploys rules, pushes classifications, source for user roles (not users) +ranger plugin to permit/deny, mask etc Pulls rules. classifications RDBMSHadoop Apache Atlas Apache Ranger Apache Solr
  • 16. Why Atlas and Ranger?  Open Source essential to forming an active ecosystem  Vision, active community & evolving – ability to contribute & work with others to provide the best solution  Already have good core capabilities  Atlas type system is very flexible  Ranger offers a range of policy types and provides a pluggable framework  Already cross project integration  Use of tag based policie in Ranger sourced from Atlas  Can be used independently of full Hadoop stack
  • 17. Refined virtual connector scope scope GaianDB Ranger Plugin Titan (GraphDB, Metadata Repository) Ranger Config Ranger Server Atlas Poll Policies OMAS OMRS IGC Pre Post Create View Metadata Extract physical metadata Manage Logical Tables Virtualizer Retrieve meta data Retrieve meta data Retrieve meta data Push meta data Oracle Netezza Hive Tables Push and query meta data Data Lake Repositories Meta Data Data Lake Virtualization tag-sync rule-sync Config (eg Policies, Audit log location) LDAP Audit Log Mapper Search for data/reporting Push and query metadata Meta Data Navigator Meta Data Datameer
  • 18. GaianDB & Virtualizer  GaianDB  Open Source  Federated, self learning, dynamic configuration  Based on Apache Derby  Already had “policy” support – we’re plugging in Ranger for this project  Virtualizer  Listens to event notifications on assets etc  Creates view definitions in GaianDB, and new Atlas APIs to store metadata. Could use different virtual engine..  Designed to be open to other virtualization technologies. LT1 LT2 DS2DS1 DS3 PolicyPlugin (ranger) Virtualizer Atlas GaianDB supports federation – not used for MVP
  • 19. Atlas – glossary enhancements  Get Atlas closer to parity with commercial offerings  Business Terms – categories, category hierarchies  Has-a, is-a, type-of, synonym, antonym, arbitrary relationships  Assets mapped to Business Terms  Classifications  Hierarchy  Navigable mappings to retain ability to flatten tags to ranger  Instead of hive column EMP_SALARY -> SPI, now can be EMP_SALARY -> SALARY -> SPI …  Used to drive governance  ATLAS-1410
  • 20. Atlas – other enhancements  Consumer Centric APIs  Open Metadata Access Services (OMAS)  REST & more Kafka notifications  Asset, Catalog, Connector, Glossary, Governance Action, Governance Definitions, Information View, Roles and Access  Repository level APIs  Open Metadata Repository Services (OMRS)  REST & more Kafka notifications  Pluggability through an Open Connector Framework to other metadata repositories – distributed and Open  Standard data model/core  Enhancement to core model – versioning, external linkage etc  More standard types ie for all relational databases to ease sharing
  • 21. Ranger areas being looked at  Building a plugin for GaianDB  Access control, simple masking. More later  User synchronization (large #users, role of Atlas)  Changes to tag sync process for New glossary proposal  As more metadata goes into Atlas, it becomes source for generation of some kinds of policies. Where is the master?  Generating ranger rules from governance definitions  How about control of access to Atlas itself?  Aside: Interfaces used by enforcement engines (such as to get classification data) need to be efficient – these should work for projects like Apache Sentry as well as Atlas
  • 22. Beyond the MVP  Open Discovery Framework  Consider other security enforcement engines – such as Apache Sentry & driving more capability around rules & governance actions from Atlas metadata  Work on standard models to support different domains  Lineage  From high level design lineage through to operational detail. Logs vs graph….  API metadata  Infrastructure – JanusGraph…  Abstraction added by IBM in last few months for titan 1
  • 23. The vision  An enterprise data catalog that lists all of your data, where it is located, its origin (lineage), owner, structure, meaning, classification and quality  Spanning systems both on premise and cloud providers  Hosted locally to your data platforms but integrated to provide the enterprise view  New data tools (from any vendor) connect to your data catalog out of the box  No vendor lock-in; nor expensive population of yet another proprietary siloed metadata repository  Metadata is added automatically to the catalog as new data is created  Extensible discovery processes characterise and classify the data  Interested parties and processes are notified  Subject matter experts collaborating around the data  Locate the data they need, quickly and efficiently  Feed back their knowledge about the data and the uses they have made about it to help others and support economic evaluation of data  Automated governance processes protect and manage your data  Metadata-driven access control  Auditing, metering and monitoring  Quality control and exception management  Rights management  Predefined standards for glossaries, data schemas, rules and regulations that reduce the cost of doing business  Open frameworks and APIs for collaborating with universities, traditional vendors and new innovators around data and advanced analytics
  • 24. Summary  Atlas can help us have an industry wide common metadata platform around which a vibrant ecosystem can evolve  Not only in Hadoop but more broadly  Metadata driven governance can be scalable & enable us to manage our data better, and be compliant with regulations  The ideas presented here resonate with many people we’ve spoken to  Get involved! I’d love to hear the feedback on this approach!  Comment on the JIRAS, ask questions, contribute, disagree… ;-)  Look at JIRA Tag “VirtualDataConnector” or start at ATLAS-1689  Atlas wiki  “Innovation happens best not in isolation but in collaboration” (keynote)  THANKS!
  • 25. Questions After this talk jonesn@uk.ibm.com 17:50 Room 4 – Security & Governance BOF zzz z z z z Questions?
  • 27. Atlas graphDB “gaiandb” IGC IGC REST API Oracle Data HDFS Data Netezza Data P-JDBC P-JDBCP-JDBC GAF OMAS Virtual Asset OMAS Search Search/Explore UI Catalog OMAS OMRS OMRS GAF Pre GAF Post Connector Framework * Atlas boundaries Developed in POC May not be in POC initially * May be hardcoded at first C o n n e c t o r F r a m e w o r k ATLAS Virtualizer Architecture
  • 28. Metadata areas and types Policy Metadata (Principles, Regulations, Standards, Approaches, Rule Specifications, Roles and Metrics) Governance Actions and Processes Augmentation MappingImplementation Connector Directories Access Access Information Auditor Integration Developer Business Analyst Data Scientist Information Worker Information Owner Information Governor Information Steward Data Quality Analyst Business Objects and Relationships, Taxonomies and Ontologies Business Attributes Organization Information Curator Teaming Metadata (people profiles, communities, projects, notebooks, …) Models and Schemas 3 2 4 5 Physical Asset Descriptions (Data stores, APIs, models and components) Asset Collections (Sets, Typed Sets, Type Organized Sets) Information Views Rights Management Reference Data Feedback Metadata (tags, comments, ratings, …) ClassificationSchemes Classification Strategy Subject Area Definition Campaigns and Projects Infrastructure and systems Rollout 1 Discovery Metadata (profile data, technical classification, data classification, data quality assessment, …) Augmentation Instrument Association Information Process Instrumentation (design lineage) 6 7
  • 29. User & Group/Role synchronization UserSync2 LDAP holds role-membership (LDAP groups) – could also be Active Directory ATLAS manages definitive list of roles <that are used for atlas managed sources> • Corporate LDAP has a huge number of users/groups • Ranger currently needs to sync all • In future perhaps we establish group/role membership during authentication • Capability for alternative source could be merged in to base UserSync LDAP lookup -> group:member Governance Action OMAS - getRoles Apache Ranger LDAP Apache Atlas
  • 30. Atlas Glossary v2: Tag Sync to Ranger TagSync2 ATLAS glossary manages a sophisticated enterprise glossary structure • Atlas Glossary v2 Proposed in ATLAS-1410 (David Radley) Sync Builds on existing tagsync approach • New API in Atlas will flatten classification structure • No changes to ranger – but exposing richer classification could be area of future work Governance Action OMAS Confidential Salary emp_renum Business Term Hive Column Business Term Confidential emp_renum Hive Column Tag Apache Ranger Apache Atlas
  • 31. Policy (Rule) synchronization RuleSync • Generate policies in Ranger based off entities in Atlas • Currently designing how this works • Scoped by policy service so existing Ranger UI approach still works Governance Action OMAS - getRules Role Classifications Asset Ranger Rule Action Apache Ranger Apache Atlas
  • 32. VirtualDataConnector JIRAS 20170402  RANGER-1488  RANGER-1487  RANGER-1486  RANGER-1485  RANGER-1464  RANGER-1454  RANGER-1234  RANGER-1186  RANGER-1168  ATLAS-1696  ATLAS-1694  ATLAS-1691  ATLAS-1158  ATLAS-520  ATLAS-519  ATLAS-455  ATLAS-197  Create Ranger plugin for gaiandb  generate rules from Governance definitions in Atlas  New usersync alternative for Atlas (vdc)  Ranger support for Virtual Data Connector Project (ATLAS)  Support Atlas v2 glossary in Atlas plugin (for access control to terms etc)  Support of Atlas v2 glossary API proposal for tag source  Post-evaluation phase user extensions  Ranger Source: eclipse  Add data masking for tag based policies  Governance Action Framework OMAS  Sample assets to support Virtual Connector Project  OMAS Interfaces for Atlas  Build ATLAS using Docker  Temporal / Versioning support for types, traits, entites ....  metrics  Timeouts in tests should be configurable from system property  Add build instructions in top level dir
  • 33. References  Apache Atlas - http://atlas.apache.org/  Top level JIRA for this activity https://issues.apache.org/jira/browse/ATLAS- 1689  Apache Ranger - http://ranger.apache.org/  GaianDB  https://github.com/gaiandb/gaiandb  https://developer.ibm.com/open/openprojects/gaian-database/  The case for open metadata – A.M.Chessell  http://www.ibmbigdatahub.com/blog/case-open-metadata

Editor's Notes

  1. This is the nirvana. Many tools from different teams – open or proprietary – all able to exchange metadata easily. A new tool can easily understand existing metadata, can integrate with minimal effort
  2. GaianDB is a open source project from IBM that is based on Apache Derby and supports a highly distributed model with self learning/healing capabilities. It virtualizes access to underlying data sources – for example a virtual table my be surfaced via JDBC that is actually based on a combination of a CSV file and another relational database. We are using it in the Virtual Data Connector project to provide a single point of control via a ranger plugin, as well as to do some data source mappings such as hiding technical columns from view or renaming columns with more business like terms gleaned from the glossary (Atlas)
  3. This is broadly the scope of an MVP definition we’re using to focus our initial work this year. We have use cases we can share with anyone interested, and will be capturing that info in the Atlas/Ranger JIRAs and potentially wiki. The list of metadata repositories is an example. Our MVP sources some metadata from IGC since that is being used by some participants, but the focus is on open interfaces and Atlas. The other repositories are potential ideas only. Similarly
  4. It’s important to architect this in an open way. The rules used to decide when to virtualize a resource need to be pluggable – perhaps for example all data arriving in a partular DataLake zone will be a virtualization candidate. Further the actual technology needs to be changeable – proprietary or other open projects – for example perhaps Presto is a candidate . Ideas welcome – proposals will be shared in Jira
  5. OMAS = Open Metadata Access Services – These are consumer centric interfaces so would pass objects suited to a particular consumer – for example Ranger in the case of the Governance Action OMAS, or a catalog UI perhaps for the Catalog services. Each consumer has different needs in terms of object structure, or whether it deals with individual objects or sets, and this can differ from the model used in the underlaying repository. For some interfaces this mapping Is simple, for others more complex. OMRS = Open Metadata Repository Services. This refers to the core repository, ie the Atlas type system. We see other metadata repositories adopting the same UI, and are proposing a mechanism that will allow these to be plugged in. Metadata can change rapidly, and the only scalable approach is to ensure it’s open & distributed. Contributions from other metadata server authors welcome ! Note that these are our working names. Fundamentally they are Atlas, and so the Atlas community will together need to agree on the actual names moving forward. A standard data model – in addition to a common mechanism/server – is necessary to make it much easier to *understand* the metadata we store. Whilst there will always be a need for extensions, having a good base object definition will make application integration easier. For example we might all wish to describe a RDBMS in a very similar way. We can then go on from this to have more standards oriented around industry models
  6. GaianDB ranger plugin – GaianDB already has the capability to have a policy plugin which governs access to it’s virtual tables. To integrate with Ranger and Atlas we will have a ranger style plugin. Whilst this will function like any other ranger plugin, in addition policies will be generated from Atlas itself. User synchronization challenges are described later. In summary, in an enterprise environment there may be many users (100k+) in LDAP, but only a small number have access to the virtualization infrastructure. We’re going to key the user sync off the list of user roles found in Atlas itself, and then obtain the role membership from LDAP. This will then be uploaded to ranger as per existing usersync Tag Sync – the glossary enhancements provide additional structure in how Business Terms, Classifications & assets are linked. A new Atlas API will flatten this structure and thus preserve Ranger’s ability to use atlas tags as today. In future there may be a re-evaluation to see if this more sophisticated approach should be pushed to Ranger too Policies – Since Atlas now has richer metadata including information about asset ownership, high level governance policies, data classification, rules may be generated in some cases from Atlas, or from a new rule-sync process. This is still being worked through and we’ll share our ideas with the community Openness – Some users I’ve spoken to are interested in Atlas but may currently use other technologies for enforcement, including Sentry in Hadoop. The intent is to ensure all the interfaces defined are open to all, and useful… so that should someone wish they could just as well integrate with Sentry as Ranger. This loosely coupled approach helps support an innovative exciting ecosystem
  7. Atlas holds metadata round user roles – they are used to define governance rules… In keeping with Ranger’s process to synchronize users & groups we will source these slightly differently, though this is mostly simply a scoping exercise to avoid pulling everything from LDAP. One consideration for the future is whether Ranger needs to sync users/groups at all – whilst the sync can help with typeahead when manually defining policies, it’s of relatively little use at runtime if instead plugins could pull the current user role membership from ldap or elsewhere after connection. Possibly for another JIRA
  8. Ranger already does tag synchronization with Atlas, but changes will be needed to support the new glossary capabilities A new tagsync process will likely be implemented so that either old or new can be used to avoid any breakage for existing users
  9. Currently working through how this may work, but fundamentally we can define the governance rules in Atlas, and likely generate executable rules in ranger. Refer to the JIRAs for ongoing design on this area
  10. In no particular order and an example only – a query for all JIRAS against Atlas, Ranger with label = ‘VirtualDataConnector’ as of 2 April 2017. This is a list of issues we’re interested in, in particular. The root JIRA for our current design work is ATLAS-1689 which it appears we forgot to tag  . There are others too so please rerun the query!