ING- CoreIntel- Collect and Process Network Logs Across Data Centers in Real Time

Unleashing the power of Apache Atlas
with Apache Ranger
Virtual Data Connector Project
NIGEL JONES
JONESN@UK.IBM.COM
DATAWORKS, MUNICH, APRIL 2017
Apache®, Apache Atlas, Apache Ranger & other Apache project names referenced are either registered trademarks or trademarks of
the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation
is implied by the use of these marks.

About Me – Nigel Jones
 https://www.linkedin.com/in/nigelljones/
 jonesn@uk.ibm.com (Anyone still use email?)
 @planetf1 – noisy, f1, electric vehicles, food & drink …. A split of work/life
accounts didn’t work for me!
 And of course the Apache Atlas & Ranger mailing lists & JIRA!
 Science fan at school uni. It was cloud chambers back then… now just the
cloud 
 IBM Hursley, UK since 1990
 Last 3 years focus on Data Lake, Information Governance, Open Metadata

The Problem…..
WHY ARE WE HERE…..

Data?
 What data do I have?
 What does it mean?
 Where is it?
 Who has access to it?
 Who owns it?
 What quality is it?
 How does it relate to other data?
 How to I control, audit & understand access?

Regulatory needs
 Adhere to regulations like BCBS-239 and GDPR
 Need to know meaning, value of the data
 Demonstrate processes in place to govern access
 Audit
 Significant fines if rules breached
 Whilst ensuring easy, ready access to appropriate data for data professionals to
support an agile business

So what do we need to address this?

Metadata..
 Metadata enables data to be used outside of the application that created it.
 Analytics and decision making
 New business applications
 Reporting and compliance
 Metadata describes the format and content of data allowing people to judge which
dataset to use for a new project
 Structure
 Meaning
 Origin
 Valid values and quality
 Usage and ownership
 Regulations and classifications that apply
 Metadata describes the business context and classification of data allowing automated
governance processes to operate.

Which can support…
 An enterprise data catalogue that lists all data including where it is, what it
is, who owns it, it’s meaning, quality, where it came from , and can fully
describe it’s business context & how the data should be governed….
 Subject Matter experts searching, collaborating, feeding back about their
data needs and use
 Automated governance actions to protect and manage including auditing,
monitoring, quality control, rights management

But easily…
 Open frameworks & APIs
 Automatic collection & discovery of metadata in a dynamic heterogeneous
environment
 Using predefined standards for glossaries, schemas, rules, regulations to
reduce cost
 Cheap to integrate new tools
 No proprietary lock-in & assumptions that all tools are from one suite or
vendor
 Avoiding silos
 Distributed and Open

The vision
Open and
Unified Metadata

Virtualization Data Connector project

Data virtualization project
 Collaboration – IBM, several banks & open community
 A Data Lake environment
 Not just Hadoop, but other sources too
 Business Terms, Classifications, Metadata rich
 Offer virtualized views. Expose relational data with business terms
 Manage Access to resources – permit, deny, log, filter/mask …. THROUGH METADATA
 Open, pluggable
 Working through use cases, design, initial MVP (this year)
 Critique, feedback is welcomed. We’re looking for guidance and support from the Atlas
& Ranger communities as well as contribute our ideas
 Proposed changes all go through mailing list and JIRA for feedback

Apache Atlas
 “Atlas is a scalable and extensible set of core foundational governance
services – enabling enterprises to effectively and efficiently meet their
compliance requirements within Hadoop and allows integration with the
whole enterprise data ecosystem.” …. http://www.apache.org
 Open Community -- Apache Incubator since May 2015
 Type agnostic metadata store
 REST API & UI
 Supports many Hadoop components including HBase, Hive, Sqoop, Storm
& others

Apache Ranger
 Centralized security administration to manage all security related tasks in a
central UI or using REST APIs.
 Fine grained authorization to do a specific action and/or operation with
Hadoop component/tool and managed through a central administration
tool
 Standardize authorization method across all Hadoop components.
 Enhanced support for different authorization methods - Role based access
control, attribute based access control etc.
 Centralize auditing of user access and administrative actions (security
related) within all the components of Hadoop.
 … from http://ranger.apache.org

Project Interactions
Search/Rep
ort
GaianDB
• Search for list of assets by metadata
• Search for data
• Reporting tool obtains data to draw report
Underlying data, sql, hive,
HDFS, Oracle, Netezza
etc
Manages logical views
Deploys rules, pushes
classifications, source for
user roles (not users)
+ranger plugin to permit/deny, mask etc
Pulls rules. classifications
RDBMSHadoop
Apache
Atlas
Apache
Ranger
Apache
Solr

Why Atlas and Ranger?
 Open Source essential to forming an active ecosystem
 Vision, active community & evolving – ability to contribute & work with
others to provide the best solution
 Already have good core capabilities
 Atlas type system is very flexible
 Ranger offers a range of policy types and provides a pluggable framework
 Already cross project integration
 Use of tag based policie in Ranger sourced from Atlas
 Can be used independently of full Hadoop stack

Refined virtual connector scope scope
GaianDB
Ranger
Plugin
Titan
(GraphDB,
Metadata
Repository)
Ranger
Config
Ranger Server
Atlas
Poll Policies
OMAS
OMRS
IGC
Pre Post Create View
Metadata
Extract
physical
metadata
Manage
Logical
Tables
Virtualizer
Retrieve meta data
Retrieve meta data
Retrieve meta data
Push meta data
Oracle Netezza
Hive
Tables
Push and query meta data
Data Lake Repositories
Meta
Data
Data Lake Virtualization
tag-sync
rule-sync
Config (eg Policies,
Audit log location)
LDAP
Audit
Log
Mapper
Search for data/reporting
Push and
query
metadata
Meta
Data
Navigator
Meta
Data
Datameer

GaianDB & Virtualizer
 GaianDB
 Open Source
 Federated, self learning, dynamic configuration
 Based on Apache Derby
 Already had “policy” support – we’re plugging in
Ranger for this project
 Virtualizer
 Listens to event notifications on assets etc
 Creates view definitions in GaianDB, and new Atlas APIs
to store metadata. Could use different virtual engine..
 Designed to be open to other virtualization
technologies.
LT1 LT2
DS2DS1 DS3
PolicyPlugin
(ranger)
Virtualizer Atlas
GaianDB supports federation
– not used for MVP

Atlas – glossary enhancements
 Get Atlas closer to parity with commercial offerings
 Business Terms – categories, category hierarchies
 Has-a, is-a, type-of, synonym, antonym, arbitrary relationships
 Assets mapped to Business Terms
 Classifications
 Hierarchy
 Navigable mappings to retain ability to flatten tags to ranger
 Instead of hive column EMP_SALARY -> SPI, now can be EMP_SALARY -> SALARY ->
SPI …
 Used to drive governance
 ATLAS-1410

Atlas – other enhancements
 Consumer Centric APIs
 Open Metadata Access Services (OMAS)
 REST & more Kafka notifications
 Asset, Catalog, Connector, Glossary, Governance Action, Governance Definitions,
Information View, Roles and Access
 Repository level APIs
 Open Metadata Repository Services (OMRS)
 REST & more Kafka notifications
 Pluggability through an Open Connector Framework to other metadata repositories
– distributed and Open
 Standard data model/core
 Enhancement to core model – versioning, external linkage etc
 More standard types ie for all relational databases to ease sharing

Ranger areas being looked at
 Building a plugin for GaianDB
 Access control, simple masking. More later
 User synchronization (large #users, role of Atlas)
 Changes to tag sync process for New glossary proposal
 As more metadata goes into Atlas, it becomes source for generation of
some kinds of policies. Where is the master?
 Generating ranger rules from governance definitions
 How about control of access to Atlas itself?
 Aside: Interfaces used by enforcement engines (such as to get classification
data) need to be efficient – these should work for projects like Apache
Sentry as well as Atlas

Beyond the MVP
 Open Discovery Framework
 Consider other security enforcement engines – such as Apache Sentry &
driving more capability around rules & governance actions from Atlas
metadata
 Work on standard models to support different domains
 Lineage
 From high level design lineage through to operational detail. Logs vs graph….
 API metadata
 Infrastructure – JanusGraph…
 Abstraction added by IBM in last few months for titan 1

The vision
 An enterprise data catalog that lists all of your data, where it is located, its origin (lineage), owner, structure, meaning, classification and quality
 Spanning systems both on premise and cloud providers
 Hosted locally to your data platforms but integrated to provide the enterprise view
 New data tools (from any vendor) connect to your data catalog out of the box
 No vendor lock-in; nor expensive population of yet another proprietary siloed metadata repository
 Metadata is added automatically to the catalog as new data is created
 Extensible discovery processes characterise and classify the data
 Interested parties and processes are notified
 Subject matter experts collaborating around the data
 Locate the data they need, quickly and efficiently
 Feed back their knowledge about the data and the uses they have made about it to help others and support economic evaluation of data
 Automated governance processes protect and manage your data
 Metadata-driven access control
 Auditing, metering and monitoring
 Quality control and exception management
 Rights management
 Predefined standards for glossaries, data schemas, rules and regulations that reduce the cost of doing business
 Open frameworks and APIs for collaborating with universities, traditional vendors and new innovators around data and advanced analytics

Summary
 Atlas can help us have an industry wide common metadata platform around
which a vibrant ecosystem can evolve
 Not only in Hadoop but more broadly
 Metadata driven governance can be scalable & enable us to manage our data
better, and be compliant with regulations
 The ideas presented here resonate with many people we’ve spoken to
 Get involved! I’d love to hear the feedback on this approach!
 Comment on the JIRAS, ask questions, contribute, disagree… ;-)
 Look at JIRA Tag “VirtualDataConnector” or start at ATLAS-1689
 Atlas wiki
 “Innovation happens best not in isolation but in collaboration” (keynote)
 THANKS!

Questions
After this talk
jonesn@uk.ibm.com
17:50 Room 4 – Security & Governance BOF
zzz
z
z
z
z
Questions?

Atlas
graphDB
“gaiandb”
IGC
IGC REST API
Oracle
Data
HDFS
Data
Netezza
Data
P-JDBC P-JDBCP-JDBC
GAF OMAS
Virtual
Asset
OMAS
Search
Search/Explore UI
Catalog
OMAS
OMRS
OMRS
GAF Pre
GAF Post
Connector Framework
*
Atlas boundaries
Developed in POC
May not be in POC initially
* May be hardcoded at first
C
o
n
n
e
c
t
o
r
F
r
a
m
e
w
o
r
k
ATLAS
Virtualizer
Architecture

Metadata areas and types
Policy Metadata (Principles,
Regulations, Standards, Approaches,
Rule Specifications, Roles and
Metrics)
Governance
Actions and
Processes
Augmentation
MappingImplementation
Connector Directories
Access
Access
Information
Auditor
Integration
Developer
Business
Analyst
Data
Scientist
Information
Worker
Information
Owner
Information
Governor
Information
Steward
Data
Quality
Analyst
Business Objects and
Relationships, Taxonomies
and Ontologies
Business Attributes
Organization
Information
Curator
Teaming Metadata
(people profiles, communities,
projects,
notebooks, …)
Models and Schemas
3
2
4
5
Physical Asset Descriptions
(Data stores, APIs,
models and components)
Asset Collections
(Sets, Typed Sets, Type
Organized Sets)
Information Views
Rights
Management
Reference Data
Feedback Metadata
(tags, comments, ratings, …)
ClassificationSchemes
Classification
Strategy Subject Area Definition
Campaigns and Projects
Infrastructure and systems
Rollout
1
Discovery
Metadata (profile data,
technical classification, data
classification,
data quality assessment, …)
Augmentation
Instrument
Association
Information Process
Instrumentation (design lineage)
6
7

User & Group/Role synchronization
UserSync2
LDAP holds role-membership
(LDAP groups) – could also be
Active Directory
ATLAS manages definitive
list of roles <that are used
for atlas managed sources>
• Corporate LDAP has a huge number of
users/groups
• Ranger currently needs to sync all
• In future perhaps we establish group/role
membership during authentication
• Capability for alternative source could be merged
in to base UserSync
LDAP lookup ->
group:member
Governance Action OMAS
- getRoles
Apache
Ranger
LDAP
Apache
Atlas

Atlas Glossary v2: Tag Sync to Ranger
TagSync2
ATLAS glossary manages a
sophisticated enterprise
glossary structure
• Atlas Glossary v2 Proposed in ATLAS-1410 (David Radley) Sync Builds on existing tagsync
approach
• New API in Atlas will flatten classification structure
• No changes to ranger – but exposing richer classification could be area of future work
Confidential
Salary
emp_renum
Business
Term
Hive Column
Business
Term
Confidential
emp_renum
Hive Column
Tag
Apache
Ranger
Apache
Atlas

Policy (Rule) synchronization
RuleSync
• Generate policies in Ranger based off entities in Atlas
• Currently designing how this works
• Scoped by policy service so existing Ranger UI approach still works
- getRules
Role
Classifications
Asset
Ranger Rule
Action
Apache
Ranger
Apache
Atlas

VirtualDataConnector JIRAS 20170402
 RANGER-1488
 RANGER-1487
 RANGER-1486
 RANGER-1485
 RANGER-1464
 RANGER-1454
 RANGER-1234
 RANGER-1186
 RANGER-1168
 ATLAS-1696
 ATLAS-1694
 ATLAS-1691
 ATLAS-1158
 ATLAS-520
 ATLAS-519
 ATLAS-455
 ATLAS-197
 Create Ranger plugin for gaiandb
 generate rules from Governance definitions in Atlas
 New usersync alternative for Atlas (vdc)
 Ranger support for Virtual Data Connector Project (ATLAS)
 Support Atlas v2 glossary in Atlas plugin (for access control to terms etc)
 Support of Atlas v2 glossary API proposal for tag source
 Post-evaluation phase user extensions
 Ranger Source: eclipse
 Add data masking for tag based policies
 Governance Action Framework OMAS
 Sample assets to support Virtual Connector Project
 OMAS Interfaces for Atlas
 Build ATLAS using Docker
 Temporal / Versioning support for types, traits, entites ....
 metrics
 Timeouts in tests should be configurable from system property
 Add build instructions in top level dir

References
 Apache Atlas - http://atlas.apache.org/
 Top level JIRA for this activity https://issues.apache.org/jira/browse/ATLAS-
1689
 Apache Ranger - http://ranger.apache.org/
 GaianDB
 https://github.com/gaiandb/gaiandb
 https://developer.ibm.com/open/openprojects/gaian-database/
 The case for open metadata – A.M.Chessell
 http://www.ibmbigdatahub.com/blog/case-open-metadata

ING- CoreIntel- Collect and Process Network Logs Across Data Centers in Real Time

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to ING- CoreIntel- Collect and Process Network Logs Across Data Centers in Real Time

Similar to ING- CoreIntel- Collect and Process Network Logs Across Data Centers in Real Time (20)

More from DataWorks Summit/Hadoop Summit

More from DataWorks Summit/Hadoop Summit (20)

Recently uploaded

Recently uploaded (20)

ING- CoreIntel- Collect and Process Network Logs Across Data Centers in Real Time

Editor's Notes