This document discusses authentication, authorization, and application integration considerations for a large company implementing a self-service and secure multitenant data lake on Hadoop. It describes three approaches to integrating the data lake with the company's existing identity and access management system and evaluates the tradeoffs of each. It also examines options for authorization controls in Hadoop, methods for applications to authenticate to the data lake, and how applications can access data impersonating user permissions. The goal is to provide analytics capabilities to users while maintaining security, compliance, and governance.
Cassandra Summit 2014: Internet of Complex Things Analytics with Apache Cassa...DataStax Academy
Similar to How to Achieve a Self-Service and Secure Multitenant Data Lake in a Large Company with Strict IT Security and Diverse Analytics Use Cases (20)
Breaking the Kubernetes Kill Chain: Host Path Mount
How to Achieve a Self-Service and Secure Multitenant Data Lake in a Large Company with Strict IT Security and Diverse Analytics Use Cases
1. Copyright 2016-2018 Northrop Grumman Systems Corp.
How to Achieve a
Self-Service and Secure
Multitenant Data Lake in a
Large Company
June 2018
Leon Li, PhD
Platform Architect
2. Copyright 2016-2018 Northrop Grumman Systems Corp.
About Northrop Grumman
Leading global security company
Approximately 70,000 employees in all 50 states and 25+ countries
Technology Heritage
• 1927 - Spirit of St. Louis, which Charles Lindbergh flew across the Atlantic
• 1946 - First flight of the XB-35 flying wing
• 1953 - Contract to oversee the U.S. Air Force ICBM program
• 1958 - Pioneer 1 becomes the first spacecraft built by an industrial contractor
• 1969 - Apollo Lunar Module carries man to the surface of the moon
• 1983 - Pioneer 10, becomes the first manmade object to leave the solar system
• 1989 - First flight of the B-2 stealth bomber, a descendent of XB-35 flying wing design
• 1998 - First flight of the RQ-4A Global Hawk, a high-altitude, long-endurance unmanned aerial
reconnaissance system
• NASA’s James Webb Space Telescope: unprecedented resolution and sensitivity space telescope to
observe most distant events and objects in the universe
2
3. Copyright 2016-2018 Northrop Grumman Systems Corp.
Northrop Grumman’s Enterprise Data Analytics Platform
• Platform provides analytics capabilities quickly and simply,
allowing users to focus on their needs instead of
infrastructure
• Provides capabilities from basic data handling and
reporting through big data and state-of-the-art machine
learning
• Single, large shared Hadoop cluster with multitenant
security as Data Lake
6/20/2018
3
4. Copyright 2016-2018 Northrop Grumman Systems Corp.
Northrop Grumman’s participation in this user conference and
mention of various products is not an endorsement of any product.
This presentation is protected by copyright protections and may not
be commercially used. The only permitted uses of this presentation
are for personal, non-commercial uses of the user community.
Northrop Grumman Does Not Endorse Any of the
Products Mentioned in this Presentation
5. Copyright 2016-2018 Northrop Grumman Systems Corp.
Architecture Decisions and Tradeoffs
1. Authentication and Accounts Management
2. Authorization and Access Control
3. Interfacing Analytics Applications with a Data Lake
5
6. Copyright 2016-2018 Northrop Grumman Systems Corp.
Authentication and Accounts Management
6
• Hadoop requires a large number of Kerberos accounts
– Each user uses a Kerberos account
– Each Hadoop service uses one Kerberos account
per machine
• Secure creation of Hadoop service accounts and
distribution of keytabs is crucial for security and
operations
Hadoop
• Enterprise IAM systems
o Provisioning and deprovisioning of accounts
o Management of user accounts
o Compliance with enterprise security policies
• Uses Kerberos as underlying technology
• Enterprises have existing IAM systems in place
• New systems for enterprise use need to meet existing
IAM policies for security, compliance, and governance
Enterprise Identity and Access Management (IAM)
User
Key Distribution Center
(IAM)
AuthenticatesAuthenticates
Hadoop
7. Copyright 2016-2018 Northrop Grumman Systems Corp.
Approach 1 – Completely isolated accounts management
7
Enterprise
Systems
Hadoop IAMEnterprise IAM
Data Operations
User &
Hadoop Service
Accounts
Benefits:
• No dependency on enterprise accounts management teams
• Faster standup of platform
• Data Lake team has full control over all accounts
• Your Hadoop management tools can easily generate and
distribute Hadoop service accounts
Disadvantages:
• Users have to remember separates usernames and passwords
• Must manage user accounts separately (e.g. provisioning,
deprovisioning, passwords, account lockouts, etc.)
• Against security best practices to establish separate user accounts for
separate systems
Hadoop
Cluster
8. Copyright 2016-2018 Northrop Grumman Systems Corp.
Approach 2 – Store all accounts directly into
existing Enterprise IAM
8
Enterprise
Systems
Enterprise IAM
Data Operations
Hadoop Service
Accounts
Benefits:
• Unified account management within enterprise IT organization
• Improved user experience with SSO
• Improved central auditing and compliance
• No need to manage a separate Hadoop IAM
– Reduce potential points of system failure
Disadvantages:
• Your Hadoop management tools may not have administrative
access to Enterprise IAM systems or generate Hadoop service
accounts in compliance with Enterprise IAM team rules
• Increase dependency on Enterprise IAM administrator team
• Recommendation: Consider OU delegation to isolate service
accounts generation and management for Hadoop
• Greater performance load on the Enterprise Directory Service as
Hadoop grows
Hadoop
Cluster
9. Copyright 2016-2018 Northrop Grumman Systems Corp.
Approach 3 – Hybrid Approach: Separate service
accounts from user accounts using a domain trust
9
Enterprise
Systems
Hadoop IAMEnterprise IAM
Data Operations
Hadoop Service
Accounts
Kerberos
Realm
Trust
Benefits:
• Unified user account management within enterprise IT
organization
• Improved user experience with SSO
• Improved central auditing and compliance
• Easier administration and maintenance as your Hadoop cluster
management tools maintains control of Hadoop internal service
accounts
Disadvantages:
• Kerberos trust setup can be complex and requires special skills
• If Enterprise IAM and Hadoop IAM is mixed operating system
environments, then incompatibilities can occur
• Applications deployed to the Hadoop IAM realm may run into
“Kerberos double-hop” delegation issues when authenticating users
to Hadoop services
Hadoop
Cluster
10. Copyright 2016-2018 Northrop Grumman Systems Corp.
Summary Approaches to Hadoop – Enterprise IAM Integration
10
1) Completely isolated accounts management 2) Store all accounts directly into existing Enterprise IAM
3) Hybrid Approach: Separate service accounts
from user accounts using a domain trust
11. Copyright 2016-2018 Northrop Grumman Systems Corp.
Architecture Decisions and Tradeoffs
1. Authentication and Accounts Management
2. Authorization and Access Control
3. Interfacing Analytics Applications with a Data Lake
11
12. Copyright 2016-2018 Northrop Grumman Systems Corp.
Hadoop Authorization Plugin Systems Architecture
12
Examples: Ranger, Sentry
Administration Portal Administration API
HDFS Hive HBase YarnNiFiStorm KafkaKnox Solr ImpalaAtlas
• Hadoop authorization systems adds plugins to many Hadoop components to control authorizations
• An Administration Portal and Administration API provides the ability to control authorizations
* Not every authorization plugin systems support plugins for every Hadoop component
Plugin Plugin Plugin Plugin Plugin Plugin Plugin Plugin Plugin Plugin Plugin
14. Copyright 2016-2018 Northrop Grumman Systems Corp.
Managing Permissions for Hive and Spark
14
• Only use HDFS permissions for both Spark and Hive
• Benefits
o Simple security model, easier to implement
o Consistent user access for Hive and Spark
• Disadvantages
o No fine grained controls like column based
security in Hive
HDFS permissions for Hive and Spark
• Use Hive column based security for Hive, and HDFS
security for Spark
• Benefits
o Fine grained controls like column based security
in Hive
• Disadvantages
o Spark access is granted separately
o More administrative complexity
o Must address discrepancy in access control
between Hive and Spark (e.g. how to make
sure fine grain control is enforce for Spark
users)
HDFS permissions for Spark only
*Note: LLAP may improve this situation when it becomes fully enterprise ready
15. Copyright 2016-2018 Northrop Grumman Systems Corp.
Architecture Decisions and Tradeoffs
1. Authentication and Accounts Management
2. Authorization and Access Control
3. Interfacing Analytics Applications with a Data Lake
15
16. Copyright 2016-2018 Northrop Grumman Systems Corp.
Hadoop Analytics Tools
Graphical Analytics Tools
…More accessible to general users
Examples
• Data Science Notebooks
• Business Intelligence and Visualization Tools
16
Command Line Hadoop Tools
…Powerful but Not Intuitive
17. Copyright 2016-2018 Northrop Grumman Systems Corp.
Security Considerations for Integrating Analytics Tools
17
• Users only have access to their files, databases, and analytics processes in
the Data Lake
• Users run analytics (Hive queries, Pig jobs, Spark jobs, etc) as themselves
in the Data Lake
18. Copyright 2016-2018 Northrop Grumman Systems Corp.
Hadoop Secure Impersonation
18
Alice
Alice’s
credentials
Web Application
Application’s credentials
doAs: Alice Authorization
controls for Alice
• A Hadoop superuser can submit jobs or access data on behalf of another user
Applications:
• Web based applications
Cautions:
• Limit superusers to trusted applications only
Hadoop
19. Copyright 2016-2018 Northrop Grumman Systems Corp.
Direct Kerberos Authentication
Applications:
• Hadoop commandline tools (hdfs, beeline, Spark shell)
• Some data science notebook tools
19
Alice
KDC Server
TGT Service
Tickets
Alice’s workstation
Authorization
controls for
Alice
Cautions:
• Users may need to be trained to be Kerberos aware
• Difficult in some choice of IAM systems in Hybrid identity
management approach
Hadoop
20. Copyright 2016-2018 Northrop Grumman Systems Corp.
Direct Kerberos Authentication + Saved Credentials
Applications:
– Some commercial data science platforms
Benefits:
– Kerberos authentication becomes transparent to the
user, improving user experience
– Linux Container security simplifies isolation of many
analytics applications20
Alice’s session
KDC Server
TGT Service
Tickets
Container instance with
Alice’s Kerberos tickets
Authorization
controls for
Alice
Alice Application
Server
(Saved password
or keytab)
Alice’s
password
or keytab
Start Linux
Container
Cautions:
• Users’ Kerberos credentials on saved on servers long
term
• Understand which servers and persistence stores save
these passwords and take security precautions to
minimize risks
Hadoop
21. Copyright 2016-2018 Northrop Grumman Systems Corp.
Edge Proxy Gateway
• Example: Knox
Applications:
- Custom scripts calling Hadoop functionality
- ODBC/JDBC Data Sources
- Self-Service Business Intelligence Tools
21
Authorization
controls for
Alice
Pluggable Auth Providers
LDAP
PAM
HadoopAuth
SSO Cookie
JWT Provider
Claims (CAS/Auth/SAML/OpenID)
Alice
Cautions:
• Not every application supports this method
• Performance challenges
Edge Proxy
Gateway
Hadoop
22. Copyright 2016-2018 Northrop Grumman Systems Corp.
Methods of application authentication and
impersonation in Hadoop
22
Edge Proxy Gateway
Direct Kerberos AuthenticationHadoop User Impersonation
Direct Kerberos Authentication + Saved Credentials
23. Copyright 2016-2018 Northrop Grumman Systems Corp.
Architecture Decisions and Tradeoffs
1. Authentication and Accounts Management
2. Authorization and Access Control
3. Interfacing Analytics Applications with a Data Lake
23