- Hellmar Becker presented on securing Hadoop in an enterprise context at a summit in Dublin on April 14, 2016.
- ING uses Hadoop for data storage, advanced analytics, real-time processing to better understand customers and offer tailored products/services. However, out-of-the-box Hadoop has security risks like data loss, privacy breaches, and system intrusion if not properly secured.
- ING built a security architecture including perimeter security with firewalls and stepping stone servers, integrating Hadoop user authentication and authorization with their corporate Active Directory, and using Apache Ranger for unified access management to securely unlock insights from data at their enterprise scale.
3. 2
4
3
1
5
The Challenge
Hadoop Usage Patterns
Aspects of Security
Building Blocks for a Security Architecture
Questions
Securing Hadoop in an Enterprise Context
3
5. Data Lake and Advanced Analytics within ING
5
External and internal reporting for
own or regulatory purposes
Integrate all data sources within the
bank into one processing platform
• Batch data streams
• Live transactions
• Model building for customer
interaction
Better understand customer
needs in an increasingly digital world
Data can help us offering
tailored products and services
Empower data scientists and analysts
to get the best results with advanced
analytics tools and predictive models
Open source software where possible
– Hadoop as a core component
6. 6
Possible consequences
• Legal consequences
• Loss of reputation
• Financial loss
Risks
• Data loss
• Privacy breach
• System intrusion
7. Hadoop user model:
• A user name is just an alphanumeric string
• So is a group name
• They do not have to match entities in the OS
• Via REST API anybody could read or modify
data
So, the security design has to be actively built!
And this is what we did.
Hadoop "out of the box" default runs without security
7
11. Aspects of Security
12
Technical: Rings of Defense
• Perimeter Level Security
• Application Level Authentication and Authorization
• OS Security
• Data Protection
See also: http://www.slideshare.net/vinnies12/hadoop-
security-today-tomorrow-apache-knox
Conceptual: Five Pillars of Security
• Administration
• Authentication
• Authorization
• Auditing
• Data Protection
See also: http://hortonworks.com/hdp/security/
13. • Firewall around the entire cluster
• “Stepping stone” servers
• Citrix/Terminal server for interactive access
• Ingestion server with defined transfer
paths
User model
• Personal users locally defined or with
corporate directory
• Service/Technical users defined locally
Software updates and software development
• Through manually maintained mirror
Used in exploratory environments (pattern 3)
Building Blocks: Perimeter Security
14
14. • General goal: Zero Touch deployment
• Automatic synchronization with enterprise
directory
• UI access is only used for incidents
Administration
15
• Kerberos]
• Future: Share a KDC HA cluster among Hadoop instances
• Connecting to enterprise directory using trusts and synchronization (next chapter)
• Keep the Kerberos principals (Hadoop users) completely separate from OS users
Authentication
Building Blocks: Internal Security
15. Unified rights management with Ranger
• Service principals will be directly made known to Ranger;
PA's rights are assigned only based on groups
• Groups and users are synced with Active Directory
• Ranger 0.4 can not take away privileges that were granted
on a lower level
• HDFS permissions and ACLs override Ranger
• Make sure these access paths are locked down
HDFS ACLs (No!)
• No easy to use GUI
• Difficult to maintain overview
• Only for HDFS, does not handle other components
Authorization
16
> hdfs dfs -setfacl -m group:execs:r-- /sales-data
> hdfs dfs -getfacl /sales-data
# file: /sales-data
# owner: bruce
# group: sales
user::rw-
group::r--
group:execs:r--
mask::r--
other::---
16. • Personal users in corporate Active Directory, NPAs
in cluster KDC
• One KDC pair per cluster
• One way realm trust
• Custom script to synchronize Ranger
What We Have Done: Corporate Integration
17
Challenges
• Learning to work in interdisciplinary teams
• Organizational boundaries
• UNIX – Windows
• Infra – Platform DevOps
Example: Ambari service connects to UNIX LDAP rather than
AD
OS security and Hadoop security are not integrated
• YARN container users
• Hadoop ACLs, group mapping
• Multitenancy? (Not solved in this picture)
17. • Ranger's uxugsync process queries Active Directory through LDAP protocol
• Ranger 0.4: Reads all users, then determines their group affiliation
• More than 50,000 employees in ING Group
• Need to limit the load on LDAP server!
• Ranger 0.5: Group driven query - still not optimal because it uses attribute filters
• Most efficient LDAP query is either by a single DN (Distinguished Name), or by
container (query base DN).
• But we cannot use containers because of enterprise policy
• Solution: custom Python script that queries LDAP hierarchically
• One “supergroup” is picked by DN
• The members of the “supergroup” are all LDAP groups that have Hadoop related
privileges
• Query all these groups, again by DN
• Examine the members of each group (personal users)
• Make the user-group relationships known to Ranger via REST call
Working Around Ranger’s Limitations
18
Ranger User-Group
API is not
documented and
supported
Database schema:
creates duplicate
records,
inconsistent
deletion behavior
OS integration
should be better
18. • IPA and sssd provide user/group mapping on
Hadoop and OS level
• Role based access for personal users,
managed through a central tool
• One user database for Hadoop services,
Ambari, Ranger
• YARN, HDFS user models fall nicely into place
• Requires ING patches (HDP 2.4, Ranger 0.6)
• RANGER-827 use getent instead of files
• RANGER-842 use pam for Ranger auth
• HADOOP-12751, HIVE-4413 support ‘@’ in
user name
• AMBARI-6432 support IPA KDC
A Better Approach: Corporate Directory Integration
19
Timelines!
We need this
prioritized by our
vendor
20. • Hellmar in Nîmes / With Python in Mindanao, by the author
• Domtoren in het oranje licht by helena_is_here is licensed under CC BY 2.0
• Data Pipeline, ING OIB Image Bank
• Storm surge by David Baird is licensed under CC BY-SA 2.0; cropped by me
• Scared Girl by Victor Bezrukov - Port-42 is licensed under CC BY 2.0
• System Lock by Yuri Samoilov is licensed under CC BY 2.0; cropped by me
• Safe by Rob Pongsajapan is licensed under CC BY 2.0; cropped by me
• Hercules and Cerberus by The Los Angeles County Museum of Art is Public Domain
Attributions
21