A comprehensive overview of the security concepts in the open source Hadoop stack in mid 2015 with a look back into the "old days" and an outlook into future developments.
9. Page9
Security in Hadoop 2015
Authorization
Restrict access to
explicit data
Audit
Understand who did
what
Data Protection
Encrypt data at rest
& in motion
• Kerberos in
Native Apache
Hadoop
• HTTP/REST
API Secured
with Apache
Knox Gateway
Authentication
Who am I/prove it?
• Wire encryption
in Hadoop
• File Encryption
• Built-in since
Hadoop 2.6
• Partner tools
• HDFS, YARN,
MapReduce,
Hive & HBase
• Storm & Knox
• Fine grain
access control
• Centralized
audit reporting
• Policy and
access history
Centralized Security Administration
10. Page10
Typical Flow - Hive Access with Beeline CLI
HDFS
HiveServer 2
A B C
Beeline Client
11. Page11
Typical Flow - Authenticate trough Kerberos
HDFS
HiveServer 2
A B C
Beeline Client
KDC
Use Hive, submit query
Hive gets NameNode
(NN) Service Ticket
Hive creates
MapReduce/Tez
job using NN
Client gets Service
Ticket for Hive
12. Page12
Typical Flow - Authorization through Ranger
HDFS
HiveServer 2
A B C
Beeline Client
KDC
Use Hive, submit query
Hive gets NameNode
(NN) Service Ticket
Hive creates
MapReduce/Tez
job using NN
Client gets Service
Ticket for Hive
Ranger
13. Page13
Typical Flow - Perimeter through Knox
HDFS
HiveServer 2
A B C
Beeline Client
KDC
Hive gets NameNode
(NN) Service Ticket
Knox gets Service
Ticket for Hive
Ranger
Client gets
query result
Original request
with user
id/password
Knox runs
as proxy
user using
Hive
Hive creates
MapReduce/Tez
job using NN
14. Page14
Typical Flow - Wire & File Encryption
HDFS
HiveServer 2
A B C
Beeline Client
KDC
Hive gets NameNode
(NN) Service Ticket
Hive creates
MapReduce/Tez
job using NN
Knox gets Service
Ticket for Hive
Ranger
Knox runs
as proxy
user using
Hive
Original request
with user
id/password
Client gets
query result
SSL SSL SASL
SSL SSL
16. Page16
Kerberos Synopsis
• Client never sends a password
• Sends a username + token instead
• Authentication is centralized
• Key Distribution Center (KDC)
• Client will receive a Ticket-
Granting-Ticket
• Allows authenticated client to
request access to secured services
• Clients establish a timed
session
• Clients establish trust with
services by sending KDC-
stamped tickets to service
17. Page17
Kerberos + Active Directory/LDAP
Cross Realm Trust
Client
Hadoop Cluster
AD /
LDAP KDC
Hosts: host1@HADOOP.EXAMPLE.COM
Services: hdfs/host1@HADOOP.EXAMPLE.COM
User Store
Use existing
directory tools to
manage users
Use Kerberos tools
to manage host +
service principals
Authentication
Users: seiler@EXAMPLE.COM
18. Page18
Ambari & Kerberos
• Install & Configure Kerberos
Server on a single node
Client on rest of the nodes
• Define Principals & Keytabs
A keytab (key table) is a file containing a key for a principal
Since there are a few dozen principals, Ambari can generate keytab data for your entire cluster
as a downloadable csv file
• Configure User Permissions
20. Page20
Load Balancer
Knox: Core Concept
Data Ingest
ETL
SSH
RPC Call
Falcon
Oozie
Scoop
Flume
Admin /
Data
Operator
Business
User
Hadoop
Admin
JDBC/ODBCREST/HTTP
Hadoop Cluster
HDFS Hive App XApp CApplication Layer
REST/HTTP
Edge
Node
21. Page21
Knox: Hadoop REST API
Service Direct URL Knox URL
WebHDFS http://namenode-host:50070/webhdfs https://knox-host:8443/webhdfs
WebHCat http://webhcat-host:50111/templeton https://knox-host:8443/templeton
Oozie http://ooziehost:11000/oozie https://knox-host:8443/oozie
HBase http://hbasehost:60080 https://knox-host:8443/hbase
Hive http://hivehost:10001/cliservice https://knox-host:8443/hive
YARN http://yarn-host:yarn-port/ws https://knox-host:8443/resourcemanager
Masters could
be on many
different hosts
One host, one
port
Consistent
paths
SSL config at
one host
22. Page22
Knox: Features
Simplified Access
• Kerberos Encapsulation
• Single Access Point
• Multi-cluster support
• Single SSL certificate
Centralized Control
• Central REST API auditing
• Service-level authorization
• Alternative to SSH “edge node”
Enterprise Integration
• LDAP / AD integration
• SSO integration
• Apache Shiro extensibility
• Custom extensibility
Enhanced Security
• Protect network details
• SSL for non-SSL services
• WebApp vulnerability filter
24. Page24
Knox: What’s New in Version 0.6
• Knox support for HDFS HA
• Support for YARN REST API
• Support for SSL to Hadoop Cluster Services (WebHDFS,
HBase, Hive & Oozie)
• Knox Management REST API
• Integration with Ranger for Knox Service Level
Authorization
• Use Ambari for install/start/stop/configuration
29. Page29
Authorization: HDFS ACL‘s
New Requirements:
– Maya, Diana and Clark are allowed to make modifications
– New group execs should be able to read the sales data
31. Page31
Authorization: HDFS Best Practices
•Start with traditional HDFS file permissions to implement
most permission requirements
• Define a small number of ACL‘s to handle exceptional
cases
•A file/folder with ACL incurs an additional cost in memory
in the NameNode compared to a file/folder with traditional
permissions
33. Page33
Authorization: Hive
• Hive has traditionally offered full table access control via
HDFS access control
• Solution for column-based control
– Let HiveServer2 check and submit the query execution
– Let the table accessible only by a special (technical) user
– Provide an authorization plugin to restrict UDF‘s and file formats
• Use standard SQL permission constructs
– GRANT / REVOKE
• Store the ACL‘s in Hive Metastore
35. Page35
Authorization: Hive
CREATE ROLE sales_role;
GRANT ALL ON DATABASE ‘sales-data’ TO ROLE ‘sales_role’;
GRANT SELECT ON DATABASE ‘marketing-data’ TO ROLE
‘sales_role’;
CREATE ROLE sales_column_role;
GRANT ‘c1,c2,c3’ to ‘sales_column_role’;
GRANT ‘SELECT(c1, c2, c3) ’ on ‘secret_table’ to
‘sales_column_role’;
36. Page36
Authorization: Pig
• There is no Pig (or MapReduce) Server to submit and
check column-based access
• Pig (and MapReduce) is restricted to full data access via
HDFS access control
37. Page37
Authorization: HBase
• The HBase permission model traditionally supported ACL‘s
defined at the namespace, table , column family and
column level
– This is sufficient to meet most requirements
• Cell-based security was introduced with HBase 0.98
– On par with the security model of Accumolo
43. Page43
Ranger: What’s New in Version 0.4?
• New Components Coverage
• Storm Authorization & Auditing
• Knox Authorization & Auditing
• Deeper Integration with HDP
• Windows Support
• Integration with Hive Auth API, support grant/revoke commands
• Support grant/revoke commands in HBase
• Enterprise Readiness
• Rest APIs for policy manager
• Store Audit logs locally in HDFS
• Support Oracle DB
• Ambari support, as part of Ambari 2.0 release
45. Page45
Encryption: Data in motion
• Hadoop Client to DataNode via Data Transfer Protocol
– Client reads/writes to HDFS over encrypted channel
– Configurable encryption strength
• ODBC/JDBC Client to HiveServer2
– Encryption via SASL Quality of Protection
• Mapper to Reducer during Shuffle/Sort Phase
– Shuffle is over HTTP(S)
– Supports mutual authentification via SSL
– Host name verification enabled
• REST Protocols
– SSL Support
46. Page46
Encryption: Data at rest
HDFS Transparent Data Encryption
• Install and run KMS on top of HDP 2.2
• Change according HDFS parameters (via Ambari)
• Create encryption key
hadoop key create key1 -size 256
hadoop key list –metadata
• Create an encryption zone using the key
hdfs dfs -mkdir /zone1
hdfs crypto -createZone -keyName key1 /zone1
hdfs –listZones
• Details:
– http://hortonworks.com/kb/hdfs-transparent-data-encryption/
48. Page48
Apache Atlas: Data Classification
Currently in Incubation
– https://wiki.apache.org/incubator/AtlasProposal
49. Page49
Apache Atlas: Tag-based Policies
HDFS
HiveServer 2
A B C
Beeline Client
RangerMetadata
Server
Data Classification
Table1|“marketing“
Tag Policy
Logs IT-Admin Create
Data Ingestion / ETL
Falcon
Oozie
Source Data
Scoop
Flume
50. Page50
Future: More goodies
Dynamic, Attribute based Access Control (ABAC)
• Extend Ranger to support data or user attributes in policy decisions
• Example: Use geo-location of users
Enhanced Auditing
• Ranger can stream audit data through Kafka&Storm into multiple stores
• Use Storm for correlation of data
Encryption as First Class Citizen
• Build native encryption support in HDFS, Hive & HBase
• Ranger-based key management to support encryption