This document provides an overview of Apache Hadoop security, both historically and what is currently available and planned for the future. It discusses how Hadoop security is different due to benefits like combining previously siloed data and tools. The four areas of enterprise security - perimeter, access, visibility, and data protection - are reviewed. Specific security capabilities like Kerberos authentication, Apache Sentry role-based access control, Cloudera Navigator auditing and encryption, and HDFS encryption are summarized. Planned future enhancements are also mentioned like attribute-based access controls and improved encryption capabilities.
* We offer the most complete set of processing, analysis, and serving frameworks for Hadoop.
* Including comprehensive support for YARN. For example, Impala runs on YARN. YARN is not a differentiator.
===
What’s really significant about this architecture is how it unifies diverse access to common data.
In traditional approaches, you’d have separate systems to collect, store, process, explore, model, and serve data. Different teams would use different systems for each workload, and users whose roles span multiple systems would have to use several of them to achieve their objectives.
With Cloudera’s enterprise data hub:
You can perform end-to-end data workflows in a single system, dramatically lowering time to value.
Each workload can access unlimited data, thanks to the underlying data platform, enhancing the value of each workload.
Power users can now access their data in new ways: SQL, search, machine learning, programming, etc.
At the same time, new users are enabled by these diverse workloads to interact with data.
Cloudera Enterprise provides comprehensive support for batch, interactive, and real-time workloads:
Batch
Data integration with Apache Sqoop
Data processing with MapReduce, Apache Hive, Apache Pig
Memory-centric processing with Apache Spark
Interactive
Analytic SQL with Impala
Search with Apache Solr
Machine Learning with Apache Spark
Real-Time
Data integration with Apache Kafka, Apache Flume
Stream processing with Apache Spark
Data serving with Apache HBase
Shared resource management ensures that each workload is handled appropriately and abides by IT policy.
What’s more, 3rd party tools, such as SAS or Informatica can run as native workloads inside Cloudera’s enterprise data hub.
Business Manager
Run high value workloads in cluster
Quickly adopt new innovations
InfoSec
Follow established policies and procedures
Maintain compliance
IT Ops
Integrate with existing IT investments
Minimize end-user support
Automate configuration
There are many aspects to security - and it's all too easy for other vendors to claim their platforms are "secure" because they cover one or more of these pillars. To achieve comprehensive security, we offer all four pillars of security: Perimeter, Access, Visibility, and Data. Cloudera Enterprise achieves all of these and is compliance-ready out-of-the-box to ensure you’re protected
Directory services and Kerberos
Username/Password LDAP/AD authentication is an option for - Hue, Hive Metastore, Impala connectors, Cloudera Manager Admin logins
SAML for SSO – Hue, Cloudera Manager
(the last bullet applies to both of the bullets above it)
Kerberos-based – use industry standard Kerberos
Provably strong authentication between all Hadoop services, and to clients or client proxies
Cloudera Manager hides complexity
Coming soon (5.1): plug directly into AD for Kerberos
Eliminates MIT Kerberos infrastructure requirement
Username/password – against LDAP/AD
SAML for SSO
Kerberos clients no longer required on most user end-points
Kerberos Principals
A user in Kerberos is called a principal, which is made up of three distinct components: the primary, instance, and realm. A Kerberos principal is used in a Kerberos-secured system to represent a unique identity. The first component of the principal is called the primary, or sometimes the user component. The primary component is an arbitrary string and may be the operating system username of the user or the name of a service. The primary component is followed by an optional section called the instance, which is used to create principals that are used by users in special roles or to define the host on which a service runs, for example. An instance, if it exists, is separated from the primary by a slash and then the content is used to disambiguate multiple principals for a single user or service. The final component of the principal is the realm. The realm is similar to a domain in DNS in that it logically defines a related group of objects, although rather than hostnames as in DNS, the Kerberos realm defines a group of principals . Each realm can have its own settings including the location of the KDC on the network and supported encryption algorithms. Large organizations commonly create distinct realms to delegate administration of a realm to a group within the enterprise. Realms, by convention, are written in uppercase characters.
Kerberos assigns tickets to Kerberos principals to enable them to access Kerberos-secured Hadoop services. For the Hadoop daemon principals, the principal names should be of the format username/fully.qualified.domain.name@YOUR-REALM.COM. In this guide, username in the username/fully.qualified.domain.name@YOUR-REALM.COM principal refers to the username of an existing Unix account that is used by Hadoop daemons, such as hdfs or mapred. Human users who want to access the Hadoop cluster also need to have Kerberos principals; in this case, username refers to the username of the user's Unix account, such as joe or jane. Single-component principal names (such as joe@YOUR-REALM.COM) are acceptable for client user accounts. Hadoop does not support more than two-component principal names.
Kerberos Keytabs
A keytab is a file containing pairs of Kerberos principals and an encrypted copy of that principal's key. A keytab file for a Hadoop daemon is unique to each host since the principal names include the hostname. This file is used to authenticate a principal on a host to Kerberos without human interaction or storing a password in a plain text file. Because having access to the keytab file for a principal allows one to act as that principal, access to the keytab files should be tightly secured. They should be readable by a minimal set of users, should be stored on local disk, and should not be included in host backups, unless access to those backups is as secure as access to the local host.
“We currently manage all user authentication and service access through a combination of Active Directory and Kerberos. We have ‘audited’ procedures based around these technologies. Help me understand how your cluster will fit into these paradigms. Also, my cousin said I will have to stand up an additional KDC and put Kerberos clients on every desktop. I really hope that’s not the case, Kerberos configuration is a pain in the a**”
There are many aspects to security - and it's all too easy for other vendors to claim their platforms are "secure" because they cover one or more of these pillars. To achieve comprehensive security, we offer all four pillars of security: Perimeter, Access, Visibility, and Data. Cloudera Enterprise achieves all of these and is compliance-ready out-of-the-box to ensure you’re protected
In trying to solve the data access problem in Impala – we need to introduce a very important concept. Role Based Access Controls.
This is very similar to the idea of Active Directory. With role based access control - I am a user in a group, and that group is assigned to some role, that role has a set of privileges that define what data the role can access and the actions that can be performed. This relationship user-group-role-privileges defines the users access and privileges.
Recall: AD Group membership in conjunction with Kerberos is used to control access to SERVICES e.g. Impala
There are many aspects to security - and it's all too easy for other vendors to claim their platforms are "secure" because they cover one or more of these pillars. To achieve comprehensive security, we offer all four pillars of security: Perimeter, Access, Visibility, and Data. Cloudera Enterprise achieves all of these and is compliance-ready out-of-the-box to ensure you’re protected
Full audit and access history for HDFS, Impala, Hive, HBase, and Sentry
Automatic collection and easy visualization of upstream and downstream data lineage
Easily discover, classify, and locate data to comply with business governance and compliance rules
Why you need Navigator:
Lots of Data Landing in Cloudera Enterprise
Huge quantities
Many different sources – structured and unstructured
Varying levels of sensitivity
Many Users Working with the Data
Administrators and compliance officers
Analysts and data scientists
Business users
Need to Effectively Control and Consume Data
Get visibility and control over the environment
Discover and explore data
There are many aspects to security - and it's all too easy for other vendors to claim their platforms are "secure" because they cover one or more of these pillars. To achieve comprehensive security, we offer all four pillars of security: Perimeter, Access, Visibility, and Data. Cloudera Enterprise achieves all of these and is compliance-ready out-of-the-box to ensure you’re protected
Data in motion – network encryption
Network RPC encryption using SASL
HDFS data transfer protocol
MR shuffle
SSL for web-based user and administration tools
Data at rest
Certified Partner Solutions:
Field-level encryption, data masking or tokenization
OS-level file-system encryption
Coming soon: HDFS file encryption
Selectively encrypt folders – apply only where needed; separate tenants using separate keys per folder
https://github.com/intel-hadoop/project-rhino/
Navigator encrypt provides massively scalable, hi-performance at rest data encryption for all critical Hadoop data, in and out of HDFS
Navigator encrypt uses process based access controls to mitigate data custodian issues and prevent unauthorized access to data in clear-text
Navigator key trustee provides secure, policy driven key management for Navigator encrypt. Key trustee can also be used to secure and manage any security related Hadoop assets
e.g. SSL Certificates and SSH Keys
Navigator encrypt provides massively scalable, hi-performance at rest data encryption for all critical Hadoop data, in and out of HDFS
Navigator encrypt uses process based access controls to mitigate data custodian issues and prevent unauthorized access to data in clear-text
Navigator key trustee provides secure, policy driven key management for Navigator encrypt. Key trustee can also be used to secure and manage any security related Hadoop assets
e.g. SSL Certificates and SSH Keys
Navigator Encrypt provides massively scalable, high performance at rest data encryption for all critical Hadoop data, in and out of HDFS. Transparent encryption for Hadoop data as it’s written to disk.
We can enable compliance (HIPAA, PCI-DSS, SOX, FERPA, EU data protection) initiatives that require at-rest encryption and key management
Fast, easy deployment and configuration with enterprise scalability
We provide a transparent layer between the application and file system that dramatically reduces performance impact of encryption
Fully integrated into Navigator.
Features
Navigator encrypt uses process based access controls to mitigate data custodian issues and prevent unauthorized access to data in clear-text
We can ensure sensitive data and encryption keys are never stored in plain text nor exposed publicly
We can make sure only applications that need access to plaintext data will have it
Navigator encrypt can prevent admins and super users from accessing encrypted data
You can establish a variety of key retrieval policies that dictate who or what can access the secure artifact
Keys protected by Navigator key trustee
Navigator encrypt provides massively scalable, high performance at rest data encryption for all critical Hadoop data, in and out of HDFS. Transparent encryption for Hadoop data as it’s written to disk.
We can enable compliance (HIPAA, PCI-DSS, SOX, FERPA, EU data protection) initiatives that require at-rest encryption and key management
Fast, easy deployment and configuration with enterprise scalability
We provide a transparent layer between the application and file system that dramatically reduces performance impact of encryption
Fully integrated into Navigator.
Features
Navigator encrypt uses process based access controls to mitigate data custodian issues and prevent unauthorized access to data in clear-text
We can ensure sensitive data and encryption keys are never stored in plain text nor exposed publicly
We can make sure only applications that need access to plaintext data will have it
Navigator encrypt can prevent admins and super users from accessing encrypted data
You can establish a variety of key retrieval policies that dictate who or what can access the secure artifact
Keys protected by Navigator key trustee
Navigator key trustee is Cloudera’s key manager and the primary use case is storing keys for Navigator encrypt
Key trustee is a software based key manager with packaged integrations to HSM’s like SafeNet Luna, Thales nShield and RSA DPM ensuring consistency with infosec policies that require these boxes to serve as root-of-trust inside a corporate environment
Key trustee runs on a dedicated server and ensures the keys are stored separate from the data which is a requirement for regulations like PCI
In addition to key management, you can think of key trustee as a virtual safe deposit box that can be used to secure any type of sensitive assets for the cluster. SSL certificates, ssh keys, passwords, keytab files, truststore files and more can all be secured with key trustee