The fundamentals and best practices of securing your Hadoop cluster are top of mind today. In this session, we will examine and explain the components, tools, and frameworks used in Hadoop for authentication, authorization, audit, and encryption of data and processes. See how the latest innovations can let you securely connect more data to more users within your organization.
2. Hadoop Security and Compliance Challenges
2
• History
• Security was not a priority in early Hadoop adopters like Yahoo!
and Facebook / it is now!
• Data concentration
• Quantity and diversity of data creates compliance challenges
• Flexibility of the Hadoop architecture
• Many paths for data in, out, processing
• Access data at different granularities, from fields to files
• ELT: sensitive data “discovery” occurs after data arrives
3. Cloudera has led in investments in security
3
Authentication
• First Hadoop distribution to offer strong authentication throughout
Encryption
• First Hadoop distribution to support encryption on wire
Audit
• Only Hadoop distribution to support audit histories for all data objects & access
paths
• Single point for log capture, audit
Authorization
• Founded the Apache Sentry project along with Oracle and Lab41 to manage fine-
grained permissions
Automation
• Cloudera Manager automates security configurations & LDAP/AD integration
4. Case Study: Finance and Banking
• Identify patterns in financially-sensitive, PCI and PII
data
• Before: Unable to build applications on Hadoop; forced
to use other systems, to greatly limit Hadoop access, or
to forgo analysis due to privacy concerns
• Now: Provide broad analysis capabilities with Impala to
large population and secured by Sentry
Fraud and Purchasing
Behavior Analysis
5. Enterprise Security in Hadoop overview
5
Four Functional Areas
Hadoop Cluster
Users
Applications Operators
Perimeter
Data
Access
Visibility
6. Defining the Functional Areas
6
Perimeter
Guarding access to the
cluster itself
Technical Concepts:
Authentication
Network isolation
Data
Protecting data in the
cluster from
unauthorized visibility
Technical Concepts:
Encryption, Tokenization,
Data masking
Access
Defining what users
and applications can do
with data
Technical Concepts:
Permissions
Authorization
Visibility
Reporting on where
data came from and
how it’s being used
Technical Concepts:
Auditing
Lineage
7. Enabling Enterprise Security
7
Perimeter
Guarding access to the
cluster itself
Technical Concepts:
Authentication
Network isolation
Data
Protecting data in the
cluster from
unauthorized visibility
Technical Concepts:
Encryption, Tokenization,
Data masking
Access
Defining what users
and applications can do
with data
Technical Concepts:
Permissions
Authorization
Visibility
Reporting on where
data came from and
how it’s being used
Technical Concepts:
Auditing
Lineage
SentryKerberos | AD/LDAP Cloudera NavigatorNative | Certified Partners
8. Enabling Enterprise Security
8
Perimeter
Guarding access to the
cluster itself
Technical Concepts:
Authentication
Network isolation
Data
Protecting data in the
cluster from
unauthorized visibility
Technical Concepts:
Encryption, Tokenization,
Data masking
Access
Defining what users
and applications can do
with data
Technical Concepts:
Permissions
Authorization
Visibility
Reporting on where
data came from and
how it’s being used
Technical Concepts:
Auditing
Lineage
SentryKerberos | AD/LDAP Cloudera NavigatorNative | Certified Partners
9. Perimeter: Authentication in Hadoop
10
Kerberos
• Provably strong authentication between all
Hadoop services and (optionally) to end-points
• Cloudera Manager hides complexity
LDAP/AD
• Username / password
• Option for Hue, Hive Metastore, Impala
connectors, Cloudera Manager admin logins
SAML
• For Single Sign-On (SSO) for listed options
• Kerberos clients no longer required on most user
end-points
10. Authentication Options and Coverage
11
HDFS
DN NN
YARN
RM AM
Impala
ID SS
MapReduce
JT TT
… Services …
(Oozie, Search, etc.)
3rd Party
Gateway …
Client
Client
Client
Client
… Applications …
(Pig, Hive, Hue, etc.)
“End-to-End” Kerberos
“Core” Kerberos “Edge” AD/LDAP/SAML
11. IT Integration: Kerberos
• Users don’t want Yet Another Credential
• Corp IT doesn’t want to provision and maintain thousands of service principals and
keytabs
• Solution: local KDC + one-way trust
• Run MIT Kerberos KDC in the cluster
• Put all service principals here
• Set up one-way trust of central corporate realm by local KDC
• Normal user credentials can be used to access Hadoop
• Recommended: Use Cloudera Manager
• To properly tune inter-related configuration knobs
• To manage principals/keytabs creation and distribution
• To preserve service monitoring with Kerberos security enabled
12. IT Integration: Kerberos + LDAP
Hadoop Cluster
Local KDC (MIT Kerberos)
hdfs/host1@HADOOP.EXAMPLE.COM
yarn/host2@HADOOP.EXAMPLE.COM
…
Central
Active Directory
user@EXAMPLE.COM …
Cross-realm
trust
NN JT
LDAP group
mapping
13. Network Access Management
• Use Hue to front-end both Hadoop and Oozie to control access through a web browser
• HTTP proxy servers:
• Oozie : MR jobs, Pig jobs, Hive jobs
• HttpFS: hadoop fs is front-ended over HTTP
• HBase REST server: HBase reads
Secure configuration with Oozie, Hue and HttpFS front-ends co-located to act as network
bridge
Hue supports AD/LDAP based authentication instead of Kerberos for client simplicity
14. Enabling Enterprise Security
15
Perimeter
Guarding access to the
cluster itself
Technical Concepts:
Authentication
Network isolation
Data
Protecting data in the
cluster from
unauthorized visibility
Technical Concepts:
Encryption, Tokenization,
Data masking
Access
Defining what users
and applications can do
with data
Technical Concepts:
Permissions
Authorization
Visibility
Reporting on where
data came from and
how it’s being used
Technical Concepts:
Auditing
Lineage
SentryKerberos | AD/LDAP Cloudera NavigatorNative | Certified Partners
15. Data: Protection in Hadoop
16
Data in Motion Data at Rest
“Network Encryption”
• SASL: Network RPC
• SSL: MapReduce shuffle
• SSL: Web-based user and
administration tools
• SSL: JDBC
• HDFS data transfer protocol
“Data Encryption”
• Certified partner solutions
• Field-level encryption
• Data masking or tokenization
• OS-level file system encryption
16. Enabling Enterprise Security
18
Perimeter
Guarding access to the
cluster itself
Technical Concepts:
Authentication
Network isolation
Data
Protecting data in the
cluster from
unauthorized visibility
Technical Concepts:
Encryption, Tokenization,
Data masking
Access
Defining what users
and applications can do
with data
Technical Concepts:
Permissions
Authorization
Visibility
Reporting on where
data came from and
how it’s being used
Technical Concepts:
Auditing
Lineage
SentryKerberos | AD/LDAP Cloudera NavigatorNative | Certified Partners
17. Prior State of Authorization
Two Sub-Optimal Choices for SQL on Hadoop
19
• Insecure Advisory Authorization
• Users could grant themselves permissions
• Intended to prevent accidental deletion of data
• Problem: Did not guard against malicious users
• Problem: Only worked with Hive
• HDFS Impersonation
• Data was only protected at the file level by HDFS permissions
• Problem: File-level not granular enough
• Problem: Lacked flexibility; not role-based
18. Sentry: Key Capabilities
21
Fine-Grained Authorization
• Specify security for SERVERS, DATABASES, TABLES,
VIEWS, and search indices
Role-Based Authorization
• SELECT privilege on views & tables
• INSERT privilege on tables
• TRANSFORM privilege on servers
• ALL privilege on the server, databases, tables & views
• ALL privilege is needed to create/modify schema
Multitenant Administration
• Separate policies for each database/schema
• Can be maintained by separate admins
22. Apache Ecosystem and Sentry
Inline support in Cloudera Impala
Extensibility plug-in for Apache HiveServer2
Inline support in Cloudera Search
Complementary security with HDFS ACLs
23. Access: Authorization in Hadoop
26
File ACL
Admin RBAC
Data RBAC
• Permission at file-level granularity
• HDFS POSIX-style permissions: u/g/o
• Access Control Lists (ACL)
• HBase, Oozie, MapReduce
• Permissions on tables, views, indices
• Sentry for HiveServer2, Impala, Search
App and Workflow
• Cloudera Manager, Hue
24. Enabling Enterprise Security
28
Perimeter
Guarding access to the
cluster itself
Technical Concepts:
Authentication
Network isolation
Data
Protecting data in the
cluster from
unauthorized visibility
Technical Concepts:
Encryption, Tokenization,
Data masking
Access
Defining what users
and applications can do
with data
Technical Concepts:
Permissions
Authorization
Visibility
Reporting on where
data came from and
how it’s being used
Technical Concepts:
Auditing
Lineage
SentryKerberos | AD/LDAP Cloudera NavigatorNative | Certified Partners
25. Visibility: Cloudera Navigator
29
Audit & Access Control
• Maintain full audit history
• Ensuring appropriate
permissions and reporting
on data access for
compliance
Discovery & Exploration
• Finding out what data is
available and what it looks
like
Lineage
• Tracing data back to its
original source
Lifecycle Management
• Migration of data based on
policies
3RD PARTY
APPS
STORAGE FOR ANY TYPE OF DATA
UNIFIED, ELASTIC, RESILIENT, SECURE
CLOUDERA’S ENTERPRISE DATA HUB
BATCH
PROCESSING
MAPREDUCE
ANALYTIC
SQL
IMPALA
SEARCH
ENGINE
SOLR
MACHINE
LEARNING
SPARK
STREAM
PROCESSING
SPARK STREAMING
WORKLOAD MANAGEMENT YARN
FILESYSTEM
HDFS
ONLINE NOSQL
HBASE
DATA
MANAGEMENT
CLOUDERANAVIGATOR
SYSTEM
MANAGEMENT
CLOUDERAMANAGER
SENTRY, SECURE
26. Why Navigator?
30
Lots of Data Landing in Cloudera Enterprise
Huge quantities
Many different sources – structured and unstructured
Varying levels of sensitivity
1
Many Users Working with the Data
Administrators and compliance officers
Analysts and data scientists
Business users
2
Need to Effectively Control and Consume Data
Get visibility and control over the environment
Discover and explore data
3
30. Leading Investment to Address the Challenges
34
Authentication First Hadoop distribution to offer strong authentication
throughout
Encryption First Hadoop distribution to support encryption on wire
Audit Only Hadoop distribution to support audit histories for all data
objects and access paths; Single point for log capture, audit
Authorization Founded the Apache Sentry project along with Oracle and
Lab41 to manage fine-grained permissions
Automation Cloudera Manager automates security configurations &
LDAP/AD integration
31. Cloudera 5: Enabling the Enterprise Data Hub
35
Open Source
Scalable
Flexible
Cost-Effective
✔
Managed ✖
Open
Architecture ✖
Secure and
Governed ✖
✔
✔
✔
3RD PARTY
APPS
STORAGE FOR ANY TYPE OF DATA
UNIFIED, ELASTIC, RESILIENT, SECURE
CLOUDERA’S ENTERPRISE DATA HUB
BATCH
PROCESSING
MAPREDUCE
ANALYTIC
SQL
IMPALA
SEARCH
ENGINE
SOLR
MACHINE
LEARNING
SPARK
STREAM
PROCESSING
SPARK STREAMING
WORKLOAD MANAGEMENT YARN
FILESYSTEM
HDFS
ONLINE NOSQL
HBASE
DATA
MANAGEMENT
CLOUDERANAVIGATOR
SYSTEM
MANAGEMENT
CLOUDERAMANAGER
SENTRY