3. Hadoop File Permissions
• Added in HADOOP-1298
• Hadoop 0.16
• Early 2008
• Authorization without authentication
• POSIX-like RWX bits
3
4. MapReduce ACLs
• Added in HADOOP-3698
• Hadoop 0.19
• Late 2008
• ACLs per job queue
• Set a list of allowed users or groups per operation
• Job submission
• Job administration
• No authentication
4
5. Securing a Cluster Through a Gateway
• Hadoop cluster runs on a private network
• Gateway server dual-homed (Hadoop network and
public network)
• Users SSH onto gateway
• Optionally can create an SSH proxy for jobs to be
submitted from the client machine
• Provides minimum level of protection
5
7. Prevent Accidental Access
• Don’t let users shoot themselves in the foot
• Main driver for early features
• Not security per-se, but a critical first step
• Doesn’t require strong authentication
7
8. Stop Malicious Users
• Early features were necessary, but not sufficient
• Security has to get real
• Hadoop runs arbitrary code
• Implicit trust doesn’t prevent the insider threat
8
9. Co-mingle All Your Data
• Often overlooked
• Big data means getting rid of stovepipes
• Scalability and flexibility are only 50% of the problem
• Trust your data in a multi-tenant environment
• Most critical driver
9
11. Authorization
• Files
• MapReduce/YARN job queues
• Service-level authorization
• Whitelists and blacklists of hosts and users
11
12. Authentication
2.2 High Level Use Cases 2 USE CASES
• HADOOP-4487
• Hadoop 0.22evel U0.20.205
2.2 H igh L
and se Cases
1. A ppl icat i ons accessing fi les on H D F S cl ust er s Non-MapReduce ap-
• Late 2010ions, including hadoop fs, access files st ored on one or more HDFS
plicat
clust ers. T he applicat ion should only be able t o access files and services
• Based on Kerberos and internal delegation tokens
t hey are aut horized t o access. See figure 1. Variat ions:
(a) Access HDFS direct ly using HDFS prot ocol.
• Provides strong user authentication servers via t he HFT P
(b) Access HDFS indirect ly t hough HDFS proxy
FileSyst em or HT T P get .
• Also used for service-to-service authentication
Name
delg(jo
(joe) Node e
kerb )
MapReduce
Application
kerb(hdfs) Task
bloc e n
k to
ken tok
ck
Data blo
Node
Figure 1: HDFS High-level Dat aflow
12
13. Encryption
• Over the wire encryption for some socket
connections
• RPC encryption added soon after Kerberos
• Shuffle encryption (HTTPS) added in Hadoop 2.0.2-
alpha, back ported to CDH4 MR1
• HDFS block streamer encryption added in Hadoop
2.0.2-alpha
• Volume-level encryption for data at rest
13
15. Apache Accumulo
• Robust, scalable, high performance data storage and
retrieval system
• Built by NSA, now an Apache project
• Based on Google’s BigTable
• Built on top of HDFS, ZooKeeper and Thrift
• Iterators for server-side extensions
• Cell labels for flexible security models
15
16. Data Model
• Multi-dimensional, persistent, sorted map
• Key/Value store with a twist
• A single primary key (Row ID)
• Secondary key (Column) internal to a row
• Family
• Qualifier
• Per-cell timestamp
16
17. Cell-Level Security
• Labels stored per cell
• Labels consist of Boolean expressions
(AND, OR, nesting)
• Labels associated with each user
• Cell labels checked against user’s labels with a built-
in iterator
17
18. Pluggable Authentication
• Currently supports username/password
authentication backed by ZooKeeper
• ACCUMULO-259
• Targeted for Accumulo 1.5.0
• Authentication info replaced with generic tokens
• Supports multiple implementations (e.g. Kerberos)
18
19. Application Level
• Accumulo often paired with application level
authentication/authorization
• Accumulo users created per application
• Each application granted access level of most
permitted user
• Application authenticates users, grabs user
authorizations, passes user labels with requests
19
20. Apache HBase
• Also based on Google’s BigTable
• Started as a Hadoop contrib project
• Supports column-level ACLs
• Kerberos for authentication
• Discussion and early prototypes of cell-level security
ongoing
20
22. Encryption for Data at Rest
• Need multiple levels of granularity
• Encryption keys tied to authorization labels (like
Accumulo labels or HBase ACLs)
• APIs for file-level, block-level, or record-level
encryption
22