Hadoop Security Architecture

Hadoop Security
Architecture
Owen O’Malley
oom@yahoo-inc.com

Hadoop

Outline
•
•
•
•
•
•

Problem Statement
Security Threats
Solutions to Threats
HDFS
MapReduce
Oozie

• Interfaces
• Performance
• Reliability and Availability
• Operations and Monitoring
Hadoop

2

Problem Statement
• The fundamental goal of adding Hadoop security is that Yahoo's data
stored in HDFS must be secure from unauthorized access.
Furthermore, it must do so without adding significant effort to operating
or using the Grid. Based on that goal, there are a few implications.
– All HDFS clients must be authenticated to ensure that the user is
who they claim to be. That implies that all map/reduce users,
including services such as Oozie, must be also authenticated and
that tasks must run with the privileges and identity of the submitting
user.
– Since Data Nodes and TaskTrackers are entrusted with user data
and credentials, they must authenticate themselves to ensure they
are running as part of the Grid and are not trojan horses.
– Kerberos will be the underlying authentication service so that users
can be authenticated using their system credentials.
Hadoop

Communication and Threats

User
Process

Job
Tracker

Name
Node

Task
Tracker

Oozie

Data
Node

Task

NFS

MObStor

ZooKeeper

Hadoop

4

Security Threats in Hadoop
•

User to Service Authentication
– No User Authentication on NameNode or JobTracker
•

Client code supplies user and group names

– No User Authorization on DataNode – Fixed in 0.21
•

Users can read/write any block

– No User Authorization on JobTracker
•
•

•

Users can modify or kill other user’s jobs
Users can modify the persistent state of JobTracker

Service to Service Authentication
– No Authentication of DataNodes and TaskTrackers
•

•

Users can start fake DataNodes and TaskTrackers

No Encryption on Wire or Disk

Hadoop

Solutions to Threats
• Add Kerberos-based authentication to NameNode
and JobTracker.
• Add delegation tokens to HDFS and support for
them in MapReduce.
• Determine user’s group membership on the
NameNode and JobTracker.
• Protect MapReduce system directory from users.
• Add authorization model to MapReduce so that
only submitting user can modify or kill job.
• Add Backyard authentication to Web UI’s.

Hadoop

6

Out of Scope for 0.20.100
• Protecting against root on slave nodes:
– Encryption of RPC messages
– Encryption of block transfer protocol
– Encryption of MapReduce transient files
– Encryption of HDFS block files
• Passing Kerberos tickets to MapReduce
tasks for third party Kerborized services.
Hadoop

7

HDFS Security
• Users authenticate via Kerberos
• MapReduce jobs can obtain delegation
tokens for later use.
• When clients are reading or writing an HDFS
file, the NameNode generates a block
access token that will be verified by the
DataNode.
Hadoop

8

HDFS Authentication
ke r

)
b(joe

Application

Name
Node

delg(jo
e)

MapReduce
Task

kerb(hdfs)
bloc

k to

ken

Data
Node

o
ck t
blo

k en

• Clients authenticate to NameNode via:
– Kerberos
– Delegation tokens
• Client demonstrates authorization to DataNode via block
access token
• DataNode authenticates to NameNode via Kerberos
Hadoop

What does this *really* look like?
• Need a Kerberos ticket to work
– kinit –l 7d oom@DS.CORP.YAHOO.COM
– hadoop fs –ls
– hadoop jar my.jar in-dir out-dir
• Works using ticket cache!

– Can display ticket cache with klist.
Hadoop

10

Kerberos Dataflows
Key
Distribution
Center

Get TGT

Request Service Ticket

Name Node
hdfs/host@YGRID

User
joe@DS.CORP

Connect with Service Ticket

Hadoop

11

Delegation Token
• Advantages over using Kerberos directly:
– Don’t trust JobTracker with credentials
– Avoid MapReduce task authorization flood
– Renewable by third party (ie. JobTracker)
– Revocable when job finishes
• tokenId = {owner prin, renewer prin, issueDate, maxDate}
• tokenAuthenticator = HMAC(masterKey, tokenId)
• Token = {tokenId, tokenAuthenticator}
Hadoop

12

Block Access Token
• Only NameNode knows the set of users allowed to
access a specific block, so the NameNodes gives
an authorized clients a block access token.
• Capabilities include read, write, copy, or replace.
• The NameNode and DataNodes share a
dynamically rolled secret key to secure the tokens.
• tokenId={expiration, keygen, owner, block, access}
• tokenAuthenticator = HMAC(blockKey, tokenId)
• token = {tokenId, tokenAuthenticator}
Hadoop

13

MapReduce Security
• Require Kerberos authentication from client.
• Secure the information about pending and running jobs
– Store the job configuration and input splits in HDFS
under ~user/.staging/$jobid
– Store the job’s location and secrets in private directory
• JobTracker creates a random job token. It it used for:
– Connecting to TaskTracker’s RPC
– Authorizing http get for shuffle
• HMAC(job token, URL) sent from reduce tasks to TaskTracker
Hadoop

14

MapReduce Authentication
Application

kerb(joe)

Job
Tracker

kerb(mapreduce)
Task
Tracker
job token

Task

HDFS
HDFS
HDFS
)
(joe
elg
d
other
credential

trus
t

Other
Service

NFS

•

Client authenticates to JobTracker via Kerberos

•

TaskTracker authenticates to JobTracker via Kerberos

•

Task authenticates to the TaskTrackers using the job token

•

Task authenticates to HDFS using a delegation token

•

NFS is not Kerberized.

Hadoop

15

MapReduce Task Security
• Users have separate task directories with
permissions set to 700.
• Distributed cache is now divided based on
the source’s visibility
– Global – shared with other users
– Private - protected from other users

Hadoop

16

Web UI
• MapReduce makes heavy use of Web UI for
displaying state of cluster and running jobs.
• HDFS also has a web browsing interface.
• Use Backyard to authenticate Web UI users
• Only allow submitting user of job to view
stdout and stderr of job’s tasks.
• HDFS web browser checks user’s
authorization.
Hadoop

17

Oozie

• Client authenticates to Oozie
– Custom auth for Yahoo!
• Oozie authenticates to HDFS and MapReduce as
“oozie” principal
• “oozie” is configured as a super-user for HDFS and
MapReduce and may act as other users.
Hadoop

18

Proxy Services Trust Model
• Requires trust that service (eg. Oozie) principal is
secure.
• Explored and rejected
– Having user headless principals stored on
Oozie machine “x/oozie” for user “x”
– Passing user headless principal keytab to Oozie
– Generalizing delegation token to have token
granting tokens.
Hadoop

19

Protocols
• RPC
– Change RPC to use SASL and either:
• Kerberos authentication (GSSAPI)
• Tokens (DIGEST-MD5)

– User’s Kerberos tickets obtained at login used
automatically.
– Changes RPC format
– Can easily add encryption later
• Block transfer protocol
– Block access tokens in data stream
Hadoop

20

Protocols
• HTTP
– User/Browser facing
• Yahoo – Custom Authentication
• External – SPNEGO or Kerberos login module

– Web Services
• HFTP – Hadoop File Transfer Protocol
• Others later
• SPNEGO or Delegation Token via RPC

– Shuffle
• Use HMAC of URL hashed with Job Token
Hadoop

21

Summary
• RPC
– Kerberos
• Application to NameNode, JobTracker
• DataNode to NameNode
• TaskTracker to JobTracker

– Digest-MD5
• MapReduce task to NameNode, TaskTracker

• Block Access Token
• Backyard
– User to Web UI
Hadoop

22

Task
Accessing as user
Accessing as user

Backyard

Browser, 2ndNN, fsck
Browser, 2ndNN, fsck

HFTP

HTTP

NN

DIGEST-MD5

RPC

Kerberos

User (initial access),
User (initial access),
2ndNN, Balancer
2ndNN, Balancer

HTTP-DIGEST w/ delegation token
HTTP-DIGEST w/ delegation token

Task
distcp accessing as user

HFTP

User, DN, Balancer, Task
User, DN, Balancer, Task

DN

HTTP

access token

Task

Socket

Backyard

Kerberos

Browser

Forward delegation token
Forward delegation token

Task

Hadoop

Backyard

JT

HTTP

RPC

User
User

Kerberos

TT

Browser
Browser

TT

Browser
Browser

HTTP

RPC

Task

DIGEST-MD5

Backyard

Kerberos

-HMAC

Local Task
Local Task

Hadoop

Task
Reduce Task getting
Reduce Task getting
Map output
Map output

Authentication Paths
Job
Tracker

Oozie

Browser

Name
Node

Data
Node

Task
Tracker

NFS
User
Process

Task

MObStor

ZooKeeper

Hadoop

HTTP backyard
HTTP HMAC
RPC Kerberos
RPC DIGEST
Block Access
Third Party

25

Interfaces (and their scope and stability)
•

Imported Interfaces
–
–

SASL – Standard for supporting token and Kerberos authentication

–

GSSAPI – Kerberos part of SASL authentication

–

HMAC-SHA1 – Shared secret authentication

–

•

JAAS – Java API for supporting authentication

SPNEGO – Use Kerberos tickets over HTTP

Exported Interfaces – Both Limited Private
–
–

•

HDFS adds a method to get delegation tokens
RPC adds a doAs method

Major Internal (Inter-system Interfaces)
–

MapReduce Shuffle uses HMAC-SHA1

–

RPC uses Kerberos and DIGEST-MD5

Hadoop

26

Pluggability
• Pluggability in Hadoop supports different environments
• HTTP browser user authentication
– Yahoo – Backyard
– External – SPNEGO or Kerberos login module
• RPC transport
– SASL supports DIGEST-MD5, Kerberos, and others
• Acquiring credentials
– JAAS supports Kerberos, and others

Hadoop

27

Performance
• The authentication should not introduce
substantial performance penalties.
• Delegation token design to avoid
authentication flood by MapReduce tasks
• Required to be less than 3% on GridMix.

Hadoop

28

Reliability and Availability
• The Kerberos KDC can not be a single point of failure.
– Kerberos clients automatically fail over to secondary KDC’s
– Secondary KDC’s can be sync’ed automatically from the primary
since the data rarely changes.
• The cluster must remain stable when Kerberos fails.
– The slaves (TaskTrackers and DataNodes) will lose their ability to
reconnect to the master, when their RPC socket closes, their service
ticket has expired, and both the primary and secondary KDC’s have
failed.
– Decided not to use special tokens to handle this case.
• Once the MapReduce job is submitted, the KDC is not required for the
job to continue running.
Hadoop

29

Operations and Monitoring
• The number of Kerberos authorizations will be
logged on the NameNode and JobTracker.
• Authorization failures will be logged.
• Authentication failures will be logged.
• The authorization logs will be a separate log4j
logger, so they can be directed to a separate file.

Hadoop

30

Hadoop Security Architecture

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Hadoop Security Architecture

Similar to Hadoop Security Architecture (20)

More from Owen O'Malley

More from Owen O'Malley (20)

Recently uploaded

Recently uploaded (20)

Hadoop Security Architecture