Private Property: No Trespassing
   Hadoop Security Explained




        Aaron T. Myers
      atm@cloudera.com
            @atm
Who am I?




• Aaron T. Myers – Software Engineer, Cloudera
• Hadoop HDFS, Common Committer
• Masters thesis on security sandboxing in Linux kernel
• Primarily works on the Core Platform Team
Outline

• Hadoop Security Overview
 • Hadoop Security pre CDH3
 • Hadoop Security with CDH3
• Details of Deploying Secure Hadoop
• Summary
Hadoop Security: Overview
Why do we care about security?
• SecureCommerceWebSite, Inc has a product that has both
   paid ads and search

• “Payment Fraud” team needs logs of all credit card
   payments

• “Search Quality” team needs all search logs and click
   history

• “Ads Fraud” team needs to access both search logs and
   payment info
  •   So we can't segregate these datasets to different clusters

• If they can share a cluster, we also get better utilization!
Security pre CDH3: User Authentication



• Authentication is by vigorous assertion
• Trivial to impersonate other user:
 • Just set property “hadoop.job.ugi” when
    running job or command
• Group resolution is done client side
Security pre CDH3: Server Authentication




                None
Security pre CDH3: HDFS


• Unix-like file permissions were introduced in
  Hadoop v16.1
• Provides standard user/group/other r/w/x
• Protects well-meaning users from accidents
• Does nothing to prevent malicious users from
  causing harm (weak authentication)
Security pre CDH3: Job Control



• ACLs per job queue for job submission / killing
• No ACLs for viewing counters / logs
• Does nothing to prevent malicious users from
  causing harm (weak authentication)
Security pre CDH3: Tasks


• Individual tasks all run as the same user
 • Whoever the TT is running as (usually 'hadoop')
• Tasks not isolated from each other
 • Tasks which read/write from local storage can
    interfere with each other
 • Malicious tasks can kill each other
• Hadoop is designed to execute arbitrary code
Security pre CDH3: Web interfaces




             None
Security with CDH3: User Authentication

• Authentication is secured by Kerberos v5
 • RPC connections secured with SASL “GSSAPI”
    mechanism
  • Provides proven, strong authentication and
    single-sign-on
• Hadoop servers can ensure that users are who
  they say they are
• Group resolution is done on the server side
Security with CDH3: Server Authentication




• Kerberos authentication is bi-directional
• Users can be sure that they are communicating
  with the Hadoop server they think they are
Security with CDH3: HDFS




• Same general permissions model
 • Added sticky bit for directories (e.g. /tmp)
• But, a user can no longer trivially impersonate
  other users (strong authentication)
Security with CDH3: Job Control



• A job now has its own ACLs, including a view ACL
• Job can now specify who can view logs, counters,
  configuration, and who can modify (kill) it
• JT enforces these ACLs (strong authentication)
Security with CDH3: Tasks

• Tasks now run as the user who launched the job
 • Probably the most complex part of Hadoop's
    security implementation
• Ensures isolation of tasks which run on the same TT
 • Local file permissions enforced
 • Local system permissions enforced (e.g. signals)
• Can take advantage of per-user system limits
 • e.g. Linux ulimits
Security with CDH3: Web Interfaces



• Out of the box Kerberized SSL support
• Pluggable servlet filters (more on this later)
Security with CDH3: Threat Model


• The Hadoop security system assumes that:
 • Users do not have root access to cluster
    machines
 • Users do not have root access to shared user
    machines (e.g. bastion box)
 • Users cannot read or inject packets on the
    network
Thanks, Yahoo!




Yahoo! did the vast majority of the
   core Hadoop security work
Hadoop Security:
Deployment Details
Requirements: Kerberos Infrastructure

• Kerberos domain (KDC)
 • eg. MIT Krb5 in RHEL, or MS Active Directory
• Kerberos principals (SPNs) for every daemon
 • hdfs/hostname@REALM for DN, NN, 2NN
 • mapred/hostname@REALM for TT and JT
 • host/hostname@REALM for web UIs
• Keytabs for service principals distributed to
  correct hosts
Configuring daemons for security

• Most daemons have two configs:
 • Keytab location (eg dfs.datanode.keytab.file)
 • Kerberos principal (eg dfs.datanode.kerberos.principal)
• Principal can use the special token '_HOST' to substitute
  hostname of the daemon (eg 'hdfs/_HOST@MYREALM')

• Several other configs to enable security in the first place
 • See example-confs/conf.secure in CDH3
Setting up users
• Each user must have a Kerberos principal
• May want some shared accounts:
 • sharedaccount/alice and sharedaccount/bob
    principals both act as sharedaccount on HDFS - you
    can use this!

  • hdfs/alice is also useful for alice to act as a superuser
• Users running MR jobs must also have unix accounts on
  each of the slaves

• Centralized user database (eg LDAP) is a practical
  necessity
Installing Secure Hadoop

• MapReduce and HDFS services should run as
  separate users (e.g. 'hdfs' and 'mapred')
• New task-controller setuid executable allows
  tasks to run as a user
• New JNI code in libhadoop.so to plug subtle
  security holes
• Install CDH3 with hadoop-0.20-sbin and hadoop-
  0.20-native packages to get this all set up
Securing higher-level services
• Many “middle tier” applications need to act on
  behalf of their clients when interacting with
  Hadoop
  • e.g: Oozie, Hive Server, Hue/Beeswax
• “Proxy User” feature provides secure
  impersonation (think sudo).
  • hadoop.proxyuser.oozie.hosts - IPs where
    “oozie” may act as an impersonator
  • hadoop.proxyuser.oozie.groups - groups whose
    users “oozie” may impersonate
Customizing Security

• Current plug-in points:
 • hadoop.http.filter.initializers - may configure a
    custom ServletFilter to integrate with existing
    enterprise web SSO
  • hadoop.security.group.mapping - map a
    kerberos principal (alice@FOOCORP.COM) to a
    set of groups
    (users,engstaff,searchquality,adsdata)
  • hadoop.security.auth_to_local - regex
    mappings of Kerberos principals to usernames
Deployment Gotchas

• MIT Kerberos 1.8.1 (in Ubuntu, RHEL 5.6+)
  incompatible with Java Krb5 implementation
  • Run “kinit -R” after kinit to work around
• Enable allow_weak_crypto in /etc/krb5.conf -
  necessary for kerberized SSL
• Must deploy “unlimited security policy JAR” in
  JAVA_HOME/jre/lib/security
• Lifesaver: HADOOP_OPTS=
    ”-Dsun.security.krb5.debug=true” hadoop ...
Best Practices for AD Integration

• MIT Kerberos realm inside cluster:
 • CLUSTER.FOOCORP.COM
• Existing Active Directory domain:
 • FOOCORP.COM or maybe AD.FOOCORP.COM
• Set up one-way cross-realm trust
 • Cluster realm must trust corporate AD realm
 • See “Step by Step Guide to Kerberos 5
    Interoperability” in Windows Server docs
Hadoop Security:
   Summary
What Hadoop Security Is


• Strong authentication
 • Malicious impersonation now impossible
• Better authorization
 • More control over who can view/control jobs
• Ensure isolation between running tasks
• An ongoing development priority
What Hadoop Security Is Not



• Encryption on the wire
• Encryption on disk
• Protection against DOS attacks
• Enabled by default
Security Beyond Core Hadoop

• Comprehensive documentation and best
  practices
  •   https://ccp.cloudera.com/display/CDHDOC/CDH3+Security+Guide

• All components of CDH3 are capable of
  interacting with a secure Hadoop cluster
• Hive 0.7 (included in CDH3) added a rich set of
  access controls
• Much easier deployment if you use Cloudera
  Enterprise
Security Roadmap


• Pluggable “edge authentication” (eg PKI, SAML)
• More authorization features across CDH
  components
 • e.g. HBase access controls
• Data encryption support
Questions?



  Aaron T. Myers
atm@cloudera.com
      @atm

Hadoop Security: Overview

  • 1.
    Private Property: NoTrespassing Hadoop Security Explained Aaron T. Myers atm@cloudera.com @atm
  • 2.
    Who am I? •Aaron T. Myers – Software Engineer, Cloudera • Hadoop HDFS, Common Committer • Masters thesis on security sandboxing in Linux kernel • Primarily works on the Core Platform Team
  • 3.
    Outline • Hadoop SecurityOverview • Hadoop Security pre CDH3 • Hadoop Security with CDH3 • Details of Deploying Secure Hadoop • Summary
  • 4.
  • 5.
    Why do wecare about security? • SecureCommerceWebSite, Inc has a product that has both paid ads and search • “Payment Fraud” team needs logs of all credit card payments • “Search Quality” team needs all search logs and click history • “Ads Fraud” team needs to access both search logs and payment info • So we can't segregate these datasets to different clusters • If they can share a cluster, we also get better utilization!
  • 6.
    Security pre CDH3:User Authentication • Authentication is by vigorous assertion • Trivial to impersonate other user: • Just set property “hadoop.job.ugi” when running job or command • Group resolution is done client side
  • 7.
    Security pre CDH3:Server Authentication None
  • 8.
    Security pre CDH3:HDFS • Unix-like file permissions were introduced in Hadoop v16.1 • Provides standard user/group/other r/w/x • Protects well-meaning users from accidents • Does nothing to prevent malicious users from causing harm (weak authentication)
  • 9.
    Security pre CDH3:Job Control • ACLs per job queue for job submission / killing • No ACLs for viewing counters / logs • Does nothing to prevent malicious users from causing harm (weak authentication)
  • 10.
    Security pre CDH3:Tasks • Individual tasks all run as the same user • Whoever the TT is running as (usually 'hadoop') • Tasks not isolated from each other • Tasks which read/write from local storage can interfere with each other • Malicious tasks can kill each other • Hadoop is designed to execute arbitrary code
  • 11.
    Security pre CDH3:Web interfaces None
  • 12.
    Security with CDH3:User Authentication • Authentication is secured by Kerberos v5 • RPC connections secured with SASL “GSSAPI” mechanism • Provides proven, strong authentication and single-sign-on • Hadoop servers can ensure that users are who they say they are • Group resolution is done on the server side
  • 13.
    Security with CDH3:Server Authentication • Kerberos authentication is bi-directional • Users can be sure that they are communicating with the Hadoop server they think they are
  • 14.
    Security with CDH3:HDFS • Same general permissions model • Added sticky bit for directories (e.g. /tmp) • But, a user can no longer trivially impersonate other users (strong authentication)
  • 15.
    Security with CDH3:Job Control • A job now has its own ACLs, including a view ACL • Job can now specify who can view logs, counters, configuration, and who can modify (kill) it • JT enforces these ACLs (strong authentication)
  • 16.
    Security with CDH3:Tasks • Tasks now run as the user who launched the job • Probably the most complex part of Hadoop's security implementation • Ensures isolation of tasks which run on the same TT • Local file permissions enforced • Local system permissions enforced (e.g. signals) • Can take advantage of per-user system limits • e.g. Linux ulimits
  • 17.
    Security with CDH3:Web Interfaces • Out of the box Kerberized SSL support • Pluggable servlet filters (more on this later)
  • 18.
    Security with CDH3:Threat Model • The Hadoop security system assumes that: • Users do not have root access to cluster machines • Users do not have root access to shared user machines (e.g. bastion box) • Users cannot read or inject packets on the network
  • 19.
    Thanks, Yahoo! Yahoo! didthe vast majority of the core Hadoop security work
  • 20.
  • 21.
    Requirements: Kerberos Infrastructure •Kerberos domain (KDC) • eg. MIT Krb5 in RHEL, or MS Active Directory • Kerberos principals (SPNs) for every daemon • hdfs/hostname@REALM for DN, NN, 2NN • mapred/hostname@REALM for TT and JT • host/hostname@REALM for web UIs • Keytabs for service principals distributed to correct hosts
  • 22.
    Configuring daemons forsecurity • Most daemons have two configs: • Keytab location (eg dfs.datanode.keytab.file) • Kerberos principal (eg dfs.datanode.kerberos.principal) • Principal can use the special token '_HOST' to substitute hostname of the daemon (eg 'hdfs/_HOST@MYREALM') • Several other configs to enable security in the first place • See example-confs/conf.secure in CDH3
  • 23.
    Setting up users •Each user must have a Kerberos principal • May want some shared accounts: • sharedaccount/alice and sharedaccount/bob principals both act as sharedaccount on HDFS - you can use this! • hdfs/alice is also useful for alice to act as a superuser • Users running MR jobs must also have unix accounts on each of the slaves • Centralized user database (eg LDAP) is a practical necessity
  • 24.
    Installing Secure Hadoop •MapReduce and HDFS services should run as separate users (e.g. 'hdfs' and 'mapred') • New task-controller setuid executable allows tasks to run as a user • New JNI code in libhadoop.so to plug subtle security holes • Install CDH3 with hadoop-0.20-sbin and hadoop- 0.20-native packages to get this all set up
  • 25.
    Securing higher-level services •Many “middle tier” applications need to act on behalf of their clients when interacting with Hadoop • e.g: Oozie, Hive Server, Hue/Beeswax • “Proxy User” feature provides secure impersonation (think sudo). • hadoop.proxyuser.oozie.hosts - IPs where “oozie” may act as an impersonator • hadoop.proxyuser.oozie.groups - groups whose users “oozie” may impersonate
  • 26.
    Customizing Security • Currentplug-in points: • hadoop.http.filter.initializers - may configure a custom ServletFilter to integrate with existing enterprise web SSO • hadoop.security.group.mapping - map a kerberos principal (alice@FOOCORP.COM) to a set of groups (users,engstaff,searchquality,adsdata) • hadoop.security.auth_to_local - regex mappings of Kerberos principals to usernames
  • 27.
    Deployment Gotchas • MITKerberos 1.8.1 (in Ubuntu, RHEL 5.6+) incompatible with Java Krb5 implementation • Run “kinit -R” after kinit to work around • Enable allow_weak_crypto in /etc/krb5.conf - necessary for kerberized SSL • Must deploy “unlimited security policy JAR” in JAVA_HOME/jre/lib/security • Lifesaver: HADOOP_OPTS= ”-Dsun.security.krb5.debug=true” hadoop ...
  • 28.
    Best Practices forAD Integration • MIT Kerberos realm inside cluster: • CLUSTER.FOOCORP.COM • Existing Active Directory domain: • FOOCORP.COM or maybe AD.FOOCORP.COM • Set up one-way cross-realm trust • Cluster realm must trust corporate AD realm • See “Step by Step Guide to Kerberos 5 Interoperability” in Windows Server docs
  • 29.
  • 30.
    What Hadoop SecurityIs • Strong authentication • Malicious impersonation now impossible • Better authorization • More control over who can view/control jobs • Ensure isolation between running tasks • An ongoing development priority
  • 31.
    What Hadoop SecurityIs Not • Encryption on the wire • Encryption on disk • Protection against DOS attacks • Enabled by default
  • 32.
    Security Beyond CoreHadoop • Comprehensive documentation and best practices • https://ccp.cloudera.com/display/CDHDOC/CDH3+Security+Guide • All components of CDH3 are capable of interacting with a secure Hadoop cluster • Hive 0.7 (included in CDH3) added a rich set of access controls • Much easier deployment if you use Cloudera Enterprise
  • 33.
    Security Roadmap • Pluggable“edge authentication” (eg PKI, SAML) • More authorization features across CDH components • e.g. HBase access controls • Data encryption support
  • 34.
    Questions? AaronT. Myers atm@cloudera.com @atm