Hadoop Security: Overview


Published on

Cloudera Software Engineer, Aaron Myers, presented an overview of Apache Hadoop security at the Los Angeles Hadoop User Group.

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Hadoop Security: Overview

  1. 1. Private Property: No Trespassing Hadoop Security Explained Aaron T. Myers atm@cloudera.com @atm
  2. 2. Who am I?• Aaron T. Myers – Software Engineer, Cloudera• Hadoop HDFS, Common Committer• Masters thesis on security sandboxing in Linux kernel• Primarily works on the Core Platform Team
  3. 3. Outline• Hadoop Security Overview • Hadoop Security pre CDH3 • Hadoop Security with CDH3• Details of Deploying Secure Hadoop• Summary
  4. 4. Hadoop Security: Overview
  5. 5. Why do we care about security?• SecureCommerceWebSite, Inc has a product that has both paid ads and search• “Payment Fraud” team needs logs of all credit card payments• “Search Quality” team needs all search logs and click history• “Ads Fraud” team needs to access both search logs and payment info • So we cant segregate these datasets to different clusters• If they can share a cluster, we also get better utilization!
  6. 6. Security pre CDH3: User Authentication• Authentication is by vigorous assertion• Trivial to impersonate other user: • Just set property “hadoop.job.ugi” when running job or command• Group resolution is done client side
  7. 7. Security pre CDH3: Server Authentication None
  8. 8. Security pre CDH3: HDFS• Unix-like file permissions were introduced in Hadoop v16.1• Provides standard user/group/other r/w/x• Protects well-meaning users from accidents• Does nothing to prevent malicious users from causing harm (weak authentication)
  9. 9. Security pre CDH3: Job Control• ACLs per job queue for job submission / killing• No ACLs for viewing counters / logs• Does nothing to prevent malicious users from causing harm (weak authentication)
  10. 10. Security pre CDH3: Tasks• Individual tasks all run as the same user • Whoever the TT is running as (usually hadoop)• Tasks not isolated from each other • Tasks which read/write from local storage can interfere with each other • Malicious tasks can kill each other• Hadoop is designed to execute arbitrary code
  11. 11. Security pre CDH3: Web interfaces None
  12. 12. Security with CDH3: User Authentication• Authentication is secured by Kerberos v5 • RPC connections secured with SASL “GSSAPI” mechanism • Provides proven, strong authentication and single-sign-on• Hadoop servers can ensure that users are who they say they are• Group resolution is done on the server side
  13. 13. Security with CDH3: Server Authentication• Kerberos authentication is bi-directional• Users can be sure that they are communicating with the Hadoop server they think they are
  14. 14. Security with CDH3: HDFS• Same general permissions model • Added sticky bit for directories (e.g. /tmp)• But, a user can no longer trivially impersonate other users (strong authentication)
  15. 15. Security with CDH3: Job Control• A job now has its own ACLs, including a view ACL• Job can now specify who can view logs, counters, configuration, and who can modify (kill) it• JT enforces these ACLs (strong authentication)
  16. 16. Security with CDH3: Tasks• Tasks now run as the user who launched the job • Probably the most complex part of Hadoops security implementation• Ensures isolation of tasks which run on the same TT • Local file permissions enforced • Local system permissions enforced (e.g. signals)• Can take advantage of per-user system limits • e.g. Linux ulimits
  17. 17. Security with CDH3: Web Interfaces• Out of the box Kerberized SSL support• Pluggable servlet filters (more on this later)
  18. 18. Security with CDH3: Threat Model• The Hadoop security system assumes that: • Users do not have root access to cluster machines • Users do not have root access to shared user machines (e.g. bastion box) • Users cannot read or inject packets on the network
  19. 19. Thanks, Yahoo!Yahoo! did the vast majority of the core Hadoop security work
  20. 20. Hadoop Security:Deployment Details
  21. 21. Requirements: Kerberos Infrastructure• Kerberos domain (KDC) • eg. MIT Krb5 in RHEL, or MS Active Directory• Kerberos principals (SPNs) for every daemon • hdfs/hostname@REALM for DN, NN, 2NN • mapred/hostname@REALM for TT and JT • host/hostname@REALM for web UIs• Keytabs for service principals distributed to correct hosts
  22. 22. Configuring daemons for security• Most daemons have two configs: • Keytab location (eg dfs.datanode.keytab.file) • Kerberos principal (eg dfs.datanode.kerberos.principal)• Principal can use the special token _HOST to substitute hostname of the daemon (eg hdfs/_HOST@MYREALM)• Several other configs to enable security in the first place • See example-confs/conf.secure in CDH3
  23. 23. Setting up users• Each user must have a Kerberos principal• May want some shared accounts: • sharedaccount/alice and sharedaccount/bob principals both act as sharedaccount on HDFS - you can use this! • hdfs/alice is also useful for alice to act as a superuser• Users running MR jobs must also have unix accounts on each of the slaves• Centralized user database (eg LDAP) is a practical necessity
  24. 24. Installing Secure Hadoop• MapReduce and HDFS services should run as separate users (e.g. hdfs and mapred)• New task-controller setuid executable allows tasks to run as a user• New JNI code in libhadoop.so to plug subtle security holes• Install CDH3 with hadoop-0.20-sbin and hadoop- 0.20-native packages to get this all set up
  25. 25. Securing higher-level services• Many “middle tier” applications need to act on behalf of their clients when interacting with Hadoop • e.g: Oozie, Hive Server, Hue/Beeswax• “Proxy User” feature provides secure impersonation (think sudo). • hadoop.proxyuser.oozie.hosts - IPs where “oozie” may act as an impersonator • hadoop.proxyuser.oozie.groups - groups whose users “oozie” may impersonate
  26. 26. Customizing Security• Current plug-in points: • hadoop.http.filter.initializers - may configure a custom ServletFilter to integrate with existing enterprise web SSO • hadoop.security.group.mapping - map a kerberos principal (alice@FOOCORP.COM) to a set of groups (users,engstaff,searchquality,adsdata) • hadoop.security.auth_to_local - regex mappings of Kerberos principals to usernames
  27. 27. Deployment Gotchas• MIT Kerberos 1.8.1 (in Ubuntu, RHEL 5.6+) incompatible with Java Krb5 implementation • Run “kinit -R” after kinit to work around• Enable allow_weak_crypto in /etc/krb5.conf - necessary for kerberized SSL• Must deploy “unlimited security policy JAR” in JAVA_HOME/jre/lib/security• Lifesaver: HADOOP_OPTS= ”-Dsun.security.krb5.debug=true” hadoop ...
  28. 28. Best Practices for AD Integration• MIT Kerberos realm inside cluster: • CLUSTER.FOOCORP.COM• Existing Active Directory domain: • FOOCORP.COM or maybe AD.FOOCORP.COM• Set up one-way cross-realm trust • Cluster realm must trust corporate AD realm • See “Step by Step Guide to Kerberos 5 Interoperability” in Windows Server docs
  29. 29. Hadoop Security: Summary
  30. 30. What Hadoop Security Is• Strong authentication • Malicious impersonation now impossible• Better authorization • More control over who can view/control jobs• Ensure isolation between running tasks• An ongoing development priority
  31. 31. What Hadoop Security Is Not• Encryption on the wire• Encryption on disk• Protection against DOS attacks• Enabled by default
  32. 32. Security Beyond Core Hadoop• Comprehensive documentation and best practices • https://ccp.cloudera.com/display/CDHDOC/CDH3+Security+Guide• All components of CDH3 are capable of interacting with a secure Hadoop cluster• Hive 0.7 (included in CDH3) added a rich set of access controls• Much easier deployment if you use Cloudera Enterprise
  33. 33. Security Roadmap• Pluggable “edge authentication” (eg PKI, SAML)• More authorization features across CDH components • e.g. HBase access controls• Data encryption support
  34. 34. Questions? Aaron T. Myersatm@cloudera.com @atm