Hadoop Security Hadoop Summit 2010 Owen O’Malley [email_address] Yahoo’s Hadoop Team
Problem Yahoo! has more yahoos than clusters. Hundreds of yahoos using Hadoop each month 38,000 computers in ~20 Hadoop clusters. Sharing requires isolation or trust. Different users need different data. Not all yahoos should have access to sensitive data financial data and PII In Hadoop 0.20, easy to impersonate. Segregate different data on separate clusters
Solution Prevent unauthorized HDFS access All HDFS clients  must  be authenticated. Including tasks running as part of MapReduce jobs And jobs submitted through Oozie. Users must also authenticate servers Otherwise fraudulent servers could steal credentials Integrate Hadoop with Kerberos Provides well tested open source distributed authentication system.
Requirements Security must be optional. Not all clusters are shared between users. Hadoop must not prompt for passwords Makes it easy to make trojan horse versions. Must have single sign on. Must support backwards compatibility HFTP must be secure, but allow reading from insecure clusters
Kerberos and Single Sign-on Kerberos allows user to sign in once Obtains Ticket Granting Ticket (TGT) kinit –  get a new Kerberos ticket klist – list your Kerberos tickets kdestroy – destroy your Kerberos ticket TGT’s last  for 10 hours, renewable for 7 days by default Once you have a TGT, Hadoop commands just work hadoop  fs –ls / hadoop jar  wordcount.jar in-dir out-dir
Kerberos Dataflow
Definitions Authentication  – Determining the user Hadoop 0.20 completely trusted the user User states their username and groups over wire We need it on both RPC and Web UI. Authorization  – What can that user do? HDFS had owners, groups and permissions since 0.16. Map/Reduce had nothing in 0.20.
Authentication Changes low-level transport RPC authentication using SASL Kerberos Token Simple Browser HTTP secured via plugin Tool HTTP (eg. Fsck) via SSL/Kerberos
Primary Communication Paths
Authorization HDFS Command line unchanged Web UI enforces authentication MapReduce added Access Control Lists Lists of users and groups that have access. mapreduce.job.acl-view-job – view job mapreduce.job.acl-modify-job – kill or modify job
API Changes Very Minimal API Changes UserGroupInformation *completely* changed. MapReduce added secret credentials Available from JobConf and JobContext Never displayed via Web UI Automatically get tokens for HDFS Primary HDFS, File{In,Out}putFormat, and DistCp Can set mapreduce.job.hdfs-servers
MapReduce Security Changes MapReduce System directory now 700. Tasks run as user instead of TaskTracker. Setuid program that runs tasks. Task directories are now 700. Distributed Cache is now secure Shared (original is world readable) is shared by everyone’s jobs. Private (original is not world readable) is shared by user’s jobs.
Web UIs Hadoop relies on the Web UIs. These need to be authenticated also… Web UI authentication is pluggable. Yahoo uses an internal package We have written a very simple static auth plug-in Dr. Who returns again (the third doctor?) We really need a SPNEGO plug-in… All servlets enforce permissions.
Proxy-Users Some services access HDFS and MapReduce as other users. Configure services with the proxy user: Who the proxy service can impersonate hadoop.proxyuser.superguy.groups=goodguys Which hosts they can impersonate from hadoop.proxyuser.superguy.hosts=secretbase New admin commands to refresh Don’t need to bounce cluster
Out of Scope Encryption RPC transport – easy Block transport protocol – difficult On disk – difficult File Access Control Lists Still use Unix-style owner, group, other permissions Non-Kerberos Authentication Much easier now that framework is available
Schedule The security team worked hard to get security added to Hadoop on schedule. Security Development team: Devaraj Das, Ravi Gummadi, Jakob Homan, Owen O’Malley, Jitendra Pandey, Boris Shkolnik, Vinod Vavilapalli, Kan Zhang Currently on science (beta) clusters Deploy to production clusters in August
Questions? Questions should be sent to: common/hdfs/mapreduce-user@hadoop.apache.org Security holes should be sent to: [email_address] Available from  http://developer.yahoo.com/hadoop/distribution/ Also a VM with Hadoop cluster with security Thanks!

Hadoop Security in Detail__HadoopSummit2010

  • 1.
    Hadoop Security HadoopSummit 2010 Owen O’Malley [email_address] Yahoo’s Hadoop Team
  • 2.
    Problem Yahoo! hasmore yahoos than clusters. Hundreds of yahoos using Hadoop each month 38,000 computers in ~20 Hadoop clusters. Sharing requires isolation or trust. Different users need different data. Not all yahoos should have access to sensitive data financial data and PII In Hadoop 0.20, easy to impersonate. Segregate different data on separate clusters
  • 3.
    Solution Prevent unauthorizedHDFS access All HDFS clients must be authenticated. Including tasks running as part of MapReduce jobs And jobs submitted through Oozie. Users must also authenticate servers Otherwise fraudulent servers could steal credentials Integrate Hadoop with Kerberos Provides well tested open source distributed authentication system.
  • 4.
    Requirements Security mustbe optional. Not all clusters are shared between users. Hadoop must not prompt for passwords Makes it easy to make trojan horse versions. Must have single sign on. Must support backwards compatibility HFTP must be secure, but allow reading from insecure clusters
  • 5.
    Kerberos and SingleSign-on Kerberos allows user to sign in once Obtains Ticket Granting Ticket (TGT) kinit – get a new Kerberos ticket klist – list your Kerberos tickets kdestroy – destroy your Kerberos ticket TGT’s last for 10 hours, renewable for 7 days by default Once you have a TGT, Hadoop commands just work hadoop fs –ls / hadoop jar wordcount.jar in-dir out-dir
  • 6.
  • 7.
    Definitions Authentication – Determining the user Hadoop 0.20 completely trusted the user User states their username and groups over wire We need it on both RPC and Web UI. Authorization – What can that user do? HDFS had owners, groups and permissions since 0.16. Map/Reduce had nothing in 0.20.
  • 8.
    Authentication Changes low-leveltransport RPC authentication using SASL Kerberos Token Simple Browser HTTP secured via plugin Tool HTTP (eg. Fsck) via SSL/Kerberos
  • 9.
  • 10.
    Authorization HDFS Commandline unchanged Web UI enforces authentication MapReduce added Access Control Lists Lists of users and groups that have access. mapreduce.job.acl-view-job – view job mapreduce.job.acl-modify-job – kill or modify job
  • 11.
    API Changes VeryMinimal API Changes UserGroupInformation *completely* changed. MapReduce added secret credentials Available from JobConf and JobContext Never displayed via Web UI Automatically get tokens for HDFS Primary HDFS, File{In,Out}putFormat, and DistCp Can set mapreduce.job.hdfs-servers
  • 12.
    MapReduce Security ChangesMapReduce System directory now 700. Tasks run as user instead of TaskTracker. Setuid program that runs tasks. Task directories are now 700. Distributed Cache is now secure Shared (original is world readable) is shared by everyone’s jobs. Private (original is not world readable) is shared by user’s jobs.
  • 13.
    Web UIs Hadooprelies on the Web UIs. These need to be authenticated also… Web UI authentication is pluggable. Yahoo uses an internal package We have written a very simple static auth plug-in Dr. Who returns again (the third doctor?) We really need a SPNEGO plug-in… All servlets enforce permissions.
  • 14.
    Proxy-Users Some servicesaccess HDFS and MapReduce as other users. Configure services with the proxy user: Who the proxy service can impersonate hadoop.proxyuser.superguy.groups=goodguys Which hosts they can impersonate from hadoop.proxyuser.superguy.hosts=secretbase New admin commands to refresh Don’t need to bounce cluster
  • 15.
    Out of ScopeEncryption RPC transport – easy Block transport protocol – difficult On disk – difficult File Access Control Lists Still use Unix-style owner, group, other permissions Non-Kerberos Authentication Much easier now that framework is available
  • 16.
    Schedule The securityteam worked hard to get security added to Hadoop on schedule. Security Development team: Devaraj Das, Ravi Gummadi, Jakob Homan, Owen O’Malley, Jitendra Pandey, Boris Shkolnik, Vinod Vavilapalli, Kan Zhang Currently on science (beta) clusters Deploy to production clusters in August
  • 17.
    Questions? Questions shouldbe sent to: common/hdfs/mapreduce-user@hadoop.apache.org Security holes should be sent to: [email_address] Available from http://developer.yahoo.com/hadoop/distribution/ Also a VM with Hadoop cluster with security Thanks!