• Save
1 hadoop security_in_details_hadoop_summit2010
Upcoming SlideShare
Loading in...5

1 hadoop security_in_details_hadoop_summit2010






Total Views
Views on SlideShare
Embed Views



2 Embeds 40

http://www.redditmedia.com 33 7



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

1 hadoop security_in_details_hadoop_summit2010 1 hadoop security_in_details_hadoop_summit2010 Presentation Transcript

  • Hadoop Security Hadoop Summit 2010 Owen O’Malley [email_address] Yahoo’s Hadoop Team
  • Problem
    • Yahoo! has more yahoos than clusters.
      • Hundreds of yahoos using Hadoop each month
      • 38,000 computers in ~20 Hadoop clusters.
      • Sharing requires isolation or trust.
    • Different users need different data.
      • Not all yahoos should have access to sensitive data
        • financial data and PII
    • In Hadoop 0.20, easy to impersonate.
        • Segregate different data on separate clusters
  • Solution
    • Prevent unauthorized HDFS access
      • All HDFS clients must be authenticated.
      • Including tasks running as part of MapReduce jobs
      • And jobs submitted through Oozie.
    • Users must also authenticate servers
      • Otherwise fraudulent servers could steal credentials
    • Integrate Hadoop with Kerberos
      • Provides well tested open source distributed authentication system.
    View slide
  • Requirements
    • Security must be optional.
      • Not all clusters are shared between users.
    • Hadoop must not prompt for passwords
      • Makes it easy to make trojan horse versions.
      • Must have single sign on.
    • Must support backwards compatibility
      • HFTP must be secure, but allow reading from insecure clusters
    View slide
  • Kerberos and Single Sign-on
    • Kerberos allows user to sign in once
      • Obtains Ticket Granting Ticket (TGT)
        • kinit – get a new Kerberos ticket
        • klist – list your Kerberos tickets
        • kdestroy – destroy your Kerberos ticket
        • TGT’s last for 10 hours, renewable for 7 days by default
      • Once you have a TGT, Hadoop commands just work
        • hadoop fs –ls /
        • hadoop jar wordcount.jar in-dir out-dir
  • Kerberos Dataflow
  • Definitions
    • Authentication – Determining the user
      • Hadoop 0.20 completely trusted the user
        • User states their username and groups over wire
      • We need it on both RPC and Web UI.
    • Authorization – What can that user do?
      • HDFS had owners, groups and permissions since 0.16.
      • Map/Reduce had nothing in 0.20.
  • Authentication
    • Changes low-level transport
    • RPC authentication using SASL
      • Kerberos
      • Token
      • Simple
    • Browser HTTP secured via plugin
    • Tool HTTP (eg. Fsck) via SSL/Kerberos
  • Primary Communication Paths
  • Authorization
    • HDFS
      • Command line unchanged
      • Web UI enforces authentication
    • MapReduce added Access Control Lists
      • Lists of users and groups that have access.
      • mapreduce.job.acl-view-job – view job
      • mapreduce.job.acl-modify-job – kill or modify job
  • API Changes
    • Very Minimal API Changes
      • UserGroupInformation *completely* changed.
    • MapReduce added secret credentials
      • Available from JobConf and JobContext
      • Never displayed via Web UI
    • Automatically get tokens for HDFS
      • Primary HDFS, File{In,Out}putFormat, and DistCp
      • Can set mapreduce.job.hdfs-servers
  • MapReduce Security Changes
    • MapReduce System directory now 700.
    • Tasks run as user instead of TaskTracker.
      • Setuid program that runs tasks.
    • Task directories are now 700.
    • Distributed Cache is now secure
      • Shared (original is world readable) is shared by everyone’s jobs.
      • Private (original is not world readable) is shared by user’s jobs.
  • Web UIs
    • Hadoop relies on the Web UIs.
      • These need to be authenticated also…
    • Web UI authentication is pluggable.
      • Yahoo uses an internal package
      • We have written a very simple static auth plug-in
        • Dr. Who returns again (the third doctor?)
    • We really need a SPNEGO plug-in…
    • All servlets enforce permissions.
  • Proxy-Users
    • Some services access HDFS and MapReduce as other users.
    • Configure services with the proxy user:
      • Who the proxy service can impersonate
        • hadoop.proxyuser.superguy.groups=goodguys
      • Which hosts they can impersonate from
        • hadoop.proxyuser.superguy.hosts=secretbase
    • New admin commands to refresh
      • Don’t need to bounce cluster
  • Out of Scope
    • Encryption
      • RPC transport – easy
      • Block transport protocol – difficult
      • On disk – difficult
    • File Access Control Lists
      • Still use Unix-style owner, group, other permissions
    • Non-Kerberos Authentication
      • Much easier now that framework is available
  • Schedule
    • The security team worked hard to get security added to Hadoop on schedule.
    • Security Development team:
      • Devaraj Das, Ravi Gummadi, Jakob Homan, Owen O’Malley, Jitendra Pandey, Boris Shkolnik, Vinod Vavilapalli, Kan Zhang
    • Currently on science (beta) clusters
    • Deploy to production clusters in August
  • Questions?
    • Questions should be sent to:
      • common/hdfs/mapreduce-user@hadoop.apache.org
    • Security holes should be sent to:
      • [email_address]
    • Available from
      • http://developer.yahoo.com/hadoop/distribution/
      • Also a VM with Hadoop cluster with security
    • Thanks!