CIS13: Big Data Platform Vendor’s Perspective: Insights from the Bleeding Edge


Published on

Aaron T. Myers (ATM), Software Engineer, Cloudera, Inc.
The era of “Big Data for the masses” is upon us. Despite the mindshare Big Data has been receiving – driven by the development and distribution of Apache Hadoop, the first commercialized release was only in December of 2011 by Cloudera, Inc. Cloudera remains the leading Hadoop platform provider in the market today. Now, with a diverse enterprise and government early adopter customer list, through Cloudera we can get a bird’s eye view of the leading authentication issues beginning to emerge from these companies headed out of the sandbox and into full production.
Speaker Aaron T. Myers (ATM) was one of Cloudera’s earliest engineers and maintains a core focus on Apache Hadoop core, specifically focused on HDFS and Hadoop’s security features. ATM is an Apache Hadoop PMC Member and Committer.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

CIS13: Big Data Platform Vendor’s Perspective: Insights from the Bleeding Edge

  1. 1. Securing  the  Hadoop  Ecosystem   Aaron  T.  Myers  (ATM)  @  Cloudera     Cloud  Iden?ty  Summit,  July  2013  
  2. 2. Who  am  I?   •  SoHware  Engineer  at  Cloudera   •  Hadoop  CommiJer  and  PMC  Member  at  Apache   SoHware  Founda?on   •  Primarily  work  on  Hadoop  Security  and  HDFS   •  Masters  thesis  focused  on  systems  security  
  3. 3. Agenda   •  What  is  Hadoop?   •  Hadoop  Ecosystem  Interac?ons   •  Hadoop  Authen?ca?on   •  Hadoop  Authoriza?on   •  IT  Infrastructure  Integra?on   •  The  Future:  Where  Hadoop  Security  is  Headed  
  4. 4. Hadoop  Is…   •  A  distributed  system   •  Designed  for  massive  scaling  of  storage  and  compute   across  many  (10s-­‐1000s)  nodes   •  An  ecosystem   •  Hadoop  is  the  kernel,  apps  on  top  are  user-­‐level  programs   •  e.g.  Impala,  Hive,  Oozie,  HBase,  etc.   •  A  security  pain   •  Designed  to  run  arbitrary  code  submiJed  by  users   •  Another  place  where  many  users  interact  with  the   system   •  Many  orgs  provide  “Hadoop  as  a  service”  
  5. 5. Hadoop  Is…   •  Not  secure  by  default   •  No  authen?ca?on  whatsoever   •  Usually  behind  a  corporate  firewall   •  OHen  accessed  by  common  BI  tools   •  Tableau,  SAS,  Microstrategy,  etc.   •  Expected  to  be  integrated  into  corporate  IT  infra   •  SSO,  etc.  
  6. 6. Hadoop  on  its  Own   Hadoop   NN   DN      TT   JT   DN      TT   DN      TT   MR   client   Map   Task   Map   Task   Reduce   Task   SNN   hdfs,  hJpfs  &  mapred  users   end  users   protocols:  RPC/data  transfer/HTTP   H6pFS   HDFS   client   WebHdfs   client  
  7. 7. The  Hadoop  Ecosystem   •  Storage   •  HBase   •  HDFS   •  Processing   •  Map/Reduce   •  YARN   •  Querying   •  Hive,  Impala  (SQL)   •  Pig  (DSL)   •  Cron,  workflows   •  Oozie   •  Data  ingest   •  Flume  (streaming)   •  Sqoop  (batch)   •  Live  data  serving   •  HBase   •  Pipelines   •  Crunch,  Cascading   •  GUI   •  Hue   •  Management   •  Cloudera  Manager  
  8. 8. Hadoop  and  Friends   Hadoop   Hive  Metastore   Hbase   Oozie   Hue   Impala   Zookeeper   Flume  MapRed   Pig   Crunch   Cascading   Sqoop   Hive   Hbase   Oozie   Impala   browser   Flume   services  clients   clients   RPC   HTTP   ThriH   HTTP   RPC   ThriH   HTTP   RPC   service  users   end  users   protocols:  RPCs/data/HTTP/ThriH/Avro-­‐RPC   Avro  RPC   WebHdfs   HTTP   RPC  Zookeeper  
  9. 9. •  Hadoop  Authen?ca?on  based  on  Kerberos   •  Usually  MIT,  also  Ac?ve  Directory   •  End  Users  to  services,  as  a  user   •  CLI  &  libraries:  Kerberos  (kinit  or  keytab)   •  Web  UIs:  Kerberos  SPNEGO  &  pluggable  HTTP  auth   •  Services  to  Services,  as  a  service   •  Creden?als:  Kerberos  (keytab)   •  Services  to  Services,  on  behalf  of  a  user   •  Proxy-­‐user  (aHer  Kerberos  for  service)   •  Job  tasks  to  Services,  on  behalf  of  a  user   •  Job  delega?on  token   Authen?ca?on  Details  
  10. 10. •  HDFS  Data   •  File  System  permissions  (Unix  like  user/group  permissions)   •  HBase  Data   •  Read/Write  Access  Control  Lists  (ACLs)  at  table  level   •  Hive  Metastore  (Hive,  Impala)   •  Leverages/proxies  HDFS  permissions  for  tables  &  par??ons   •  Hive  Server  (Hive,  Impala)  (coming)   •  More  advanced  GRANT/REVOKE  with  ACLs  for  tables   •  Jobs  (Hadoop,  Oozie)   •  Job  ACLs  for  Hadoop  Scheduler  Queues,  manage  &  view  jobs   •  Zookeeper   •  ACLs  at  znodes,  authen?cated  &  read/write   Authoriza?on  Details  
  11. 11. IT  Integra?on:  Kerberos   •  Users  don’t  want  Yet  Another  Creden?al   •  Corp  IT  doesn’t  want  to  provision  thousands  of   service  principals   •  Solu?on:  local  KDC  +  one-­‐way  trust   •  Run  a  KDC  (usually  MIT  Kerberos)  in  the  cluster   •  Put  all  service  principals  here   •  Set  up  one-­‐way  trust  of  central  corporate  realm  by   local  KDC   •  Normal  user  creden?als  can  be  used  to  access  Hadoop  
  12. 12. IT  Integra?on:  Groups   •  Much  of  Hadoop  authoriza?on  uses  “groups”   •  User  ‘atm’  might  belong  to  groups  ‘analysts’,  ‘eng’,  etc.   •  Users’  groups  are  not  stored  in  Hadoop  anywhere   •  Refers  to  external  system  to  determine  group  membership   •  NN/JT/Oozie/Hive  servers  all  must  perform  group  mapping   •  Default  plugins  for  user/group  mapping:   •  ShellBasedUnixGroupsMapping  –  forks/runs  `/bin/id’   •  JniBasedUnixGroupsMapping  –  makes  a  system  call   •  LdapGroupsMapping  –  talks  directly  to  an  LDAP  server  
  13. 13. IT  Integra?on:  Kerberos  +  LDAP   Hadoop  Cluster   Local  KDC     hdfs/host1@HADOOP.EXAMPLE.COM yarn/host2@HADOOP.EXAMPLE.COM … Central  Ac?ve  Directory     tucu@EXAMPLE.COM atm@EXAMPLE.COM … Cross-­‐realm  trust   NN   JT   LDAP  group   mapping  
  14. 14. IT  Integra?on:  Web  Interfaces   •  Most  web  interfaces  authen?cate  using  SPNEGO   •  Standard  HTTP  authen?ca?on  protocol   •  Used  internally  by  services  which  communicate  over  HTTP   •  Most  browsers  support  Kerberos  SPNEGO  authen?ca?on   •  Hadoop  components  which  use  servlets  for  web   interfaces  can  plug  in  custom  filter   •  Integrate  with  intranet  SSO  HTTP  solu?on  
  15. 15. IT  Integra?on:  Web  Interfaces   •  Most  web  interfaces  authen?cate  using  SPNEGO   •  Standard  HTTP  authen?ca?on  protocol   •  Used  internally  by  services  which  communicate  over  HTTP   •  Most  browsers  support  Kerberos  SPNEGO  authen?ca?on   •  Hadoop  components  which  use  servlets  for  web   interfaces  can  plug  in  custom  filter   •  Integrate  with  intranet  SSO  HTTP  solu?on  
  16. 16. Issues  with  Hadoop  Security   •  SSO  is  poorly  and  not  universally  supported   •  Only  supported  for  the  web  interfaces,  liJle  used,  etc.   •  Kerberos  the  only  op?on   •  Not  all  orgs  comfortable  administering  net  new  Kerberos   realm   •  Not  well-­‐suited  for  cloud  deployments   •  Need  properly  working  reverse  DNS   •  Pain  to  provision  KDC,  distribute  keytabs   •  Kerberos  tough  for  management  tools   •  No  Kerberos  administra?ve  API/protocol  
  17. 17. Issues  with  Hadoop  Security  (cont.)   •  Isola?on  of  user  tasks  currently  requires  separate   local  Unix  accounts  on  all  boxes   •  Need  to  integrate  with  LDAP  using  PAM  or  something  like   it   •  HDFS  authoriza?on  only  supports  Unix-­‐style   permissions   •  Not  expressive  enough  for  some  applica?ons,  e.g.  Hive  
  18. 18. Future  Development   •  Full  SSO  support   •  OAUTH  the  most  commonly  requested,  first  goal   •  Decouple  Hadoop  RPC  implementa?on  from   Kerberos   •  Make  authen?ca?on  system  fully  pluggable  for  custom   implementa?ons   •  Any  service  which  can  provide  bidirec?onal  authen?ca?on   •  Improve  management  tools   •  Cloudera  Manager  can  manage  more  of  the  security   infrastructure  
  19. 19. Future  Development  (cont.)   •  Use  beJer  isola?on  methods  for  user  tasks   •  Linux  containers   •  Solaris  “zones”   •  Etc.   •  BeJer  authoriza?on  capabili?es   •  Talk  of  adding  ACL  support  to  HDFS   •  Hive  Server  2  will  provide  rich  authoriza?on  capabili?es  
  20. 20. Q&A  
  21. 21. Thanks   Aaron  T.  Myers  (ATM)  @  Cloudera     Cloud  Iden?ty  Summit,  July  2013