Securing the Hadoop Ecosystem


Published on

Nowadays a typical Hadoop deployment consists of core Hadoop components – HDFS and MapReduce – several other components such as HBase, HttpFS, Oozie, Pig, Hive, Sqoop, Flume, plus programmatic integration from external systems and applications. This effectively creates a complex and heterogenous distributed environment that runs across several machines and uses different protocols to communicate with each other; all of which is used concurrently by several users and applications. When a Hadoop deployment and its ecosystem is used to process sensitive data (such as financial records, payment transactions, healthcare records), several security requirements arise. These security requirements may be dictated by internal policies and/or government regulations. They may require strong authentication, selective authorization to access data/resources, and data confidentiality. This session covers in detail how different components in the Hadoop ecosystem and external applications can interact with each other in a secure manner providing authentication, authorization, and confidentiality when accessing services and transferring data to/from/between services. The session will cover topics like Kerberos authentication, Web UI authentication, File System permissions, delegation tokens, Access Control Lists, ProxyUser impersonation and network encryption.

Published in: Technology
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Securing the Hadoop Ecosystem

  1. 1. Securing the Hadoop Ecosystem ATM (Cloudera) & Tucu (Cloudera) Hadoop Summit, June 2013
  2. 2. Why is Security Important? Tucu’s mug Pic ATM Tucu
  3. 3. Agenda • Hadoop Ecosystem Interactions • Security Concepts • Authentication • Authorization • Confidentiality • Auditing • IT Infrastructure Integration • Deployment Recommendations
  4. 4. Hadoop on its Own Hadoop NN DN TT JT DN TT DN TT MR client Map Task Map Task Reduce Task SNN hdfs, httpfs & mapred users end users protocols: RPC/data transfer/HTTP HttpFS HDFS client WebHdfs client
  5. 5. Hadoop and Friends Hadoop Hive Metastore Hbase Oozie Hue Impala Zookeeper FlumeMapRed Pig Crunch Cascading Sqoop Hive Hbase Oozie Impala browser Flume servicesclients clients RPC HTTP Thrift HTTP RPC Thrift HTTP RPC service users end users protocols: RPCs/data/HTTP/Thrift/Avro-RPC Avro RPC WebHdfs HTTP RPCZookeeper
  6. 6. • Authentication: • End users to services, as a user: user credentials • Services to Services, as a service: service credentials • Services to Services, on behalf of a user: service credentials + trusted service • Job tasks to Services, on behalf of a user: job delegation token • Authorization • Data: HDFS, HBase, Hive Metastore, Zookeeper • Jobs: who can submit, view or manage Jobs (MR, Pig, Oozie, Hue, …) • Queries: who can run queries (Impala) Authentication / Authorization
  7. 7. Confidentiality / Auditing • Confidentiality • Data at rest (on disk) • Data in transit (on the network) • Auditing • Who accessed (read/write) data • Who submitted, managed or viewed a Job or a Query
  8. 8. • End Users to services, as a user • CLI & libraries: Kerberos (kinit or keytab) • Web UIs: Kerberos SPNEGO & pluggable HTTP auth • Services to Services, as a service • Credentials: Kerberos (keytab) • Services to Services, on behalf of a user • Proxy-user (after Kerberos for service) Authentication Details
  9. 9. • HDFS Data • File System permissions (Unix like user/group permissions) • HBase Data • Read/Write Access Control Lists (ACLs) at table level • Hive Metastore (Hive, Impala) • Leverages/proxies HDFS permissions for tables & partitions • Hive Server (Hive, Impala) (coming) • More advanced GRANT/REVOKE with ACLs for tables • Jobs (Hadoop, Oozie) • Job ACLs for Hadoop Scheduler Queues, manage & view jobs • Zookeeper • ACLs at znodes, authenticated & read/write Authorization Details
  10. 10. • Data in transit • RPC: using SASL • HDFS data: using SASL • HTTP: using SSL (web UIs, shuffle). Requires SSL certs • Thrift: not avail (Hive Metastore, Impala) • Avro-RPC: not avail (Flume) • Data at rest • Nothing out of the box • Doable by: custom ‘compression’ codec or local file system encryption Confidentiality Details
  11. 11. • Who accessed (read/write) FS data • NN audit log contains all file opens, creates • NN audit log contains all metadata ops, e.g. rename, listdir • Who submitted, managed, or viewed a Job or a Query • JT, RM, and Job History Server logs contain history of all jobs run on a cluster • Who submitted, managed, or viewed a workflow • Oozie audit logs contain history of all user requests Auditing Details
  12. 12. Auditing Gaps • Not all projects have explicit audit logs • Audit-like information can be extracted by processing logs • Eg: Impala query logs are distributed across all nodes • It is difficult go correlate jobs & data access • Eg: Map-Reduce jobs launched by Pig job • Eg: HDFS data accessed by a Map-Reduce job
  13. 13. IT Integration: Kerberos • Users don’t want Yet Another Credential • Corp IT doesn’t want to provision thousands of service principals • Solution: local KDC + one-way trust • Run a KDC (usually MIT Kerberos) in the cluster • Put all service principals here • Set up one-way trust of central corporate realm by local KDC • Normal user credentials can be used to access Hadoop
  14. 14. IT Integration: Groups • Much of Hadoop authorization uses “groups” • User ‘atm’ might belong to groups ‘analysts’, ‘eng’, etc. • Users’ groups are not stored in Hadoop anywhere • Refers to external system to determine group membership • NN/JT/Oozie/Hive servers all must perform group mapping • Default plugins for user/group mapping: • ShellBasedUnixGroupsMapping – forks/runs `/bin/id’ • JniBasedUnixGroupsMapping – makes a system call • LdapGroupsMapping – talks directly to an LDAP server
  15. 15. IT Integration: Kerberos + LDAP Hadoop Cluster Local KDC hdfs/host1@HADOOP.EXAMPLE.COM yarn/host2@HADOOP.EXAMPLE.COM … Central Active Directory tucu@EXAMPLE.COM atm@EXAMPLE.COM … Cross-realm trust NN JT LDAP group mapping
  16. 16. IT Integration: Web Interfaces • Most web interfaces authenticate using SPNEGO • Standard HTTP authentication protocol • Used internally by services which communicate over HTTP • Most browsers support Kerberos SPNEGO authentication • Hadoop components which use servlets for web interfaces can plug in custom filter • Integrate with intranet SSO HTTP solution
  17. 17. • Security configuration is a PITA • Do only what you really need • Enable cluster security (Kerberos) only if un-trusted groups of users are sharing the cluster • Otherwise use edge-security to keep outsiders out • Only enable wire encryption if required • Only enable web interface authentication if required Deployment Recommendations
  18. 18. • Secure Hadoop bring-up order 1. HDFS RPC (including SNN check-pointing) 2. JobTracker RPC 3. TaskTrackers RPC & LinuxTaskControler 4. Hadoop web UI 5. Configure monitoring to work with security 6. Other services (HBase, Oozie, Hive Metastore, etc) 7. Continue with authorization and network encryption if needed • Recommended: Use an admin/management tool • Several inter-related configuration knobs • To manage principals/keytabs creation and distribution • Automatically configures monitoring for security Deployment Recommendations
  19. 19. Q&A
  20. 20. Thanks ATM (Cloudera) & Tucu (Cloudera) Hadoop Summit, June 2013
  21. 21. Client Protocol Authentication Proxy User Authorization Confidentiality Auditing Hadoop HDFS RPC Kerberos Yes FS permissions SASL Yes Hadoop HDFS Data Transfer SASL No FS permissions SASL No Hadoop WebHDFS HTTP Kerberos SPNEGO plus pluggable Yes FS permissions N/A Yes Hadoop MapReduce (Pig, Hive, Sqoop, Crunch, Cascading) RPC Kerberos Yes (requires job config work) Job & Queue ACLs SASL No Hive Metastore Thrift Kerberos Yes FS permissions N/A Yes Oozie HTTP Kerberos SPNEGO plus pluggable Yes Job & Queue ACLs and FS permissions SSL (HTTPS) Yes Hbase RPC/Thrift/HTTP Kerberos Yes table ACLs SASL No Zookeeper RPC Kerberos No znode ACLs N/A No Impala Thrift Kerberos No Hive policy file N/A No Hue HTTP pluggable No Job & Queue ACLs and FS permissions HTTPS No Flume Avro RPC N/A No N/A N/A No Security Capabilities