Big Data Warehousing Meetup: Securing the Hadoop Ecosystem by Cloudera


Published on

In our recent Big Data Warehousing Meetup, we discussed Data Governance, Compliance and Security in Hadoop.

As the Big Data paradigm becomes more commonplace, we must apply enterprise-grade governance capabilities for critical data that is highly regulated and adhere to stringent compliance requirements. Caserta and Cloudera shared techniques and tools that enables data governance, compliance and security on Big Data.

For more information, visit

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Proxy-user setup:Relying party is configured to recognized super-users who are allowed to impersonate
  • Big Data Warehousing Meetup: Securing the Hadoop Ecosystem by Cloudera

    1. 1. Securing the Hadoop Ecosystem Patrick Angeles Big Data Warehouse Meetup Feb 10, 2014
    2. 2. Why is Security Important?
    3. 3. About Me Hadooping for 5+ years • Responsible for several secure Hadoop deployments • Did e-commerce and consumer analytics (PCI, PII, etc.) • Crypto and PKI in a previous life. •
    4. 4. Why Secure Hadoop? • Multi-tenancy • • You want your cluster to store data and run workloads from multiple users and groups Compliance • You have policies on which personnel can view what data
    5. 5. Agenda Hadoop Ecosystem Interactions • Security Concepts • Security in Practice • • • IT Infrastructure Integration Deployment Recommendations
    6. 6. Hadoop on its Own WebHdfs client HDFS client Hadoop NN SNN DN TT Map Task DN TT Map Task DN TT Reduce Task HttpFS MR client hdfs, httpfs & mapred users JT end users protocols: RPC/data transfer/HTTP
    7. 7. Hadoop and Friends service users end users clients protocols: RPCs/data/HTTP/Thrift/Avro-RPC services clients Hbase Zookeeper RPC Hbase RPC Zookeeper Oozie HTTP Oozie WebHdfs Pig HTTP Hue Crunch HTTP browser HTTP Cascading MapRed RPC Hadoop RPC Flume Sqoop Impala Hive Hive Metastore Thrift Avro RPC Thrift Flume Impala
    8. 8. Security Concepts Authentication • Authorization • Confidentiality • • • Encryption Auditing • Traceability
    9. 9. Authentication • End Users to Services, as a user • • • • Services to Services, as a service • • • CLI & libraries: Kerberos (kinit or keytab) Web UIs: Kerberos SPNEGO & pluggable HTTP auth MR tasks use delegation tokens Credentials: Kerberos (keytab) Client SSL certificates (for shuffle encryption) Services to Services, on behalf of a user • Proxy-user (after Kerberos for service)
    10. 10. Authorization • HDFS Data • • HBase Data • • Fine-grained authorization through Apache Sentry (Incubating) Jobs (Hadoop, Oozie) • • Read/Write Access Control Lists (ACLs) at table level Hive Server 2 and Impala • • File System permissions (Unix like user/group permissions) Job ACLs for Hadoop Scheduler Queues, manage & view jobs Zookeeper • ACLs at znodes, authenticated & read/write
    11. 11. Confidentiality • Data in transit RPC: using SASL • HDFS data: using SASL • HTTP: using SSL (web UIs, shuffle). Requires SSL certs • • Data at rest Nothing out of the box • Doable by: custom ‘compression’ codec or local file system encryption •
    12. 12. Auditing • Who accessed (read/write) FS data • • • Who submitted, managed, or viewed a Job or a Query • • NN audit log contains all file opens, creates NN audit log contains all metadata ops, e.g. rename, listdir JT, RM, and Job History Server logs contain history of all jobs run on a cluster Who submitted, managed, or viewed a workflow • Oozie audit logs contain history of all user requests
    13. 13. Auditing Gaps • Not all projects have explicit audit logs • • • It is difficult to correlate jobs & data access • • • Audit-like information can be extracted by processing logs Eg: Impala query logs are distributed across all nodes Eg: Map-Reduce jobs launched by Pig job Eg: HDFS data accessed by a Map-Reduce job Tools written on top of Hadoop can do this well
    14. 14. Security in Practice
    15. 15. Integration: Kerberos Users don’t want Yet Another Credential • Corp IT doesn’t want to provision thousands of service principals • Solution: local KDC + one-way trust • Run a KDC (usually MIT Kerberos) in the cluster • • • Put all service principals here Set up one-way trust of central corporate realm by local KDC • Normal user credentials can be used to access Hadoop
    16. 16. Integration: Groups • Much of Hadoop authorization uses “groups” • • Users’ groups are not stored in Hadoop anywhere • • • User ‘patrick’ might belong to groups ‘analysts’, ‘eng’, etc. Refers to external system to determine group membership NN/JT/Oozie/Hive servers all must perform group mapping Default plugins for user/group mapping: • • • ShellBasedUnixGroupsMapping – forks/runs `/bin/id’ JniBasedUnixGroupsMapping – makes a system call LdapGroupsMapping – talks directly to an LDAP server
    17. 17. Integration: Kerberos + LDAP Central Active Directory LDAP group mapping me@EXAMPLE.COM … Hadoop Cluster NN JT Local KDC Cross-realm trust hdfs/host1@HADOOP.EXAMPLE.COM yarn/host2@HADOOP.EXAMPLE.COM …
    18. 18. Integration: Web Interfaces • Most web interfaces authenticate using SPNEGO • • • • Standard HTTP authentication protocol Used internally by services which communicate over HTTP Most browsers support Kerberos SPNEGO authentication Hadoop components which use servlets for web interfaces can plug in custom filter • Integrate with intranet SSO HTTP solution
    19. 19. Recommendations • Security configuration is a PITA • • Do only what you really need Enable cluster security (Kerberos) only if un-trusted groups of users are sharing the cluster • Otherwise use edge-security to keep outsiders out Only enable wire encryption if required • Only enable web interface authentication if required •
    20. 20. Security Enablement • Secure Hadoop enablement order 1. 2. 3. 4. 5. 6. 7. HDFS RPC (including SNN check-pointing) JobTracker RPC TaskTrackers RPC & LinuxTaskControler Hadoop web UI Configure monitoring to work with security Other services (HBase, Oozie, Hive Metastore, etc) Continue with authorization and network encryption if needed
    21. 21. Administration • Use an admin/management tool • • • Several inter-related configuration knobs To manage principals/keytabs creation and distribution Automatically configures monitoring for security
    22. 22. Q&A