Your SlideShare is downloading. ×
Securing the Hadoop Ecosystem
Securing the Hadoop Ecosystem
Securing the Hadoop Ecosystem
Securing the Hadoop Ecosystem
Securing the Hadoop Ecosystem
Securing the Hadoop Ecosystem
Securing the Hadoop Ecosystem
Securing the Hadoop Ecosystem
Securing the Hadoop Ecosystem
Securing the Hadoop Ecosystem
Securing the Hadoop Ecosystem
Securing the Hadoop Ecosystem
Securing the Hadoop Ecosystem
Securing the Hadoop Ecosystem
Securing the Hadoop Ecosystem
Securing the Hadoop Ecosystem
Securing the Hadoop Ecosystem
Securing the Hadoop Ecosystem
Securing the Hadoop Ecosystem
Securing the Hadoop Ecosystem
Securing the Hadoop Ecosystem
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Securing the Hadoop Ecosystem

4,140

Published on

Nowadays a typical Hadoop deployment consists of core Hadoop components – HDFS and MapReduce – several other components such as HBase, HttpFS, Oozie, Pig, Hive, Sqoop, Flume, plus programmatic …

Nowadays a typical Hadoop deployment consists of core Hadoop components – HDFS and MapReduce – several other components such as HBase, HttpFS, Oozie, Pig, Hive, Sqoop, Flume, plus programmatic integration from external systems and applications. This effectively creates a complex and heterogenous distributed environment that runs across several machines and uses different protocols to communicate with each other; all of which is used concurrently by several users and applications. When a Hadoop deployment and its ecosystem is used to process sensitive data (such as financial records, payment transactions, healthcare records), several security requirements arise. These security requirements may be dictated by internal policies and/or government regulations. They may require strong authentication, selective authorization to access data/resources, and data confidentiality. This session covers in detail how different components in the Hadoop ecosystem and external applications can interact with each other in a secure manner providing authentication, authorization, and confidentiality when accessing services and transferring data to/from/between services. The session will cover topics like Kerberos authentication, Web UI authentication, File System permissions, delegation tokens, Access Control Lists, ProxyUser impersonation and network encryption.

Published in: Technology
0 Comments
17 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
4,140
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
17
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Securing the Hadoop Ecosystem ATM (Cloudera) & Tucu (Cloudera) Hadoop Summit, June 2013
  • 2. Why is Security Important? Tucu’s mug Pic ATM Tucu
  • 3. Agenda • Hadoop Ecosystem Interactions • Security Concepts • Authentication • Authorization • Confidentiality • Auditing • IT Infrastructure Integration • Deployment Recommendations
  • 4. Hadoop on its Own Hadoop NN DN TT JT DN TT DN TT MR client Map Task Map Task Reduce Task SNN hdfs, httpfs & mapred users end users protocols: RPC/data transfer/HTTP HttpFS HDFS client WebHdfs client
  • 5. Hadoop and Friends Hadoop Hive Metastore Hbase Oozie Hue Impala Zookeeper FlumeMapRed Pig Crunch Cascading Sqoop Hive Hbase Oozie Impala browser Flume servicesclients clients RPC HTTP Thrift HTTP RPC Thrift HTTP RPC service users end users protocols: RPCs/data/HTTP/Thrift/Avro-RPC Avro RPC WebHdfs HTTP RPCZookeeper
  • 6. • Authentication: • End users to services, as a user: user credentials • Services to Services, as a service: service credentials • Services to Services, on behalf of a user: service credentials + trusted service • Job tasks to Services, on behalf of a user: job delegation token • Authorization • Data: HDFS, HBase, Hive Metastore, Zookeeper • Jobs: who can submit, view or manage Jobs (MR, Pig, Oozie, Hue, …) • Queries: who can run queries (Impala) Authentication / Authorization
  • 7. Confidentiality / Auditing • Confidentiality • Data at rest (on disk) • Data in transit (on the network) • Auditing • Who accessed (read/write) data • Who submitted, managed or viewed a Job or a Query
  • 8. • End Users to services, as a user • CLI & libraries: Kerberos (kinit or keytab) • Web UIs: Kerberos SPNEGO & pluggable HTTP auth • Services to Services, as a service • Credentials: Kerberos (keytab) • Services to Services, on behalf of a user • Proxy-user (after Kerberos for service) Authentication Details
  • 9. • HDFS Data • File System permissions (Unix like user/group permissions) • HBase Data • Read/Write Access Control Lists (ACLs) at table level • Hive Metastore (Hive, Impala) • Leverages/proxies HDFS permissions for tables & partitions • Hive Server (Hive, Impala) (coming) • More advanced GRANT/REVOKE with ACLs for tables • Jobs (Hadoop, Oozie) • Job ACLs for Hadoop Scheduler Queues, manage & view jobs • Zookeeper • ACLs at znodes, authenticated & read/write Authorization Details
  • 10. • Data in transit • RPC: using SASL • HDFS data: using SASL • HTTP: using SSL (web UIs, shuffle). Requires SSL certs • Thrift: not avail (Hive Metastore, Impala) • Avro-RPC: not avail (Flume) • Data at rest • Nothing out of the box • Doable by: custom ‘compression’ codec or local file system encryption Confidentiality Details
  • 11. • Who accessed (read/write) FS data • NN audit log contains all file opens, creates • NN audit log contains all metadata ops, e.g. rename, listdir • Who submitted, managed, or viewed a Job or a Query • JT, RM, and Job History Server logs contain history of all jobs run on a cluster • Who submitted, managed, or viewed a workflow • Oozie audit logs contain history of all user requests Auditing Details
  • 12. Auditing Gaps • Not all projects have explicit audit logs • Audit-like information can be extracted by processing logs • Eg: Impala query logs are distributed across all nodes • It is difficult go correlate jobs & data access • Eg: Map-Reduce jobs launched by Pig job • Eg: HDFS data accessed by a Map-Reduce job
  • 13. IT Integration: Kerberos • Users don’t want Yet Another Credential • Corp IT doesn’t want to provision thousands of service principals • Solution: local KDC + one-way trust • Run a KDC (usually MIT Kerberos) in the cluster • Put all service principals here • Set up one-way trust of central corporate realm by local KDC • Normal user credentials can be used to access Hadoop
  • 14. IT Integration: Groups • Much of Hadoop authorization uses “groups” • User ‘atm’ might belong to groups ‘analysts’, ‘eng’, etc. • Users’ groups are not stored in Hadoop anywhere • Refers to external system to determine group membership • NN/JT/Oozie/Hive servers all must perform group mapping • Default plugins for user/group mapping: • ShellBasedUnixGroupsMapping – forks/runs `/bin/id’ • JniBasedUnixGroupsMapping – makes a system call • LdapGroupsMapping – talks directly to an LDAP server
  • 15. IT Integration: Kerberos + LDAP Hadoop Cluster Local KDC hdfs/host1@HADOOP.EXAMPLE.COM yarn/host2@HADOOP.EXAMPLE.COM … Central Active Directory tucu@EXAMPLE.COM atm@EXAMPLE.COM … Cross-realm trust NN JT LDAP group mapping
  • 16. IT Integration: Web Interfaces • Most web interfaces authenticate using SPNEGO • Standard HTTP authentication protocol • Used internally by services which communicate over HTTP • Most browsers support Kerberos SPNEGO authentication • Hadoop components which use servlets for web interfaces can plug in custom filter • Integrate with intranet SSO HTTP solution
  • 17. • Security configuration is a PITA • Do only what you really need • Enable cluster security (Kerberos) only if un-trusted groups of users are sharing the cluster • Otherwise use edge-security to keep outsiders out • Only enable wire encryption if required • Only enable web interface authentication if required Deployment Recommendations
  • 18. • Secure Hadoop bring-up order 1. HDFS RPC (including SNN check-pointing) 2. JobTracker RPC 3. TaskTrackers RPC & LinuxTaskControler 4. Hadoop web UI 5. Configure monitoring to work with security 6. Other services (HBase, Oozie, Hive Metastore, etc) 7. Continue with authorization and network encryption if needed • Recommended: Use an admin/management tool • Several inter-related configuration knobs • To manage principals/keytabs creation and distribution • Automatically configures monitoring for security Deployment Recommendations
  • 19. Q&A
  • 20. Thanks ATM (Cloudera) & Tucu (Cloudera) Hadoop Summit, June 2013
  • 21. Client Protocol Authentication Proxy User Authorization Confidentiality Auditing Hadoop HDFS RPC Kerberos Yes FS permissions SASL Yes Hadoop HDFS Data Transfer SASL No FS permissions SASL No Hadoop WebHDFS HTTP Kerberos SPNEGO plus pluggable Yes FS permissions N/A Yes Hadoop MapReduce (Pig, Hive, Sqoop, Crunch, Cascading) RPC Kerberos Yes (requires job config work) Job & Queue ACLs SASL No Hive Metastore Thrift Kerberos Yes FS permissions N/A Yes Oozie HTTP Kerberos SPNEGO plus pluggable Yes Job & Queue ACLs and FS permissions SSL (HTTPS) Yes Hbase RPC/Thrift/HTTP Kerberos Yes table ACLs SASL No Zookeeper RPC Kerberos No znode ACLs N/A No Impala Thrift Kerberos No Hive policy file N/A No Hue HTTP pluggable No Job & Queue ACLs and FS permissions HTTPS No Flume Avro RPC N/A No N/A N/A No Security Capabilities

×