Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Improvements in Hadoop Security


Published on

Published in: Technology, Business
  • Be the first to comment

Improvements in Hadoop Security

  1. 1. © Hortonworks Inc. 2011 Improvements in Hadoop Security Sanjay Radia @srr Chris Nauroth @cnauroth Page 1
  2. 2. © Hortonworks Inc. 2011 Hello Sanjay Radia • Founder, Hortonworks • Part of the Hadoop team at Yahoo! since 2007 – Chief Architect of Hadoop Core at Yahoo! – Long time Apache Hadoop PMC and Committer – Designed and developed several key Hadoop features • Prior – Data center automation, virtualization, Java, HA, OSs, File Systems (Startup, Sun Microsystems, …) – Ph.D., University of Waterloo Chris Nauroth • Member of Technical Staff, Hortonworks – Apache Hadoop Committer – Major contributor to HDFS ACLs • Hadoop user since 2010 – Prior employment experience deploying, maintaining and using Hadoop clusters Page 2 Architecting the Future of Big Data
  3. 3. © Hortonworks Inc. 2011 Overview • Models of Deployment – Secure and insecure • Hadoop Authentication – The how and why – Knox – perimeter security • Authorization – existing and what is new – HDFS – Tables and Hive – HBase and Accumulo • Data protection and encryption – Wire – Data at rest Page 3 Architecting the Future of Big Data
  4. 4. © Hortonworks Inc. 2011 Two Reasons for Security in Hadoop Hadoop Contains Sensitive Data – As Hadoop adoption grows so too has the types of data organizations look to store. Often the data is proprietary or personal and it must be protected. – In this context, Hadoop is governed by the same security requirements as any data center platform. Hadoop is subject to Compliance adherence – Organizations are often subject to comply with regulations such as HIPPA, PCI DSS, FISAM that require protection of personal information. – Adherence to other Corporate security policies. 1 2
  5. 5. © Hortonworks Inc. 2011 Three Models of Hadoop Deployment • Insecure cluster – You have protected it via the perimeter – You trust the code that runs in the system – Note In Hadoop cluster, user submitted code runs inside the cluster – (Not true in typical client-server applications) – The client-side libraries pass the client’s login credential – There is no end-end-authentication here – Authorization is done against this credential • Secure cluster – Full authentication – Can run arbitrary code in jobs • Perimeter security using Knox – Internal cluster can be secure or insecure depending on your needs
  6. 6. © Hortonworks Inc. 2011 Pillars of Hadoop Security Authorization Restrict access to explicit data Audit Understand who did what Data Protection Encrypt data at rest & motion AD/Kerberos in native Apache Hadoop Perimeter Security with Apache Knox Gateway Authentication Who am I/prove it? Control access to cluster. Every service has audit logs Knox and XASecure provide central audit logs
  7. 7. © Hortonworks Inc. 2011 Hadoop Authentication Overview • Kerberos/Active Directory based security – SSO – users do not have to re-login into Hadoop – Hadoop accounts do not have to be created – Caveat – MR Task isolation require Unix accounts for each user current but this is going away with Linux containers – Hadoop tokens – supplement the Kerberos authentication – Delegation tokens – deal with the delayed job execution – Block tokens – capabilities to deal with the distributed nature of HDFS – Trusted Proxies – support for third party services to act as proxy – Oozie – Gateways – HDFS proxy, Knox, etc. • Knox – Perimeter Security and Rest Gateway Page 7
  8. 8. © Hortonworks Inc. 2011 Why the tokens • Why does Hadoop have its own tokens? – Standard client-server security model is not sufficient for Hadoop – Works when logged-in client is directly accessing a Hadoop service – But for a job, the execution happens much later – The job submitter has long logged off • Hence we needed to add delegation tokens • HDFS is a distributed service and needed to add capability-like tokens for datanode authentication The permissions/ ACLs are in the Namenode
  9. 9. © Hortonworks Inc. 2011 Apache Knox Perimeter Security with Hadoop REST APIs Architecting the Future of Big Data Page 10
  10. 10. © Hortonworks Inc. 2011 The Gateway or Edge Node • Hadoop APIs can be used from any desktop after SSO login – FileSystem and MapReduce Java APIs – Pig, Hive and Oozie clients (that wrap the Java APIs) • However it is typical to Use “Edge Node” or “Gateway Node” that is “inside” cluster – The libraries for the APIs are generally only installed on the gateway – Users SSH to Edge Node and execute API commands from shell Page 11 HadoopUser Edge Node SSH
  11. 11. © Hortonworks Inc. 2011 • Single Hadoop access point • REST API hierarchy • Consolidated API calls • Multi-cluster support • Eliminates SSH “edge node” • Central API management • Central audit control • Simple Service level Authorization • SSO Integration – Siteminder, API Key*, OAuth* & SAML* • LDAP & AD integration Perimeter Security with Apache Knox Integrated with existing systems to simplify identity maintenance Incubated and led by Hortonworks, Apache Knox provides a simple and open framework for Hadoop perimeter security. Single, simple point of access for a cluster Central controls ensure consistency across one or more clusters
  12. 12. © Hortonworks Inc. 2011 Hadoop REST APIs • Useful for connecting to Hadoop from the outside the cluster • When more client language flexibility is required – i.e. Java binding not an option • Challenges (Knox addresses these challenges) – Client must have knowledge of cluster topology – Required to open ports (and in some cases, on every host) outside the cluster Page 13 Service API WebHDFS Supports HDFS user operations including reading files, writing to files, making directories, changing permissions and renaming. Learn more about WebHDFS. WebHCat Job control for MapReduce, Pig and Hive jobs, and HCatalog DDL commands. Learn more about WebHCat. Hive Hive REST API operations HBase HBase REST API operations Oozie Job submission and management, and Oozie administration. Learn more about Oozie.
  13. 13. © Hortonworks Inc. 2011 What can be done today? Authorization Restrict access to explicit data Audit Understand who did what Data Protection Encrypt data at rest & motion Previously • All Services: Service level ACLs • HDFS: Permissions • Yarn: Queue ACLs • Hive/Pig Tables: Table level via HDFS • Apache Accumulo: Cell level • HBase: Namespace, Table, Column Family and Column level ACLs Authentication Who am I/prove it? Control access to cluster. Hadoop 2.x • HDFS: ACLs • Hive: Column level ACLs • HBase: Cell level ACLs • Knox: • Rest Service level Authorization • Access Audit with Knox
  14. 14. © Hortonworks Inc. 2011 HDFS ACLs • Existing HDFS POSIX permissions good, but not flexible enough – Permission requirements may differ from the natural organizational hierarchy of users and groups. • HDFS ACLs augment the existing HDFS POSIX permissions model by implementing the POSIX ACL model. – An ACL (Access Control List) provides a way to set different permissions for specific named users or named groups, not only the file’s owner and file’s group. Page 15 Architecting the Future of Big Data
  15. 15. © Hortonworks Inc. 2011 HDFS File Permissions Example • Authorization requirements: –In a sales department, they would like a single user Maya (Department Manager) to control all modifications to sales data –Other members of sales department need to view the data, but can’t modify it. –Everyone else in the company must not be allowed to view the data. • Can be implemented via the following: Read/Write perm for user maya User Group Read perm for group sales File with sales data
  16. 16. © Hortonworks Inc. 2011 HDFS ACLs • Problem –No longer feasible for Maya to control all modifications to the file – New Requirement: Maya, Diane and Clark are allowed to make modifications – New Requirement: New group called executives should be able to read the sales data –Current permissions model only allows permissions at 1 group and 1 user • Solution: HDFS ACLs –Now assign different permissions to different users and groups Owner Group Others HDFS Directory … rwx … rwx … rwx Group D … rwx Group F … rwx User Y … rwx
  17. 17. © Hortonworks Inc. 2011 HDFS ACLs New Tools for ACL Management (setfacl, getfacl) – hdfs dfs -setfacl -m group:execs:r-- /sales-data – hdfs dfs -getfacl /sales-data # file: /sales-data # owner: maya # group: sales user::rw- group::r— group:execs:r— mask::r— other::-- – How do you know if a directory has ACLs set? – hdfs dfs -ls /sales-data Found 1 items -rw-r-----+ 3 maya sales 0 2014-03-04 16:31 /sales-data
  18. 18. © Hortonworks Inc. 2011 HDFS ACLs Default ACLs –hdfs dfs -setfacl -m default:group:execs:r-x /monthly-sales-data –hdfs dfs -mkdir /monthly-sales-data/JAN –hdfs dfs –getfacl /monthly-sales-data/JAN – # file: /monthly-sales-data/JAN # owner: maya # group: sales user::rwx group::r-x group:execs:r-x mask::r-x other::--- default:user::rwx default:group::r-x default:group:execs:r-x default:mask::r-x default:other::---
  19. 19. © Hortonworks Inc. 2011 HDFS ACLs Best Practices • Start with traditional HDFS permissions to implement most permission requirements. • Define a smaller number of ACLs to handle exceptional cases. • A file with an ACL incurs an additional cost in memory in the NameNode compared to a file that has only traditional permissions. Page 20 Architecting the Future of Big Data
  20. 20. © Hortonworks Inc. 2011 Tables and Hive Architecting the Future of Big Data Page 21
  21. 21. © Hortonworks Inc. 2011 Table ACLs – The Challenge and Solution • Hive and Pig have traditionally offer full table access control via HDFS access control • The challenge in column-level access control – Hive and Pig queries are executed as Tez-based tasks that access the HDFS files directly – HDFS does not have knowledge of columns (there are several file/table formats) • Solution for Column level ACLs – Let Hive server check and submit the query execution – Let the table be accessible only by special user (“HiveServer”) – But one has to restrict the UDFs and file formats – Good news: Hive provides an authorization plugin to do this cleanly • Use standard SQL permission constructs – GRANT/REVOKE • Store the ACLs in Hive Metastore instead of some external DB • But what about Pig, there is no Pig server … Page 22 Architecting the Future of Big Data
  22. 22. © Hortonworks Inc. 2011 Hive ATZ-NG – Architecture HDFS Metastore HiveServer2 O/JDBC Beeline CLI • ATZ-NG is called for O/JDBC & Beeline CLI • Standard SQL GRANT / REVOKE for management • Privilege to register UDF restricted to Admin user • Policy integrated with Table/View life cycle Storage Based Authorization Provider Hive CLI OozieHue PIG HCat Ambari 0. Enable HiveATZ-NG 1. Authentication UDFs Protected – column level Protected – table level Restrict direct access to Metastore Protect HDFS with Kerberos & HDFS ACL ATZ-NG 2. Authorization
  23. 23. © Hortonworks Inc. 2011 What about MR/Pig • Note there is no Pig/MR server to submit and check column ACLs • Hence in the same cluster running Hive –You cannot give Pig similar column level access control –If Pig/MR is important, –Use coarse grained table level –Or run Pig/MR as privileged uses with full table level access Page 26
  24. 24. © Hortonworks Inc. 2011 Hive ATZ-NG Example Page 28
  25. 25. © Hortonworks Inc. 2011 Scenario • Objective: Share Product Management Roadmap securely • Actors: –Admin Role – Specified in hive-site – Admin role controls role memberships –Product Management Role – Should be able to create, read all road map details. – Members: Vinay Shukla, Tim Hall –Engineering Role – Should be able to read (see) all roadmap details – Members: Kevin Minder, Larry McCay Page 29
  26. 26. © Hortonworks Inc. 2011 Step 1: Admin role Creates Roles, Adds Users 1. CREATE ROLE PM; 1. CREATE ROLE ENG; 1. GRANT ROLE PM to user timhall with admin option; 1. GRANT ROLE PM to user vinayshukla; 1. GRANT ROLE ENG to user kevinminder with admin option; 1. GRANT ROLE ENG to user larrymccay;
  27. 27. © Hortonworks Inc. 2011 Step 2: Super-user Creates Tables/Views create table hdp_hadoop_plans ( id int, hadoop_roadmap string, hdp_roadmap string );
  28. 28. © Hortonworks Inc. 2011 Step 3: Users or Roles Assigned To Tables 1. GRANT ALL ON hdp_hadoop_plans TO ROLE PM; 1. GRANT SELECT ON hdp_hadoop_plans TO ROLE ENG;
  29. 29. © Hortonworks Inc. 2011 HBase Cell Level Authorization • The HBase permissions model already supports ACLs defined at the namespace, table, column family and column level. –This is sufficient to meet many requirements –This can be insufficient if a data model requires protection on individual rows/cells. –Example: Medical data, each row representing a patient, may require customizing who can see an individual patient’s data, and the social security number of each row may need further restriction. Page 33
  30. 30. © Hortonworks Inc. 2011 HBase Cell Level Authorization • Cell level authorization augments the permissions model by allowing ACLs specified on individual cells. –ACLs are now supported at the individual cell level. –Individual operations may choose order of evaluation. Cell level ACLs may be evaluated last or first. –Evaluating last is useful if the common case is access granted through table or column family ACLs, and cell level ACLs define exceptions for denial. –Evaluating first is useful if many users are granted access through cell level ACLs. Page 34 Architecting the Future of Big Data
  31. 31. © Hortonworks Inc. 2011 HBase Cell Level Authorization • Visibility labels –Visibility expressions can be stored as metadata in a cell’s tag. –A visibility expression consists of labels combined with boolean operators. –E.g. (financial | strategy | research) & !newhire –This means that a user must be labeled financial or strategy or research and not be a newhire in order to see the column. –The mapping of users to their labels is pluggable. By default, a user’s labels are specified as authorizations in the individual operation. –HBase visibility labels were inspired by similar features in Apache Accumulo, and the model will look very familiar to Accumulo users. Page 35 Architecting the Future of Big Data
  32. 32. © Hortonworks Inc. 2011 What can be done today? Authorization Restrict access to explicit data Audit Understand who did what Data Protection Encrypt data at rest & motion Wire encryption • In native Hadoop • With Knox • SSL for Rest (2.x) File encryption • Via MR file format • 3rd Party encryption tools for col level encryption • Native HDFS support coming Authentication Who am I/prove it? Control access to cluster.
  33. 33. © Hortonworks Inc. 2011 Wire Encryption – for data in motion Page 37 • Hadoop client to DataNode is via Data Transfer Protocol – HDFS client reads/writes to HDFS service over encrypted channel – Configurable encryption strength • ODBC/JDBC Client to HiveServer 2 – Encryption is via SASL Quality Of Protection • Map to Reduce via shuffle – Shuffle is over HTTP(S) – Supports mutual authentication via SSL – Host name verification enabled • Rest Protocols – SSL support
  34. 34. © Hortonworks Inc. 2011 Data at Rest • Coming: HDFS encrypted file system currently under development in Apache – – Page 38 Architecting the Future of Big Data
  35. 35. © Hortonworks Inc. 2011 XA Secure A Major step forward in Hadoop security See Shaun Connolly’s Key note on Wednesday June 4 Architecting the Future of Big Data Page 39
  36. 36. © Hortonworks Inc. 2011 Security in Hadoop with HDP + XA Secure Authorization Restrict access to explicit data Audit Understand who did what Data Protection Encrypt data at rest & in motion • Kerberos in native Apache Hadoop • HTTP/REST API Secured with Apache Knox Gateway • MapReduce Access Control Lists • HDFS Permissions, HDFS ACL, • Audit logs in with HDFS & MR • Hive ATZ-NG • Cell level access control in Apache Accumulo Authentication Who am I/prove it? • Wire encryption in Hadoop • Orchestrated encryption with 3rd party tools • HDFS, Hive and Hbase • Fine grain access control • RBAC • Centralized audit reporting • Policy and access history • Future roadmap • Strategy to be finalized HDP2.1XASecure Centralized Security Administration • As-Is, works with current authentication methods
  37. 37. © Hortonworks Inc. 2011 Open Source? •Yes XASecure technology will be open sourced –Not just a Apache license where you are forced to get the latest from Hortonworks –But a full-fledged Apache Project that is truly open to the community of developers and users •See Shaun Connolly’s Keynote on Wednesday for details Page 41 Architecting the Future of Big Data
  38. 38. © Hortonworks Inc. 2011 Summary • Very strong Authentication via Kerberos and Active directory – Uses your organization's user DB and integrates to its group and role membership – Supplemented by Hadoop tokens – Note these are necessary due to delayed job execution after user logs-off • Strong fine grained authentication with some recent improvements – HDFS ACLs – Hive – integrated via SQL model and Hive Metastore – Note a hacked side addon – HBase Cell Level Authorization • Strong encryption support – Wire – Data – Some improvements coming soon • Every product has audit logs • XASecure adds a major step forward – Yes it will be open sourced as a Apache Project Page 42 Architecting the Future of Big Data
  39. 39. © Hortonworks Inc. 2011 Thank you, Q&A Page 43 Resource Location Hortonworks Security Labs Apache Knox Project Page HDFS ACLs Blog Post Encrypted File System Development HBase Cell Level Authorization Learn more