Securing Hadoop - MapR Technologies


Published on

Historically, security hasn't been a high priority in regards to Hadoop (reflection of type of data and organizations using Hadoop), but now Hadoop is being used by more traditional firms with heightened security requirements. MapR's Senior Principal Technologist, Keys Botzum, gives a talk on how you can build a secure cluster.

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • MapR provides a complete distribution for Apache Hadoop. MapR has integrated, tested and hardened a broad array of packages as part of this distribution Hive, Pig, Oozie, Sqoop, plus additional packages such as Cascading. We have spent over a two year well funded effort to provide deep architectural improvements to create the next generation distribution for Hadoop. MapR has made significant updates combined with a dozen open source packages. Any of the innovations MapR has delivered include 100% compatibility with the Apache Hadoop APIs. This is in stark contrast with the alternative distributions from Cloudera, HortonWorks, Apache which are all equivalent.
  • MapR has been selected by two of the companies most experienced with MapReduce technology which is a testament to the technology advanges of MapR’s distribution. Amazon through its Elastic MapReduce service (EMR) hosted over 2 million clusters in the past year. Amazon selected MapR to complement EMR as the only commercial Hadoop distribution being offered, sold and supported as a service by Amazon to its customers. MapR was also selected by Google – the pioneer of MapReduce and the company whose white paper on MapReduce inspired the creation of Hadoop – has also selected MapR to make our distribution available on Google Compute Engine. Hadoop in the cloud makes a great deal of sense: the elastic resource allocation that cloud computing is premised on works well for cluster-based data processing infrastructure used on varying analyses and data sets of indeterminate size. MapR has unique features such as mirroring between sites and multi-tenancy support that further enhance cloud deployments
  • In initial release, server key and cldb key never changes. Server ticket also shared by all servers and does not expire.
  • Note: this does create a “race condition” in the install process since all nodes but the first have to have run after the first. This might be an issue with certain parallel install processes. You can work around this by simply running (specifying the domain for the ssl certs as needed) somewhere to create the needed keys and then copying them to all nodes at once.
  • Securing Hadoop - MapR Technologies

    1. 1. Securing Hadoop Keys Botzum, MapR Technologies Jan 2014 ©MapR Technologies - Confidential 1
    2. 2. Why Secure Hadoop  Historically security wasn’t a high priority –  Reflection of the type of data and the type of organizations using Hadoop Hadoop is now being used by more traditional firms as well as organizations with high security requirements – – – Highly regulated Sensitive data sets People with experience with security in existing enterprise technologies (e.g., databases) are asking for the same in Hadoop ©MapR Technologies - Confidential 2
    3. 3. Why Secure Hadoop  Client operating system is trusted to identify user (weak authentication) – –  Hadoop servers trust anyone that can reach them on the network –  Could I falsify a data node, job tracker, etc.? Hive Server runs as ‘system’ user –  If I can compromise client, I can run jobs or access HDFS as anyone Think about virtual machines with root access All Hive Server submitted jobs run as that ‘system’ user Intruders can see and modify all network traffic ©MapR Technologies - Confidential 3
    4. 4. Apache Hadoop Security  Core goals – Authenticate network traffic Users authenticate • Servers authenticate to each other • –  Encrypt network traffic Note: Hadoop also has a lot of authorization functionality which I’m not discussing here ©MapR Technologies - Confidential 4
    5. 5. Apache Hadoop Security  Kerberos as core authentication technology – –  But Kerberos doesn’t fit perfectly with Hadoop model –  Kerberos to access HDFS, JT, Oozie, etc. Kerberos for server to server traffic Introduce delegation tokens for carrying identity in many scenarios Kerberos is complicated – Need Kerberos identity for every server in the cluster • – – Lots to manage! Every user needs a Kerberos identity to access cluster, Web UIs, etc. Lots of steps • ©MapR Technologies - Confidential 5
    6. 6. Ecosystem Kerberos  Ecosystem components also generally rely on Kerberos – – –  Need to create appropriate Kerberos SPNEGO identities for many services (Web UI access) Need to create service Kerberos identity for cluster access for many services, often for each node Lots to manage HBase, Oozie, Hive Server 2, Hive Meta Server, Flume, etc. ©MapR Technologies - Confidential 6
    7. 7. Apache Hadoop Security – Additional Items  Kerberos only part of the puzzle  More steps – some examples – – – Configure Web UI HTTPS Configure Encrypted Shuffle Configure Hive Server 2 Authentication using LDAP or Kerberos • Impersonation • Authenticate to HS2 (userid/password or Kerberos) – HS2 executes job using secure impersonation on cluster – Now job runs as submitting user and can see/modify only what user can – • Encryption – SSL can be used to protect userid & password authentication to HS2 ©MapR Technologies - Confidential 7
    8. 8. MapR Distribution for Apache Hadoop  Complete Hadoop distribution  Comprehensive management suite  Industry-standard interfaces  Enterprise-grade dependability  Higher performance  Ease of Use ©MapR Technologies - Confidential 8
    9. 9. The Cloud Leaders Pick MapR Google chose MapR to provide Hadoop on Google Compute Engine Amazon EMR is the largest Hadoop provider in revenue and # of clusters ©MapR Technologies - Confidential 9
    10. 10. MapR Security  Build on the work of the Apache community, but with improvements  Goals – Authenticate network traffic Users authenticate • Servers authenticate to each other • – – – Encrypt network traffic Low performance overhead Simple and easy to administer ©MapR Technologies - Confidential 10
    11. 11. MapR Native Security  Hadoop security without Kerberos –  But borrow heavily from Kerberos design Kerberos integration if desired ©MapR Technologies - Confidential 11
    12. 12. Architecture  Shared secrets like Kerberos –  Managed at cluster level Identity represented using a ticket which is issued by MapR CLDB servers (Container Location DataBase) ©MapR Technologies - Confidential 12
    13. 13. Tickets  A ticket represents a valid authenticated identity  Contains – – –  An expiration time, renewal lifetime, and creation time A randomly generated secret key Information about the identity – userid, group ids A client authenticates to servers using the ticket ©MapR Technologies - Confidential 13
    14. 14. User Experience  User invokes maprlogin – maprlogin connects to CLDB (over https) • –  Ticket is returned, saved in file in /tmp file and accessible only by owning user – file name is /tmp/maprticket_<uid> MapR PAM module –  Provide userid & password (or Kerberos ticket) for validation by CLDB Optional MapR provided PAM module creates MapR tickets automatically during Unix login All processes automatically pick up ticket (nothing to do) Java and C/C++ clients implicitly look for valid ticket and use it – Clients optionally use existing Kerberos identity to get MapR ticket – ©MapR Technologies - Confidential 14
    15. 15. Client First Contact  Client sends the ticket and data encrypted using secret key  Receiving server – – –  Validates ticket, including expiration Extracts identity information from ticket and uses that for authorization Returns encrypted response to client Notice that MapR user identity is independent of host or operating system identity ©MapR Technologies - Confidential 15
    16. 16. Server First Contact  When a trusted server starts it uses a local server ticket to authenticate to the CLDB – – – CLDB verifies the ticket’s authenticity using secret key CLDB returns a server key that is used to create and validate user tickets The server is now a trusted member of the cluster ©MapR Technologies - Confidential 16
    17. 17. Maprlogin  Primary user visible security tool  Actions are – – – – – –  password - authenticate to a MapR cluster using a valid password kerberos - authenticate to a MapR cluster using Kerberos print - print information on your existing credentials authtest - test authentication as a generic client end / logout - logout of cluster renew - renew existing ticket For example: % maprlogin password [Password for user 'fred' at cluster '': ] MapR credentials of user 'fred' for cluster '' are written to '/tmp/maprticket_1001' ©MapR Technologies - Confidential 17
    18. 18. Maprlogin – Under the Covers maprlogin 1. username/passwd sent on https 4. ticket + key saved in file in /tmp MapR CLDB LDAP/ Kerberos/ NIS 3. ticket + user key returned 6. client sends RPC encrypted with user-key + ticket hadoop fs –ls / 5. cmd picks up ticket + key from file ©MapR Technologies - Confidential 2. uses PAM to authenticate FileServer/ CLDB 7. server decrypts ticket to authenticate user and checks permissions on ACL 18
    19. 19. Cryptography  Encrypted using current NIST standards – AES-256 in GCM mode for encryption and signing • NIST standard - • – Leverage Intel hardware encryption where available, software otherwise  Use the open source crypto++ library for our C++ cryptography –  Random number generation – Use secure random number generation as documented here l#_details ©MapR Technologies - Confidential 19
    20. 20. MapR Security – More by Default  By default, out of the box – HS2 supports password authentication • – Oozie supports MapR ticket authentication • – – Can configure Kerberos and SSL function, same as from Apache, including secure impersonation Can configure Kerberos and SSL function, same as from Apache, including secure impersonation MapR Tables (HBase APIs) use native MapR security, no configuration needed Most Web UIs enhanced to support userid & password authentication and HTTPS • Can configure Kerberos SPNEGO, same as from Apache ©MapR Technologies - Confidential 20
    21. 21. Encrypted Shuffle (?)  No need to special case encrypting shuffle  MapR-FS is store for Map output – Shuffle inherits the same encryption, authentication, and authorization functionality of the rest of MapR-FS ©MapR Technologies - Confidential 21
    22. 22. Let’s Build a Secure Cluster!  Node 1 apt-get install mapr…. –C … -Z … -secure –genkeys – Generates all needed keys for MapR-RPC as well as for HTTPS  Node N apt-get install mapr…. scp rootORmapr@node1:/opt/mapr/conf/{cldb.key,maprserverticket,ssl_keyst ore,ssl_truststore} /opt/mapr/conf –C … -Z … -secure  Clients apt-get install mapr… scp anyuser@nodeN:/opt/mapr/conf/ssl_truststore /opt/mapr/conf … -secure ©MapR Technologies - Confidential 22
    23. 23. MapR Advantage  Vastly simpler – –  Easier integration –  Core secured by default in one step No requirement for Kerberos in core and associated complexity Leverage existing Linux authentication (PAM and NSSwitch) Faster – Leverage Intel AES hardware cryptography ©MapR Technologies - Confidential 23
    24. 24. Further Reading  MapR –  MapR Native Security – –  Adding Security to Apache Hadoop –  The Evolution of Hadoop’s Security Model – ©MapR Technologies - Confidential 24
    25. 25. Thank You ©MapR Technologies - Confidential 25
    26. 26. Appendix ©MapR Technologies - Confidential 26
    27. 27. Key Design Elements  User authentication and authorization information obtained using standard operating system information – PAM and nsswitch  MapR specific shared secret keys – – – Easier to manage No dependencies on complex external security systems Better performance  MapR servers (running as ‘mapr’) have access to maprserverticket and are therefore privileged processes  MapR-RPC altered to encrypt and authenticate traffic  Maprsasl created for Apache Java code to leverage similar security – – Leverages same keys, authentication model, etc. Reuses the C/C++ code via JNI ©MapR Technologies - Confidential 27
    28. 28. Persistent Keys and Tickets CLDB/ZK 1 K Node 1 … CLDB/ZK N K Node 2 Node N … ©MapR Technologies - Confidential 28
    29. 29. Example: Job Tracker Integration JobClient submit job (maprsasl) 1. JC copies job conf securely to FS JobTracker schedule job (maprsasl) TaskTracker 2. JT creates user ticket 3. TT fetches 4. TT launches job using ticket identity ticket File system JT can create user tickets. TT copies ticket to private job directory on local disk. taskcontroller copies it to user private local disk dir and tasks set MAPR_TICKET_LOCATION to that place. ©MapR Technologies - Confidential 29
    30. 30. Creating a Secure Cluster  On first node run … -genkeys, it creates some keys – – –  Additional nodes – – –  Copy all to other CLDB and ZK nodes Copy all but the CLDB key to remaining nodes Run On a client – –  CLDB key (cldb.key) Ticket for nodes (maprserverticket) SSL certificates (ssl_keystore & ssl_truststore) Copy SSL truststore from any server node Run No requirement for Kerberos configuration ©MapR Technologies - Confidential 30