Hadoop and Big Data Security


Published on

Hadoop and Big Data Security - Kevin T. Smith

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Hadoop and Big Data Security

  1. 1. Hadoop and Big Data Security Kevin T. Smith, 11/14/2013 Ksmith <AT> Novetta . COM
  2. 2. Big Data Security – Why Should We Care? New Challenges related to Data Management, Security, and Privacy As data growth is explosive, so is the complexity of our IT environments Many organizations required to enforce access control & privacy restrictions on data sets (HIPAA, Privacy Laws) – or face steep penalties & fines Organizations are increasingly required to enforce access control to their data scientists based on Need-to-Know, User Authorization levels, and what data they are allowed to see – especially in Healthcare, Finance, and Government Organizations struggling to understand what data they can release Mismanagement of Data Sets -- Costly.. AOL Research “Data Valdez” Incident • CNNMoney - “101 Dumbest Moments in Business” • $5 Million Settlement , plus $100 to each member of AOL between 3/2006-5/2006, + $50 to each member who believed their data was in the released data; Fired employees, CTO Resignation The Netflix Contest Anonymized Data Set Incident • Class-Action Lawsuit, $9 Million Settlement Massachusetts Hospital Record Incident Cyber Security Attacks are on the Rise Ponemon Institute – the Average Cost of a Data Breach in the U.S. is 5.4 Million dollars* Playstation (2011) – Experts predict costs between 2.2 and 2.4 Billion * (Breach Study: Global Analysis, May 2013)
  3. 3. A (Brief) History of Hadoop Security Hadoop developed without Security in Mind Originally No Security model No authentication of users or services Anyone could submit arbitrary code to be executed Later authorization added, but any user could impersonate other users with command-line switch In 2009, Yahoo! focused on Hadoop Authentication, and did a Hadoop Redesign, But… Resulting Security Model is Complex Security Configuration is complex & Easy to Mess Up No Data at Rest Encryption Kerberos-Centric Limited Authorization Capabilities Things are Changing, But Slowly.. It is important to understand how Hadoop Security is Currently Implemented & Configured It is important to understand how to meet your organization’s security requirements
  4. 4. Hadoop Security Data Flow Distributed Security is a Challenge Since the .20.20x distributions of Hadoop, much of the model is Kerberos Centric , as you see to the right Model is quite complex, as you will see on the next slide
  5. 5. Token Delegation & Hadoop Security Flow Token Used For Kerberos TGT Kerberos initial authentication to KDC. Kerberos service ticket Kerberos initial authentication between users, client processes, and services. Delegation token Token issued by the NameNode to the client, used by the client or any services working on the client’s behalf to authenticate them to the NameNode. Block Access token Token issued by the NameNode after validating authorization to a particular block of data, based on a shared secret with the DataNode. Clients (and services working on the client’s behalf) use the Block Access token to request blocks from the DataNode. Job token This is issued by the JobTracker to TaskTrackers. Tasks communicating with TaskTrackers for a particular job use this token to prove they are associated with the job.
  6. 6. Some Vendor Activity in Hadoop Security Seems to be a New One Every Week! Cloudera Sentry – Fine Grained Access Control for Apache Hive & Cloudera Impala IBM InfoSphere Optim Data Masking – Optim Data Masking provides “Deidentification” of data by obfuscating corporate secrets, Guardium provides monitoring & auditing Intel’s Secure Hadoop Distribution – Encryption in transit & at rest, Granular access control with HBase DataStax Enterprise – Encryption in Transit & at Rest (using Cassandra for storage) DataGuise for Hadoop – Detects & protects sensitive data, setting access permission, masking or encrypting data, authorization based access Knox Gateway (Hortonworks) – Perimeter security, integration with IDAM environments, manage security across multiple clusters – now an Apache Project Protegrity – Big Data Protector provides Encryption & tokenization, Enterprise Security Administrator provides central policy, key mgmt, auditing, reporting Sqrrl – Builds on Apache Accumulo’s security capabilities for Hadoop Zettaset Secure Orchestrator – security wrapper around Hadoop
  7. 7. Apache Accumulo • Cell-Level Access Control via visibility • By default, uses its own db for users & credentials • Can be extended in code to use other Identity & Access Management Infrastructure
  8. 8. Project Rhino Intel launched this open source effort to improve security capabilities of Hadoop & contributed code to Apache in early 2013. Encrypted Data at Rest - JIRA Tasks HADOOP-9331 (Hadoop Crypto Codec Framework and Crypto Codec Implementation) and MAPREDUCE-5025 (Key Distribution and Management for Supporting Crypto Codec in MapReduce) . ZOOKEEPER-1688 will provide the ability for transparent encryption of snapshots and commit logs on disk, protecting against the leakage of sensitive information from files at rest. Token-Based Authentication & Unified Authorization Framework - JIRA TasksHADOOP-9392 (Token-Based Authentication and Single Sign-On) and HADOOP9466(Unified Authorization Framework) Improved Security in HBase - The JIRA Task HBASE-6222 (Add Per-KeyValue Security) adds cell-level authorization to HBase – something that Apache Accumulo has but HBase does not. HBASE-7544 builds on the encryption framework being developed, extending it to HBase, providing transparent table encryption.
  9. 9. What’s the Best Guidance Now? Identify and Understand the Sensitivity Levels of Your Data Are there access control policies associated with your data? Understand the Impact of the Release of Your Data Netflix example – Could someone couple your data with open source data to gain new (and unintended) insight? Develop Policies & Procedures relating to Security & Privacy of Your Data Sets Data Ingest Access Control within Your Organization Cleansing/Sanitization/Destruction Auditing Monitoring Procedures Incident Response Develop a Technical Security Approach that Complements Hadoop Security
  10. 10. Questions? Ksmith <AT> Novetta.COM