Hadoop
and Big Data Security
Kevin T. Smith, 11/14/2013
Ksmith <AT> Novetta . COM
Big Data Security – Why Should We Care?
New Challenges related to Data Management, Security, and Privacy
As data growth is explosive, so is the complexity of our IT environments
Many organizations required to enforce access control & privacy restrictions on
data sets (HIPAA, Privacy Laws) – or face steep penalties & fines
Organizations are increasingly required to enforce access control to their data
scientists based on Need-to-Know, User Authorization levels, and what data they
are allowed to see – especially in Healthcare, Finance, and Government
Organizations struggling to understand what data they can release

Mismanagement of Data Sets -- Costly..
AOL Research “Data Valdez” Incident
• CNNMoney - “101 Dumbest Moments in Business”
• $5 Million Settlement , plus $100 to each member of AOL between 3/2006-5/2006,
+ $50 to each member who believed their data was in the released data; Fired
employees, CTO Resignation

The Netflix Contest Anonymized Data Set Incident
• Class-Action Lawsuit, $9 Million Settlement

Massachusetts Hospital Record Incident

Cyber Security Attacks are on the Rise
Ponemon Institute – the Average Cost of a Data Breach in the U.S. is 5.4 Million
dollars*
Playstation (2011) – Experts predict costs between 2.2 and 2.4 Billion
* (Breach Study: Global Analysis, May 2013)
A (Brief) History of Hadoop Security
Hadoop developed without Security in Mind
Originally No Security model
No authentication of users or services
Anyone could submit arbitrary code to be executed
Later authorization added, but any user could
impersonate other users with command-line switch

In 2009, Yahoo! focused on Hadoop
Authentication, and did a Hadoop Redesign, But…
Resulting Security Model is Complex
Security Configuration is complex & Easy to Mess Up
No Data at Rest Encryption
Kerberos-Centric
Limited Authorization Capabilities

Things are Changing, But Slowly..

It is important to
understand how
Hadoop Security is
Currently Implemented
& Configured
It is important to
understand how to
meet your
organization’s security
requirements
Hadoop Security Data Flow
Distributed Security
is a Challenge
Since the .20.20x
distributions of
Hadoop, much of
the model is
Kerberos Centric ,
as you see to the
right
Model is quite
complex, as you will
see on the next
slide
Token Delegation &
Hadoop Security Flow
Token

Used For

Kerberos TGT

Kerberos initial authentication to KDC.

Kerberos service
ticket

Kerberos initial authentication between users,
client processes, and services.

Delegation
token

Token issued by the NameNode to the client,
used by the client or any services working on
the client’s behalf to authenticate them to the
NameNode.

Block Access
token

Token issued by the NameNode after
validating authorization to a particular block
of data, based on a shared secret with the
DataNode. Clients (and services working on
the client’s behalf) use the Block Access
token to request blocks from the DataNode.

Job token

This is issued by the JobTracker to
TaskTrackers. Tasks communicating with
TaskTrackers for a particular job use this
token to prove they are associated with the
job.
Some Vendor Activity in Hadoop Security
Seems to be a New One Every Week!

Cloudera Sentry – Fine Grained Access Control for Apache Hive & Cloudera Impala
IBM InfoSphere Optim Data Masking – Optim Data Masking provides “Deidentification” of data by obfuscating corporate secrets, Guardium provides
monitoring & auditing
Intel’s Secure Hadoop Distribution – Encryption in transit & at rest, Granular access
control with HBase
DataStax Enterprise – Encryption in Transit & at Rest (using Cassandra for storage)
DataGuise for Hadoop – Detects & protects sensitive data, setting access
permission, masking or encrypting data, authorization based access
Knox Gateway (Hortonworks) – Perimeter security, integration with IDAM
environments, manage security across multiple clusters – now an Apache Project
Protegrity – Big Data Protector provides Encryption & tokenization, Enterprise
Security Administrator provides central policy, key mgmt, auditing, reporting
Sqrrl – Builds on Apache Accumulo’s security capabilities for Hadoop
Zettaset Secure Orchestrator – security wrapper around Hadoop
Apache Accumulo
• Cell-Level Access Control via visibility
• By default, uses its own db for
users & credentials
• Can be extended in code to use other
Identity & Access Management
Infrastructure
Project Rhino
Intel launched this open source effort to improve security capabilities of Hadoop &
contributed code to Apache in early 2013.
Encrypted Data at Rest - JIRA Tasks HADOOP-9331 (Hadoop Crypto Codec
Framework and Crypto Codec Implementation) and MAPREDUCE-5025 (Key
Distribution and Management for Supporting Crypto Codec in MapReduce) .
ZOOKEEPER-1688 will provide the ability for transparent encryption of snapshots
and commit logs on disk, protecting against the leakage of sensitive information
from files at rest.
Token-Based Authentication & Unified Authorization Framework - JIRA
TasksHADOOP-9392 (Token-Based Authentication and Single Sign-On) and HADOOP9466(Unified Authorization Framework)
Improved Security in HBase - The JIRA Task HBASE-6222 (Add Per-KeyValue
Security) adds cell-level authorization to HBase – something that Apache Accumulo
has but HBase does not. HBASE-7544 builds on the encryption framework being
developed, extending it to HBase, providing transparent table encryption.
What’s the Best Guidance Now?
Identify and Understand the Sensitivity Levels of Your Data
Are there access control policies associated with your data?

Understand the Impact of the Release of Your Data
Netflix example – Could someone couple your data with open source data to
gain new (and unintended) insight?

Develop Policies & Procedures relating to Security & Privacy of Your Data
Sets
Data Ingest
Access Control within Your Organization
Cleansing/Sanitization/Destruction
Auditing
Monitoring Procedures
Incident Response

Develop a Technical Security Approach that Complements Hadoop Security
Questions?
Ksmith <AT> Novetta.COM

Hadoop and Big Data Security

  • 1.
    Hadoop and Big DataSecurity Kevin T. Smith, 11/14/2013 Ksmith <AT> Novetta . COM
  • 2.
    Big Data Security– Why Should We Care? New Challenges related to Data Management, Security, and Privacy As data growth is explosive, so is the complexity of our IT environments Many organizations required to enforce access control & privacy restrictions on data sets (HIPAA, Privacy Laws) – or face steep penalties & fines Organizations are increasingly required to enforce access control to their data scientists based on Need-to-Know, User Authorization levels, and what data they are allowed to see – especially in Healthcare, Finance, and Government Organizations struggling to understand what data they can release Mismanagement of Data Sets -- Costly.. AOL Research “Data Valdez” Incident • CNNMoney - “101 Dumbest Moments in Business” • $5 Million Settlement , plus $100 to each member of AOL between 3/2006-5/2006, + $50 to each member who believed their data was in the released data; Fired employees, CTO Resignation The Netflix Contest Anonymized Data Set Incident • Class-Action Lawsuit, $9 Million Settlement Massachusetts Hospital Record Incident Cyber Security Attacks are on the Rise Ponemon Institute – the Average Cost of a Data Breach in the U.S. is 5.4 Million dollars* Playstation (2011) – Experts predict costs between 2.2 and 2.4 Billion * (Breach Study: Global Analysis, May 2013)
  • 3.
    A (Brief) Historyof Hadoop Security Hadoop developed without Security in Mind Originally No Security model No authentication of users or services Anyone could submit arbitrary code to be executed Later authorization added, but any user could impersonate other users with command-line switch In 2009, Yahoo! focused on Hadoop Authentication, and did a Hadoop Redesign, But… Resulting Security Model is Complex Security Configuration is complex & Easy to Mess Up No Data at Rest Encryption Kerberos-Centric Limited Authorization Capabilities Things are Changing, But Slowly.. It is important to understand how Hadoop Security is Currently Implemented & Configured It is important to understand how to meet your organization’s security requirements
  • 4.
    Hadoop Security DataFlow Distributed Security is a Challenge Since the .20.20x distributions of Hadoop, much of the model is Kerberos Centric , as you see to the right Model is quite complex, as you will see on the next slide
  • 5.
    Token Delegation & HadoopSecurity Flow Token Used For Kerberos TGT Kerberos initial authentication to KDC. Kerberos service ticket Kerberos initial authentication between users, client processes, and services. Delegation token Token issued by the NameNode to the client, used by the client or any services working on the client’s behalf to authenticate them to the NameNode. Block Access token Token issued by the NameNode after validating authorization to a particular block of data, based on a shared secret with the DataNode. Clients (and services working on the client’s behalf) use the Block Access token to request blocks from the DataNode. Job token This is issued by the JobTracker to TaskTrackers. Tasks communicating with TaskTrackers for a particular job use this token to prove they are associated with the job.
  • 6.
    Some Vendor Activityin Hadoop Security Seems to be a New One Every Week! Cloudera Sentry – Fine Grained Access Control for Apache Hive & Cloudera Impala IBM InfoSphere Optim Data Masking – Optim Data Masking provides “Deidentification” of data by obfuscating corporate secrets, Guardium provides monitoring & auditing Intel’s Secure Hadoop Distribution – Encryption in transit & at rest, Granular access control with HBase DataStax Enterprise – Encryption in Transit & at Rest (using Cassandra for storage) DataGuise for Hadoop – Detects & protects sensitive data, setting access permission, masking or encrypting data, authorization based access Knox Gateway (Hortonworks) – Perimeter security, integration with IDAM environments, manage security across multiple clusters – now an Apache Project Protegrity – Big Data Protector provides Encryption & tokenization, Enterprise Security Administrator provides central policy, key mgmt, auditing, reporting Sqrrl – Builds on Apache Accumulo’s security capabilities for Hadoop Zettaset Secure Orchestrator – security wrapper around Hadoop
  • 7.
    Apache Accumulo • Cell-LevelAccess Control via visibility • By default, uses its own db for users & credentials • Can be extended in code to use other Identity & Access Management Infrastructure
  • 8.
    Project Rhino Intel launchedthis open source effort to improve security capabilities of Hadoop & contributed code to Apache in early 2013. Encrypted Data at Rest - JIRA Tasks HADOOP-9331 (Hadoop Crypto Codec Framework and Crypto Codec Implementation) and MAPREDUCE-5025 (Key Distribution and Management for Supporting Crypto Codec in MapReduce) . ZOOKEEPER-1688 will provide the ability for transparent encryption of snapshots and commit logs on disk, protecting against the leakage of sensitive information from files at rest. Token-Based Authentication & Unified Authorization Framework - JIRA TasksHADOOP-9392 (Token-Based Authentication and Single Sign-On) and HADOOP9466(Unified Authorization Framework) Improved Security in HBase - The JIRA Task HBASE-6222 (Add Per-KeyValue Security) adds cell-level authorization to HBase – something that Apache Accumulo has but HBase does not. HBASE-7544 builds on the encryption framework being developed, extending it to HBase, providing transparent table encryption.
  • 9.
    What’s the BestGuidance Now? Identify and Understand the Sensitivity Levels of Your Data Are there access control policies associated with your data? Understand the Impact of the Release of Your Data Netflix example – Could someone couple your data with open source data to gain new (and unintended) insight? Develop Policies & Procedures relating to Security & Privacy of Your Data Sets Data Ingest Access Control within Your Organization Cleansing/Sanitization/Destruction Auditing Monitoring Procedures Incident Response Develop a Technical Security Approach that Complements Hadoop Security
  • 10.