and Big Data Security
Kevin T. Smith, 11/14/2013
Ksmith <AT> Novetta . COM
Big Data Security – Why Should We Care?
New Challenges related to Data Management, Security, and Privacy
As data growth is explosive, so is the complexity of our IT environments
Many organizations required to enforce access control & privacy restrictions on
data sets (HIPAA, Privacy Laws) – or face steep penalties & fines
Organizations are increasingly required to enforce access control to their data
scientists based on Need-to-Know, User Authorization levels, and what data they
are allowed to see – especially in Healthcare, Finance, and Government
Organizations struggling to understand what data they can release
Mismanagement of Data Sets -- Costly..
AOL Research “Data Valdez” Incident
• CNNMoney - “101 Dumbest Moments in Business”
• $5 Million Settlement , plus $100 to each member of AOL between 3/2006-5/2006,
+ $50 to each member who believed their data was in the released data; Fired
employees, CTO Resignation
The Netflix Contest Anonymized Data Set Incident
• Class-Action Lawsuit, $9 Million Settlement
Massachusetts Hospital Record Incident
Cyber Security Attacks are on the Rise
Ponemon Institute – the Average Cost of a Data Breach in the U.S. is 5.4 Million
Playstation (2011) – Experts predict costs between 2.2 and 2.4 Billion
* (Breach Study: Global Analysis, May 2013)
A (Brief) History of Hadoop Security
Hadoop developed without Security in Mind
Originally No Security model
No authentication of users or services
Anyone could submit arbitrary code to be executed
Later authorization added, but any user could
impersonate other users with command-line switch
In 2009, Yahoo! focused on Hadoop
Authentication, and did a Hadoop Redesign, But…
Resulting Security Model is Complex
Security Configuration is complex & Easy to Mess Up
No Data at Rest Encryption
Limited Authorization Capabilities
Things are Changing, But Slowly..
It is important to
Hadoop Security is
It is important to
understand how to
Hadoop Security Data Flow
is a Challenge
Since the .20.20x
Hadoop, much of
the model is
Kerberos Centric ,
as you see to the
Model is quite
complex, as you will
see on the next
Token Delegation &
Hadoop Security Flow
Kerberos initial authentication to KDC.
Kerberos initial authentication between users,
client processes, and services.
Token issued by the NameNode to the client,
used by the client or any services working on
the client’s behalf to authenticate them to the
Token issued by the NameNode after
validating authorization to a particular block
of data, based on a shared secret with the
DataNode. Clients (and services working on
the client’s behalf) use the Block Access
token to request blocks from the DataNode.
This is issued by the JobTracker to
TaskTrackers. Tasks communicating with
TaskTrackers for a particular job use this
token to prove they are associated with the
Some Vendor Activity in Hadoop Security
Seems to be a New One Every Week!
Cloudera Sentry – Fine Grained Access Control for Apache Hive & Cloudera Impala
IBM InfoSphere Optim Data Masking – Optim Data Masking provides “Deidentification” of data by obfuscating corporate secrets, Guardium provides
monitoring & auditing
Intel’s Secure Hadoop Distribution – Encryption in transit & at rest, Granular access
control with HBase
DataStax Enterprise – Encryption in Transit & at Rest (using Cassandra for storage)
DataGuise for Hadoop – Detects & protects sensitive data, setting access
permission, masking or encrypting data, authorization based access
Knox Gateway (Hortonworks) – Perimeter security, integration with IDAM
environments, manage security across multiple clusters – now an Apache Project
Protegrity – Big Data Protector provides Encryption & tokenization, Enterprise
Security Administrator provides central policy, key mgmt, auditing, reporting
Sqrrl – Builds on Apache Accumulo’s security capabilities for Hadoop
Zettaset Secure Orchestrator – security wrapper around Hadoop
• Cell-Level Access Control via visibility
• By default, uses its own db for
users & credentials
• Can be extended in code to use other
Identity & Access Management
Intel launched this open source effort to improve security capabilities of Hadoop &
contributed code to Apache in early 2013.
Encrypted Data at Rest - JIRA Tasks HADOOP-9331 (Hadoop Crypto Codec
Framework and Crypto Codec Implementation) and MAPREDUCE-5025 (Key
Distribution and Management for Supporting Crypto Codec in MapReduce) .
ZOOKEEPER-1688 will provide the ability for transparent encryption of snapshots
and commit logs on disk, protecting against the leakage of sensitive information
from files at rest.
Token-Based Authentication & Unified Authorization Framework - JIRA
TasksHADOOP-9392 (Token-Based Authentication and Single Sign-On) and HADOOP9466(Unified Authorization Framework)
Improved Security in HBase - The JIRA Task HBASE-6222 (Add Per-KeyValue
Security) adds cell-level authorization to HBase – something that Apache Accumulo
has but HBase does not. HBASE-7544 builds on the encryption framework being
developed, extending it to HBase, providing transparent table encryption.
What’s the Best Guidance Now?
Identify and Understand the Sensitivity Levels of Your Data
Are there access control policies associated with your data?
Understand the Impact of the Release of Your Data
Netflix example – Could someone couple your data with open source data to
gain new (and unintended) insight?
Develop Policies & Procedures relating to Security & Privacy of Your Data
Access Control within Your Organization
Develop a Technical Security Approach that Complements Hadoop Security