2. DataApps Corporation 2
About me
●
Focus on securing Hadoop clusters
●
Apache Hadoop Committer
–
Contributed features to security and HDFS
●
Founder of DataApps - http://dataApps.io
3. DataApps Corporation 3
Problem
●
Users store many kinds of data in HDFS
●
There could be sensitive data stored
without sufficient protection
●
Results in non-compliance
4. DataApps Corporation 4
Solution
●
Scan the data against patterns matching
different items like email, credit card
●
Analyze the results for accuracy
●
If sensitive data identified, take action
5. DataApps Corporation 5
Protecting sensitive Data
–
Restrict access to the data using permissions
and extended ACLS
–
Encrypt the data
●
HDFS Encryption for HDFS files
●
Column level encryption for Hive columns
–
Mask the values
6. DataApps Corporation 6
Introducing Chlorine
●
Scans HDFS directories against patterns
to detect sensitive data elements.
●
Reports sensitive elements detected.
7. DataApps Corporation 7
Data Patterns
●
Patterns are expressed a regular
expressions
●
Supports common patterns like email,
credit card, SSN etc.
●
Users can add new patterns
8. DataApps Corporation 8
File Formats
●
Text Files
●
All file formats supported via Hive
including Sequence, Avro, Orc, Parquet.
●
Uses Hive Schema to parse input (if
available)
10. DataApps Corporation 10
Results
●
Collects all sensitive elements
●
Provides a preview of detected elements
by type.
●
Can download the full list with location
pointers.
13. DataApps Corporation 13
Protection of Scans and Results
●
Can enable usage of impersonation to
scan files as the user itself
●
Scan results stored on HDFS are
readable by Chlorine user only
●
Users can view their scans and control
access to the scans and results
15. DataApps Corporation 15
On line Demo
●
Visit http://dataApps.io to scan a Hadoop
Cluster using Chlorine
●
Setup Scans against a Hadoop Cluster
preloaded with data, View results
●
Explore Chlorine administrative features