This is the talk I gave at the Brussels Data Science meetup on 17/06/2015, focusing on information security & governance technical controls available to implementors within Hadoop-core.
2. Rob Gibbon
■ Architect @Big Industries Belgium
■ Focus on designing, deploying & integrating web
scale solutions with Hadoop
■ Deliveries for clients in telco, financial services &
media
3. Hadoop was built to survive data tsunamis
■ a response to challenges that enterprise vendors
were unable to address
■ focused on data volumes and cost reduction
■ initially, the solution had some serious holes
5. the early days
■ Multiple SPoF
■ No authentication
■ Easily spoofed authorisation
■ No encryption of data at rest nor in transit
■ No accounting
6. enter the hadoop vendors
■ Vendors like Cloudera focus on making Apache
Hadoop “enterprise ready”
■ Includes building robust infosec controls into
Hadoop core
■ Multilayer security is now available for Hadoop
7. running a cluster in non-secure mode
■ malicious|mistaken user:
■ recursively delete all the data please
■ by the way, I’m the system superuser
■ hadoop:
■ oh ok then
10. running a secure cluster
■ Kerberos is one of the primary security controls you
can use
■ Btw, what’s wrong with this kerberos principal?
■ hdfs@BIGINDUSTRIES.BE
11. kerberos continued
■ Kerberos uses a three-part principal
■ hdfs/node1.cluster1.bigindustries.be@BIGINDUSTRIES.BE
■ hdfs/node1.cluster2.bigindustries.be@BIGINDUSTRIES.BE
■ Best to use explicit mappings from kerberos principals to local
users
12. hive / impala
■ HiveServer doesn’t support Kerberos => use HiveServer2
■ Best to use Sentry to enforce role based access controls from
SQL
■ Users can upload and execute arbitrary [possibly hostile] UDFs
=> enable Sentry
■ Older versions of Metastore don’t enforce permissions on
grant_* and revoke_* APIs => stay up to date
14. disaster recovery
■ HDFS and HBase offer point-in-time snapshots
■ => consistentency!
■ Vendor-tethered solutions for site-to-site replication
are available
15. encryption at rest
■ HDFS encryption zones
■ transparent to existing applications
■ minimal performance overhead on Intel
architecture
■ key management is externalised
16. wire encryption
■ SSL encryption is now available for most Hadoop
services
■ Note that AES-256 for SSL and for Kerberos preauth
requires extra JCE policy files on the cluster
18. tokenization
■ The process of substituting a sensitive data
element with a non-sensitive equivalent
■ 3rd Party vendor solutions are available that
integrate well with Hadoop
19. some places where there’s still some work to
do
■ Setting up hadoop security controls is complex and time
consuming
■ Not much support for SELinux around here
■ No general, coherent, policy-based framework for controlling
resource access demands
■ Apache Knox is a starting point
■ => network and host resource access?
20. Integration
■ Integrating hadoop into an organisation’s services environment
needs careful planning
■ Hadoop can conflict with established governance policies
■ system accounts & privileges
■ remote access
■ firewall flows
■ domains and trust
■ etc.
21. layered security in hadoop-core
■ Authentication: Kerberos
■ Authorisation: Local unix group or LDAP mappings
■ Authorisation: Sentry RBACS for hive/impala
■ Encryption: HDFS encryption
■ Encryption: SSL encryption for most services
■ Availability: Active/Passive failover HDFS, YARN, Hbase
■ Integrity: HDFS block replication & CRC checksum