Once the proof of concept is successful in terms of performance and scalability many start asking questions how Hadoop can become a part of a corporate ecosystem. It is also quite common for Hadoop to store vast amount of sensitive data becoming a central repository (data lake) shared with multiple tenants. There is a challenge to secure not a single platform, but the whole framework.
In this session I would like to show how Ranger, Kerberos and built-in Hadoop security mechanisms can help you to meet some of these objectives and share our experience in this area.
4. 4
SocialHarmonisation
Digitalisation
Customer Call Centres
Webservices
In the Cloud
Virtual Bank
Software as a Service
Infrastructure as a Service
Seamless
Concept of ONE
No geographical boundaries
Exception Handling
APIs
My identity
Straight through processing
Customer experience
Personalisation
Automation
Standardisation
Agile
Self Service
Mobile First
Real Time
Security
24/7
‘Outside in and Inside out’
Omnichannel
Zero Touch
Customer journeys
Analytics
Big Data
Digitalised branches
Building standard for new generation digital bank
Cloud Platform as a service
Data Centre
13. HDFS
HiveServer 2
A B C
KDC
Use Hive ST,
submit query
Hive gets
Namenode (NN)
service ticket
Hive creates
map reduce
using NN ST
Ranger
Knox gets
service ticket for
Hive
Knox runs as proxy
user using Hive ST
Original request
with user id and
password
Client gets
query result
Client
Apache
Knox
Active
Directory
Hortonworks Ring of Defense Architecture
hortonworks.com
17. IPA for central UAM
• This works great for OS
• Can this be used by Hadoop?
• Can this be used by Ranger?
18. HDFS
HiveServer 2
A B C
KDC
Use Hive ST,
submit query
Hive gets
Namenode (NN)
service ticket
Hive creates
map reduce
using NN ST
Ranger
Knox gets
service ticket for
Hive
Knox runs as proxy
user using Hive ST
Original request
with user id and
password
Client gets
query result
Client
Apache
Knox
Active
Directory
Hortonworks Ring of Defense Architecture
hortonworks.com
26. Ranger audit
• It is recommended that you store audits in Solr and HDFS, and disable
Audit to DB.
• Otherwise you can expect performance issues
• Audit is stored in a single table
• No partitions
• No data retention
27.
28. IPA as a central UAM
• This works great for OS
• Can this be used by Hadoop? Works great for PA in IPA
• Can this be used by Ranger? Not yet. You still need to bind to LDAP.
29. Ranger KMS
One big advantage of encryption in
HDFS is that even privileged users,
such as the “hdfs” superuser, can be
blocked from viewing encrypted data.
30. Caveats
• Ranger (the same goes for Sentry) feels like slapped on security
• User synchronization can be very slow with many users due to
architecture issues
• Doesn’t manage HDFS ACLS and requires Hive user access… defeating
end to end security
• Vulnerability scans just kill Ranger ;)
32. mysql> select count(*) from x_user;
+----------+
| count(*) |
+----------+
| 99 |
+----------+
1 row in set (0.00 sec)
33. mysql> select count(*) from x_group;
+----------+
| count(*) |
+----------+
| 45 |
+----------+
1 row in set (0.00 sec)
34. mysql> select count(*) from x_group_users;
+----------+
| count(*) |
+----------+
| 645697 |
+----------+
1 row in set (0.13 sec)
35. mysql> select sum(user_id) from (select count(distinct user_id) user_id
from x_group_users group by p_group_id) temp;
+--------------+
| sum(user_id) |
+--------------+
| 603 |
+--------------+
1 row in set (1.21 sec)
36. mysql>
delete from x_group_users where id not in
(
select minid from
(select min(id) as minid from x_group_users group by
p_group_id,user_id) as temp
);
37. Make it better
• https://issues.apache.org/jira/browse/RANGER-827
usersync SSSD integration (sync excplicitly specified group)
• https://issues.apache.org/jira/browse/HADOOP-12751
allow users with domain suffix (avoid naming collision)
• https://issues.apache.org/jira/browse/HIVE-12981
the same for Hive
• https://issues.apache.org/jira/browse/RANGER-842
PAM integrated authentication for Ranger
38. Ambari integration with IPA
• https://github.com/HariSekhon/tools/blob/master/ambari_freeipa_k
erberos_setup.pl
39. Other upcoming features (0.6)
• Tag based policies
• Geolocation based policies
• Deny and exclude policies
• Hive Metastore plugin
40.
41. Some take away tips
• Install updates on a regular basis
• Isolate your cluster from the rest of the network
• Kerberize your cluster
• Secure the user interfaces
• dfs.namenode.acls.enabled
• fs.permissions.umask-mode
• Watch for superusers (hadoop.proxyuser settings)
• Change OS default umask (watch for the upgrades and config permissions)
• Make sure hive warehouse hdfs path is protected
• Implement Ranger
• Just don’t sync your whole AD with it ;)