Setting up a secure Hadoop cluster involves a magic combination of Kerberos, Sentry, Ranger, Knox, Atlas, LDAP and possibly PAM. Add encryption on the wire and at rest to the mix and you have, at the very least, a interesting configuration and installation task.
Nonetheless, the fact that there are a lot of knobs to turn, doesn't excuse you from the responsibility of taking proper care of your customers' data. In this talk, we'll detail how the different security components in Hadoop interact and how easy it actually can be to setup thing correctly, once you understand the concepts and tools. We'll outline a successful secure Hadoop setup with an example.
2. This is not going to be a perfect talk
• It will be incomplete (squeezed for time)
• Probably without humor (I am just really bad in telling jokes)
• I have a disclaimer (work for a Bank)
• A lot of text in Orange (ING is Oranje)
3. Agenda
• Security today in Hadoop
• Kerberos (In depth)
• Policy based access
• Lineage (A bit)
• Encryption
4. Information security principals
Confidentiality
• Information is not
made available or
disclosed to
unauthorized
individuals and,
entities or
processes
Integrity
• Maintaining and
assuring the
accuracy and
completeness of
data over its entire
lifecycle
Availability
•Data must be
available when it
is needed
Unfortunately most of the
attention in Hadoop goes to
confidentiality
5. Security today in Hadoop
Authentication
Who am I?
Kerberos
Apache
Knox
Authorization
What can I do?
Apache
Ranger
Apache
Sentry
Audit
What did I do?
Apache
Ranger
Cloudera
Navigator
Data Protection
Can someone read
my data?
SSL
SASL
KMS
Data Governance
Where did my data
come from and
where is it going?
Apache
Atlas
Cloudera
Navigator
Identity Management
9. Kerberos has great advantages…
• Requires that each client, each request prove it’s identity
• Does not require a user to enter password everytime a service registered
• Works across operating systems
• Kerberos assumes that network connections rather than servers and workstations
are the weak link in network security
• Did you know that Active Directory is just Kerberos+LDAP?
10. …but its perceived complexity has stopped implementation
• AS, KDC, TGS, SS, TGT, KINIT, KEYTAB, KADMIN So many abbrevations…
• But you just need to remember a few: kinit, keytab, kdc
• Synchronization of host clocks required
• What wait? You didn’to do that yet? Your local cloud provider already does this for you.
• Separate user databases if combined with LDAP or PAM
• Well there is Active Directory and there is FreeIPA
• Tool Xxx is not kerberized and I really need it
• Insecure don’t use it or add patches yourself. Yeah OpenSource!
12. Integration in an Enterprise environment
• Fully integrated with Operating System
and Hadoop
• UserIDs are the same, shared and
immediate
• Can use PAM
• YARN, HDFS acls start working out of
the box as local users just exist That is the big stuff!
14. Support in Hadoop distributions is slightly lagging
Quite easy actually: gen_credentials.sh
just needs to be adjusted:
http://blog.godatadriven.com/samba-
configuration.html (for IPA it needs to be
adjusted)
https://github.com/HariSekhon/tools/blob/mast
er/ambari_freeipa_kerberos_setup.pl
Written by an ex cloudera guy ;-)
15. Caveats
• Trusted domains deliver users with “username@REALM”, Hadoop and Hive filter on ‘@’
• See: https://issues.apache.org/jira/browse/HADOOP-12751
• See: https://issues.apache.org/jira/browse/HIVE-12981
• Workaround: convert @ to _ by means of sssd
• full_name_format = %1$s_%2$s
• re_expression =
(((?P<Name>[^@]+)_(?P<Domain>.+$))|((?P<Domain>[^]+)(?P<Name>.+$))|((?P<Name>[^@]+)@(?P<D
omain>.+$))|(^(?P<Name>[^@]+)$))
• Or just wait for the patches to land
20. Caveats
• Ranger (but also Sentry) feels like slapped on security. Just usable, but barely
• User synchronization can be very slow with many users due to architecture issues
• Unix synchronization and authentication is using /etc/passwd /etc/group instead of NSS and PAM
• https://issues.apache.org/jira/browse/RANGER-842
• https://issues.apache.org/jira/browse/RANGER-827
• If these patches land syncing will be much faster for IPA/SSSD enabled systems
• No real Spark roadmap, just spark-sql. This also goes for Sentry
• Doesn’t manage HDFS ACLS and requires Hive user access… defeating end to end security
21. Data Governance
• Why?
• We need to be able to pinpoint what data resides where, why, what happened with it.
• Why?
• Cause you might want us to remove your data
• … and the regulator says so
22. Encryption
• Data at rest
• Used if you don’t trust your physical infrastructure. Cloud!
• Only our highest confidentiality levels require it, we are not at that level so we don’t use it
• Data in transit
• Data across untrusted networks. Cloud?
• Perimeter security solves a lot of these issues, you take a significant performance hit of around 20% if you
enable it within your cluster
• For ETL or data ingestion then it becomes more reasonable
• For us it is enabled for access TO the cluster NOT WITHIN
• Data democratization
• Use case: allow some data scientists to see the original data and some of the masked/anonimized data
• We are tinkering with this
24. We are hiring! Bolke.de.Bruin@ing.nl
24
Frank DerksJohn Muller Pooja Rao Hylke Hendriksen
Giovanni LanziniFabian Jansen Hanneke van Veldhuizen Johan Witman
Wendell KulingJonas Ahrendt Bolke de Bruin Ivo Everts
Doron Reuter
Zhe Sun