Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

End-to-End Security and Auditing in a Big Data as a Service Deployment

1,310 views

Published on

End-to-End Security and Auditing in a Big Data as a Service Deployment

Published in: Technology
  • Be the first to comment

End-to-End Security and Auditing in a Big Data as a Service Deployment

  1. 1. End-to-End Security and Auditing in a Big-Data-as-a-Service (BDaaS) Deployment Nanda Vijaydev - BlueData Abhiraj Butala - BlueData
  2. 2. “A mechanism for the delivery of statistical analysis tools and information that helps organizations understand and use insights gained from large information sets in order to gain a competitive advantage.” On-Demand, Self-Service, Elastic Big Data Infrastructure, Applications, Analytics Source: www.semantikoz.com/blog/big-data-as-a-service-definition-classification Big-Data-as-a-Service (BDaaS)
  3. 3. Multi-Tenant Big-Data-as-a-Service Data/Storage Prod 2.2 Dev/Test 2.4 POC 2.3 Prod 2.3 Dev/Test 2.4 MARKETING R&D MANUFACTURING 360 Customer View Log Analysis Predictive Maintenance Data LakeStaging Multiple compute services (Hadoop, BI, Spark) There is a shared Data Lake (Shared HDFS)
  4. 4. Why BDaaS? – Compute Side Of The Story • Set of applications that interact with Hadoop keeps growing • Various versions of the same app/distro run in parallel • Enterprises have need to scale compute up and down based on usage • A model similar to Amazon AWS with S3 as storage and applications on EC2
  5. 5. Why BDaaS? – Data Side Of The Story • Production cluster access takes time and is generally restricted • Staging clusters may not have all the data • Data exists on other storage systems such as NFS Isilon is common • Users also want to upload arbitrary files for analysis
  6. 6. Hadoop – A Collection Of Services Hadoop is a collection of storage and compute services such as HDFS, HBase, Hive, Yarn, Solr, Kafka
  7. 7. Security In Hadoop • Authenticate user into Hadoop ecosystem – Each service has its own integration with LDAP/AD for authentication • Authorize and limit their actions to selected services. Authorization is granted separately for each service. Example: – Folder “/user/customer” in HDFS has ‘r-x’ to user ‘alice’, and ‘- wx’ to user ‘bob’ – Enable column level access to a Hive Table. “Customer.Name” & “Customer.PhoneNumber” is only accessible by some users and groups
  8. 8. Ranger – A Pluggable Security Framework • Ranger works with a common user DB (LDAP/AD) for authentication • Provides a plug-in for individual Hadoop services to enable authorization • Allows users to define policies in a central location, using WEB UI or APIs • Users can define their own plug-in for a custom service and manage them centrally via Ranger Admin
  9. 9. Defining HDFS Ranger Policies HDFS Policy List Marketing Policy Drill Down
  10. 10. Security Considerations in BDaaS Data/Storage Prod 2.2 Dev/Test 2.4 POC 2.3 Prod 2.3 Dev/Test 2.4 MARKETING R&D MANUFACTURING 360 Customer View Log Analysis Predictive Maintenance Data LakeStaging 1. User Identity – Data Lake 2. User Identity - Application Level 3. User Identity propagation to Data Layer 1. User identity within a Data Lake 2. User identity in application layer 3. Prevent data duplication & maintain user integrity across layers
  11. 11. 1. Securing The Data Lake LDAPKDC Data/Storage Prod 2.2 Dev/Test 2.4 POC 2.3 Prod 2.3 Dev/Test 2.4 MARKETING R&D MANUFACTURING 360 Customer View Log Analysis Predictive Maintenance Data LakeStaging 1. Authentication & Authorization – Data Lake 2. User Identity - Application Level 3. User Identity propagation to Data Layer
  12. 12. 2. Securing The App Layer LDAP KDC Data/Storage Prod 2.2 Dev/Test 2.4 POC 2.3 Prod 2.3 Dev/Test 2.4 MARKETING R&D MANUFACTURING 360 Customer View Log Analysis Predictive Maintenance Data LakeStaging 1. Authentication & Authorization – Data Lake 2. User Identity - Application Level 3. User Identity propagation to Data Layer App containers are integrated with LDAP KDC AliceBob Tom
  13. 13. 3. Identity Propagation to Data Layer LDAP KDC Data/Storage Prod 2.2 Dev/Test 2.4 POC 2.3 Prod 2.3 Dev/Test 2.4 MARKETING R&D MANUFACTURING 360 Customer View Log Analysis Predictive Maintenance Data LakeStaging 1. Authentication & Authorization – Data Lake 2. User Identity - Application Level 3. User Identity propagation to Data Layer KDC AliceBob Tom
  14. 14. User Identity Propagation Two Ways –Users connect directly to HDFS • Simple Authentication • Kerberos Authentication –Users connect to HDFS via a Super-user (Impersonation)
  15. 15. HDFS Direct Connections LDAP KDC Prod 2.2 Dev/Test 2.4 POC 2.3 Prod 2.3 Dev/Test 2.4 MARKETING R&D MANUFACTURING 360 Customer View Log Analysis Predictive Maintenance KDC Alice BobTom HDFS Data Lake
  16. 16. HDFS Direct Connections.. – hdfs-audit.log – Ranger policies are enforced for alice and bob as they are the effective users
  17. 17. HDFS Direct Connections.. • Single Hadoop Setup – Ideal • Multi-tenant, Multi-application Setup – Kerberized HDFS needs kerberized compute and services – May not want to kerberize Dev/QA setups – Hadoop versions should be compatible all across – Data duplication
  18. 18. HDFS Super-user Connections • Super-users perform actions on behalf of other users (Impersonation/Proxying) • Adding a new super-user is easy – core-site.xml
  19. 19. HDFS Super-user Connections.. LDAP KDC Prod 2.2 Dev/Test 2.4 POC 2.3 Prod 2.3 Dev/Test 2.4 MARKETING R&D MANUFACTURING 360 Customer View Log Analysis Predictive Maintenance KDC Alice BobTom HDFS Data Lake DataTap Caching Service via – super-user
  20. 20. HDFS Super-user Connections.. – hdfs-audit.log – Ranger Authorization policies still enforced, as alice and bob are effective users
  21. 21. HDFS Super-user Connections.. Multi-tenant, Multi-application Setup – Works for applications which don’t support Kerberos (yet) – Dev/Test setups need not be kerberized – DataTap service can abstract version incompatibilities – Can help avoid data duplication – Need tight LDAP/AD integration though!
  22. 22. Ranger in Action Hue Example
  23. 23. HDFS Permissions on Data Lake • Set HDFS file access for ‘/user/secret’ to strict mode • Set umask to ‘077’
  24. 24. HDFS Ranger Policies
  25. 25. DataTap Caching Service
  26. 26. Create Table via Hue
  27. 27. Query table via Hue - Success
  28. 28. Query table via Hue - Failure
  29. 29. Ranger Audit Logs
  30. 30. Key Takeaways • BDaaS is more than Hadoop-as-a-Service – Includes BI / ETL / Analytics + Data Science tools • Security is an important consideration in BDaaS • Data duplication is not an option • Global user authentication using a centralized DB like LDAP/AD is a must • Apache Ranger helps in enforcing global policies, provided user identities are propagated correctly
  31. 31. Q & A www.bluedata.com Nanda Vijaydev @nandavijaydev Abhiraj Butala @abhirajbutala

×