Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Securing Spark Applications

1,264 views

Published on

Securing Spark Applications

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Securing Spark Applications

  1. 1. 1© Cloudera, Inc. All rights reserved. Securing Spark Applications Hadoop Summit 2016 - Dublin Marcelo Vanzin
  2. 2. 2© Cloudera, Inc. All rights reserved. What is Security? • Security has many facets • This talk will focus on three areas: • Encryption • Authentication • Authorization
  3. 3. 3© Cloudera, Inc. All rights reserved. Why do I need security? • User identification • Application isolation • Access control enforcement • Compliance with government regulations
  4. 4. 4© Cloudera, Inc. All rights reserved. Before we go further... • Set up Kerberos • Use HDFS (or another secure filesystem) • Use YARN • Configure them for security (enable auth, encryption) Kerberos, HDFS, and YARN provide the security backbone for Spark.
  5. 5. 5© Cloudera, Inc. All rights reserved. Encryption • In a secure cluster, data should not be visible in the clear • On-the-wire data • At-rest data • Very important to financial / government institutions • Or anyone who works with sensitive data
  6. 6. 6© Cloudera, Inc. All rights reserved. What a Spark app looks like Driver Executor Executor Control RPC File Download Shuffle / Cached Blocks Shuffle Service Shuffle Blocks UI Disk Disk Shuffle Blocks / Metadata
  7. 7. 7© Cloudera, Inc. All rights reserved. Prior to Spark 1.6 Different channel, different method - Control plane SSL - File distribution SSL - Shuffle Blocks SASL - User UI / REST API Nothing - Spilled/Shuffle Blocks Use ecryptfs (or equivalent)
  8. 8. 8© Cloudera, Inc. All rights reserved. What is wrong with SSL?
  9. 9. 9© Cloudera, Inc. All rights reserved. Why not SSL? • SSL can be hard to set up • Need certificates readable on every node • Sharing certificates not as secure • Hard to have per-user certificates
  10. 10. 10© Cloudera, Inc. All rights reserved. Spark 1.6 Standardizes around a common transport library • Replaces Akka RPC (SPARK-6028) • Replaces HTTP File service (SPARK-11140) • Uses SASL encryption But.. • WebUI still has no encryption • Shuffle / Spilled blocks still require FS-level encryption • SASL in JVM restricted to 3DES encryption – not very strong
  11. 11. 11© Cloudera, Inc. All rights reserved. Spark 2.0 • REPL class distribution using transport lib (SPARK-11563) • HTTPS Support for WebUI and History Server (SPARK-2750) • Encrypting shuffle blocks is almost in (SPARK-5682) • Depends on third party Chimera library for encryption • Work is being done to add Chimera to Apache Commons Future: • Use Chimera to encrypt over-the-wire data
  12. 12. 12© Cloudera, Inc. All rights reserved. Authentication Who is reading my data? • Spark relies on Kerberos • the necessary evil • Ubiquitous in Hadoop • YARN, HDFS, Hive...
  13. 13. 13© Cloudera, Inc. All rights reserved. Who is reading my data? Kerberos provides secure authentication. KDC Application Hi I’m Bob. Hello Bob. Here’s your TGT. Here’s my TGT. I want to talk to HDFS. Here’s your HDFS ticket. User
  14. 14. 14© Cloudera, Inc. All rights reserved. Now with a distributed app... KDC Executor Executor Executor Executor Executor Executor Executor Executor Hi I’m Bob. Hi I’m Bob. Hi I’m Bob. Hi I’m Bob. Hi I’m Bob. Hi I’m Bob. Hi I’m Bob. Hi I’m Bob. Something is wrong.
  15. 15. 15© Cloudera, Inc. All rights reserved. Kerberos in Hadoop / Spark Hadoop services use delegation tokens to avoid KDC limitations. Driver NameNode Executor DataNode
  16. 16. 16© Cloudera, Inc. All rights reserved. Delegation Tokens Like Kerberos tickets, they have a TTL. • OK for most batch applications. • Not OK for long running applications • Streaming • Spark SQL Thrift Server Since 1.4, Spark can manage delegation tokens, but very limited. • Full support only for HDFS. • Limited support for Hive, HBase.
  17. 17. 17© Cloudera, Inc. All rights reserved. How about Secure Kafka?
  18. 18. 18© Cloudera, Inc. All rights reserved. Spark Streaming with Kafka • Kafka 0.9 supports some security features • Requires the use of a new consumer API (SPARK-12177) • Kafka 0.9 does not support delegation tokens! (KAFKA-1696)
  19. 19. 19© Cloudera, Inc. All rights reserved. Authorization How can I share my data? Simplest form of authorization: file permissions. • Unix-style user/group/other or ACLs • Simple, but high maintenance. • umask • manually change new files • Trusted entity (OS kernel) enforces access control
  20. 20. 20© Cloudera, Inc. All rights reserved. More than just FS semantics Not all applications operate on files... • Tables, columns, partitions instead of files and directories • Trusted service needs to understand app’s semantics
  21. 21. 21© Cloudera, Inc. All rights reserved. Trusted Service Example: Hive Client HiveServer2 DataNode DataNode HMS
  22. 22. 22© Cloudera, Inc. All rights reserved. Untrusted App Example: Spark User Code DataNode DataNode HMS
  23. 23. 23© Cloudera, Inc. All rights reserved. Apache Sentry • Role-based access control to resources • Integrates with HMS / HS2 to control access to data • Fine-grained (up to column level) controls HDFS plugin synchronizes file permissions. • Permission to read table = permission to read table’s files • Permission to create table = permission to write to database’s directory
  24. 24. 24© Cloudera, Inc. All rights reserved. Still restricted to FS view of the world! • Files, directories, etc… • Cannot provide column-level and row-level access control. • Whole table or nothing. Still, it goes a long way in allowing Spark applications to work well with Hive data in a shared, secure environment. But...
  25. 25. 25© Cloudera, Inc. All rights reserved. A Simple Example Assume we had a table “accounts” column_name column_type name string country string balance int
  26. 26. 26© Cloudera, Inc. All rights reserved. Untrusted App Example: Spark User Code HDFS HMS 1. Where’s table “accounts”? 2. In path “/accounts” 3. Give me the files in “/accounts” 4. Here’s the file name country balance
  27. 27. 27© Cloudera, Inc. All rights reserved. Future: RecordService A distributed, scalable, data access service for unified authorization in Hadoop. • Drop in replacement for Hive InputFormats • Integration with Spark SQL Data Sources API • Predicate pushdown, projection
  28. 28. 28© Cloudera, Inc. All rights reserved. RecordService Users can enforce row- and column- level permissions using views. name country balance Alice US 1000 Bob BR 1500 Eve US 2000 > create view customers as select customer, country from accounts > create view balances_us as select customer, amount from accounts where country = “US”
  29. 29. 29© Cloudera, Inc. All rights reserved. Untrusted App Example: Spark User Code RS Worker RS Planner 1. Where’s table “accounts”? 2. Sorry, you can’t read it. 3. Where’s table “customers”? 4. In Worker “X” 5. Give me table “customers” 6. Here’s a list of (name, country) name country balance name country
  30. 30. 30© Cloudera, Inc. All rights reserved. Takeaways • Spark can be made secure today! • Builds on top of security features in Hadoop • Still work to be done • Stronger encryption • Easier to use SSL • Better integration with Sentry / RecordService
  31. 31. 31© Cloudera, Inc. All rights reserved. References • Encryption: SPARK-6017, SPARK-5682 • Delegation tokens: SPARK-5342 • Sentry: http://sentry.apache.org/ • HDFS synchronization: SENTRY-432 • RecordService: http://cloudera.github.io/RecordServiceClient/
  32. 32. 32© Cloudera, Inc. All rights reserved. Thanks! Questions?

×