Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Securing Spark Applications
Kostas Sakellis
Marcelo Vanzin
What is Security?
• Security has many facets
• This talk will focus on three areas:
– Encryption
– Authentication
– Author...
Why do I need security?
• Multi-tenancy
• Application isolation
• User identification
• Access control enforcement
• Compl...
Before we go further...
• Set up Kerberos
• Use HDFS (or another secure filesystem)
• Use YARN!
• Configure them for secur...
Encryption
• In a secure cluster, data should not be visible in the clear
• Very important to financial / government insti...
What a Spark app looks like
RM NM NM
AM / Driver Executor
Executor
SparkSubmit
Control RPC
File Download
Shuffle / Cached ...
Data Flow in Spark
Every connection in the previous slide can transmit sensitive
data!
• Input data transmitted via broadc...
Encryption in Spark
• Almost all channels support encryption.
– Exception 1: UI (SPARK-2750)
– Exception 2: local shuffle ...
Encryption: Current State
Different channel, different method.
• Shuffle protocol uses SASL
• RPC / File download use SSL
...
Encryption: The Goal
SASL everywhere for wire encryption (except UI).
• Minimum configuration (one boolean config)
• Uses ...
Authentication
Who is reading my data?
• Spark uses Kerberos
– the necessary evil
• Ubiquitous among other services
– YARN...
Who’s reading my data?
Kerberos provides secure authentication.
KDC
Application
Hi I’m Bob.
Hello Bob. Here’s your TGT.
He...
Now with a distributed app...
KDC
Executor
Executor
Executor
Executor
Executor
Executor
Executor
Executor
Hi I’m Bob.
Hi I...
Kerberos in Hadoop / Spark
KDCs do not allow multiple concurrent logins at the scale
distributed applications need. Hadoop...
Delegation Tokens
Like Kerberos tickets, they have a TTL.
• OK for most batch applications.
• Not OK for long running appl...
Delegation Tokens
Since 1.4, Spark can manage delegation tokens!
• Restricted to HDFS currently
• Requires user’s keytab t...
Authorization
How can I share my data?
Simplest form of authorization: file permissions.
• Use Unix-style permissions or A...
More than just FS semantics...
Authorization becomes more complicated as abstractions
are created.
• Tables, columns, part...
Trusted Service: Hive
Hive has a trusted service (“HiveServer2”) for enforcing
authorization.
• HS2 parses queries and mak...
Untrusted Apps: Spark
Each Spark app runs as the requesting user, and needs
access to the underlying files.
• Spark itself...
Apache Sentry
• Role-based access control to resources
• Integrates with Hive / HS2 to control access to data
• Fine-grain...
The Sentry HDFS Plugin
Synchronize HDFS file permissions with higher-level
abstractions.
• Permission to read table = perm...
Still restricted to FS view of the world!
• Files, directories, etc…
• Cannot provide column-level and row-level access
co...
Future: RecordService
A distributed, scalable, data access service for unified
authorization in Hadoop.
RecordService
RecordService
• Drop in replacement for InputFormats
• SparkSQL: Integration with Data Sources API
– Predicate pushdown, p...
RecordService
• Assume we had a table tpch.nation
column_name column_type
n_nationkey smallint
n_name string
n_regionkey s...
import com.cloudera.recordservice.spark._
val context = new org.apache.spark.sql.SQLContext(sc)
val df = context.load("tpc...
RecordService
• Users can enforce Sentry permissions using views
• Allows column and row level security
> CREATE ROLE rest...
...
val df = context.load("tpch.nation",
"com.cloudera.recordservice.spark")
val results = df.collect()
>> TRecordServiceE...
...
val df = context.load("tpch.nation_names",
"com.cloudera.recordservice.spark")
val results = df.collect()
RecordService
RecordService
• Documentation: http://cloudera.github.io/RecordServiceClient/
• Beta Download:
http://www.cloudera.com/con...
Takeaways
• Spark can be made secure today!
• Benefits from a lot of existing Hadoop platform work
• Still work to be done...
References
• Encryption: SPARK-6017, SPARK-5682
• Delegation tokens: SPARK-5342
• Sentry: http://sentry.apache.org/
– HDFS...
Thanks!
Questions?
Upcoming SlideShare
Loading in …5
×

Securing Your Apache Spark Applications

6,478 views

Published on

Kostas Sakellis and Marcelo Vanzin at Spark Summit EU 2015

Published in: Software
  • Dating for everyone is here: ❤❤❤ http://bit.ly/2u6xbL5 ❤❤❤
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Follow the link, new dating source: ♥♥♥ http://bit.ly/2u6xbL5 ♥♥♥
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Get Paid $25 per hour to watch YouTube videos ♣♣♣ http://t.cn/AieXiXbg
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Securing Your Apache Spark Applications

  1. 1. Securing Spark Applications Kostas Sakellis Marcelo Vanzin
  2. 2. What is Security? • Security has many facets • This talk will focus on three areas: – Encryption – Authentication – Authorization
  3. 3. Why do I need security? • Multi-tenancy • Application isolation • User identification • Access control enforcement • Compliance with government regulations
  4. 4. Before we go further... • Set up Kerberos • Use HDFS (or another secure filesystem) • Use YARN! • Configure them for security (enable auth, encryption). Kerberos, HDFS, and YARN provide the security backbone for Spark.
  5. 5. Encryption • In a secure cluster, data should not be visible in the clear • Very important to financial / government institutions
  6. 6. What a Spark app looks like RM NM NM AM / Driver Executor Executor SparkSubmit Control RPC File Download Shuffle / Cached Blocks Shuffle Service Shuffle Service Shuffle Blocks UI Shuffle Blocks / Metadata
  7. 7. Data Flow in Spark Every connection in the previous slide can transmit sensitive data! • Input data transmitted via broadcast variables • Computed data during shuffles • Data in serialized tasks, files uploaded with the job How to prevent other users from seeing this data?
  8. 8. Encryption in Spark • Almost all channels support encryption. – Exception 1: UI (SPARK-2750) – Exception 2: local shuffle / cache files (SPARK-5682) For local files, set up YARN local dirs to point at local encrypted disk(s) if desired. (SPARK-5682)
  9. 9. Encryption: Current State Different channel, different method. • Shuffle protocol uses SASL • RPC / File download use SSL SSL can be hard to set up. • Need certificates readable on every node • Sharing certificates not as secure • Hard to have per-user certificate
  10. 10. Encryption: The Goal SASL everywhere for wire encryption (except UI). • Minimum configuration (one boolean config) • Uses built-in JVM libraries • SPARK-6017 For UI: • Support for SSL • Or audit UI to remove sensitive info (e.g. information on environment page).
  11. 11. Authentication Who is reading my data? • Spark uses Kerberos – the necessary evil • Ubiquitous among other services – YARN, HDFS, Hive, HBase etc.
  12. 12. Who’s reading my data? Kerberos provides secure authentication. KDC Application Hi I’m Bob. Hello Bob. Here’s your TGT. Here’s my TGT. I want to talk to HDFS. Here’s your HDFS ticket. User
  13. 13. Now with a distributed app... KDC Executor Executor Executor Executor Executor Executor Executor Executor Hi I’m Bob. Hi I’m Bob. Hi I’m Bob. Hi I’m Bob. Hi I’m Bob. Hi I’m Bob. Hi I’m Bob. Hi I’m Bob. Something is wrong.
  14. 14. Kerberos in Hadoop / Spark KDCs do not allow multiple concurrent logins at the scale distributed applications need. Hadoop services use delegation tokens instead. Driver NameNode Executor DataNode
  15. 15. Delegation Tokens Like Kerberos tickets, they have a TTL. • OK for most batch applications. • Not OK for long running applications – Streaming – Spark SQL Thrift Server
  16. 16. Delegation Tokens Since 1.4, Spark can manage delegation tokens! • Restricted to HDFS currently • Requires user’s keytab to be deployed with application • Still some remaining issues in client deploy mode
  17. 17. Authorization How can I share my data? Simplest form of authorization: file permissions. • Use Unix-style permissions or ACLs to let others read from and / or write to files and directories • Simple, but high maintenance. Set permissions / ownership for new files, mess with umask, etc.
  18. 18. More than just FS semantics... Authorization becomes more complicated as abstractions are created. • Tables, columns, partitions instead of files and directories • Semantic gap • Need a trusted entity to enforce access control
  19. 19. Trusted Service: Hive Hive has a trusted service (“HiveServer2”) for enforcing authorization. • HS2 parses queries and makes sure users have access to the data they’re requesting / modifying. HS2 runs as a trusted user with access to the whole warehouse. Users don’t run code directly in HS2*, so there’s no danger of code escaping access checks.
  20. 20. Untrusted Apps: Spark Each Spark app runs as the requesting user, and needs access to the underlying files. • Spark itself cannot enforce access control, since it’s running as the user and is thus untrusted. • Restricted to file system permission semantics. How to bridge the two worlds?
  21. 21. Apache Sentry • Role-based access control to resources • Integrates with Hive / HS2 to control access to data • Fine-grained (up to column level) controls Hive data and HDFS data have different semantics. How to bridge that?
  22. 22. The Sentry HDFS Plugin Synchronize HDFS file permissions with higher-level abstractions. • Permission to read table = permission to read table’s files • Permission to create table = permission to write to database’s directory Uses HDFS ACLs for fine-grained user permissions.
  23. 23. Still restricted to FS view of the world! • Files, directories, etc… • Cannot provide column-level and row-level access control. • Whole table or nothing. Still, it goes a long way in allowing Spark applications to work well with Hive data in a shared, secure environment. But...
  24. 24. Future: RecordService A distributed, scalable, data access service for unified authorization in Hadoop.
  25. 25. RecordService
  26. 26. RecordService • Drop in replacement for InputFormats • SparkSQL: Integration with Data Sources API – Predicate pushdown, projection
  27. 27. RecordService • Assume we had a table tpch.nation column_name column_type n_nationkey smallint n_name string n_regionkey smallint n_comment string
  28. 28. import com.cloudera.recordservice.spark._ val context = new org.apache.spark.sql.SQLContext(sc) val df = context.load("tpch.nation", "com.cloudera.recordservice.spark") val results = df.groupBy("n_regionkey") .count() .collect() RecordService
  29. 29. RecordService • Users can enforce Sentry permissions using views • Allows column and row level security > CREATE ROLE restrictedrole; > GRANT ROLE restrictedrole to GROUP restrictedgroup; > USE tpch; > CREATE VIEW nation_names AS SELECT n_nationkey, n_name FROM tpch.nation; > GRANT SELECT ON TABLE tpch.nation_names TO ROLE restrictedrole;
  30. 30. ... val df = context.load("tpch.nation", "com.cloudera.recordservice.spark") val results = df.collect() >> TRecordServiceException(code:INVALID_REQUEST, message:Could not plan request., detail:AuthorizationException: User 'kostas' does not have privileges to execute 'SELECT' on: tpch.nation) RecordService
  31. 31. ... val df = context.load("tpch.nation_names", "com.cloudera.recordservice.spark") val results = df.collect() RecordService
  32. 32. RecordService • Documentation: http://cloudera.github.io/RecordServiceClient/ • Beta Download: http://www.cloudera.com/content/cloudera/en/downloads/betas/recordservic e/0-1-0.html
  33. 33. Takeaways • Spark can be made secure today! • Benefits from a lot of existing Hadoop platform work • Still work to be done – Ease of use – Better integration with Sentry / RecordService
  34. 34. References • Encryption: SPARK-6017, SPARK-5682 • Delegation tokens: SPARK-5342 • Sentry: http://sentry.apache.org/ – HDFS synchronization: SENTRY-432 • RecordService: http://cloudera.github.io/RecordServiceClient/
  35. 35. Thanks! Questions?

×