Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

SGX-PySpark: Secure Distributed Data Analytics


Published on

SGX-PySpark: Secure Distributed Data Analytics addresses how public cloud users can protect sensitive data while still preserving the same utility of data analytics.

Published in: Science
  • Be the first to comment

  • Be the first to like this

SGX-PySpark: Secure Distributed Data Analytics

  1. 1. Do Le Quoc, Franz Gregor, Jatinder Singh and Christof Fetzer SGX-PySpark: Secure Distributed Data Analytics Motivation • Data analytics has become an important component of modern cloud-based data-driven services • Large-scale datasets processed by the service may contain customer's sensitive information • Customers need to trust both service providers and cloud providers • How to protect sensitive data while preserving the same utility of data analytics ? Key idea • Ensure confidentiality and integrity for both code and data using trusted hardware, i.e., Intel Software Guard Extensions (SGX) • Execute only sensitive parts of data analytics inside enclaves • Encrypt input data; decrypt and securely process it inside enclaves Implementation • PySpark: widely used in industry for big data analytics • SCONE: enables unmodified applications run inside Intel SGX enclaves • Execute Spark Driver and Python processes of PySpark inside enclaves using SCONE SGX-PySpark • Objectives: • Support complex operations for big data analytics • Provide strong security guarantees • Minimize performance overhead • Support Python • Architecture: Evaluation • Dataset: TPC-H Benchmark • ~22 % overhead compared to native execution Demo • GitHub repository: • Demo video: 0 20 40 60 80 100 Q1 Q3 Q4 Q5 Q6 Q7 Q10 Q12 Q13 Q14 Q16 Q18 Q19 Latency[seconds] TPC-H Queries SGX-PySpark Native PySpark