Combining Asynchronous Task Parallelism
and Intel SGX for Secure Deep Learning
19th European Dependable Computing Conference
Leuven, Belgium
10 April, 2024
Xavier Martorel
Universitat Politecnica de Catalanya
Valerio Schiavoni
University of Neuchâtel
Isabelly Rocha
University of Neuchâtel
Pascal Felber
University of Neuchâtel
Marcelo Pasin
University of Neuchâtel
Osman Unsal
Barcelona Super Computing
[Practical Experience Report]
valerio.schiavoni@unine.ch - SGX-OmpSs - EDCC’24
/11
Secure Deep Learning
2
Intel SGX
•Several deep-learning applications require private data
•Face recognition, speech recognition, self-driving cars,
genetic sequence modeling, NLP, etc.
•High accuracy, but very high training costs
• 11.5 hours on commodity hardware for NLP models
• Performance vs. Security trade-o
ff
•Privacy, integrity
•We want to exploit HW heterogeneity: how ?
no security guarantees on GPUs
(see you in 2 years)
valerio.schiavoni@unine.ch - SGX-OmpSs - EDCC’24
/11
•Addresses the performance requirements
•1 task ➡ program statement
•Arbitrary granularity
•No dependencies to other tasks, run in parallel
•Native support for HW heterogeneity
•Several frameworks exist
•OpenMP, charms++, OmpSs
•https://pm.bsc.es/ompss
Performance: Task-level Parallelism
3
valerio.schiavoni@unine.ch - SGX-OmpSs - EDCC’24
/11
•Trusted Execution Environments
•Addresses the security requirements
•Hardware area protected against powerful attacks
•Its content is an enclave, shielded from:
•compromised OS, compromised system libraries, attackers
with physical access to a machine
•Several implementations exist nowadays:
•Intel: TDX, SGX
•ARM: TrustZone, CCA
•AMD: SEV, SEV-SNP
•RISC-V: Keystone, MultiZone
•Google: Trusty, …
Security: TEEs
4
In this talk
valerio.schiavoni@unine.ch - SGX-OmpSs - EDCC’24
/11
Intel Software Guard Extensions
5
Intel SGX
Enclave
Create enclave
Call trusted
function
…
Execute
Return
Call
gate
Trusted function
Untrusted Trusted
➊
➋
➏
➎
➍
➌
➐
Intel SGX
Operating System
•Available since 2015, SkyLake
•Hardware-protected area on die
•Split the program in two:
•Untrusted vs. trusted (enclaves)
•Transparent encryption/decryption
•Code integrity
•Intel Attestation Service
•Memory limits: EPC, up to 64 GBs in recent server-grade
CPUs, older generations only ~100 MB
•Intel SDK, C/C++, Rust SDK, containers (Scone, SGX-LKL…)
valerio.schiavoni@unine.ch - SGX-OmpSs - EDCC’24
/11
Do we need a new system?
6
•State-of-the-art systems for secure computation with SGX
•At the time of this work, none would
fi
t the bill
valerio.schiavoni@unine.ch - SGX-OmpSs - EDCC’24
/11
SGX-OmpSs: example
7
1 int SGX_CDECL main(int argc, char*argv[])
2 {
3 ...
4 double *A, B, C = (double *) malloc(DIM * DIM * sizeof(double));
5 fill_random(A); fill_random(B); fill_random(C);
6 for(i=0;i<DIM;i++)
7 for (j = 0; j < DIM; j++)
8 for (k = 0; k < DIM; k++) {
9 // OmpSs pragmas
10 #pragma omp task in(A[i][k], B[k][j]) inout(C[i][j]) no_copy_deps
11 // SGX ecall
12 ecall_matmul(global_eid, &A[i][k], &B[k][j], &C[i][j], BSIZE); }
13 // OmpSs pragmas
14 #pragma omp taskwait //barrier to wait for pending tasks
…
}
•Matrix multiplication, 2 pragmas, 1 sgx ecall
valerio.schiavoni@unine.ch - SGX-OmpSs - EDCC’24
/11
8
create
enclave
call
Trusted() return
process
secrets
Untrusted Trusted
SGX Compiler
DFiant
HDL
CUDA
MaxJ
Enclave
Kernels
programmer
annotates SGX tasks
OmpsSS application
1
Mercurium Compiler
GCC OmpSs.elf
source code +
annotations
(calls to Nanos++)
2
3
Nanos Enclave Support
4
SGX-OmpSs: work
fl
ow
•Main contributions of this work, called SGX-OmpSs:
1.Integration of Intel SGX and task-based framework OmpSs
2.Application to deep-neural network applications
valerio.schiavoni@unine.ch - SGX-OmpSs - EDCC’24
/11
●Intel E3-1275 (SGX 1.0), 4 cores, 2 threads, 92 MiB EPC
●See more results in the paper:
● micro-benchmarks
● energy considerations
● 5 lessons learned
●In the rest of this talk:
● one micro-benchmark
● secure task-based DL
● YOLO-Pascal, LENET-MNIST
9
Evaluation
valerio.schiavoni@unine.ch - SGX-OmpSs - EDCC’24
/11
Microbenchmarks
10
valerio.schiavoni@unine.ch - SGX-OmpSs - EDCC’24
/11
Microbenchmarks
10
valerio.schiavoni@unine.ch - SGX-OmpSs - EDCC’24
/11
Microbenchmarks
10
Lesson 1:large overheads for “secure” versions
valerio.schiavoni@unine.ch - SGX-OmpSs - EDCC’24
/11
11
0
100
200
300
400
500
sgx 2 4 8
-100
-80
-60
-40
-20
0
20
40
60
Runtime
[s]
Difference
[%]
YOLO-Pascal LENET-MNIST
Runtime 🏃
•Real-time object detection on the Pascal VOC 2012 dataset
•Hand-written digits, lightweight CNN
valerio.schiavoni@unine.ch - SGX-OmpSs - EDCC’24
/11
11
0
100
200
300
400
500
sgx 2 4 8
-100
-80
-60
-40
-20
0
20
40
60
Runtime
[s]
Difference
[%]
YOLO-Pascal LENET-MNIST
Runtime 🏃
•Real-time object detection on the Pascal VOC 2012 dataset
•Hand-written digits, lightweight CNN
baseline, no parallelism
valerio.schiavoni@unine.ch - SGX-OmpSs - EDCC’24
/11
11
0
100
200
300
400
500
sgx 2 4 8
-100
-80
-60
-40
-20
0
20
40
60
Runtime
[s]
Difference
[%]
YOLO-Pascal LENET-MNIST
Runtime 🏃
•Real-time object detection on the Pascal VOC 2012 dataset
•Hand-written digits, lightweight CNN
baseline, no parallelism
lower is
better
valerio.schiavoni@unine.ch - SGX-OmpSs - EDCC’24
/11
11
0
100
200
300
400
500
sgx 2 4 8
-100
-80
-60
-40
-20
0
20
40
60
Runtime
[s]
Difference
[%]
YOLO-Pascal LENET-MNIST
Runtime 🏃
•Real-time object detection on the Pascal VOC 2012 dataset
•Hand-written digits, lightweight CNN
baseline, no parallelism
lower is
better
valerio.schiavoni@unine.ch - SGX-OmpSs - EDCC’24
/11
12
5
10
15
20
25
30
35
sgx 2 4 8
-100
-80
-60
-40
-20
0
20
40
60
80
100
120
140
Energy
[kJ]
Difference
[%]
YOLO-Pascal LENET-MNIST
Energy🔋🪫
Lesson 5: predicting performances is not easy, must be done on a case-by-case
read paper for 2-4
valerio.schiavoni@unine.ch - SGX-OmpSs - EDCC’24
/11
12
5
10
15
20
25
30
35
sgx 2 4 8
-100
-80
-60
-40
-20
0
20
40
60
80
100
120
140
Energy
[kJ]
Difference
[%]
YOLO-Pascal LENET-MNIST
Energy🔋🪫
Lesson 5: predicting performances is not easy, must be done on a case-by-case
read paper for 2-4
valerio.schiavoni@unine.ch - SGX-OmpSs - EDCC’24
/11
Conclusion
13
•SGX-OmpSs can accelerate the execution of secure applications
•Easy to use it in any application domain
•It exploits the asynchronous task parallelism paradigm
•For SGX-based applications, e
ff
orts to port to SGX-OmpSs are
minimal
•Taski
fi
ed deep-learning workloads improve runtime (up to 94%)
and reduce energy requirements (up to 92%)
•In a (far) future: extend to FPGAs and secure GPUs

Combining Asynchronous Task Parallelism and Intel SGX for Secure Deep Learning

  • 1.
    Combining Asynchronous TaskParallelism and Intel SGX for Secure Deep Learning 19th European Dependable Computing Conference Leuven, Belgium 10 April, 2024 Xavier Martorel Universitat Politecnica de Catalanya Valerio Schiavoni University of Neuchâtel Isabelly Rocha University of Neuchâtel Pascal Felber University of Neuchâtel Marcelo Pasin University of Neuchâtel Osman Unsal Barcelona Super Computing [Practical Experience Report]
  • 2.
    valerio.schiavoni@unine.ch - SGX-OmpSs- EDCC’24 /11 Secure Deep Learning 2 Intel SGX •Several deep-learning applications require private data •Face recognition, speech recognition, self-driving cars, genetic sequence modeling, NLP, etc. •High accuracy, but very high training costs • 11.5 hours on commodity hardware for NLP models • Performance vs. Security trade-o ff •Privacy, integrity •We want to exploit HW heterogeneity: how ? no security guarantees on GPUs (see you in 2 years)
  • 3.
    valerio.schiavoni@unine.ch - SGX-OmpSs- EDCC’24 /11 •Addresses the performance requirements •1 task ➡ program statement •Arbitrary granularity •No dependencies to other tasks, run in parallel •Native support for HW heterogeneity •Several frameworks exist •OpenMP, charms++, OmpSs •https://pm.bsc.es/ompss Performance: Task-level Parallelism 3
  • 4.
    valerio.schiavoni@unine.ch - SGX-OmpSs- EDCC’24 /11 •Trusted Execution Environments •Addresses the security requirements •Hardware area protected against powerful attacks •Its content is an enclave, shielded from: •compromised OS, compromised system libraries, attackers with physical access to a machine •Several implementations exist nowadays: •Intel: TDX, SGX •ARM: TrustZone, CCA •AMD: SEV, SEV-SNP •RISC-V: Keystone, MultiZone •Google: Trusty, … Security: TEEs 4 In this talk
  • 5.
    valerio.schiavoni@unine.ch - SGX-OmpSs- EDCC’24 /11 Intel Software Guard Extensions 5 Intel SGX Enclave Create enclave Call trusted function … Execute Return Call gate Trusted function Untrusted Trusted ➊ ➋ ➏ ➎ ➍ ➌ ➐ Intel SGX Operating System •Available since 2015, SkyLake •Hardware-protected area on die •Split the program in two: •Untrusted vs. trusted (enclaves) •Transparent encryption/decryption •Code integrity •Intel Attestation Service •Memory limits: EPC, up to 64 GBs in recent server-grade CPUs, older generations only ~100 MB •Intel SDK, C/C++, Rust SDK, containers (Scone, SGX-LKL…)
  • 6.
    valerio.schiavoni@unine.ch - SGX-OmpSs- EDCC’24 /11 Do we need a new system? 6 •State-of-the-art systems for secure computation with SGX •At the time of this work, none would fi t the bill
  • 7.
    valerio.schiavoni@unine.ch - SGX-OmpSs- EDCC’24 /11 SGX-OmpSs: example 7 1 int SGX_CDECL main(int argc, char*argv[]) 2 { 3 ... 4 double *A, B, C = (double *) malloc(DIM * DIM * sizeof(double)); 5 fill_random(A); fill_random(B); fill_random(C); 6 for(i=0;i<DIM;i++) 7 for (j = 0; j < DIM; j++) 8 for (k = 0; k < DIM; k++) { 9 // OmpSs pragmas 10 #pragma omp task in(A[i][k], B[k][j]) inout(C[i][j]) no_copy_deps 11 // SGX ecall 12 ecall_matmul(global_eid, &A[i][k], &B[k][j], &C[i][j], BSIZE); } 13 // OmpSs pragmas 14 #pragma omp taskwait //barrier to wait for pending tasks … } •Matrix multiplication, 2 pragmas, 1 sgx ecall
  • 8.
    valerio.schiavoni@unine.ch - SGX-OmpSs- EDCC’24 /11 8 create enclave call Trusted() return process secrets Untrusted Trusted SGX Compiler DFiant HDL CUDA MaxJ Enclave Kernels programmer annotates SGX tasks OmpsSS application 1 Mercurium Compiler GCC OmpSs.elf source code + annotations (calls to Nanos++) 2 3 Nanos Enclave Support 4 SGX-OmpSs: work fl ow •Main contributions of this work, called SGX-OmpSs: 1.Integration of Intel SGX and task-based framework OmpSs 2.Application to deep-neural network applications
  • 9.
    valerio.schiavoni@unine.ch - SGX-OmpSs- EDCC’24 /11 ●Intel E3-1275 (SGX 1.0), 4 cores, 2 threads, 92 MiB EPC ●See more results in the paper: ● micro-benchmarks ● energy considerations ● 5 lessons learned ●In the rest of this talk: ● one micro-benchmark ● secure task-based DL ● YOLO-Pascal, LENET-MNIST 9 Evaluation
  • 10.
    valerio.schiavoni@unine.ch - SGX-OmpSs- EDCC’24 /11 Microbenchmarks 10
  • 11.
    valerio.schiavoni@unine.ch - SGX-OmpSs- EDCC’24 /11 Microbenchmarks 10
  • 12.
    valerio.schiavoni@unine.ch - SGX-OmpSs- EDCC’24 /11 Microbenchmarks 10 Lesson 1:large overheads for “secure” versions
  • 13.
    valerio.schiavoni@unine.ch - SGX-OmpSs- EDCC’24 /11 11 0 100 200 300 400 500 sgx 2 4 8 -100 -80 -60 -40 -20 0 20 40 60 Runtime [s] Difference [%] YOLO-Pascal LENET-MNIST Runtime 🏃 •Real-time object detection on the Pascal VOC 2012 dataset •Hand-written digits, lightweight CNN
  • 14.
    valerio.schiavoni@unine.ch - SGX-OmpSs- EDCC’24 /11 11 0 100 200 300 400 500 sgx 2 4 8 -100 -80 -60 -40 -20 0 20 40 60 Runtime [s] Difference [%] YOLO-Pascal LENET-MNIST Runtime 🏃 •Real-time object detection on the Pascal VOC 2012 dataset •Hand-written digits, lightweight CNN baseline, no parallelism
  • 15.
    valerio.schiavoni@unine.ch - SGX-OmpSs- EDCC’24 /11 11 0 100 200 300 400 500 sgx 2 4 8 -100 -80 -60 -40 -20 0 20 40 60 Runtime [s] Difference [%] YOLO-Pascal LENET-MNIST Runtime 🏃 •Real-time object detection on the Pascal VOC 2012 dataset •Hand-written digits, lightweight CNN baseline, no parallelism lower is better
  • 16.
    valerio.schiavoni@unine.ch - SGX-OmpSs- EDCC’24 /11 11 0 100 200 300 400 500 sgx 2 4 8 -100 -80 -60 -40 -20 0 20 40 60 Runtime [s] Difference [%] YOLO-Pascal LENET-MNIST Runtime 🏃 •Real-time object detection on the Pascal VOC 2012 dataset •Hand-written digits, lightweight CNN baseline, no parallelism lower is better
  • 17.
    valerio.schiavoni@unine.ch - SGX-OmpSs- EDCC’24 /11 12 5 10 15 20 25 30 35 sgx 2 4 8 -100 -80 -60 -40 -20 0 20 40 60 80 100 120 140 Energy [kJ] Difference [%] YOLO-Pascal LENET-MNIST Energy🔋🪫 Lesson 5: predicting performances is not easy, must be done on a case-by-case read paper for 2-4
  • 18.
    valerio.schiavoni@unine.ch - SGX-OmpSs- EDCC’24 /11 12 5 10 15 20 25 30 35 sgx 2 4 8 -100 -80 -60 -40 -20 0 20 40 60 80 100 120 140 Energy [kJ] Difference [%] YOLO-Pascal LENET-MNIST Energy🔋🪫 Lesson 5: predicting performances is not easy, must be done on a case-by-case read paper for 2-4
  • 19.
    valerio.schiavoni@unine.ch - SGX-OmpSs- EDCC’24 /11 Conclusion 13 •SGX-OmpSs can accelerate the execution of secure applications •Easy to use it in any application domain •It exploits the asynchronous task parallelism paradigm •For SGX-based applications, e ff orts to port to SGX-OmpSs are minimal •Taski fi ed deep-learning workloads improve runtime (up to 94%) and reduce energy requirements (up to 92%) •In a (far) future: extend to FPGAs and secure GPUs