Combining Asynchronous Task Parallelism and Intel SGX for Secure Deep Learning

Combining Asynchronous Task Parallelism
and Intel SGX for Secure Deep Learning
19th European Dependable Computing Conference
Leuven, Belgium
10 April, 2024
Xavier Martorel
Universitat Politecnica de Catalanya
Valerio Schiavoni
University of Neuchâtel
Isabelly Rocha
Pascal Felber
Marcelo Pasin
Osman Unsal
Barcelona Super Computing
[Practical Experience Report]

valerio.schiavoni@unine.ch - SGX-OmpSs - EDCC’24
/11
Secure Deep Learning
2
Intel SGX
•Several deep-learning applications require private data
•Face recognition, speech recognition, self-driving cars,
genetic sequence modeling, NLP, etc.
•High accuracy, but very high training costs
• 11.5 hours on commodity hardware for NLP models
• Performance vs. Security trade-o
ff
•Privacy, integrity
•We want to exploit HW heterogeneity: how ?
no security guarantees on GPUs
(see you in 2 years)

/11
•Addresses the performance requirements
•1 task ➡ program statement
•Arbitrary granularity
•No dependencies to other tasks, run in parallel
•Native support for HW heterogeneity
•Several frameworks exist
•OpenMP, charms++, OmpSs
•https://pm.bsc.es/ompss
Performance: Task-level Parallelism
3

/11
•Trusted Execution Environments
•Addresses the security requirements
•Hardware area protected against powerful attacks
•Its content is an enclave, shielded from:
•compromised OS, compromised system libraries, attackers
with physical access to a machine
•Several implementations exist nowadays:
•Intel: TDX, SGX
•ARM: TrustZone, CCA
•AMD: SEV, SEV-SNP
•RISC-V: Keystone, MultiZone
•Google: Trusty, …
Security: TEEs
4
In this talk

/11
Intel Software Guard Extensions
5
Intel SGX
Enclave
Create enclave
Call trusted
function
…
Execute
Return
Call
gate
Trusted function
Untrusted Trusted
➊
➋
➏
➎
➍
➌
➐
Intel SGX
Operating System
•Available since 2015, SkyLake
•Hardware-protected area on die
•Split the program in two:
•Untrusted vs. trusted (enclaves)
•Transparent encryption/decryption
•Code integrity
•Intel Attestation Service
•Memory limits: EPC, up to 64 GBs in recent server-grade
CPUs, older generations only ~100 MB
•Intel SDK, C/C++, Rust SDK, containers (Scone, SGX-LKL…)

/11
Do we need a new system?
6
•State-of-the-art systems for secure computation with SGX
•At the time of this work, none would
fi
t the bill

/11
SGX-OmpSs: example
7
1 int SGX_CDECL main(int argc, char*argv[])
2 {
3 ...
4 double *A, B, C = (double *) malloc(DIM * DIM * sizeof(double));
5 fill_random(A); fill_random(B); fill_random(C);
6 for(i=0;i<DIM;i++)
7 for (j = 0; j < DIM; j++)
8 for (k = 0; k < DIM; k++) {
9 // OmpSs pragmas
10 #pragma omp task in(A[i][k], B[k][j]) inout(C[i][j]) no_copy_deps
11 // SGX ecall
12 ecall_matmul(global_eid, &A[i][k], &B[k][j], &C[i][j], BSIZE); }
13 // OmpSs pragmas
14 #pragma omp taskwait //barrier to wait for pending tasks
…
}
•Matrix multiplication, 2 pragmas, 1 sgx ecall

/11
8
create
enclave
call
Trusted() return
process
secrets
Untrusted Trusted
SGX Compiler
DFiant
HDL
CUDA
MaxJ
Enclave
Kernels
programmer
annotates SGX tasks
OmpsSS application
1
Mercurium Compiler
GCC OmpSs.elf
source code +
annotations
(calls to Nanos++)
2
3
Nanos Enclave Support
4
SGX-OmpSs: work
fl
ow
•Main contributions of this work, called SGX-OmpSs:
1.Integration of Intel SGX and task-based framework OmpSs
2.Application to deep-neural network applications

/11
●Intel E3-1275 (SGX 1.0), 4 cores, 2 threads, 92 MiB EPC
●See more results in the paper:
● micro-benchmarks
● energy considerations
● 5 lessons learned
●In the rest of this talk:
● one micro-benchmark
● secure task-based DL
● YOLO-Pascal, LENET-MNIST
9
Evaluation

/11
Microbenchmarks
10

/11
Microbenchmarks
10
Lesson 1:large overheads for “secure” versions

/11
11
0
100
200
300
400
500
sgx 2 4 8
-100
-80
-60
-40
-20
0
20
40
60
Runtime
[s]
Difference
[%]
YOLO-Pascal LENET-MNIST
Runtime 🏃
•Real-time object detection on the Pascal VOC 2012 dataset
•Hand-written digits, lightweight CNN

/11
11
0
100
200
300
400
500
sgx 2 4 8
-100
-80
-60
-40
-20
0
20
40
60
Runtime
[s]
Difference
[%]
Runtime 🏃
baseline, no parallelism

/11
11
0
100
200
300
400
500
sgx 2 4 8
-100
-80
-60
-40
-20
0
20
40
60
Runtime
[s]
Difference
[%]
Runtime 🏃
baseline, no parallelism
lower is
better

/11
12
5
10
15
20
25
30
35
sgx 2 4 8
-100
-80
-60
-40
-20
0
20
40
60
80
100
120
140
Energy
[kJ]
Difference
[%]
Energy🔋🪫
Lesson 5: predicting performances is not easy, must be done on a case-by-case
read paper for 2-4

/11
Conclusion
13
•SGX-OmpSs can accelerate the execution of secure applications
•Easy to use it in any application domain
•It exploits the asynchronous task parallelism paradigm
•For SGX-based applications, e
ff
orts to port to SGX-OmpSs are
minimal
•Taski
fi
ed deep-learning workloads improve runtime (up to 94%)
and reduce energy requirements (up to 92%)
•In a (far) future: extend to FPGAs and secure GPUs

Combining Asynchronous Task Parallelism and Intel SGX for Secure Deep Learning

More Related Content

Similar to Combining Asynchronous Task Parallelism and Intel SGX for Secure Deep Learning

More from vschiavoni

Recently uploaded

Combining Asynchronous Task Parallelism and Intel SGX for Secure Deep Learning