FPGA has recently gained attention throughout the industry because of its performance-per-power efficiency, re-programmable flexibility and wide range of applicableness. As a prediction to this phenomenon, Intel has been planning a new product line which offers a Xeon processor with integrated FPGA that will enable datacenters to easily deploy high-performance accelerators with a relatively low cost of ownership. The new Xeon+FPGA Platform is supported with a software ecosystem that eliminates the difficulties traditional FPGA devices had such as datacenter wide accelerator deployment.
In this session, Intel will present their design and implementation of FPGA as a supplement to vcores in Spark YARN mode to accelerate SparkML applications on the Intel Xeon+FPGA platform. In particular, they have added new options to Spark core that provides an interface for the user to describe the accelerator dependencies of the application. The FPGA info in the Spark context will be used by the new APIs and DRF policy implemented on YARN to schedule the Spark executor to a host with Xeon+FPGA installed. Experimental results using ALS scoring applications that accelerate GEneral Matrix to Matrix Multiplication operations demonstrate that Xeon+FPGA improves the FLOPS throughput by 1.5× compared to a CPU-only cluster.
3. Outline
• What is an FPGA
• Intel® Accelerator Portfolio
• Intel® FPGA Hardware Platform
• Intel® FPGA Software
• Intel® Accelerator Abstraction Layer Enabling
– Spark
– YARN
– Spark MLlib
• Spark-perf Performance Test Results
– Netlib vs. Intel® DAAL
• Intel® FPGA Acceleration Demo
3
4. What is FPGA
4
X
+
1000s of hard DSPs (floating-point units)
1000s of Hard “M20K” SRAMs (2.5KB/SRAM)
Sea of Programmable Logic and Routing
Extreme degree of customizations
Arbitrary SRAMs compositions (spad, $, fifo, ..)
Arbitrary bitwidth, mix bitwidths, etc
X
+
X
+
X
+
X
+ M20K
M20K
M20K
M20K
Figures courtesy of Gordon Chiu
FPGAs well positioned for High performance and providing flexibility
5. Intel® Portfolio of Accelerator Options
IA PROCESSORs FPGA Fixed Function
ACCELERATION
Standard
Workloads
Integrated FPGA
CPU socket
compatible access
to FPGA capabilities
Discrete FPGA
Scalable range of FPGA
options (I/O, TDP,
Price, Mem, Features)
Built-in Standard
platform acceleration,
Highly Optimized
Hardware FlexibilitySoftware Flexibility
Fixed
HW Acceleration
Flexible
Workloads
and/
or
5
7. Intel® Accelerator Abstraction Layer (AAL)
FPGACPU
User Application
CPU
Infrastructure IP
(UPI, PCIe*, HSSI, FPGA Management)
FPGA Runtime Software
FPGA IP
Intel® Provided
Infrastructure
USER SOFTWARE
INTERFACE
User Developed
Application
Specific
Functions
UPI/PCIe
HSSI
= New blocks that simplify code development.
CORE CACHE
INTERFACE
7
8. Single Node Software Stack
8
FPGA
CPU
Running a FPGA accelerated app
1. Prepare host with required
software and libraries
2. Write/modify application which
calls the accelerator API
through the shim layer
3. Use FPGA Runtime CLI to flash
FPGA device with FPGA IP
corresponding to the
accelerator API
4. Run application
SHIMs to Key Industry Frameworks
Analytics
Compression
Genomics
Networking
Workloads
Accelerator API
Accelerator Support Software
Intel Software Infrastructure
Intel Hardware Infrastructure
User Application
FPGA IP
9. Vision of FPGAs in the Data Center
End User
Developed IP
Static/dynamic
FPGA programming
Place
workload
FPGA
Storage Network
YARN (FPGA Enabled)
Intel®
Developed IP
3rd party
Developed IP
Compute
Resource Pool
Software
Defined
Infrastructure
Public and Private
Cloud Users
Launch workload
Workload
accelerators
Intel® Xeon® processor
IP
Workload N
Workload 2
Workload 1
9
Shared storage
1
2
10. New Options to Spark User Interface
10
spark-submit
…
--conf spark.executor.fpga.type=MCP | DCP
--conf spark.executor.fpga.ip=AFU1_UUID:AFU1_CNT
--conf spark.executor.fpga.ip=AFU2_UUID:AFU2_CNT
…
--executor 10
--fpga.type: String to filter devices that qualify resource requirements
--fpga.ip: Pair of IP UUID and IP count, colon delimiter, repeatable
11. Spark AM Modifications
11
Set FPGA constraints based on
YARN’s New Resource model
YARN-3926
Example
spark-submit
…
--conf spark.executor.fpga.type=MCP
--conf spark.executor.fpga.ip=ANALYTICS:1
--conf spark.executor.fpga.ip=GENOMICS:1
…
--executor 10
resourceCapability.setResourceValue(“MCP”, 2)
fpga_type = MCP
fpga_ips = ANALYTICS:1,GENOMICS:1
12. Node
Manager
Intel FPGA Software Enabling on YARN – YARN-5983
12
Resource
Manager
ContainerManager
ApplcationMaster
Service
ResourceTracker
Service
ResourceScheduler
NodeStatusUpdater
HWUtils
YARNCfgIntelFPGAPlugin
FPGAPluginManager
LocalResourceAllocator
Dev_ID Platform_ID IP_ID Status
FPGA Runtime SoftwareContainer
Containers
1
2
3
13. Launching and Placing Workloads
13
Application
Master
Resource
Manager
Node
Manager
AMRMClientAsync
NMClientAsync
ContainerManager
ApplcationMaster
Service
ResourceTracker
Service
ResourceScheduler
NodeStatusUpdater
HWUtils
YARNCfgIntelFPGAPlugin
FPGAPluginManager
LocalResourceAllocator
FPGA Runtime SoftwareContainer
Containers
1
2
3
4 5
78
6
Dev_ID Platform_ID IP_ID Status
14. Example: Software Stack for GEMM offloading
Infrastructure IP
GEMM IP #1
FPGA
CPU
FPGA Runtime Software
FPGA Runtime
Driver Pre/Post processing
SPARK & Hadoop framework
GEMM Accelerator API
Accelerator implementation in RTL
Intel® HW Infrastructure
Intel® SW Infrastructure
Accelerator Support SW
Intel® DAAL
Accelerator API
Shim Layer
User Application
14
15. Xeon®+FPGA with Intel® DAAL integration
Intel® Threads BB
MKL
Intel® DAAL framework
SGEMM adapter
FPGA Runtime
FPGA GEMM
MKL
CPU GEMM
CPU SGEMM
SPARK MLlib
Block Matrix Multiplication Workload
Intel® DAAL framework
Intel® Threads BB
• Users/Developers completely agnostic about
presence of FPGA on the server.
• Intel® DAAL framework is aware of the
presence of the FPGA on the platform.
• SGEMM adapter schedules jobs on the
FPGA if is available else call MKL GEMM on
a separate thread.
15
16. Spark MLlib Modifications to offload GEMM
16
Accelerating linalg.BLAS.gemm()
Create DAAL context
Instantiate Batch processing class
for GEMM
Set input matrices
Execute GEMM through DAAL
Retrieve and set GEMM result
17. Performance test results
17
0 500 1000 1500 2000 2500 3000
128
256
512
1024
2048
4096
GEMM TIME(MS)
BLOCKSIZE
GEMM Micro Performance
Benchmark
Intel® DAAL OpenBLAS
1
4
0
1
2
3
4
5
GEMM Micro Performance
Execution Time Normalized
Intel® DAAL(normalized) OpenBLAS(normalized)
(Lower is better)
Hardware environment
CPU model Genuine Intel® CPU @ 2.40GHz
Memory 250G
FPGA model BDX+MCP
Network 10Gigabit
Software environment
Spark basic version v1.6.3
Observation Spark + Intel® DAAL (CPU only)
Baseline Spark + OpenBLAS
Workload Spark-perf
Executor total memory 160G
Executor total vcore 22
Executor number 1,2,4,8,16
Partition number executor number *2
Matrix Size
128*512*512
256*1024*1024
512*4096*4096
1024*10240*10240
2048*20480*20480
4096*20480*20480
Block Size 128, 256, 512, 1024, 2048, 4096
Intel® DAAL version v1.2
Intel® AAL version v5.0.3
(Lower is better)
18. Performance test results
18
Hardware environment
CPU model Genuine Intel® CPU @ 2.40GHz
Memory 250G
FPGA model BDX+MCP
Network 10Gigabit
Software environment
Spark basic version v1.6.3
Observation Spark + Intel® DAAL (CPU only)
Baseline Spark + OpenBLAS
Workload Spark-perf
Executor total memory 160G
Executor total vcore 22
Executor number 1,2,4,8,16
Partition number executor number *2
Matrix Size
128*512*512
256*1024*1024
512*4096*4096
1024*10240*10240
2048*20480*20480
4096*20480*20480
Block Size 128, 256, 512, 1024, 2048, 4096
Intel® DAAL version v1.2
Intel® AAL version v5.0.3
0 10 20 30 40 50
128*512*512
256*1024*1024
512*4096*4096
1024*10240*10240
2048*20480*20480
4096*20480*20480
TIME(S)
MATRIXSIZE
Spark-perf block-matrix-mult
Benchmark
OpenBLAS
Intel® DAAL
1
2
0
0.5
1
1.5
2
2.5
Spark-perf block-matrix-mult
Execution Time Normalized
Intel® DAAL(normalized) OpenBLAS(normalized)
(Lower is better)
(Lower is better)
20. Conclusion
• Intel is leading the compute evolution with innovative products
• Intel also provides software stack solutions optimized to its hardware
• Using SparkML with Intel® DAAL/MKL shows better performance
compared to OpenBLAS
• Additional performance improvement can be achieved on the same
software stack by using the Intel® Xeon®+FPGA platform
20