David Ojika
University of Florida
Speeding up Spark with
Data Compression on Xeon+FPGA
Intel Collaborators: Piotr Majcher, Wojciech Neubauer, Suchit Subhaschandra,
Ramesh Illikkal, Bhaskar Gowda, and PK Gupta
Motivation
• Big data
– Growth in volume of data
– Distributed processing: data shuffling across machines
• Data compression
– Reduce data volume, optimize application performance
– Forbes*: 60% organization using data compression
– A CPU-intensive operation
• Programmable accelerators (FPGAs)
– Core-scaling: CPU may be reaching performance limits
– Rising demand for performance efficiency, cost in the datacenter
*https://www.altera.com/en_US/pdfs/liter atur e/thir d- party/forbes _The_Com ing_Data_Av alanche.pdf
Compliment CPU cores with FPGAs for improved Spark performance
About Me
• 4th year PhD student
• Interests in distributed systems, FPGA acceleration of
data intensive computing
• Research with CERN
• Past internship at Intel
• Current internship at Microsoft Research
What / Why FPGAs
• Field-programmable gate array
(FPGA)
– Custom circuit
– Can accelerate specific tasks
• FPGAs offer:
– Reconfigurable architecture
– Low-power, energy efficiency
• FPGA attachment technology
– Loosely-coupled
• PCI-e attached FPGA
– Tightly-coupled
• Xeon+FPGA
Intro Challenges Solutions Results Conclusion
Xeon+FPGA
• Xeon CPU and FPGA in a single processor socket
– Cache coherent interface
• Supports “in-line” data via
direct I/O
• Accelerator Function Unit
(AFU)
– Reconfigurable region (user logic)
Image: Courtesy of Intel
Intro Challenges Solutions Results Conclusion
Challenges of integrating FPGAs
into big data systems
Challenging Programming Model
• Requirement on hardware-specific knowledge
• Long synthesis (compile) times
• Limited platform-portability
Intro Challenges Solutions Results Conclusion
Complicated Software Interface
• JVM-to-FPGAinterface
• Data transfer overheads
Intro Challenges Solutions Results Conclusion
FPGA Sharing
• FPGA and CPU threads co-existence is non-trivial
• How to keep FPGA accelerator fully utilized
Intro Challenges Solutions Results Conclusion
Reconfiguration
• FPGA reconfiguration can take milliseconds to a few
seconds
• Certain workloads may be intolerable to downtime
Intro Challenges Solutions Results Conclusion
…a gap between
FPGA accelerator developer
and
big data application developer
What did we do?
a. FPGA accelerator abstraction
1. Java API for CPU offload to FPGA
2. Manage JVM-to-FPGA data transfers
3. Coordinate FPGA and CPU thread co-existence
b. FPGA-based compression plugin for Spark
• No changes to existing application required
• Compatible with existing Spark/Hadoop installations
Swif – “simplified workload-intuitive framework”
• A flexible accelerator system with ‘FPGA-accelerable’workloads as
first-class citizens
Intro Challenges Solutions Results Conclusion
Swif Overview
- HSA: HeterogeneousSoftwareArchitecture
- AAL: Accelerator Abstraction Layer
Intro Challenges Solutions Results Conclusion
Swif API
Intro Challenges Solutions Results Conclusion
Compression use-case
FPGA CPUSwif
Spark
Compression Decompression
Swif API
App
Design Goals:
• Plugin model
• Failure resilience
• Heterogeneity
Intro Challenges Solutions Results Conclusion
Swif in Spark: How to use
1. Export
§ LD_LIBRARY_PATH = FPGAnatives.so
2. Set
§ CLASSPATH = FPGA.JAR
3. Configure
§ spark-defaults.xml à compression.codec = FPGACompressorCodec
4. Run
§ spark-submit --class myApp
Intro Challenges Solutions Results Conclusion
Implementation Details
Compression in Spark
18
OutputStream
Compressor
direction of data flow
C o m p r e s s
Write Read
User Buffer
(uncompressed data)
Outputstream
(compressed data)
Upstream: Spark Streamclass
(a)
(b)
(c)
CPU
(b)
100’s to thousands oftimes per job
(3)(1)
(2)
Intro Challenges Solutions Results Conclusion
Compression in Spark – with FPGA?
19
OutputStream
Compressor
direction of data flow
C o m p r e s s
Write Read
User Buffer
(uncompressed data)
Outputstream
(compressed data)
(3)(1)
(2)
Upstream: Spark Streamclass
(b)
(c)
“black-box”
(FPGA)
(b)
100’s to thousands oftimes per job
(a)
Intro Challenges Solutions Results Conclusion
FPGA-to-JVM Interface
Expose FPGA accelerator functions Manage buffer allocation/movement
Intro Challenges Solutions Results Conclusion
Interface with Spark
• FPGACompressor, FPGACompressorCodec
– Extendable classes
– Implements compression interfaces of Spark
FPGACompressor: base class
ZlibFPGACompressor:Compressor class for Spark
Intro Challenges Solutions Results Conclusion
Putting it all together à Swif Stack
Spark Apps
Driver
AAL Runtime System
Shared Java Library
FPGA
Compressor
Codec
Spark
Ø Compressor: FPGA Compressor (ZLIB)
Ø Standardinterfaces for Spark: compressor,streams,etc.
Ø Config: Enable/Disable codec in Spark configuration settings
Ø Library & Commons: Generic access to FPGA from Java
Config.
Spark
Framework
FPGA
Plugin
Intro Challenges Solutions Results Conclusion
Optimizations
System Optimizations
HDFS
Pinned buffer
(NIO)
• HDFS block size
– Apache: 64 MB, Cloudera:128 MB
• NIO buffer
– Buffer size = block size
• Accelerator sharing among threads
– Granularity of task parallelism effectively
controlled by block size
– Buffer reuse
• RDD caching
– Faster FPGA access to data
Intro Challenges Solutions Results Conclusion
Results
Raw Compression Performance
~ 8X speedup over CPU
Compressionratio
equal
(Native)
Intro Challenges Solutions Results Conclusion
Application Profile
"Swif: A Simplified Workload-centric Framework for FPGA-Based Computing" D. Ojika, et. al., FCCM 17
Xeon Core FPGA AFU
Spark Worker
(Xeon+FPGA Server)
• Single-node Spark Cluster
• Focus on RDD Output
compression on FPGA
• Multi-executor Spark Job
– TeraSort
Intro Challenges Solutions Results Conclusion
System Performance
3.2X Speedup
MB
RDD memory footprint
Job execution time
4X memory saving
Intro Challenges Solutions Results Conclusion
System Performance (Multicore)
Offload of multiple CPU
threads (Spark Executors)
to FPGA
- Increased FPGA utilization
- Still 2X faster than CPU run
Intro Challenges Solutions Results Conclusion
System Performance (with Data caching)
40% improvement
Intro Challenges Solutions Results Conclusion
Conclusion
• JVM-based frameworks can efficiently leverage FPGA
accelerators
– Key to efficiency is software to FPGA interfacing
– Treat workloads as first-class citizens
• Case-study on compression offload in Spark:
– 3.2X job speedup,4X reduction in RDD footprint
– Potential for larger savings in a multi-node cluster environment
• Storage, network bandwidth, etc.
• Swif is an ongoing effort
– More work still to be done
Intro Challenges Solutions Results Conclusion
Xeon + FPGA
Accelerator Abstraction Layer (AAL)
Shared Java Library
Compressor/Decompressor
Spark
TeraSort, PageRank, …
Big Data User
Big Data system
Heterogeneous Hardware
Native Libraries
Scheduling
Plugin
Framework
Workloads
Any User
Generic system
Swif: The Big Picture
More Details
• “Towards FPGA as a Microservice”
– Invited talk: 12th Workshop on Virtualization in High Performance
Cloud Computing (VHPC) at ISC ‘17
Acknowledgments
• Intel for internship opportunity
• University of Florida / Intel collaboration (HARP)
• Intel for PhD fellowship
Thank You.
David Ojika, davido@ufl.edu

Speeding Up Spark with Data Compression on Xeon+FPGA with David Ojika

  • 1.
    David Ojika University ofFlorida Speeding up Spark with Data Compression on Xeon+FPGA Intel Collaborators: Piotr Majcher, Wojciech Neubauer, Suchit Subhaschandra, Ramesh Illikkal, Bhaskar Gowda, and PK Gupta
  • 2.
    Motivation • Big data –Growth in volume of data – Distributed processing: data shuffling across machines • Data compression – Reduce data volume, optimize application performance – Forbes*: 60% organization using data compression – A CPU-intensive operation • Programmable accelerators (FPGAs) – Core-scaling: CPU may be reaching performance limits – Rising demand for performance efficiency, cost in the datacenter *https://www.altera.com/en_US/pdfs/liter atur e/thir d- party/forbes _The_Com ing_Data_Av alanche.pdf Compliment CPU cores with FPGAs for improved Spark performance
  • 3.
    About Me • 4thyear PhD student • Interests in distributed systems, FPGA acceleration of data intensive computing • Research with CERN • Past internship at Intel • Current internship at Microsoft Research
  • 4.
    What / WhyFPGAs • Field-programmable gate array (FPGA) – Custom circuit – Can accelerate specific tasks • FPGAs offer: – Reconfigurable architecture – Low-power, energy efficiency • FPGA attachment technology – Loosely-coupled • PCI-e attached FPGA – Tightly-coupled • Xeon+FPGA Intro Challenges Solutions Results Conclusion
  • 5.
    Xeon+FPGA • Xeon CPUand FPGA in a single processor socket – Cache coherent interface • Supports “in-line” data via direct I/O • Accelerator Function Unit (AFU) – Reconfigurable region (user logic) Image: Courtesy of Intel Intro Challenges Solutions Results Conclusion
  • 6.
    Challenges of integratingFPGAs into big data systems
  • 7.
    Challenging Programming Model •Requirement on hardware-specific knowledge • Long synthesis (compile) times • Limited platform-portability Intro Challenges Solutions Results Conclusion
  • 8.
    Complicated Software Interface •JVM-to-FPGAinterface • Data transfer overheads Intro Challenges Solutions Results Conclusion
  • 9.
    FPGA Sharing • FPGAand CPU threads co-existence is non-trivial • How to keep FPGA accelerator fully utilized Intro Challenges Solutions Results Conclusion
  • 10.
    Reconfiguration • FPGA reconfigurationcan take milliseconds to a few seconds • Certain workloads may be intolerable to downtime Intro Challenges Solutions Results Conclusion
  • 11.
    …a gap between FPGAaccelerator developer and big data application developer
  • 12.
    What did wedo? a. FPGA accelerator abstraction 1. Java API for CPU offload to FPGA 2. Manage JVM-to-FPGA data transfers 3. Coordinate FPGA and CPU thread co-existence b. FPGA-based compression plugin for Spark • No changes to existing application required • Compatible with existing Spark/Hadoop installations Swif – “simplified workload-intuitive framework” • A flexible accelerator system with ‘FPGA-accelerable’workloads as first-class citizens Intro Challenges Solutions Results Conclusion
  • 13.
    Swif Overview - HSA:HeterogeneousSoftwareArchitecture - AAL: Accelerator Abstraction Layer Intro Challenges Solutions Results Conclusion
  • 14.
    Swif API Intro ChallengesSolutions Results Conclusion
  • 15.
    Compression use-case FPGA CPUSwif Spark CompressionDecompression Swif API App Design Goals: • Plugin model • Failure resilience • Heterogeneity Intro Challenges Solutions Results Conclusion
  • 16.
    Swif in Spark:How to use 1. Export § LD_LIBRARY_PATH = FPGAnatives.so 2. Set § CLASSPATH = FPGA.JAR 3. Configure § spark-defaults.xml à compression.codec = FPGACompressorCodec 4. Run § spark-submit --class myApp Intro Challenges Solutions Results Conclusion
  • 17.
  • 18.
    Compression in Spark 18 OutputStream Compressor directionof data flow C o m p r e s s Write Read User Buffer (uncompressed data) Outputstream (compressed data) Upstream: Spark Streamclass (a) (b) (c) CPU (b) 100’s to thousands oftimes per job (3)(1) (2) Intro Challenges Solutions Results Conclusion
  • 19.
    Compression in Spark– with FPGA? 19 OutputStream Compressor direction of data flow C o m p r e s s Write Read User Buffer (uncompressed data) Outputstream (compressed data) (3)(1) (2) Upstream: Spark Streamclass (b) (c) “black-box” (FPGA) (b) 100’s to thousands oftimes per job (a) Intro Challenges Solutions Results Conclusion
  • 20.
    FPGA-to-JVM Interface Expose FPGAaccelerator functions Manage buffer allocation/movement Intro Challenges Solutions Results Conclusion
  • 21.
    Interface with Spark •FPGACompressor, FPGACompressorCodec – Extendable classes – Implements compression interfaces of Spark FPGACompressor: base class ZlibFPGACompressor:Compressor class for Spark Intro Challenges Solutions Results Conclusion
  • 22.
    Putting it alltogether à Swif Stack Spark Apps Driver AAL Runtime System Shared Java Library FPGA Compressor Codec Spark Ø Compressor: FPGA Compressor (ZLIB) Ø Standardinterfaces for Spark: compressor,streams,etc. Ø Config: Enable/Disable codec in Spark configuration settings Ø Library & Commons: Generic access to FPGA from Java Config. Spark Framework FPGA Plugin Intro Challenges Solutions Results Conclusion
  • 23.
  • 24.
    System Optimizations HDFS Pinned buffer (NIO) •HDFS block size – Apache: 64 MB, Cloudera:128 MB • NIO buffer – Buffer size = block size • Accelerator sharing among threads – Granularity of task parallelism effectively controlled by block size – Buffer reuse • RDD caching – Faster FPGA access to data Intro Challenges Solutions Results Conclusion
  • 25.
  • 26.
    Raw Compression Performance ~8X speedup over CPU Compressionratio equal (Native) Intro Challenges Solutions Results Conclusion
  • 27.
    Application Profile "Swif: ASimplified Workload-centric Framework for FPGA-Based Computing" D. Ojika, et. al., FCCM 17 Xeon Core FPGA AFU Spark Worker (Xeon+FPGA Server) • Single-node Spark Cluster • Focus on RDD Output compression on FPGA • Multi-executor Spark Job – TeraSort Intro Challenges Solutions Results Conclusion
  • 28.
    System Performance 3.2X Speedup MB RDDmemory footprint Job execution time 4X memory saving Intro Challenges Solutions Results Conclusion
  • 29.
    System Performance (Multicore) Offloadof multiple CPU threads (Spark Executors) to FPGA - Increased FPGA utilization - Still 2X faster than CPU run Intro Challenges Solutions Results Conclusion
  • 30.
    System Performance (withData caching) 40% improvement Intro Challenges Solutions Results Conclusion
  • 31.
    Conclusion • JVM-based frameworkscan efficiently leverage FPGA accelerators – Key to efficiency is software to FPGA interfacing – Treat workloads as first-class citizens • Case-study on compression offload in Spark: – 3.2X job speedup,4X reduction in RDD footprint – Potential for larger savings in a multi-node cluster environment • Storage, network bandwidth, etc. • Swif is an ongoing effort – More work still to be done Intro Challenges Solutions Results Conclusion
  • 32.
    Xeon + FPGA AcceleratorAbstraction Layer (AAL) Shared Java Library Compressor/Decompressor Spark TeraSort, PageRank, … Big Data User Big Data system Heterogeneous Hardware Native Libraries Scheduling Plugin Framework Workloads Any User Generic system Swif: The Big Picture
  • 33.
    More Details • “TowardsFPGA as a Microservice” – Invited talk: 12th Workshop on Virtualization in High Performance Cloud Computing (VHPC) at ISC ‘17
  • 34.
    Acknowledgments • Intel forinternship opportunity • University of Florida / Intel collaboration (HARP) • Intel for PhD fellowship
  • 35.