Accelerating Data Analysis of Brain Tissue Simulations with Apache Spark with Judit Planas

Judit Planas, Blue Brain Project - EPFL
Accelerating Data Analysis
of Brain Tissue Simulations
with Apache Spark
#Py5SAIS

Outline
• Introduction
• Motivation
• Analysis of simulations with Spark
• Evaluation
• Conclusions
2#Py5SAIS

Blue Brain Project
• Blue Brain Project (BBP): Swiss initiative targeting the digital reconstruction and
simulation of the brain, hosted in Geneva (Switzerland)
4#Py5SAIS
• BBP is a multi-disciplinary
team of people coming from
different backgrounds, like
neuroscience, computer
engineering, physics,
mathematics or chemistry

Computational Neuroscience
5#Py5SAIS
And many
others…
Neuropharmacology
Neurophysiology
NeuroanatomyNeuropsychologyNeurology
Neurophilosophy
Molecular
neuroscience
Systems
neuroscience
Computational
neuroscience
Neuroscience
study areas

BBP’s Target Contributions to Neuroscience
• Help scientists understand how the brain works internally
• Recently, BBP has been able to reproduce the electrical behavior
of a neocortex fragment by means of a computer reconstruction[1]
– Brain volume: 1/3 mm3
– 30’000 neurons
– 40 million synapses (connections between neurons)
– This model has revealed novel insights into the functioning of the neocortex
• Supercomputer-based simulation of the brain provides a new tool to study the
interaction within the different brain regions
• Understanding the brain not only will help the diagnosis and treatment of brain
diseases, but also will contribute to the development of neurorobotics,
neuromorphic computing and AI
6#Py5SAIS

Simulation Neuroscience at Different Levels
7#Py5SAIS
Model Point neurons
Morphologicallydetailed
neurons
Subcellular (molecular)
Simulator
Example
NEST
Spikingneural
networks,focused
ondynamics,size
andstructure
NEURON
Supports cells with
complexanatomical
andbiophysical
properties
STEPS
Detailedmodels of
neuronalsignaling
pathways at
molecularlevel
HW Platform Full scale Ksupercomp. 20 racks BG/Q Full scale on Sango
HumanBrain% 2.3 % 0.1 % 1.2 x 10-9 %
# Neurons 1.86 x 109 82 x 106 1
BiologicalTime 1 s 10 ms 30 ms
SimulationTime 3275 s 220 s 26.8 s
Sim TimeNorm.:
1 neuron, 1 thread,
1 s biotime
0.003 ms 0.28 ms 1.3 x 106 s
All these values are approximate and should NOT be used for technical purposes [2, 3, 4]

Brain Tissue Simulation Process
• The following steps are usually involved in brain tissue simulations:
9#Py5SAIS
Visualization
Scripting Analysis
Configuration Analysis / FilteringSimulation

Our Challenge
• Simulations can produce GBs – TBs of data very quickly
– E.g.: Plastic neocortex simulation
• Recording 1 variable
• 30 s of biological time ~50 GB of output data [ x N ]
• 31’000 neurons
⭐ But… >1 variables are recorded
⭐ And… Biological time is much longer
• Most scientists use sequential scripts to analyze their data
– Most preferred language is Python
– They do not have time / knowledge to improve their scripts
– Existing analysis tools exploit thread-level parallelism on single node
10#Py5SAIS

Our Requirements
• Python-friendly
• Scalable
• GBs – TBs of data will be analyzed
• Scientist workflow should not be impacted nor modified significantly
• Compatible with existing tools and scripts
11#Py5SAIS
Visualization
Scripting Analysis
Configuration Analysis / FilteringSimulation

Analysis of Simulations with Spark
12#Py5SAIS

Spark Could Be Our Solution
• Spark meets most of our requirements
• How can we leverage from Spark?
– Once simulation data is generated, it can be read by a Spark cluster
– Then, scientist scripts and available tools can execute queries on these data
13#Py5SAIS
GPFS
Binary
File
Reader

Simulation Data Analysis with Spark
Pros
• Python support
• Scalable across cluster
• Can be hidden from final user
• Fits our type of queries
• SQL support can be useful
• Compatible with on-site systems
Cons
• NumPy support critical for us
– RDDs: OK
– DFs: need data conversion
• Missing native reader for custom
binary format
• UDFs in Python add overhead
• On-site system’s GPFS cannot be
migrated to HDFS
14#Py5SAIS
But still worth to investigate! J

What Do We Evaluate?
16#Py5SAIS
We want to
compare
the
difference
between:
Parallel structure
DF
RDD
Data partitioning
Manual
Automatic
Data container
Python list
NumPy array
Data source
Parquet
Binary format (BBP)
Binary (byte array)
Selected combinations

Selected Combinations
RDD RDDkey DFbin DFpylist
Parallel
structure
RDD RDD DF DF
Data
partitioning
Automatic
Manual, based
on GID
Automatic Automatic
Data container NP array NP array
Binary
(byte array)
Python list
17#Py5SAIS
• Data source is specified in the label: BBP (custom layout, binary) or Parquet

How Do We Evaluate?
• The original file format is structured like a matrix
• We extracted real, representative data analysis from the scientists:
– Reduce by GID (Global ID = neuron ID)
– Reduce by TS (time step)
– Compute by GID
– Compute by TS
18#Py5SAIS
GID 1 …GID 2 GID M
GID 1 GID 2 … GID M
…
GID 1 GID 2 … GID M
Time step 0
Time step 1
Time step N
by GID
by TS

Where Do We Evaluate?
Hardware Platform
• On-site cluster
• 40+ compute nodes
• InfiniBand EDR 100 GB
• GPFS file system
• Each node:
– 2 x Intel Xeon 6140
– 72 threads (2 x 18 cores with HT)
– 384 GB DRAM
– 2 x SSD P4500, 1 TB each
Software
• Red Hat Enterprise Linux 7.3
• Java OpenJDK RE 1.8
• Apache Spark 2.2.1
19#Py5SAIS
Runtime Configuration
• Exclusive access to allocated nodes
• Spark slaves use all cores
• Spark master runs on separate
node
• Dataset size: 2 TB

Memory Footprint
20#Py5SAIS
0
50
100
150
200
250
300
Node Memory Footprint [GB]
DFpylist - BBP RDD - Parquet RDDkey - Parquet DFbin - BBP
DFbin - Parquet RDDkey - BBP RDD - BBP
• Original dataset size (files): 47 GB
• RDD – BBP is the only version that
stays close to original files size
• Any other data source / data structure
increases memory consumption from
3 – 6 times
• DFpylist crashes due to lack of
memory while performing the
computation by GID (unable to
continue with memory test)

Data Loading Performance
• Loading data from original files (BBP)
does not scale: file system is too
overloaded accessing the same files
• Loading data from Parquet scales
because data is split into 1000s of
files, but also hits GPFS limitations
• RDDkey takes more time than the
others because of the extra
repartitioning of data
• DFbin takes more time than RDD due
to data conversion
21#Py5SAIS
0
1000
2000
3000
4000
5000
6000
7000
8000
8 16 24 32 40
Execution time [s]
# Nodes
RDD - BBP RDDkey - BBP DFbin - BBP
RDD - Parquet RDDkey - Parquet DFbin - Parquet
Existing tools (thread-level parallelism, 1
node): 13’820 s à up to 100x speed-up!

Reduction by GID/TS Performance
22#Py5SAIS
0
100
200
300
400
500
600
700
800
900
8 16 24 32 40
Execution time [s]
# Nodes
Reduce by GID
0
10
20
30
40
50
60
8 16 24 32 40
Execution time [s]
# Nodes
Reduce by TS
• RDDkey is the fastest thanks to the manual
partitioning: better data distribution
• DFbin is the slowest due to data conversions
• Reduce by TS was run right after Reduce by
GID: we believe that partial results were
cached and, thus, is significantly faster
Existing tools (TLP, 1 node): 5 s à worse
or same perf à needs investigation

Computation by GID/TS Performance
23#Py5SAIS
0
50
100
150
200
250
300
350
400
8 16 24 32 40
Execution time [s]
# Nodes
Comp by GID
0
20
40
60
80
100
120
140
160
180
200
8 16 24 32 40
Execution time [s]
# Nodes
Comp by TS
• RDDkey is again the fastest
• RDD Parquet needs the inverse data
conversions (from bin to NP)
• Comp by TS was run right after Comp by GID
• Significant speed-up from 8 to 16 nodes
(partially due to external load in the system)
Existing tools (TLP, 1 node):
2347 s à up to 130x speed-up!

Conclusions
• Spark improves the performance of our data analysis (up to 130x)
• There are a few corner cases to investigate on our side
• Design decisions (data structure, origin, …) can make a huge impact in
memory consumption and performance
• Spark data partitioning does a good job, but additional knowledge on data
contents can help in a better partitioning
• But there are a few points that could be further improved…
25#Py5SAIS

Discussion Points
• NumPy support would make things easier for us
– Use of Pandas DataFrames cannot be applied to our use case
• Ability to control number of executors per node dynamically at runtime:
– Use few executors per node when loading data from GPFS
– Use full node when doing data analysis and computations
26#Py5SAIS

Acknowledgments & References
• BBP In Sillico Experiments and HPC teams for the support and feedback provided
• An award of computer time was provided by the ALCF Data Science Program (ADSP). This
research used resources of the Argonne Leadership Computing Facility, which is a DOE Office
of Science User Facility supported under Contract DE-AC02-06CH11357
[1] Henry Markram et al. Reconstruction and Simulation of Neocortical Microcircuitry. Cell, Vol.
163, Issue 2, pp 456 – 492
[2] Kunkel, Susanne et al. Spiking network simulation code for petascale computers. Frontiers in
neuroinformatics, Vol. 8, p 78
[3] Ovcharenko, A. et al. Simulating Morphologically Detailed Neuronal Networks at Extreme
Scale. In PARCO, pp. 787-796
[4] Chen, Weiliang et al. Parallel STEPS: Large Scale Stochastic Spatial Reaction-Diffusion
Simulation with High Performance Computers. Frontiers in neuroinformatics, Vol. 11, p 13
27#Py5SAIS

28#Py5SAIS
Thank you! J
Questions … ?

Accelerating Data Analysis of Brain Tissue Simulations with Apache Spark with Judit Planas

More Related Content

What's hot

Similar to Accelerating Data Analysis of Brain Tissue Simulations with Apache Spark with Judit Planas

More from Databricks

Recently uploaded

Accelerating Data Analysis of Brain Tissue Simulations with Apache Spark with Judit Planas