Thomas Heinis is a post-doctoral researcher in the database group at EPFL. His research focuses on scalable data management algorithms for large-scale scientific applications. Thomas is a part of the "Human Brain Project" and currently works with neuroscientists to develop the data management infrastructure necessary for scaling up brain simulations. Prior to joining EPFL, Thomas completed his Ph.D. in the Systems Group at ETH Zurich, where he pursued research in workflow execution systems as well as data provenance.
4. Data in the Simulation Sciences
4
COVERAGE
RESOLUTION
Increasinglevelofdetail
Dimensions are Multiplicative!
Increasing model size by order of magnitude
5. What is the Human Brain Project?
A 10-year European initiative to
understand the human brain,
enabling advances in neuroscience,
medicine and future computing.
A consortium of 250+ Scientists, 135
Research Groups, from over 80
institutions, and more than 20
countries in Europe and beyond.
6. Human Brain Project - Vision
Future Medicine
Symptom-based to biology-based classification
Unique signatures of diseases
Early diagnosis
Future Neuroscience
Multi-level view of brain
Causal chain of events from genes to cognition
Future Computing
Supercomputing as scientific method
Human like intelligence
13. 0
50
100
150
200
250
300
50 100 150 200 250 300 350 400 450
Time[seconds]
Dataset Density [Million of Elements per unit Volume]
Hilbert R-Tree
STR R-Tree
PR-Tree
13
Scalability Challenge
Dataset: 100K neurons, 450 Million 3D cylinders, 27 GB on disk.
Range Queries: Uniform Random 500 for each experiment.
Spatial Density Increases with Dataset SizeState of the Art Does Not Scale with Density
14. FLAT: A Two Phase Spatial Index
2) CRAWLING: Traverse neighborhood
1) SEEDING: Find any one object
Requires
Reachability
14
Use Connectivity To Avoid Overlap
Key Idea: Two phases, each
independent of overlap:
17. FLAT: Seeding Phase
17
Seed
R-Tree
R-Tree for seeding, but will it scale with density?
Seeding phase avoids overlap overhead in R-Tree
Overlap
Seed query picks
one child arbitrarily
Seed
Query
Seeding is fast
page reads = ~height of tree.
Range Query: Find ALL element inside query
Seed Query: Find ANY ONE element inside query
18. Seed
Partition
FLAT: Crawling Phase
The neighbor links are used for recursive graph traversal
Starting from the seed page
18Linear complexity in terms of graph edges
Range Query
19. 0
50
100
150
200
250
300
50 100 150 200 250 300 350 400 450
Time[seconds]
Dataset Density [Million of Elements per unit Volume]
Hilbert R-Tree
STR R-Tree
PR-Tree
FLAT
19
FLAT: Performance Evaluation
Dataset: 100K neurons, 450 Million 3D cylinders, 27 GB on disk.
Range Queries: Uniform Random 500 for each experiment.
Spatial Density Increases with Dataset SizeDecouples Execution Time from Density
7.8 x
20. FLAT: Scalability
20
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
50 100 150 200 250 300 350 400 450
TimeperResultObject[ms]
Dataset Density [Million of Elements per Unit Volume]
Hilbert R-Tree
STR R-Tree
PR-Tree
FLAT
Seeding cost amortizes with
increase in result cardinality
Trend is “FLAT”, Scales With Density
Dataset: 100K neurons, 450 Million 3D cylinders, 27 GB on disk.
Range Queries: Uniform Random 500 for each experiment.
23. Interactive Exploration
2323
Bronchial Tree of the Lung
Arterial Tree of the Heart
Spatial Range Query
SequencesGuiding
Path
Guided Analysis Ubiquitous in Scientific Applications
Neural Network
24. Guiding paths are not known in advance
Interactive execution of query sequence
Interactive Query Execution
24
DISK
CPU
Retrieve Query ResultsProcess Results
Time
1st Query 2nd Query 3rd Query
Predictive Prefetching Hides Data Retrieval Cost
Prefetching Opportunity
1st Query 3rd Query2nd Query
Path decided after
processing results
Prefetch DataPrediction
Predict next query location in the sequence
Prefetch data of next query into prefetch cache
25. Existing techniques:
Extrapolate past query locations
Exponential Weighted Moving Average (EWMA)
Straight Line
Hilbert Prefetching
Predictive Prefetching
25
Large Volume
Queries
Small Volume
Queries
0
5
10
15
20
25
30
35
40
45
50
10k 80k 150k 220k
CacheHitRate[%]
Volume of Query [µm3]
Neuroscience Data set
25 query in sequence
Not Efficient With Arbitrary Query Volume!
26. SCOUT: Content Aware Prefetching
26
Key Insight: Use previous query content!
Approach:
1. Inspect query results
2. Identify guiding path
3. Predict next query using guiding path
Need to Identify Guiding Path
?
27. SCOUT: How paths are defined
27
Query results = many primitive spatial objects.
Idea: Graph Framework
G(V,E) such that, Vertices = spatial objects,
Edges between nearby objects.
Independence from data representation
Exact graph
N2 comparisons!
Grid Hash based construction
Approximate Graph Representation
Range Query
28. Paths
Candidate set
SCOUT: Guiding Path Identification
Iterative Candidate Pruning
Key Insight: Guiding path goes through all queries!
28
n
n+1
n+2
n+3
Guiding path
Predicted
Query
Longer Sequence → Better Prediction
29. Prefetch duration not known in advance.
Query dimension not known in advance.
Idea: Incremental Prefetching
Repeatedly prefetch growing regions
By extrapolating guiding path
nth query in sequence
SCOUT: Where to Prefetch
29Independence from query size
Guiding
Path
Exit
….
.
p1 p2 pn
Policy = safest region first
30. 0
10
20
30
40
50
60
70
80
90
100CacheHitRate[%] EWMA Straight Line
Hilbert SCOUT
SCOUT: Prediction Accuracy
30
Sequence 1 Sequence 2
Visualization
Cache Hit Rate = Amount of data retrieved from cache
Total amount of data retrieved
x 100
80K [μm3]
32
Query Volume:
Sequence Length:
20K [μm3]
32
Dataset: 100K neurons, 450 Million 3D cylinders, 27 GB on disk
72% - 91% Prediction Accuracy
SCOUT speeds up sequences up to 14.7x
Speedup 2x
Speedup 14.7x
31. SCOUT: Scalability
31
Increase in Data set Size
0
20
40
60
80
100
50M 150M 250M 350M 450M
Data set Size
[# of spatial objects]
SCOUT
CacheHitRate[%]
SCOUT scales with increase in data set size
CPU
DISK
Retrieve Query ResultsProcessing Results
Time
3rd Query2nd Query
PredictionPrefetching
SCOUT Overhead
0
50
100
150
200
50M 150M 250M 350M 450M
Time[sec]
Data set Size [# of spatial objects]
Prediction
Retrieve Query Results
15-16%
33. Dynamic Exploration
33
Mesh: Collection of 3D Connected Polyhedra
Mesh → Enable High Precision 3D Models
Polyhedra Connected Polyhedra Volumetric
Mesh Model
3D Vertices Shared
Faces
Challenge: Monitoring Memory Resident Spatial Mesh Models
34. Monitoring Mesh Simulations
34Problem: Efficiently Execute Range Queries
Time step 1 Time step 2 Time step 3
timeSimulation
Time step
Simulation
Time step
Updates Queries
Monitor Monitor
35. Data Challenge
35Need: Solution That Scales
Mesh Detail:
Highly Dynamic:
Unpredictable Mesh Movement
Updates Affect Entire Dataset
Mesh Detail Increases
With Dataset Size
Now Future
Timestep 2Timestep 1
36. State of the Art
36
Moving Object Indexes
TPR-Tree, STRIPES
Neither Scales with Size nor Detail!
Mesh Movement
is Inherently
Unpredictable
Static Spatial Indexes
R-Tree, LUR-Tree, QU-
Trade
Linear Scan
Coarse Grained Fine Grained
37. Performance Evaluation
37
Linear Scan Outperforms Indexed Approaches
Not Enough Queries to Invest
on Index Maintenance
Monitor
timeSimulation
Time step
Monitor
Simulation
Time step
Few
Queries
Massive
Updates
SETUP:
Neural Mesh Dataset: 1.32 Billion
Tetrahedral Mesh (33GB)
15 Queries per 60 simulation time step
0
1000
2000
3000
4000
5000
6000
7000
8000
Statistical Analysis
Microbenchmark
TotalQueryResponseTime[sec]
LinearScan OCTREE
LUR-Tree QU-Trade
99.5%
80%
72%
Maintenance
38. Can We Do Better?
38Mesh Connectivity → Query Execution
Reduce Search Space → Index Approach
No Maintenance → Linear Scan
Best of Both Worlds
Not Rely on External Data Structure:
→ Directly use in-memory Mesh Data
Mesh Graph Traversal:
→ Retrieve Results in Spatial Proximity
OCTOPUS: Idea
Vertices
Edges
Mesh Graph
Key Insight: Use Mesh Connectivity to Retrieve
Query Results!
46. Human Brain Project:
Part of the toolset used every day
February 2013: first 10 million neuron model built
Still 4 orders of magnitude smaller than human brain
General Applicability:
Material Sciences
Astronomy
Geographical Information
Systems
Impact
46
2010
2008
2006
0
10
20
30
1K 10K 100K 10M
ModelSize[GB]
Simulation Size [# Neurons]
2013
(2.5 TB)
47. Future Challenges
47
Enable Scientific Breakthroughs via Scalable Data
Analysis!
Address Scientific Data Trends:
→ Progressively Complex Datasets
→ Increasingly Complex Scientific Queries
→ Modern Hardware
Approximate Queries on Big Data:
→ Use Mechanism of Learning & Forgetting to
manage Data Synopses
48. Data Privacy/Anonymization
Scalable Querying of Petascale Data
Cloud Analytics
Quick & efficient access to raw data
Distributed Workflow Execution
Provenance/Reproducibility
Data Personalization
HBP Data Management Challenges
48
49. Conclusions
49
Enabling data exploration is key to scientific
discovery.
Prior spatial access methods do not scale with
data growth.
Use Spatial Connectivity to achieve
scalability.
→ Explicitly Added (FLAT & TOUCH)
→ Implicitly Present in the Dataset (OCTOPUS
& SCOUT)
Many exciting big data management