Braintalk cuso nm

1,169 views

Published on

Thomas Heinis is a post-doctoral researcher in the database group at EPFL. His research focuses on scalable data management algorithms for large-scale scientific applications. Thomas is a part of the "Human Brain Project" and currently works with neuroscientists to develop the data management infrastructure necessary for scaling up brain simulations. Prior to joining EPFL, Thomas completed his Ph.D. in the Systems Group at ETH Zurich, where he pursued research in workflow execution systems as well as data provenance.

Published in: Science, Technology, Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,169
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
18
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Braintalk cuso nm

  1. 1. Analyzing and Querying Big Scientific Data Thomas Heinis
  2. 2. Data-Driven Scientific Discovery 2 Human Brain ProjectSDSS LHC ATLAS Scientists Are Overwhelmed with Big Data Large Hadron Collider 12 Petabytes / experiment Sloan Digital Sky Survey 4 Petabytes / year Human Brain Project ~100 Gigabytes / sec
  3. 3. Scientific Data Growth 3 0 1 2 3 4 5 6 7 8 9 10 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 CumulativeSizeofDatasets [Petabytes] Year Astronomy [NRAO] Physics [LHC] Simulation [ICESS] Gene Sequencing [EBI] Scientific Data Grows Exponentially!
  4. 4. Data in the Simulation Sciences 4 COVERAGE RESOLUTION Increasinglevelofdetail Dimensions are Multiplicative! Increasing model size by order of magnitude
  5. 5. What is the Human Brain Project? A 10-year European initiative to understand the human brain, enabling advances in neuroscience, medicine and future computing. A consortium of 250+ Scientists, 135 Research Groups, from over 80 institutions, and more than 20 countries in Europe and beyond.
  6. 6. Human Brain Project - Vision  Future Medicine  Symptom-based to biology-based classification  Unique signatures of diseases  Early diagnosis  Future Neuroscience  Multi-level view of brain  Causal chain of events from genes to cognition  Future Computing  Supercomputing as scientific method  Human like intelligence
  7. 7. Brain Simulation – Wet Lab 7 Neuron structure & electrophysiological properties:
  8. 8. Simulating the Brain
  9. 9. Spatial Analysis Static 3D Exploration Interactive 3D Exploration Simulation Science Data Challenges 9 Simulation Observational Data Post Simulation Data Dynamic 3D Exploration Need Scalable Spatial Access Methods Spatial Modeling
  10. 10. Spatial Analysis Static 3D Exploration Interactive 3D Exploration Simulation Science Data Challenges 10 Simulation Observational Data Post Simulation Data Dynamic 3D Exploration Need Scalable Spatial Access Methods Spatial Modeling
  11. 11. Static Exploration 11 Neural Tissue Model Single Neuron 3D Model Efficient Spatial Index is Crucial 3D Spatial Range Query
  12. 12. State-of-the-Art Spatial Indexes 12 R-Tree: Hierarchy of Minimum Bounding Rectangles (MBR) R-Trees Variants: Hilbert packed R-Tree STR R-Tree PR-Tree Overlap Range Query Structural Overlap Degrades Performance
  13. 13. 0 50 100 150 200 250 300 50 100 150 200 250 300 350 400 450 Time[seconds] Dataset Density [Million of Elements per unit Volume] Hilbert R-Tree STR R-Tree PR-Tree 13 Scalability Challenge Dataset: 100K neurons, 450 Million 3D cylinders, 27 GB on disk. Range Queries: Uniform Random 500 for each experiment. Spatial Density Increases with Dataset SizeState of the Art Does Not Scale with Density
  14. 14. FLAT: A Two Phase Spatial Index 2) CRAWLING: Traverse neighborhood 1) SEEDING: Find any one object Requires Reachability 14 Use Connectivity To Avoid Overlap Key Idea: Two phases, each independent of overlap:
  15. 15. Earthquake simulations datasets No Problem! FLAT: Reachability Problem Convex Dataset Geometry Never crawl outside the query bound 15 Connectivity For accessing neighboring objects in data. REQUIREMENTS: Not every dataset satisfies this requirement! No path inside query No Connectivity
  16. 16. FLAT: Reachability 16 1) Partitioning Group spatially close elements 2) Linking Connect neighboring partitions Add Connectivity → Enable Recursive Crawling Index Building:
  17. 17. FLAT: Seeding Phase 17 Seed R-Tree R-Tree for seeding, but will it scale with density? Seeding phase avoids overlap overhead in R-Tree Overlap Seed query picks one child arbitrarily Seed Query Seeding is fast page reads = ~height of tree. Range Query: Find ALL element inside query Seed Query: Find ANY ONE element inside query
  18. 18. Seed Partition FLAT: Crawling Phase The neighbor links are used for recursive graph traversal Starting from the seed page 18Linear complexity in terms of graph edges Range Query
  19. 19. 0 50 100 150 200 250 300 50 100 150 200 250 300 350 400 450 Time[seconds] Dataset Density [Million of Elements per unit Volume] Hilbert R-Tree STR R-Tree PR-Tree FLAT 19 FLAT: Performance Evaluation Dataset: 100K neurons, 450 Million 3D cylinders, 27 GB on disk. Range Queries: Uniform Random 500 for each experiment. Spatial Density Increases with Dataset SizeDecouples Execution Time from Density 7.8 x
  20. 20. FLAT: Scalability 20 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 50 100 150 200 250 300 350 400 450 TimeperResultObject[ms] Dataset Density [Million of Elements per Unit Volume] Hilbert R-Tree STR R-Tree PR-Tree FLAT Seeding cost amortizes with increase in result cardinality Trend is “FLAT”, Scales With Density Dataset: 100K neurons, 450 Million 3D cylinders, 27 GB on disk. Range Queries: Uniform Random 500 for each experiment.
  21. 21. FLAT: iPad Implementation 21 http://www.youtube.com/watch?v=zaUEARq-IY0
  22. 22. Static 3D Exploration Interactive 3D Exploration Simulation Science Data Challenges 22 Simulation Observational Data Post Simulation Data Dynamic 3D Exploration Spatial Modeling
  23. 23. Interactive Exploration 2323 Bronchial Tree of the Lung Arterial Tree of the Heart Spatial Range Query SequencesGuiding Path Guided Analysis Ubiquitous in Scientific Applications Neural Network
  24. 24. Guiding paths are not known in advance Interactive execution of query sequence Interactive Query Execution 24 DISK CPU Retrieve Query ResultsProcess Results Time 1st Query 2nd Query 3rd Query Predictive Prefetching Hides Data Retrieval Cost Prefetching Opportunity 1st Query 3rd Query2nd Query Path decided after processing results Prefetch DataPrediction Predict next query location in the sequence Prefetch data of next query into prefetch cache
  25. 25. Existing techniques: Extrapolate past query locations Exponential Weighted Moving Average (EWMA) Straight Line Hilbert Prefetching Predictive Prefetching 25 Large Volume Queries Small Volume Queries 0 5 10 15 20 25 30 35 40 45 50 10k 80k 150k 220k CacheHitRate[%] Volume of Query [µm3] Neuroscience Data set 25 query in sequence Not Efficient With Arbitrary Query Volume!
  26. 26. SCOUT: Content Aware Prefetching 26 Key Insight: Use previous query content! Approach: 1. Inspect query results 2. Identify guiding path 3. Predict next query using guiding path Need to Identify Guiding Path ?
  27. 27. SCOUT: How paths are defined 27 Query results = many primitive spatial objects. Idea: Graph Framework G(V,E) such that, Vertices = spatial objects, Edges between nearby objects. Independence from data representation Exact graph N2 comparisons! Grid Hash based construction Approximate Graph Representation Range Query
  28. 28. Paths Candidate set SCOUT: Guiding Path Identification Iterative Candidate Pruning Key Insight: Guiding path goes through all queries! 28 n n+1 n+2 n+3 Guiding path Predicted Query Longer Sequence → Better Prediction
  29. 29. Prefetch duration not known in advance. Query dimension not known in advance. Idea: Incremental Prefetching Repeatedly prefetch growing regions By extrapolating guiding path nth query in sequence SCOUT: Where to Prefetch 29Independence from query size Guiding Path Exit …. . p1 p2 pn Policy = safest region first
  30. 30. 0 10 20 30 40 50 60 70 80 90 100CacheHitRate[%] EWMA Straight Line Hilbert SCOUT SCOUT: Prediction Accuracy 30 Sequence 1 Sequence 2 Visualization Cache Hit Rate = Amount of data retrieved from cache Total amount of data retrieved x 100 80K [μm3] 32 Query Volume: Sequence Length: 20K [μm3] 32 Dataset: 100K neurons, 450 Million 3D cylinders, 27 GB on disk 72% - 91% Prediction Accuracy SCOUT speeds up sequences up to 14.7x Speedup 2x Speedup 14.7x
  31. 31. SCOUT: Scalability 31 Increase in Data set Size 0 20 40 60 80 100 50M 150M 250M 350M 450M Data set Size [# of spatial objects] SCOUT CacheHitRate[%] SCOUT scales with increase in data set size CPU DISK Retrieve Query ResultsProcessing Results Time 3rd Query2nd Query PredictionPrefetching SCOUT Overhead 0 50 100 150 200 50M 150M 250M 350M 450M Time[sec] Data set Size [# of spatial objects] Prediction Retrieve Query Results 15-16%
  32. 32. Static 3D Exploration Interactive 3D Exploration Simulation Science Data Challenges 32 Simulation Observational Data Post Simulation Data Dynamic 3D Exploration Spatial Modeling
  33. 33. Dynamic Exploration 33 Mesh: Collection of 3D Connected Polyhedra Mesh → Enable High Precision 3D Models Polyhedra Connected Polyhedra Volumetric Mesh Model 3D Vertices Shared Faces Challenge: Monitoring Memory Resident Spatial Mesh Models
  34. 34. Monitoring Mesh Simulations 34Problem: Efficiently Execute Range Queries Time step 1 Time step 2 Time step 3 timeSimulation Time step Simulation Time step Updates Queries Monitor Monitor
  35. 35. Data Challenge 35Need: Solution That Scales Mesh Detail: Highly Dynamic: Unpredictable Mesh Movement Updates Affect Entire Dataset Mesh Detail Increases With Dataset Size Now Future Timestep 2Timestep 1
  36. 36. State of the Art 36 Moving Object Indexes TPR-Tree, STRIPES Neither Scales with Size nor Detail! Mesh Movement is Inherently Unpredictable Static Spatial Indexes R-Tree, LUR-Tree, QU- Trade Linear Scan Coarse Grained Fine Grained
  37. 37. Performance Evaluation 37 Linear Scan Outperforms Indexed Approaches Not Enough Queries to Invest on Index Maintenance Monitor timeSimulation Time step Monitor Simulation Time step Few Queries Massive Updates SETUP: Neural Mesh Dataset: 1.32 Billion Tetrahedral Mesh (33GB) 15 Queries per 60 simulation time step 0 1000 2000 3000 4000 5000 6000 7000 8000 Statistical Analysis Microbenchmark TotalQueryResponseTime[sec] LinearScan OCTREE LUR-Tree QU-Trade 99.5% 80% 72% Maintenance
  38. 38. Can We Do Better? 38Mesh Connectivity → Query Execution Reduce Search Space → Index Approach No Maintenance → Linear Scan Best of Both Worlds Not Rely on External Data Structure: → Directly use in-memory Mesh Data Mesh Graph Traversal: → Retrieve Results in Spatial Proximity OCTOPUS: Idea Vertices Edges Mesh Graph Key Insight: Use Mesh Connectivity to Retrieve Query Results!
  39. 39. OCTOPUS 39 Range Query Update Oblivious Query Execution Time step 1 Time step 2 Time step 3 What About Non-Convex Meshes?
  40. 40. OCTOPUS: Non-Convex Meshes 40Using Mesh Surface Guarantees Accuracy ? No Reachability! Surface Scan
  41. 41. OCTOPUS: Mesh Deformation 41 Deformation: Zero Cost of surface maintenance Scales With Massive Updates Time step 1 Time step 2 Time step 3 Graph changes
  42. 42. OCTOPUS: Mesh Detail 42Scales with Mesh Resolution Quadratic Increase Surface Points Cubic Increase Non-Surface Points Scalability: Surface grows slower than volume (and therefore dataset size)!
  43. 43. OCTOPUS: Performance 437.3-8X Speedup 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 Visualization MicrobenchMark Statistical Analysis Microbenchmark TotalQueryExecutionTime[sec] OCTOPUS LinearScan OCTREE LUR-Tree QU-Trade 8X 7.3X Visualization Microbenchmark
  44. 44. OCTOPUS: Scalability 44 0 20 40 60 80 100 120 140 0.13 0.17 0.26 0.52 1.32 TotalQueryExecutionTime[sec] Mesh Detail [Tetrahedrals in Billions] Graph Traversal Surface Scan OCTOPUS Breakdown 64% 41% 0 350 700 1050 1400 0.13 0.17 0.26 0.52 1.32 LinearScan OCTOPUS Mesh Detail [Tetrahedrals in Billions] TotalQueryExecutionTime[sec] Scales with Mesh Detail SETUP: Queries: Uniform random 15 per time step, 60 time steps 8X 10X
  45. 45. Algorithm Overview 45 Simulation Observational Data Post Simulation Data Spatial Analysis Model Validation Spatial Modeling OCTOPUS: ICDE’14 FLAT: ICDE’12 SCOUT: VLDB’12 TOUCH: SIGMOD’13 GIPSY: SSDBM ‘13
  46. 46. Human Brain Project: Part of the toolset used every day February 2013: first 10 million neuron model built Still 4 orders of magnitude smaller than human brain General Applicability: Material Sciences Astronomy Geographical Information Systems Impact 46 2010 2008 2006 0 10 20 30 1K 10K 100K 10M ModelSize[GB] Simulation Size [# Neurons] 2013 (2.5 TB)
  47. 47. Future Challenges 47 Enable Scientific Breakthroughs via Scalable Data Analysis!  Address Scientific Data Trends: → Progressively Complex Datasets → Increasingly Complex Scientific Queries → Modern Hardware  Approximate Queries on Big Data: → Use Mechanism of Learning & Forgetting to manage Data Synopses
  48. 48.  Data Privacy/Anonymization  Scalable Querying of Petascale Data  Cloud Analytics  Quick & efficient access to raw data  Distributed Workflow Execution  Provenance/Reproducibility  Data Personalization HBP Data Management Challenges 48
  49. 49. Conclusions 49  Enabling data exploration is key to scientific discovery.  Prior spatial access methods do not scale with data growth.  Use Spatial Connectivity to achieve scalability. → Explicitly Added (FLAT & TOUCH) → Implicitly Present in the Dataset (OCTOPUS & SCOUT)  Many exciting big data management
  50. 50. 50 Thank You! Collaborators: Farhan Tauheed, Anastasia Ailamaki, Felix Schürmann, Henry Markram, Sadegh Nobari, Panagiotis Karras, Laurynas Biveinis, Mirjana Pavlovic

×