Your SlideShare is downloading. ×
Braintalk cuso nm
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

Braintalk cuso nm

274
views

Published on

Thomas Heinis is a post-doctoral researcher in the database group at EPFL. His research focuses on scalable data management algorithms for large-scale scientific applications. Thomas is a part of the …

Thomas Heinis is a post-doctoral researcher in the database group at EPFL. His research focuses on scalable data management algorithms for large-scale scientific applications. Thomas is a part of the "Human Brain Project" and currently works with neuroscientists to develop the data management infrastructure necessary for scaling up brain simulations. Prior to joining EPFL, Thomas completed his Ph.D. in the Systems Group at ETH Zurich, where he pursued research in workflow execution systems as well as data provenance.

Published in: Science, Technology, Education

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
274
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
7
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Analyzing and Querying Big Scientific Data Thomas Heinis
  • 2. Data-Driven Scientific Discovery 2 Human Brain ProjectSDSS LHC ATLAS Scientists Are Overwhelmed with Big Data Large Hadron Collider 12 Petabytes / experiment Sloan Digital Sky Survey 4 Petabytes / year Human Brain Project ~100 Gigabytes / sec
  • 3. Scientific Data Growth 3 0 1 2 3 4 5 6 7 8 9 10 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 CumulativeSizeofDatasets [Petabytes] Year Astronomy [NRAO] Physics [LHC] Simulation [ICESS] Gene Sequencing [EBI] Scientific Data Grows Exponentially!
  • 4. Data in the Simulation Sciences 4 COVERAGE RESOLUTION Increasinglevelofdetail Dimensions are Multiplicative! Increasing model size by order of magnitude
  • 5. What is the Human Brain Project? A 10-year European initiative to understand the human brain, enabling advances in neuroscience, medicine and future computing. A consortium of 250+ Scientists, 135 Research Groups, from over 80 institutions, and more than 20 countries in Europe and beyond.
  • 6. Human Brain Project - Vision  Future Medicine  Symptom-based to biology-based classification  Unique signatures of diseases  Early diagnosis  Future Neuroscience  Multi-level view of brain  Causal chain of events from genes to cognition  Future Computing  Supercomputing as scientific method  Human like intelligence
  • 7. Brain Simulation – Wet Lab 7 Neuron structure & electrophysiological properties:
  • 8. Simulating the Brain
  • 9. Spatial Analysis Static 3D Exploration Interactive 3D Exploration Simulation Science Data Challenges 9 Simulation Observational Data Post Simulation Data Dynamic 3D Exploration Need Scalable Spatial Access Methods Spatial Modeling
  • 10. Spatial Analysis Static 3D Exploration Interactive 3D Exploration Simulation Science Data Challenges 10 Simulation Observational Data Post Simulation Data Dynamic 3D Exploration Need Scalable Spatial Access Methods Spatial Modeling
  • 11. Static Exploration 11 Neural Tissue Model Single Neuron 3D Model Efficient Spatial Index is Crucial 3D Spatial Range Query
  • 12. State-of-the-Art Spatial Indexes 12 R-Tree: Hierarchy of Minimum Bounding Rectangles (MBR) R-Trees Variants: Hilbert packed R-Tree STR R-Tree PR-Tree Overlap Range Query Structural Overlap Degrades Performance
  • 13. 0 50 100 150 200 250 300 50 100 150 200 250 300 350 400 450 Time[seconds] Dataset Density [Million of Elements per unit Volume] Hilbert R-Tree STR R-Tree PR-Tree 13 Scalability Challenge Dataset: 100K neurons, 450 Million 3D cylinders, 27 GB on disk. Range Queries: Uniform Random 500 for each experiment. Spatial Density Increases with Dataset SizeState of the Art Does Not Scale with Density
  • 14. FLAT: A Two Phase Spatial Index 2) CRAWLING: Traverse neighborhood 1) SEEDING: Find any one object Requires Reachability 14 Use Connectivity To Avoid Overlap Key Idea: Two phases, each independent of overlap:
  • 15. Earthquake simulations datasets No Problem! FLAT: Reachability Problem Convex Dataset Geometry Never crawl outside the query bound 15 Connectivity For accessing neighboring objects in data. REQUIREMENTS: Not every dataset satisfies this requirement! No path inside query No Connectivity
  • 16. FLAT: Reachability 16 1) Partitioning Group spatially close elements 2) Linking Connect neighboring partitions Add Connectivity → Enable Recursive Crawling Index Building:
  • 17. FLAT: Seeding Phase 17 Seed R-Tree R-Tree for seeding, but will it scale with density? Seeding phase avoids overlap overhead in R-Tree Overlap Seed query picks one child arbitrarily Seed Query Seeding is fast page reads = ~height of tree. Range Query: Find ALL element inside query Seed Query: Find ANY ONE element inside query
  • 18. Seed Partition FLAT: Crawling Phase The neighbor links are used for recursive graph traversal Starting from the seed page 18Linear complexity in terms of graph edges Range Query
  • 19. 0 50 100 150 200 250 300 50 100 150 200 250 300 350 400 450 Time[seconds] Dataset Density [Million of Elements per unit Volume] Hilbert R-Tree STR R-Tree PR-Tree FLAT 19 FLAT: Performance Evaluation Dataset: 100K neurons, 450 Million 3D cylinders, 27 GB on disk. Range Queries: Uniform Random 500 for each experiment. Spatial Density Increases with Dataset SizeDecouples Execution Time from Density 7.8 x
  • 20. FLAT: Scalability 20 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 50 100 150 200 250 300 350 400 450 TimeperResultObject[ms] Dataset Density [Million of Elements per Unit Volume] Hilbert R-Tree STR R-Tree PR-Tree FLAT Seeding cost amortizes with increase in result cardinality Trend is “FLAT”, Scales With Density Dataset: 100K neurons, 450 Million 3D cylinders, 27 GB on disk. Range Queries: Uniform Random 500 for each experiment.
  • 21. FLAT: iPad Implementation 21 http://www.youtube.com/watch?v=zaUEARq-IY0
  • 22. Static 3D Exploration Interactive 3D Exploration Simulation Science Data Challenges 22 Simulation Observational Data Post Simulation Data Dynamic 3D Exploration Spatial Modeling
  • 23. Interactive Exploration 2323 Bronchial Tree of the Lung Arterial Tree of the Heart Spatial Range Query SequencesGuiding Path Guided Analysis Ubiquitous in Scientific Applications Neural Network
  • 24. Guiding paths are not known in advance Interactive execution of query sequence Interactive Query Execution 24 DISK CPU Retrieve Query ResultsProcess Results Time 1st Query 2nd Query 3rd Query Predictive Prefetching Hides Data Retrieval Cost Prefetching Opportunity 1st Query 3rd Query2nd Query Path decided after processing results Prefetch DataPrediction Predict next query location in the sequence Prefetch data of next query into prefetch cache
  • 25. Existing techniques: Extrapolate past query locations Exponential Weighted Moving Average (EWMA) Straight Line Hilbert Prefetching Predictive Prefetching 25 Large Volume Queries Small Volume Queries 0 5 10 15 20 25 30 35 40 45 50 10k 80k 150k 220k CacheHitRate[%] Volume of Query [µm3] Neuroscience Data set 25 query in sequence Not Efficient With Arbitrary Query Volume!
  • 26. SCOUT: Content Aware Prefetching 26 Key Insight: Use previous query content! Approach: 1. Inspect query results 2. Identify guiding path 3. Predict next query using guiding path Need to Identify Guiding Path ?
  • 27. SCOUT: How paths are defined 27 Query results = many primitive spatial objects. Idea: Graph Framework G(V,E) such that, Vertices = spatial objects, Edges between nearby objects. Independence from data representation Exact graph N2 comparisons! Grid Hash based construction Approximate Graph Representation Range Query
  • 28. Paths Candidate set SCOUT: Guiding Path Identification Iterative Candidate Pruning Key Insight: Guiding path goes through all queries! 28 n n+1 n+2 n+3 Guiding path Predicted Query Longer Sequence → Better Prediction
  • 29. Prefetch duration not known in advance. Query dimension not known in advance. Idea: Incremental Prefetching Repeatedly prefetch growing regions By extrapolating guiding path nth query in sequence SCOUT: Where to Prefetch 29Independence from query size Guiding Path Exit …. . p1 p2 pn Policy = safest region first
  • 30. 0 10 20 30 40 50 60 70 80 90 100CacheHitRate[%] EWMA Straight Line Hilbert SCOUT SCOUT: Prediction Accuracy 30 Sequence 1 Sequence 2 Visualization Cache Hit Rate = Amount of data retrieved from cache Total amount of data retrieved x 100 80K [μm3] 32 Query Volume: Sequence Length: 20K [μm3] 32 Dataset: 100K neurons, 450 Million 3D cylinders, 27 GB on disk 72% - 91% Prediction Accuracy SCOUT speeds up sequences up to 14.7x Speedup 2x Speedup 14.7x
  • 31. SCOUT: Scalability 31 Increase in Data set Size 0 20 40 60 80 100 50M 150M 250M 350M 450M Data set Size [# of spatial objects] SCOUT CacheHitRate[%] SCOUT scales with increase in data set size CPU DISK Retrieve Query ResultsProcessing Results Time 3rd Query2nd Query PredictionPrefetching SCOUT Overhead 0 50 100 150 200 50M 150M 250M 350M 450M Time[sec] Data set Size [# of spatial objects] Prediction Retrieve Query Results 15-16%
  • 32. Static 3D Exploration Interactive 3D Exploration Simulation Science Data Challenges 32 Simulation Observational Data Post Simulation Data Dynamic 3D Exploration Spatial Modeling
  • 33. Dynamic Exploration 33 Mesh: Collection of 3D Connected Polyhedra Mesh → Enable High Precision 3D Models Polyhedra Connected Polyhedra Volumetric Mesh Model 3D Vertices Shared Faces Challenge: Monitoring Memory Resident Spatial Mesh Models
  • 34. Monitoring Mesh Simulations 34Problem: Efficiently Execute Range Queries Time step 1 Time step 2 Time step 3 timeSimulation Time step Simulation Time step Updates Queries Monitor Monitor
  • 35. Data Challenge 35Need: Solution That Scales Mesh Detail: Highly Dynamic: Unpredictable Mesh Movement Updates Affect Entire Dataset Mesh Detail Increases With Dataset Size Now Future Timestep 2Timestep 1
  • 36. State of the Art 36 Moving Object Indexes TPR-Tree, STRIPES Neither Scales with Size nor Detail! Mesh Movement is Inherently Unpredictable Static Spatial Indexes R-Tree, LUR-Tree, QU- Trade Linear Scan Coarse Grained Fine Grained
  • 37. Performance Evaluation 37 Linear Scan Outperforms Indexed Approaches Not Enough Queries to Invest on Index Maintenance Monitor timeSimulation Time step Monitor Simulation Time step Few Queries Massive Updates SETUP: Neural Mesh Dataset: 1.32 Billion Tetrahedral Mesh (33GB) 15 Queries per 60 simulation time step 0 1000 2000 3000 4000 5000 6000 7000 8000 Statistical Analysis Microbenchmark TotalQueryResponseTime[sec] LinearScan OCTREE LUR-Tree QU-Trade 99.5% 80% 72% Maintenance
  • 38. Can We Do Better? 38Mesh Connectivity → Query Execution Reduce Search Space → Index Approach No Maintenance → Linear Scan Best of Both Worlds Not Rely on External Data Structure: → Directly use in-memory Mesh Data Mesh Graph Traversal: → Retrieve Results in Spatial Proximity OCTOPUS: Idea Vertices Edges Mesh Graph Key Insight: Use Mesh Connectivity to Retrieve Query Results!
  • 39. OCTOPUS 39 Range Query Update Oblivious Query Execution Time step 1 Time step 2 Time step 3 What About Non-Convex Meshes?
  • 40. OCTOPUS: Non-Convex Meshes 40Using Mesh Surface Guarantees Accuracy ? No Reachability! Surface Scan
  • 41. OCTOPUS: Mesh Deformation 41 Deformation: Zero Cost of surface maintenance Scales With Massive Updates Time step 1 Time step 2 Time step 3 Graph changes
  • 42. OCTOPUS: Mesh Detail 42Scales with Mesh Resolution Quadratic Increase Surface Points Cubic Increase Non-Surface Points Scalability: Surface grows slower than volume (and therefore dataset size)!
  • 43. OCTOPUS: Performance 437.3-8X Speedup 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 Visualization MicrobenchMark Statistical Analysis Microbenchmark TotalQueryExecutionTime[sec] OCTOPUS LinearScan OCTREE LUR-Tree QU-Trade 8X 7.3X Visualization Microbenchmark
  • 44. OCTOPUS: Scalability 44 0 20 40 60 80 100 120 140 0.13 0.17 0.26 0.52 1.32 TotalQueryExecutionTime[sec] Mesh Detail [Tetrahedrals in Billions] Graph Traversal Surface Scan OCTOPUS Breakdown 64% 41% 0 350 700 1050 1400 0.13 0.17 0.26 0.52 1.32 LinearScan OCTOPUS Mesh Detail [Tetrahedrals in Billions] TotalQueryExecutionTime[sec] Scales with Mesh Detail SETUP: Queries: Uniform random 15 per time step, 60 time steps 8X 10X
  • 45. Algorithm Overview 45 Simulation Observational Data Post Simulation Data Spatial Analysis Model Validation Spatial Modeling OCTOPUS: ICDE’14 FLAT: ICDE’12 SCOUT: VLDB’12 TOUCH: SIGMOD’13 GIPSY: SSDBM ‘13
  • 46. Human Brain Project: Part of the toolset used every day February 2013: first 10 million neuron model built Still 4 orders of magnitude smaller than human brain General Applicability: Material Sciences Astronomy Geographical Information Systems Impact 46 2010 2008 2006 0 10 20 30 1K 10K 100K 10M ModelSize[GB] Simulation Size [# Neurons] 2013 (2.5 TB)
  • 47. Future Challenges 47 Enable Scientific Breakthroughs via Scalable Data Analysis!  Address Scientific Data Trends: → Progressively Complex Datasets → Increasingly Complex Scientific Queries → Modern Hardware  Approximate Queries on Big Data: → Use Mechanism of Learning & Forgetting to manage Data Synopses
  • 48.  Data Privacy/Anonymization  Scalable Querying of Petascale Data  Cloud Analytics  Quick & efficient access to raw data  Distributed Workflow Execution  Provenance/Reproducibility  Data Personalization HBP Data Management Challenges 48
  • 49. Conclusions 49  Enabling data exploration is key to scientific discovery.  Prior spatial access methods do not scale with data growth.  Use Spatial Connectivity to achieve scalability. → Explicitly Added (FLAT & TOUCH) → Implicitly Present in the Dataset (OCTOPUS & SCOUT)  Many exciting big data management
  • 50. 50 Thank You! Collaborators: Farhan Tauheed, Anastasia Ailamaki, Felix Schürmann, Henry Markram, Sadegh Nobari, Panagiotis Karras, Laurynas Biveinis, Mirjana Pavlovic