Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

RAMSES: Robust Analytic Models for Science at Extreme Scales


Published on

RAMSES: A new project in data-driven analytical modeling of distributed systems

RAMSES is a new DOE-funded project on the end-to-end analytical performance modeling of science workflows in extreme-scale science environments. It aims to link multiple threads of inquiry that have not, until now, been adequately connected: namely, first-principles performance modeling within individual sub-disciplines (e.g., networks, storage systems, applications), and data-driven methods for evaluating, calibrating, and synthesizing models of complex phenomena. What makes this fusion necessary is the drive to explain, predict, and optimize not just individual system components but complex end-to-end workflows. In this talk, I will introduce the goals of the project and some aspects of our technical approach.

Published in: Science
  • Be the first to comment

RAMSES: Robust Analytic Models for Science at Extreme Scales

  1. 1. Gagan Agarwal1* Prasanna Balaprakash2 Ian Foster2* Raj Kettimuthu2 Sven Leyffer2 Vitali Morozov2 Todd Munson2 Nagi Rao3* Saday Sadayappan1 Brad Settlemyer3 Brian Tierney4* Don Towsley5* Venkat Vishwanath2 Yao Zhang2 1 Ohio State University 2 Argonne National Laboratory 3 Oak Ridge National Laboratory 4 ESnet 5 UMass Amherst (* Co-PIs) Advanced Scientific Computing Research Program manager: Rich Carlson ♦︎
  2. 2. 2 Prediction, explanation, & optimization are challenging for even “simple” E2E workflows Source data store Desti-nation data store Wide Area Network For example, file transfer, for which we want to: • Predict achievable throughput for a specific configuration • Explain factors influencing performance • Optimize parameter values to achieve high speeds
  3. 3. 3 Prediction, explanation, & optimization are challenging for even “simple” E2E workflows Application OS FS Stack HBA/HCA Router LAN Switch Source data transfer node TCP IP NIC Application OS Router TCP FS Stack HBA/HCA LAN Switch IP NIC Storage Array Wide Area Network OST MDT Lustre file system Destination data transfer node OSS OSS MDS MDS + diverse environments + diverse workloads + contention
  4. 4. 85 Gbps sustained disk-to-disk over 100 Gbps network, Ottawa—New Orleans 4 Raj Kettiumuthu and team, Argonne
  5. 5. High-speed transfers to/from AWS cloud, via Globus transfer service • UChicago  AWS S3 (US region): Sustained 2 Gbps – 2 GridFTP servers, GPFS file system at UChicago – Multi-part upload via 16 concurrent HTTP connections • AWS  AWS (same region): Sustained 5 Gbps 5 go#s3
  6. 6. 6 One Advanced Photon Source data node: 125 destinations
  7. 7. Same node (1 Gbps link)
  8. 8. 9
  9. 9. How to create more accurate, useful, and portable models of such systems? Simple analytical model: T= α+ β*l [startup cost + sustained bandwidth] Experiment + regression to estimate α, β 10 First-principles modeling to better capture details of system & application components Data-driven modeling to learn unknown details of system & application components Model composition Model, data comparison
  10. 10. The RAMSES vision To develop a new science of end-to-end analytical performance modeling that will transform understanding of the behavior of science workflows in extreme-scale science environments. Based on integration of first-principles and data-driven modeling, and structured approach to model evaluation & composition 11
  11. 11. The RAMSES research agenda & platform Modeling Develop, evaluate, and refine component and end-to-end models Tools Develop easy-to-use tools to provide end-users with actionable advice Estimation Develop and apply data-driven estimation methods: differential regression, surrogate models, etc. Experiments Extensive, automated Databas experiments to test models & build database 12 Evaluators Advisor e Estimators Tester
  12. 12. We are informed by five challenge workflows 13 Transfer: High-performance, end-to-end file transfer Scattering: Capture and analysis of diffuse scattering experimental data MapReduce: Data-intensive, distributed data analytics Exascale: Performance of exascale application kernels on memory hierarchies In-situ: Configuration and placement of in-situ analysis computations
  13. 13. Transfer: End-to-end file movement Storage Array 14 Application OS FS Stack HBA/HCA Router LAN Switch Source data transfer node TCP IP NIC Application OS TCP IP FS Stack HBA/HCA Router LAN Switch NIC Wide Area Network Predict: Throughput for configuration Explain: Factors influencing performance Optimize: Parameters for high speeds OST MDT Lustre file system Destination data transfer node OSS OSS MDS MDS
  14. 14. Scattering: Linking simulation and experiment to study disordered structures Diffuse scattering images from Ray Osborn et al., Argonne Experimental Sample scattering Material composition Simulated structure Simulated scattering La 60% Sr 40% Detect errors (secs—mins) Knowledge base Past experiments; simulations; literature; expert knowledge Select experiments (mins—hours) Contribute to knowledge base Simulations driven by experiments (mins—days) Knowledge-driven decision making Evolutionary optimization
  15. 15. Immediate assessment of alignment quality in near-field high-energy diffraction microscopy 1 Blue Gene/Q Orthros (All data in NFS) 3: Generate Parameters FOP.c 50 tasks 25s/task ¼ CPU hours Uses Swift/K Dataset 360 files 4 GB total 1: Median calc 75s (90% I/O) MedianImage.c Uses Swift/K 2: Peak Search 15s per file ImageProcessing.c Uses Swift/K Reduced Dataset 360 files 5 MB total feedback to experiment Detector 4: Analysis Pass FitOrientation.c 60s/task (PC) 1667 CPU hours 60s/task (BG/Q) 1667 CPU hours Uses Swift/T GO Transfer Up to 2.2 M CPU hours per week! ssh Globus Catalog Scientific Metadata Workflow Workflow Progress Control Script Bash Manual This is a single workflow 3: Convert bin L to N 2 min for all files, convert files to Network Endian format Before After Hemant Sharma, Justin Wozniak, Mike Wilde, Jon Almer
  16. 16. MapReduce: Distributing data and computation for data analytics Job Assignment ... ... Data Slaves Master Local Cluster Local Reduction ... ... Data Slaves Master Cloud Environment Job Assignment Local Reduction Index 17 Remote data analysis Job assignment Global reduction
  17. 17. Exascale simulation 18 Images Courtesy: Joseph Insley (Argonne) HACC Cosmology • Compute intensive phase with regular stride one access • Tree walk phase: irregular memory access with high branching and integer ops • 3D FFT communication intensive phase • I/O Phase Nek5000 CFD • Matrix vector product phase • Conjugate gradient iteration • Communication phase involving nearest neighbor exchange and vector reductions
  18. 18. In situ analysis on the DOE Leadership Compute Resource (Multi Petaflop, High Radix Interconnect Dragonfly, 5D Torus) Computing Infrastructure I/O Nodes Switch Complex Analysis Nodes/Cluster (IB) File Server Nodes Storage System 1536 GB/s DTN Nodes We need to perform the right computation at the right place and time, taking into account details of the simulation, resources, and analysis 1 2 3 4
  19. 19. A diverse set of components Server Parallel computer Router Storage system LAN WAN TCP, UDT GridFTP File systems GridFTP server NECbone HACCbone Checksum Encryption MapReduce Other apps Transfer Y Y Y Y Y Y Y Y Y Y Y Scattering Y Y Y Y Y Y Y Y Exascale Y Y Y Y Y Y Distributed MapReduce Y Y Y Y Y Y Y Y Y In-Situ Y Y Y Y Y Y Y Y 20
  20. 20. Develop, evaluate, and refine component and end-to-end models • Models from the literature • Fluid models for network flows • SKOPE modeling system 21 Develop and apply data-driven estimation methods • Differential regression • Surrogate models • Other methods from literature Develop easy-to-use tools to provide end-users with actionable advice • Runtime advisor, integrated with Globus transfer system Automated experiments to test models and build database • Experiment design • Testbeds
  21. 21. Overview Input Output Workload input Code skeletons Parser Per-function intermediate repr. (Block Skeleton Trees) Behavior modeling engine Execution-based intermediate repr. (Bayesian execution tree) Transformation engine Performance projection Characterization engine Transformed Bayesian execution tree Hardware model system specifications Performance projection Schema for suggested tranformations Synthesized characteristics Source code User Effort (semi-automated with a source-to-source translator) Automatic SKOPE language Back end Front end Bottleneck analysis SKOPE performance modeling framework
  22. 22. Differential regression for combining data from different sources Example of use: Predict performance on connection length L not realizable on physical infrastructure E.g., IB-RDMA or HTCP throughput on 900-mile connection 1) Make multiple measurements of performance on path lengths d: – Ms(d): OPNET simulation – ME(d): ANUE-emulated path – MU(di): Real network (USN) 2) Compute measurement regressions on d: ṀA(.), A∈{S, E, U} 3) Compute differential regressions: ΔṀA,B(.) = ṀA(.) - ṀB(.), A, B∈{S, E, U} 4) Apply differential regression to obtain estimates, C∈{S, E} 퓜U(d) = MC(d) - ΔṀC,U(d) simulated/emulated measurements point regression estimate
  23. 23. We will extend the differential regression method in several areas • To compare different component models – E.g., different models of network elements, storage systems, protocol implementations • To compare different composite models – E.g., different methods for combining memory and CPU models • To compare model outputs with measurements 24
  24. 24. Component model component System parameters Task size parameters i cost terms performance quality model p i si Experiment design (active learning) Analytical and empirical models ˆQ i ( pi ,si ) is a regression estimate of
  25. 25. End-to-end profile composition Source LAN profile WAN profile Destination LAN profile Configuration for host and edge devices Configuration for WAN devices Configuration for host and edge devices composition operations
  26. 26. End-to-end model composition & analysis • End-to-end model using composition – It is an approximation: due to component interactions not modelled by the composition operator • Actual end-to-end performance model – Component models are “corrected” to account for un-modelled effects: this form is assumed to exist 27
  27. 27. Using end-to-end measurements and differential regression to correct regression estimates • Regression estimate of composed model: – “Estimated”, since components models are “incomplete” as derived from first principles and/or measurements • Error due to regression estimate: • Error can be mitigated using measurements: Corrected estimate of : 28 Q p,s ( )Å ˆQ p,s ( ) = Q p,s ( )- ˆQ p,s ( ) éë ùû 2 ˆ (p, ) Qs Qp,s ˆQ p,s ( ) = ˆQ p,s ( )+ ˆD (p,s) Analytical model Correction from differential regression using measurements
  28. 28. Performance guarantees • Vapnik-Chervonenkis theory: under finite VC-dim(F) P I ˆD, ˆQ, p ( )- I D*, ˆQ, p ( ) >e { } <d F,l,e ( ) Estimated Optimal – Guarantees that error of regression estimate is close to optimal with a certain probability – Distribution-free: does not require detailed knowledge of error distributions – uses end-to-end measurements • Error of the corrected estimate: 29 i p I D, ˆQ ( , p) = Qp,s - ˆQ p,s ( )- D p,s ( ) éë ùû ò dPQp,s
  29. 29. Surrogate modeling framework to inform choice of experiments 30 Machine learning & optimization Performance metrics Informative configurations First-principles models Evaluation
  30. 30. Fluid models of network flows GridFTP flow i, parallelism ki dT k T t i i i   2 dt R k Bottleneck router T t p t dt      Solve for throughputs, and transfer delays Special case: known p 31 GridFTP flow i: RTT Ri Throughput Ti Bottleneck router: Capacity C Loss rate p { 0} 1Q j j dQ C T i i i k T R p  ( ) ( ) ( ) 2 i i i
  31. 31. 32 Model composition Analytical models Performance projections Regression models Experiments Historical logs Emulators Code skeletons SKOPE language Workload parameters Source code Benchmarks Simulators SKOPE System models (current or future) Application behavior models Our multi-modal approach
  32. 32. 33 File transfer performance projections System models Application behavior Application to file transfer Model composition Analytical models Regression models Experiments Historical logs Code skeletons SKOPE language Workload parameters Source code SKOPE models Storage, TCP, WAN iperf GridFTP Emulators XDD
  33. 33. 34 Exascale simulation perf. projections System models Application behavior Compute, memory, models Model composition Analytical models Regression models Experiments Historical logs Code skeletons SKOPE language Workload parameters Source code SKOPE interconnect MPI benchmarks Stream DGEMM IOR corresponding CPU of a code skeleton is int roduced in the comment is not discussed in further L ist ing 1: Mat Mul ’ s CPU 1 f l oat A[ N] [ K] , B[ K] [ M] ; f l oat C[ N] [ M] ; 3 i nt i , j , k ; f or ( i =0; i <N; ++i ) { 5 f or ( j =0; j <M; ++j ) { f l oat sum = 0; 7 f or ( k =0; k <K; ++k) { sum+=A[ i ] [ k] * B[ k ] [ j ] ; 9 } C[ i ] [ j ] = sum; 11 } L ist ing 2: Mat Mul ’ s code skele-t on 1 f l oat A[ N] [ K] f l oat B[ K] [ M] 3 f l oat C[ N] [ M] / * t he l oop space * / 5 par al l el _f or ( N, M) : i , j 7 { / * comput at i on w/ t 9 * i nst r uc t i on count * / 11 comp 1 / * st r eami ng l oop * / 13 st r eam k = 0: K { / * l oad * / 15 l d A[ i ] [ k ] l d B[ k ] [ j ] 17 comp 3 } 19 comp 5 / * st or e * / 21 st C[ i ] [ j ] } The following informat a computat ional kernel. Dat a par al lel ism homoge-neous tasks repeated express data parallelism the innermost parallel A task corresponds f or loop. I t is expressed computat ion. Dat a accesses are oper-at ions. The accessed in-dices, array sizes, and be expressed as well; are random unless users and List ing 6). Application to exascale simulation
  34. 34. A performance database • We aim to collect instrumentation data in a central database to simplify model validation • We plan to use the perfSONAR measurement archive tool as a starting point – REST API on top of Cassandra and Postgres – Optimized for time series data – Will extend as needed – 35
  35. 35. Application to transfer optimization 36 Performance predictor Parameter database Performance analyst Model refiner User feedback agent Globus (1) Transfer service description (3) Transfer performance (4) User feedback (2) Prediction Prediction Analysis Analysis Parameter update
  36. 36. Summary • We focus on the science of modeling: integration of first-principles and data-driven models; model composition and evaluation • Our challenge applications span a broad spectrum of DOE resources and disciplines • We see big opportunities for cooperation: e.g., on development and evaluation of component models 37
  37. 37. Thanks, and for more information • Thanks to our sponsors: Advanced Scientific Computing Research Program manager: Rich Carlson • Thanks to my RAMSES project co-participants • For more information, please see and @ianfoster 38