David Loureiro - Presentation at HP's HPC & OSL TES

2,558 views

Published on

David Loureiro, SysFera CEO, talks about "Managing large-scale, heterogeneous infrastructures: from DIET to SysFera-DS" at HP's High Performance Computing and Open Source & Linux Technical Excellence Symposium that took place on the 19-23 March, 2012, in Grenoble, France.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
2,558
On SlideShare
0
From Embeds
0
Number of Embeds
7
Actions
Shares
0
Downloads
7
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

David Loureiro - Presentation at HP's HPC & OSL TES

  1. 1. Distributed Interactive Engineering Toolbox David Loureiro - Eddy Caron SysFera Ecole Normale Supérieure de Lyon GRAAL/AVALON Research Team
  2. 2. Outline  Context  From DIET…  … to SysFera-DS  Conclusion 2
  3. 3. Why Large Scale systems? First need: supercomputing at a national or international scale Large size problems (grand challenge) need a collaboration between several codes/supercomputing centers Always a need for more computing power, memory capacity, and disk storage The power of any single resource is always small compared to the aggregation of several resources Network connectivity increased quickly!• Many available resources • Increasing complexity of applications – Many clusters – Multi-scale – Supercomputers – Multi-disciplinary – Millions of PC and – Huge data set produced workstations connected – Heterogeneity – Sharing or renting resources From DIET to SysFera-DS 3
  4. 4. Centralized or Decentralized ? 2001 TeraGrid / 2003 Grid’5000 Centralized! 1997 Google Cluster • Grid Computing (Clusters of Clusters) (De)Centralized! Decentralized! Centralized! Decentralized! Sky Computing 2002 Earth Simulator • First computer to reach the Teraflops (40TF) • Homogeneous, Centralized, Expensive1946 ENIAC• 18.000 tubes, 30 tons, 170 m²• 2.000 tubes replaced every months by 6 technicians Cloud Computing • Amazon • Google • Microsoft 2008 IBM Roadrunner • … • First computer to reach the Petaflops From DIET to SysFera-DS 4
  5. 5. Research driven by applications  Data-centric applications  Very Large data management (in, out, temporary) >30 TB data/night  Computer-centric applications  GigaFlops Predicting Impacts of Massive Earthquakes (SDSC)  Community-centric applications  Data sharing (acquisition, results, ..)  Resources Large Hadron Collider (LHC) Without an optimal scheduling? I just need my simulation result Without minimizing ressources consumption? Without any optimisation? … Grid user point of view  Single sign-on  Single compute space  Single data space  Single development environment From DIET to SysFera-DS 5
  6. 6. Which framework ?  Holy Grail: Transparency and simplicity (maybe even before performance) !  Scheduling tunability  Many incarnations of the Grid  Grid computing  Cluster computing  peer-to-peer systems,  Global computing  Web Services,  Clouds, …  Many programming models  Shared-State Models  Message Passing Models,  Hybrids models  RPC and RMI models  Peer-to-peer models  Web Services models  Coordination models, …  Do not forget good ol’ time research on scheduling and distributed systems !  Most scheduling problems are very difficult to solve even in their simplistic form …  … but simple solutions often lead to better performance results in real life From DIET to SysFera-DS 6
  7. 7. Outline  Context  From DIET…  … to SysFera-DS  Conclusion 7
  8. 8. DIET’s Goals http://graal.ens-lyon.fr/DIET/  Our goals  To develop a toolbox for the deployment of environments using the Application Service Provider/Software as a Service (ASP/SaaS) paradigm with different applications  Use as much as possible public domain and standard software  To obtain a high performance and scalable environment  Implement and validate our more theoretical results  Scheduling for heterogeneous platforms, data (re)distribution and replication, performance evaluation, algorithmic for heterogeneous and distributed platforms, …  Based on CORBA and our own software developments  FAST for performance evaluation,  LogService for monitoring,  VizDIET for the visualization,  GoDIET for the deployment  Dagda for the data management  Several applications in different fields (simulation, bioinformatics, …)  Release 2.8 available on the web since november  ACI Grid ASP, RNTL GASP, ANR LEGO CIGC-05-11, ANR Gwendia, Celtic-plus Project SEED4C From DIET to SysFera-DS 8
  9. 9. RPC and Grid-Computing: Grid-RPC • One simple idea – Implementing the RPC programming model over the grid – Using resources accessible through the network – Mixed parallelism model (data-parallel model at the server level and task parallelism between the servers) • Features needed – Load-balancing (resource localization and performance evaluation, scheduling), – IDL, – Data and replica management, – Security, – Fault-tolerance, – Interoperability with other systems, – …  Design of a standard interface – within the OGF (Grid-RPC and SAGA WG) – Existing implementations: NetSolve/GridSolve, Ninf, DIET, OmniRPC From DIET to SysFera-DS 9
  10. 10. RPC and Grid Computing: Grid-RPC Request AGENT(s) Client S2 ! Op(C, A, B) S3 S4 S1 S2 From DIET to SysFera-DS 10
  11. 11. Client and server interface Client side  So easy …  Multi-interface (C, C++, Fortran, Java, Python, Scilab, Web Services, etc.)  Grid-RPC compliant Server side  Install and submit new server to agent (LA)  Problem and parameter description  Client IDL transfer from server  Dynamic services  new service  new version  security update  outdated service  Etc. From DIET to SysFera-DS 11
  12. 12. Architecture overview ( )* +,$ " &$ ( )* "+,$ &$ ( )* "+,$ &$ &$ %&$ %&$ ! "# $ ! "# $ ! "# $ ! "# $ MA : Master Agent ! "# $ LA : Local Agent ! "# $ SeD : ServerDeamon From DIET to SysFera-DS 12
  13. 13. Workflow Management  Workflow representation  Direct Acyclic Graph (DAG) Each vertex is a task   Each directed edge represents communication between tasks  Functional workflows  Loops, if statements, automatic parallelism, fault-tolerance  Goals !  Build and execute workflows  Use different heuristics to solve scheduling problems  Extensibility to address multi-workflows submission and large grid platform  Manage heterogeneity and variability of environment  ANR Gwendia time Idle Data transfert Execution time  Language definition (MOTEUR & MADAG)EGI (Glite) Comparison on Grid’5000 vs EGI 132.143 s  32.857s 274.643 sGrid’5000 (DIET) 0.214s Contribution to the management of large 540.614 s 3.371 s scale platforms: the DIET experience 13
  14. 14. DIET Scheduling: Plug-in Schedulers  SeD level  Performance estimation function  Estimation Metric Vector - dynamic collection of performance estimation values  Performance measures available through DIET  FAST-NWS performance metrics  Time elapsed since the last execution  CoRI (Collector of Resource Information)  Developer defined values  Aggregation Methods  Defining mechanism to sort SeD responses: associated with the service and defined at SeD level  Tunable comparison/aggregation routines for scheduling  Priority Scheduler  Performs pairwise server estimation comparisons returning a sorted list of server responses;  Can minimize or maximize based on SeD estimations and taking into consideration the order in which the request for those performance estimations was specified at SeD level. From DIET to SysFera-DS 14
  15. 15. DIET Scheduling: Performance estimation  Collector of Resource Information (CoRI)  Interface to gather performance information  Currently 2 modules available CoRI Manager  CoRI Easy  FAST (Martin Quinson’s PhD) CoRI-Easy FAST Other Collector Collector Collectors like  Sigar, GPU, etc to come… Ganglia  Extension for parallel program • Code analysis / FAST calls combination • Allow the estimation of parallel regular routines (ScaLAPACK-like) Max. error: 14,7 % Avg. error: 3,8 % 35,00 35,00 30,00 30,00 25,00 25,00 20,00 20,00 15,00 15,00 10,00 10,00 5,00 5,00 0,00 0,00 1 1 6 6 1 11 1 11 6 6 16 16 11 11 16 21 16 21 21 26 21 26 26 26 31 31 31 31 Measured Estimated From DIET to SysFera-DS 15
  16. 16. Data Management Three approaches for DIET  DTM (LIFC, Besançon)  Hierarchical and distributed data manager  Redistribution between servers  JuxMem (Paris, Rennes)  P2P data cache  DAGDA (IN2P3, Clermont-Ferrand and LIP)  Joining task scheduling and data management  Standardized through GridRPC OGF WG. • Data Arrangement for Grid and Distributed Applications  Explicit data replication: Using the API.  Implicit data replication.  Data replacement algorithm: LRU, LFU AND FIFO  Transfer optimization by selecting the more convenient source.  Storage resources usage management.  Data status backup/restoration. From DIET to SysFera-DS 16
  17. 17. Parallel and batch submissions Parallel & sequential jobs  transparent for the user  system dependent submission MA SeDBatch  Many batch systems  Batch schedulers behaviour LA SeD//  Internal scheduling process  Monitoring & Performance prediction NFS  Simulation (Simbatch) SeD OAR SLURM SeDBatch PBS LSF OGE Loadleveler 6/03/12 From DIET to SysFera-DS
  18. 18. DIET Cloud  Inside the Cloud  DIET platform is virtualized inside the cloud. (as Xen image for example)  Very flexible and scalable as DIET nodes can be launched  Scheduling is more complex  DIET as a Cloud manager  Eucalyptus interface  Eucalyptus is treated as a new Batch System  Provide a new implementation for the BatchSystem abstract class From DIET to SysFera-DS 18
  19. 19. Grid’5000 Grid’5000  Building a nation wide experimental platform for  Grid & P2P researches (like a particle accelerator for the computer scientists)  9 geographically distributed sites hosting clusters with 256 CPUs to 1K CPUs)  All sites are connected by RENATER (French Res. and Edu. Net.)  Design and develop a system/middleware environment for safely test and repeat experiments  Use the platform for Grid experiments in real life conditions  4 main features:  A high security for Grid’5000 and the Internet, despite the deep reconfiguration feature  Single sign-on  High-performance LRMS: OAR  A user toolkit to reconfigure the nodes and monitor experiment: Kadeploy  DIET deployment over a maximum of processors  1 MA, 8 LA, 540 SeDs  1120 clients on 140 machines  DGEMM requests (2000x2000 matrices)  Simple round-robin scheduling From DIET to SysFera-DS 19
  20. 20. Applications: 4 of them Cosmology Application Climatology Application • Dark Mater Halos • Forecasting of the worlds environment and • Large Scale experiment on Grid’5K climate on regional to global scales • Plug-in Scheduler Robotic Application Bioinformatics Application Parameters DIET API External DIET middleware application call Results Request Metrics vector • BLAST BLAST service Plugin-scheduler declaration •40000 requests over 5 databases of different sizes (from 1 to 5 GB) • Experiment between Italia and France • Data management optimized From DIET to SysFera-DS 20
  21. 21. Conclusions Grid-RPC  Interesting approach for several applications  Simple, flexible, and efficient  Many interesting research issues (scheduling, data management, resource discovery and reservation, deployment, fault-tolerance, …) DIET  Scalable, open-source, and multi-application platform  Concentration on several issues like resource discovery, scheduling (distributed scheduling and plugin schedulers), deployment (GoDIET and GRUDU), performance evaluation (CoRI), monitoring (LogService and VizDIET), data management and replication (DTM, JuxMem, and DAGDA)  Large scale validation on the Grid’5000 platform  A middleware designed and tunable for different applications http://www.grid5000.org/ From DIET to SysFera-DS 21
  22. 22. Results  A complete Middleware for heterogeneous infrastructure  DIET is light to use and non-intrusive  Dedicated to many applications  Designed for Grid and Cloud  Efficient even in comparison to commercial tools  DIET is high tunability middleware  Used in production  The DIET Team  SysFera Compagny (14 persons today)  http://www.sysfera.com From DIET to SysFera-DS 22
  23. 23. Future Prospects  Do we need application specific schedulers ?  Scheduling based on Economic Model for Cloud Platform  DIET Green (Collaboration with RESO)  Increase the DIET capacity to deal with heterogeneous resources MA  Single System Image Cluster OS LA  Box Cluster LA LA SED Kerrighed Kerrighed script generator Deploy the image  Virtual Machines New services are register SED Batch SED Cloud SED Batch script generator Cloud script generator Submission to batch scheduler Deploy the image New services are register  GPU architecture SMP Virtual  Multi-core Batch Scheduler Cloud Platform PBS, OAR, Loadlever, ... Eucalyptus, EC2, ...  Large scale architecture  … From DIET to SysFera-DS 23
  24. 24. Outline  Context  From DIET…  … to SysFera-DS  Conclusion 24
  25. 25. Who are we? • 2001: Research project from the Graal team (Inria/ENS) – DIET: grid middleware • 2007: SysFera-DS used within the Décrypthon project – Used in production – Selected by IBM to replace Univa-UD • 2010: Creation of SysFera, INRIA spin-off • 2012: A team of 14 (R&D: 4 engineers and 5 PhD) – Supported by two experts from INRIA and ENS – SysFera-DS
  26. 26. DécrypthonHPC management & mutualization Before SysFera- DS: • Local usage of resources • No unique submission BORDEAUX LILLE interface • 5 sites, 2 LoadLeveler LoadLeveler different batch schedulers JUSSIE ORSAY U LYON LoadLeveler LoadLeveler OAR + Stockage
  27. 27. DécrypthonHPC management & mutualization With SysFera-DS: • Resources mutualization • Web interface for submission • Application specific scheduling Site Web • Data management BORDEAUX de LILLE soumissi • Hardware failures LoadLeveler on LoadLeveler hidden from the users (automatic re-submission) JUSSIE ORSAY U LYON LoadLeveler LoadLeveler OAR + Stockage
  28. 28. Helping cure muscular distrophy« The Décrypthon Steering Commitee choseSysFera-DS starting on June 2007 for its qualitiesof robustness and modularity. It has beenprogressively implemented on the Décrypthongrids ressources while ensuring a completelytransparent and smooth transition for theusers. » Thierry Toursel Research Project Manager, AFM
  29. 29. EDF - Distributed platforms are complex
  30. 30. EDF - The solution
  31. 31. Working with a leading international companyThanks to SysFera-DS, we can now provide ourR&D engineers a stable, reliable andperformant solution to access oursupercomputers and computing clusters. David Bateman ICCOS Group Manager, EDF
  32. 32. SysFera-DS does it all • Simple access to complex infrastructures • Advanced administration features – User management and access control – Monitoring and reporting • Consistent platform for application development • Integration to existing environments • Compatibility with many different resources • Non-intrusive, non-exclusive • Flexible, stable, reliable, performant
  33. 33. Keys benefits Heterogeneous applications management Big Data Efficient Management Workflow & dataflow mangement & design Collaborative Webboard Hybrid Cloud
  34. 34. Offers• A software to optimize your computations• A licence to plug inside your software• Your applications migration• A webboard to manage your applications & infrastructures• Skilled competences to support these tools• Skilled competences to develop dedicated plugins Your applications Our Software Our Software Your infrastucture Your Applications Pool ressources CIMENT CLOUD …
  35. 35. Offers Webboard « To manage Your your Applications Webboard applications » « To manage Your your Applications Vishnu applications » « A set of dedicated plugins – infrastructure management » DIET « to optimize your computations & integrate your infrastructures »
  36. 36. Features overview • Meta-scheduling (load balancing), workflows management, jobs management, data management • Resources and communications management • Launch and monitoring of jobs, file transfers, hardware and software infrastructure through a scientific portal • User management with single sign-on • Cross network domain • Advanced and fine-grained data management • Automatic management of dynamic resources • Maintenance management • Easy deployment • Usable in user space: no need to be root • Cloud management
  37. 37. The WebBoard (Before SysFera)User and admin interface One app - one pageUser rights management Statistics
  38. 38. SysFera-DS WebBoard
  39. 39. Outline • Context • From DIET… • … to SysFera-DS • Conclusion 39
  40. 40. 05.04.12 ANR-SOPAn open source solutionThe core of SysFera-DS is open-source software......which means anyone can use it, share it, andcontribute to it. 40
  41. 41. LIP SysFera MIS, CNRS, ENSI, ENSHEEIT, LIFC, IRISA,…DIET Open Source SysFera-DS
  42. 42. Conclusion • An open source solution with two different kind of collaborated support DIET LIP - Avalon Team - Proof of concept - Simulations - New features - Grid’5000 experiments - Scientific expertise - etc. SysFera-DS SysFera - Application support with industrial quality - Platfom development - New features - Personnal features - Research Grid to Production Grid - Hotline
  43. 43. Acknowledgment  Abdelkader Amar  Florent Rochette  Nicolas Bard  Adrian Muresan  Frédéric Desprez  Ousmane Thiare  Alan Su  Frédéric Lombard  Peter Frauenkron  Amine Bsila  Frédéric Suter  Philippe Combes  Andréea Chis  Gaël Le Mahec  Philippe Martinez  Antoine Vernois  Georg Hoesch  Philippe Vicens  Barbara Walter  Ghislain Charrier  Phuspinder Kaur Chouhan  Benjamin Depardon  Haïkel Guemar  Raphaël Bolze  Benjamin Isnard  Ibrahima Cissé  Romain Lacroix  Bert Van Heukelom  Jean-Marc Nicod  Stéphane Vialle  Bruno DelFabro  Jonathan Rouzaud-Cornabas  Sylvain Dahan  Christophe Pera  Kevin Coulomb  Vincent Pichon  Cyril Pontvieux  Laurent Philippe  Yves Caniou  Cédric Tedeschi  Ludovic Bertsch  Damien Reimert-Vasconcellos  Luis Rodero-Merino  Daouda Traore  Marc Boury  David Loureiro  Martin Quinson  Eric Boix  Mathias Colin  Eugene Pamba Capochichi  Mathieu Jan  Emmanuel Quémener  Maurice Djibril Faye 43
  44. 44. http://graal.ens-lyon.fr/DIET http://www.sysfera.com http://blog.sysfera.comDavid Loureiro (SysFera CEO):- david.loureiro@sysfera.com- @DavidLoureiroFr- www.sysfera.com

×