• Like
Monarc simulation framework
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Monarc simulation framework

  • 90 views
Published

 

Published in Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
90
On SlideShare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
2
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. MONARC Simulation Framework Corina Stratan, Ciprian Dobre UPB Iosif Legrand, Harvey Newman CALTECH
  • 2. The GOALS of the Simulation Framework  The aim of this work is to continue and improve the development of the MONARC simulation framework  To perform realistic simulation and modelling of large scale distributed computing systems, customised for specific HEP applications.  To offer a dynamic and flexible simulation environment to be used as a design tool for large distributed systems  To provide a design framework to evaluate the performance of a range of possible computer systems, as measured by their ability to provide the physicists with the requested data in the required time, and to optimise the cost. December 2003 Legrand I.C. 2
  • 3. A Global View for Modelling Computing Models Specific Components Basic Components LAN DB Simulation Engine WAN CPU Job Scheduler MetaData Catalog Distributed Scheduler Analysis Jobs MONITORING REAL Systems December 2003 Legrand I.C. Testbeds 3
  • 4. Design Considerations This Simulation framework is not intended to be a detailed simulator for basic components such as operating systems, data base servers or routers. Instead, based on realistic mathematical models and measured parameters on test bed systems for all the basic components, it aims to correctly describe the performance and limitations of large distributed systems with complex interactions. December 2003 Legrand I.C. 4
  • 5. Simulation Engine Computing Models Specific Components Basic Components LAN DB Simulation Engine WAN CPU Job Scheduler MetaData Catalog Distributed Scheduler Analysis Jobs MONITORING REAL Systems December 2003 Legrand I.C. Testbeds 5
  • 6. Design Considerations of the Simulation Engine  A process oriented approach for discrete event simulation is well suited to describe concurrent running programs.  “Active objects” (having an execution thread, a program counter, stack...) provide an easy way to map the structure of a set of distributed running programs into the simulation environment.  The Simulation engine supports an “interrupt” scheme This allows effective & correct simulation for concurrent processes with very different time scale by using a DES approach with a continuous process flow between events December 2003 Legrand I.C. 6
  • 7. The Simulation Engine – Tasks and Events Task – for simulating an entity with time dependent behavior (active object, server, …) Running semaphore.v() 5 possible states for a task: CREATED, READY, RUNNING, FINISHED, WAITING Create d Assigned to worker thread Each task maintains an internal semaphore necessary for switching between states. semaphore.p() Finished Ready Event happens or sleeping period is over Waiting Event --used for communication and synchronization between tasks: Event used for communication and synchronization between tasks: when a task must notify another task about something that happened when a task must notify another task about something that happened or will happen in the future, ititcreates an event addressed to that task. or will happen in the future, creates an event addressed to that task. The events are queued and sent to the destination tasks by the The events are queued and sent to the destination tasks by the engine’s scheduler. engine’s scheduler. December 2003 Legrand I.C. 7
  • 8. Tests of the Engine Processing a TOTAL of 100 000 simple jobs in 1 , 10, 100, 1000, 2 000 , 4 000, 10 000 CPUs using the same number of parallel threads 10000 2X2.4 GHz, Linux 2X450MHz Solaris 2X3GHz, Windows Time [s] 1000 100 10 1 10 100 1000 10000 100000 No of THREADS http://monarc.cacr.caltech.edu/ December 2003 Legrand I.C. more tests: 8
  • 9. Basic Components Computing Models Specific Components Basic Components LAN DB Simulation Engine WAN CPU Job Scheduler MetaData Catalog Distributed Scheduler Analysis Jobs MONITORING REAL Systems December 2003 Legrand I.C. Testbeds 9
  • 10. Basic Components These Basic components are capable to simulate the core functionality for general distributed computing systems. They are constructed based on the simulation engine and are using efficiently the implementation of the interrupt functionality for the active objects . These components should be considered the basic classes from which specific components can be derived and constructed December 2003 Legrand I.C. 10
  • 11. Basic Components Computing Nodes Network Links and Routers , IO protocols Data Containers Servers  Data Base Servers  File Servers (FTP, NFS … ) Jobs  Processing Jobs  FTP jobs Scripts & Graph execution schemes Basic Scheduler Activities ( a time sequence of jobs ) December 2003 Legrand I.C. 11
  • 12. Multitasking Processing Model Concurrent running tasks share resources (CPU, memory, I/O) “Interrupt” driven scheme: For each new task or when one task is finished, an interrupt is generated and all “processing times” are recomputed. It provides: An efficient mechanism to simulate multitask processing. Handling of concurrent jobs with different priorities. An easy way to apply different load balancing schemes. December 2003 Legrand I.C. 12
  • 13. LAN/WAN Simulation Model Link Node Link LAN Node Node ROUTER Node Internet Connections “Interrupt” driven simulation : for each new message an interrupt is created and for all the active transfers the speed and the estimated time to complete the transfer are recalculated. LAN Node Node ROUTER Link Node LAN Node Node Continuous Flow between events ! An efficient and realistic way to simulate concurrent transfers having different sizes / protocols. December 2003 Legrand I.C. 13
  • 14. Network model  data traffic simulated for both local and wide area networks  a simulation at the packet level is practically impossible  we adopted a larger scale approach, based on an “interrupt” mechanism Network Entity: Network Entity: ••LAN, WAN, LinkPort LAN, WAN, LinkPort ••main attribute: bandwidth main attribute: bandwidth ••keeps the evidence of the keeps the evidence of the messages that traverse itit messages that traverse Components of the network model December 2003 Legrand I.C. 14
  • 15. Simulating the network transfers  interrupt mechanism similar with the one used for job execution simulation  the initial speed of a message is determined by evaluating the bandwidth that each entity on the route can offer  different network protocols can be modelled Caltech WAN CERN WAN CERN LAN newMessage CPU CERN Router INT Message1 Caltech Router Message2 INT INT Caltech LAN Message3 LinkPort LinkPort CPU 1. The route and the available bandwidth for the new message are determined. 1. The messages on the route are interrupted and their speeds are recalculated. December 2003 Legrand I.C. 15
  • 16. Job Scheduling and Execution Activity1 Activity1 CPU 1 class Activity1 extends Activity { class Activity1 extends Activity { … … public void pushJobs() { public void pushJobs() { … … Job newJob = new Job (…); Job newJob = new Job (…); addJob(newJob); addJob(newJob); … … } } … … } } Job 1 (30% CPU) Job 2 (70% CPU) 1 CPU 2 Job 3 (30% CPU) Activity2 Activity2 class Activity2 extends Activity { class Activity2 extends Activity { … … } } 1. The activity class creates a job and submits it to the farm. 2. The job scheduler sends the new job to a CPU unit. All the jobs executing on that CPU are interrupted. 3. CPU power reallocated on the unit where the new job was scheduled. The interrupted jobs reestimate their completion time. December 2003 Legrand I.C. Job 4 (30% CPU) Job 5 (40% CPU) newJob CPU 3 2 INT Job 6 (33% (50% CPU) 3 Job 7 INT (33% CPU) Job 7 (50% CPU) newJob (33% CPU) 16
  • 17. Output of the simulation Node DB Simulation Engine Output Listener Filters Router Output Listener Filters GRAPHICS Log Files EXEL User C Any component in the system can generate generic results objects Any client can subscribe with a filter and will receive the results it is Interested in . VERY SIMILAR structure as in MonALISA . We will integrate soon The output of the simulation framework into MonaLISA December 2003 Legrand I.C. 17
  • 18. Specific Components Computing Models Specific Components Basic Components LAN DB Simulation Engine WAN CPU Job Scheduler MetaData Catalog Distributed Scheduler Analysis Jobs MONITORING REAL Systems December 2003 Legrand I.C. Testbeds 18
  • 19. Specific Components These Components should be derived from the basic components and must implement the specific characteristics and way they will operate. Major Parts : Data Model Data Flow Diagrams from Production and especially for Analysis Jobs Scheduling / pre-allocation policies Data Replication Strategies December 2003 Legrand I.C. 19
  • 20. Data Model Generic Data Container Size Event Type Event Range Access Count  INSTANCE FILE Data Base FTP Server Node DB Server META DATA Catalog Replication Catalog Network FILE NFS Server Custom Data Server Export / Import December 2003 Legrand I.C. 20
  • 21. Data Model (2) Data Processing JOB Data Request META DATA Catalog Replication Catalog Data Container Data Container Data Container Data Container Select from the options JOB List Of IO Transactions December 2003 Legrand I.C. 21
  • 22. Database Functionality Automatic storage management example: Client-server model  Automatic storage management is possible, with data being sent to mass storage units DatabaseServer 1 DContainer 1 writeData() DContainer 2 DB1 DContainer 22 DContainer 23 DContainer 24 3 kinds of requests for the database server: Mass Storage 2 DContainer 15 DContainer 16 DB2 … • write • get (read the data and erase it from the server) DContainer 20 DContainer 21 DContainer 3 • read Mass Storage 1 1. The job wants to write a container into the database DB1, but the server is out of storage space. 2. The least frequently used container is moved to a mass storage unit. The new container is written to the database. December 2003 Legrand I.C. 22
  • 23. Data Flow Diagrams for JOBS Input and output is a collection of data. This data is described by type and range Input Processing 1 Process is Output Input Processing 2 Output A fine described by name granularity decomposition of processes which can be executed independently and the way they communicate can be very useful for optimizationProcessing 4 and parallel Output execution ! 10x Output Input Processing 3 Processing 4 Output December 2003 Legrand I.C. Input Input 23
  • 24. Job Scheduling Centralized Scheme CPU FARM CPU FARM Site A Site B JobScheduler JobScheduler Dynamically loadable module GLOBAL Job Scheduler December 2003 Legrand I.C. 24
  • 25. Job Scheduling Distributed Scheme – market model CPU FARM Site A JobScheduler CPU FARM COST Site B Request JobScheduler DECISION JobScheduler CPU FARM Site A December 2003 Legrand I.C. 25
  • 26. Computing Models Computing Models Specific Components Basic Components LAN DB Simulation Engine WAN CPU Job Scheduler MetaData Catalog Distributed Scheduler Analysis Jobs MONITORING REAL Systems December 2003 Legrand I.C. Testbeds 26
  • 27. Activities: Arrival Patterns A flexible mechanism to define the Stochastic process of how users perform data processing tasks Dynamic loading of “Activity” tasks, which are threaded objects and are controlled by the simulation scheduling mechanism Physics Activities Injecting “Jobs” Regional Centre Farm Job Job Job Activity Activity for( int k =0; k< jobs_per_group; k++) { Job job = new Job( this, Job.ANALYSIS, "TAG”, 1, events_to_process); farm.addJob(job ); // submit the job sim_hold ( 1000 ); // wait 1000 s } Each “Activity” thread generates data processing jobs These dynamic objects are used to model the users behavior December 2003 Legrand I.C. 27
  • 28. Regional Centre Model Complex Composite Object DB Index REGIONAL CENTER Link Port DB Server Link Port LAN FARM Link Port AJob AJob AJob ... CPU WAN DB Server Link Port AJob AJob AJob ... CPU Link Port AJob AJob AJob ... CPU Job Scheduler Job Activity Job Activity December 2003 Legrand I.C. Simplified topology of the Centers D B A Job E Activity C 28
  • 29. MONARC - Main Classes Activity MetaJob JobDatabase JobFTP JobProcessData QScheduler DistribScheduler JobScheduler CPUCluster Job CPUUnit AbstractCPUUnit AJob Task WorkerThread Pool Farm DContainer Scheduler EventQueue Database Event MassStorage RegionalCenter DatabaseServer UDPMessage DatabaseEntity TCPProtocol UDPProtocol DatabaseIndex Protocol December 2003 Legrand I.C. LinkPort WAN LAN NetworkEntity TCPMessage Message 29
  • 30. Monitoring Computing Models Specific Components Basic Components LAN DB Simulation Engine WAN CPU Job Scheduler MetaData Catalog Distributed Scheduler Analysis Jobs MONITORING REAL Systems December 2003 Legrand I.C. Testbeds 30
  • 31. Real Need for Flexible Monitoring Systems It is important to measure & monitor the Key applications in a well defined test environment and to extract the parameters we need for modeling Monitor the farms used today, and try to understand how they work and simulate such systems. It requires a flexible monitoring system able to dynamically add new parameters and provide access to historical data Interfacing monitoring tools to get the parameters we need in simulations in a nearly automatic way MonALISA was designed and developed based on the experience with the simulation problems. December 2003 Legrand I.C. 31
  • 32. EXAMPLES December 2003 Legrand I.C. 32
  • 33. FTP and NFS clusters Client 1 request events FTP (NFS) Server Client 2 Client 3 Client n December 2003 Legrand I.C. This examples evaluate the performance of a local area network with a server and several worker stations. The server stores events used by the processing nodes.  NFS Example: the server concurrently delivers the events, one by one to the clients.  FTP Example: the server sends a whole file with events in a single transfer 33
  • 34. FTP Cluster December 2003 Legrand I.C. 50 CPU units x 2 Jobs per unit 100 events per job, event size 1MB LAN bandwidth 1 Gbps, server’s effective bandwidth 60Mbps 34
  • 35. NFS Cluster December 2003 Legrand I.C. 35
  • 36. Distributed Scheduling ••Job Migration: when a regional Job Migration: when a regional center is assigned too many jobs, itit center is assigned too many jobs, sends a part of them to other centers sends a part of them to other centers with more free resources with more free resources ••New job scheduler implemented, New job scheduler implemented, which supports job migration, applying which supports job migration, applying load balancing criteria load balancing criteria Caltech CERN FNAL KEK December 2003 Legrand I.C. export() Jobs Regional Center export() export() We tested different configurations, with 1, 2 and 4 regional centers, and with different numbers of CPUs per regional center. The number of jobs submitted is kept constant, the job arrival rate varying during a day. 36
  • 37. Distributed Scheduling (2) Test Case: Test Case: ••4 regional centers, 20 4 regional centers, 20 CPUs per center CPUs per center ••average job processing average job processing time 3h, approx. 500 jobs time 3h, approx. 500 jobs per day submitted in a per day submitted in a center center Average processing time and CPU usage for 1, 2, 4, 6 centers December 2003 Legrand I.C. 37
  • 38. Distributed Scheduling (3)  similar with the previous example, but the jobs are more complex, involving network transfers  centers connected in a chain configuration: CERN FNAL Caltech KEK Chain WAN connection Every job submitted to a regional center needs an amount of data located in that center. If the job is exported to another center, would the benefits be great enough to compensate the cost of the data transfer? December 2003 Legrand I.C. 38
  • 39. Distributed Scheduling (4) The average processing time significantly increases when reducing the bandwidth and the number of CPUs The network transfers are more intense in the centers from the middle of the chain (like Caltech) December 2003 Legrand I.C. 39
  • 40. Distributed Scheduling (5) December 2003 Legrand I.C. 40
  • 41. Local Data Replication  Evaluates the performance  Evaluates the performance improvements that can be improvements that can be obtained by replicating obtained by replicating data. data.  We simulated a regional  We simulated a regional center which has a number center which has a number of database servers, and of database servers, and another four centers which another four centers which host jobs that process the host jobs that process the data on those database data on those database servers servers  A better performance can  A better performance can be obtained if the data from be obtained if the data from the servers is replicated the servers is replicated into the other regional into the other regional centers centers December 2003 Legrand I.C. 41
  • 42. Local Data Replication (2) December 2003 Legrand I.C. 42
  • 43. WAN Data Replication • similar with the previous example, but now with two central servers, each holding an equal amount of replicated data, and eight satellite regional centers, hosting worker jobs • a worker job will get a number of events from one of the central regional centers (one event at a time) and process them locally Jobs Replica Common Link Jobs Replica Common Link Jobs December 2003 Legrand I.C. workers choose the “best” workers choose the “best” server to get the data from. server to get the data from. They use a Replication Load They use a Replication Load balancing service (knowing balancing service (knowing the load of the network and of the load of the network and of the servers) the servers) VS VS The server is chosen The server is chosen randomly randomly 43
  • 44. WAN Data Replication Both servers have the same bandwidth and support the same maximum load Better average response time, total execution time is smaller when taking decisions based on load balancing One server has half of the other’s bandwidth and supports half of its maximum load December 2003 Legrand I.C. 44
  • 45. Summary  Modelling and understanding current systems, their performance and limitations, is essential for the design of the large scale distributed processing systems. This will require continuous iterations between modelling and monitoring  Simulation and Modelling tools must provide the functionality to help in designing complex systems and evaluate different strategies and algorithms for the decision making units and the data flow management. http://monarc.cacr.caltech.edu/ December 2003 Legrand I.C. 45