What Is High Throughput Distributed Computing


Published on

Published in: Technology, Education
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • The lectures will survey technologies that could be used for storing and managing the many PetaBytes of data that will be collected and processed at LHC. This will cover current mainline hardware technologies, their capacity, performance and reliability characteristics, the likely evolution in the period from now to LHC, and their fundamental limits. It will also cover promising new technologies including both products which are emerging from the home and office computing environment (such as DVDs) and more exotic techniques. The importance of market acceptance and production volume as cost factors will be mentioned. Robotic handling systems for mass storage media will also be discussed. After summarising the mass storage requirements of LHC, some suggestions will be made of how these requirements may be met with the technology which will be available.
  • What Is High Throughput Distributed Computing

    1. 1. What is High Throughput Distributed Computing? CERN Computing Summer School 2001 Santander Les Robertson CERN - IT Division [email_address]
    2. 2. Outline <ul><li>High Performance Computing (HPC) and High Throughput Computing (HTC) </li></ul><ul><li>Parallel processing </li></ul><ul><ul><li>so difficult with HPC applications </li></ul></ul><ul><ul><li>so easy with HTC </li></ul></ul><ul><li>Some models of distributed computing </li></ul><ul><li>HEP applications </li></ul><ul><li>Offline computing for LHC </li></ul><ul><li>Extending HTC to the Grid </li></ul>
    3. 3. “Speeding Up” the Calculation? <ul><li>Use the fastest processor available </li></ul><ul><li>-- but this gives only a small factor over modest (PC) processors </li></ul><ul><li>Use many processors, performing bits of the problem in parallel </li></ul><ul><li>-- and since quite fast processors are inexpensive we can think </li></ul><ul><li> of using very many processors in parallel </li></ul>
    4. 4. High Performance – or – High Throughput? <ul><li>The key questions are – granularity & degree of parallelism </li></ul><ul><li>Have you got one big problem or a bunch of little ones? To what extent can the “problem” be decomposed into sort-of-independent parts ( grains ) that can all be processed in parallel? </li></ul><ul><li>Granularity </li></ul><ul><ul><li>fine-grained parallelism – the independent bits are small, need to exchange information, synchronise often </li></ul></ul><ul><ul><li>coarse-grained – the problem can be decomposed into large chunks that can be processed independently </li></ul></ul><ul><li>Practical limits on the degree of parallelism – </li></ul><ul><ul><li>how many grains can be processed in parallel? </li></ul></ul><ul><ul><li>degree of parallelism v. grain size </li></ul></ul><ul><ul><li>grain size limited by the efficiency of the system at synchronising grains </li></ul></ul>
    5. 5. High Performance – v. – High Throughput? <ul><li>fine-grained problems need a high performance system </li></ul><ul><ul><li>that enables rapid synchronisation between the bits that can be processed in parallel </li></ul></ul><ul><ul><li>and runs the bits that are difficult to parallelise as fast as possible </li></ul></ul><ul><li>coarse-grained problems can use a high throughput system </li></ul><ul><ul><li>that maximises the number of parts processed per minute </li></ul></ul><ul><li>High Throughput Systems use a large number of inexpensive processors, inexpensively interconnected </li></ul><ul><li>while High Performance Systems use a smaller number of more expensive processors expensively interconnected </li></ul>
    6. 6. High Performance – v. – High Throughput? <ul><li>There is nothing fundamental here – it is just a question of financial trade-offs like: </li></ul><ul><ul><li>how much more expensive is a “fast” computer than a bunch of slower ones? </li></ul></ul><ul><ul><li>how much is it worth to get the answer more quickly? </li></ul></ul><ul><ul><li>how much investment is necessary to improve the degree of parallelisation of the algorithm? </li></ul></ul><ul><li>But the target is moving - </li></ul><ul><ul><li>Since the cost chasm first opened between fast and slower computers 12-15 years ago an enormous effort has gone into finding parallelism in “big” problems </li></ul></ul><ul><ul><li>Inexorably decreasing computer costs and de-regulation of the wide area network infrastructure have opened the door to ever larger computing facilities – clusters  fabrics  (inter)national grids demanding ever-greater degrees of parallelism </li></ul></ul>
    7. 7. High Performance Computing
    8. 8. A quick look at HPC problems <ul><li>Classical high-performance applications </li></ul><ul><ul><li>numerical simulations of complex systems such as </li></ul></ul><ul><ul><ul><li>weather </li></ul></ul></ul><ul><ul><ul><li>climate </li></ul></ul></ul><ul><ul><ul><li>combustion </li></ul></ul></ul><ul><ul><ul><li>mechanical devices and structures </li></ul></ul></ul><ul><ul><ul><li>crash simulation </li></ul></ul></ul><ul><ul><ul><li>electronic circuits </li></ul></ul></ul><ul><ul><ul><li>manufacturing processes </li></ul></ul></ul><ul><ul><ul><li>chemical reactions </li></ul></ul></ul><ul><ul><li>image processing applications like </li></ul></ul><ul><ul><ul><li>medical scans </li></ul></ul></ul><ul><ul><ul><li>military sensors </li></ul></ul></ul><ul><ul><ul><li>earth observation, satellite reconnaisance </li></ul></ul></ul><ul><ul><ul><li>seismic prospecting </li></ul></ul></ul>
    9. 9. Approaches to parallelism <ul><li>Domain decomposition </li></ul><ul><li>Functional decomposition </li></ul>graphics from Designing and Building Parallel Programs (Online) , by Ian Foster - http://www-unix.mcs.anl.gov/dbpp/
    10. 10. Of course – it’s not that simple graphic from Designing and Building Parallel Programs (Online) , by Ian Foster - http://www-unix.mcs.anl.gov/dbpp/
    11. 11. The design process <ul><li>Data or functional decomposition </li></ul><ul><ul><li> building an abstract task model </li></ul></ul><ul><li>Building a model for communication between tasks </li></ul><ul><ul><li> interaction patterns </li></ul></ul><ul><li>Agglomeration – to fit the abstract model to the constraints of the target hardware </li></ul><ul><ul><li>interconnection topology </li></ul></ul><ul><ul><li>speed, latency, overhead of communications </li></ul></ul><ul><li>Mapping the tasks to the processors </li></ul><ul><ul><li>load balancing </li></ul></ul><ul><ul><li>task scheduling </li></ul></ul>graphic from Designing and Building Parallel Programs (Online) , by Ian Foster - http://www- unix . mcs . anl . gov / dbpp /
    12. 12. Large scale parallelism – the need for standards <ul><li>“ Supercomputer” market is on trouble; diminishing number of suppliers; questionable future </li></ul><ul><li>Increasingly risky to design for specific tightly coupled architectures like - SGI (Cray, Origin), NEC, Hitachi </li></ul><ul><li>Require a standard for communication between partitions/tasks that works also on loosely coupled systems (“massively parallel processors” – MPP – IBM SP, Compaq) </li></ul><ul><li>Paradigm is message passing rather than shared memory – tasks rather than threads </li></ul><ul><ul><li>Parallel Virtual Machine - PVM </li></ul></ul><ul><ul><li>MPI – Message Passing Interface </li></ul></ul>
    13. 13. MPI – Message Passing Interface <ul><ul><li>industrial standard – http://www. mpi -forum.org </li></ul></ul><ul><ul><li>source code portability </li></ul></ul><ul><ul><li>widely available; efficient implementations </li></ul></ul><ul><ul><li>SPMD (Single Program Multiple Data) model </li></ul></ul><ul><ul><ul><li>Point-to-point communication (send/receive/wait; blocking/non-blocking) </li></ul></ul></ul><ul><ul><ul><li>Collective operations (broadcast; scatter/gather; reduce) </li></ul></ul></ul><ul><ul><ul><li>Process groups, topologies </li></ul></ul></ul><ul><ul><li>comprehensive and rich functionality </li></ul></ul>
    14. 14. MPI – Collective operations <ul><li>IBM Redbook - http://www. redbooks . ibm .com/ redbooks /SG245380.html </li></ul>Defining high level data functions allows highly efficient implementations, e.g. minimising data copies
    15. 15. The limits of parallelism - Amdahl’s Law <ul><li>If we have N processors: </li></ul><ul><li>s + p Speedup = ———— s + p/N </li></ul><ul><li>taking s as the fraction of the time spent in the sequential part of the program ( s + p = 1 ) </li></ul><ul><li> 1 Speedup = ————  1/s s + (1-s)/N </li></ul>s – time spent in a serial processor on serial parts of the code p – time spent in a serial processor on parts that could be executed in parallel Amdahl, G.M., Validity of single-processor approach to achieving large scale computing capability Proc. AFIPS Conf., Reston, VA, 1967, pp. 483-485
    16. 16. Amdahl’s Law - maximum speedup
    17. 17. Load Balancing - real life is (much) worse <ul><li>Often have to use barrier synchronisation between each step, and different cells require different amounts of computation </li></ul><ul><li>Real time sequential part s =  s i </li></ul><ul><li>Real time parallelisable part on a sequential processor p =  k  j p k j </li></ul><ul><li>Real time parallelised T = s +  max( p k j ) </li></ul><ul><li>T = s +  max( p k j ) >> s + p/N </li></ul>s 1 s i s j s N : : : : : : t … … p k 1 p k j p k M … … p K 1 p K j p K M … … p 1 1 p 1 j p 1 M
    18. 18. Gustafson’s Interpretation <ul><li>The problem size scales with the number of processors </li></ul><ul><li>With a lot more processors (computing capacity) available you can and will do much more work in less time </li></ul><ul><li>The complexity of the application rises to fill the capacity available </li></ul><ul><li>But the sequential part remains approximately constant </li></ul><ul><li>Gustafson, J.L., Re-evaluating Amdahl’s Law, CACM 31(5), 1988, pp. 532-533 </li></ul>
    19. 19. Amdahl’s Law - maximum speedup with Gustafson’s appetite potential 1,000 X speedup with 0.1% sequential code
    20. 20. The importance of the network <ul><li>Communication Overhead adds to the inherent sequential part of the program to limit the Amdahl speedup Latency – the round-trip time (RTT) to communicate between two processors communications overhead </li></ul><ul><li> c = latency + data_transfer_time </li></ul><ul><li>s + p Speedup = ————— s + c + p/N </li></ul><ul><li>For fine grained parallel programs the problem is latency , not bandwidth </li></ul>t sequential communications overhead parallelisable … … … … … …
    21. 21. Latency <ul><li>Comparison – Efficient MPI implementation on Linux cluster (source: Real World Computing Partnership, Tsukuba Research Center) </li></ul>Network Bandwidth (MByte/sec) RTT Latency (microsecond) Myrinet 146 20 Gigabit Ethernet(Sysconect) 73 61 Fast Ethernet(EEPRO100) 11 100
    22. 22. High Throughput Computing
    23. 23. High Throughput Computing - HTC <ul><li>Roughly speaking – </li></ul><ul><li>HPC – deals with one large problem </li></ul><ul><li>HTC – is appropriate when the problem can be decomposed into many (very many) smaller problems that are essentially independent </li></ul><ul><ul><li>Build a profile of all MasterCard customers who purchased an airline ticket and rented a car in August </li></ul></ul><ul><ul><li>Analyse the purchase patterns of Wallmart customers in the LA area last month </li></ul></ul><ul><ul><li>Generate 10 6 CMS events </li></ul></ul><ul><ul><li>Web surfing, Web searching </li></ul></ul><ul><ul><li>Database queries </li></ul></ul><ul><li>HPC – problems that are hard to parallelise – single processor performance is important </li></ul><ul><li>HTC – problems that are easy to parallelise – can be adapted to very large numbers of processors </li></ul>
    24. 24. HTC - HPC <ul><li>High Throughput </li></ul><ul><li>Granularity can be selected to fit the environment </li></ul><ul><li>Load balancing easy </li></ul><ul><li>Mixing workloads is easy </li></ul><ul><li>Sustained throughput is the key goal </li></ul><ul><ul><li>the order in which the individual tasks execute is (usually) not important </li></ul></ul><ul><ul><li>if some equipment goes down the work can be re-run later </li></ul></ul><ul><ul><li>easy to re-schedule dynamically the workload to different configurations </li></ul></ul><ul><li>High Performance </li></ul><ul><li>Granularity largely defined by the algorithm, limitations in the hardware </li></ul><ul><li>Load balancing difficult </li></ul><ul><li>Hard to schedule different workloads </li></ul><ul><li>Reliability is all important </li></ul><ul><ul><li>if one part fails the calculation stops (maybe even aborts!) </li></ul></ul><ul><ul><li>check-pointing essential – all the processes must be restarted from the same synchronisation point </li></ul></ul><ul><ul><li>hard to dynamically re- configure for smaller number of processors </li></ul></ul>
    25. 25. Distributed Computing
    26. 26. Distributed Computing <ul><li>Local distributed systems </li></ul><ul><ul><li>Clusters </li></ul></ul><ul><ul><li>Parallel computers (IBM SP) </li></ul></ul><ul><li>Geographically distributed systems </li></ul><ul><ul><li>Computational Grids </li></ul></ul><ul><li>HPC – as we have seen </li></ul><ul><ul><li>Needs low latency AND good communication bandwidth </li></ul></ul><ul><li>HTC distributed systems </li></ul><ul><ul><li>The bandwidth is important, the latency is less significant </li></ul></ul><ul><ul><li>If latency is poor more processes can be run in parallel to cover the waiting time </li></ul></ul>
    27. 27. Shared Data <ul><li>If the granularity is course enough – the different parts of the problem can be synchronised simply by sharing data </li></ul><ul><li>Example – event reconstruction </li></ul><ul><ul><li>all of the events to be reconstructed are stored in a large data store </li></ul></ul><ul><ul><li>processes (jobs) read successive raw events, generating processed event records, until there are no raw events left </li></ul></ul><ul><ul><li>the result is the concatenation of the processed events (and folding together some histogram data) </li></ul></ul><ul><ul><li>synchronisation overhead can be minimised by partitioning the input and output data </li></ul></ul>
    28. 28. Data Sharing - Files <ul><li>Global file namespace </li></ul><ul><ul><li>maps universal name to network node, local name </li></ul></ul><ul><li>Remote data access </li></ul><ul><li>Caching strategies </li></ul><ul><ul><li>Local or intermediate caching </li></ul></ul><ul><ul><li>Replication </li></ul></ul><ul><ul><li>Migration </li></ul></ul><ul><li>Access control, authentication issues </li></ul><ul><li>Locking issues </li></ul><ul><li>NFS </li></ul><ul><li>AFS </li></ul><ul><li>Web folders </li></ul><ul><li>Highly scalable for read-only data </li></ul>
    29. 29. Data Sharing – Databases, Objects <ul><li>File sharing is probably the simplest paradigm for building distributed systems </li></ul><ul><li>Database and object sharing look the same </li></ul><ul><li>But – </li></ul><ul><li>Files are universal, fundamental systems concepts – standard interfaces, functionality </li></ul><ul><li>Databases are not yet fundamental, built-in but there are only a few standards </li></ul><ul><li>Objects even less so – still at the application level – so harder to implement efficient and universal caching, remote access, etc. </li></ul>
    30. 30. Client-server <ul><li>Examples </li></ul><ul><ul><li>Web browsing </li></ul></ul><ul><ul><li>Online banking </li></ul></ul><ul><ul><li>Order entry </li></ul></ul><ul><ul><li>…… .. </li></ul></ul><ul><li>The functionality is divided between the two parts – for example </li></ul><ul><ul><li>exploit locality of data (e.g. perform searches, transformations on node where data resides) </li></ul></ul><ul><ul><li>exploit different hardware capabilities (e.g. central supercomputer, graphics workstation) </li></ul></ul><ul><ul><li>security concerns – restrict sensitive data to defined geographical locations (e.g. account queries) </li></ul></ul><ul><ul><li>reliability concerns (e.g. perform database updates on highly reliable servers) </li></ul></ul><ul><li>Usually the server implements pre-defined, standardised functions </li></ul>client server request response
    31. 31. 3-Tier client-server server client client client server client client client server client client client database server <ul><li>data extracts replicated on intermediate servers </li></ul><ul><li>changes batched for asynchronous treatment by database server </li></ul><ul><li>Enables - </li></ul><ul><li>scaling up client query capacity </li></ul><ul><li>isolation of main database </li></ul>
    32. 32. Peer-to-Peer - P2P <ul><li>Peer-to-Peer  decentralisation of function and control </li></ul><ul><li>Taking advantage of the computational resources at the edge of the network </li></ul><ul><li>The functions are shared between the distributed parts – without central control </li></ul><ul><li>Programs to cooperate without being designed as a single application </li></ul><ul><li>So P2P is just a democratic form of parallel programming - </li></ul><ul><ul><li>SETI </li></ul></ul><ul><ul><li>The parallel HPC problems we have looked at, using MPI </li></ul></ul><ul><li>All the buzz of P2P is because new interfaces promise to bring this to the commercial world; allow different communities, businesses to collaborate through the internet </li></ul><ul><ul><li>XML </li></ul></ul><ul><ul><li>SOAP </li></ul></ul><ul><ul><li>. NET </li></ul></ul><ul><ul><li>JXTA </li></ul></ul>
    33. 33. Simple Object Access Protocol - SOAP <ul><li>SOAP – simple, lightweight mechanism for exchanging objects between peers in a distributed environment using XML carried over HTTP </li></ul><ul><li>SOAP consists of three parts: </li></ul><ul><ul><li>The SOAP envelope - what is in a message; who should deal with it, and whether it is optional or mandatory </li></ul></ul><ul><ul><li>The SOAP encoding rules - serialisation definition for exchanging instances of application-defined datatypes. </li></ul></ul><ul><ul><li>The SOAP Remote Procedure Call representation </li></ul></ul>
    34. 34. Microsoft’s .NET <ul><li>.NET is a framework, or environment for building, deploying and running Web services and other internet applications </li></ul><ul><ul><li>Common Language Runtime - C++, C#, Visual Basic and JScript </li></ul></ul><ul><ul><li>Framework classes </li></ul></ul><ul><li>Aiming at a standard but Windows only </li></ul>
    35. 35. JXTA <ul><li>Interoperability </li></ul><ul><ul><li>locating JXTA peers </li></ul></ul><ul><ul><li>communication </li></ul></ul><ul><li>Platform, language and network independence </li></ul><ul><li>Implementable on anything – </li></ul><ul><ul><li>phone – VCR - PDA – PC </li></ul></ul><ul><li>A set of protocols </li></ul><ul><li>Security model </li></ul><ul><li>Peer discovery </li></ul><ul><li>Peer groups </li></ul><ul><li>XML encoding </li></ul>http://www.jxta.org/project/www/docs/TechOverview.pdf
    36. 36. End of Part 1 Tomorrow: HEP applications Offline computing for LHC Extending HTC to the Grid
    37. 37. HEP Applications
    38. 38. interactive physics analysis batch physics analysis detector event summary data raw data event reprocessing event simulation analysis objects (extracted by physics topic) Data Handling and Computation for Physics Analysis event filter (selection & reconstruction) processed data [email_address] CERN
    39. 39. HEP Computing Characteristics <ul><li>Large numbers of independent events - trivial parallelism – “job” granularity </li></ul><ul><li>Modest floating point requirement - SPECint performance </li></ul><ul><li>Large data sets - smallish records, mostly read-only </li></ul><ul><li>Modest I/O rates - few MB/sec per fast processor </li></ul><ul><li>Simulation </li></ul><ul><li>cpu-intensive </li></ul><ul><li>mostly static input data </li></ul><ul><li>very low output data rate </li></ul><ul><li>Reconstruction </li></ul><ul><li>very modest I/O </li></ul><ul><li>easy to partition input data </li></ul><ul><li>easy to collect output data </li></ul>
    40. 40. Analysis <ul><li>ESD analysis </li></ul><ul><ul><li>modest I/O rates </li></ul></ul><ul><ul><li>read only ESD </li></ul></ul><ul><ul><li>BUT </li></ul></ul><ul><li>Very large input database </li></ul><ul><li>Chaotic workload – </li></ul><ul><ul><li>unpredictable, no limit to the requirements </li></ul></ul><ul><li>AOD analysis </li></ul><ul><ul><li>potentially very high I/O rates </li></ul></ul><ul><ul><li>but modest database </li></ul></ul>
    41. 41. HEP Computing Characteristics <ul><li>Large numbers of independent events - trivial parallelism – “job” granularity </li></ul><ul><li>Large data sets - smallish records, mostly read-only </li></ul><ul><li>Modest I/O rates - few MB/sec per fast processor </li></ul><ul><li>Modest floating point requirement - SPECint performance </li></ul><ul><li>Chaotic workload – </li></ul><ul><ul><li>research environment  unpredictable, no limit to the requirements </li></ul></ul><ul><li>Very large aggregate requirements – computation, data </li></ul><ul><ul><li>Scaling up is not just big – it is also complex </li></ul></ul><ul><ul><li>… and once you exceed the capabilities of a single geographical installation ………? </li></ul></ul>
    42. 42. Task Farming
    43. 43. Task Farming <ul><li>Decompose the data into large independent chunks </li></ul><ul><li>Assign one task (or job) to each chunk </li></ul><ul><li>Put all the tasks in a queue for a scheduler which manages a large “farm” of processors, each of which has access to all of the data </li></ul><ul><li>The scheduler runs one or more jobs on each processor </li></ul><ul><li>When a job finishes the next job in the queue is started </li></ul><ul><li>Until all the jobs have been run </li></ul><ul><li>Collect the output files </li></ul>
    44. 44. Task Farming <ul><li>Task farming is good for </li></ul><ul><li>a very large problem </li></ul><ul><li>Which has </li></ul><ul><li>selectable granularity </li></ul><ul><li>largely independent tasks </li></ul><ul><li>loosely shared data </li></ul>HEP – -- Simulation -- Reconstruction -- and much of the Analysis
    45. 45. The SHIFT Software Model (1990) [email_address] From the application’s viewpoint – this is simply file sharing – all data available to all processes standard APIs – disk I/O; mass storage; job scheduler; can be implemented over an IP network mass storage model – tape data cached on disk (stager) physical implementation - transparent to the application/user scalable, heterogeneous flexible evolution – scalable capacity; multiple platforms; seamless integration of new technologies disk servers application servers stage (migration) servers tape servers queue servers IP network
    46. 46. Current Implementation of SHIFT racks of dual-cpu Linux PCs Linux PC controllers IDE disks Linux PC controllers Robots – STK Powderhorn Drives - STK 9840, STK 9940, IBM 3590 Ethernet 100BaseT, Gigabit mass storage application servers WAN data cache
    47. 47. Fermilab Reconstruction Farms <ul><li>1991 – farms of RISC workstations introduced for reconstruction </li></ul><ul><li>replaced special purpose processors (emulators, ACP) </li></ul><ul><li>Ethernet network </li></ul><ul><li>Integrated with tape systems </li></ul><ul><li>cps – job scheduler, event manager </li></ul>
    48. 48. Condor – a hunter of unused cycles <ul><li>The hunter of idle workstations (1986) </li></ul><ul><li>ClassAd Matchmaking </li></ul><ul><ul><li>users advertise their requirements </li></ul></ul><ul><ul><li>systems advertise their capabilities & constraints </li></ul></ul><ul><li>Directed Acyclic Graph Manager – DAGman </li></ul><ul><ul><li>define dependencies between jobs </li></ul></ul><ul><li>Checkpoint – reschedule – restart </li></ul><ul><ul><li>if the owner of the workstation returns </li></ul></ul><ul><ul><li>or if there is some failure </li></ul></ul><ul><li>Share data through files </li></ul><ul><ul><li>global shared files </li></ul></ul><ul><ul><li>Condor file system calls </li></ul></ul><ul><li>Flocking </li></ul><ul><ul><li>interconnecting pools of Condor-content workstations </li></ul></ul>http://www. cs . wisc . edu /condor/
    49. 49. Layout of the Condor Pool = ClassAd Communication Pathway Central Manager master collector negotiator schedd startd = Process Spawned Desktop schedd startd master Desktop schedd startd master Cluster Node master startd Cluster Node master startd ondor C http://www.cs.wisc.edu/condor
    50. 50. How Flocking Works <ul><li>Add a line to your condor_config : </li></ul><ul><ul><li>FLOCK_HOSTS = Pool-Foo, Pool-Bar </li></ul></ul>Schedd Collector Negotiator Central Manager (CONDOR_HOST ) Pool-Foo Central Manager Pool-Bar Central Manager Submit Machine Collector Negotiator Collector Negotiator ondor C http://www.cs.wisc.edu/condor
    51. 51. Friendly Condor Pool Home Condor Pool 600 Condor jobs ondor C http://www.cs.wisc.edu/condor
    52. 52. Finer grained HTC
    53. 53. The food chain in reverse – -- The PC has consumed the market for larger computers destroying the species -- There is no choice but to harness the PCs
    54. 54. Berkeley - Networks of Workstations (1994) <ul><li>Single system view </li></ul><ul><li>Shared resources </li></ul><ul><li>Virtual machine </li></ul><ul><li>Single address space </li></ul><ul><li>Global Layer Unix – GLUnix </li></ul><ul><li>Serverless Network File Service – xFS </li></ul><ul><li>Research project </li></ul>A Case for Networks of Workstations: NOW, IEEE Micro , Feb, 1995, Thomas E. Anderson, David E. Culler, David A. Patterson http://now. cs . berkeley . edu
    55. 55. Beowulf <ul><li>Nasa Goddard (Thomas Sterling, Donald Becker) - 1994 </li></ul><ul><li>16 Intel PCs – Ethernet - Linux </li></ul><ul><li>Caltech/JPL, Los Alamos </li></ul><ul><li>Parallel applications from the Supercomputing community </li></ul><ul><li>Oak Ridge – 1996 – The Stone SouperComputer </li></ul><ul><ul><li>problem – generate an eco-region map of the US, 1 km grid </li></ul></ul><ul><ul><li>64-way PC cluster proposal rejected </li></ul></ul><ul><ul><li>re-cycle rejected desktop systems </li></ul></ul><ul><li>The experience, emphasis on do-it-yourself, packaging of some of the tools, and probably the name – stimulated wide-spread adoption of clusters in the super-computing world </li></ul>
    56. 56. Parallel ROOT Facility - Proof <ul><li>ROOT object oriented analysis tool </li></ul><ul><li>Queries are performed in parallel on an arbitrary number of processors </li></ul><ul><li>Load balancing: </li></ul><ul><ul><li>Slaves receive work from Master process in “packets” </li></ul></ul><ul><ul><li>Packet size is adapted to current load, number of slaves, etc. </li></ul></ul>proof
    57. 57. LHC Computing
    58. 58. CERN's Users in the World Europe: 267 institutes, 4603 users Elsewhere: 208 institutes, 1632 users
    59. 59. The Large Hadron Collider Project 4 detectors CMS ATLAS LHC b Storage – Raw recording rate 0.1 – 1 GBytes/sec Accumulating at 5-8 PetaBytes/year 10 PetaBytes of disk Processing – 200,000 of today’s fastest PCs
    60. 60. <ul><li>Worldwide distributed computing system </li></ul><ul><li>Small fraction of the analysis at CERN </li></ul><ul><li>ESD analysis – using 12-20 large regional centres </li></ul><ul><ul><li>how to use the resources efficiently </li></ul></ul><ul><ul><li>establishing and maintaining a uniform physics environment </li></ul></ul><ul><li>Data exchange – with tens of smaller regional centres, universities, labs </li></ul>
    61. 61. Planned capacity evolution at CERN Mass Storage Disk CPU LHC Other experiments LHC Other experiments Moore’s law
    62. 62. Are Grids a solution? <ul><li>The Grid – Ian Foster, Carl Kesselman – The Globus Project </li></ul><ul><li>“ Dependable, consistent, pervasive access to [high-end] resources” </li></ul><ul><li>Dependable: </li></ul><ul><ul><li>provides performance and functionality guarantees </li></ul></ul><ul><li>Consistent: </li></ul><ul><ul><li>uniform interfaces to a wide variety of resources </li></ul></ul><ul><li>Pervasive: </li></ul><ul><ul><li>ability to “plug in” from anywhere </li></ul></ul>
    63. 63. The Grid The GRID ubiquitous access to computation in the sense that the WEB provides ubiquitous access to information
    64. 64. Globus Architecture www.globus.org Applications Core Services Metacomputing Directory Service GRAM Globus Security Interface Heartbeat Monitor Nexus Gloperf High-level Services and Tools DUROC globusrun MPI Nimrod/G MPI-IO CC++ GlobusView Testbed Status GASS middleware Uniform application program interface to grid resources Grid infrastructure primitives Mapped to local implementations, architectures, policies Local Services LSF Condor MPI NQE Easy TCP Solaris Irix AIX UDP
    65. 65. <ul><li>The nodes of the Grid </li></ul><ul><ul><li>are managed by different people </li></ul></ul><ul><ul><li>so have different access and usage policies </li></ul></ul><ul><ul><li>and may have different architectures </li></ul></ul><ul><li>The geographical distribution </li></ul><ul><ul><li>means that there cannot be a central status </li></ul></ul><ul><ul><li>status information and resource availability is “published” (remember Condor Classified Ads) </li></ul></ul><ul><ul><li>Grid schedulers can only have an approximate view of resources </li></ul></ul><ul><li>The Grid Middleware tries to present this as a coherent virtual computing centre </li></ul>
    66. 66. Core Services <ul><li>Security </li></ul><ul><li>Information Service </li></ul><ul><li>Resource Management – Grid scheduler, standard resource allocation </li></ul><ul><li>Remote Data Access – global namespace, caching, replication </li></ul><ul><li>Performance and Status Monitoring </li></ul><ul><li>Fault detection </li></ul><ul><li>Error Recovery Management </li></ul>
    67. 67. The Promise of Grid Technology <ul><li>What does the Grid do for you? </li></ul><ul><li>you submit your work </li></ul><ul><li>and the Grid </li></ul><ul><ul><li>Finds convenient places for it to be run </li></ul></ul><ul><ul><li>Optimises use of the widely dispersed resources </li></ul></ul><ul><ul><li>Organises efficient access to your data </li></ul></ul><ul><ul><ul><li>Caching, migration, replication </li></ul></ul></ul><ul><ul><li>Deals with authentication to the different sites that you will be using </li></ul></ul><ul><ul><li>Interfaces to local site resource allocation mechanisms, policies </li></ul></ul><ul><ul><li>Runs your jobs </li></ul></ul><ul><ul><li>Monitors progress </li></ul></ul><ul><ul><li>Recovers from problems </li></ul></ul><ul><ul><li>.. and .. Tells you when your work is complete </li></ul></ul>
    68. 68. LHC Computing Model 2001 - evolving [email_address] The LHC Computing Centre The opportunity of Grid technology CMS ATLAS LHC b CERN Tier 0 Centre at CERN physics group regional group Tier2 Lab a Uni a Lab c Uni n Lab m Lab b Uni b Uni y Uni x Tier3 physics department    Desktop Germany Tier 1 USA UK France Italy ……… . CERN Tier 1 ……… . CERN Tier 0