The lectures will survey technologies that could be used for storing and managing the many PetaBytes of data that will be collected and processed at LHC. This will cover current mainline hardware technologies, their capacity, performance and reliability characteristics, the likely evolution in the period from now to LHC, and their fundamental limits. It will also cover promising new technologies including both products which are emerging from the home and office computing environment (such as DVDs) and more exotic techniques. The importance of market acceptance and production volume as cost factors will be mentioned. Robotic handling systems for mass storage media will also be discussed. After summarising the mass storage requirements of LHC, some suggestions will be made of how these requirements may be met with the technology which will be available.
What Is High Throughput Distributed Computing - Presentation Transcript
What is High Throughput Distributed Computing? CERN Computing Summer School 2001 Santander Les Robertson CERN - IT Division [email_address]
Outline
High Performance Computing (HPC) and High Throughput Computing (HTC)
Parallel processing
so difficult with HPC applications
so easy with HTC
Some models of distributed computing
HEP applications
Offline computing for LHC
Extending HTC to the Grid
“Speeding Up” the Calculation?
Use the fastest processor available
-- but this gives only a small factor over modest (PC) processors
Use many processors, performing bits of the problem in parallel
-- and since quite fast processors are inexpensive we can think
of using very many processors in parallel
High Performance – or – High Throughput?
The key questions are – granularity & degree of parallelism
Have you got one big problem or a bunch of little ones? To what extent can the “problem” be decomposed into sort-of-independent parts ( grains ) that can all be processed in parallel?
Granularity
fine-grained parallelism – the independent bits are small, need to exchange information, synchronise often
coarse-grained – the problem can be decomposed into large chunks that can be processed independently
Practical limits on the degree of parallelism –
how many grains can be processed in parallel?
degree of parallelism v. grain size
grain size limited by the efficiency of the system at synchronising grains
High Performance – v. – High Throughput?
fine-grained problems need a high performance system
that enables rapid synchronisation between the bits that can be processed in parallel
and runs the bits that are difficult to parallelise as fast as possible
coarse-grained problems can use a high throughput system
that maximises the number of parts processed per minute
High Throughput Systems use a large number of inexpensive processors, inexpensively interconnected
while High Performance Systems use a smaller number of more expensive processors expensively interconnected
High Performance – v. – High Throughput?
There is nothing fundamental here – it is just a question of financial trade-offs like:
how much more expensive is a “fast” computer than a bunch of slower ones?
how much is it worth to get the answer more quickly?
how much investment is necessary to improve the degree of parallelisation of the algorithm?
But the target is moving -
Since the cost chasm first opened between fast and slower computers 12-15 years ago an enormous effort has gone into finding parallelism in “big” problems
Inexorably decreasing computer costs and de-regulation of the wide area network infrastructure have opened the door to ever larger computing facilities – clusters fabrics (inter)national grids demanding ever-greater degrees of parallelism
High Performance Computing
A quick look at HPC problems
Classical high-performance applications
numerical simulations of complex systems such as
weather
climate
combustion
mechanical devices and structures
crash simulation
electronic circuits
manufacturing processes
chemical reactions
image processing applications like
medical scans
military sensors
earth observation, satellite reconnaisance
seismic prospecting
Approaches to parallelism
Domain decomposition
Functional decomposition
graphics from Designing and Building Parallel Programs (Online) , by Ian Foster - http://www-unix.mcs.anl.gov/dbpp/
Of course – it’s not that simple graphic from Designing and Building Parallel Programs (Online) , by Ian Foster - http://www-unix.mcs.anl.gov/dbpp/
The design process
Data or functional decomposition
building an abstract task model
Building a model for communication between tasks
interaction patterns
Agglomeration – to fit the abstract model to the constraints of the target hardware
interconnection topology
speed, latency, overhead of communications
Mapping the tasks to the processors
load balancing
task scheduling
graphic from Designing and Building Parallel Programs (Online) , by Ian Foster - http://www- unix . mcs . anl . gov / dbpp /
Large scale parallelism – the need for standards
“ Supercomputer” market is on trouble; diminishing number of suppliers; questionable future
Increasingly risky to design for specific tightly coupled architectures like - SGI (Cray, Origin), NEC, Hitachi
Require a standard for communication between partitions/tasks that works also on loosely coupled systems (“massively parallel processors” – MPP – IBM SP, Compaq)
Paradigm is message passing rather than shared memory – tasks rather than threads
Parallel Virtual Machine - PVM
MPI – Message Passing Interface
MPI – Message Passing Interface
industrial standard – http://www. mpi -forum.org
source code portability
widely available; efficient implementations
SPMD (Single Program Multiple Data) model
Point-to-point communication (send/receive/wait; blocking/non-blocking)
IBM Redbook - http://www. redbooks . ibm .com/ redbooks /SG245380.html
Defining high level data functions allows highly efficient implementations, e.g. minimising data copies
The limits of parallelism - Amdahl’s Law
If we have N processors:
s + p Speedup = ———— s + p/N
taking s as the fraction of the time spent in the sequential part of the program ( s + p = 1 )
1 Speedup = ———— 1/s s + (1-s)/N
s – time spent in a serial processor on serial parts of the code p – time spent in a serial processor on parts that could be executed in parallel Amdahl, G.M., Validity of single-processor approach to achieving large scale computing capability Proc. AFIPS Conf., Reston, VA, 1967, pp. 483-485
Amdahl’s Law - maximum speedup
Load Balancing - real life is (much) worse
Often have to use barrier synchronisation between each step, and different cells require different amounts of computation
Real time sequential part s = s i
Real time parallelisable part on a sequential processor p = k j p k j
Real time parallelised T = s + max( p k j )
T = s + max( p k j ) >> s + p/N
s 1 s i s j s N : : : : : : t … … p k 1 p k j p k M … … p K 1 p K j p K M … … p 1 1 p 1 j p 1 M
Gustafson’s Interpretation
The problem size scales with the number of processors
With a lot more processors (computing capacity) available you can and will do much more work in less time
The complexity of the application rises to fill the capacity available
But the sequential part remains approximately constant
Gustafson, J.L., Re-evaluating Amdahl’s Law, CACM 31(5), 1988, pp. 532-533
Amdahl’s Law - maximum speedup with Gustafson’s appetite potential 1,000 X speedup with 0.1% sequential code
The importance of the network
Communication Overhead adds to the inherent sequential part of the program to limit the Amdahl speedup Latency – the round-trip time (RTT) to communicate between two processors communications overhead
c = latency + data_transfer_time
s + p Speedup = ————— s + c + p/N
For fine grained parallel programs the problem is latency , not bandwidth
HTC – is appropriate when the problem can be decomposed into many (very many) smaller problems that are essentially independent
Build a profile of all MasterCard customers who purchased an airline ticket and rented a car in August
Analyse the purchase patterns of Wallmart customers in the LA area last month
Generate 10 6 CMS events
Web surfing, Web searching
Database queries
HPC – problems that are hard to parallelise – single processor performance is important
HTC – problems that are easy to parallelise – can be adapted to very large numbers of processors
HTC - HPC
High Throughput
Granularity can be selected to fit the environment
Load balancing easy
Mixing workloads is easy
Sustained throughput is the key goal
the order in which the individual tasks execute is (usually) not important
if some equipment goes down the work can be re-run later
easy to re-schedule dynamically the workload to different configurations
High Performance
Granularity largely defined by the algorithm, limitations in the hardware
Load balancing difficult
Hard to schedule different workloads
Reliability is all important
if one part fails the calculation stops (maybe even aborts!)
check-pointing essential – all the processes must be restarted from the same synchronisation point
hard to dynamically re- configure for smaller number of processors
Distributed Computing
Distributed Computing
Local distributed systems
Clusters
Parallel computers (IBM SP)
Geographically distributed systems
Computational Grids
HPC – as we have seen
Needs low latency AND good communication bandwidth
HTC distributed systems
The bandwidth is important, the latency is less significant
If latency is poor more processes can be run in parallel to cover the waiting time
Shared Data
If the granularity is course enough – the different parts of the problem can be synchronised simply by sharing data
Example – event reconstruction
all of the events to be reconstructed are stored in a large data store
processes (jobs) read successive raw events, generating processed event records, until there are no raw events left
the result is the concatenation of the processed events (and folding together some histogram data)
synchronisation overhead can be minimised by partitioning the input and output data
Data Sharing - Files
Global file namespace
maps universal name to network node, local name
Remote data access
Caching strategies
Local or intermediate caching
Replication
Migration
Access control, authentication issues
Locking issues
NFS
AFS
Web folders
Highly scalable for read-only data
Data Sharing – Databases, Objects
File sharing is probably the simplest paradigm for building distributed systems
Database and object sharing look the same
But –
Files are universal, fundamental systems concepts – standard interfaces, functionality
Databases are not yet fundamental, built-in but there are only a few standards
Objects even less so – still at the application level – so harder to implement efficient and universal caching, remote access, etc.
Client-server
Examples
Web browsing
Online banking
Order entry
…… ..
The functionality is divided between the two parts – for example
exploit locality of data (e.g. perform searches, transformations on node where data resides)
exploit different hardware capabilities (e.g. central supercomputer, graphics workstation)
security concerns – restrict sensitive data to defined geographical locations (e.g. account queries)
reliability concerns (e.g. perform database updates on highly reliable servers)
Usually the server implements pre-defined, standardised functions
client server request response
3-Tier client-server server client client client server client client client server client client client database server
data extracts replicated on intermediate servers
changes batched for asynchronous treatment by database server
Enables -
scaling up client query capacity
isolation of main database
Peer-to-Peer - P2P
Peer-to-Peer decentralisation of function and control
Taking advantage of the computational resources at the edge of the network
The functions are shared between the distributed parts – without central control
Programs to cooperate without being designed as a single application
So P2P is just a democratic form of parallel programming -
SETI
The parallel HPC problems we have looked at, using MPI
All the buzz of P2P is because new interfaces promise to bring this to the commercial world; allow different communities, businesses to collaborate through the internet
XML
SOAP
. NET
JXTA
Simple Object Access Protocol - SOAP
SOAP – simple, lightweight mechanism for exchanging objects between peers in a distributed environment using XML carried over HTTP
SOAP consists of three parts:
The SOAP envelope - what is in a message; who should deal with it, and whether it is optional or mandatory
The SOAP encoding rules - serialisation definition for exchanging instances of application-defined datatypes.
The SOAP Remote Procedure Call representation
Microsoft’s .NET
.NET is a framework, or environment for building, deploying and running Web services and other internet applications
Common Language Runtime - C++, C#, Visual Basic and JScript
End of Part 1 Tomorrow: HEP applications Offline computing for LHC Extending HTC to the Grid
HEP Applications
interactive physics analysis batch physics analysis detector event summary data raw data event reprocessing event simulation analysis objects (extracted by physics topic) Data Handling and Computation for Physics Analysis event filter (selection & reconstruction) processed data [email_address] CERN
HEP Computing Characteristics
Large numbers of independent events - trivial parallelism – “job” granularity
Modest floating point requirement - SPECint performance
Large data sets - smallish records, mostly read-only
Modest I/O rates - few MB/sec per fast processor
Simulation
cpu-intensive
mostly static input data
very low output data rate
Reconstruction
very modest I/O
easy to partition input data
easy to collect output data
Analysis
ESD analysis
modest I/O rates
read only ESD
BUT
Very large input database
Chaotic workload –
unpredictable, no limit to the requirements
AOD analysis
potentially very high I/O rates
but modest database
HEP Computing Characteristics
Large numbers of independent events - trivial parallelism – “job” granularity
Large data sets - smallish records, mostly read-only
Modest I/O rates - few MB/sec per fast processor
Modest floating point requirement - SPECint performance
Chaotic workload –
research environment unpredictable, no limit to the requirements
Very large aggregate requirements – computation, data
Scaling up is not just big – it is also complex
… and once you exceed the capabilities of a single geographical installation ………?
Task Farming
Task Farming
Decompose the data into large independent chunks
Assign one task (or job) to each chunk
Put all the tasks in a queue for a scheduler which manages a large “farm” of processors, each of which has access to all of the data
The scheduler runs one or more jobs on each processor
When a job finishes the next job in the queue is started
Until all the jobs have been run
Collect the output files
Task Farming
Task farming is good for
a very large problem
Which has
selectable granularity
largely independent tasks
loosely shared data
HEP – -- Simulation -- Reconstruction -- and much of the Analysis
The SHIFT Software Model (1990) [email_address] From the application’s viewpoint – this is simply file sharing – all data available to all processes standard APIs – disk I/O; mass storage; job scheduler; can be implemented over an IP network mass storage model – tape data cached on disk (stager) physical implementation - transparent to the application/user scalable, heterogeneous flexible evolution – scalable capacity; multiple platforms; seamless integration of new technologies disk servers application servers stage (migration) servers tape servers queue servers IP network
Current Implementation of SHIFT racks of dual-cpu Linux PCs Linux PC controllers IDE disks Linux PC controllers Robots – STK Powderhorn Drives - STK 9840, STK 9940, IBM 3590 Ethernet 100BaseT, Gigabit mass storage application servers WAN data cache
Fermilab Reconstruction Farms
1991 – farms of RISC workstations introduced for reconstruction
replaced special purpose processors (emulators, ACP)
Ethernet network
Integrated with tape systems
cps – job scheduler, event manager
Condor – a hunter of unused cycles
The hunter of idle workstations (1986)
ClassAd Matchmaking
users advertise their requirements
systems advertise their capabilities & constraints
Directed Acyclic Graph Manager – DAGman
define dependencies between jobs
Checkpoint – reschedule – restart
if the owner of the workstation returns
or if there is some failure
Share data through files
global shared files
Condor file system calls
Flocking
interconnecting pools of Condor-content workstations
http://www. cs . wisc . edu /condor/
Layout of the Condor Pool = ClassAd Communication Pathway Central Manager master collector negotiator schedd startd = Process Spawned Desktop schedd startd master Desktop schedd startd master Cluster Node master startd Cluster Node master startd ondor C http://www.cs.wisc.edu/condor
How Flocking Works
Add a line to your condor_config :
FLOCK_HOSTS = Pool-Foo, Pool-Bar
Schedd Collector Negotiator Central Manager (CONDOR_HOST ) Pool-Foo Central Manager Pool-Bar Central Manager Submit Machine Collector Negotiator Collector Negotiator ondor C http://www.cs.wisc.edu/condor
Friendly Condor Pool Home Condor Pool 600 Condor jobs ondor C http://www.cs.wisc.edu/condor
Finer grained HTC
The food chain in reverse – -- The PC has consumed the market for larger computers destroying the species -- There is no choice but to harness the PCs
Berkeley - Networks of Workstations (1994)
Single system view
Shared resources
Virtual machine
Single address space
Global Layer Unix – GLUnix
Serverless Network File Service – xFS
Research project
A Case for Networks of Workstations: NOW, IEEE Micro , Feb, 1995, Thomas E. Anderson, David E. Culler, David A. Patterson http://now. cs . berkeley . edu
Beowulf
Nasa Goddard (Thomas Sterling, Donald Becker) - 1994
16 Intel PCs – Ethernet - Linux
Caltech/JPL, Los Alamos
Parallel applications from the Supercomputing community
Oak Ridge – 1996 – The Stone SouperComputer
problem – generate an eco-region map of the US, 1 km grid
64-way PC cluster proposal rejected
re-cycle rejected desktop systems
The experience, emphasis on do-it-yourself, packaging of some of the tools, and probably the name – stimulated wide-spread adoption of clusters in the super-computing world
Parallel ROOT Facility - Proof
ROOT object oriented analysis tool
Queries are performed in parallel on an arbitrary number of processors
Load balancing:
Slaves receive work from Master process in “packets”
Packet size is adapted to current load, number of slaves, etc.
proof
LHC Computing
CERN's Users in the World Europe: 267 institutes, 4603 users Elsewhere: 208 institutes, 1632 users
The Large Hadron Collider Project 4 detectors CMS ATLAS LHC b Storage – Raw recording rate 0.1 – 1 GBytes/sec Accumulating at 5-8 PetaBytes/year 10 PetaBytes of disk Processing – 200,000 of today’s fastest PCs
Worldwide distributed computing system
Small fraction of the analysis at CERN
ESD analysis – using 12-20 large regional centres
how to use the resources efficiently
establishing and maintaining a uniform physics environment
Data exchange – with tens of smaller regional centres, universities, labs
Planned capacity evolution at CERN Mass Storage Disk CPU LHC Other experiments LHC Other experiments Moore’s law
Are Grids a solution?
The Grid – Ian Foster, Carl Kesselman – The Globus Project
“ Dependable, consistent, pervasive access to [high-end] resources”
Dependable:
provides performance and functionality guarantees
Consistent:
uniform interfaces to a wide variety of resources
Pervasive:
ability to “plug in” from anywhere
The Grid The GRID ubiquitous access to computation in the sense that the WEB provides ubiquitous access to information
Globus Architecture www.globus.org Applications Core Services Metacomputing Directory Service GRAM Globus Security Interface Heartbeat Monitor Nexus Gloperf High-level Services and Tools DUROC globusrun MPI Nimrod/G MPI-IO CC++ GlobusView Testbed Status GASS middleware Uniform application program interface to grid resources Grid infrastructure primitives Mapped to local implementations, architectures, policies Local Services LSF Condor MPI NQE Easy TCP Solaris Irix AIX UDP
The nodes of the Grid
are managed by different people
so have different access and usage policies
and may have different architectures
The geographical distribution
means that there cannot be a central status
status information and resource availability is “published” (remember Condor Classified Ads)
Grid schedulers can only have an approximate view of resources
The Grid Middleware tries to present this as a coherent virtual computing centre
Core Services
Security
Information Service
Resource Management – Grid scheduler, standard resource allocation
Remote Data Access – global namespace, caching, replication
Performance and Status Monitoring
Fault detection
Error Recovery Management
The Promise of Grid Technology
What does the Grid do for you?
you submit your work
and the Grid
Finds convenient places for it to be run
Optimises use of the widely dispersed resources
Organises efficient access to your data
Caching, migration, replication
Deals with authentication to the different sites that you will be using
Interfaces to local site resource allocation mechanisms, policies
Runs your jobs
Monitors progress
Recovers from problems
.. and .. Tells you when your work is complete
LHC Computing Model 2001 - evolving [email_address] The LHC Computing Centre The opportunity of Grid technology CMS ATLAS LHC b CERN Tier 0 Centre at CERN physics group regional group Tier2 Lab a Uni a Lab c Uni n Lab m Lab b Uni b Uni y Uni x Tier3 physics department Desktop Germany Tier 1 USA UK France Italy ……… . CERN Tier 1 ……… . CERN Tier 0
0 comments
Post a comment