Performance and scalability prediction in HPC systems

S
S
S
S S
Emilio Luque
Alvaro Wong, Dolores Rexachs, Javier Panadero
Computer Architecture and Operating Systems (CAOS)
Department
University Autonoma of Barcelona (UAB)

PhD Students Collaborations
 Diego Montezanti (UNLP)
 Silvana Lis Gallo (UNLP)
 Diego Encinas (UNLP)
Postdoc researchers
/External Collaborations
Dr. Francisco Borges
Dr. Eduardo C. Cabrera
Dr. Marcela Castro
Dr. Joe Carrión
Dr. Leonardo Fialho
Dr. Adriana Gaudiani (UNGS)
Dr. Joao Gramacho
Dr. Cecilia Jaramillo
Dr. Zhengchun Liu
Dr. Sandra Méndez
Dr. Hugo Meyer
Dr. Ronal Muresano
Dr. Javier Panadero
Dr. Cristian Tissera (UNSL)
Dr. Javier Balladini
Staff Members (UAB)
Dr. Emilio Luque (Catedrático)
Dr. Dolores Rexáchs (Prof. Titular)
Dr. Remo Suppi (Prof. Titular)
Dr. Daniel Franco (Prof. Titular)
Dr. Elisa Heymann (Prof. Titular)
Dr. Francisco Epelde (MD-Tauli Hospital)
High Performance Computing for
Efficient Applications and Simulation
Postdoc Researchers (UAB)
Dr. Álvaro Wong
Dr. Manel Taboada
Dr. Eva Bruballa
PhD Students (UAB)
Laura Espínola
 Mohammed Ghazzawi
Pilar Gómez
Carlos Rangel
 Elham Shojaei
Ghazal Tashakor
 Jorge Villamayor
 Betzabeth León
2

Sequential Parallel
Shared Memory
HPC Platforms
Message
passing
Scientific
applications
Supercomputers clusters
Performance
evaluation
Hybrid
Focus

Evaluar el rendimiento de una aplicación paralela es
cada vez mas complejo.
Seleccionar recursos
Dimensionar
Evaluación del rendimiento

 Modelos por medición: son técnicas que incluye la monitorización del
sistema mientras esta siendo sometido a una carga de trabajo particular.
– La aplicación
Pros Contras
Aplicación
Precisión en el
tiempo de ejecución
Tiempo
Complejo
Benchmarks
Tiempo acotado Seleccionar el
más adecuado
S
 Modelos matemáticos estadísticos: basados en representaciones
matemáticas de sistemas informáticos. (Simpoint)
 Modelos por simulación: Construcción de un modelo del comportamiento del
sistema y reproducirlo con una abstracción apropiada de la carga. (Mambo,
Dimemas, Cotson)
– Benchmarks (NAS Par Bench, Linpack,SPEC2006)
Modelos para la evaluación del rendimiento

A
6
Características de las aplicaciones científicas !!
 Comportamiento repetitivo
 Estático
 Dinámico (Cómputo y Comunicación(mensajes)
B
S 2

Behavior of the Scientific
Application.
Application Execution
Phase A Phase B
3000 Sec.
Phase A Phase B
1% - 5% Application Execution Time
3000 Sec.
=Time A TimeB+

Predicción
Caracterizar el comportamiento de las aplicaciones científicas paralelas por paso
de mensaje extrayendo su firma (intrínseca / program-independent).
Objetivo: la firma de la aplicación
Tiempo de Ejecución de la Aplicación
S
S
Tiempo de Ejecución de la Firma
Predicción del Tiempo
de Ejecución de la
Aplicación
Metodología

Construcción de la firma
A B
Tiempo →
Tiempo →
Tiempo →
Checkpoints

Parallel Application
Instrumentation
/Monitoring
Executable code
1)Collection Data.
2)Parallel Application model.
3) Patterns identification.
4)Extract phases and Weights.
Time of each
phase by
Weights
Prediction B
Time of each
phase by
Weights
Prediction C
Time of each
phase by
Weights
Prediction D
B
C
D
A
Phases’
Weights
Parallel Application Signature
Binary Phases + Coordinated
Checkpoint
Cluster
Cluster
Cluster
Cluster
PAS2P Methodology
SS
S
Prediction
PAS2P

Parallel
Application
1) Collection data.
2)Parallel Application Model.
3) Patterns Identification.
4)Extract phases and Weights.
Cluster B
Cluster C
Cluster D
Cluster A
Phases’
Weights
Parallel Application Signature
Binary Phases+ Coordinated
Checkpoint
PAS2P Methodology
SSS
Prediction
Instrumentation /
monitoring
Phases
Weight
Weight
Weight
Prediction
B
Time of each
Phase
Time of each
Phase
Time of each
Phase
Prediction
D
Prediction
C
Performance Prediction: PAS2P

To collect data from the applications, we instrument the application in order to produce a
log trace, from which we characterize communication and computation behavior.
Starting from concept of the Basic Block (BB), a sequence of code with one entry and
one exit, we extend this concept towards parallel applications, defining the following
terms:
 Event: Sending or receiving a message, during the life of a process, in which either
the production or arrival of a message occurs.
 Extended Basic Block (EBB): A segment of a process whose beginning and end
are defined by occurrences of messages. We may also say that it is a “computational
time” segment bounded by communications actions.
Data Collection
•Id: Event identifier.
•Type: If the event is an emission +K, or a reception –K, being K the
number of the involved process.
•Size: The communication volume of the message which is being
transmitted (Bytes).
•Msg_id: The relation with the event (emission/reception) of the same
message.
•Number of event: The number of the event in a process.
•Logical Time: Time depending of the precedence in communications.
Structure

 Evento: el envío o recepción de un mensaje.
 Bloque Básico Extendido (BBE): Segmento de un proceso cuyo inicio y fin esta delimitado por
ocurrencias de mensajes.
 Bloque Básico Paralelo (BBP): Un conjunto de Bloques Básico Extendidos delimitado por dos ticks.
– El primer tick esta definido como el punto de entrada donde por lo menos ocurre un evento.
– El segundo tick esta definido como el punto de salida en donde por lo menos existe un evento.
PROCESOS Punto de Entrada Cómputo Punto de salida
SEND/
RECV
Volumen
(KB)
(MSEC) SEND/
RECV
Volumen
(KB)
P1 K M C K M
P2 K M C K M
P3 K M C K M
P4 K M C K M
Estructura
del
evento
•Tipo de comunicación: +/- K or 0
Tiempo de cómputo
•Volumen de Comunicación: M
Algunos conceptos nuevos

Time
P1
P2
P3
P4
Extended Basic Blocks
(Computational time)
The synchronization between computing nodes, which is absent in sequential
applications, becomes necessary. To solve this, we have to move from physical to
logical traces.

We created a logical clock based on the order of precedence in communications
between processes as defined by Lamport. Lamport assumed that sending or
receiving a message is an event in a process, then he defined the “happened
before” relation, denoted by “a  b”. The relation “a  b” on the set of events of a
system is the smallest relation satisfying the following conditions:
1. If “a” and “b” are events in the same process, and “a” comes before “b”,
then a  b.
2. If “a” is the sending of a message by one process and “b” is the receipt of
the same message by another process, then a  b.
Algorithm to assign Logical Time (Using Lamport Algoritm)
Now, we introduce two new concepts:
 Tick: Logical time unit.
 Parallel Basic Block (PBB): The set of Extended Basic Blocks that start at the
same tick and occur between two consecutive events of the same process.
Phase is defined as sub-chains of Grouped PBB’s that repeat along the execution.
Parallel Application Model

1 2 3 4 51
1 2 3 4 5
0
1 2 3 4 5
P1
P2
P3
P4
S Evento Send R Evento Recv
0
t
t
t
t
0
04
02
1 2 03
0
030
2
5 6
0
Ordenación lógica mediante el algoritmo de Lamport
Para crear una ordenación de todos los eventos, se
utiliza el concepto de reloj lógico definido por Lamport.
1. a y b son eventos del mismo proceso, y a ocurre
antes que b conforme el reloj físico del proceso.
2. a es el envío de un mensaje y b la recepción del
mismo mensaje.
a b

S Evento Send R Evento Recv
1 2 3 4 5
P1
P2
P3
P4
0
t
t
t
t
0
41 2
0
0 5 6
A continuación introducimos nuevos conceptos:
 Tick: Unidad de tiempo lógico
Traza física
Traza lógica
1 2 3 4 5
1 2 3 4 5
P1
P2
P3
P4
Tick 0 1 2 3 4 5 6
0
0
0
0
Procesos
Paso de traza física a traza lógica

BBP1 BBP2 BBP3 BBP4 BBP5 BBP6 BBP7 BBP8 BBP9 BBP10 BBP11 BBP12BBP1
Construcción de la firma y predicción
P1
P2
P3
P4
Fase: 1 2 3 4 5 6 7 1 8 9 10 11
Fases: sub-cadenas de Bloques Básicos Paralelos que se repiten a lo largo de la ejecución.
Fase 1 2 3 4 5 6 7 8 9 10 11
Firma de la aplicación
paralela
Fases+ Checkpoint
coordinado, pesos
B
Clúster
Predicción
Tiempo de
Ejecución de la
Fase
Peso
Predicción del
Tiempo de
Ejecución
S

Program Processes Predicted
Execution
Time
(Sec)
Application
Execution Time
(Sec)
Prediction
Execution
Time Error (%)
CG 8 205.221 208.146 1.4
BT 9 707.979 710.207 0.32
SP 9 1579.497 1580.105 0.04
Sweep3D 8 254.872 256.536 0.65
POP 8 948.454 992.565 4.45
CG 64 752.320 1199.390 36.2
BT 64 963.292 1066.000 9.64
SP 64 162.949 400.558 59.32
Sweep3D 64 586.580 724.470 19.04
POP 32 1310.540 1322.629 0.92
EXPERIMENTAL PAS2P RESULT BY USING LAMPORT
IMPLEMENTATION
When we increase the number of processes, we found that the quality of
prediction falls, due to processes become more independent and there
are non-deterministic receptions that may arrive at any logical time,
which generates a greater number of phases.

Fases de la aplicación
55%
Cuando se incrementó el número de procesos, nos encontramos con que el
error de predicción aumentaba.
El problema no es que no podamos predecir, sino que no hemos logrado crear
una firma cuyo objetivo es reducir el tiempo de ejecución de la aplicación.
100%
Algo no iba bien…….pero ¿qué estaba pasando?

Machine-independent application model
To solve this, we have to move from multiple physical, local clocks to a
single logical, global clock.
LT=0 LT=1
LT=0 LT=1 LT=2
LT=2
P1 S
P2
S
R RS
RLamport algorithm
implementation
R
R
S RSend Event Recv Event
To solve the non-deterministic events (receptions) problem,
we have decided to introduce some modifications in
Lamport’s algorithm.

To solve the non-deterministic events (receptions) problem, we have
decided to introduce some modifications in Lamport’s algorithm, defining a
new logical ordering, in which, if one process Sends a message in a logical
time (LT), its reception will be forced to arrive in a LT + 1 and never
afterwards.
Queue starts with the first
event on each process
Empty
queue
?
Yes
No
End
E’ is
Send
?
Yes
No
ELT=E’LT+1
ELT=E’LT
E is
Recv
?
Yes
No
E’ with the higher LT and
smaller physical time that E is
taken from the same process
that E.
The corresponding (Erecv)
is searched, and is
assigned: ErecvLT=ELT+1
Next event is inserted in
queue, only if it is from the
same process.
First event (E) is
dropped off
queue

Modified algorithm to assign Logical Time
Time
P
1
P
2
P
3
P
4
Event ID#
1 2 3 4 5
7 8 9
18 19 20 21 22 23
Send event Recv event
6
10 1211
12 13 14 15 16 17
0
1
1 1 1
0
0
0
1
1
1
1
2 2
2 2
2
2
2
2
3
3 3 3
Queue: 1, 7, 12, 18Queue: 7, 12, 18, 2Queue: 12, 18, 2, 8Queue: 18, 2, 8, 13Queue: 2, 8, 13, 19Queue: 8, 13, 19, 3Queue: 13, 19, 3, 9Queue: 19, 3, 9, 14Queue: 3, 9, 14, 20Queue: 9, 14, 20Queue: 14, 20Queue: 20Queue: empty

P1
P2
P3
P4
1 2 34 5
7 8 9
1
8
6
10
15
Logical Time
0s 1r 1s 2r 2s 3r
1
2
1
3
1
9
16
11 12
1
4
2
0
21 22 23
17
Physical Time
P1
P2
P3
P4
1 2 3 4 5
7 8 9
18 19 20 21 22 23
6
10 1211
12 13 14 15 16 17
2
2
2
3
3 3 3
Event ID#
Physical Trace
Logical Trace
Event ID#
0
0
0
0
1
1
1
1
2
2
2
2
1
1
2
3
1
2
2
3
#
2
3
3
Logical Time
1

Parallel Application model
Tick: A logical time unit.

LT 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
P1
P2
P3
P4
Parallel Basic Blocks
Once we have located each event, we sub-divide the logical trace into
more logical times, that is, there can only be one event for each process
in a logical time.
PBB1 PBB3 PBB5 PBB7 PBB9 PBB11 PBB13 PBB15 PBB17 PBB19 PBB21
PBB2 PBB4 PBB6 PBB8 PBB10 PBB12 PBB14 PBB16 PBB18 PBB20

Identificación de patrones (similaridad)
1-Similaridad entre
Bloques Básicos
Paralelos
2-Similaridad entre
Fases
 La primera estrategia se basa en buscar similaridad entre Bloques Básicos
Paralelos:
– Se buscan Bloque Básicos Paralelos similares y se renombran con un identificador que
corresponde al de la primera aparición.
– A partir de la secuencia de Bloques Básicos Paralelos diferentes se identifican las
fases.
 La segunda estrategia se basa en buscar similaridad entre fases
– Creación de fases directamente a partir de la traza lógica.
– Búsqueda de la similaridad entre las fases

In order to find the repetitive behavior of an application, we need to compare the
behavior of each PBB and see if there are any similarities, to do this, we search for
similarity between two Parallel Basic Blocks based on the three main
components of its structure:
1. Communication Pattern: each of the assigned values of the entry points and
each of the assigned values of the exit points should be the same, the tool
compares the communication patterns, these are the values that are within the
event.
2. Communication Volume: each of the values of the entry and exit points
must be similar, and can accept a difference of 5%.
3. Computational Time: each computational time allows for a difference of 5%.
Pattern identification

Para identificar el comportamiento repetitivo de una aplicación, necesitamos comparar el
comportamiento de cada BBP.
Buscamos similaridad entre dos Bloques Básicos Paralelos basándonos en los tres
componentes de su estructura:
1. Patrón de comunicación
2. Volumen de comunicación
3. Tiempo de cómputo
Punto de entrada Cómputo Punto de salida
SEND/
RECV
Volumen
(KB)
(mseg) SEND/
RECV
Volumen
(KB)
2 19 2500 -2 20
1 20 2504 -1 19
1 21 2612 0 0
1 19 2600 0 0
SEND/
RECV
Volumen
(KB)
(mseg) SEND/
RECV
Volumen
(KB)
2 18 2512 -2 21
1 21 2501 -1 18
1 21 2602 0 0
1 18 2610 0 0
BBP1 BBP2BBP1
1- Similaridad entre Bloques Básicos Paralelos

#
BBP1
Volumen de
comunicación
Tipo de
comunicaciónTiempo de cómputo
P1
P2
P3
P4
2
TL 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
-2 -3 -4 3 4
1
1
1
2
-1 3
2
-3 -4
-1 -2
4
4
3
-4
-1 -2 -3
2 -2 -3 -4 3 4
1
1
1
2
-1 3
2
-3 -4
-1 -2
4
4
3
-4
-1 -2 -3
Similaridad entre Bloques Básicos Paralelos
SEND/
RECV
Volumen
(KB)
(mseg) SEND/
RECV
Volumen
(KB)
0 19 2500 0 20
0 20 2504 0 19
0 21 2612 0 0
0 19 2600 0 0
2
1
1
1
-2
-1

In order to search for phases, that is, sub-chains that repeat along the
execution, now we seek the PBB’s with the similar behavior to “PBB1”, we
find that “PBB12” is similar to the behavior of “PBB1”. So we rename it “PBB1”
as it is the same Parallel Basic Block.
LT 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
P1
P2
P3
P4

Now, to identify and create the phases. A Phase is defined as sub-chains of
Grouped PBB’s that repeat along the execution. If we look at this simple
example, we can identify two phases, with the behavior as shown below:
LT 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
P1
P2
P3
P4
Extract phases
Phase 1 Phase 1Phase 2

2 - Similaridad entre fases
Este método busca crear fases lo mas largas posibles:
•Una fase se alarga hasta que vuelva a ocurrir en algún evento con el mismo
tipo de comunicación y cada vez que una fase crece o se extiende un tick se
verifica si la fase ya existe utilizando criterios de similaridad.
•Para aplicar este método, solo se analizan los ticks donde ocurren los eventos
Send de la traza lógica, descartando los ticks de Recv ya que su
comportamiento siempre estará dado por los ticks de los Sends.

Tipo de comunicación
Ticks 0 1 2 3 4 5 6 7 8 9
10
Proceso 1
Proceso 2
Proceso 3
Proceso 4
Startpoint ó
punto de inicialización #
Fase a Fase bFase
 Se compara:
– Los tamaños de las fases ( número ticks): sean iguales.
– Dos eventos son similares si tienen el mismo tipo de comunicación y volumen de
comunicación es similar en un 5%.
 Una fase es similar si el número de eventos similares es mayor o igual al 80% del
número total de eventos que componen la fase.
Si es similar, el peso de la fase aumenta.
Si no es similar, se guarda como una nueva fase.
Similaridad
Fase 2 Fase 2 Fase 3Fase 1 Fase 2
0 2 3 4 2 3 4 2 3 4 2
1 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0
P1
P2
P3
P4
Búsqueda de las fases

TL 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
P1
P2
P3
P4
Evento Send Evento Recv
Las fases

Para construir la firma, es necesario ejecutar la aplicación para
crear los checkpoints para cada una de las fases de la firma de la
aplicación.
Construcción de la firma de la aplicación
Crear Checkpoint
Coordinado
Instrumentación y
ejecución
Firma
Ejecutable
S

To build the Parallel Application Signature, the last step is to re-run the
application to make the coordinated checkpoints “before” each
relevant phase happens.
To run the Parallel Application Signature means to execute its
constituent phases. This is done using the coordinated checkpoint
obtained for the different phases restarting from the saved state and
start measuring from the point a phase begins until it ends. We repeat
this method and proceed to execute all constituent phases.
Parallel Application
Signature
Binary Phases +
Coordinated Checkpoint
B
Cluster
Prediction
Time of each
phase
By Weight
Predict
Execution Time
S
Create a Signature and predict execution time

•La creación del checkpoint debe realizarse un número de instrucciones antes del
comienzo de la fase para “calentar” los componentes de la máquina (cache, TLBs,
etc).
•Repetimos este método para todas las fases que constituyen a la firma
Tiempo físico
Proceso 1
Proceso 2
Proceso 3
Proceso 4
Número de
evento
Fase X
# Inicio Fin
64
64
64
64
65
65
65
65
120
120
120
125
125
125
125
Checkpoint
120
Warm-up
Fase Y
Los checkpoints

Tiempo físico
Número de evento
Fase Y
# Tiempo de Ejecución de la Fase
P1
P2
P3
P4
Warm
up
60
60
60
60
64
64
64
64
65
65
65
65
T1
120
116
116
116
116
120
120
120
125
125
125
125
T1 T2
Ejecutar la Firma de la aplicación es ejecutar sus fases. Esto se consigue
restaurando cada uno de los checkpoint de la firma.
Para medir el tiempo de ejecución de cada fase se mide desde el punto en donde
comienza la fase hasta el punto en donde termina, seleccionando el tiempo del
proceso que ha demorado más.
Warm
up
Fase X
T2
…
…
…
…
Predicción

Firma de la aplicación
Fases+ Checkpoint +
Pesos
B
Clúster
Predicción
Tiempo de
Ejecución de la
fase
Peso
Predicción del
Tiempo de
Ejecución
S
Una vez que se obtienen el tiempo de ejecución de las fases, para
Predecir el Tiempo de Ejecución (PTE) de la aplicación multiplicamos
el Tiempo de Ejecución de cada Fase (TEFase) por su Peso (P).
Predicción del Tiempo de Ejecución (PTE)

 NAS Parallel Benchmarks
 SMG2000
 Sweep3D
 Parallel Ocean Program(POP)
Experimental results
Scientific applications
Cluster Characteristics Software
Cluster A Dual-Core Intel(R) Xeon(R) CPU 5150 2.66GHz 4MB
L2 (2x2), 8 GB Fully Buffered DIMM 667 MHz
Network Gigabit Ethernet, 128 cores.
Linux 2.6x, OpenMPI
1.4.2, MPE 2-1.0.6p1,
BLCR 0.8.2-1
Cluster B 2 x Quad-Core Intel(R) Xeon(R) E5430 2.66GHz
Processors 2x6MB cache L2, 16 GB RAM Fully
Buffered DIMMs (FBD) 667MHz Network Gigabit
Ethernet, 64 cores.
Linux 2.6x, OpenMPI
1.4.2, MPE 2-1.0.6p1,
BLCR 0.8.2-1
Used to Construct the
Signature and run it to
Predict
Used to run the Signature
and predict

Programa Cores
OLPAS2P – BBP OLPAS2P-FASES
TEF
(Seg)
TEF vs.
TEA(%)
PTE
(Seg)
EPTE
(%)
PTE
(Seg)
EPTE
(%)
CG
32 5.39 0.22% 2316.370 4.00% 2413.01 0.01%
64 3.18 0.34% 1137.710 5.15% 1165.24 2.85%
BT
32 6.24 0.58% 963.292 9.64% 1055.04 1.02%
64 5.02 0.83% 567.123 6.43% 597.96 1.08%
SP
32 7.76 0.76% 938.997 6.92% 1004.12 0.45%
64 3.50 0.78% 424.363 5.20% 441.42 1.38%
SMG2000 32 11.58 1.94% 579.400 2.72% 581.29 2.40%
64 6.15 3.23% 186.519 1.90% 187.20 1.54%
Sweep3D 16 2.44 0.10% 2232.658 1.45% 2235.15 1.34%
32 1.94 0.15% 1252.520 0.62% 1257.03 0.27%
POP
32 19.92 1.48% 1319.583 1.79% 1324.32 1.44%
64 15.48 1.95% 748.454 5.52% 758.31 4.33%
TEA
(Seg)
2412.70
1199.39
1066.00
604.47
1008.74
447.61
595.55
190.12
2265.34
1260.32
1343.55
792.56
32
64
64
64
64
64
Calidad
53
25
15
5
29
1
9 7 6
2 4 3
0
10
20
30
40
50
60
Método de similaridad entre Bloques
Básicos Paralelos
Método de similaridad entre Fases
fases
Predicción en el Clúster A
Tiempo
corto

Experimental results on Cluster A
Program Processes
/
Cores
Signature
Execution
Time
(SET)
(Seg)
100
(SET/AET)(%)
Predicted
Execution
Time (PET)
(Seg)
Application
Execution
Time (AET)
(Seg)
(Prediction
Execution
Time Error
(PETE) %)
CG 64/32 5.39 0.22 2316.37 2412.70 4.00
CG 64/64 3.18 0.34 1137.71 1199.39 5.15
BT 64/32 6.24 0.58 963.29 1066.00 9.64
BT 64/64 5.02 0.83 567.12 604.47 6.43
SP 64/32 7.76 0.76 938.99 1008.74 6.92
SP 64/64 3.50 0.78 424.36 447.61 5.20
SMG2000 64/32 11.58 1.94 579.40 595.55 2.72
SMG2000 64/64 6.15 3.23 186.51 190.12 1.90
Sweep3D 32/16 2.44 0.10 2232.65 2265.34 1.45
Sweep3D 32/32 1.94 0.15 1252.52 1260.32 0.62
POP 64/32 19.92 1.48 1319.58 1343.55 1.79
POP 64/64 15.48 1.95 748.45 792.56 5.52

Programa Cores
OLPAS2P – BBP OLPAS2P-FASES
TEF
(Seg)
TEF vs.
TEA(%)
PTE
(Seg)
EPTE
(%)
PTE
(Seg)
EPTE
(%)
CG
32 8.42 0.29% 2721.32 4.43% 2793.42 1.90%
64 4.87 0.32% 1495.22 1.11% 1504.66 0.48%
BT
32 13.47 0.80% 1621.32 2.78% 1652.65 0.90%
64 10.19 0.77% 1234.00 5.79% 1302.76 0.55%
SP
32 2.04 0.24% 818.41 0.09% 808.76 1.28%
64 2.08 0.51% 372.94 6.90% 388.367 3.05%
SMG2000 32 16.75 2.63% 624.47 1.76% 633.23 0.38%
64 8.37 5.01% 157.66 5.45% 162.87 2.32%
Sweep3D 16 4.32 0.17% 2437.82 2.21% 2494.36 0.06%
32 3.01 0.22% 1310.54 0.92% 1328.04 0.40%
POP
32 22.79 1.41% 1608.33 0.21% 1608.85 0.17%
64 18.36 1.79% 1014.42 0.77% 1016.01 0.61%
TEA
(Seg.)
2847.42
1511.91
1667.64
1309.91
819.17
400.55
635.61
166.74
2492.74
1322.62
1611.59
1022.28
32
64
64
64
64
64
Predicción en el Clúster B
Tiempo
corto
Calidad

Experimental results on Cluster B
Program Processes
/
Cores
Signature
Execution
Time
(SET)
(Seg)
100
(SET/AET)(%)
Predicted
Execution
Time (PET)
(Seg)
Application
Execution
Time (AET)
(Seg)
Prediction
Execution
Time Error
(PETE) (%)
CG 64/32 8.42 0.29 2721.32 2847.42 4.43
CG 64/64 4.87 0.32 1495.22 1511.91 1.11
BT 64/32 13.47 0.80 1621.32 1667.64 2.78
BT 64/64 10.19 0.77 1234.00 1309.91 5.79
SP 64/32 2.04 0.24 818.41 819.17 0.09
SP 64/64 2.08 0.51 372.94 400.55 6.90
SMG2000 64/32 16.75 2.63 624.47 635.61 1.76
SMG2000 64/64 8.37 10.15 157.66 166.74 5.45
Sweep3D 32/16 4.32 0.17 2437.82 2492.74 2.21
Sweep3D 32/32 3.01 0.22 1310.54 1322.62 0.92
POP 64/32 22.79 1.41 1608.33 1611.59 0.21
POP 64/64 18.36 1.79 1014.42 1022.28 0.77

Firma de la aplicación con distintas políticas de mapping
Predicción del
Tiempo de
Ejecución
(PTE)
 BT
 POP
Clúster A
La Firma de la Aplicación cuando se ejecuta con diferentes políticas
de mapping, en donde el usuario pudiese implementar políticas
eficientes para la administración de los recursos de computo
disponiendo de la firma.
S
S

Patrón de
mapping
(Por nodo)
Cores Tiempo de
Ejecución
de la firma
(TEF) (Seg)
Predicción del
Tiempo de Ejecución
(PTE) (Seg)
Tiempo de
Ejecución de la
Aplicación
(TEA) (Seg)
Error de
Prediccion del
Tiempo de
Ejecución
(EPTE)(%)
BT con 25 procesos
1 proceso 25 0.707 474.997 498.114 4.65
1 proceso
2 procesos
Total
15
5
1.481 522.265 535.310 2.4420
3 procesos
1 procesos
Total
8
1
1.519 575.158 600.809 4.279
20 procesos
5 procesos
Total
1
1
3.385 1357.700 1448.010 6.242
Mapping sobre el Clúster A
Tiempo
corto
Calidad

Patron de
mapping
(Por nodo)
Cores Tiempo de
Ejecución
de la firma
(TEF)
(Seg)
Predicción del
Tiempo de
Ejecución (PTE)
(Seg)
Tiempo de
Ejecución de
la Aplicación
(TEA) (Seg)
Error de
Predicción
del Tiempo
de
Ejecución
(EPTE)(%)
POP con 16 procesos
1 proceso 16 10.085 203.107 224.093 9.37
3 procesos
4 procesos
Total
4
1
10.843 200.099 216.630 7.64
5
8 procesos
4 procesos
2 procesos
2 procesos
Total
1
1
1
1
20.229 386.623 421.265 8.23
4
Mapping sobre el Clúster A
Tiempo
corto
Calidad

Resultados preliminares con distintos workloads
Predicción del
Tiempo de
Ejecución (PTE)
 CG
 Sweep3D
Clúster A
S S S
S S

Programa Workload Firma Predicción
del Tiempo
de
Ejecución
(PTE)
(Seg)
Tiempo de
Ejecución
de la
Aplicación
(TEA)
(Seg)
Error de
Predicción
del Tiempo
de Ejecución
EPTE
(%)
ID de la
Fase
TEF
(Seg)
Peso
CG
Total
Clase A
1
2
3
-
0.000189
0.004276
4.58497e-05
0.004511
832
416
416
-
0.157
1.778
0.019
1.955 2.325 15.92%
CG
Total
Clase B
1
2
3
-
0.0008864
0.0329661
0.0003090
0.0341617
3952
1976
1976
-
3.503
65.141
0.610
69.255 69.710 0.66%
CG
Total
Clase C
1
2
3
-
0.002976
0.099214
0.000917
0.103108
3952
1976
1976
-
11.762
196.047
1.813
209.623 210.856 0.59%
Firmas de CG con distintos workloads sobre el Clúster A
Tiempo
corto
Calidad

Programa Workload Firma Predicción
del Tiempo
de
Ejecución
(PTE)
(Seg)
Tiempo de
Ejecución
de la
Aplicación
(TEA)
(Seg)
Error de
Predicción
del Tiempo
de Ejecución
EPTE
(%)
ID de la
Fase
TEF
(Seg)
Peso
Sweep3D
Total
150
1
2
3
4
-
0.002870
0.002759
0.002770
0.002868
0.011268
21564
21564
21564
21564
-
61.902
59.502
59.734
61.857
242.998 257.023 5.46
Sweep3D
Total
200
1
2
3
4
0.004188
0.004162
0.004181
0.004135
0.016673
28764
28764
28764
28764
-
120.464
119.724
120.290
118.940
479.418
-
-
-
-
503.099 4.97
Firmas de Sweep3D con distintos workloads en el Clúster A
Tiempo
corto
Calidad

¿Escala la aplicación al aumentar
los recursos (cores)?
¿Qué eficiencia tendrá la aplicación con
un elevado número de cores?
0 500 1000 1500 2000 2500
Speedup
Cores
Speedup de la aplicación
Speedup
1
1200

Strong Scaling: Keeping the problem size fixed
and pushing in more workers or processors
Goal: Minimize time to solution for a given
problem
Weak Scaling: Keeping the work per worker
fixed and adding more workers/processors (the
overall problem size increases)
Goal: solve the larger problems

0
500
1000
1500
2000
2500
3000
0 500 1000 1500 2000 2500 3000
Speedup
Cores
Speedup de la aplicación
Sp Ideal
Sp Real
Aplicación
Paralela
97% 35%
Ineficiencia
1

Metodología para predecir la escalabilidad de aplicaciones
paralelas en un determinado sistema
Objetivo General
Contexto
• Escalabilidad Fuerte
• Aplicaciones paralelas de paso de mensajes (MPI)
• Utilizando un número limitado de recursos
• High Performance Computing (Clusters) / Cloud

Predicción sobre el sistema utilizando todos los recursos
Aplicación
Paralela
F1
F2
F3
PAS2P
S
Firma de la
aplicación
1
10
100
1000
10000
Parallel
Application
Application
Signature
Prediction
Time
Time(Seconds)
Application TimeTiempo
Acotado
Alta
precisión
S1024
¿ Tiempo de predicción
para 1024 procesos ?
Tiempo predicho
para 1024
procesos
Segmentos más
representativos
de la aplicación
F4
W2
W1
W3
W4
S1024
¿ Podemos predecir el rendimiento de la aplicación en
el sistema sin utilizar todos los recursos ?
PAS2P

Small-scale
of
Processes
Parallel
Application
(Workload)
PAS2P
Instrumentation
and analisys
32 64 128
Output
65
Nuestro objetivo

Predicción sobre el sistema utilizando un número limitado de recursos
0 500 1000 1500 2000 2500
Speedup
Cores
Predicción del Speedup de la aplicación
Speedup
128
256
S64
S128
S256
64
1
1200
0 500 1000 1500 2000 2500
Speedup
Cores
Predicción del Speedup de la aplicación
Speedup
1024
2048
1
1200
Nuestro objetivo

Communicational Pattern
Computational Pattern
0
25
50
75
100
125
32 64 128 512
Weight Model
Application Phases
Performance Prediction
for N processes
32 64 128
Weight
Processes
67
¿Qué necesitamos?

Aplicación
Paralela
I
PAS2P
Pj
Fases de la
aplicación
Sx Sy Sv
Análisis de las
fases
f( Pj, I ) = Tiempo predicho ∀ j ≤ zEscalabilidad Fuerte
Las fases de la aplicación se mantienen funcionalmente
constantes a medida que se aumenta el número de
procesos y workload (Input)
F14 F24 F34 F444
F18
F28 F38 F488
Similitud funcional
F54
F5
8
x<y<v

Aplicación
Paralela
I
PAS2P
Pj
Fases de la
aplicación
Sx Sy Sv
Análisis de las
fases
Las aplicaciones paralelas están escritas utilizando
unos patrones de comunicación y cómputo
determinados que especifican unas reglas para que la
aplicación escale
F14 F24 F34 F444
F18
F28 F38 F488
F54
F58
x<y<v

Aplicación
Paralela
I
PAS2P
Pj
Fases de la
aplicación
Sx Sy Sv
Análisis de las
fases
x<y<v

Aplicación
Paralela
I
PAS2P
Pj
Fases de la
aplicación
Sx Sy Sv
Análisis de las
fases
Modelizado de las
fases escaladas
para Pz
Para cada fase relevante de { PZ, I }
 Creación de procesos
 Patrón de comunicación
 Regla de comunicación (Origen-Dest.)
 Volumen de comunicación (Bytes)
 Patrón de cómputo (#Inst)
 Peso de la fase
SLT
Traza lógica escalable
de la aplicación
x<y<v

Aplicación
Paralela
I
PAS2P
Pj
Fases de la
aplicación
Sx Sy Sz
Análisis de las
fases
Modelado de las
fases escaladas
para Pz
SLT
¿ Tiempo de cómputo
y comunicación?
Fase 1 Peso: 2400
Traza lógica escalada de la aplicación
Proceso Fase
ID
Tipo de
primitiva
Origen-
Destino
Vol. de
Comunicación
(Bytes)
Número de
Instrucciones
0 1 MPI_Irecv 0-1 4000 756
0 1 MPI_Send 0-1 4000 456
0 1 MPI_Wait 0-1 4000 456746733
0 1 MPI_Irecv 0-2 2000 975
0 1 MPI_Send 0-2 2000 875
0 1 MPI_Wait 0-2 2000 357876543
x<y<v

Aplicación
Paralela
I
PAS2P
Pj
Fases de la
aplicación
Sv Sx Sy
Análisis de las
fases
Modelado de las
fases escaladas
para Pz
SLT
Predicción
tiempo
Cómputo
ST4NP
Pz
Proces
o
Fase Tipo de
primitiva
Origen-
Destino
Vol. de
com.
(Bytes)
Número de
instrucciones
de cómputo
Tiempo de
cómputo
(ns)
0 1 MPI_Irecv 0-1 4,000 756 4,000
0 1 MPI_Send 0-1 4,000 456 2,345
0 1 MPI_Wait 0-1 4,000 456,746,733 83,593,535
0 1 MPI_Irecv 0-2 2,000 975 7,533
0 1 MPI_Send 0-2 2,000 875 5,366
0 1 MPI_Wait 0-2 2,000 357,876,543 45,326,854
Peso Fase 1: 2,800
x<y<v

Aplicación
Paralela
I
PAS2P
Pj
Fases de la
aplicación
Sv Sx Sy
Análisis de las
fases
Modelado de las
fases escaladas
para Pz
STL
Predicción
tiempo
Computo
ST4NP
Pz
Predicción del
tiempo de
comunicación
Herramienta
Synthetic
Signature (SS)
x<y<v

Proceso Fase Tipo de
primitiva
Origen-
Destino
Vol. de
com.
(Bytes)
Número de
instrucciones
de cómputo
Tiempo de
cómputo
(ns)
Tiempo de
comunicación
(ns)
0 1 MPI_Irecv 0-1 4,000 756 4,000 234
0 1 MPI_Send 0-1 4,000 456 2,345 1275
0 1 MPI_Wait 0-1 4,000 456,746,733 83,593,535 4674
0 1 MPI_Irecv 0-2 2,000 975 7,533 428
0 1 MPI_Send 0-2 2,000 875 5,366 1632
0 1 MPI_Wait 0-2 2,000 357,876,543 45,326,854 4872
Peso Fase 1: 2,800

0
500
1000
1500
2000
2500
3000
3500
64 128 256 512 1024 2048
TIempo(Segundos)
Número de Procesos
Tiempo de ejecución
Aplicación
Paralela
I
PAS2P
Pj
Fases de la
aplicación
Sv Sx Sy
Análisis de las
fases
Modelado de las
fases escaladas
para Pz
STL
T. Ejec Pz
Predicción
tiempo
Cómputo
ST4NP
Pz
x8
Predicción del
tiempo de
comunicación
Herramienta
Synthetic
Signature (SS)
x<y<v

0
200000
400000
600000
800000
1000000
1200000
0 50 100 150 200 250 300
Tiempo(S)
Tiempo de cómputo
Tiempo
Puntos ejecutados
¿ Predicción para 256 procesos ? Modelos matemáticos
de regresión
f (n) = y

Puntos ejecutados
0
200000
400000
600000
800000
1000000
1200000
0 50 100 150 200 250 300
Tiempo(S)
Tiempo de cómputo
Tiempo
Punto Predicho
¿ Predicción para 256 procesos ? Modelos de regresión
matemáticos
f (n) = y

Puntos ejecutados Punto Predicho
0
200000
400000
600000
800000
1000000
1200000
0 50 100 150 200 250 300
Tiempo(S)
Procesos
Tiempo de cómputo
Tiempo

1
10
100
1000
10000
0 1000 2000 3000 4000 5000
Time(Seconds)
Processes
Measured Points
Real Execution
Predicted Points
Prediction Error
A medida que nos alejamos de los puntos
ejecutados los métodos de regresión introducen un mayor
error de predicción

Obtener un punto lejano para ajustar el
modelo de regresión sin necesidad de
ejecutar para ese número de procesos

Modelo de
regresión de
cómputo
ajustado
(CRMc)
Función de
regresión
Modelo de
regresión de
cómputo Inicial
(CRMi)
Función de
regresión

*
 Applications Used:
o NPB NAS:
 BT
 CG
 SP
o Sweep 3d
o N-Body
 Architecture Clusters
Cluster Architecture
CAPITA Processor: 64 AMD Opteron(tm) Processor 6262 HE 1.60
GHz Memory: 48 GB RAM SDRAM, Network: ConnectX IB
Mellanoxcard. (512 nodes).
BEM Processor: 24 Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz,
Memory: 50GB RAM, Network: ConnectX IB Mellanoxcard.
(2096 nodes).
 Libraries:
• OpenMPI 1.6.5
• PAS2P 1.5
• PAPI 5.3.2
• GNU 4.4
• Intel Compiler 10
83

Synthetic Signature
160 cores
PAS2P Signature
512 cores
AET (Sec.)
Phase ID
Weight
(W)
PhaseET
(Sec.)
(PhaseET)x
W(Sec.)
PETE
(%)
PhaseET
(Sec.)
(PhaseET)x
W(Sec.)
PETE
(%)
4574.32
0 100 10.9437 1094.37
4.16
10.0477 1004.77
0.13
1 100 0.1209 12.09 0.1158 11.58
2 99 14.9401 1479.13 14.8701 1472.13
3 100 10.9942 1099.42 10.1899 1018.99
4 100 10,799 1079.90 10,729 1072.90
PET: 4764.91 PET: 4580.37
PhaseET: Phase Execution Time
PETE: Predicted ExecutionTime Error
PET: PredictedExecution Time
AET: Application Execution Time
Predictions for Sweep3D with 512 Processes
using the SS and the PAS2P Signature.
*
84

*
Program
Synthetic Signature SS PAS2P Signature
SYET
(Sec.)
PET (Sec.)
PETE
(%)
System
Cores
SET
(Sec.)
PET (Sec.)
PETE
(%)
System
Cores
AET(Sec.)
SP 133.98 6927.00 0.98 219 134.98 6901.51 1.34 484 6995.66
BT 221.84 11642.86 2.46 219 229.24 11992.94 0.46 484 11937.58
N-Body 184.92 5090.77 1.77 130 192.23 5009.26 3.34 512 5182.65
Sweep3D 308.02 4764.91 4.16 160 307.12 4580.37 0.13 512 4574.32
CG 578.34 11108.52 8.59 160 514.47 10648.84 4.10 512 10229.20
SYET: Synthetic Execution Time SET: Signature Execution Time PET: Predicted Execution Time
PETE: Predicted ExecutionTime Error
Prediction using the SS and the PAS2P Signature.
85

1
2
3
4
5
6
7
8
9
10
0 100 200 300 400 500 600
Speedup
Processes
Speedup CG CLASS C
SP Aplicación SP Predicho Puntos Ejecutados
*
CG Application CG Predicted Measured Points
86

1
3
5
7
9
11
13
15
17
0 100 200 300 400 500 600
Speedup
Processes
Sweep3D input.80.11 -11 iterations
SP Aplicación SP Predicho Puntos EjecutadosApplication SP Predicted Measureded Points
*
87

*
Prediction of CG CLASS E using the PAS2P Signature:
88

*
Prediction of CG CLASS E using the P3S SS:
89

*
Resources used to predict the CG CLASS E PerformanceNumberofcores
0
550
1100
1650
2200
Processes
128 256 512 1024 2048
9696969696
2,048
1,024
512
256
128
PAS2P Resources P3S Resources
90

*
Prediction of BT CLASS E using the PAS2P Signature:
91

*
Prediction of BT CLASS E using the P3S SS:
92

*
Resource used to predict the BT CLASS E PerformanceNumberofcores
0
550
1100
1650
2200
Processes
256(9.5GB) 484(20GB) 1024(43GB) 2025(92GB)
96969696
2,025
1,024
484
256
PAS2P Resources P3S Resources
93

S
S
S
S S
Emilio Luque, Alvaro Wong, Dolores Rexachs, Carlos Rangel

ExecutionTime(Sec.)Log
0
9250
18500
27750
37000
Mapping Policies
Mapping A Mapping B Mapping C
355.88357.09406.47
20,000
27,000
36,000
Signature Execution Time
Predicted Execution Time
Mapping Policies affect the Application Execution Time
Required Time to
know a better
mapping
S

Applying Clustering to the Signature
To improve the application communications
Attraction by communication
relation
Cluster N

Core
L1d L1i
L2
Core
L1d L1i
L3
Core
L1d L1i
Core
L1d L1i
Core
L1d L1i
Core
L1d L1i
Core
L1d L1i
Core
L1d L1i
L2
L2
L2
L2
L2
Socket
P0 P1
P7
Communication Clustering (Attraction)

Core
L1d L1i
L2
Core
L1d L1i
L3
Core
L1d L1i
Core
L1d L1i
Core
L1d L1i
Core
L1d L1i
Core
L1d L1i
Core
L1d L1i
L2
L2
L2
L2
L2
Socket
P0 P1
P7
P0
Computation

Core
L1d L1i
L2
Core
L1d L1i
L3
Core
L1d L1i
Core
L1d L1i
Core
L1d L1i
Core
L1d L1i
Core
L1d L1i
Core
L1d L1i
L2
L2
L2
L2
L2
Socket
P0 P1
P7
P0
P1
Computation Clustering

Core
L1d L1i
L2
Core
L1d L1i
L3
Core
L1d L1i
Core
L1d L1i
Core
L1d L1i
Core
L1d L1i
Core
L1d L1i
Core
L1d L1i
L2
L2
L2
L2
L2
Socket
P0 P1
P7
N
N
L2 cache misses increase
IF (P0 Data size + P1 Data size) > L2 CACHE SIZE
Repulsion Effect
Computation Clustering (Repulsion)

Core
L1d L1i
L2
Core
L1d L1i
L3
Core
L1d L1i
Core
L1d L1i
Core
L1d L1i
Core
L1d L1i
Core
L1d L1i
Core
L1d L1i
L2
L2
L2
L2
L2
Socket
P0 P1
P7
N
P0P1
P0 P1 P7
L3 cache misses increase
IF (P0 Data size + … + P7 Data size ) > L3 CACHE SIZE
Repulsion Effect
Computation Clustering (Repulsion)

Muchas gracias por su atención

Performance and scalability prediction in HPC systems

Recommended

Recommended

More Related Content

Similar to Performance and scalability prediction in HPC systems

Similar to Performance and scalability prediction in HPC systems (20)

More from Facultad de Informática UCM

More from Facultad de Informática UCM (20)

Recently uploaded

Recently uploaded (20)

Performance and scalability prediction in HPC systems