PREDICTING THE TIME OF OBLIVIOUS PROGRAMS. Euromicro 2001
Upcoming SlideShare
Loading in...5
×
 

PREDICTING THE TIME OF OBLIVIOUS PROGRAMS. Euromicro 2001

on

  • 158 views

PREDICTING THE TIME OF OBLIVIOUS PROGRAMS...

PREDICTING THE TIME OF OBLIVIOUS PROGRAMS

The BSP model can be extended with a zero cost synchronization mechanism, which can be used when the number of messages due to receives is known. This mechanism, usually known as "oblivious synchronization" implies that different processors can be in different supersteps at the same time. An unwanted consequence of these software improvements is a loss of accuracy in prediction. This paper proposes an extension of the BSP complexity model to deal with oblivious barriers and shows its accuracy.

Statistics

Views

Total Views
158
Views on SlideShare
158
Embed Views
0

Actions

Likes
0
Downloads
0
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Oblivious BSPModel 21/08/11 EuroPar-2000 Good afternoon ladies and gentlemen. In this paper, we propose a Parallel Computing Model that extends the well-known Bulk Synchronous Parallel model to work with algorithms that don´t require global barrier synchronisation, and deals with new programming features as processor-partition operations and oblivious synchronisation. This last feature gives name to the model: the Oblivious BSP.
  • Oblivious BSPModel 21/08/11 EuroPar-2000 Presentation starts with a brief introduction to the BSP model concepts, and then I will present the Oblivious BSP model. A methodology for predicting the execution time is shown using a trivial example. After that, I will show the preliminaries results obtained using the OBSP model to predict the execution time of two algorithms: FFT, which is an example of Data Parallelism, and RAP, which is solved by a intensive communication pipeline algorithm. To conclude the presentation I will mention current and future works into this line.
  • Oblivious BSPModel 21/08/11 EuroPar-2000 The Bulk Synchronous Parallel model was proposed by Prof. Valiant in 1990. It considers a parallel machine made of a set of p processor with private memory, interconnected throe a global communication network and a mechanism for synchronising the processors. The BSP model can be characterised by the following parameters: the communication gap g , defined as the unary packet transmission time, which reflects the per-processor bandwidth; the latency L , which corresponds to the time needed to synchronise all processors. These values depend on the number of processors p . A BSP computation is organised into supersteps, each of them consists of: Local computation, inter-process communication, and a global synchronisation. The execution time for a superstep s is given by: the largest amount of work performed by any processor during the superstep, w s plus the largest number of packets sent or received by any processor during the superstep, h s plus the time required by the global synchronisation.
  • Oblivious BSPModel 21/08/11 EuroPar-2000 The OBSP model extends the BSP model to deal with oblivious synchronisation and processor-partition operations. When the number of messages due to receive by a processor in a superstep is known, a zero-cost synchronisation mechanism can be used to reduce the synchronisation overhead. An Oblivious Synchronisation blocks a processor until the expected number of messages are received. A partition operation splits the current set of processors into several subsets. Each of them acts as an autonomous BSP machine with its own processor numbering and synchronisation points. The OBSP machine communication capabilities are characterised by the following parameters: the gap g, the Synchronising Latency, L the Oblivious Latency, L b and the special values for small packet sizes g 0 and L b0
  • Oblivious BSPModel 21/08/11 EuroPar-2000 The Paderborn University BSP library (PUB) is a parallel C library based on the BSP model. In addition to the most common BSP features, PUB provides routines to perform: oblivious synchronisation, partition operations, and collective communications.
  • Oblivious BSPModel 21/08/11 EuroPar-2000 In an OBSP prediction analysis, we assume that: 1) supersteps are numbered starting at 1, 2) all processors perform the same number of supersteps R, and 3) because processors can be in different supersteps at the same time, a processor in its superstep s can send a message to other processor in a previous superstep. The system ensures that communication is not made effective until the receiver processor finishes its superstep s. Instead of using a global barrier, the OBSP model defines the incoming partners of each processor OMEGA as the set of processors that sends a message to this processor union itself. EICh sub s,i denotes the maximum number of communicated packet by a processor. PHI sub s,i denotes the time spent by processor i in superstep s, and is given by these recursive formulas. When a partition operation is performed, this schema is recursively applied into each submachine.
  • Oblivious BSPModel 21/08/11 EuroPar-2000 In this slice I compare both execution models using a trivial example. In the first superstep one processor performs local computation and sends a message to the other processor, which has to do double amount of work. Then, they synchronise and the second superstep is a symmetrical one. Using the BSP model, the maximum amount of local computation in each superstep is 2w so the total computing time is given by: Using the OBSP model, the first processor can get the second superstep while the second processor remains in the first superstep. The system buffers the message until the receiver processor is ready to receive it. This overlapping allows reduce the total execution time.
  • Oblivious BSPModel 21/08/11 EuroPar-2000 This figure represents the FFT execution under the OBSP model. Coloured blocks corresponds to local computation, and black blocks denotes inter-processors communication. Blue lines on the right denotes the supersteps performed by a machine X (j) , while the black lines marks the computing and communication parts in every superstep. In the original set of processors, each of them performs some local computing that include a partition into two subsets to solve the odd and even components transformation. This partition process continues until only one processor remains in each submachine. Each of these inner submachines performs only a superstep to compute a sequential transformation, and then rejoin to the outer machine. Local computation in the first superstep includes the work performed by the inner submachine. The superstep finishes with a data exchange, and the second superstep consists of the odd and even transformed signal combination.
  • Oblivious BSPModel 21/08/11 EuroPar-2000 Preliminary results have been obtained on a CRAY T3E. The first table shows the model parameters values for this machine. We note that the values for small packet sizes are not available. In the second table, we can see the measured time and the OBSP predicted time for the FFT algorithm with an input vector of size 2 million of elements. The prediction accuracy is quite good. Percentage errors are less than 3% for the overall algorithm. After this paper acceptance, some experiments have been carried out with a fine-grain intensive-communication pipeline algorithm that solves the RAP. Percentage errors are larger than the previous example, but we point out that this algorithm uses small message sizes and the used model parameters are g y L b.
  • Oblivious BSPModel 21/08/11 EuroPar-2000 Preliminary results have been obtained on a CRAY T3E. The first table shows the model parameters values for this machine. We note that the values for small packet sizes are not available. In the second table, we can see the measured time and the OBSP predicted time for the FFT algorithm with an input vector of size 2 million of elements. The prediction accuracy is quite good. Percentage errors are less than 3% for the overall algorithm. After this paper acceptance, some experiments have been carried out with a fine-grain intensive-communication pipeline algorithm that solves the RAP. Percentage errors are larger than the previous example, but we point out that this algorithm uses small message sizes and the used model parameters are g y L b.
  • Oblivious BSPModel 21/08/11 EuroPar-2000 As conclusions: We have proposed a new parallel computing model that extends the BSP model to work with oblivious synchronisation and partition operations. Preliminary results shows that prediction accuracy is as good as the BSP model, but In future works we want to obtain the parameters values for small message sizes, and we want to extend the analysis to other algorithms and parallel platforms.
  • Oblivious BSPModel 21/08/11 EuroPar-2000 In the first superstep, processor 1 has to make double amount of work than processor 0. Processor 1 receives a message from processor 0, so its omega set include both processor. If h is the amount of communicated data, PHI ’ s for each processor is ... Processor 0 starts its second superstep while processor 1 remains still in the previous one. System buffers the message to ensure it will be delivered when receiver processor demands it. Processor 1 has less work to do in the second superstep, so it sends the message back and finishes.
  • Oblivious BSPModel 21/08/11 EuroPar-2000

PREDICTING THE TIME OF OBLIVIOUS PROGRAMS. Euromicro 2001 PREDICTING THE TIME OF OBLIVIOUS PROGRAMS. Euromicro 2001 Presentation Transcript

  • Predicting the Time of Oblivious BSP
        • 9th Euromicro Workshop on Parallel and Distributed Processing PDP 2001 Mantova, Italy, February 7 - 9, 2001
    González J.A. 1 , León C. 1 , Piccoli F. 2 , Printista M. 2 , Roda J.L. 1 , Rodríguez C. 1 , Sande F. 1 1 Dpto. de Estadística, Investigación Operativa y Computación Universidad de La Laguna Tenerife, Canary Islands, Spain 2 Universidad Nacional de San Luis Ejército de los Andes 950, San Luis, Argentina
  • Outline
    • Introduction
    • Bulk Synchronous Parallel model (BSP)
    • Oblivious BSP model (OBSP)
      • Parameters
      • Paderborn University BSP Library
      • Prediction Analysis Methodology
    • Preliminaries Results
      • Fast Fourier Transform Algorithm (FFT)
      • Resource Allocation Problem (RAP)
    • Conclusions & Future Works
  • Bulk Synchronous Parallel Model (BSP)
    • Proposed by Prof. Valiant in 1990
      • parallel machine made of a set of p processors,
      • a global communication network, and
      • a mechanism for synchronisation
    • Parameters
      • p : number of processors
      • L : synchronisation time
      • g : unary packet transmission time
    • Supersteps
    Microprocessor Cache Memory Network Interface DRAM Memory Interconnection Network Microprocesador Memoria Caché Interfaz de Red Memoria DRAM Microprocesador Memoria Caché Interfaz de Red Memoria DRAM Microprocesador Memoria Caché Interfaz de Red Memoria DRAM Microprocessor Cache Memory Network Interface DRAM Memory
  • Oblivious BPS Model (OBSP)
    • Parameters
      • p : number of processors
      • g : Gap
      • L : Synchronisation Latency
      • L b : Oblivious Latency
      • g 0 : Gap for small packet size
      • L b0 : Latency for small packet size
    • New features
      • Oblivious Synchronisation
      • Processor Partition Operations
    – h PS : OBSP packet size g L b0 g 0 h h PS L b time T(h) = g*h+L b h  h PS T(h) = g 0 *h+L b0 h < h PS
  • Paderborn University BSP Library
    • Oblivious Synchronisation
      • bsp_oblsync(t_bsp* bsp, int nr_msgs)
    • Partition Operations
      • bsp_partition(t_bsp* bsp, t_bsp* sub, int nr, int* partition)
      • bsp_done(t_bsp* bsp)
        • http://www.uni-paderborn.de/~pub
        • Current release: 7.0
    The Paderborn University BSP (PUB) Library - Design, Implementation and Performance Olaf Bonorden, Ben Juurlink, Ingo von Otte, Ingo Rieping 13 th International Parallel Processing Symposium & 10 th Symposium on Parallel and Distributed Processing (IPPS/SPDP) San Juan, Puerto Rico, April 12 - April 16, 1999
  • OBSP Cost Analysis
  • BSP Model vs OBSP Model  2,i =3*w + 2*(g*h+L b ) h  h PS T BSP =4*w + 2*(g*h+L) h  h PS BSP OBSP P1 P0 time w w 2w 2w L b L b g*h g*h P1 P0 time w w 2w 2w L L g*h g*h
  • FFT Analysis using the OBSP Model  1,i (T k (1) ,X k (1) ,  i (1) ) P1 P0 P2 P3 seq_fft Division bsp_partition Combination  2,i (T k (1) ,X k (1) ,  i (1) )  1,i (T k (2) ,X k (2) ,  i (2) )  1,i (T (0) ,X (0) ,0)  2,i (T (0) ,X (0) ,0) bsp_done X (0) ={0,1,2,3} X 0 (1) ={0,1} X 1 (1) ={2,3} X k (2) ={k} k=0,..,3 w 1,i g*h 1,i +L b w 2,i w 2,i (1) w 1,i (1) g*h 1,i (1) +L b  i (1) w 1,i (2)  i (2)
  • OBSP Prediction Accuracy Real and OBSP predicted time for the FFT algorithm on the CRAY T3E Real and OBSP predicted time for the RAP algorithm on the CRAY T3E N=1000, M=1000 N=2048 OBSP parameter values on the CRAY T3E. g is in bytes per second p=16
  • PBS 209152 Items. CRAY T3E
  • Conclusions & Future Works
    • Oblivious BSP model:
      • Extends BSP model to deal with oblivious synchronisation and processor partition operations,
      • Prediction accuracy in the CRAY T3E is no larger than 30%.
      • The variability observed in Communication time is due to the simplicity of the model, using only a few constants to characterise the architecture.
    • Future works:
      • A Model based in the h- relation concept and therefore assuming a linear conduct on the h -relation size but with different communication constants for each step?
  • OBSP Cost Analysis Example P1 P0 time w w 2w 2w L b L b g*h g*h
  • BSP Cost Analysis Example time w w 2w 2w L L g*h g*h P1 P0