SlideShare a Scribd company logo
1 of 22
Download to read offline
Mapping SMAC Algorithm onto GPU
Student: Zhengjie Lu
Supervisor: Dr. Ir. Bart Mesman
Ir. Yifan He
Prof. Dr. Ir. Richard Kleihorst
Contents
1. Background ...............................................................................................................................................3
1.1 GPU programming ..............................................................................................................................3
1.2 SMAC algorithm ..................................................................................................................................3
2. Implementation ........................................................................................................................................6
2.1 General Structure................................................................................................................................6
2.2 SMAC on CPU......................................................................................................................................7
2.3 SMAC on GPU......................................................................................................................................9
3. Experiment..............................................................................................................................................12
3.1 Experiment Environment..................................................................................................................12
3.2 Experiment Setup..............................................................................................................................13
3.3 Experiment Result.............................................................................................................................14
3.3.1 GPU improvement .....................................................................................................................14
3.3.2 Linear execution-time model.....................................................................................................15
4. Roofline Model Analysis..........................................................................................................................17
4.1 Roofline Model..................................................................................................................................17
4.2 Application........................................................................................................................................17
5. Conclusion...............................................................................................................................................19
Acknowledgement......................................................................................................................................20
Appendix .....................................................................................................................................................21
1. Background
1.1 SMAC algorithm
SMAC is short for the “Simplified Method for Atmospheric Correction”, which is specially used in
computing the atmospheric correction of satellite measurements in the solar spectrum. It is popular in
the remote sensing application because it is several hundred times faster than more detailed radiative
transfer models like 5S [3]. Figure 1.1 explains a black-box model of SMAC. The input of 9 parameters
will be taken into SMAC and then a single output will be generated.
SMAC Algorithm
Float sza;
Float sva;
Float vza;
Float vva;
Float taup550;
Float uh2o;
Float uo3;
Float airPressure;
Float r_toa.
Float r_surfRecycle
Output:
Input:
Fig1.1 SMAC black box model
SMAC is computational fast in its peer, but it still takes considerable amounts of time in processing
large-size data that is common in the remote sensing applications. Figure 1.2gives the profiles into
SMAC when it is processing a data size of 231240 bytes on CPU. The file IO operation is dominant in this
case (about 75%), while the CPU computation time is also significant (about 25%). Since the file IO
performance can be improved by introducing some faster hard disks, the CPU computation will become
the bottleneck later. This motivates us to map SMAC onto a commercial GPU (i.e. NVIDIA graphic card)
and see how much computation performance improvement we can achieve.
Fig1.2 Profiles of the original SMAC program
1.2 GPU programming
The GPU programming is introduced into the field of Scientific Computation after its success in
accelerating the computer graphic processing. The hardware advantage of many parallel processing
cores per GPU (normally the number is at least 32 processing cores per GPU) makes it capable to deal
with massive data. The disadvantages of GPU programming are that: (1) the programmers are required
to have the knowledge about the hardware (especially the memory accessing pattern) so that they can
manipulate GPU efficiently, and (2) the pipeline penalty is dramatically huge if the branch predictions
miss. Normally, an efficient program is organized in such a way that GPU is responsible for the massive
mathematical computation while CPU takes charge of the logical operations and controlling.
NVIDIA develops a GPU programming technology named “Compute Unified Device Architecture” (CUDA)
for its own graphic card products, and it is the most popular one in the state-of-the-art GPU
programming. The CUDA supported GPUs can be found in NVIDIA’s official website *1+ and they have
covered the latest products of NVIDIA with more than 100 cores per GPU. In our paragraph, we will refer
CUDA programming as GPU programming for the NVIDIA graphic cards.
It is essential to explain the NVIDIA GPU hardware architecture before we discuss about CUDA
programming, because CUDA programming is actually a collection of regulations for operating the
hardware in the most efficient way. Figure 1.3 shows the overview of NVIDIA GeForce 8800GT GPU.
Every 8 stream processors (SP) is organized together in a stream multi-processor (SM). 14 SMs consists
of the main body of a GPU. Inside each SM, there are 8192 registers and also a shared memory with the
size 16384 bytes. The shared memory is used for the local communication within the 8 SPs. A global
memory is connected with each SM to make the global communication. It should be pointed out that
the access to the global memory is rather slow and that to the shared memory is rather fast. This
indicates that we should play with the shared memory much more than the global memory, to achieve a
better performance.
Fig1.3 NVIDIA 8800GT architecure
The basic concept in CUDA programming is the single-instruction-multiple-threads (SIMT), which means
that all the active threads will perform an identical instruction in each execution time [2]. Each active
thread is assigned to a unique SP so that the physically parallel-threading is achieved. Also each thread
will has its own registers to keep its status and they can communicate with each other through the
shared memory.
The second concept in CUDA programming is the thread and block. Each block is consisting of multiple
threads shown in figure 1.4, and the number of threads per block is limited by the physical available
registers number and shared memory size. Each block will be assigned to a SM inside which 8 SPs are
integrated. A single block can be only assigned to a single SM, while a single SM can hold many blocks.
Fig1.4 Block and threads Fig1.5 Stream Execution
The third concept in CUDA programming is the warp, which is the basic thread scheduling unit. 32
threads in a block will be organized as a warp and then simultaneously assigned to this block’s
responding SM by the scheduler. If the thread number in a block is not the times of 32, then some
dummy threads will be appended to this block to make the thread number as the times of 32.
The forth concept in CUDA programming is the concurrent-copy-and-execution, or so-called "stream"
execution. The input data is broken down into several segmentations with the same length. Then the
data segmentations (so-called data streams) are transferred from the CPU memory to the GPU memory
one by one. A data stream can be processed by the GPU kernel can process once it has been transferred
completely, without waiting for the completion of other data stream transmissions. An explanation on
the stream execution is given in figure 1.5.
The fifth concept in CUDA programming is the memory access coalescence. This means that the access
pattern to the global memory should be viewed in terms of aligned segments of 16 and 32 words. The
addressing pattern must also be aligned to 16 or 32 words.
As a short conclusion, the program has to be mapped to the NVIDIA GPU hardware with the CUDA
regulations.
2. Implementation
2.1 General Structure
SMAC algorithm is consisting of 14 steps, as shown in figure 2.1. The first step as a "data filter" is a
conditional branching, through which only the valid data can be passed to the computations later. Each
computation would just rely on the input of its previous ones, as shown in the data dependency graph in
figure 2.2. All computations in SMAC are arithmetic calculations, including trigonometric functions,
exponential functions and etc. Also several computations contain the if-else conditions and these
branching will be replaced with the equivalent logical expression when they are mapped onto GPU. The
implementation of SMAC on CPU is programmed in ASIC C++, while the one on GPU programmed in C++
and C.
Calculation: step 1
Valid vector?
Calculation: step 2
Parameters setup
Calculation: step 11
Calculation: step 3
Calculation: step 4
Calculation: step 5
Calculation: step 6
Calculation: step 7
Calculation: step 8
Calculation: step 9
Calculation: step 10
Yes
No
Start
Calculation: step 12
Fig.2.1 Overview of SMAC kernel
1
2
3
4
5
6
7 8
9
10 11 12
p
p
1
Parameters
Setup
Calculation
Step 1
Fig.2.2 Data dependency Graph of SMAC
2.2 SMAC on CPU
A single thread is employed as the execution model of SMAC on CPU, in which the SMAC kernel has to
read through all the input and then generate out the final results. One input vector consisting of 9 float
point number can just be used to produce one output data consisting of 1 float point number. No any
data dependencies exist between different input vectors and neither do the outputs. The completely
execution model is shown in figure 2.3.
The data flow in this case is quite simple, which is shown in figure 2.4. Both the input data and
coefficients are read from the files on the hard disk to the CPU memory. Then CPU takes them into its
registers and throws out the final results into the CPU memory.
vector[0]
vector[1]
vector[2]
…
vector[n-1]
SMAC kernel
Float sza;
Float sva;
Float vza;
Float vva;
Float taup550;
Float uh2o;
Float uo3;
Float airPressure;
Float r_toa.
Float r_surfRecycle[0];
Float r_surfRecycle[1];
Float r_surfRecycle[2];
Float r_surfRecycle[3];
…
Float r_surfRecycle[n-1].
Input Output
Fig.2.3 Execution model of SMAC on CPU
Fig.2.4 Data flow of SMAC on CPU
The original SMAC program employs 3 classes: (1) SmacAlgorithm Class, (2) Coefficients Class and (3)
CoefficientsFile Class. SmacAlgorithm Class functions the kernel in which the SMAC algorithm is fully
implemented, while the other two manage the access to the coefficient file. An additional SimData Class
is now included to manipulate the input and output data. The relations among all the classes are listed
in figure 2.5.
Because the validness of input data can be identified as soon as they are read in, the “data filter” in the
SMAC kernel can be immediately achieved in SimData Class instead. It can both save the memory usage
and reduce the processing time in the kernel. Also the GPU can get rid of the conditional branching
when mapping SMAC on it.
The flow chart in figure 2.6 explains the implementation procedures. The coefficients and the input data
are read firstly, and then passed to the SmacAlgorithm instance for computation. The computational
results will be collected by the SimData instance which is the output parameter of the SmacAlgorithm
instance.
SmacAlgorithm
class
Coefficients class
CoefficientsFile
class
File Operation
Algorithm
Execution
I/O Interface
SimData class
Satellite Sensors
Coefficients File
Satellite Data File
Earth Surface
Reflectance File
Coefficients::setC
oefficients()
SimData::readData()
SmacAlgthm::SmacAlgthm()
SmacAlgthm:
run()
ENTRY
EXIT
Fig.2.5 Program structure of SMAC on CPU Fig.2.6 Flow chart of SMAC on CPU
2.3 SMAC on GPU
The execution model of SMAC on GPU is benefiting from the multiple threads. Each GPU thread is an
instance of the SMAC kernel so that multiple input vectors can be processed simultaneously, as shown in
figure 2.7. This is the main benefits of employing GPU.
Vector[0]
Vector[1]
Vector[2]
…
Vector[n-1]
SMAC kernel 0
Float sza;
Float sva;
Float vza;
Float vva;
Float taup550;
Float uh2o;
Float uo3;
Float airPressure;
Float r_toa.
Float r_surfRecycle[0];
Float r_surfRecycle[1];
Float r_surfRecycle[2];
Float r_surfRecycle[3];
…
Float r_surfRecycle[n-1].
Input Output
SMAC kernel 1
SMAC kernel 2
...
GPU
Fig.2.7 Execution model of SMAC on GPU
The data flow of SMAC on GPU is quite different from the one on CPU as shown in figure 2.8. The input
data need to be transferred from the CPU memory to the GPU memory and then the results copied back
in reverse. The constant memory in GPU is employed as the cache, for accessing the frequently used
coefficients in all SMs.
Coefficients[];
Input_data[];
1st
copy of coefficients[];
1st
copy of input_data[];
1st
copy of output_data[].
Hard disk CPU memory
2nd
copy of input_data[];
output_data[].
GPU Global memory
2nd
copy of coefficients[]
GPU constant memory
Registers
GPU
Fig.2.8 Data flow of SMAC on GPU
The program structure of SMAC on GPU is based on the one on CPU, in which the SMAC kernel is
immediately mapped onto GPU. Two more GPU relevant modules are introduced to the program
structure as shown in figure 2.9. The module "GPU_kernel.cu" implements the SMAC kernel which will
be executed on GPU, while the one "GPU.cu" controls the GPU memory operations and kernel execution.
Fig.2.9 Program structure of SMAC on GPU
Also the flow chart of SMAC on GPU is similar to the one on CPU, except for invoking the GPU relevant
functions, as shown in figure 2.10. An obvious change is that the input data are transferred from the
CPU memory to the GPU memory before executing the SMAC kernel on GPU and then the output data
are copied back from the GPU memory to the CPU memory. Besides that, the input data has to be re-
organized in such a sequence that they can be coalesced accessed to the GPU memory.
reorganizeInput()
cudaMemcpyAsync()
cudaMemcpyAsync()
GPU_kernel<<>>()
reorganizeOutput()
Coefficients::setC
oefficients()
SimData::readData()
SmacAlgthm::SmacAlgthm()
SmacAlgthm:
run()
ENTRY
EXIT
Fig.2.10 Flow chart of SMAC on GPU
As introduced in section 1.1, we have to concern some more in GPU programming and they are: threads,
blocks and streams. The profiling tool CUDAPROF gives us the tip that 59 registers are needed per SMAC
kernel thread. With this requirement of register number, the optimization tool CUDACALCULATOR
indicates that a max number of 192 threads per block can be achieved.
Once a block with 192 threads is executing on a unique SM, no other blocks can be assigned to the same
SM before the working block finishes all its executions. Since there are 4 SMs inside the GPU in our
experiment, 4 blocks are enough to fully utilize the GPU. Employing more blocks would probably
introduce the extra overhead when the blocks are switching, while employing fewer blocks could be just
a waste of hardware resources.
The number of streams will be explored later in our experiment.
3. Experiment
3.1 Experiment Environment
We apply the SMAC program on a laptop workstation which is equipped with both an Intel Dual-Core
CPU and a NVIDIA GPU. The GPU is mounted on the laptop mother boarder through the PCI express
interface. The operation system on this machine is Windows 32-bit Vista Enterprise, with CUDA 2.2
support. The workload on CPU and GPU are both low before our experiment. The detailed descriptions
of experiment environment are listed below:
HARDWARE
CPU Intel(R) Core(TM)2 Duo CPU T9300, 2.5 GHz x 2
GPU nVidia Quadro FX570M, 0.95 GHz x 32
Main Memory 4GB
Mother Board Interface PCI express 1.0 x 16
SOFTWARE
Operation system Widows Vista Enterprise 32-Bit
CUDA version CUDA 2.2
GPU maximum registers per thread 60
GPU thread number 192 x 4 (#thread per block x #block)
CPU thread number 1
Table 3.1 Experiment environment
Coefficients::setC
oefficients()
SimData::readData()
SmacAlgthm::SmacAlgthm()
SmacAlgthm:
run()
ENTRY
EXIT
Start timer
Stop timer
Fig.3.1 Time profiling method
To profile the execution time for either SMAC on CPU or SMAC on GPU, the timers are attached at two
ends of the algorithm instances as shown in figure 3.1. In our experiment, we are concern about the
SMAC kernel performance but not the application performance. That is because the later one is
dominated by the hard disk I/O speed as already given in table 1.1 and it can be overcome by employing
other hard disk with high I/O speed.
3.2 Experiment Setup
As it is indicated in figure 3.1, the timing profile of the SMAC kernel can be defined as the difference
between the “start timer” and “stop timer”. For instances, they are:
CPU time CPU stop timer CPU start timer -
GPU time GPUstop timer GPU start timer -
Now we define the performance improvement as:
CPU time
Improvement
GPU time

, in which the linear execution-time model is employed:
CPU time CPU overhead Bytes CPU speed  
( )
( )
( )
( )
GPU time GPU memorytime GPU run time
GPUmemory overhead Bytes GPU memoryspeed
GPUkernel overhead Bytes GPU kernel speed
GPUmemory overhead GPUkernel overhead
Bytes GPU memoryspeed GPU kernel speed
GPU overhe
 
   
 
  
 
 ad Bytes GPU speed 
The performance improvement can also be expressed as:
CPU overhead Bytes CPU speed
Improvement
GPU overhead Bytes GPU speed
Bytes CPU speed CPU speed
Bytes GPU speed GPU speed
 

 

 

It should be pointed out that this equation only holds when the data size is dramatically large. From this
formula, it can be obtained that the ultimate improvement is only relevant to the CPU and GPU speed
and has nothing to do with the data size. We will apply the linear execution-time model to predict the
GPU performance later.
3.3 Experiment Result
3.3.1 GPU improvement
Table 3.2 and figure 3.2 record the performance improvement during our experiment. The improvement
cures in figure 3.2 is climbing down when the number of streams increased, in case that the data size is
quite low. That is because the data size is too small to cover the overhead gap. But if we increase the
data size to some larger one, we can find that the slopes of the curves become larger and larger. That is
due to the increasing data size which dismisses the overhead. Later all the curves would behavior
similarly or even overlapping on each other, when the data size is larger than a certain threshold. It
couldn’t help a lot even if we increase the data size or the stream number.
Table 3.2 Performance improvement: CPU time/GPU time
Fig.3.2 GPU performance improvement: CPU time/GPU time
3.3.2 Linear execution-time model
In our earlier tests, the overhead and processing speed have been obtained and they are recorded in
table 3.3. The GPU overhead is relatively large, while the CPU overhead is too tiny to be measured. The
reason for the significant GPU overhead is that the data needs be re-organized and then transferred
both before and after the GPU kernel execution. It is also obvious that the GPU speed is at least 10 times
faster than the CPU one.
CPU time (ms) = CPU overhead (ms) + data size (byte) x CPU speed (ms/byte)
CPU overhead 0
CPU speed 5
5.39 10

GPU time (ms) = GPU overhead (ms) + data size (byte) x GPU speed (ms/byte)
GPU overhead: 1-stream 1.67
GPU speed: 1-stream 6
2.41 10

GPU overhead: 8-stream 4.45
GPU speed: 8-stream 6
2.01 10

Table 3.3 Parameters of linear execution-time model
Both the predicted improvement and experimental improvement of 1-stream GPU performance are
plotted in figure 3.3. Some recognizable variance occurs between the two curves with the data size is
small. Then the cures are almost identical when the data size is large. Finally, the curves will likely reach
the "ceiling" in despite of the data size.
Fig.3.3. Single stream GPU performance improvement
A comparison of the 8-stream GPU performance improvement is shown in figure 3.4 and it tell tells us
the similar information. The performance improvement will ultimately stays as a constant. No better
improvement could be achieved even when we increase the amounts of data size.
Fig.3.4. 8-stream GPU performance improvement
1-stream GPU
Data size
(byte)
Experimental
Improvement
Predicted
Improvement
116640 5,71 3,21806093
233280 8,6 5,626082853
466560 12,83 8,989388397
933120 14,79 12,82188881
1866240 19,28 16,29558697
3732480 20,54 18,84884576
8-stream GPU
Data size
(byte)
Experimental
Improvement
Predicted
Improvement
116640 1,99 1,342894063
233280 3,49 2,557938816
466560 6,18 4,671162775
933120 9,95 7,958659158
1866240 14,75 12,27984849
3732480 19,24 16,85581952
4. Roofline Model Analysis
4.1 Roofline Model
Roofline Model can give us a non-perfect insight into the performance bottleneck [4]. When the
instruction density measured in flops/byte lies below the peak performance as shown in figure 4.1, it
indicates that the bottleneck is the data transfer and we should bring in more data. Otherwise the
bottleneck might be the computational limitation and we should reconsider the computation approach.
Fig.4.1.Example of Roofline model
4.2 Application
To apply the Roofline model with our case, the GPU hardware specifications and the profiles of SMAC
kernel (notes: no the SMAC application) have to be obtained before the analysis. They are all listed in
table 4.1.
Hardware: NVIDIA Quadro FX570M
PCI express bandwidth: 4 GB/sec
Peak performance: 91.2 GFlops/sec
Peak performance without FMAU: 30.4 GFlops/sec
Software: SMAC kernel on GPU
Data size: 59719680 Bytes
Issued instruction number: 4189335552 Flops
Execution time: 79.2 ms
Instruction density: 70.15 Flops/Byte
Instruction Throughput: 52.8 GFlops/sec
Table 4.1 Parameters of Roofline model
Now the performance of SMAC on GPU is applied on the Roofline model as the blue marker shown in
figure 4.2. It can be identified that the performance bottleneck is the computation if we only consider
the kernel execution on GPU. To be more precise, it results from the computation unbalances between
float point multiplication and addition. That is partially because that SMAC is immediately mapped on
GPU and the data dependencies inside SMAC still exist. The data dependencies limit the fully utilization
of the FMAUs (float-point multiplication-addition unit) in GPU.
The other reason is that just 192 threads are employed in a block to satisfy the register budget.
Introducing more threads would only result in that the temp operands have to be stored in the local
memory which is quite slow to be accessed. Keeping the thread number as many as 192 can dismiss the
opportunities t using the local memory, but it would also make some function units always unused.
The situation can be worse if the I/O operations are included. The navy blue sloping line stands for the
hard disk IO bandwidth bottleneck and it is right farther than the GPU memory bandwidth limitation
cure. The instruction throughput in this case is below 1 GFlops/second so that it is not plotted in figure
4.2. To be summarized, the SMAC application is limited by the hard disk I/O.
Fig.4.2 Roofline model of SMAC on GPU
5. Conclusion and Future
The popular algorithm named SMAC in the remote sensing field is successfully mapped onto the
commercial programmable GPU with the help of CUDA technology. Since SMAC is applying with large
size of data stream, it can employ the trick of “stream-execution” to achieve the performance
improvement. Our experiment result shows that a performance speed up of 25 times can be achieved
by GPU, compared with CPU. Also the linear execution-model is proved in analyzing GPU’s stream-
execution.
Besides that, the Roofline model is employed to identify the bottleneck of SMAC on GPU. The SMAC
kernel on GPU is dominated by the computational bottleneck, while the completed application is limted
by the hard disk I/O bandwidth. We are only interested in the former one in this report. Two main
reasons result in the computational bottleneck of SMAC kernel on GPU and they are: (1) the unbalance
between the float-point multiplications and additions due to the data dependencies, and (2) the register
pressure caused by the register requirement per thread. The first one can possibly be released by
decoupling the data dependencies inside the SMAC algorithm kernel, while the second one can be
solved by turning to the fine-grained threads in which fewer register are required.
Fig.5.1Diagram of GPU power measurement Fig.5.2 Physical layout of GPU power measurement
The power consumption of SMAC on GPU also interested us in the future. Since no commercial power
consumption measurement PCI-E cards are available in the market, a customized approach has to be
carried out. Figure 5.1 shows the measurement principle, in which the 5V and 12V power supply lines in
the 4-pin PCI-E interface are measured separately. A 0.03 resistor with 20W capacity is connected to
the 5V power supply line and its current flow is calculated by its voltage over its register value. Then the
power supply through this 5V line can be obtained by 5V times the current through the resistor. So does
the measurement on the 12V power supply line. At last the two power supplies are summarized to
obtain the GPU power consumption. Figure 5.2 shows the physical connection of our chosen future
experiment.
Acknowledgement
This assignment is my 3-month traineeship in Technische Universiteit Eindhoven (TU/e) and the topic is
from VITO-TAP, Flemish Institute for Technological Research NV. I have got lots of support from the
people surrounding. Dr. Ir. Bart Mesman, my supervisor in TU/e, has spent quit a lot of time in my
weekly reports and verifying my ideas. Ir. Ir. Yifang He who is the PhD candidate in TU/e, also my
supervisor in this traineeship, guided me with the research methodology. Prof. Dr. Ir. Richard Kleihorst,
my supervisor in VITO-TAP, kindly arranged the daily issues and working environment in VITO-TAP. Prof.
Dr. Ir. Henk Corporaal, my academic mentor in TU/e, gave me strong supports in the 3 months. I should
also thank Ir. Zhengyu Ye and Ir. Gert Jam, who gave me valuable advices on GPU programming and
Roofline Model.
Appendix
[1] NVIDIA CUDA official website, http://www.nvidia.com/object/cuda_home.html, retrieved on July 20,
2009
[2] NVIDIA CUDA documentation, Chapter 4, “NVIDIA CUDA Programming Guide 2.2”, February 4, 2009
[3] H. Rahman, G. Dedieu, “SMAC: A Simplified Method for the Atmospheric Correction of Satellite”.
December 5, 1993
*4+ S. Williams, A. Waterman, D. Patterson, “Rooline: An insightful Visual Performance model for Multi-
core Architectures”, April 2009
[5] S. Williams, D. Patterson, “The Roofline Model: A pedagogical tool for program analysis and
optimization”, retrieved on September 10, 2009.
[6] Zhengyu Ye, "Design Space Exploration for GPU-Based Architecture", August 2009.
[7] Shane Ryoo, Christopher I. Rodrigues, Sam S. Stone, Sara S. Baghsorkhi, Sain-Zee Ueng, John A.
Stratton, and Wen-mei W. Hwu, “Program Optimization Space Pruning for a Multithreaded GPU”, ACM
2008.
[8] NVIDIA CUDA documentation, “NVIDIA_CUDA_BestPracticesGuide_2.3”, retrieved on September 12,
2009
[9] Rob Farber, “CUDA, Supercomputing for the Masses” , September 19, 2008.
*10+ X. Ma, M. Dong, L. Zhong, Z. Deng, “Statistical Power Consumption Analysis and Modeling for GPU-
based Computing”, retrieved on September, 2009
[11+ S. Collange, D. Defour, A. Tisserand, “Power Consumption of GPUs from a Software Perspective”,
Proceedings of the 9th International Conference on Computational Science, 2009
*12+ Analog Device documentation, “Measuring temperatures on computer chips with speed and
accuracy”, April, 1999
[13+ Green Grid, “The Green Data Center: Energy-Efficient Computing in the 21st Century”, retrieved on
July 16, 2009
[14+ Google, “The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale
Machines”. Retrieved on July 16, 2009
[15+ Google, “The Case for Energy-Proportional Computing”, retrieved on July 16, 2009
[16] http://en.wikipedia.org/wiki/PowerNow!, , retrieved on July 16, 2009
[17] http://en.wikipedia.org/wiki/SpeedStep, retrieved on July 16, 2009
[18+ SUN, “Sun's Throughput Servers: Paradigm Shift in Processor Design Drives Improved Business
Value”, retrieved on July 16, 2009
[19] IBM, "Storage Modeling for Power Estimation", retrieved on July 16, 2009
[20] http://en.wikipedia.org/wiki/Green_computing, retrieved on July 16, 2009
[21+ Seagate, “2.5-Inch Enterprise Disc Drives: Key to Cutting Data Center Costs”, retrieved on July 16,
2009
[22] Google, "Power-Aware Micro-architecture: Design and Modeling Challenges for Next-Generation
Microprocessors", retrieved on July 16, 2009
[23] Google, "MapReduce: Simplified Data Processing on Large Clusters", retrieved on July 16, 2009
[24] Google, "Power Provisioning for a Warehouse-sized Computer", retrieved on July 16, 2009
[25+ IBM, “IBM BladeCenter HS22 Technical Introduction”, retrieved on July 16, 2009
[26+ IBM, “IBM BladeCenter Products and Technology”, retrieved on July 16, 2009
[27] http://www-03.ibm.com/systems/virtualization/, retrieved on July 16, 2009
[28+ Green Grid, “Five Ways to Reduce Data Center Server Power Consumption”, retrieved on July 16,
2009

More Related Content

What's hot

A Review on Image Compression in Parallel using CUDA
A Review on Image Compression in Parallel using CUDAA Review on Image Compression in Parallel using CUDA
A Review on Image Compression in Parallel using CUDAIJERD Editor
 
Cache Optimization Techniques for General Purpose Graphic Processing Units
Cache Optimization Techniques for General Purpose Graphic Processing UnitsCache Optimization Techniques for General Purpose Graphic Processing Units
Cache Optimization Techniques for General Purpose Graphic Processing UnitsVajira Thambawita
 
Phasor data concentrator or i pdc
Phasor data concentrator or i pdcPhasor data concentrator or i pdc
Phasor data concentrator or i pdcNitesh Pandit
 
Mainmemoryfinalprefinal 160927115742
Mainmemoryfinalprefinal 160927115742Mainmemoryfinalprefinal 160927115742
Mainmemoryfinalprefinal 160927115742marangburu42
 
Monte Carlo G P U Jan2010
Monte  Carlo  G P U  Jan2010Monte  Carlo  G P U  Jan2010
Monte Carlo G P U Jan2010John Holden
 
Volume 2-issue-6-2040-2045
Volume 2-issue-6-2040-2045Volume 2-issue-6-2040-2045
Volume 2-issue-6-2040-2045Editor IJARCET
 
Spie2006 Paperpdf
Spie2006 PaperpdfSpie2006 Paperpdf
Spie2006 PaperpdfFalascoj
 
Porting MPEG-2 files on CerberO, a framework for FPGA based MPSoc
Porting MPEG-2 files on CerberO, a framework for FPGA based MPSocPorting MPEG-2 files on CerberO, a framework for FPGA based MPSoc
Porting MPEG-2 files on CerberO, a framework for FPGA based MPSocadnanfaisal
 
Ali.Kamali-MSc.Thesis-SFU
Ali.Kamali-MSc.Thesis-SFUAli.Kamali-MSc.Thesis-SFU
Ali.Kamali-MSc.Thesis-SFUAli Kamali
 
VTU 5TH SEM CSE OPERATING SYSTEMS SOLVED PAPERS
VTU 5TH SEM CSE OPERATING SYSTEMS SOLVED PAPERSVTU 5TH SEM CSE OPERATING SYSTEMS SOLVED PAPERS
VTU 5TH SEM CSE OPERATING SYSTEMS SOLVED PAPERSvtunotesbysree
 

What's hot (18)

A Review on Image Compression in Parallel using CUDA
A Review on Image Compression in Parallel using CUDAA Review on Image Compression in Parallel using CUDA
A Review on Image Compression in Parallel using CUDA
 
Cache Optimization Techniques for General Purpose Graphic Processing Units
Cache Optimization Techniques for General Purpose Graphic Processing UnitsCache Optimization Techniques for General Purpose Graphic Processing Units
Cache Optimization Techniques for General Purpose Graphic Processing Units
 
Phasor data concentrator or i pdc
Phasor data concentrator or i pdcPhasor data concentrator or i pdc
Phasor data concentrator or i pdc
 
Mainmemoryfinalprefinal 160927115742
Mainmemoryfinalprefinal 160927115742Mainmemoryfinalprefinal 160927115742
Mainmemoryfinalprefinal 160927115742
 
20120140505010
2012014050501020120140505010
20120140505010
 
Monte Carlo G P U Jan2010
Monte  Carlo  G P U  Jan2010Monte  Carlo  G P U  Jan2010
Monte Carlo G P U Jan2010
 
GPGPU_report_v3
GPGPU_report_v3GPGPU_report_v3
GPGPU_report_v3
 
Lec04 gpu architecture
Lec04 gpu architectureLec04 gpu architecture
Lec04 gpu architecture
 
Volume 2-issue-6-2040-2045
Volume 2-issue-6-2040-2045Volume 2-issue-6-2040-2045
Volume 2-issue-6-2040-2045
 
How a cpu works1
How a cpu works1How a cpu works1
How a cpu works1
 
Spie2006 Paperpdf
Spie2006 PaperpdfSpie2006 Paperpdf
Spie2006 Paperpdf
 
Memperf
MemperfMemperf
Memperf
 
Porting MPEG-2 files on CerberO, a framework for FPGA based MPSoc
Porting MPEG-2 files on CerberO, a framework for FPGA based MPSocPorting MPEG-2 files on CerberO, a framework for FPGA based MPSoc
Porting MPEG-2 files on CerberO, a framework for FPGA based MPSoc
 
Ali.Kamali-MSc.Thesis-SFU
Ali.Kamali-MSc.Thesis-SFUAli.Kamali-MSc.Thesis-SFU
Ali.Kamali-MSc.Thesis-SFU
 
Memory management
Memory managementMemory management
Memory management
 
VTU 5TH SEM CSE OPERATING SYSTEMS SOLVED PAPERS
VTU 5TH SEM CSE OPERATING SYSTEMS SOLVED PAPERSVTU 5TH SEM CSE OPERATING SYSTEMS SOLVED PAPERS
VTU 5TH SEM CSE OPERATING SYSTEMS SOLVED PAPERS
 
GPU Programming
GPU ProgrammingGPU Programming
GPU Programming
 
Modern processors
Modern processorsModern processors
Modern processors
 

Viewers also liked

Viewers also liked (11)

ORM
ORMORM
ORM
 
skimming and scanning
skimming and scanningskimming and scanning
skimming and scanning
 
insights-servfile (5)
insights-servfile (5)insights-servfile (5)
insights-servfile (5)
 
Packing tape sale
Packing tape salePacking tape sale
Packing tape sale
 
L 10 dialogue
L 10 dialogueL 10 dialogue
L 10 dialogue
 
Mobile
MobileMobile
Mobile
 
What is diabetes
What is diabetesWhat is diabetes
What is diabetes
 
Plastic roads
Plastic roadsPlastic roads
Plastic roads
 
Presentazione ITALSIMPATIA ENGLISH
Presentazione ITALSIMPATIA ENGLISHPresentazione ITALSIMPATIA ENGLISH
Presentazione ITALSIMPATIA ENGLISH
 
Créer application mobile
Créer application mobileCréer application mobile
Créer application mobile
 
Lesson 9 dialogue
Lesson 9 dialogueLesson 9 dialogue
Lesson 9 dialogue
 

Similar to Map SMAC Algorithm onto GPU

Optimizing Apple Lossless Audio Codec Algorithm using NVIDIA CUDA Architecture
Optimizing Apple Lossless Audio Codec Algorithm using NVIDIA CUDA Architecture Optimizing Apple Lossless Audio Codec Algorithm using NVIDIA CUDA Architecture
Optimizing Apple Lossless Audio Codec Algorithm using NVIDIA CUDA Architecture IJECEIAES
 
Architecture exploration of recent GPUs to analyze the efficiency of hardware...
Architecture exploration of recent GPUs to analyze the efficiency of hardware...Architecture exploration of recent GPUs to analyze the efficiency of hardware...
Architecture exploration of recent GPUs to analyze the efficiency of hardware...journalBEEI
 
Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecture
Revisiting Co-Processing for Hash Joins on the CoupledCpu-GPU ArchitectureRevisiting Co-Processing for Hash Joins on the CoupledCpu-GPU Architecture
Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecturemohamedragabslideshare
 
Nvidia (History, GPU Architecture and New Pascal Architecture)
Nvidia (History, GPU Architecture and New Pascal Architecture)Nvidia (History, GPU Architecture and New Pascal Architecture)
Nvidia (History, GPU Architecture and New Pascal Architecture)Saksham Tanwar
 
Efficient algorithm for rsa text encryption using cuda c
Efficient algorithm for rsa text encryption using cuda cEfficient algorithm for rsa text encryption using cuda c
Efficient algorithm for rsa text encryption using cuda ccsandit
 
Efficient algorithm for rsa text encryption using cuda c
Efficient algorithm for rsa text encryption using cuda cEfficient algorithm for rsa text encryption using cuda c
Efficient algorithm for rsa text encryption using cuda ccsandit
 
gpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsngpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsnARUNACHALAM468781
 
GPGPU programming with CUDA
GPGPU programming with CUDAGPGPU programming with CUDA
GPGPU programming with CUDASavith Satheesh
 
Conference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environmentConference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environmentEricsson
 
GPU Architecture NVIDIA (GTX GeForce 480)
GPU Architecture NVIDIA (GTX GeForce 480)GPU Architecture NVIDIA (GTX GeForce 480)
GPU Architecture NVIDIA (GTX GeForce 480)Fatima Qayyum
 
Volume 2-issue-6-2040-2045
Volume 2-issue-6-2040-2045Volume 2-issue-6-2040-2045
Volume 2-issue-6-2040-2045Editor IJARCET
 
Nt1310 Unit 3 Computer Components
Nt1310 Unit 3 Computer ComponentsNt1310 Unit 3 Computer Components
Nt1310 Unit 3 Computer ComponentsKristi Anderson
 
GPU Rigid Body Simulation GDC 2013
GPU Rigid Body Simulation GDC 2013GPU Rigid Body Simulation GDC 2013
GPU Rigid Body Simulation GDC 2013ecoumans
 
Accelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous PlatformsAccelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous PlatformsIJMER
 
Professional Project - C++ OpenCL - Platform agnostic hardware acceleration ...
Professional Project - C++ OpenCL - Platform agnostic hardware acceleration  ...Professional Project - C++ OpenCL - Platform agnostic hardware acceleration  ...
Professional Project - C++ OpenCL - Platform agnostic hardware acceleration ...Callum McMahon
 
AVOIDING DUPLICATED COMPUTATION TO IMPROVE THE PERFORMANCE OF PFSP ON CUDA GPUS
AVOIDING DUPLICATED COMPUTATION TO IMPROVE THE PERFORMANCE OF PFSP ON CUDA GPUSAVOIDING DUPLICATED COMPUTATION TO IMPROVE THE PERFORMANCE OF PFSP ON CUDA GPUS
AVOIDING DUPLICATED COMPUTATION TO IMPROVE THE PERFORMANCE OF PFSP ON CUDA GPUScscpconf
 
Achieving Improved Performance In Multi-threaded Programming With GPU Computing
Achieving Improved Performance In Multi-threaded Programming With GPU ComputingAchieving Improved Performance In Multi-threaded Programming With GPU Computing
Achieving Improved Performance In Multi-threaded Programming With GPU ComputingMesbah Uddin Khan
 
Graphics processing unit ppt
Graphics processing unit pptGraphics processing unit ppt
Graphics processing unit pptSandeep Singh
 
Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)Brendan Gregg
 
Topics - , Addressing modes, GPU, .pdf
Topics - , Addressing modes, GPU,  .pdfTopics - , Addressing modes, GPU,  .pdf
Topics - , Addressing modes, GPU, .pdfShubhamSinghRajput46
 

Similar to Map SMAC Algorithm onto GPU (20)

Optimizing Apple Lossless Audio Codec Algorithm using NVIDIA CUDA Architecture
Optimizing Apple Lossless Audio Codec Algorithm using NVIDIA CUDA Architecture Optimizing Apple Lossless Audio Codec Algorithm using NVIDIA CUDA Architecture
Optimizing Apple Lossless Audio Codec Algorithm using NVIDIA CUDA Architecture
 
Architecture exploration of recent GPUs to analyze the efficiency of hardware...
Architecture exploration of recent GPUs to analyze the efficiency of hardware...Architecture exploration of recent GPUs to analyze the efficiency of hardware...
Architecture exploration of recent GPUs to analyze the efficiency of hardware...
 
Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecture
Revisiting Co-Processing for Hash Joins on the CoupledCpu-GPU ArchitectureRevisiting Co-Processing for Hash Joins on the CoupledCpu-GPU Architecture
Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecture
 
Nvidia (History, GPU Architecture and New Pascal Architecture)
Nvidia (History, GPU Architecture and New Pascal Architecture)Nvidia (History, GPU Architecture and New Pascal Architecture)
Nvidia (History, GPU Architecture and New Pascal Architecture)
 
Efficient algorithm for rsa text encryption using cuda c
Efficient algorithm for rsa text encryption using cuda cEfficient algorithm for rsa text encryption using cuda c
Efficient algorithm for rsa text encryption using cuda c
 
Efficient algorithm for rsa text encryption using cuda c
Efficient algorithm for rsa text encryption using cuda cEfficient algorithm for rsa text encryption using cuda c
Efficient algorithm for rsa text encryption using cuda c
 
gpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsngpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsn
 
GPGPU programming with CUDA
GPGPU programming with CUDAGPGPU programming with CUDA
GPGPU programming with CUDA
 
Conference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environmentConference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environment
 
GPU Architecture NVIDIA (GTX GeForce 480)
GPU Architecture NVIDIA (GTX GeForce 480)GPU Architecture NVIDIA (GTX GeForce 480)
GPU Architecture NVIDIA (GTX GeForce 480)
 
Volume 2-issue-6-2040-2045
Volume 2-issue-6-2040-2045Volume 2-issue-6-2040-2045
Volume 2-issue-6-2040-2045
 
Nt1310 Unit 3 Computer Components
Nt1310 Unit 3 Computer ComponentsNt1310 Unit 3 Computer Components
Nt1310 Unit 3 Computer Components
 
GPU Rigid Body Simulation GDC 2013
GPU Rigid Body Simulation GDC 2013GPU Rigid Body Simulation GDC 2013
GPU Rigid Body Simulation GDC 2013
 
Accelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous PlatformsAccelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous Platforms
 
Professional Project - C++ OpenCL - Platform agnostic hardware acceleration ...
Professional Project - C++ OpenCL - Platform agnostic hardware acceleration  ...Professional Project - C++ OpenCL - Platform agnostic hardware acceleration  ...
Professional Project - C++ OpenCL - Platform agnostic hardware acceleration ...
 
AVOIDING DUPLICATED COMPUTATION TO IMPROVE THE PERFORMANCE OF PFSP ON CUDA GPUS
AVOIDING DUPLICATED COMPUTATION TO IMPROVE THE PERFORMANCE OF PFSP ON CUDA GPUSAVOIDING DUPLICATED COMPUTATION TO IMPROVE THE PERFORMANCE OF PFSP ON CUDA GPUS
AVOIDING DUPLICATED COMPUTATION TO IMPROVE THE PERFORMANCE OF PFSP ON CUDA GPUS
 
Achieving Improved Performance In Multi-threaded Programming With GPU Computing
Achieving Improved Performance In Multi-threaded Programming With GPU ComputingAchieving Improved Performance In Multi-threaded Programming With GPU Computing
Achieving Improved Performance In Multi-threaded Programming With GPU Computing
 
Graphics processing unit ppt
Graphics processing unit pptGraphics processing unit ppt
Graphics processing unit ppt
 
Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)
 
Topics - , Addressing modes, GPU, .pdf
Topics - , Addressing modes, GPU,  .pdfTopics - , Addressing modes, GPU,  .pdf
Topics - , Addressing modes, GPU, .pdf
 

Map SMAC Algorithm onto GPU

  • 1. Mapping SMAC Algorithm onto GPU Student: Zhengjie Lu Supervisor: Dr. Ir. Bart Mesman Ir. Yifan He Prof. Dr. Ir. Richard Kleihorst
  • 2. Contents 1. Background ...............................................................................................................................................3 1.1 GPU programming ..............................................................................................................................3 1.2 SMAC algorithm ..................................................................................................................................3 2. Implementation ........................................................................................................................................6 2.1 General Structure................................................................................................................................6 2.2 SMAC on CPU......................................................................................................................................7 2.3 SMAC on GPU......................................................................................................................................9 3. Experiment..............................................................................................................................................12 3.1 Experiment Environment..................................................................................................................12 3.2 Experiment Setup..............................................................................................................................13 3.3 Experiment Result.............................................................................................................................14 3.3.1 GPU improvement .....................................................................................................................14 3.3.2 Linear execution-time model.....................................................................................................15 4. Roofline Model Analysis..........................................................................................................................17 4.1 Roofline Model..................................................................................................................................17 4.2 Application........................................................................................................................................17 5. Conclusion...............................................................................................................................................19 Acknowledgement......................................................................................................................................20 Appendix .....................................................................................................................................................21
  • 3. 1. Background 1.1 SMAC algorithm SMAC is short for the “Simplified Method for Atmospheric Correction”, which is specially used in computing the atmospheric correction of satellite measurements in the solar spectrum. It is popular in the remote sensing application because it is several hundred times faster than more detailed radiative transfer models like 5S [3]. Figure 1.1 explains a black-box model of SMAC. The input of 9 parameters will be taken into SMAC and then a single output will be generated. SMAC Algorithm Float sza; Float sva; Float vza; Float vva; Float taup550; Float uh2o; Float uo3; Float airPressure; Float r_toa. Float r_surfRecycle Output: Input: Fig1.1 SMAC black box model SMAC is computational fast in its peer, but it still takes considerable amounts of time in processing large-size data that is common in the remote sensing applications. Figure 1.2gives the profiles into SMAC when it is processing a data size of 231240 bytes on CPU. The file IO operation is dominant in this case (about 75%), while the CPU computation time is also significant (about 25%). Since the file IO performance can be improved by introducing some faster hard disks, the CPU computation will become the bottleneck later. This motivates us to map SMAC onto a commercial GPU (i.e. NVIDIA graphic card) and see how much computation performance improvement we can achieve. Fig1.2 Profiles of the original SMAC program
  • 4. 1.2 GPU programming The GPU programming is introduced into the field of Scientific Computation after its success in accelerating the computer graphic processing. The hardware advantage of many parallel processing cores per GPU (normally the number is at least 32 processing cores per GPU) makes it capable to deal with massive data. The disadvantages of GPU programming are that: (1) the programmers are required to have the knowledge about the hardware (especially the memory accessing pattern) so that they can manipulate GPU efficiently, and (2) the pipeline penalty is dramatically huge if the branch predictions miss. Normally, an efficient program is organized in such a way that GPU is responsible for the massive mathematical computation while CPU takes charge of the logical operations and controlling. NVIDIA develops a GPU programming technology named “Compute Unified Device Architecture” (CUDA) for its own graphic card products, and it is the most popular one in the state-of-the-art GPU programming. The CUDA supported GPUs can be found in NVIDIA’s official website *1+ and they have covered the latest products of NVIDIA with more than 100 cores per GPU. In our paragraph, we will refer CUDA programming as GPU programming for the NVIDIA graphic cards. It is essential to explain the NVIDIA GPU hardware architecture before we discuss about CUDA programming, because CUDA programming is actually a collection of regulations for operating the hardware in the most efficient way. Figure 1.3 shows the overview of NVIDIA GeForce 8800GT GPU. Every 8 stream processors (SP) is organized together in a stream multi-processor (SM). 14 SMs consists of the main body of a GPU. Inside each SM, there are 8192 registers and also a shared memory with the size 16384 bytes. The shared memory is used for the local communication within the 8 SPs. A global memory is connected with each SM to make the global communication. It should be pointed out that the access to the global memory is rather slow and that to the shared memory is rather fast. This indicates that we should play with the shared memory much more than the global memory, to achieve a better performance. Fig1.3 NVIDIA 8800GT architecure The basic concept in CUDA programming is the single-instruction-multiple-threads (SIMT), which means that all the active threads will perform an identical instruction in each execution time [2]. Each active thread is assigned to a unique SP so that the physically parallel-threading is achieved. Also each thread
  • 5. will has its own registers to keep its status and they can communicate with each other through the shared memory. The second concept in CUDA programming is the thread and block. Each block is consisting of multiple threads shown in figure 1.4, and the number of threads per block is limited by the physical available registers number and shared memory size. Each block will be assigned to a SM inside which 8 SPs are integrated. A single block can be only assigned to a single SM, while a single SM can hold many blocks. Fig1.4 Block and threads Fig1.5 Stream Execution The third concept in CUDA programming is the warp, which is the basic thread scheduling unit. 32 threads in a block will be organized as a warp and then simultaneously assigned to this block’s responding SM by the scheduler. If the thread number in a block is not the times of 32, then some dummy threads will be appended to this block to make the thread number as the times of 32. The forth concept in CUDA programming is the concurrent-copy-and-execution, or so-called "stream" execution. The input data is broken down into several segmentations with the same length. Then the data segmentations (so-called data streams) are transferred from the CPU memory to the GPU memory one by one. A data stream can be processed by the GPU kernel can process once it has been transferred completely, without waiting for the completion of other data stream transmissions. An explanation on the stream execution is given in figure 1.5. The fifth concept in CUDA programming is the memory access coalescence. This means that the access pattern to the global memory should be viewed in terms of aligned segments of 16 and 32 words. The addressing pattern must also be aligned to 16 or 32 words. As a short conclusion, the program has to be mapped to the NVIDIA GPU hardware with the CUDA regulations.
  • 6. 2. Implementation 2.1 General Structure SMAC algorithm is consisting of 14 steps, as shown in figure 2.1. The first step as a "data filter" is a conditional branching, through which only the valid data can be passed to the computations later. Each computation would just rely on the input of its previous ones, as shown in the data dependency graph in figure 2.2. All computations in SMAC are arithmetic calculations, including trigonometric functions, exponential functions and etc. Also several computations contain the if-else conditions and these branching will be replaced with the equivalent logical expression when they are mapped onto GPU. The implementation of SMAC on CPU is programmed in ASIC C++, while the one on GPU programmed in C++ and C. Calculation: step 1 Valid vector? Calculation: step 2 Parameters setup Calculation: step 11 Calculation: step 3 Calculation: step 4 Calculation: step 5 Calculation: step 6 Calculation: step 7 Calculation: step 8 Calculation: step 9 Calculation: step 10 Yes No Start Calculation: step 12 Fig.2.1 Overview of SMAC kernel
  • 7. 1 2 3 4 5 6 7 8 9 10 11 12 p p 1 Parameters Setup Calculation Step 1 Fig.2.2 Data dependency Graph of SMAC 2.2 SMAC on CPU A single thread is employed as the execution model of SMAC on CPU, in which the SMAC kernel has to read through all the input and then generate out the final results. One input vector consisting of 9 float point number can just be used to produce one output data consisting of 1 float point number. No any data dependencies exist between different input vectors and neither do the outputs. The completely execution model is shown in figure 2.3. The data flow in this case is quite simple, which is shown in figure 2.4. Both the input data and coefficients are read from the files on the hard disk to the CPU memory. Then CPU takes them into its registers and throws out the final results into the CPU memory.
  • 8. vector[0] vector[1] vector[2] … vector[n-1] SMAC kernel Float sza; Float sva; Float vza; Float vva; Float taup550; Float uh2o; Float uo3; Float airPressure; Float r_toa. Float r_surfRecycle[0]; Float r_surfRecycle[1]; Float r_surfRecycle[2]; Float r_surfRecycle[3]; … Float r_surfRecycle[n-1]. Input Output Fig.2.3 Execution model of SMAC on CPU Fig.2.4 Data flow of SMAC on CPU The original SMAC program employs 3 classes: (1) SmacAlgorithm Class, (2) Coefficients Class and (3) CoefficientsFile Class. SmacAlgorithm Class functions the kernel in which the SMAC algorithm is fully implemented, while the other two manage the access to the coefficient file. An additional SimData Class is now included to manipulate the input and output data. The relations among all the classes are listed in figure 2.5. Because the validness of input data can be identified as soon as they are read in, the “data filter” in the SMAC kernel can be immediately achieved in SimData Class instead. It can both save the memory usage and reduce the processing time in the kernel. Also the GPU can get rid of the conditional branching when mapping SMAC on it. The flow chart in figure 2.6 explains the implementation procedures. The coefficients and the input data are read firstly, and then passed to the SmacAlgorithm instance for computation. The computational results will be collected by the SimData instance which is the output parameter of the SmacAlgorithm instance.
  • 9. SmacAlgorithm class Coefficients class CoefficientsFile class File Operation Algorithm Execution I/O Interface SimData class Satellite Sensors Coefficients File Satellite Data File Earth Surface Reflectance File Coefficients::setC oefficients() SimData::readData() SmacAlgthm::SmacAlgthm() SmacAlgthm: run() ENTRY EXIT Fig.2.5 Program structure of SMAC on CPU Fig.2.6 Flow chart of SMAC on CPU 2.3 SMAC on GPU The execution model of SMAC on GPU is benefiting from the multiple threads. Each GPU thread is an instance of the SMAC kernel so that multiple input vectors can be processed simultaneously, as shown in figure 2.7. This is the main benefits of employing GPU. Vector[0] Vector[1] Vector[2] … Vector[n-1] SMAC kernel 0 Float sza; Float sva; Float vza; Float vva; Float taup550; Float uh2o; Float uo3; Float airPressure; Float r_toa. Float r_surfRecycle[0]; Float r_surfRecycle[1]; Float r_surfRecycle[2]; Float r_surfRecycle[3]; … Float r_surfRecycle[n-1]. Input Output SMAC kernel 1 SMAC kernel 2 ... GPU Fig.2.7 Execution model of SMAC on GPU The data flow of SMAC on GPU is quite different from the one on CPU as shown in figure 2.8. The input data need to be transferred from the CPU memory to the GPU memory and then the results copied back in reverse. The constant memory in GPU is employed as the cache, for accessing the frequently used coefficients in all SMs.
  • 10. Coefficients[]; Input_data[]; 1st copy of coefficients[]; 1st copy of input_data[]; 1st copy of output_data[]. Hard disk CPU memory 2nd copy of input_data[]; output_data[]. GPU Global memory 2nd copy of coefficients[] GPU constant memory Registers GPU Fig.2.8 Data flow of SMAC on GPU The program structure of SMAC on GPU is based on the one on CPU, in which the SMAC kernel is immediately mapped onto GPU. Two more GPU relevant modules are introduced to the program structure as shown in figure 2.9. The module "GPU_kernel.cu" implements the SMAC kernel which will be executed on GPU, while the one "GPU.cu" controls the GPU memory operations and kernel execution. Fig.2.9 Program structure of SMAC on GPU Also the flow chart of SMAC on GPU is similar to the one on CPU, except for invoking the GPU relevant functions, as shown in figure 2.10. An obvious change is that the input data are transferred from the CPU memory to the GPU memory before executing the SMAC kernel on GPU and then the output data are copied back from the GPU memory to the CPU memory. Besides that, the input data has to be re- organized in such a sequence that they can be coalesced accessed to the GPU memory.
  • 11. reorganizeInput() cudaMemcpyAsync() cudaMemcpyAsync() GPU_kernel<<>>() reorganizeOutput() Coefficients::setC oefficients() SimData::readData() SmacAlgthm::SmacAlgthm() SmacAlgthm: run() ENTRY EXIT Fig.2.10 Flow chart of SMAC on GPU As introduced in section 1.1, we have to concern some more in GPU programming and they are: threads, blocks and streams. The profiling tool CUDAPROF gives us the tip that 59 registers are needed per SMAC kernel thread. With this requirement of register number, the optimization tool CUDACALCULATOR indicates that a max number of 192 threads per block can be achieved. Once a block with 192 threads is executing on a unique SM, no other blocks can be assigned to the same SM before the working block finishes all its executions. Since there are 4 SMs inside the GPU in our experiment, 4 blocks are enough to fully utilize the GPU. Employing more blocks would probably introduce the extra overhead when the blocks are switching, while employing fewer blocks could be just a waste of hardware resources. The number of streams will be explored later in our experiment.
  • 12. 3. Experiment 3.1 Experiment Environment We apply the SMAC program on a laptop workstation which is equipped with both an Intel Dual-Core CPU and a NVIDIA GPU. The GPU is mounted on the laptop mother boarder through the PCI express interface. The operation system on this machine is Windows 32-bit Vista Enterprise, with CUDA 2.2 support. The workload on CPU and GPU are both low before our experiment. The detailed descriptions of experiment environment are listed below: HARDWARE CPU Intel(R) Core(TM)2 Duo CPU T9300, 2.5 GHz x 2 GPU nVidia Quadro FX570M, 0.95 GHz x 32 Main Memory 4GB Mother Board Interface PCI express 1.0 x 16 SOFTWARE Operation system Widows Vista Enterprise 32-Bit CUDA version CUDA 2.2 GPU maximum registers per thread 60 GPU thread number 192 x 4 (#thread per block x #block) CPU thread number 1 Table 3.1 Experiment environment Coefficients::setC oefficients() SimData::readData() SmacAlgthm::SmacAlgthm() SmacAlgthm: run() ENTRY EXIT Start timer Stop timer Fig.3.1 Time profiling method
  • 13. To profile the execution time for either SMAC on CPU or SMAC on GPU, the timers are attached at two ends of the algorithm instances as shown in figure 3.1. In our experiment, we are concern about the SMAC kernel performance but not the application performance. That is because the later one is dominated by the hard disk I/O speed as already given in table 1.1 and it can be overcome by employing other hard disk with high I/O speed. 3.2 Experiment Setup As it is indicated in figure 3.1, the timing profile of the SMAC kernel can be defined as the difference between the “start timer” and “stop timer”. For instances, they are: CPU time CPU stop timer CPU start timer - GPU time GPUstop timer GPU start timer - Now we define the performance improvement as: CPU time Improvement GPU time  , in which the linear execution-time model is employed: CPU time CPU overhead Bytes CPU speed   ( ) ( ) ( ) ( ) GPU time GPU memorytime GPU run time GPUmemory overhead Bytes GPU memoryspeed GPUkernel overhead Bytes GPU kernel speed GPUmemory overhead GPUkernel overhead Bytes GPU memoryspeed GPU kernel speed GPU overhe               ad Bytes GPU speed  The performance improvement can also be expressed as: CPU overhead Bytes CPU speed Improvement GPU overhead Bytes GPU speed Bytes CPU speed CPU speed Bytes GPU speed GPU speed          It should be pointed out that this equation only holds when the data size is dramatically large. From this formula, it can be obtained that the ultimate improvement is only relevant to the CPU and GPU speed
  • 14. and has nothing to do with the data size. We will apply the linear execution-time model to predict the GPU performance later. 3.3 Experiment Result 3.3.1 GPU improvement Table 3.2 and figure 3.2 record the performance improvement during our experiment. The improvement cures in figure 3.2 is climbing down when the number of streams increased, in case that the data size is quite low. That is because the data size is too small to cover the overhead gap. But if we increase the data size to some larger one, we can find that the slopes of the curves become larger and larger. That is due to the increasing data size which dismisses the overhead. Later all the curves would behavior similarly or even overlapping on each other, when the data size is larger than a certain threshold. It couldn’t help a lot even if we increase the data size or the stream number. Table 3.2 Performance improvement: CPU time/GPU time
  • 15. Fig.3.2 GPU performance improvement: CPU time/GPU time 3.3.2 Linear execution-time model In our earlier tests, the overhead and processing speed have been obtained and they are recorded in table 3.3. The GPU overhead is relatively large, while the CPU overhead is too tiny to be measured. The reason for the significant GPU overhead is that the data needs be re-organized and then transferred both before and after the GPU kernel execution. It is also obvious that the GPU speed is at least 10 times faster than the CPU one. CPU time (ms) = CPU overhead (ms) + data size (byte) x CPU speed (ms/byte) CPU overhead 0 CPU speed 5 5.39 10  GPU time (ms) = GPU overhead (ms) + data size (byte) x GPU speed (ms/byte) GPU overhead: 1-stream 1.67 GPU speed: 1-stream 6 2.41 10  GPU overhead: 8-stream 4.45 GPU speed: 8-stream 6 2.01 10  Table 3.3 Parameters of linear execution-time model Both the predicted improvement and experimental improvement of 1-stream GPU performance are plotted in figure 3.3. Some recognizable variance occurs between the two curves with the data size is small. Then the cures are almost identical when the data size is large. Finally, the curves will likely reach the "ceiling" in despite of the data size.
  • 16. Fig.3.3. Single stream GPU performance improvement A comparison of the 8-stream GPU performance improvement is shown in figure 3.4 and it tell tells us the similar information. The performance improvement will ultimately stays as a constant. No better improvement could be achieved even when we increase the amounts of data size. Fig.3.4. 8-stream GPU performance improvement 1-stream GPU Data size (byte) Experimental Improvement Predicted Improvement 116640 5,71 3,21806093 233280 8,6 5,626082853 466560 12,83 8,989388397 933120 14,79 12,82188881 1866240 19,28 16,29558697 3732480 20,54 18,84884576 8-stream GPU Data size (byte) Experimental Improvement Predicted Improvement 116640 1,99 1,342894063 233280 3,49 2,557938816 466560 6,18 4,671162775 933120 9,95 7,958659158 1866240 14,75 12,27984849 3732480 19,24 16,85581952
  • 17. 4. Roofline Model Analysis 4.1 Roofline Model Roofline Model can give us a non-perfect insight into the performance bottleneck [4]. When the instruction density measured in flops/byte lies below the peak performance as shown in figure 4.1, it indicates that the bottleneck is the data transfer and we should bring in more data. Otherwise the bottleneck might be the computational limitation and we should reconsider the computation approach. Fig.4.1.Example of Roofline model 4.2 Application To apply the Roofline model with our case, the GPU hardware specifications and the profiles of SMAC kernel (notes: no the SMAC application) have to be obtained before the analysis. They are all listed in table 4.1. Hardware: NVIDIA Quadro FX570M PCI express bandwidth: 4 GB/sec Peak performance: 91.2 GFlops/sec Peak performance without FMAU: 30.4 GFlops/sec Software: SMAC kernel on GPU Data size: 59719680 Bytes Issued instruction number: 4189335552 Flops Execution time: 79.2 ms Instruction density: 70.15 Flops/Byte Instruction Throughput: 52.8 GFlops/sec Table 4.1 Parameters of Roofline model
  • 18. Now the performance of SMAC on GPU is applied on the Roofline model as the blue marker shown in figure 4.2. It can be identified that the performance bottleneck is the computation if we only consider the kernel execution on GPU. To be more precise, it results from the computation unbalances between float point multiplication and addition. That is partially because that SMAC is immediately mapped on GPU and the data dependencies inside SMAC still exist. The data dependencies limit the fully utilization of the FMAUs (float-point multiplication-addition unit) in GPU. The other reason is that just 192 threads are employed in a block to satisfy the register budget. Introducing more threads would only result in that the temp operands have to be stored in the local memory which is quite slow to be accessed. Keeping the thread number as many as 192 can dismiss the opportunities t using the local memory, but it would also make some function units always unused. The situation can be worse if the I/O operations are included. The navy blue sloping line stands for the hard disk IO bandwidth bottleneck and it is right farther than the GPU memory bandwidth limitation cure. The instruction throughput in this case is below 1 GFlops/second so that it is not plotted in figure 4.2. To be summarized, the SMAC application is limited by the hard disk I/O. Fig.4.2 Roofline model of SMAC on GPU
  • 19. 5. Conclusion and Future The popular algorithm named SMAC in the remote sensing field is successfully mapped onto the commercial programmable GPU with the help of CUDA technology. Since SMAC is applying with large size of data stream, it can employ the trick of “stream-execution” to achieve the performance improvement. Our experiment result shows that a performance speed up of 25 times can be achieved by GPU, compared with CPU. Also the linear execution-model is proved in analyzing GPU’s stream- execution. Besides that, the Roofline model is employed to identify the bottleneck of SMAC on GPU. The SMAC kernel on GPU is dominated by the computational bottleneck, while the completed application is limted by the hard disk I/O bandwidth. We are only interested in the former one in this report. Two main reasons result in the computational bottleneck of SMAC kernel on GPU and they are: (1) the unbalance between the float-point multiplications and additions due to the data dependencies, and (2) the register pressure caused by the register requirement per thread. The first one can possibly be released by decoupling the data dependencies inside the SMAC algorithm kernel, while the second one can be solved by turning to the fine-grained threads in which fewer register are required. Fig.5.1Diagram of GPU power measurement Fig.5.2 Physical layout of GPU power measurement The power consumption of SMAC on GPU also interested us in the future. Since no commercial power consumption measurement PCI-E cards are available in the market, a customized approach has to be carried out. Figure 5.1 shows the measurement principle, in which the 5V and 12V power supply lines in the 4-pin PCI-E interface are measured separately. A 0.03 resistor with 20W capacity is connected to the 5V power supply line and its current flow is calculated by its voltage over its register value. Then the power supply through this 5V line can be obtained by 5V times the current through the resistor. So does the measurement on the 12V power supply line. At last the two power supplies are summarized to obtain the GPU power consumption. Figure 5.2 shows the physical connection of our chosen future experiment.
  • 20. Acknowledgement This assignment is my 3-month traineeship in Technische Universiteit Eindhoven (TU/e) and the topic is from VITO-TAP, Flemish Institute for Technological Research NV. I have got lots of support from the people surrounding. Dr. Ir. Bart Mesman, my supervisor in TU/e, has spent quit a lot of time in my weekly reports and verifying my ideas. Ir. Ir. Yifang He who is the PhD candidate in TU/e, also my supervisor in this traineeship, guided me with the research methodology. Prof. Dr. Ir. Richard Kleihorst, my supervisor in VITO-TAP, kindly arranged the daily issues and working environment in VITO-TAP. Prof. Dr. Ir. Henk Corporaal, my academic mentor in TU/e, gave me strong supports in the 3 months. I should also thank Ir. Zhengyu Ye and Ir. Gert Jam, who gave me valuable advices on GPU programming and Roofline Model.
  • 21. Appendix [1] NVIDIA CUDA official website, http://www.nvidia.com/object/cuda_home.html, retrieved on July 20, 2009 [2] NVIDIA CUDA documentation, Chapter 4, “NVIDIA CUDA Programming Guide 2.2”, February 4, 2009 [3] H. Rahman, G. Dedieu, “SMAC: A Simplified Method for the Atmospheric Correction of Satellite”. December 5, 1993 *4+ S. Williams, A. Waterman, D. Patterson, “Rooline: An insightful Visual Performance model for Multi- core Architectures”, April 2009 [5] S. Williams, D. Patterson, “The Roofline Model: A pedagogical tool for program analysis and optimization”, retrieved on September 10, 2009. [6] Zhengyu Ye, "Design Space Exploration for GPU-Based Architecture", August 2009. [7] Shane Ryoo, Christopher I. Rodrigues, Sam S. Stone, Sara S. Baghsorkhi, Sain-Zee Ueng, John A. Stratton, and Wen-mei W. Hwu, “Program Optimization Space Pruning for a Multithreaded GPU”, ACM 2008. [8] NVIDIA CUDA documentation, “NVIDIA_CUDA_BestPracticesGuide_2.3”, retrieved on September 12, 2009 [9] Rob Farber, “CUDA, Supercomputing for the Masses” , September 19, 2008. *10+ X. Ma, M. Dong, L. Zhong, Z. Deng, “Statistical Power Consumption Analysis and Modeling for GPU- based Computing”, retrieved on September, 2009 [11+ S. Collange, D. Defour, A. Tisserand, “Power Consumption of GPUs from a Software Perspective”, Proceedings of the 9th International Conference on Computational Science, 2009 *12+ Analog Device documentation, “Measuring temperatures on computer chips with speed and accuracy”, April, 1999 [13+ Green Grid, “The Green Data Center: Energy-Efficient Computing in the 21st Century”, retrieved on July 16, 2009 [14+ Google, “The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines”. Retrieved on July 16, 2009 [15+ Google, “The Case for Energy-Proportional Computing”, retrieved on July 16, 2009 [16] http://en.wikipedia.org/wiki/PowerNow!, , retrieved on July 16, 2009 [17] http://en.wikipedia.org/wiki/SpeedStep, retrieved on July 16, 2009
  • 22. [18+ SUN, “Sun's Throughput Servers: Paradigm Shift in Processor Design Drives Improved Business Value”, retrieved on July 16, 2009 [19] IBM, "Storage Modeling for Power Estimation", retrieved on July 16, 2009 [20] http://en.wikipedia.org/wiki/Green_computing, retrieved on July 16, 2009 [21+ Seagate, “2.5-Inch Enterprise Disc Drives: Key to Cutting Data Center Costs”, retrieved on July 16, 2009 [22] Google, "Power-Aware Micro-architecture: Design and Modeling Challenges for Next-Generation Microprocessors", retrieved on July 16, 2009 [23] Google, "MapReduce: Simplified Data Processing on Large Clusters", retrieved on July 16, 2009 [24] Google, "Power Provisioning for a Warehouse-sized Computer", retrieved on July 16, 2009 [25+ IBM, “IBM BladeCenter HS22 Technical Introduction”, retrieved on July 16, 2009 [26+ IBM, “IBM BladeCenter Products and Technology”, retrieved on July 16, 2009 [27] http://www-03.ibm.com/systems/virtualization/, retrieved on July 16, 2009 [28+ Green Grid, “Five Ways to Reduce Data Center Server Power Consumption”, retrieved on July 16, 2009