Implementation and Optimization of FDTD Kernels by Using Cache-Aware Time-Skewing Algorithms
1. IMPLEMENTATION AND OPTIMIZATION OF
FDTD KERNELS BY USING CACHE-AWARE
TIME-SKEWING ALGORITHMS
THESIS PRESENTATION
1
SERHAN OZBEY
WARSAW UNIVERSITY OF TECHNOLOGY
INSTITUTE OF TELECOMMUNICATIONS 16/03/2017
2. ABSTRACT
The main goal of this thesis was to implement and optimize cache-aware time-skewing algorithms for
FDTD kernels to reduce cache misses and idle time of the processor.
Large scale discretization of space and computations needed for electromagnetic simulations
Importance of utilization and optimization of an efficient memory access pattern
Naive implementation of FDTD method into code is a kernel with cascaded loops that makes data reads
and writes from memory to calculate EM fields.
Exploiting data dependencies and locality features of FDTD kernel with a better usage of memory
hierarchy, reducing processors’ idle time is achievable
Execution time of FDTD can take long if cascaded loops are not incremented in a way to use data
dependencies efficiently.
Reduction of this idle time can be done with skewing and blocking time and space domains to force
loop iterations to follow data dependencies for a better access scheme with better usage of fast CPU
cache memories
4. INTRODUCTION
For sustainable and reliable telecommunication networks, modelling of efficient and durable network
components are highly demanded. This is done by modelling and producing efficient devices that
interacts well with electromagnetic disturbances that affects performance of such components.
Considerations of factors such as electromagnetic radiation, scattering should be done by
electromagnetic modelling of devices to simulate interactions of devices with nature conditions and
materials existing in environment.
This is done by modelling and producing efficient devices that interacts well with electromagnetic
disturbances that affects performance of such components
4
5. INTRODUCTION
Computational electromagnetics (electromagnetic modeling): is the process of modeling the interaction
of electromagnetic fields with physical objects and the environment. Maxwell’s equations should be
solved, which will evaluate electric and magnetic fields according to given boundary and constitutional
relation conditions.
By using computationally efficient approximations to Maxwell's equations, it is used to
calculate antenna performance
electromagnetic compatibility,
radar cross section
electromagnetic wave propagation when not in free space.
5
6. INTRODUCTION
Computational electromagnetics have been the answer for electromagnetic simulations using latest
technology available. By now, there is many methods existing in domain such as integral form Maxwell’s
equation solvers like MoM or differential form Maxwell’s equation solvers as FEM and FDTD.
To achieve high details and accuracy in these solvers, huge discretization of space and time elements
needed to solve these problems.
This means memory should be used in an efficient way by exchanging spatial and temporal data
in a fast way to calculate the field values with Maxwell’s equations till the end of the given time.
6
7. INTRODUCTION
• FDTD, the numerical analysis technique which is
used widely in computational electromagnetics ,
belongs in the general class of grid-based
differential numerical modeling methods. The
time-dependent Maxwell's equations (in partial
differential form) are discretized using central-
difference approximations to the space and
time partial derivatives.
7
8. FDTD METHOD
Solving Maxwell’s equations in time domain.
Saving each frame (one time iteration of our
code) as a movie.
Electric field changing at a particular point will
induce a curling (circulating) magnetic field.
Likewise, an induced magnetic field induces
curling electric field.
This leaves us with a leapfrog way of
calculations as shown at the figure on right
hand side.
8
9. FDTD METHOD
for t in 0 to NT-1
for i in 1 to N-1
E[i] = k1*E[i] + k2 * ( H[i] - H[i-1] )
end for
for i in 1 to N-1
H[i]+=E[i]-E[i+1]
end for
end for
A naïve 1D FDTD algorithm.
It is calculating all field values N for every NT
timesteps.
9
10. INTRODUCTION
• FDTD, remains to be a challenging task for
the computers and devices running it due to
it’s high demands of computational power
and memory bandwidth .
• Programs can’t leverage fully efficiently from
the evolving processor power upgrades
matching Moore’s Law , as processors spend
more than %80 of their time waiting for a
data to process or to be received from the
main memory.
10
11. INTRODUCTION
• Stencil codes such as FDTD kernels includes
cascaded loops forcing processors to make a lot
of memory read and writes. This is because of
problem sizes in general are too big to fit inside
the biggest cache component of the processor.
• Special feature of stencil codes are known as
datas are somehow related to it’s neighbours.
• In case of FDTD kernels, this is happening
between E-fields and H-fields. Space and time
elements are dependent to elements close by in
FDTD, as a result of Maxwell’s equations.
11
12. A data dependency graph, showing how the elements at different space and time are related to
each others computations as shown at the FDTD formula.
12
13. Values that can be computed from tile after some values are loaded initially.
13
14. As programs can’t leverage fully efficiently from the evolving processor power upgrades matching
Moore’s Law, one factor that is becoming more and more important is how well the algorithm takes
advantage of the memory hierarchy, its memory performance .
Memory access speed is very important in modern microprocessors. And this is a reason that we will
focus our work to cache memory hierarchies to make the most of effective cache replacement methods
to
reduce cache miss rates
improving locality of data
making the fast data access possible between processor and memory via effective cache usage.
14
INTRODUCTION
15. Cache-aware time-skewing algorithms takes advantage of explicitly defined processor details which is
being used with. As the algorithm stores data together in the same block , and as mentioned earlier, this
is the reasons that processors memory page size and cache lines should be included inside algorithm.
This is a vital part as the algorithm is taking advantage if processors cache behavior as it’s main objective
is minimizing the movement of memory pages in processors cache.
Objectives will be focused on loop tiling , time skewing , reducing CPU stalls with data locality
optimizations. Significant rise on the performance will be expected as a result of these optimization
steps.
15
INTRODUCTION
16. INTRODUCTION
FDTD solvers demands expensive hardware with parallelism features to run smoothly and accurately,
Our objective was to extend previous researches that provided ideas against these solutions.
The main objective of this thesis is achieving better results in means of reliability, cache usage
and execution times for FDTD codes to make it available to run smoothly and accurately given
problems with also taking the physics and engineering aspects of the problem into account which
has been lacking in previous researches.
Extension of previously known works on code optimizations such as loop blocking, cache-aware
algorithms and time-skewing techniques has been introduced as a contribution in details, instead
only including implicit informations.
16
17. LITERATURE REVIEW
FDTD method
References for understanding the problem and implementation of theory to code
Changes and proposals for new FDTD techniques
Solving FDTD problems for extreme conditions and specific problems
Photonics , biomedicine
Solving Schrodinger equations with a generalized FDTD approach
Different implementations to software as V2D.
17
18. LITERATURE REVIEW
Memory hierarchy and the "memory wall"
Referring to important concepts of memory management and optimizations such as
Memory hierarchy
‘Memory wall’ term
Von Neumann bottleneck
Roofline model
Memory mountain
18
19. LITERATURE REVIEW
Stencil codes and data dependencies
Definition and types of stencils
Approximating problem into stencil code
Methodology of determination of data dependencies
Other terms such as: Paralellism, GPU
Locality optimizations
Understanding the ‘Principle of locality’
Important terms related to locality features of codes ( machine balance, computer balance, scalable locality)
Different code optimization algorithms studies
19
20. METHODOLOGY
Research design
Code generation and validation
Dependence and loop iteration analysis
Finding optimal tiling and skewing
Methodogical assumptions
20
33. Summarizing, for both 1D FDTD and 2D FDTD:
Cache profiling
Execution time
Data types and Programming Languages
Compiler optimizations
Future works
33
RESULTS AND DISCUSSIONS
34. CONCLUSIONS
Computational electromagnetics gained much more importance with improvements and demands of the
related technologies, such as antenna design, bio-medicine, wireless communications
A good software implementation is a must for highly memory and computational intense code kernel
such as FDTD
In this thesis, previous literature work was extended and demonstrated about the improvements with
software optimizations such as loop blocking, cache-aware algorithms and time-skewing for 1D and 2D
FDTD kernels.
34
35. CONCLUSIONS
Difference between naive FDTD codes and applied algorithms applied were shown in the results for 1D
and 2D cases.
Results that were achieved indicates that applying time-skewing algorithms, with the way that has been
done in this thesis, comes with increased total data references but with much better cache hit rate
performance from other codes.
Performance of time-skewing is much visible in 2D code in terms of cache misses.
Run-time graphs and improved L1 and L3 cache miss rates for 1D and 2D cases have been achieved and
demonstrated with results.
Explanation of line-by-line cache misses are explained throughout the thesis.
35
Editor's Notes
Hello Dear Professors and valuable members of our institute of Telecommunications , I’m Serhan Ozbey. I am a graduate of Electrical & Electronics Engineering from Yasar University in Turkey. And I will be presenting my Master’s Thesis today in partial fulllment of the requirements for the degree of Master of Science in Telecommunications.
FDTD meaning
Large scale discretization of space and computations: By FDTD technique, we are handling time-domain problem by gridding both space and time. For each time step incrementation, we are making calculations for each field grid.
Loop optimizations: process of increasing execution speed and reducing the overheads associated of loops. Most execution time of a scientific program is spent on loops
Cache-aware algorithm:
Time skewing:
In my thesis, I decided to structure topics in this way. And I will be following it today at my presentation for clarification of this complex problem.
1) I made a brief Introduction to the problem by summarizing the problem, diagnosis of the problem and proposed solutions. Provided Background information about the frequently used terms throughout the thesis.
2) Then mentioned previous literatures that were used for a deeper understanding of the problem by improving knowledge and continuing the evaluation of the thesis with proposed techniques.
3) Methodology where I summarized which steps were taken in order to realize the results
4) Results and Discussion part where the results that were obtained following methodology steps we proposed. Possible future work discussion considering previous literatures
And conclusion
Electromagnetic interference also called radio-frequency interference (RFI) when in the radio frequency spectrum, is a disturbance generated by an external source that affects an electrical circuit by electromagnetic induction, electrostatic coupling, or conduction.
Gauss's law - The electric flux leaving a volume is proportional to the charge inside.
Gauss's law for magnetism - There are no magnetic monopoles; the total magnetic flux through a closed surface is zero.
Maxwell–Faraday equation (Faraday's law of induction) - The voltage induced in a closed circuit is proportional to the rate of change of the magnetic flux it encloses.
Ampère's law (with Maxwell's extension) - The magnetic field integrated around a closed loop is proportional to the electric current plus displacement current (rate of change of electric field) it encloses.
What are the options? What can be better?
MoM
FEM
FDTD
Although in principal these technologies could be used to solve the same problems there are often good practical reasons why one particular simulator is better suited to solving a particular problem type
The principle of using finite-difference approximations is an effective solution to deal with complex geometries of real-life problems by solving Maxwell’s equations in time-domain.
Modelling in time-domain is really suitable to see transient phenomena of related problems.
A basic example can be detection of a moving plane by radar, by producing electromagnetic radiation for detection, As in Figure.
Approximation of the problem can be thought as making movies of electric and magnetic fields flowing through a media or device, as it is a time-domain method. Each iteration of Maxwell’s equations for fields are one frame of the movie. By the knowledge of electric field and magnetic field, calculation of many measurable components with the knowledge of the experimented medium, device or the environment coefficients has been done.
Yee’s grid: Yee’s grid has been chosen because of it’s structure of different field components
for different grid locations, there will not be any intersecting field values
This grid is built by dividing space into discrete cells, but as there is still infinite information inside the cells, storage of
information is done at one single point in each cell.
Modes of FDTD
Time dependent curl equations: are used because of the Maxwell’s diff equations
A basic kernel of a FDTD algorithm. A really simplified version as at the first basic implementations we are not considering to implement parameters like grid widths and heights or wavelengths. We are inducing the field with a basic pulse like Gaussian to see the response of the field. Realizations starts with a pulse propagation in free space.
What are the challenges of electromagnetic modelling?
Moore’s Law, the transistor count of the integrated circuits doubles approximately every two years. On most modern microprocessors, the majority of transistors are contained in caches. Comes with improved power efficiency, higher core counts, and bigger last levels caches.
Memory hierarchy: Many studies proved that solutions can be found with optimizing memory accesses of the programs by making the best out of the running
systems’ memory architecture.
This structure of FDTD’s allows to implement and optimize naive code with a new one which leverages spatial and temporal locality features. In this thesis, time-skewing principle has been focused and evaluated through experiments.
This graph is really important in our case as this gives us the idea that we only try to reach to the last time step values meanwhile calculating other timesteps. So these data can be stored temporary.
Temporal: Recently referenced data or instruction is likely to be referenced again in near future.
Spatial: Data or instruction with nearby addresses tend to be referenced together.
Objective
Also effects on memory bound and computationally intensive codes on memory architecture will be investigated by running modified benchmark tools and further theoretical calculations. Validation of generated FDTD codes using locality optimization algorithms will also be investigated.
Downgrading problems to 1d or 2d,
Free space formulation of FDTD
Transverse magnetic (TM) modes: no magnetic field in the direction of propagation.
detections of breast cancers using 2D FDTD method to realize malignant
tumors.
changing FDTD solving theories is [37] where
authors invented a hybrid FEM-FDTD method. Another research has made to implement nonuniform
mesh grids for FDTD to decrease the resolution for the specific parts of the problem space
that are out of intense interest
V2D created to solve specific axisymmetrical devices with a unique approach of 2D solution. By simulating both circular TE and
TM waveguide modes instead for one, it was proved that modelling is faster than using 3D FDTD
Schrodinger eq: mathematical equation that describes the evolution over time of a physical system in which quantum effects, such as wave–particle duality, are significant. The equation is a mathematical formulation for studying quantum mechanical systems
Memory hierarchy: To avoid one really expensive memory components, several memory components such as registers, caches, GPU caches, main memory are used.
Memory wall: processors speeds exceeding rate of improvements at main memory
The von Neumann bottleneck is the idea that computer system throughput is limited due to the relative ability of processors compared to top rates of data transfer
Roofline model: As stated, this graph is a function of machine peak performance, machine peak bandwidth,
and computation intensity. By plotting this graph, realization of the basic idea about the expected
performance according to the computation intensity that the program demands is considered
Research design
-Generation of FDTD code converted from FDTD theory and sources
-Analysis of FDTD theory and code to obtain data dependencies and iteration space
-Realization of data dependency graphs
-Optimal tiling and skewing implementation designs
-Generation of optimized source code
-Comparisons and tests with the output codes
Methodological as:
-free-space (vacuum). This means that an external array to hold field coefficients was neglected.
-Materials simulated are assumed non-magnetic:
-Boundary conditions were not set
-Normalization of Gaussian units
-Impulse response hard source as an excitation
-Factor of grid resolution was neglected in this thesis
-Courant’s stability condition is defined as
-Calculation of cache rates, cache associativity factors
-The slight skew of time distance difference observed between Ex and Hy sources are normal
Why it was chosen?
Hardware
Dell XPS, I selected this as I own this and believe that it is an below-average PC that should be able to run these experiments with acceptable rates. Component preferences are listed in details at the thesis.
Software
Ubuntu Linux, Compiler Explorer, valgrind
Computer Benchmark
Simulation of some reliable computer memory benchmarks has been done throughout the thesis in order to determine performance of the memory hierarchy of the computer used.
Data Processing and Analysis
The most important metrics of this thesis was memory events happening at processors and execution time as the both phenomena are directly related with each other.
After generation of the optimized codes, data was analysed in the following order:
We store elements as float. One float is 4 bytes (single precision). As one cache line is 64 byte with one cache line transfer, we actually transfer 16 float elements. To eliminate this, we increment loop with 64.
As in our research design, the codes were written in C and C++11 languages by using code optimizations
and dependency relations of FDTD stencil codes. Then a variety of benchmark tools that
has been modified to work for our specific machine to learn our computers’ capabilities was used. This experiments were run in order to set metrics and knowing the room for improvement.
Then calculations were made about cache hit/miss scenarios and expectations were listed about execution times. Then comparison of these calculations and expectations with the actual results was made from the tools that was tested the codes, such as valgrind, cachegrind, perf and our own execution time test functions.
Then comparison and discussion of all findings to other related works in the
field has been made.
As in our research design, the codes were written in C and C++11 languages by using code optimizations
and dependency relations of FDTD stencil codes. Then a variety of benchmark tools that
has been modified to work for our specific machine to learn our computers’ capabilities was used. This experiments were run in order to set metrics and knowing the room for improvement.
Then calculations were made about cache hit/miss scenarios and expectations were listed about execution times. Then comparison of these calculations and expectations with the actual results was made from the tools that was tested the codes, such as valgrind, cachegrind, perf and our own execution time test functions.
Then comparison and discussion of all findings to other related works in the
field has been made.
FUTURE WORKS: prefetching, parallel processing, adding GPU and other hardware specifications into the problem
outer tiling to use the most of L2 and L3 cache rates
Simple approach was made by keeping this measure for only at L1 cache in our experiment as valgrind tool does not support L2 memory access
event information.