UNIVERSIT`A DEGLI STUDI DI ROMA
“TOR VERGATA”
DOTTORATO DI RICERCA IN INGEGNERIA DELLE
TELECOMUNICAZIONI E MICROELETTRONICA
CICLO XXVII
GPU ACCELERATION OF ATOMISTIC SIMULATION OF
NANOSTRUCTURED DEVICES
Ph.D. Candidate: Walter Jesuslee Savio Rodrigues
Anno di Esame: 2015
Dipartimento di Ingegneria Elettronica
Ph.D. Tutor: Prof. Dr. Aldo Di Carlo
Ph.D. Coordinator: Prof. Dr. Aldo Di Carlo
UNIVERSIT`A DEGLI STUDI DI ROMA
“TOR VERGATA”
DOCTOR OF PHILOSOPHY IN TELECOMMUNICATION AND
MICROELECTRONICS ENGINEERING
CYCLE XXVII
GPU ACCELERATION OF ATOMISTIC SIMULATION OF
NANOSTRUCTURED DEVICES
Ph.D. Candidate: Walter Jesuslee Savio Rodrigues
Year of Ph.D. Dissertation Defense: 2015
Department of Electronics Engineering
Ph.D. Advisor: Prof. Dr. Aldo Di Carlo
Ph.D. Coordinator: Prof. Dr. Aldo Di Carlo
OLABs: Optoelectronics & Nanoelectronics Laboratory
GPU Acceleration of Atomistic Simulation of Nanostructured Devices
Walter Jesuslee Savio Rodrigues
May, 2015
Ph.D. in Telecommunication and Microelectronics Engineering Program - XXVII Cycle
Optoelectronics & Nanoelectronics Laboratory
Simulation & Theoretical Research Group
Department of Electronics Engineering
Engineering Faculty
University of Rome Tor Vergata
Via del Politecnico 1, 00133, Rome, Italy
Phone + 39 (0)6 7259 7939
www.optolab.uniroma2.it
Acknowledgment
I would like to express my sincere gratitude to my advisor Prof. Aldo Di Carlo for the
continuous support during my Ph.D. studies. His motivation and enthusiasm has helped
me to keep going till this point.
I would like to thank Dr. Alessandro Pecchia, Dr. Matthias Auf der Maur and Dr.
Daniele Barettin for patiently sharing their immense knowledge with me and guiding me
throughout my research.
I thank all my fellow colleagues Giacomo, Francesco, Claudio, Antonio, Marco, Amir,
Corrado, Babak, Matteo P., Andrea R., Thomas B., Francesca B., Matteo G., Lucio,
Monica, Elisa, Giorgia, Fabio S., and Desi for welcoming me into the group and for all
their love and support that I have received over the last three years.
Last but not the least, thanks to all my friends that have made my stay in Rome a
memorable one and my wife, Jasmine, for her love, support and patience throughout my
Ph.D. studies.
2
Abstract
Numerical simulation of materials and devices at the atomistic level plays an important
role in advancing science and guiding device fabrications. Also, it plays an increasing
role in explaining experimental findings and studying micro and macro systems at a level
that may otherwise not be physically possible. Nowadays, many high-ended sophisticated
computational tools are available to scientists that can accelerate innovation and lead to
low cost advancements and device optimizations. This also enables the domain experts to
move their focus to areas of expertise and help solve key issues that, once resolved, lead
to major scientific breakthroughs.
The progress in the field of numerical simulations began with the enormous
advancements in computing technology that revolutionized the world three decades ago.
Today, larger and faster computing systems are widely accessible. Supercomputers and
high-ended, expensive, computationally powerful computing systems are being utilized
to speedup numerical calculations. However, many times these improvements in
technology have not translated into equivalent productivity. Till date, many
computational scientists still employ outdated tools and algorithmic implementations;
thereby, spending unnecessary time waiting for results. The advent of graphics
processing unit (GPU) has grasped the attention of the scientific computation
community with its huge number of computing engines. The work reported here is
specifically to help computational scientists and nanoelectronic’s domain experts to
develop tools that take advantage of modern improvements in computing technology.
Atomistic simulation of nanostructured devices often requires the simulation of
systems with an irreducibly-large number of atoms. However, large-scale atomistic
calculations such as those based on empirical tight binding (ETB) approach reported
3
here, must face the computational obstacle for the diagonalization of the Hamiltonian
matrix needed for the calculation of eigenvalues and eigenvectors. This bottleneck can
be overcome by parallel computing techniques or the introduction of faster algorithms.
Recent advancements have enabled the construction of massively parallel codes and
O(N) computational schemes. Nevertheless, such codes require large high performance
computing (HPC) facilities to run; thereby, reducing the accessibility to a wider range of
users. This work has been motivated by the lack of specialized eigensolvers for
large-scale computations on GPUs.
Developing algorithms that can ideally scale over GPUs is an important component for
transferring the hardware feature into actual beneficial speedups. In recent times, there has
been an extensive effort being put in translating algorithms initially designed for sequential
processors. However, many aspects need to be considered to result in speedups while
dealing with GPU or other parallel computing technologies. Hence, often this sequential
to parallel transition is not straight forward and requires deeper understanding of the
system’s architecture and algorithms itself.
In this work, significance is also placed on addressing some basic problems that
hinder the development of efficient eigensolvers on GPU: first, the choice of the
algorithm itself. I demonstrate how to overcome the problem of compute versus
communication gap that exists in GPUs and have also established ways to resolve the
computational and memory related bottlenecks. Also, multi-GPU implementations that
scales with GPUs are presented, resulting in eigensolvers that accelerates efficiently
large-scale tight binding calculations.
However, there are several methods that can be used to calculate the needed energy
eigenstates. Given the variety of possible methods it is still unclear which one is more
suited and how their performance compares in a given scenario. Hence, I concentrate on
the GPU implementation of three different methods that are common among peers in
the electronic computational domain. An analysis for timing, memory occupancy and
convergence on a multi-GPU system is performed. Finally, realistic applications of GPU
accelerated atomistic simulations will be presented. ETB calculation of quantum
heterostructures derived from experimental results will be performed using GPU
showing that the performance of the solvers employed for the atomistic simulation of
nanostructured devices can be considerably enhanced using GPUs.
4
Preface
The work outlined in this dissertation was carried out in the Department of Electronics
Engineering, University of Rome Tor Vergata, over the period from January 2012 to April
2015. This dissertation is the result of my work and includes a small part which is the
outcome of the work done in collaboration. The material included in this dissertation
has not been submitted for a degree or diploma or any other qualification at any other
university.
This work has been divided into seven parts. The first chapter introduces the Tight
binding model and outlines the motivation for this research work. The second chapter
briefly describes the hardware architecture and the CUDA programming model for GPU.
A review and survey of eigensolver methods are presented in chapter three. Chapter four
and five details the design and benchmarking of GPU based eigensolvers for atomistic
simulation. The sixth chapter presents real applications of the research work carried out
and the last chapter is the conclusions.
5
Contents
Acknowledgment 2
Abstract 3
Preface 5
Contents 6
1 Introduction to tight binding model and its computational challenges 7
1.1 Empirical tight binding model . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2 Mathematical formulation for empirical tight binding model . . . . . . . . 10
1.3 Schr¨odinger equation and the eigenvalue problem . . . . . . . . . . . . . . 11
1.4 Computational challenges of empirical tight binding method . . . . . . . . 11
1.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2 Introduction to GPU and general purpose GPU computing 14
2.1 Towards an unified graphics computing architecture . . . . . . . . . . . . . 17
2.2 Architectural overview of the Tesla Kepler GPU . . . . . . . . . . . . . . . 18
2.2.1 Next-generation streaming multiprocessor . . . . . . . . . . . . . . 19
2.2.2 Instruction scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2.3 Memory model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2.4 Advance features . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3 CUDA programming model . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.4 General-purpose computing on graphics processing units . . . . . . . . . . 26
2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
6
3 Introduction to Eigensolvers 29
3.1 Direct methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.1.1 QR algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.1.2 Divide-and-conquer method . . . . . . . . . . . . . . . . . . . . . . 31
3.1.3 Bisection method and inverse iteration . . . . . . . . . . . . . . . . 32
3.1.4 Jacobi method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2 Iterative methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.1 Power iteration method . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2.2 Rayleigh quotient iteration method (RQI) . . . . . . . . . . . . . . 33
3.2.3 Arnoldi method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.4 Lanczos method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.5 Locally optimal block preconditioned conjugate gradient method
(LOBPCG) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2.6 Davidson method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2.7 Jacobi-Davidson method . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2.8 Contour integral spectral slicing . . . . . . . . . . . . . . . . . . . . 36
3.2.9 FEAST method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3 Survey of available software packages for eigenproblems . . . . . . . . . . . 37
3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4 Design of GPU based eigensolver for atomistic simulation 40
4.1 Lanczos method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.2 Implementation and optimization strategies for parallel eigensolvers . . . . 43
4.2.1 MPI-OpenMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2.2 MPI-CUDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2.3 Performance enhancement via communication cost reduction . . . . 46
4.2.4 Memory optimization by Splitting approach . . . . . . . . . . . . . 46
4.2.5 Mix real-complex CUDA kernel . . . . . . . . . . . . . . . . . . . . 47
4.2.6 Performance enhancement using the Overlap technique . . . . . . . 49
4.2.7 CUDA-aware MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.3 Benchmarking the Lanczos method . . . . . . . . . . . . . . . . . . . . . . 50
4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
7
5 GPU focused comprehensive study of popular eigenvalue methods 58
5.1 GPU based implementations of popular eigenvalue methods . . . . . . . . 59
5.1.1 Jacobi-Davidson method . . . . . . . . . . . . . . . . . . . . . . . . 60
5.1.2 FEAST method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.2 Benchmarking results, comparison and discussion . . . . . . . . . . . . . . 64
5.2.1 Eigensolver evaluation on a Multi-GPU workstation . . . . . . . . . 66
5.2.2 Eigensolver evaluation on a HPC cluster . . . . . . . . . . . . . . . 73
5.2.3 Performance comparison between GPU and HPC cluster . . . . . . 76
5.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6 Application of GPU accelerated atomistic simulations 78
6.1 Atomistic simulation of complex quantum dot/ring nanostructure . . . . . 78
6.2 Atomistic simulation of InGaN quantum dot with Indium fluctuation . . . 84
6.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
7 Conclusion 87
Publications and Conferences 92
Bibliography 94
Abbreviations 111
List of Figures 113
List of Tables 114
8
Chapter 1
Introduction to tight binding model
and its computational challenges
The birth of the use of computer simulations occurred around couple of decades ago,
but their impact in modern science has exactly mirrored the exponential growth in the
power of computers. In recent times, almost all fields of sciences have seen an explosion
of the use of computer simulations to the point where computational methods now stand
alongside with theoretical and experimental methods in value [1]. In turn, the growing
power of computers have spurred the development of methods and scientific software
packages, widening the potential of simulations to tackle a wide range of scientific issues
and placing sophisticated tools in the hands of a wider group of scientists.
Atomistic simulations are playing an increasingly important role in realistic,
scientific and industry applications in many areas including advance material design,
nanotechnology, modern chemistry and semiconductor research. Atomistic simulation is
the theoretical and computational modeling of what happens at the atomic scale in
solids, liquids, molecules and plasmas. Often, this means solving numerically the
classical or quantum-mechanical microscopic equations for the motion of interacting
atoms, or even deeper electrons and nuclei. Atomistic simulation is used to interpret
existing experimental data and predict new phenomena, to reach computationally where
simple theory alone cannot and to provide a way forward where experiments are not yet
possible. The predictive capability of these simulation approaches hinges on the
accuracy of the model used to describe atomic interaction. Modern models are optimized
9
to reproduce experimental values and electronic structure estimates for the forces and
energies of representative atomic configuration deemed important for the problem of
interest.
Most solid-state applications are now making heavy use of density functional theory
(DFT) which has proved to be extremely successful in studying structural properties
and electronic states of materials from which formation energies, phase stability and
thermodynamic properties can be understood or even predicted. Many particle
corrections can be introduced as a perturbation, allowing also the exploration of optical
properties. Localized basis approaches like the Gaussian orbitals, wavelets or the
augmented-plane wave methods are used for calculating the electronic band structure of
solids allowing the prediction of many important properties [2]. All these methods
involve the development of quite complicated computer codes. Limited computational
resources, however, impose restrictions on both the system size and the level of theory
that can be used to calculate interaction between electrons and ions. In order to
overcome these limitations, more approximate methods have been developed and
advance optimization tactics either theoretical or practical are widely welcomed.
1.1 Empirical tight binding model
The model name “tight binding” suggests that it describes the properties of tightly
bound electrons in solids. The electrons in this model are considered to be tightly bound
to the atom to which they belong and they have limited interaction with states and
potentials of surrounding atoms. As a result, the wave function of the electron is rather
similar to the atomic orbital of the free atom to which it belongs. The energy of the
electron is close to the ionization energy of the electron in the free atom or ion because
the interaction with the potentials and states of neighboring atoms is limited. The tight
binding (TB) approach to electronic structure is one of the most used methods in solid
state systems [3]. The empirical tight binding (ETB) method, which dates back to the
work of Slater and Koster [4] assumes mostly two-center approximation and the matrix
elements of the Hamiltonian between orthogonal and atom-centered orbitals [5] are
treated as parameters fitted to experiment or first-principles calculations. ETB is widely
employed for the description of electronic structure of complex systems [6] like interfaces
10
and defects in crystals, amorphous materials, nanoclusters, and quantum dots because it
is computationally efficient and provides physically transparent results. Indeed this
technique requires a relatively small number of parameters which are fitted to accurately
reproduce a given set of experimental data.
As stated, ETB considers a system where electrons are bound to atoms and the
perturbation produced from the linear combination of atomic orbitals (LCAO) [4, 16]
(e.g. sp3
, sp3
d5
, etc). ETB employs an implicit basis composed of the localized
atomic-like orbitals in order to describe the band structure, but do not involve the direct
computation of inter-atomic overlaps. Consequently, many authors define ETB as a
formal expression over Wannier function. The Hamiltonian matrix elements are typically
obtained empirically from fits to more accurate calculations, experiments or derived
from first-principles expressions [7,8]. The ETB method used for calculations of particles
state of atomistic systems [9, 10] is generally less accurate and less transferable than
methods based on DFT, where the Hamiltonian is computed from explicit wave
functions, but it does provide a good alternative for simulating systems of larger
size [11] and over longer time scales than are currently tractable using first-principles
methods. In fact, the ETB is the model of choice for atomistic description of the
electronic properties of nanostructured devices [12–15].
According to the macroscopic device description and crystallographic orientation, the
atomistic structure needed for ETB calculations is generated internally in TiberCAD,
a multiscale CAD tool for the simulation of modern nanoelectronics and optoelectronics
devices [17]. The atomistic structure is deformed based on the strain calculations obtained
from a continuous media elasticity model by projecting the deformation field onto the
atomic positions [18]. In order to couple the atomistic calculation of electronic states
with the continuous media model for particle transport, the macroscopic electrostatic
potential calculated with the Poisson/drift-diffusion model has been projected onto the
atomic positions in a multiscale fashion [19]. The solution of the eigenvalue problem
resulting from the ETB provides the quantum energy eigenstates and consequently the
charge density. An ETB model based on a sp3
d5
s∗
+ spin-orbital parametrization has
been applied in this work [7].
11
1.2 Mathematical formulation for empirical tight
binding model
ETB describes the system Hamiltonian (H) taking the linear combination of localized
orbitals centered on each atom position [20]. The function
|Ψ =
α,R
Cα(R)|α, R (1.1)
represents standing waves or atomic orbitals. Which is necessary to find an
approximation of the eigenenergies and a set of expansion coefficients Cα [21].
In the quantum atomistic approach, the energy levels, , of the stationary states can
be seen as the eigenvalues of the matrix H,
H|Ψ = |Ψ (1.2)
which is the time-independent Schr¨odinger equation. ETB, widely explained elsewhere,
determines the energy of H in terms of energy levels by solving the secular equation
det|H − I| = 0 (1.3)
where I is the overlap matrix elements which reduces to unit matrix when neglecting
inter-atomic overlaps [20] and are the energy levels (eigenvalues).
The matrix H in equation 1.2 for the sp3
d5
s∗
parametrization used here [7] includes the
spin-orbit interactions forming a block matrix of 20×20 for each atom. In later chapters
we shall see at length methods to solve similar equations efficiently. The solution of the
eigenvalue problem defined in equation 1.2 provides the quantum energy eigenstates which
gives the charge density and allows the prediction of many other important properties of
the system.
12
1.3 Schr¨odinger equation and the eigenvalue
problem
The wavefunction for a given physical system contains the measurable information
about the system. To obtain specific values for physical parameters, for example energy
eigenstates, one operates on the wavefunction with the quantum mechanical operator
associated with that parameter. The operator associated with energy is the Hamiltonian
and the operation on the wavefunction is the Schr¨odinger equation as given in equation
1.2. Thus, the time-independent Schr¨odinger equation in a linear algebra terminology is
an eigenvalue equation for the Hamiltonian operator [23] which is explained in more
detail in Chapter 3.
Solutions exist for the time-independent Schr¨odinger equation only for certain values
of energy and these values are called “eigenvalues” of energy. The band energy states
form a discrete spectrum of values, physically interpreted as quantization. Corresponding
to each eigenvalue is an “eigenfunction”. More specifically, the energy eigenstates form a
basis. The solution to the Schr¨odinger equation for a given energy i involves also finding
the specific function |Ψi which describes that energy state. Any wavefunction may be
written as a sum over the discrete energy states or an integral over continuous energy
states, or more generally as an integral over a measure.
1.4 Computational challenges of empirical tight
binding method
The pursuit for ever higher levels of detail and realism in nanoelectronics simulations
presents formidable modeling and computational challenges. Over the last two decades,
available computer power has grown as well as the size of system that can be considered
employing the TB method has also grown. As the nanostructure systems become larger,
however, the issue of scaling becomes crucial. The number of computational operations
required to diagonalize a matrix is proportional to the cube of the number of basis
functions, and thus to the number of atoms. This behavior is referred to as O(N3
)
scaling. As a result, a thousand-fold increase in computer power only buys a ten-fold
13
increase in system size. The O(N3
) scaling of the H matrix diagonalization limits the
number of atoms in the system to a few hundred thousand.
Realistic nanostructures fabricated in lab are around 30 nm, comprising ≈ 1 million
atoms. For III-V semiconductors every atom has 4 neighbors since the sp3
d5
s∗
+ spin-
orbital parametrization is used based on 20 orbitals per atom, this translate to an H
matrix that is 20 times bigger and having an average of 40 non-zero values per row. The
spin-orbit coupling adds imaginary component to the H matrix doubling the problem
size. The ETB method is implemented using double precision arithmetic to ensure highly
accurate solutions and faster convergence. Since H is a Hermitian matrix each non zero
value take 16 bytes of memory (double-complex data type), the total memory needed only
for the H matrix generated from a realistic nanostructure is more than what is available on
most workstations. Consequently, such codes require large high performance computing
(HPC) facilities to run, reducing the accessibility to a wider range of users.
Thus, limited computational resources impose restrictions either on the system size
or forces one to introduce further approximations in the level of theory. Efforts are
constantly made to reduce computational cost in terms of run-time and memory. These
significant challenges posed by large-scale ETB based calculations have been addressed
in this work by the development of new HPC strategies for numerical algorithms and
their implementations on parallel architectures. A specialized implementation that
spares memory and reduces at most machine-to-machine data transfers have been
developed. Furthermore, in order to study bigger, realistic nanostructured systems, a
parallel distributive approach using the standard message passing interface (MPI) is
employed.
1.5 Summary
The ETB model presented here is in fact the model of choice for atomistic description of
the electronic properties of nanostructured devices despite it being less accurate and less
transferable than methods based on DFT. The ETB parametrization given by Jancu for
nearest-neighbor bond lengths have been used despite the enormous cost in storage that
the H matrix representation can deliver, the ETB model is indeed, the best approximation
of energy functions for III-V semiconductors. However, large-scale atomistic calculations
14
involving ETB approach must face the computational obstacle for the diagonalization
of the TB Hamiltonian matrix. This bottleneck can be overcome by parallel computing
techniques or the introduction of faster algorithms which are reported in this work.
15
Chapter 2
Introduction to GPU and general
purpose GPU computing
In 1965, Gordon E. Moore made an interesting observation that the number of
transistors in a dense integrated circuit would double approximately every two
years [24, 25]. His prediction has proven to be accurate and is termed as the “Moore’s
law.” The exponential increase in the number of transistors on a chip has dramatically
enhanced the effect of digital electronics in nearly every segment of life. In the last few
decades, the microprocessor performance has drastically increased as a result of many
related advances like increased transistor density, increased transistor performance,
wider data paths, pipelining, faster processor speed, superscalar execution, speculative
execution, caching, chip and system-level integration. As of 2012, every square
millimeters of chip area has up to 9 million transistors. Microprocessors are easy to
program because compilers evolved right along with the hardware they run on [26].
Users can ignore most of the complexity in modern central processing unit (CPU) since
its microarchitecture is almost invisible.
Multi-core chips have the same software architecture as older multiprocessor systems,
a simple coherent memory model and a few identical computing engines [27,28]. However,
CPU cores continue to be optimized for single-threaded performance at the expense of
parallel execution. This fact is most apparent when one considers that integer and floating-
point execution units occupy only a tiny fraction of the die area in a modern CPU. With
such a small part of the chip devoted to performing direct calculations, it is no surprise
16
that CPUs are relatively inefficient for HPC applications.
The need for CPU designers to maximize single-threaded performance is also behind
the use of aggressive process technology to achieve the highest possible clock rates.
However, this comes with significant costs. Faster transistors run hotter, cost more to
manufacture and leak more power even when they aren’t switching. Manufactures that
make high-end CPUs spend staggering amounts of money on process technology just to
improve single-threaded performance. The market demands general-purpose processors
that deliver high single threaded performance as well as multi-core throughput for a
wide variety of workloads. This pressure has given us almost three decades of progress
toward higher complexity and higher clock rates. Each new generation of process
technology requires ever more heroic measures to improve transistor characteristics.
These challenges have become more apparent in the late 20 century.
By 2005, the primary focus of processor manufactures have been to continue to increase
the core count on chips. This approach, however, has reached a point of diminishing
returns. Dual-core CPUs provide noticeable benefits for most CPU users, but are rarely
fully utilized except when working with multimedia content or multiple performance-
hungry applications. Most of the time quad-core CPUs are only a slight improvement.
As CPU core design continues to progress there will continue to be further improvements
in process technology, faster memory interfaces, and wider superscalar cores. However,
about a decade ago, processor architects realized that CPUs were no longer the preferred
solution for certain problems and started with a clean slate for a better solution.
Graphics processing unit (GPU) is a specialized electronic circuit designed to rapidly
manipulate data and alter memory [29,30]. In a GPU 80% of the transistors on the die
are devoted to data processing rather than data caching and flow control as in CPU
because they are designed to execute the same function on each element of data with
high arithmetic intensity. A simple way to understand a GPU is to look at the difference
between a CPU and GPU and to compare how each process tasks. Architecturally, the
CPU is composed of only few cores with lots of cache memory optimized for sequential
serial processing that can handle a few software tasks at a time. In contrast, a GPU
has a massively parallel architecture consisting of thousands of smaller, more efficient
cores designed for handling thousands of tasks simultaneously. The ability of a GPU with
thousands of cores to process thousands of tasks can accelerate some software by 100x
17
over a CPU alone. Moreover, the GPU achieves this acceleration while being more power
and cost-efficient than a CPU.
Figure 2.1: Schematic comparison of CPU and GPU structure (Source: NVIDIA)
In recent times, GPU computing has grown into a mainstream movement supported
by the latest operating systems as well. The reason for the wide and mainstream
acceptance is that the GPU is a computational powerhouse, its capabilities goes far
beyond basic graphics controller functions and are growing faster than those of the
CPU. GPU architectures are becoming increasingly programmable, offering the
potential for dramatic speedups for a variety of general purpose applications compared
to CPUs. GPU computing is not meant to replace CPU computing. Each approach has
advantages for certain kinds of software. As explained earlier, CPUs are optimized for
applications where most of the work is being done by a limited number of threads,
especially where the threads exhibit high data locality, a mix of different operations, and
a high percentage of conditional branches. GPU design aims at the other end of the
spectrum where applications with multiple threads that are dominated by longer
sequences of computational instructions. In recent times, GPUs have become much
better at thread handling, data caching, virtual memory management, flow control and
other CPU-like features. However, the distinction between computationally intensive
procedure and control-flow intensive procedure is fundamental. In a GPU since most of
the circuitry within each core is dedicated to computation, rather than speculative
features meant to enhance single-threaded performance, most of the die area and power
consumed by GPU goes into the application’s actual algorithmic work.
18
2.1 Towards an unified graphics computing
architecture
The GPU is a processor with ample computational resources. The modern GPU has
evolved from a fixed function graphics pipeline to a programmable parallel processor with
computing power exceeding that of multicore CPUs. Traditional GPUs structure their
graphics computation in a similar organization called the graphics pipeline. This pipeline is
designed to allow hardware implementations to maintain high computation rates through
parallel execution. The pipeline is divided into several stages. All geometric primitives
pass through every stage. In hardware, each stage is implemented as a separate piece of
hardware on the GPU in what is termed a task-parallel machine organization [31–34].
The input to the pipeline is a list of geometry, expressed as vertices in object
coordinates. The output is an image in a frame buffer. The first stage of the pipeline,
the geometry stage, transforms each vertex from object space into screen space then
assembles the vertices into triangles and traditionally performs lighting calculations on
each vertex. The output of the geometry stage are triangles in screen space. The next
stage, rasterization, determines both the screen positions covered by each triangle and
interpolates per-vertex parameters across the triangle. The result of the rasterization
stage is a fragment for each pixel location covered by a triangle. The third stage, the
fragment stage, computes the color for each fragment using the interpolated values from
the geometry stage. In the final stage, composition, fragments are assembled into an
image of pixels usually by choosing the closest fragment to the camera at each pixel
location [33,34].
Over the years, graphics vendors have transformed the fixed-function pipeline into a
more flexible programmable pipeline [31–34]. This effort has been primarily
concentrated on two stages of the graphics pipeline: vertex processors operate on the
vertices of primitives such as points, lines, and triangles. Typical operations include
transforming coordinates into screen space which are then fed to the setup unit and the
rasterizer, and setting up lighting and texture parameters to be used by the
pixel-fragment processors. Pixel-fragment processors operate on rasterizer output which
fills the interior of primitives along with the interpolated parameters.
Vertex and pixel-fragment processors have evolved at different rates. Vertex
19
processors were designed for low-latency, high-precision math operations. Whereas,
pixel-fragment processors were optimized for high-latency, lower-precision texture
filtering. Vertex processors have traditionally supported more complex processing, so
they became programmable first. Each new generation of GPUs have increased the
functionality and generality of these two programmable stages. The two processor types
were functionally converging as the result of a need for greater programming generality.
However, the increased generality also increased the design complexity and cost of
developing two separate processors. Since GPUs typically must process more pixels than
vertices, pixel-fragment processors traditionally outnumber vertex processors by about
three to one. However, typical workloads were not well balanced leading to inefficiency.
These factors influenced the decision to design a unified architecture.
A primary design objective was to execute vertex and pixel-fragment shader
programs on the same unified processor architecture. Unification would enable dynamic
load balancing of varying vertex, pixel-processing workloads and permit the introduction
of new graphics shader stages such as geometry shaders. It also would allow the sharing
of expensive hardware such as the texture units. The generality required of a unified
processor opened the door to a completely new GPU parallel-computing capability.
In November 2006, NVIDIA introduced the Tesla architecture [34, 35] which unifies
the vertex and pixel processors and extends them, enabling high performance parallel
computing applications written in the C language using the Compute Unified Device
Architecture (CUDA) [36–40]. The Tesla architecture is based on a scalable processor
array. Due to its unified-processor design, the physical Tesla architecture does not resemble
the logical order of graphic pipeline stages. The following section gives a brief overview
of the recent GPU microarchitecture based on the new Tesla unified graphics computing
architecture which is utilized here to benchmark this work.
2.2 Architectural overview of the Tesla Kepler GPU
In 2012, the GPU microarchitecture codename, Kepler was introduced which is the
successor to the Fermi microarchitecture. Developed by NVIDIA, it is comprised of 7.1
billion transistors making it the fastest and the most complex microprocessor ever built.
The Kepler microarchitecture uses a similar design to Fermi [41, 42], but with a couple
20
Figure 2.2: Full chip block diagram of Kepler microarchitecture based GPU (Source:
NVIDIA)
of key differences [43]. The Kepler architecture focuses on efficiency, programmability
and performance. The Kepler architecture employs a new streaming multiprocessor
architecture called the next-generation streaming multiprocessor (SMX). Each SMX
contains 192 cores which suggests potential for considerably greater performance. The
polymorph engines have been redesigned to deliver twice the performance because all
those cores run at a lower clock speed than the previous Fermi’s core did. The GPU as a
whole uses less power even as it delivers more performance. The reason for Kepler’s
power efficiency is that the whole GPU uses a single Core clock rather than the
double-pump Shader clock [44]. The Kepler implementations include 15 SMX units and
six 64-bit memory controllers. Different products GK110/210 will use different
configurations.
2.2.1 Next-generation streaming multiprocessor
Each SMX unit consists of 192 single-precision cores, 64 double-precision units, 32
special function units, and 32 load/store units, 64 KB of shared memory, and 48 KB of
read-only data cache. The shared memory and the data cache are accessible to all
21
Figure 2.3: Architectural overview of next-generation streaming multiprocessor (SMX)
within Kepler microarchitecture (Source: NVIDIA)
threads executing on the same streaming multiprocessor. Each core within SMX has
fully pipelined floating-point and integer arithmetic logic units. Floating-point
operations follow the IEEE 754-2008 floating-point standard. Each core can perform one
single-precision fused multiply-add (FMA) operation in each clock period and one
double-precision FMA in two clock periods. FMA support also increases the accuracy
and performance of other mathematical operations such as division and square root and
more complex functions such as extended-precision arithmetic, interval arithmetic and
linear algebra. The integer ALU supports the usual mathematical and logical operations
including multiplication on both 32-bit and 64-bit values. Memory operations are
handled by the load-store units. The load/store instructions can now refer to memory in
terms of two-dimensional arrays providing addresses in terms of x and y values. Kepler
is designed to significantly increase the GPU’s double precision performance. The 32
Special Function Units (SFUs) is also available to handle transcendental and other
22
special operations such as sin, cos, exp (exponential) and rcp (reciprocal) [43,45–47].
2.2.2 Instruction scheduler
The SMX schedules threads in groups of 32 parallel threads called warps. Each SMX
features four warp schedulers and eight instruction dispatch units allowing four warps to
be issued and executed concurrently. Kepler’s quad warp scheduler selects four warps and
two independent instructions per warp can be dispatched each cycle. Kepler allows double
precision instructions to be paired with other instructions [45,48].
Figure 2.4: Warp scheduler within next-generation streaming multiprocessors (Source:
NVIDIA)
2.2.3 Memory model
The number of registers that can be accessed by a thread has been quadrupled in Kepler
allowing each thread access to up to 255 registers. Codes that exhibit high register pressure
or spilling behavior in previous microarchitecture may see substantial speedups as a result
of the increased available per-thread register count. Kepler also implements a new shuffle
instruction which allows threads within a warp to share data. Previously, sharing data
between threads within a warp required separate store and load operations to pass the
data through shared memory. With the shuffle instruction, threads within a warp can
read values from other threads in the warp in just about any imaginable permutation.
23
Figure 2.5: Kepler GPU memory hierarchy (Source: NVIDIA)
The Kepler microarchitecture provides for local memory in each streaming
multiprocessor. The Kepler architecture supports a unified memory request path for
loads and stores with an L1 cache per SMX multiprocessor. In the Kepler GK110
architecture, each SMX has 64 KB of on-chip memory that can be configured as 48 KB
of shared memory with 16 KB of L1 cache or as 16 KB of shared memory with 48 KB of
L1 cache. Kepler also allows for additional flexibility in configuring the allocation of
shared memory and L1 cache by permitting a 32 KB/32 KB split between shared
memory and L1 cache. The decision to allocate 16 KB, 48 KB or 32 KB of the local
memory as cache usually depends on two factors: how much shared memory is needed
and how predictable the kernel’s accesses to global memory are likely to be. A larger
shared-memory requirement argues for less cache, more frequent or unpredictable
accesses to larger regions of DRAM argue for more cache. For the GK210 architecture,
the total amount of configurable memory is doubled to 128 KB allowing a maximum of
112 KB shared memory and 16 KB of L1 cache. Other possible memory configurations
are 32 KB L1 cache with 96 KB shared memory or 48 KB L1 cache with 80 KB of
shared memory.
In addition to the L1 cache, Kepler introduces a 48 KB cache for data that is known
to be read-only for the duration of the function. Use of the read-only path is beneficial
because it takes both load and working set footprint off the shared/L1 cache path. The
Kepler GK110/210 GPUs feature 1536 KB of dedicated L2 cache memory. The L2 cache
is the primary point of data unification between the SMX units servicing all load, store
and texture requests and providing efficient, high speed data sharing across the GPU.
The L2 cache subsystem also implements another feature not found on CPUs: a set of
24
memory read-modify-write operations that are atomic and thus ideal for managing access
to data that must be shared across thread blocks or even kernels. L1 and L2 caches help in
improving the random memory access performance while the texture cache enables faster
texture filtering. The programs also have access to a dedicated shared memory which is
a small software-managed data cache attached to each multiprocessor shared among the
cores. This is a low-latency, high-bandwidth, indexable memory which runs essentially at
register speeds. Kepler’s register files, shared memories, L1 cache, L2 cache and DRAM
memory are protected by a single-error correct double-error detect ECC code.
2.2.4 Advance features
In Kepler, Hyper-Q enables multiple CPU cores to launch work on a single GPU
simultaneously; thereby, expanding Kepler GPU hardware work queues from 1 to
32 [45, 46]. The significance of this being that having a single work queue meant that
previous GPU could be under occupied at times if there wasn’t enough work in that
queue to fill every streaming multiprocessor. By having 32 work queues, Kepler can in
many scenarios achieve higher utilization by being able to put different task streams on
what would otherwise be an idle SMX.
When working with a large amount of data, increasing the data throughput and
reducing latency is vital to increasing compute performance. Kepler GK110/210
supports the RDMA feature in NVIDIA GPUDirect which is designed to improve
performance by allowing direct access to GPU memory by third-party devices [45, 46].
GPUDirect provides direct memory access (DMA) between NIC and GPU without the
need for CPU side data buffering. GPUDirect enables much higher aggregate bandwidth
for GPU-to-GPU communication within a server and across servers with the
Peer-to-Peer and RDMA features.
Kepler has a possibility of dynamic parallelism which allows the GPU to generate
new work for itself, synchronize on results and control the scheduling of that work via
dedicated, accelerated hardware paths all without involving the CPU [45,46]. In previous
GPUs, all work was launched from the host CPU, run to completion, and return a result
back to the CPU. The result would then be used as part of the final solution or would
be analyzed by the CPU which would then send additional requests back to the GPU for
additional processing. In Kepler, any kernel can launch another kernel and can create the
25
Figure 2.6: Direct Peer-to-Peer data transfer between two GPUs using GPUDirect
(Source: NVIDIA)
necessary streams, events and manage the dependencies needed to process additional work
without the need for host CPU interaction. This architectural innovation makes it easier
for developers to create and optimize recursive and data-dependent execution patterns
and allows more of a program to be run directly on GPU.
2.3 CUDA programming model
In November 2006, NVIDIA introduced CUDA, a general purpose parallel computing
architecture with a new parallel programming model and instruction set architecture.
CUDA comes with a software environment that allows developers to use C as a high-
level programming language [37, 49]. At its core are three key abstractions; a hierarchy
of thread groups, shared memories and barrier synchronization that are simply exposed
to the programmer as a minimal set of language extensions. These abstractions provide
fine-grained data parallelism and thread parallelism nested within coarse-grained data
parallelism and task parallelism. They guide the programmer to partition the problem
into coarse sub-problems that can be solved independently in parallel by blocks of threads
and each sub-problem into finer pieces that can be solved cooperatively in parallel by all
threads within the block [38–40].
CUDA extends C by allowing the programmer to define C functions called
kernels [50]. Kernel is the parallel portion of the application that will execute on the
GPU. Kernels are executed N times in parallel by N different CUDA threads as opposed
to only once like regular C functions. Each thread that executes the kernel is given a
unique thread ID that is accessible within the kernel through the built-in threadIdx
26
variable. The threadIdx is a 3 component vector so that threads can be identified using
a one-dimensional, two-dimensional or three-dimensional thread index, forming a
one-dimensional, two-dimensional or three-dimensional thread block.
Figure 2.7: (Left) Gird of thread blocks (Source: NVIDIA). (Right) CUDA execution
model
There is a limit to the number of threads per block, since all threads of a block
are expected to reside on the same processor core and must share the limited memory
resources of that core. A kernel can be executed by multiple equally-shaped thread blocks
so that the total number of threads is equal to the number of threads per block times the
number of blocks. Blocks are organized into a one-dimensional or two-dimensional grid of
thread blocks. The number of thread blocks in a grid is usually dictated by the size of the
data being processed or the number of processors in the system. Each block within the
grid can be identified by a one-dimensional or two-dimensional index accessible within
the kernel through the built-in blockIdx variable. The dimension of the thread block is
accessible within the kernel through the built-in blockDim variable.
Thread blocks are required to execute independently in any order, in parallel or in
series. This independence requirement allows thread blocks to be scheduled in any order
across any number of cores. Threads within a block can cooperate by sharing data through
some shared memory and by synchronizing their execution to coordinate memory accesses.
More precisely, one can specify synchronization points in the kernel by calling a barrier
at which all threads in the block must wait before any is allowed to proceed.
27
CUDA threads may access data from multiple memory spaces during their execution.
Each thread has private local memory. Each thread block has shared memory visible to all
threads of the block and with the same lifetime as the block. All threads have access to the
same global memory. There are also two additional read-only memory spaces accessible
by all threads: the constant and texture memory spaces. The global, constant and texture
memory spaces are persistent across kernel launches by the same application.
2.4 General-purpose computing on graphics
processing units
Traditionally, powerful GPUs have been useful mostly to gamers looking for realistic
experiences along with engineers and creatives needing 3D modeling functionality.
General-purpose computing on GPUs only became practical and popular after 2001 with
the advent of both programmable shaders and floating point support on graphics
processors. In particular, problems involving matrices and/or vectors especially two,
three or four-dimensional vectors were easy to translate to a GPU which acts with
native speed and support on those types. The scientific computing community’s
experiments with the new hardware started with a matrix multiplication routine. These
early efforts to use GPUs as general-purpose processors required reformulating
computational problems in terms of graphics primitives as supported by the two major
APIs for graphics processors, OpenGL and DirectX [33]. This cumbersome translation
was obviated by the advent of general-purpose programming languages and APIs such
as Sh/RapidMind, Brook and Accelerator [31,51,52].
These were followed by NVIDIA’s CUDA, which allowed programmers to ignore the
underlying graphical concepts in favor of more common high-performance computing
concepts [32, 53]. Newer, hardware vendor-independent offerings include Microsoft’s
DirectCompute and Apple/Khronos Group’s OpenCL [53]. This means modern GPGPU
pipelines can act on any big data operation and leverage the speed of a GPU without
requiring full and explicit conversion of the data to a graphical form [50].
GPU flexibility has increased over the last decade thanks to their massive multi-core
parallelization, delivering high throughput capabilities even on double-precision
arithmetic, to their increased on-board memory and the efforts made by vendors in
28
facilitating programmability. GPU accelerated computing has revolutionized the HPC
industry. Researchers have quickly realized that many real world problems map very
well to the pipelined single instruction multiple data (SIMD) hardware in the GPU’s
streaming processors. There are many computational applications across a wide range of
fields already optimized for GPUs. Some examples are: Molecular dynamics [54–57],
Quantum chemistry [58–62], Materials science [63, 64], Bioinformatics [65–69],
Physics [70–74], Numerical analytics [75–77], Fluid dynamics [78–80], Medical
imaging [81–83], Finance [84,85].
While GPU has many benefits such as more computing power, larger memory
bandwidth and low power consumption, there are some constraints to fully utilize its
processing power. Developing codes for GPU takes more time and need more
sophisticated work, gaining relevant speedup requires that algorithms are coded to
reflect the GPU architecture, and programming for the GPU differs significantly from
traditional CPUs. In particular, incorporating GPU acceleration into pre-existing codes
is more difficult than just moving from one CPU family to another. A GPU-savvy
programmers need to dive into the code and make significant changes to critical
components. Also, GPU code runs in parallel so data partition and synchronization
technique are needed which also enforces access levels for different categories of memory.
The low bandwidth PCI-E bus that physically connects between the GPU and the rest
of the system is one of the main performance limiting factor. The performance of GPU
goes down an order of magnitude as transferring anything over PCI-E lowers the speeds
twentyfold compared to the onboard memory. These constraints make performance
optimization more difficult. Also, GPU’s debugging environment is not as powerful as
general CPU.
2.5 Summary
GPU is the most powerful computing engine available to computational scientists and is
being utilized in a wide range of scientific computing applications. What make the GPU
so powerful is its thousands of identical cores that run at lower clock rate than CPU but
optimized for recursive SIMD type operation on a big data set, along with its high memory
bandwidth and ease of programmability using a high level language. However, there are
29
certain types of application that are more ideal for GPU computing than others. Most
applications need to be re-coded for GPU extensively and one needs a better and deeper
understanding of the GPU architecture and memory model to obtain optimal speedups.
The ongoing remarkable effort put by GPU vendors have resulted in a generation of more
sophisticated, easily programmable, compute optimal GPU architectures.
30
Chapter 3
Introduction to Eigensolvers
The theory and computation of eigenvalue problems are among the most successful and
widely used tools of applied mathematics and scientific computing. Eigenvalue problems
find its application in a variety of scientific and engineering applications including
acoustics, control theory, earthquake engineering, graph theory, Markov chains, pattern
recognition, quantum mechanics, stability analysis, quantum physics, material sciences
and many other areas. The increasing number of applications and the ever-growing scale
of problems have motivated fundamental progress in the numerical solution of eigenvalue
problems.
Eigenvalues are often introduced in the context of linear algebra or matrix theory.
However, historically, they arose in the study of quadratic forms and differential
equations. In the 18th
century, Euler studied the rotational motion of a rigid body and
discovered the importance of the principal axes. Lagrange realized that the principal
axes are the eigenvectors of the inertia matrix [86]. In the early 19th
century, Cauchy
saw how their work could be used to classify the quadric surfaces and generalized it to
arbitrary dimensions. At the start of the 20th
century, Hilbert studied the eigenvalues of
integral operators by viewing the operators as infinite matrices [87]. He was the first to
use the word “eigen.” The first numerical algorithm for computing eigenvalues and
eigenvectors appeared in 1929 when Von Mises published the power method [88].
An eigenvector of an N×N square matrix A is a non-zero vector v that, when multiplied
with A, yields a scalar (λ) multiple of itself.
31
Av = λv (3.1)
This equation is referred to as the standard eigenvalue problem. Here, λ is an eigenvalue
of A, v is the corresponding right eigenvector and (λ, v) is called an eigenpair. The set of
all eigenvectors of a matrix, each paired with its corresponding eigenvalue is called the
eigensystem of that matrix [89]. The full set of eigenvalues of A is called the spectrum and
is denoted by λ(A) = λ1, λ2, ..., λn. Any multiple of an eigenvector is also an eigenvector
with the same eigenvalue. An eigenspace of a matrix A is the set of all eigenvectors with
the same eigenvalue together with the zero vector. An eigenbasis for A is any basis for
the set of all vectors that consists of linearly independent eigenvectors of A.
In solving an eigenvalue problem, there are a number of properties that need be
considered like the type of matrix (real or complex), structure of the matrix (band,
sparse, structured sparseness, toeplitz), special properties of the matrix (symmetric,
hermitian, skew symmetric, unitary) and type of eigenvalues required (largest, smallest,
inner, sums of intermediate eigenvalues). These greatly affect the choice of algorithm.
There are a variety of more complicated eigenproblems. For instance, Ax = λBx and
more generalized eigenproblems like Ax + λBx + λ2
Cx = 0, higher order polynomial
problems, and nonlinear eigenproblems. All these problems are considerably more
complicated than the standard eigenproblem depending on the operators involved.
In numerical mathematics, several different techniques needed to calculate the
eigenpairs have been developed. These techniques can be divided into two main groups:
“direct methods” and “iterative methods.” First, the algorithms for medium sized
problems that calculate one up to all eigenvalues. Second, the methods for huge
eigenvalue equations that calculate only a few eigenpairs projecting the huge problem
onto a much smaller search space which is build up within the algorithm. The projected
system is small enough to be solved by techniques of the former group.
3.1 Direct methods
In this section, lets briefly discuss various direct methods for the computation of
eigenvalues of matrices that are small and can be stored in the computer memory as full
matrices. These direct methods are sometimes called transformation methods and are
32
built up around similarity transformations. They transforms the matrix to a simpler
form and finds all the eigenvalues and eigenvectors.
3.1.1 QR algorithm
This algorithm finds all the eigenvalues and optionally all the eigenvectors. The basic
idea is to perform QR decomposition [90–92]. The QR algorithm consists of two separate
stages. First, by means of a similarity transformation, the original matrix is transformed
in a finite number of steps to Hessenberg form or in the Hermitian/symmetric case to real
tridiagonal form. This first stage of the algorithm prepares it for the second stage which is
the actual QR iterations that are applied to the Hessenberg or tridiagonal matrix [93]. It
takes O(n2
) floating point operations for finding all the eigenvalues of a tridiagonal matrix.
Since reducing a dense matrix to tridiagonal form costs 4
3
n3
floating point operations,
O(n2
) is negligible for large enough n. For finding all the eigenvectors as well, QR iteration
takes a little over 6n3
floating point operations on average.
3.1.2 Divide-and-conquer method
An eigenvalue problem is divided into two problems of roughly half the size, each of
these are solved recursively and the eigenvalues of the original problem are computed
from the results of these smaller problems. This algorithm was originally proposed by
Cuppen [94]. However, it took ten more years until a stable variant was found by Gu
and Eisenstat [95,96]. The advantage of divide-and-conquer comes when eigenvectors are
needed as well. If this is the case, reduction to tridiagonal form takes 8
3
n3
, but the second
part of the algorithm takes O(n3
) as well. For the QR algorithm with a reasonable target
precision, this is ≈ 6n3
, whereas for divide-and-conquer it is ≈ 4
3
n3
. The reason for this
improvement is that in divide-and-conquer the O(n3
) part of the algorithm is separate
from the iteration, whereas in QR, this must occur in every iterative step. Adding the
8
3
n3
flops for the reduction, the total improvement is from ≈ 9n3
to ≈ 4n3
flops. The
divide-and-conquer approach is now the fastest algorithm for computing all eigenvalues
and eigenvectors of a symmetric matrix of order larger than 25, this also holds true for
non-parallel computers. If the subblocks are of order greater than 25, then they are further
reduced else, the QR algorithm is used for computing the eigenvalues and eigenvectors of
the subblock [97].
33
3.1.3 Bisection method and inverse iteration
Bisection may be used to find just a subset of the eigenvalues, like those in an interval [a, b].
It needs only O(nk) floating point operations, where k is the number of eigenvalues desired.
Thus the bisection method could be much faster than the QR method when k n. It
can be highly accurate, but may be adjusted to run faster if lower accuracy is acceptable
[98,99]. Inverse iteration can then be used to find the corresponding eigenvectors. In the
best case, when the eigenvalues are well separated, inverse iteration also costs only O(nk)
floating point operations. This is much less than either QR or divide-and-conquer, even
when all eigenvalues and eigenvectors are desired (k = n). On the other hand, when many
eigenvalues are clustered close together, Gram-Schmidt orthogonalization will be needed
to make sure that one does not get several identical eigenvectors. This will add O(nk2
)
floating point operations to the operation count in the worst case.
3.1.4 Jacobi method
Jacobi method is mostly used for solving Hermitian eigenvalue problems. This method
constructs an orthogonal transformation to diagonal form, A = XΛX∗
by applying a
sequence of elementary orthogonal rotations, each time reducing the sum of squares of
the nondiagonal elements of the matrix, until it is of diagonal form to working
accuracy [100]. The Jacobi algorithm has been very popular since its implementation is
very simple and gives eigenvectors that are orthogonal to working accuracy. However, it
cannot compete with the QR method in terms of operation counts. Jacobi needs 2sn3
multiplications for s sweeps, which is more than the 4
3
n3
needed for tridiagonal
reduction. There is one important advantage to the Jacobi algorithm. It can deliver
eigenvalue approximations with a small error in the relative sense, in contrast to
algorithms based on tridiagonalization, which only guarantee that the error is bounded
relative to the norm of the matrix [101,102].
3.2 Iterative methods
Theoretically, the numerical algorithms mentioned above are applicable for arbitrary
dimensions but practically they are limited by memory restrictions and computational
34
time. The effort of the QR algorithm is in O(n3
) and cannot be handled for large N on
current computers. In this section, numerical methods are introduced that calculate a
few eigenvalues with less computational cost. The well-known iterative methods for
solving eigenvalue problems are the power method (the inverse iteration), the Krylov
subspace methods, the Jacobi-Davidson algorithm and FEAST method. Traditionally, if
the extreme eigenvalues are not well separated or the eigenvalues sought are in the
interior of the spectrum, a shift-and-invert transformation has to be used in combination
with these eigenvalue problem solvers.
3.2.1 Power iteration method
The power iteration is a very simple algorithm. It does not compute a matrix
decomposition, the basic idea is to multiply the matrix A repeatedly by a well chosen
starting vector, so that the component of that vector in the direction of the eigenvector
with largest eigenvalue in absolute value is magnified relative to the other
components [88]. The speed of convergence of the power iteration depends on the ratio
of the second largest eigenvalue to the largest eigenvalue.
It is interesting that the most effective variant is the inverse power method with shift
which can find interior as well as exterior eigenvalues [103]. The idea of this method is to
apply the power method on A−1
or on the inverse of the shifted matrix (A − µ0I)−1
. The
eigenvalues of A−1
are the inverse eigenvalues of A. Thus, the inverse power method finds
the eigenvalue closest to zero. The smallest eigenvalue of the shifted matrix (A − µ0I) is
the eigenvalue of A closest to µ0. Therefore, this method can find any simple eigenvalue
when an appropriate guess µ0 is available.
3.2.2 Rayleigh quotient iteration method (RQI)
RQI is an eigenvalue algorithm which extends the idea of the inverse iteration by using the
Rayleigh quotient to obtain increasingly accurate eigenvalue estimates [104]. Starting with
a normalized putative eigenvector a sequence of normalized approximate eigenvectors are
generated with their associated Rayleigh quotients. The RQI algorithm converges cubically
for Hermitian or symmetric matrices, given an initial vector that is sufficiently close to an
eigenvector of the matrix that is being analyzed. If the matrix is non-Hermitian then it is
still possible to get cubical convergence by using a two-sided version of the algorithm. The
35
drawbacks of the RQI method is that it may converge to an eigenvalue which is not the
closest to the desired one and the algorithm has a high computation cost since it requires
a factorization at every iteration.
3.2.3 Arnoldi method
The Arnoldi method was first introduced as a direct algorithm for reducing a general
matrix into upper Hessenberg form [105]. It was later discovered that this algorithm
leads to a good iterative technique for approximating eigenvalues of large sparse matrices.
Arnoldi method belongs to a class of linear algebra algorithms based on the idea of Krylov
subspaces that give a partial result after a relatively small number of iterations. It is an
orthogonal projection method onto a Krylov subspace. The procedure can be essentially
viewed as a modified Gram-Schmidt process for building an orthogonal basis of the Krylov
subspace Km
(A, v). The cost of orthogonalization increases as the method proceeds. A
convergence analysis of eigenvector approximation using the Arnoldi method can be found
in [106,107].
As CPU time and memory needed to manage the Krylov subspace increase with its
dimension, a subspace restarting strategy is necessary. Roughly speaking, the restarting
strategy builds a new subspace of smaller dimension by extracting the desired
approximate eigenvectors from the current subspace of a larger dimension. An elegant
implicit restarting strategy based on the shifted-QR algorithm was proposed by
Sorensen [108]. This method generates a new Krylov subspace of smaller dimension
without using matrix-vector products involving A. The resulting algorithm is called the
implicitly restarted Arnoldi (IRA) method.
3.2.4 Lanczos method
The Lanczos algorithm can be viewed as a simplified Arnoldi’s algorithm in that it
applies to Hermitian matrices. It’s algorithm is an effective iterative method to find
eigenvalues and eigenvectors of large sparse matrices by first building an orthonormal
basis and then forming approximate solutions using Rayleigh projection. It reduces a
large, complicated eigenvalue problem into a simpler one [109, 110] explicitly taking
advantage of the symmetry of the matrix. However, the Lanczos method diverges when
implemented on a finite precision architecture since the Lanczos vectors inevitably lose
36
their mutual orthogonality [110, 111]. Hence, it needs a full reorthogonalization of each
newly computed vector against all preceding Lanczos vectors. This not only greatly
increases the number of computations required, but also requires that all the vectors be
stored. For large problems, it will be very expensive to take more than a few steps using
full reorthogonalization. Nevertheless, linear independence will surely be lost without
some sort of corrective procedure.
Selective orthogonalization interpolates between full reorthogonalization and simple
Lanczos to obtain the best of both worlds. Robust linear independence is maintained
among the vectors at a cost which is close to that of simple Lanczos [112,113]. Another
way to maintain orthogonality is to limit the size of the basis set and use a restarting
scheme by replacing the starting vector with an improved starting vector and computing
a new Lanczos factorization with the new vector.
3.2.5 Locally optimal block preconditioned conjugate gradient
method (LOBPCG)
LOBPCG is based on a local optimization of a three-term recurrence. It is designed to
find the smallest or the largest eigenvalues and corresponding eigenvectors of a symmetric
and positive definite eigenvalue problems [114]. Similar to other conjugate gradient based
methods, this is accomplished by the iterative minimization of the Rayleigh quotient, while
taking the gradient as the search direction in every iteration step. Which results in finding
the smallest eigenstates of the original problem. In the LOBPCG method the minimization
at each step is done locally, in the subspace of the current approximation, the previous
approximation, and the preconditioned residual. The subspace minimization is done by
the Rayleigh-Ritz method. Iterating several approximate eigenvectors, simultaneously, in a
block in a similar locally optimal fashion, results in the full block version of the LOBPCG.
3.2.6 Davidson method
Davidson came up with an idea of expanding the subspace in such a way that certain
eigenpairs would be favored. Bearing in mind the fact that if certain true eivenvector
lies in the subspace of current iteration, the eigen problem in the subspace would give
the exact corresponding eigenpair. Thus to achieve fast convergence, a better way to
37
expand the subspace is to choose the new expansion vector to be the component of the
error vector which is orthogonal to the subspace [115,116]. If this orthogonal component
could be solved exactly and added to the subspace, then convergence is guaranteed to be
achieved in the next iteration in exact arithmetic. It has been reported that this method
can be quite successful in finding dominant eigenvalues of (strongly) diagonally dominant
matrices. Davidson [117] suggests that his algorithm (more precisely, the Davidson-Liu
variant) may be interpreted as a Newton-Raphson scheme and this has been used as an
argument to explain its fast convergence.
3.2.7 Jacobi-Davidson method
The Jacobi-Davidson method is a popular technique to compute a few eigenpairs of large
sparse matrices. It is motivated by the fact that standard eigensolvers often require an
expensive factorization of the matrix to compute interior eigenvalues. Such a factorization
is unfeasible for large matrices in large-scale simulations. In the Jacobi-Davidson method,
one still needs to solve inner linear systems, but a factorization is avoided because the
method is designed so as to favor the efficient use of iterative solution techniques based on
preconditioning [118]. Jacobi-Davidson method belongs to the class of subspace methods
which means that approximate eigenvectors are sought in a subspace. Each iteration of
this method has two important phases: the subspace extraction in which an approximate
eigenpair is sought with the approximate vector in the search space and the subspace
expansion in which the search space is enlarged by adding a new basis vector to it trying
to lead to a better approximate eigenpairs in the next extraction phase [119,120].
3.2.8 Contour integral spectral slicing
The contour integral spectral slicing method is based on the contour integral method
proposed by Sakurai-Sugiura [121] for finding certain eigenvalues of a generalized
eigenvalue problem that lie in a given domain of the complex plane. The method
projects the matrix pencil onto a subspace associated with the eigenvalues that are
located in the domain. The approach is based on the root finding method for an analytic
function. This method finds all of the zeros that lie in a circle using numerical
integration. The algorithm requires a region that includes several eigenvalues and an
estimate of the number of eigenvalues or clusters in the region. The major advantage of
38
this method is that iterative process for constructing subspace is not required. At each
contour point the projected matrix pencil with eigenvalues of interest are derived by
solution of linear systems. A Rayleigh-Ritz type variant of the method has been also
developed to improve numerical stability [122].
3.2.9 FEAST method
Lately, the FEAST algorithm which takes its inspiration from the density-matrix
representation and contour integration technique in quantum mechanics has been
developed [123]. Unlike the Lanczos and Jacobi-Davidson method, the aim of the
FEAST algorithm is to actually compute the eigenvectors instead of approximating
them. The algorithm deviates fundamentally from the traditional Krylov subspace
iteration based techniques. This algorithm is free from any orthogonalization procedures
and its main computational tasks consist of solving the inner independent linear systems
with multiple right-hand sides. The FEAST algorithm finds all the eigenpair in a given
search interval. It requires that one provides an estimate for the number of eigenpair
within the search interval which often is not possible to obtain beforehand [124].
3.3 Survey of available software packages for
eigenproblems
The history of reliable high quality software for numerical linear algebra started in 1971
with the book titled the “Handbook for Automatic Computation” [125]. This book
described state-of-the-art algorithms for the solution of linear systems and
eigenproblems. During the same decade, few research groups started the development of
two influential software packages: LINPACK and EISPACK. LINPACK covered the
numerical solution of linear systems. EISPACK concentrated on eigenvalue problems.
These packages can also be viewed as prototypes for eigenvalue routines in the bigger
software packages NAG and IMSL and in the widely available software package
MATLAB. EISPACK was replaced in 1995 by LAPACK.
In Table 3.1, we can noticed that there are numerous commercial and open source free
packages available that support single and double precision, real or complex arithmetic
eigensolvers and even distributed computing via MPI or other technologies. Yet, there
39
Table 3.1: Detailed list of available software packages for large-scale eigenproblems
Package name
Numerical method
employed
Real Complex Shared memory GPU Distributed Multi-GPU Sparse Interior
Anasazi
Block Krylov-Schur,
Block Davidson,
LOBPCG
Yes Yes Yes No Yes No Yes Yes
ARPACK
Arnoldi/Lanczos
(implicit restart)
Yes Yes Yes No Yes No Yes Yes
BLOPEX LOBPCG Yes Yes Yes No Yes No Yes No
FEAST FEAST Yes Yes Yes No Yes No Yes Yes
FILTLAN
Polynomial
filtered Lanczos
Yes Yes Yes No No No Yes Yes
IETL
Power, RQI,
Lanczos
Yes Yes Yes No No No Yes Yes
LASO Lanczos Yes No Yes No No No Yes No
MAGMA LOBPCG Yes Yes Yes Yes Yes
Yes
(limited
support)
Yes Yes
PRIMME
Block Davidson,
JDQMR, JDQR,
LOBPCG
Yes Yes Yes No Yes No Yes Yes
PROPACK SVD via Lanczos Yes Yes Yes No No No Yes Yes
PySPARSE Jacobi-Davidson Yes No No No No No Yes Yes
SLEPc
Krylov-Schur,
Arnoldi, Lanczos,
RQI, Subspace
Yes Yes Yes
Yes
(limited
support)
Yes
Yes
(limited
support)
Yes Yes
TRLAN
Lanczos
(dynamic thick-
restart)
Yes No Yes No Yes No Yes No
are a lot of disadvantages of employing one of it. To list a few: first, the users have to
assume that they are optimal implementations and trade control for easy usability with
flexibility. Second, packages are developed based on one hardware/software feature and
may not exploit all the available optimization prospect advance platforms have to offer.
Also, most commercial packages are driven by the requirements of their clients and may
fail to serve the broader scientific community. Most packages are inadequate to meet
the needs of large groups of computational experts from different domains. Some are
dedicated to real systems, whereas others are meant to solve complex systems, some are
developed for both real and complex, but experience has shown that they may not have
solvers for other specific eigenvalue systems. As seen, there are few independent projects
currently in progress to implement eigensolvers to execute in a multi-GPU and CPU-GPU
hybrid scenario. However, their capability is limited by various factors and a lot of work
still needs to be done before they can be widely employed for general purpose numerical
computation.
40
3.4 Summary
Eigenvalue problem arises in a wide range of scientific domain. Till date, enormous
numerical effort has been put to develop methods that can solve these systems. The
eigenproblem variations that are most widely encountered are the standard eigenvalue
problem and the generalized eigenvalue problem. There are a number of methods that
can be employed to solve eigenproblems, but the choice of the method utilized depends
on a number of factors. In a broad sense, algorithms can be divided in two groups: direct
methods that are employed for small systems and the iterative methods that are
engaged while dealing with large-scale eigenproblems. There are a number of
implementations of a wide variety of algorithms in the form of portable software
packages available. However, there’s limited work focused to develop robust, optimal
eigensolver packages for recent HPC and GPU based systems.
41
Chapter 4
Design of GPU based eigensolver for
atomistic simulation
There are two important aspects that must be considered while employing a numerical
method. The first one is the correct implementation of the physical governing equations
and the accuracy of the mathematical algorithms. The second one is directly related to
the nature of the hardware needed to execute the model. Each kind of platform used to
perform numerical simulations presents its own advantages and limitations. Parallelization
methods and optimization techniques are essential to perform simulations at a reasonable
execution time.
Iterative methods based on the Krylov subspace which were introduced in Chapter 3
are usually employed to compute few eigenstates of large sparse matrices. Among these
methods is the original Lanczos algorithm, Arnoldi [110], Krylov-Schur or
Jacobi-Davidson [126]. As already seen, some of the main standard libraries that include
iterative eigensolver routines are ARPACK (ARnoldi PACKage) [127], PRIMME
(PReconditioned Iterative MultiMethod Eigensolver) a library based on
Jacobi-Davidson algorithm [128], IETL (Iterative Eigensolver Template Library)
providing a generic template interface to performance solvers [116] and SLEPc, a
scalable library based on the linear algebra package PETSc [129]. All these libraries
support single and double precisions, real or complex arithmetic and even distributed
computations via MPI as well.
Most eigenvalue solvers have concentrated on computational techniques that
42
accelerate separate components, in particular the matrix-vector multiplication [130] or a
new efficient sparse matrix storage formats [131]. However, there exists a limited amount
of work realized for taking advantage of modern day processor architectural
improvements for high performance computing in atomistic simulation which is
facilitated by their enhanced programmability and motivated by their attractive price to
performance ratio and incredible growth in speed [116,127,128].
This work has been motivated by the lack of specialized eigensolvers for large-scale
computations on GPUs. I concentrate on addressing some basic problems that hinder
the development of efficient eigensolver on GPUs: First, the choice of the algorithm
itself. Then its demonstrate how to overcome the problem of compute versus
communication gap that exists in GPUs and have also established ways to resolve the
computational and memory related bottlenecks. Finally, a multi-GPU implementation
that scales with GPUs is presented. Resulting in an eigensolver that accelerates
efficiently large-scale TB calculations. In the following sections, I start with the custom
implementation of the Lanczos algorithm with a simple restart that is optimized for
GPUs as it has been identified as a more fitted method for computing few eigenpairs on
a GPU framework that can cope with memory limitations of current GPUs and slow
GPU-CPU communication. I, also, discuss the enhancements and strategies developed
for optimal eigenslover implementations utilizing GPU and other HPC based distributed
technologies and present benchmark calculations performed on a GaN/AlGaN wurtzite
quantum dot similar to the one shown in Figure 4.1. I further the discussion in Chapter
5 by comparing our fine-tuned Lanczos implementation with GPU based
Jacobi-Davidson and FEAST method implementations.
Figure 4.1: Conical wurtzite GaN/AlGaN quantum dot with 30% Al. Atomistic
description: In yellow Aluminium, in red Gallium.
43
4.1 Lanczos method
We are interested in finding inner eigenvalues of the energy spectrum, near the energy
gap of the large GaN/AlGaN quantum dots nanostructure as shown in 4.1. Such
systems have important applications in modern nitride-based light emitting diodes
(LEDs) [9, 19]. However, the Lanczos algorithm converges fast to the extreme
eigenvalues. As stated in Chapter 3, different spectral transformations are used for the
purpose, like spectrum folding or shift-and-invert [110]. In this, implementation
spectrum folding is applied in order to avoid the computation of the matrix inverse that
might pose additional convergence problems. So, in general, the lowest eigenpairs of the
operator A = (H − sI)2
is computed, where s is the chosen spectrum shift [132]. The
implemented algorithm is a variant of that described in reference [133].
Algorithm. The Lanczos method
Assume H is a Hermitian matrix, q1 is a random vector with |q1| = 1
q0 = 0, β1 = 0
for i = 1 to m :
ui = (H − sI)qi
αi = ui · ui
qi+1 = (H − sI)ui − αiqi − βiqi−1
βi+1 = ||ui||2
After each iteration, we get αi and βi, the coefficients used to construct the tridiagonal
matrix,
T =














α1 β2 0
β2 α2 β3
β3 α3
...
...
... βm−1
βm−1 αm−1 βm
0 βm αm














Due to finite precision arithmetic, new q vectors slowly become less orthogonal to the
initial vectors [106]. Reorthogonalizing the current q vector against all previous qi takes
44
a lot of resources and it is not done in our implementation. Other versions of the
Lanczos algorithm, performs a partial reorthogonalization keeping the subspace rather
small. Experience shows that the convergence rate increases when the subspace is
considerably enlarged at the expense of accurate orthogonality. In this implementation,
the Lanczos iterations are performed until orthogonality with respect to the initial
vector, q1, is preserved to an error of 10−5
. In this way, the typical size of the tridiagonal
matrix, T , becomes of the order of 1000, which can be diagonalized using standard
LAPACK routines, obtaining the eigenvalues, λ
(m)
i and corresponding eigenvectors,
w
(m)
i .
It can be proved that the eigenvalues of T are approximate eigenvalues of A. Here,
only the eigenvalues with lowest |λi| are considered, corresponding to the eigenvalue λi =
|λi|+s of H closer to s. The projected eigenvector, vi, can be calculated as vi = Qmw
(m)
i ,
where Qm is the transformation matrix whose column vectors are q1, q2, · · · , qm. The qi
vectors are recomputed on the fly by running the Lanczos iteration a second time. This
might seem a waste of time at first, but reducing the subspace size in order to store the qi
vectors in memory does not improve overall speed. Once the approximate eigenvector, vi,
has been computed, the algorithm is tested for convergence by considering the residual
norm || ¯vi|H|¯vi / ¯vi|¯vi − λi|| < tol.
One can notice from the algorithm that each iteration requires two sparse matrix-
vector (spMV) multiplications and four vector operations, which implies that, if Rmax is
the maximum number of non-zero elements in any one row of the sparse matrix H, then
the complexity of the spMV product operation is O(Rmax · N) [134]. The complexity per
iteration of the Lanczos algorithm is O(2(Rmax · N) + N) where the dominant operation
is given by the matrix-vector multiplication. Observe that the matrix remains unchanged
along this loop.
4.2 Implementation and optimization strategies for
parallel eigensolvers
Two different hardware technologies have been employed: CPUs and GPUs. Current
CPUs have multiple processing cores, making possible the distribution of workload
among the different cores using its multi-core shared-memory architecture. In addition,
45
CPUs also present SIMD which allows performing an operation on multiple data
simultaneously. Open Multi-Processing (OpenMP) may be used for explicit direct
multithreaded, shared memory parallelism thus providing a portable, scalable model for
developers of shared memory parallel applications. OpenMP programs accomplish
parallelism exclusively through the use of threads [135].
As detailed in Chapter 2, the GPU architecture allows for the execution of threads on
a larger number of processing elements. Although these processing elements are typically
much slower than those of a CPU, having a large number of threads may make it possible
to surpass the performance of current multi-core CPUs [136]. Another characteristic of
parallel programming with GPUs is the ability to start a large number of threads with
little overhead [39]. This is unlike traditional CPU threads, where each individual thread
is treated as an entity independent of others, requiring separate resources such as stack
memory, and whose creation and management are not cheap [39]. GPU threads, on the
other hand, are cheaper to create and manage, since batches of GPU threads are treated
the same, it is possible to create a large number of them and run them for a shorter
duration.
The parallelization task on multiple computing systems can be performed by using
MPI for communicating via messages between distributed processes that are running in
parallel over the network. We combine MPI with OpenMP and CUDA to enable solving
tight binding problems with a H matrix that are too large to fit on a single node or
that would require an unreasonably long compute time on a single node. We also take
advantage of latest development in hardware technologies such as NVIDIA GPUDirect so
as to achieve additional improvements in performance.
4.2.1 MPI-OpenMP
In OpenMP, the goal is usually to parallelize loops. A serialized program can be
parallelized one loop at a time. When compiler directives are used, OpenMP will
automatically make loop index variables private within team threads (Master thread +
Worker threads) and global variables shared. Below is the pseudocode for spMV with
OpenMP.
Do i = 1 to Number_of_Rows
Start=row_index(i)
46
Stop=row_index(i+1)
Sum = 0
Do k = Start to Stop
Sum += H(k) * q(col_index(k))
End Do
V(i) = Sum
End Do
All non-zero coefficients of matrix H are stored at contiguous memory location in array
H(:), row by row, and the starting offsets of all rows are contained in a separate array
row index(:). Array col index(:) contains the original column index for each non-zero
matrix coefficient. A matrix-vector multiplication with vector q(:) can then be written
as shown in the pseudocode. While array H(:) is traversed contiguously, access to q(:)
is indexed. The rows of matrix H and the solution vector V (:) are partitioned between
threads. The OpenMP compiler directives takes care of generating the code for distributing
the work and synchronizing across the threads.
MPI-OpenMP hybrid paradigm works well for multi-core CPU nodes connected over
a network since MPI is designed to handle distributed-memory systems. We use MPI
across nodes and OpenMP within individual node, thus, avoiding the extra communication
overhead with MPI within the same node. We have divided the problem into a two-level
parallelism. MPI is used for coarse-grained parallelism among nodes while OpenMP is
used for fine-grained parallelism between different CPU cores on the same node.
4.2.2 MPI-CUDA
There are many reasons for wanting to combine the two parallel programming approaches
of MPI and CUDA. A common reason is to enable solving problems with a data size
too large to fit into the memory of a single GPU or that would require an unreasonably
long compute time on a single node. Another reason is to accelerate an existing MPI
application with GPUs or to enable an existing single-node multi-GPU application to
scale across multiple nodes.
MPI-CUDA hybrid paradigm is utilized to enable solving large TB calculations on
multiple GPUs. The workstation has multiple GPUs that are connected to the same
host. Similar to MPI-OpenMP, the problem has been divided into a two-level parallelism.
47
MPI is used for coarse-grained parallelism among GPUs while CUDA kernels are used
for fine-grained parallelism within a single GPU. To further improve the performance of
the MPI-CUDA implementation, techniques, like the splitting technique, the mix real-
complex arithmetic kernel, the overlap transfer technique and the CUDA-aware MPI
which are explained in detail in the following subsections have been utilized.
4.2.3 Performance enhancement via communication cost
reduction
In order to reduce memory usage and traffic at the cost of extra flops, the eigenvalues and
the eigenvectors are calculated using minimal information without saving any subspace
vectors as described in section 4.1. This might initially seem a waste of time, but as
previously stated, reducing the subspace size in order to store the qi vectors in memory
does not improve overall speed. Furthermore, a considerable time needed to transfer the
vectors from GPU to machine RAM has to be spent. Since the peak bandwidth between
the device memory and the GPU is much higher than the peak bandwidth between host
memory and device memory, it is important to minimize data transfer between the host
and the device. Therefore, it is necessary to keep the entire matrix and the intermediate
vectors on the GPU. The advantage of the described algorithm is that it resides in a very
little memory at the expenses of computing more matrix-vector products. This is ideal
for graphic cards limited in memory, but fast in performing vector operations. Another
fundamental advantage of this implementation is the absence of expensive data transfer
of the vector qi from the device to the host. Only the scalars αi, βi are transferred at each
iteration since T is diagonalized on the host.
4.2.4 Memory optimization by Splitting approach
Memory optimizations are the most important area for performance enhancement. The
goal is to maximize the possible atomistic size that can be simulated on the GPU. The TB
Hamiltonian is a sparse matrix with approximately 40 non-zero coefficients per row with
a standard deviation ranging from 3.0 to 4.0. Therefore, the Hamiltonian is stored in a
compressed sparse row (CSR) format which stores only the non-zero elements. To enable
multithread parallelism, we store both the upper and lower triangular blocks. Performance
48
improvements may be possible using alternative sparse matrix representations such as
ELLPACK, although, it has been shown that CSR becomes very efficient when matrix
rows exceed four million [137].
Spin-orbit couplings add imaginary components to the Hamiltonian matrix doubling
the problem size and adding the burden of complex algebra operations. In conventional
TB approaches, based on the local atomic spin-orbit interaction, the size of the imaginary
part of the Hamiltonian is much smaller than the real part. Therefore, memory can be
saved by exploiting the sparsity if we split the complex TB Hamiltonian matrix into their
real and imaginary parts and then perform the eigenvalue calculation. The complex spMV
are substituted by two multiplications,
V = Mul(Hreal, q) + iMul(Himg, q) (4.1)
This has been achieved by designing a new CUDA kernel accepting mixed complex/real
arithmetic as explained in the following subsection 4.2.5.
4.2.5 Mix real-complex CUDA kernel
Sparse matrix-vector multiplication is an integral part of most numerical methods and it
is a bandwidth-limited operation on current hardware. On cache-based architectures, like
GPU, the main factors that influence performance are spatial locality in accessing the
matrix and temporal locality in re-using the elements of the vector. The new mix real-
complex CUDA kernel is based on the implementation discussed by Reguly and Giles [138]
who shows that it can outperform CUSPARSE library. The main idea of the kernel is to
let many threads cooperate on any row during spMV products, thereby increasing data
locality and decrease cache misses.
int tid = threadIdx.x;
int coopIdx = threadIdx.x%coop;
int i = (repeat*blockIdx.x * blockDim.x + tid)/coop;
__shared__ cuDoubleComplex sdata[ BLOCK_SIZE ];
for (int r = 0; r<repeat; r++)
{
cuDoubleComplex localSum = 0.0;
int rowPtr = rowPtrs[i];
49
int stop = rowPtrs[i+1]-rowPtr;
for (int j = coopIdx; j < stop; j+=coop)
{
localSum.x += values[rowPtr+j] * x[colIdxs[rowPtr+j]].x;
localSum.y += values[rowPtr+j] * x[colIdxs[rowPtr+j]].y;
}
sdata[tid] = localSum;
for (unsigned int s=coop/2; s>0; s>>=1)
if (coopIdx < s){
sdata[tid].x += sdata[tid + s].x;
sdata[tid].y += sdata[tid + s].y;
}
if (coopIdx == 0) y[i] = sdata[tid];
i += blockDim.x/coop;
}
0 50000 100000 150000 200000 250000 300000 350000 400000
0
10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
160
Time(Sec)
Number of atoms
Complex/Complex
Real/Real
Complex/Real
Figure 4.2: Performance of spMV operation on GPU employing different data types
Two different CUDA streams are used to carry out the matrix-vector multiplication
because the operations are independent of each other and can be executed in parallel if
enough GPU resources are available. For III-V semiconductors, every atom has 4
neighbors, Rmax ≈ 40. In contrast, the imaginary part has Rmax = 2. For this reason,
50
different tuning strategies are necessary for the two spMV operations. For the spMV
operations involving the real part, numerical experiments give the best performance
using coop = 8 and repeat = 2 in the notation of Ref. [138] and in the kernel reported
above. The spMV involving the imaginary parts is performed with coop = 1 and
repeat = 1. As seen in figure 4.2, this hybrid complex/real kernel performs much better
than the original implementation based on four real/real spMV operations that suffered
almost a factor of 2× performance degradation. This is due to the fact that the real
matrix needs to be fetched only once, decreasing the bandwidth utilization.
4.2.6 Performance enhancement using the Overlap technique
Figure 4.3: (Left) Typical sparsity pattern of a TB Hamiltonian and partitioning over
four nodes. (Right) Data exchanged between adjacent nodes
To facilitate the calculation of big nanostructures, MPI is utilized because it is one of
the dominant technologies used in HPC today. Distributive parallel computing nature of
MPI, along with being portable, efficient and flexible, is very ideal for scientific computing
bound by memory and speed limitations. However, the challenge with TB application is
that different parts of the TB Hamiltonian matrix has been distributed to different nodes
and the algorithm is executed in an independent fashion on each localized node. Therefore,
after each matrix-vector multiplication, a part of the resultant vector that is needed to
carry out future matrix-vector multiplication correctly needs to be transferred. This part
acts as an overlap between nodes that need to be exchanged. The bandwidth of the overlap
transferred is dictated by the bandwidth of the H matrix. Figure 4.3 shows the typical
51
sparsity pattern of the TB Hamiltonian matrix and the right panel shows the overlap data
exchange between adjacent nodes.
This exchange is a critical parameter in the performance scaling of the parallel
implementation. The atomistic structure is reordered using the reverse Cuthill-McKee
algorithm before building the Hamiltonian. Since the bandwidth of the reordered matrix
is reduced, the overlap that needs to be transferred between nodes is almost reduced by
half compared to the original overlap when reordering is not performed. By using this
technique, we will avoid having to gather the entire resultant vector for each node.
4.2.7 CUDA-aware MPI
Using CUDA-aware MPI makes the algorithm run more efficiently since all operations
that are required to carry out the message transfer can be pipelined and acceleration
technologies, like GPUDirect, can be utilized. The goal of this technology is to reduce
dependency on the CPU to manage transfer. Since with a regular MPI implementation,
only pointers to host memory can be passed to MPI API’s and one needs to stage GPU
buffers through host memory. Further, Pinned memory is used to speedup host-to-device
and device-to-host transfer in general since it prevents memory pages from being swapped
out. On the GPU test system utilized here for benchmarking, all the GPUs are connected
over the same PCI-E bus, GPUDirect Peer-to-Peer is utilized to achieve a high-bandwidth,
low-latency communications between the GPUs.
4.3 Benchmarking the Lanczos method
GaN/AlGaN wurtzite quantum dot, like the one showed in Figure 4.1, is used to perform
the benchmark calculations on nanostructures with up to 600,000 atoms corresponding to
H matrix size of around 12,000,000 and approximately 480,000,000 of non-zero elements.
Numerical benchmark comparisons have been performed on systems having the following
architectures.
• Test system 1 (CUDA, MPI-CUDA): Intel Xeon Processor E5-2620 (6 cores, 2 GHz,
Cache 15 MB), 64 GB DDR3 SDRAM (Speed 1333 MHz) and 2 Nvidia Tesla K20c
(Chip Kepler GK110 GPU, Processor Clock 706 MHz, Memory Clock 2.6 GHz,
52
Memory Size 5 GB) connected on the same PCI-E with an operating system based
on Linux kernel 3.0.85.
• Test system 2 (MPI-OpenMP): Intel Xeon Processor X5560 (4 cores, 2.8 GHz, Cache
8 MB), 48 GB DDR3 SDRAM (Speed 1333 MHz) connected through a 20 Gbps
InfiniBand (4x DDR) with an operating system based on Linux kernel 2.6.30.
• Test system 3 (Sequential, OpenMP): Intel Xeon Processor W3530 (4 cores, 2.8
GHz, Cache 8 MB), 6 GB DDR3 SDRAM (Speed 1333 MHz) with an operating
system based on Linux kernel 2.6.30.
The algorithm has been written in Fortran 95 and compiled with Intel Fortran 11.1,
whereas the GPU parts are written in C and compiled with CUDA Toolkit 5.5. Here, I
concentrated particularly on the sparse matrix-vector multiplication timings and
discussed the performance of GPUs to find one conduction band energy eigenstate.
Table 4.1 reports the timings to find the first eigenvalue on a single K20c GPU for
increasing size of the problem in terms of the number of atoms. The corresponding
Hamiltonian size is given by multiplying the number of atoms times 20, which is the
basis size, whereas the number of non-zero elements is approximately given by
multiplying the Hamiltonian size times 40 (average number of non-zeros per row). Table
4.1 also reported the total number of iterations needed to reach convergence. This
number varies depending on the starting guess (s), the quantum dot shape, composition
and size. The absolute error on the eigenvalue also varies since convergence tolerance is
tested once the orthogonality is lost and the matrix T has been diagonalized. Therefore,
it is more instructive in order to compare performance on different machines to compute
the time per iteration, which is directly related to the time per spMV multiplication. It
has been observed that the timings for the memory-optimized algorithm are slightly
worse than the original complex/complex algorithm, despite the reduction of the overall
number of floating point operations. This is attributed to the fact that now two distinct
matrices need to be accessed as given by equation 4.1.
The Kepler K20c GPUs used for this work has 5 GB of memory which is sufficient for a
nanostructure with up to ≈ 260,000 atoms. Splitting the H matrix saves 35-40% of memory
as shown in Figure 4.4 which enables the simulation of up to 350,000 atom structure on
a single GPU. Extra time is spent for splitting and further memory optimization of the
53
Table 4.1: Results for energy eigenstate calculation using CUDA on Nvidia Kepler
K20c GPU (Test system 1)
Number of
atoms
CUDA implementation Memory-optimized CUDA implementation
Error
×10−6
Runtime
(sec)
Number of
Lanczos
iterations
Time per
iteration
(msec)
Error
×10−6
Runtime
(sec)
Number of
Lanczos
iterations
Time per
iteration
(msec)
8,039 7.9 3.04 950 3.2 5.3 3.5 960 3.7
24,650 1.4 8.6 1330 6.5 1.1 13.0 1840 7.1
79,495 2.7 76.9 4180 18.4 8.7 62.5 3160 19.8
151,472 9.7 70.4 1940 36.3 3.6 76.4 1960 39.0
203,376 8.0 88.6 1580 56.1 8.2 95.1 1580 60.2
263,379 3.9 141.6 1940 73.0 1.6 149.7 1960 76.4
351,600 - - - - 1.3 186.8 1940 96.3
H matrix. This overhead seems acceptable in the memory versus time trade off.
0 100000 200000 300000 400000 500000 600000
0
2000
4000
6000
8000
10000
Memoryutilization(MB)
Number of atoms
Tight binding Hamiltonian
Tight binding Hamiltonian optimized for GPU
Figure 4.4: Memory utilization by TB Hamiltonian matrix on GPU
Table 4.2 shows timings for the distributed calculations on two Kepler K20c GPUs.
First, it is observed that the parallel implementation can be slower than the single GPU
implementation when the data transfer is performed via host memory (first column).
This happens because data transfer is the speed limiting factor. Performance can be
substantially improved using CUDA-aware MPI implementations exploiting the PCI-E
bus data transfer supported by K20c and available in CUDA Toolkit 5.5 onwards. Such
architecture can perform a peer-to-peer transfer between the GPU memories directly at
a rate of 6.3 GB/s, boosting the computation speed by a factor of 2.7× (second column
54
Table 4.2: Results for energy eigenstate calculation using MPI-CUDA implementation
running on two Nvidia Kepler K20c GPUs (Test system 1)
Number
of
atoms
MPI-CUDA implementation
via Host memory
MPI-CUDA implementation
via PCI
Memory-optimized MPI-CUDA
implementation via PCI
Error
×10−6
Runtime
(sec)
Number
of
Lanczos
iterations
Time
per
iteration
(msec)
Error
×10−6
Runtime
(sec)
Number
of
Lanczos
iterations
Time
per
iteration
(msec)
Error
×10−6
Runtime
(sec)
Number
of
Lanczos
iterations
Time
per
iteration
(msec)
79,495 7.6 175.29 5440 32.2 6.4 63.1 5540 11.4 7.1 67.3 5520 12.2
203,376 7.5 305.31 3520 86.7 7.5 117.2 3520 33.3 7.6 119.6 3520 34.0
351,600 2.2 379.78 2580 147.2 4.5 94.5 1780 53.1 0.16 195.1 3400 57.4
601,766 - - - - - - - - 2.7 268.2 2640 101.6
of Table 4.2). In this second case, the average performance of the parallel implementation
is about a factor 1.7× faster than a single GPU runs, shown in Table 4.1. The largest
structure that can fit on two GPUs is made of little more than 600,000 atoms which
requires 6.0 GB of storage in total. As already stated before, this requires a splitting
strategy to be employed.
In order to put the GPU performance in the right perspective, a benchmark comparison
of the same algorithm running in parallel on multi-core CPU nodes connected through
a high-speed, high-bandwidth (20Gbit/s) InfiniBand network is performed, as available
on most HPC facilities (Test system 2). The best approach is a hybrid MPI-OpenMP
implementation in which the matrix is distributed over quad-core nodes on which every
matrix-vector multiplication operation has been parallelized using four OpenMP threads.
Table 4.3 summarizes the results of these runs. Here timings for 2, 4, 8 and 16 MPI
processes, for a total of 8, 16, 32 and 64 cores, respectively are reported. The relevant
performance is also graphically shown in Figure 4.5. We can observe that using MPI-
OpenMP over InfiniBand, there is almost a linear scaling when moving from 2 nodes
to 8 nodes, but this degrades for bigger systems as the overlap which is dictated by
the bandwidth of the Hamiltonian matrix is larger for large-scale systems and it needs
to be transferred after each matrix-vector multiplication, hence MPI message transfer
and synchronizations among the processes after each matrix-vector multiplication takes
substantial amount of time and acts as a speed degrading factor.
Figure 4.6 and 4.7 shows the time in seconds and performance in Gflops per Lanczos
iteration on a CPU’s single core vs. quad-core CPU using OpenMP vs. standard GPU
implementation on a single Kepler K20c GPU vs. memory optimized implementation
(MOI) on a Kepler K20c GPU vs. MPI-OpenMP implementation on 2, 4, 8 and 16
quad-core CPU respectively vs. MPI-CUDA implementation on two Kepler K20c GPU
55
Figure 4.5: Time comparison of Lanczos iteration using MPI-OpenMP on a HPC
cluster connected via InfiniBand
Figure 4.6: Time taken per Lanczos iteration for different implementations and
technologies
where MPI communication is done via host memory and via PCI respectively vs.
memory optimized (MOI) MPI-CUDA implementation on two Kepler K20c GPU with
exchange via PCI-E bus using GPUDirect.
The performance reported in Gflops in Figure 4.7 is given by ((number of non-zero
elements in H × number of multiply-add operations in algorithm)/time per Lanczos
56
Table 4.3: Results for energy eigenstate calculations using MPI-OpenMP (Test system
2)
Number of atoms Error
Number of
MPI nodes
Runtime
(sec)
Number of Lanczos
iterations
Time per iteration
(msec)
79,495 6.8 ×10−6
2 428.58 4670 91.8
7.0 ×10−6
4 342.21 4660 64.2
7.2 ×10−6
8 212.09 4630 45.9
6.2 ×10−6
16 147.9 4710 31.4
203,376 3.8 ×10−6
2 918.89 3280 280.2
7.1 ×10−6
4 755.47 3210 235.3
4.2 ×10−6
8 510.65 3260 156.7
1.1 ×10−6
16 357.82 3330 107.2
351,600 1.2 ×10−7
2 1800.93 3470 519.0
1.9 ×10−7
4 1460.34 3420 427.0
0.7 ×10−7
8 864.12 3490 247.5
1.9 ×10−7
16 671.5 3420 196.3
601,766 3.3 ×10−6
2 2562.07 3310 773.8
5.3 ×10−6
4 2039.16 3200 637.2
8.6 ×10−6
8 1124.02 3050 368.6
5.1 ×10−6
16 1049.46 3220 325.9
Figure 4.7: Performance comparison for the Lanczos iteration between different
implementations and technologies
iteration) obtained on the GPU and compared it to a single quad-core Xeon CPU, as
described above (Test System 3), using OpenMP multithreading on the M × v
operations and MPI-OpenMP implementation on 2, 4, 8 and 16 quad-core CPU
respectively (Test System 2). A performance gain of a factor of more than 40× can be
57
achieved on the GPU as compared to a single CPU core and a factor of 10× compared
to the OpenMP implementation on a quad-core. The point corresponding to 351,600
atoms is only possible with memory optimization. Besides some oscillations, we observe
quite opposite trends of numerical efficiency between the GPU and the CPU, the first
steeply rising and then saturating with problem size, while the second is steadily
degrading. This is attributed to the large memory bandwidth (208 Gbit/sec) of the
Kepler K20c which is the ultimate speed limiting factor for the large matrices handle
here. It is also observed that on small systems there is no appreciable GPU speedup.
This is due to memory allocation and transfer of data to the GPU takes a considerable
amount of time.
Figure 4.8: Speed comparison for spMV between implementations on each of the
technologies
Comparing the MPI-OpenMP performance with respect to MPI-CUDA, for the
smallest structure of 79,495 atoms, we obtain a time/iteration of 45.9 msec on 8 nodes.
Whereas, on the same structure the best MPI-CUDA performance is 11.4 msec
corresponding to an acceleration of 4.0×. Even comparing to the slowest MPI-CUDA
via host memory, the GPU algorithm has a speedup factor of 1.4×. For the larger
structures, the gain on the GPU is even further increased. For the case of 351,600 atoms
the speedup factors range between 4.6×, in the case of MPI-CUDA via PCI to 1.68× for
the MPI-CUDA via host. The largest structure of 601,766 atoms can be compared only
58
in the case of memory-optimized strategy and for this, the acceleration factor is 3.6×.
These comparisons can be appreciated in Figure 4.6, where it is possible to see that
GPU implementation on two cards outperforms the parallel implementation of the same
algorithm running on CPUs. Figure 4.8 shows the speed comparison for spMV
operations, as seen, GPUs out performs every other implementation. Even a single GPU
is faster than 16 quad-core nodes connected by an InfiniBand network.
Clearly, the drawback of this GPU implementation is that it faces memory limitations
that prevent scaling up the system size above a certain limit. Nevertheless, the amount of
memory hosted by GPU is likely to increase in future, as the latest NVIDIA Kepler K80
already has 24 GB of device memory. As demonstrated by these benchmarks, fast direct
GPU inter-communication is needed for high performance. Currently, multiple GPU cards
can be interconnected via PCI switches to a single I/O hub, although a system with 4
GPUs gives optimal parallel performance.
4.4 Summary
The Lanczos method has been fine-tuned for memory limited GPU. Advance
optimization strategies and techniques that take into account the characteristics of the
sp3
d5
s∗
+ spin-orbital parametrization Hamiltonian matrix are developed and utilized to
obtain optimal performance. The whole algorithm has been developed using CUDA and
runs entirely on the GPU. Furthermore, parallel distribution over several GPUs has
been attained using MPI and the implementation is fully vectorized and scales with
GPUs. Benchmark calculations performed on a GaN/AlGaN wurtzite quantum dot with
up to 601,766 atoms are presented. The GPU results are also compared to other
available computing technologies.
59
Chapter 5
GPU focused comprehensive study
of popular eigenvalue methods
As already outlined in Chapter 3, there are several methods that can be used to
calculate the needed eigenstates of the H matrix. Given the variety of possible methods,
it is still unclear which one is more suited and how their performance compares in a
given scenario. However, there are few methods which are more widely used given their
implementation feasibility, convergence characteristic, accuracy and reliability. Methods
such as Lanczos, Jacobi-Davidson and conjugate gradient are popular and widely
utilized in tight binding calculations [139–141]. Recently, a new method called FEAST is
gaining popularity [142, 143]. Hence studying, optimizing and benchmarking them for
recent HPC and GPU architectures is of importance for the given application domain.
Today, larger and faster computing systems are widely accessible. Supercomputers
and high-end expensive computing systems are being utilized to accelerate computation
in a parallel distributed, cluster or grid computing setting. The advent of GPUs have
grasped the attention of most of the scientific computation community. Developing
algorithms that can ideally scale over such a system is an important component for
transferring the hardware feature into actual beneficial speedups. In recent times, there
has been an extensive effort being put in translating algorithms initially designed for
sequential processors to now days HPC system which normally deal with either SIMD or
multiple-instruction-multiple-data (MIMD) scenario. However, a lot of aspects need to
be considered to result in speedups while dealing with parallel computing. Hence, often
60
this sequential to parallel transition is not straight forward and requires a deeper
understanding of the system architecture and the eigenvalue method itself.
There are many challenging questions to be considered in terms of the choice of method
employed. Some of these questions include: what method takes the least total computation
time and is well suited for GPUs given its limited available resources? Which approach is
robust in convergence when used with nanostructures having a dense energy spectrum?
Also, in a multi-GPU scenario where data has to be shared among GPUs, its important
to identify the implementation that deals well with hardware limitation. Characteristics
of the method like its ratio of compute to memory intensive operation, which are needed
for a good speedup in hybrid implementations also need to be considered. Finally, its
important to find a method that scales best in a multi-GPU distributed setup.
Having identified the aspects that need to be taken into account and proposed a design
for parallel computing eigensolver in Chapter 4. Here, lets test and compare some of the
popular eigenvalue algorithms for memory utilization, execution time, implementation
complexity (feasibility) and convergence. Also, lets benchmark a robust implementation
of each of the algorithm on a multi-GPU system as well on a HPC cluster.
5.1 GPU based implementations of popular
eigenvalue methods
As we know, GPUs have a limited memory and the peak bandwidth between the device
memory and the GPU is much higher than the peak bandwidth between host memory and
the device memory. Therefore, as already shown in Chapter 4 it is crucial to minimize
the data transfer between the host and the GPU by keeping the Hamiltonian matrix
and the search subspace on the device memory. For this reason, the TB Hamiltonian
matrix is converted to a single precision format prior to transfer to GPU’s global memory.
The algorithms are implemented using mixed single/double precision arithmetic to ensure
highly accurate solutions. Since the Lanczos method is detailed in Chapter 4, its parallel
design and implementation details are not listed in the subsequent subsections.
61
5.1.1 Jacobi-Davidson method
The Jacobi-Davidson method is an iterative subspace method for computing one or more
eigenpairs of large sparse matrices. In this method, each iteration has two phases: the
subspace extraction and the subspace expansion.
For the subspace expansion phase, given an approximate eigenpair (θi, ui) close to
(λi, vi), with ui ∈ U, where U is the subspace, and θi =
u∗
i Hui
u∗
i ui
is the Rayleigh quotient of
ui, taken as approximate eigenvalue because it minimizes the two-norm of the residual:
r = Hui − θiui . To expand U in an appropriate direction lets look for an orthogonal
correction t ⊥ ui such that ui + t satisfies the eigenvalue equation:
H(ui + t) = λi(ui + t) (5.1)
Lets try to find eigenvalues closest to some given target τ, initially, lets consider this
to be the same as the chosen Lanczos shift τ = s. In the above equation,
(H − τI)t = −r + (λi − θi)ui + (λi − τ)t (5.2)
t and | λ − τ | are small and can be neglected. When we multiply both sides of
equation 5.2 by the orthogonal projection I − uiu∗
i . We have the following equation
(I − uiu∗
i )(H − τI)(I − uiu∗
i )t = −r (5.3)
where t ⊥ ui. We solve equation 5.3 only approximately using generalized minimal
residual method (GMRES) and its approximate solution is used for the expansion of the
subspace [144].
To save GPU memory, the process is enhanced by restarting the Jacobi-Davidson
method with a few recently found ui in this way, the dimension of the search subspace is
restricted [145]. In order to avoid the found eigenvalues from reentering the
computational process, the new search vectors are explicitly made orthogonal to the
computed eigenvectors.
As stated above, the interior eigenvalues are of interest. The Ritz vectors represents
poor candidates for restart since they converge monotonically towards exterior eigenvalues.
One solution to this problem is using the harmonic Ritz vectors. The harmonic Ritz values
62
are inverses of the Ritz values of H−1
. Since the H matrix is Hermitian, the harmonic Ritz
values for the shifted matrix (H − τI) converge monotonically towards θi = τ eigenvalues
closest to the target value τ. The search subspace for the shifted matrix and the unshifted
matrix coincide and hence its possible for the computation of harmonic Ritz pairs for any
shift. The harmonic Ritz vector for the shifted matrix can be interpreted as maximizing a
Rayleigh quotient for (H − τI)−1
. It represents the best information that is available for
the wanted eigenvalue; therefore, it is also the best candidate as a starting vector after
the restart [146].
GMRES method is designed to solve nonsymmetric linear systems. The most popular
form of GMRES is based on the modified Gram-Schmidt procedure and it uses restarts. If
no restarts are used, GMRES will converge in no more than N steps. This is of no practical
value here since N is very large. Moreover, the storage and computational requirements
in the absence of restarts are prohibitive. However, there exist cases for which the method
stagnates and convergences takes place only at the Nth
step. For such systems, any choice
of restart less than N fails to converge.
Algorithm. The GMRES method
Start: Choose x0 and compute r0 = f − Ax0 and v1 = r0
||r0||
Iterate: For j = 1, 2, . . . , m do:
hi,j = (Avj, vi), i = 1, 2, . . . , j
ˆvj+1 = Avj − j
i=h hi,jvi,
hj+1,j = ||ˆvj+1||, and
vj+1 =
ˆvj+1
hj+1,j
Form the approximate solution: xm = x0 + Vmym where ym minimizes ||βe1 − ¯Hmy||,
y ∈ Rm
.
Restart: Compute rm = f − Axm, if satisfied then stop
else compute x0 = xm, v1 = rm/||rm|| and then iterate once again.
The least square problem min ||βe1 − ¯Hmy|| is solved by factorizing ¯Hm into QmRm
using plane rotation. The difficulty is in choosing an appropriate value for restart. If its
too small, GMRES may be slow to converge or fail to converge entirely. A value for restart
that is larger than necessary involves excessive work and uses more storage. There are no
definite rules governing the choice of restart and it is a matter of experience. More details
on practical implementation of GMRES method can be found in reference [148].
63
The correction equation is solved to an accuracy of just 10−1
. it is sufficient enough
to keep the number of outer iterations between 4 to 10 with internal restart set to 10.
GMRES, although more expensive than other linear solvers, is chosen because it is found
to be more stable in solving the correction equation for TB Hamiltonian [147, 148]. We
can further improve the computation by treating the H matrix with a preconditioner.
However, the preconditioner will occupy similar memory as the actual matrix and also
increase the crucial time consuming matrix-vector multiplications per iteration. Hence, it
may not be a wise choice for a GPU-accelerated solver where 10−1
accuracy is sufficient.
5.1.2 FEAST method
The aim of the FEAST algorithm is to actually compute the eigenvectors instead of
approximating them, unlike the Lanczos and Jacobi-Davidson method. It yields all the
eigenvalues and eigenvectors within a given search interval [λmin, λmax]. FEAST relies on
the Rayleigh-Ritz method [123,124] for finding the eigenvector space V in some enveloping
space U ⊇ V . Let Γ be a simply closed differentiable curve in the complex plane that
encloses exactly the eigenvalues λ1, ..., λm and z be the contour point. Using the Cauchy
integral theorem, it can easily be shown that
V V ∗
=
1
2πi Γ
(zI − H)−1
dz = Q (5.4)
Next, choose a random matrix Y ∈ Cn×m0
, where m0 is the size of the working
subspace which is slightly larger than m the number of eigenvalues within the search
interval. The expression in 5.4 leads to a new set of m0 independent vectors Qn×m0 =
q1, q2, ..qm0 obtained by solving linear systems along the contour and form U = QY .
It follows that U = span(U) ⊇ V is a candidate for the space used in the Rayleigh-
Ritz method. The matrix U can be computed, for our TB Hamiltonian matrix 3 to 8
integration points are sufficient. Then for each integration point z, a block linear system
(zI − H)−1
Ui = Yi needs to be solved, each with m0 right hand sides. Notice that the
matrix keeps on changing with z throughout the run.
The FEAST algorithm can be parallelized in several ways. First, the interval
[λmin, λmax] can be split and each part can be treated separately. Also, for each contour
point block linear system can be solved independently from each other. As well as each
64
linear system in principal can be solved in parallel [149]. Here, FEAST has not been
parallelized using any of the mentioned strategies. Instead, the solver that find the
solution for each linear system using our parallel multi-GPU enhanced techniques is
parallelized, since the solution to the block linear system is the most expensive part of
the method.
The conjugate gradient squared method (CGS) is employed to solve the block inner
independent linear systems since the cost per iteration of CGS is cheaper than that of
GMRES in terms of computation and memory [144, 150]. The inner independent linear
systems need to be solved to a high accuracy of at least 10−6
. For non-converged linear
system, the solver can be stopped after a few hundreds of iteration. The CGS method is
outlined below.
Algorithm. The CGS method
Choose an initial guess x0 and ˜r0
r0 = b − Ax0
u−1 = w−1 = 0, α−1 = σ−1 = 1
for k = 0, 1, 2 . . . do
ρk = (rk, ˜r0)
βk = ( −1
αk−1
)( ρk
σk−1
)
vk = rk − βkuk−1
wk = vk − βk(uk−1 − βkwk−1)
c = Awk
σk = (c, ˜r0)
αk = ρk
σk
uk = vk − αkc
xk+1 = xk + αk(vk + uk)
if xk+1 is accurate enough, then stop
if not rk+1 = rk − αkA(vk + uk) and iterate
Often, convergence is improved by using an incomplete factorization method based on
Gaussian elimination like incomplete LU (ILU) as a preconditioner matrix [151]. However,
for TB Hamiltonian matrix under consideration the ILU factorization with 0 level of fill-in
is not sufficient for convergence. If utilized it takes more iterations to converge compared
65
to the case where a preconditioner is not employed and hence, we need to perform higher
level of factorizations. As the fill-in level in an ILU decomposition increases, the quality
of the ILU preconditioner improves. This also changes the sparsity of the preconditioner
matrix. Thus more accurate ILU preconditioners require more memory to such an extent
that eventually the running time of the algorithm increases, even though the total number
of iterations in the linear solver decreases. Also, the parallelization of ILU involves a lot
of data transfers between the nodes since almost the entire TB Hamiltonian matrix is
needed on each node and it takes a noticeable amount of compute time because a fresh
ILU factorization is needed to be computed for each contour point as the matrix keeps on
changing. Therefore, a FEAST implementation that utilizes an incomplete factorization
based method to generate a preconditioner matrix is not implemented. To obtain a higher
speedup and low memory foot print, parallel preconditioners that are better suited for
GPU parallelism must be developed.
5.2 Benchmarking results, comparison and
discussion
All benchmarks are performed by analyzing the algorithms to find the lowest 8 conduction
energy eigenstates of atomistic quantum dots similar to the one show in Figure 5.1. Here,
the Lanczos, Jacobi-Davidson and FEAST methods are compared and I especially focus
on their ability to compute multiple eigenpairs.
Figure 5.1: (Left) Cubical wurtzite GaN/AlGaN quantum dot showing the core with
30% Aluminum. (Right) a central slice of the cube. Atomistic description: in yellow
Aluminum, in red Gallium
66
The GPU implementation of the algorithms and linear solver is done utilizing the
TB Hamiltonian splitting approach, the mixed real-complex arithmetic matrix-vector
multiplication CUDA kernel and all of the parallel GPU implementation techniques and
optimization strategies discussed in Chapter 4. However, in case of the FEAST method,
the matrix keep on changing with different contour points as zI − H (or z∗
I − H).
Therefore, it is not optimal to use the splitting approach since tests have shown that a
significant amount of time is spent building the splitted matrix and dropping the zeros.
The Lanczos algorithm has been fully ported to the GPU and vectorized to scale
with MPI parallelization on multi-GPU workstations as show in Chapter 4. Similarly, the
Jacobi-Davidson algorithm has been implemented on GPU, along with GMRES method
which is utilized as a linear solver for the Jacobi-Davidson correction equation. In order
to spare GPU memory, the subspace vectors have been saved on the host memory. This
strategy enables to treat larger systems at the expense of more device-host communication.
A comparison between Jacobi-Davidson algorithm with and without the subspace in the
device memory is shown in the following subsections. Concerning FEAST, only the linear
solver (CGS) has been ported to the GPU given that this is the most time consuming
part of the algorithm. In this respect, Lanczos and Jacobi-Davidson can be considered as
pure GPU implementations and FEAST as a hybrid CPU-GPU, even though 98% of the
total time is spent on the GPU solving the block liner system. The relevant details of the
test hardware are given below.
• Test system 5 (Multi-GPU workstation): Intel Xeon Processor E5-2620 (6 cores, 2
GHz, Cache 15 MB), 64 GB DDR3 SDRAM (Speed 1333 Mhz) and 2 Nvidia Tesla
K40 (Chip Kepler GK110B GPU, Processor Clock 745 MHz, CUDA cores 2880,
Memory Clock 3.0 GHz, Memory Size 12 GB, Peak performance 1.43 Tflops) +
2 Nvidia Tesla K20 (Chip Kepler GK110 GPU, Processor Clock 706 MHz, CUDA
cores 2496, Memory Clock 2.6 GHz, Memory Size 5 GB, Peak performance 1.17
Tflops) connected on the same PCI-E with an operating system based on Linux
kernel 3.0.85.
• Test system 6 (HPC cluster): 2208 compute nodes, each node has 2 Intel Xeon
X5570 (4 cores, 2.93 GHz, Cache 8 MB), 24 GB DDR3 SDRAM (Speed 1066 MHz).
Nodes are connected through an Infiniband QDR network with non-blocking Fat
67
Tree topology with a total Peak performance of 207 Tflops and having an operating
system based on Linux kernel 2.6.32.
5.2.1 Eigensolver evaluation on a Multi-GPU workstation
Figure 5.2: Time comparison between methods on 1 Kepler GPU for the calculation of
8 energy eigenstates
Figure 5.3: Time comparison between methods on 4 Kepler GPUs for the calculation
of 8 energy eigenstates
68
On a single GPU, Jacobi-Davidson with subspace in host memory performs almost 2×
times faster as Lanczos and 13× faster as FEAST as seen from Figure 5.2. However,
when we move from one GPU to a multi-GPU scenario as shown in Figure 5.3, Jacobi-
Davidson with subspace in host memory performs only 1.4× times faster than Lanczos
when first few eigenstates are searched. The decrease in speedup compared to a single GPU
implementation is attributed to the fact that the sparse mix real-complex matrix-vector
operations become less significant as seen in Table 5.1. Also, since the subspace is saved on
the host memory, it imposes more inter host-GPU data movement than Lanczos as seen
from Figure 5.7, this is the main speed limiting factor for any parallel implementation. To
attain ideal scaling, there should not be any data dependency or synchronizations between
GPUs. Also, there should be enough data to utilize all the GPU cores efficiently. As noticed
from Figure 5.2, 5.3, 5.8 and 5.9 with regards to Jacobi-Davidson implementation with
subspace stored in device memory, it is possible only to fit upto 151,472 atom quantum
dot on GPUs having a memory limit of 5 GB. Therefore, as already stated, it is crucial
to employ the implementation that spares memory by moving the subspace to the host
memory. The rest of the discussion in the following subsections corresponds to the Jacobi-
Davidson method with subspace stored in the host memory.
Figure 5.4, 5.5 and 5.6 shows the scaling of each method over multiple GPUs. We
observe that the Lanczos and the FEAST method exhibit a strong scaling for a large
quantum dot. The ample data movement in the Jacobi-Davidson implementation due to
the subspace being stored in host memory impedes its scaling performance.
69
Figure 5.4: Scaling of Lanczos method on 1 to 4 GPUs
Figure 5.5: Scaling of Jacobi-Davidson (subspace in host memory) method on 1 to 4
GPUs
70
Figure 5.6: Scaling of FEAST method on 1 to 4 GPUs
The profiling results from a data movement perspective for 151,472 atom quantum dot
are shown in Figure 5.7. Notice that Lanczos is a compute intensive algorithm as almost
99% of time is used for computation with minimal data transfer which happens only
during the launch as the matrix is loaded onto the GPU memory. Whereas, in the case
of Jacobi-Davidson method the host to device and device to host data transfer account
for 15-20% of the total effective time since the subspace is stored on the host memory.
The CGS method used to solve the block linear system within FEAST, imposes an ample
amount of device to device data transfer accounting to 10-25% of the total computation
time. We attain a peak bandwidth of 7.45 Gbits/sec between the host and the device.
71
Figure 5.7: Percentage of time taken for memory and compute operations on (Left) 1
GPU and (Right) 4 GPUs respectively
The profiling tests have also revealed that given the sequential nature of the iterative
algorithms and the pure GPU implementation with minimal data transfer, it is not
possible to obtain any significant memory copy or compute overlap. Only in the case of
Jacobi-Davidson method, a 3% of compute/memory copy overlap is obtained since the
subspace vectors are stored on the host memory. It is expected that this number will
increase as the size of the quantum dot increases.
Table 5.1, 5.2 and 5.3 shows the profiling results for compute operations of the
algorithms for 151,472 atom quantum dot. In all of the three methods, the sparse
matrix-vector multiplication is the most important computation task. However, when we
go from single GPU to multi-GPU implementation for the Jacobi-Davidson method, the
dense subspace-vector multiplication gains significance over the sparse Hamiltonian
matrix-vector multiplication. Notice in Table 5.1 that the GPU occupancy for this
operation is very low, hence, it would be best to off load this operation onto the CPU.
Increasing the warp efficiency will maximize GPU compute resource utilization. A low
value indicates that there are divergent branches.
As the size of the nanostructure is increased, usually more energy states are needed and
these states happen to be closely spaced. This poses a challenge for realistic nanostructure
simulations since the eigenvalue happen to be less distinct. Investigation has shown that
Jacobi-Davidson happens to be the most robust method in terms of convergence. Even
for closely placed energy states, the algorithm performs fairly well compared to the other
72
Table 5.1: Profiler output for 151,472 atom quantum dot, listing the most significant
compute operations within Jacobi-Davidson method with subspace stored in host
memory
Computation Profile
for single GPU
Computation Profile
for multi -GPU
Shared
Memory
Registers
Compute
Time
GPU
Occupancy
Warp
Efficiency
Compute
Time
GPU
Occupancy
Warp
Efficiency
Mix complex-real,
SpMxV product
Mul(Hreal, qcomplex)
45.30% 0.991 90.89% 32.20% 0.972 94.05% 4096 28
Vector operations
y = y + αx
15.50% 0.976 100.00% 12.20% 0.942 100.00% 0 20
Dense MxV operation 14.70% 0.197 89.35% 37.40% 0.201 89.33% 10240 60
Dot product 13.80% 0.497 100.00% 8.30% 0.482 100.00% 1024 28
Shift matrix 3.00% 0.998 69.94% 1.70% 0.997 73.14% 0 8
Table 5.2: Profiler output for 151,472 atom quantum dot, listing the most significant
compute operations within Lanczos method
Computation Profile
for single GPU
Computation Profile
for multi -GPU
Shared
Memory
Registers
Compute
Time
GPU
Occupancy
Warp
Efficiency
Compute
Time
GPU
Occupancy
Warp
Efficiency
Mix complex-real,
SpMxV product
Mul(Hreal, qcomplex)
84.20% 0.941 91.14% 82.80% 0.924 94.13% 4096 29
Mix complex-real,
SpMxV product
Mul(Himag, qcomplex)
3.20% 0.876 42.02% 3.50% 0.829 51.87% 0 32
Vector operations
y = y + αx
7.80% 0.781 100.00% 8.30% 0.748 100.00% 0 14
Table 5.3: Profiler output for 151,472 atom quantum dot, listing the most significant
compute operations within the CGS method (linear solver for FEAST)
Computation Profile
for single GPU
Computation Profile
for multi -GPU
Shared
Memory
Registers
Compute
Time
GPU
Occupancy
Warp
Efficiency
Compute
Time
GPU
Occupancy
Warp
Efficiency
Complex SpMxV product
Mul(Hcomplex, qcomplex)
85.50% 0.993 89.83% 83.80% 0.976 93.07% 4096 31
Vector operations
y = y + αx
11.70% 0.973 100.00% 13.60% 0.923 100.00% 0 16-21
Dot product 2.70% 0.497 100.00% 2.40% 0.491 100.00% 1024 28
73
methods, typically 300-600 iteration are sufficient to find the first few energy states.
Experience shows that in Jacobi-Davidson for fast convergence, the minimum dimension
of the subspace can safely be restricted to 4 more than the number of wanted energy states
and the maximum dimension needs to be at least 10 more than the number of wanted
energy states, i.e. in this case minimum = 8+4, maximum = 8+10 since 8 energy states
are sought. In the case of the Lanczos method, the convergence is drastically lowered for
a dense eigenvalue spectrum. The convergence rate falls as the size of the quantum dot is
increased. Usually for big systems around 10,000-20,000 Lanczos iteration are needed to
find each energy state. Similarly, in the case of FEAST, more contour points and a bigger
search space is needed to improved convergence. Which also translates into more work and
more memory utilization for each FEAST iteration. Typically, 10-25 number of FEAST
iteration are sufficient for good accuracy. Comparing the accuracy of the methods with
the direct diagonalization carried out on a small nanostructure, it was found that FEAST
delivered results to an absolute accuracy of 10−11
. While, Lanczos and Jacobi-Davidson
methods delivered to an absolute accuracy of 10−6
. Stopping convergence criteria in all
the three methods were set to 10−5
eV.
Figure 5.8: Memory consumption between methods on 1 GPU
Regarding memory occupancy as shown in Figure 5.8 and 5.9, a single GPU Lanczos
implementation occupies the least amount of memory since subspace vectors are not
stored. Whereas, the slightly higher memory occupancy of CGS used in the FEAST
74
Figure 5.9: Memory consumption between methods on 4 GPUs
method can be attributed to the original complex TB Hamiltonian matrix since the
splitting technique was not used. For the Jacobi-Davidson method, a subspace of 8+10
is needed for basis vectors and another additional space of 8+10 vectors is needed for
the projection of the H matrix onto this subspace. If the subspace is stored on the GPU,
the feasible simulation size of the quantum dot is reduced by half. In a multi-GPU
system, the TB Hamiltonian is divided equally among GPUs. As the Hamiltonian size is
reduced on each node, the subspace and temporary vectors required in the
implementation scheme tend to gain importance and takes over the Hamiltonian as the
chief memory consumption entity.
One advantage of the Lanczos method over other methods is that since each eigenstates
is calculated one at a time, it is possible to calculate the degenerate energy state with
just one matrix-vector multiplication, once found this eigenpair is project out and other
unique energy states are calculated. However, Jacobi-Davidson is also found to be robust
in this case since it happens to find the degenerate state within a few iterations in most
cases using the harmonic extraction.
5.2.2 Eigensolver evaluation on a HPC cluster
As described in Test system 6, each node has a dual quad-core CPU with 24 GB of main
memory. A hybrid MPI-OpenMP (multi-process/multi-thread) implementation has been
75
Figure 5.10: Time performance comparison between Lanczos, Jacobi-Davidson and
FEAST method on 4, 8, 16 and 32 nodes of the HPC cluster for the calculation of 8
energy eigenstates
employed for each of these methods. The benchmark calculation has been performed for
4, 8, 16 and 32 MPI processes with a constant of 8 OpenMP threads on each nodes
corresponding to 32, 64, 128 and 256 CPU cores in usage. Figure 5.10 shows the weak
scaling while Figure 5.11, 5.12 and 5.13 shows the strong scaling results for the benchmark
calculation performed on the HPC cluster.
Memory analysis shows that there is no significant difference in memory consumption
when the Hamiltonian is split on 4 nodes or 32 nodes. This is due to the size of the
subspace and temporary vectors overplay the importance of the TB Hamiltonian matrix.
Which has been highly memory optimized using the single precision storage and splitting
technique. Out of the three methods considered, Lanczos is most memory efficient given
that no subspace vectors are saved because of the choice of more flops over bytes. It is
followed by the FEAST method using CGS as linear solver, which requires 3.2× times
more memory than Lanczos mainly because a search space bigger than the number of
eigenpairs in the given interval is needed. The Jacobi-Davidson method is found to be the
most memory expensive given its requirement to save an adequate subspace and find a
solution to the complex algebra correction equation. Jacobi-Davidson requires 5× times
more memory than Lanczos and hence, we can fit only up to 699,399 atom quantum dot
on the test hardware.
76
Figure 5.11: Scaling of Lanczos method on 4, 8, 16 and 32 nodes of the HPC cluster
Figure 5.12: Scaling of Jacobi-Davidson (subspace in host memory) method on 4, 8, 16
and 32 nodes of the HPC cluster
77
Figure 5.13: Scaling of FEAST method on 4, 8, 16 and 32 nodes of the HPC cluster
To summarize the findings for small systems Jacobi-Davidson performs on an average
10.2× times faster than Lanczos which further increases to 17.2× with the increase in the
system size given the slow convergence nature of Lanczos for closely spaced energy states
in large quantum dot. Whereas, in the case of FEAST method it executes on average 1.6×
times slower than Lanczos for small systems which increases to 9.3× for large systems since
more contour points are needed for convergence. In all three methods, one thing that is
common is the trend of speedups when the number of nodes employed are doubled, which
is 1.5× for 4 to 8 nodes, 1.3× for 8 to 16 and 1.15× for 16 to 32 nodes. The decrease in
speedup with increase in nodes is mainly due to process synchronization and limitations
in inter-node communications.
5.2.3 Performance comparison between GPU and HPC cluster
To examine the advantage of GPUs over an expensive HPC cluster for TB calculations,
lets compare the performance of 1 and 4 Tesla Kepler GPUs with 256 CPU cores and
also inspect the gain of multi-GPUs over single GPU. Comparing the performance of the
different method against different hardware, it is possible to infer that for Lanczos and
FEAST method a 3.0× and 2.6× scaling in speedup is achieved when we go from 1 GPU
to 4 GPUs for large quantum dot. Whereas, in the case of Jacobi-Davidson the speedup
is limited to a factor of 1.6× demonstrating that the transfer of the subspace from the
78
host to the device and vice versa is the limiting factor as already stated.
When the performance of 256 CPU cores on the HPC cluster is equated with a single
Tesla Kepler GPU, the Jacobi-Davidson method on the HPC cluster is found to
outperform the GPU by a factor of 1.2×. On the contrary, the GPU implementation of
Lanczos and FEAST methods on 1 GPU beats the performance of 256 CPU cores by a
factor of 5.8× and 4.1× respectively. Comparing the multi-GPU implementation on 4
GPUs against 256 CPU cores of the HPC cluster for Jacobi-Davidson, Lanczos and
FEAST method, the multi-GPU system outperform the HPC cluster by a factor of
1.5×, 13.7× and 10.8× respectively.
5.3 Summary
Three different eigenvalue algorithms that are commonly employed for electronic band
calculations have been implemented and optimized for a multi-GPU workstation. An
analysis for timing, memory occupancy and convergence on a multi-GPU workstation
and a HPC cluster has been performed. By this work, the feasibility and advantage of
each method as an eigensolver specifically for large-scale TB calculations have been
examined. The tests have shown that Jacobi-Davidson is the most robust method in
terms of convergence and is fast in terms of execution time but suffers from a high
memory requirement. Lanczos, on the contrary, is the most memory efficient method.
79
Chapter 6
Application of GPU accelerated
atomistic simulations
Numerical simulations of quantum heterostructures derived from experimental results will
be performed using GPU based ETB implementation discussed in the previous chapters.
As already shown, GPU facilitates the simulation of realistic nanostructures within a
reasonable time frame compared to HPC clusters. Here, two different applications of
GPU accelerated atomistic simulations are presented. First, a GaAs/Al0.3Ga0.7As complex
dot/ring nanostructure is studied [152]. The fabricated nanostructure is very large in
dimension for an ETB calculation to be performed hence, a study on a ideal scaled complex
quantum dot/ring nanostructure is presented. Second, a real sample containing large
InGaN islands and non-uniform Indium content is analyzed [153]. The three-dimensional
models for the quantum dot have been directly extrapolated from experimental results by
a numerical algorithm.
6.1 Atomistic simulation of complex quantum
dot/ring nanostructure
Complex three dimensional quantum nanostructures are being fabricated in labs given
their potential to adjust their electronic properties via size and shape fine tuning [152].
These physical parameters set the confinement potential for electrical charge carriers, thus
determining the electronic and optical properties of the quantum nanostructured system.
80
In this work, a complex GaAs quantum nanostructure over an Al0.3Ga0.7As buffer layer
has been considered to compute the electron states. A multiphysics quantum/classical
simulation coupling drift-diffusion with ETB method has been performed. The multiscale
software tool TiberCAD which has been incorporated with the GPU implementation of the
eigensolvers discussed in the previous chapters has been used to calculate the energy gap as
well as the spatial probability density (SPD) of a scaled quantum dot/ring nanostructure
similar to the one showed in Figure 6.1
Figure 6.1: Atomic force microscope images of GaAs/Al0.3Ga0.7As complex quantum
dot/ring nanostructure (Source: Sanguinetti (2011))
The nanostructure studied consists of a central cylindrical quantum dot and a
surrounding ring of GaAs, surrounded by AlGaAs. The dot has a diameter of 16 nm,
and the ring a width of 5 nm. The spatial separation between the dot and the ring is 5
nm. The dot is 7 nm high while the outer ring is 5 nm high. The structure is grown on
Al0.3Ga0.7As on the (001) plane and covered with 1.4 nm and 3.4 nm thick Al0.3Ga0.7As,
respectively (see Figure 6.2 and 6.3). 2 nm of the substrate and 0.8 nm outer AlGaAs
shell has been included in the simulations.
Calculations are performed using the above described structure for varying quantum
dot size. The size of the quantum dot is varied by varying the radius. Similarly, even
the height of the quantum dot can be varied. Twenty electron states per structure are
sought using the ETB method which also includes the spin states. The resulting density
is projected onto the finite element mesh used for classical models. The solutions also
provides the SPD for electrons. In order to couple the atomistic calculation with the
continuous media model, the macroscopic electrostatic potential is calculated by solving
the Poisson equation and is projected onto the atomic positions by interpolations. Due to
GPU memory limitations for structures having more than 500,000 atoms, we are restricted
in finding fewer than twenty states using the ETB method which is sufficient enough for
81
Figure 6.2: (Below) Lateral view, (Above) Top view: Geometry of dot/ring complex
nanostructure
this work. The sp3
d5
s∗
parametrization is considered for the calculation of electron energy
states.
Here, it is of interest to find nanostructure sizes for which electron states localized
in the dot and the ring have the same energies and therefore delocalize on both dot
and ring. Taking into account the unavoidable hole state localization that takes place in
these nanoscale heterostructures, due to the higher effective mass, this would permit to
produce closely energy spaced and tunable (by controlling the actual nanostructure sizes)
lambda type absorption resonances in topologically complex nanostructures. The lambda
resonances exhibited by the investigated dot/ring nanostructures have many potential
applications in photon storage for quantum computing (low group velocity media [154]),
metamaterials [155, 156] and teraherz generation [157]. The atomistic calculations are
performed for varying dot size so as to be able to predict the dot and ring dimensions
needed in order to delocalize electron states and lead to lambda states formation.
82
Figure 6.3: Partly sliced GaAs/Al0.3Ga0.7As complex quantum dot/ring nanostructure
with 30% Al, 70% Ga. Atomistic description: in Pink Aluminum, in Blue Gallium
Figure 6.4: Electron states using ETB methods for varying radius of the quantum dot
while the rest of the geometry of the complex nanostructure is kept fixed
In Figure 6.4, we see the eigenenergies of electron states found using ETB method.
Here, the energy frame is defined such that the Fermi energy is 0 eV. The plots look less
dense for some structures since it was possible only to calculate sixteen electron states due
to limitations in GPU memory. Figure 6.5 shows the first few electron states’ probability
densities for a structure with 8 nm dot radius. In this case, all states are localized in the
dot or in the ring.
Figure 6.6 shows the eigenenergies of the states with symmetry as shown in Figure 6.5
for different dot radii. The lines connect the energies of states that have been identified to
have the same symmetry by visually inspecting the wave functions. The graph suggests
83
Figure 6.5: SPD for first 8 electrons states using ETB method for the quantum dot
with radius = 8 nm
Figure 6.6: Evolution of eigenenergies with quantum dot radius. The lines connect
states which have been identified to have the same wave function symmetry.
84
Figure 6.7: Probability density for lambda states in quantum dot with radius = 6.2
nm, overlapping between states B, C and H
Figure 6.8: Probability density for lambda states in quantum dot with radius = 6.5
nm, overlapping between (Left) states B and F and (Right) states C and E
85
that the first excited states in the dot B and C get resonant with the H and the E/F states
for radii of roughly 6.2 nm and 6.5 nm, respectively. Notice that in Figure 6.6 the state A
is not reported, as it is well separated from B and C states and would form lambda states
at unrealistically small quantum dot radii.
Figures 6.7 and 6.8 confirms this picture, showing strong mixing between the dot and
ring states for dot radii where resonance is expected. Due to symmetrical causes, there is
an affirmatory relationship between states of the type B and F, and C and E as seen in
Figure 6.8. Note that due to symmetry reasons, the B/C dot states do not couple with
the ring D state.
6.2 Atomistic simulation of InGaN quantum dot
with Indium fluctuation
Recent scientific work has clearly pointed out how taking into account realistic elements
directly derived from experimental results can strengthen the effectiveness of models used
for simulations. Nowadays, a new possible field of application of a comprehensive realistic
multiscale [17] approach appears to be the analysis of Indium Gallium Nitride systems
because of the increasing role in the fabrication of LEDs. Here, an ETB calculation is
performed on a real sample containing large InGaN islands with size of tens of nanometer
and non-uniform Indium content.
Figure 6.9: InGaN quantum dot with varying content of Indium derived from
experimental high-resolution transmission electron microscopy
A complex algorithm has been developed in order to build a three-dimensional
geometry and a structure from the experimental image of the out-of-plane strain
obtained by geometric phase analysis (GPA) of the high-resolution transmission electron
microscopy image of a real sample. The latter contains several InGaN/GaN superlattices
and large InGaN quantum dot islands having sizes of tens of nanometers, with
86
Figure 6.10: A central slice of InGaN quantum dot with 19% Indium randomly
distributed. Atomistic description: in Red Indium, in White Gallium
Figure 6.11: InGaN quantum dot with uniform content of Indium. Description: in Red
19% Indium, in Blue 0% Indium
non-uniform Indium distribution similar to the one shown in Figure 6.9. Using the
Gwyddion software [158], we sampled the quantum dot and extrapolated a
three-dimensional structure. The details of the extrapolation method and the numerical
models are described in reference [159]. This extrapolated structure has been used to
create a finite element model to discretize the electronic ETB model.
ETB calculations of the quantum dot with random Indium distribution has been
performed and the results are compared to InGaN alloys with the Virtual Crystal
Approximation (VCA) (see Figure 6.11) [160,161]. Where, VCA considers an alloy ABC
as a fictitious material whose properties are a weighted average of the properties of its
alloy components.
The ETB results shown in Figure 6.12 shows that the confined states strongly depend
on the local distribution of Indium. This dependence is mainly due to the large energy
gap difference between InN and GaN with a valence band difference of just 0.45-0.5 meV
compare to 2.7-2.75 meV of the conduction band. The ground states are more likely to be
Figure 6.12: Electronic ground states obtained from ETB calculation of InGaN
quantum dot with random Indium content
87
Figure 6.13: Electronic ground states obtained from ETB calculation of InGaN
quantum dot with uniform Indium content
present in regions with higher Indium content which would dictate certain electronic and
optical properties of InGaN LEDs depending on whether the states overlap or not. In the
case of the quantum dot generated using VCA, the ground states are very symmetric and
ideally overlap each other as seen in Figure 6.13.
6.3 Summary
Numerical atomistic simulations of realistic quantum nanostructures have been carried
out using GPUs showing that GPUs can be employed to accelerate ETB calculation ten
folds compared to state-of-the-art HPC clusters. In the first case, ETB calculation on a
number of idealistic scaled GaAs/Al0.3Ga0.7As complex quantum dot/ring nanostructures
were performed. GPUs assisted in cutting short the time needed to simulate these multiple
samples from a few weeks to few days. Similarly, in the second case, GPUs were used to
calculate the ground states of realistic InGaN quantum dot having around 750,000 atoms.
88
Chapter 7
Conclusion
In this work, it has been shown that large-scale atomistic simulation of nanostructured
devices that plays a significant role in guiding and explaining experimental findings in
modern material science and semiconductor research, which faces the computational
obstacle from the diagonalization of the Hamiltonian matrix can be accelerated using
parallel computing techniques and the introduction of enhanced algorithms. Both this
aspects have been addressed in this work by developing optimized algorithms to execute
on state-of-the-art computing hardware. It is widely known that implementing
algorithms that can ideally scale over parallel computing architectures is an important
component for transferring the hardware advancements into beneficial speedups. This
also requires a deeper knowledge of the method and the underlining hardware
architecture being utilized.
Today’s GPUs are developed to help computational scientists push out the frontiers.
They have certainly grasped the attention of most researchers which is lately evident
from the extensive effort being put in translating algorithms initially designed for other
computing machines to GPU. Here, it has been shown that GPUs can be used for the
acceleration of atomistic simulation of nanostructured devices by employing them for
the calculation of energy eigenstate in a quantum nanostructured system. Benchmark
calculations are performed for an atomistic model of wurtzite GaN/AlGaN quantum dot
parametrized using an ETB scheme demonstrating that GPUs can be very effectively used
for iterative numerical optimization problems such as finding the extreme eigenvalues of
large sparse matrices.
89
Figure 7.1: Performance of Lanczos implementation benchmarked on different
technologies
In Chapter 4, a fine-tuned GPU based parallel implementation of the Lanczos
algorithm with a simple restart is reported as it has been identified as the algorithm
that is more fitted for computing few eigenpairs on a GPU framework that can cope
with memory limitations of current GPUs and slow GPU-CPU communication. Here, a
technique has been developed that exploits the structure of the TB Hamiltonian
matrices. Using which we can optimize on the memory occupation by splitting the TB
Hamiltonian into its real and imaginary parts. This further required the development of
a new mixed real/complex arithmetic CUDA kernel. Performing the multiplication in a
split fashion resulted in a 35-40% memory saving without significant loss of
performance. Thus, allowing to increase the maximum system size that can be handled
on a GPU. Likewise, it has been shown how the performance of the eigenvalue solver can
be further enhanced by subduing the slow communication between GPUs by exploiting
the matrix sparsity pattern and moreover, taking advantage of the GPU-GPU
communication offered by the new GPUDirect technology. The implementation designed
and tested is fully vectorized and scales with GPUs.
As evident from Figure 7.1 the fine-tuned Lanczos implementation benchmarked
running on Kepler K20c (Test system 1) performed on an average 10× times faster
compared to the same OpenMP implementation running on a Xeon quad-core CPU
(Test system 3). Also, shown here are the benchmark calculations in a multi-GPU
90
scenario, parallelized using MPI. In this context, the importance of using fast data
transfer via direct PCI-E interconnects is shown. The performance of a dual-GPU versus
a HPC cluster upto 16 nodes connected via InfiniBand is shown. This demonstrates that
the dual-GPU on average is faster by a factor of 4.1× for a system comprising of around
350,000 atoms and by more than a factor of 3.2× for systems comprising of 600,000
atoms. Assuming an ideal parallel scaling on the InfiniBand HPC cluster that might be
reached with faster interconnects, a large number of nodes will be needed. Currently, a
32 core IBM HPC system costs ≈ $90,000 and has a peak power consumption of ≈ 791
Watts. On the other hand a single quad-core workstation with a single Kepler GPU will
cost less than ≈ $10,000 and will consume ≈ 486 Watts of power making GPU more
cost-effective in terms of energy, infrastructure cost and maintenance. The drawback of
this fine-tuned GPU implementation is that it faces memory limitations that prevents
scaling up the system size above a certain limit. Nonetheless, the amount of memory
hosted by GPUs is likely to increase in the future.
In the search for faster algorithms, it was noticed that there are a few methods which
are more widely used for atomistic simulations given their implementation feasibility,
convergence characteristic, accuracy and reliability. Thus, a comprehensive study of
Jacobi-Davidson, Lanczos and FEAST methods for energy eigenstate calculation in
nanostructures was conducted in Chapter 5 because it was still unclear which one is
more suited for GPU and how they perform in a given setup. By creating, testing and
profiling a GPU based performance enhanced implementation of the listed methods
their feasibility and advantage as an eigensolver specifically for the tight binding
calculations was examined.
The study revealed that Jacobi-Davidson is the most robust method in terms of
convergence and is fastest in terms of execution time. However, it has a high memory
consumption and is therefore less suited for calculating the energy eigenstates of large
nanostructures. This shortcoming can be overcome by moving the subspace vectors to
the host memory as shown thus enabling us to calculate the energy states of larger
systems. Nevertheless, this type of GPU implementation of the Jacobi-Davidson does
not scale well as compared to Lanczos and FEAST. Lanczos, on the contrary, is the
most memory efficient method, but the poor convergence for higher energy eigenstates
in large nanostructures is a primary bottleneck which makes it not the first method of
91
Figure 7.2: Performance of Lanczos, Jacobi-Davidson (JD) and FEAST
implementation benchmarked on different technologies
choice. However, on a multi-GPU system it shows a superior scaling trend. The FEAST
method performs the worst since a preconditioner matrix was not utilized while solving
the block linear system because the construction of a typical preconditioner based on
incomplete factorization is expensive in terms of both memory and time and is not ideal
for a GPU based implementation. This led to the important inference that
Jacobi-Davidson can be considered as the best method given its good convergence even
without a preconditoner matrix and should be considered as the method of choice on
computing systems where memory is not a constraint. On GPUs, it can be employed to
calculate the energy eigenstates of few hundred thousand atom nanostructures. Lanczos,
on the other hand, is the method of choice when memory usage is the limiting factor.
Even though Lanczos is slow in convergence, it can be easily scaled using a multi-GPU
implementation to perform in par with Jacobi-Davidson as seen in Figure 7.2.
Two different applications of GPU accelerated atomistic simulations were also
presented. First, numerical simulations of an idealized GaAs/Al0.3Ga0.7As complex
quantum dot/ring nanostructure was performed. GPUs were employed to carry out the
ETB calculation within a reasonable time frame for systems with varying quantum dot
size. The goal of the analysis was to fine-tune the electronic properties of the complex
nanostructure via size tuning, in order to find lambda states (coupled states) that are
localized in both the quantum dot and quantum ring. As this type of lambda state
92
characteristic exhibited by complex nanostructures has many potential applications in
quantum computing to metamaterials. Second, numerical simulations of quantum dot
structures derived from experimental high-resolution transmission electron microscopy
results were performed. A real sample containing large InGaN islands with size of ten of
nanometer and non-uniform Indium content was analyzed. The three-dimensional
models for the quantum dots were directly extrapolated from the experimental results
by a numerical algorithm. The ground energy eigenstates of these quantum dots greater
than 750,000 atoms were calculated using the GPU based implementation for varying
Indium content within a few hours compared to a few days that would be needed on
other hardware platform.
Finally, the question is was the principal objective of the proposed work realized?
This can be established by means of a test case. Let us consider the atomistic simulation
of ≈ 200, 000 atom quantum dot which can be considered as an averaged size
nanostructure often encountered in computational electronics domain. To calculate 8
electron energy eigenstates using the ETB method that utilizes a Lanczos type
eigensolver it would take ≈ 24 hours engaging a sequential implementation on Test
system 3. On the same test system, implementation involving OpenMP technology
would require ≈ 8 hours. Whereas, on 16 nodes of a HPC cluster connected via
InfiniBand (Test system 2) utilizing MPI-OpenMP technology it would need ≈ 1.45
hours. Employing Kepler GPU with the CUDA implementation of the fine-tuned
Lanczos based eigensolver, it took ≈ 50 minutes, which was additionally lowered to ≈ 20
minutes using MPI-CUDA implementation on 4 Kepler GPUs (Test system 5). By using
the MPI-CUDA implementation of the Jacobi-Davidson method, the time taken was
further reduced to ≈ 10 minutes. Thus, one can say that the objective to accelerated
atomistic simulations were accomplished using enhanced algorithms, GPU and other
parallel computing techniques. Multi-GPU system, with a high speed data interconnect,
can be considered as one of the best, cost-effective, energy efficient computing
architecture currently available to accelerate the atomistic simulation of nanostructured
devices.
93
Publications and Conferences
• Walter Rodrigues, A. Pecchia, M. Lopez, A. Auf der Maur, A. Di Carlo (2014),
“Accelerating atomistic calculations of quantum energy eigenstates on graphic
cards”, Computer Physics Communications Journal, Vol. 185, Issue 10, Pages
2510-2518. DOI:10.1016/j.cpc.2014.05.028
• W. Rodrigues, A. Pecchia, M. Auf der Maur, A. Di Carlo, “A multi-GPU based
approach for atomistic calculations of quantum energy eigenstates”, Poster
presentation, 17th
International Workshop on Computational Electronics, June
3-6, 2014, Paris, France, Pages 145-146. ISBN:978-2-9547858-0-6
• Walter Rodrigues, M. Lopez, A. Pecchia, A. Auf der Maur, A. Di Carlo (2014),
“GPU based approach for the atomistic calculation of quantum energy eigenstates
in nanostructured system”, Proceedings of the 6th
International Conference from
Scientific Computing to Computational Engineering (6th IC-SCCE), 9-12 July 2014,
Athens, Greece. ISSN:2241-8865, ISBN:978-618-80527-5-8
• W. Rodrigues, A. Pecchia, M. Auf der Maur, A. Di Carlo (2015), “A
comprehensive study of popular eigenvalue methods employed for quantum
calculation of energy eigenstates in nanostructures using GPUs”, Journal of
Computational Electronics, In Press, Published online on April 9, 2015.
DOI:10.1007/s10825-015-0695-z
• W. Rodrigues, A. Pecchia, M. Auf der Maur, D. Barettin, S. Sanguinetti, A. Di
Carlo, “Atomistic simulation of GaAs/AlGaAs quantum dot/ring nanostructures”,
Accepted to the 15th
International Conference on Nanotechnology (IEEE NANO
2015), July 27-30, 2015, Rome, Italy.
• D. Barettin, M. Auf der Maur, A. Pecchia, W. Rodrigues, A. Tsatsulnikov, A.
94
V. Sakharov, W. V. Lundin, A. E. Nikolaev, N. Cherkashin, M. J. Hytch, S. Yu.
Karpov, A. Di Carlo, “Realistic model of LED structure with InGaN quantum-dots
active region”, Accepted to the 15th
International Conference on Nanotechnology
(IEEE NANO 2015), July 27-30, 2015, Rome, Italy.
95
Bibliography
[1] Martin T. Dove, An introduction to atomistic simulation methods, Seminarios de
la SEM, vol. 4, pp. 7-37.
[2] Neil W. Ashcroft and N. David Mermin (1976), Solid State Physics, Cengage
Learning, ISBN:0030839939.
[3] P. E. Turchi, A. Gonis, and L. Colombo (1998), Tight-Binding Approach to
Computational Materials Science, Materials Research Society, Warrendale, PA, Vol.
491.
[4] J. C. Slater and G. F. Koster (1954), Simplified LCAO Method for the Periodic
Potential Problem, Phys. Rev. 94, 1498.
[5] Per-Olov L¨owdin (1950), On the Non-Orthogonality Problem Connected with the
Use of Atomic Wave Functions in the Theory of Molecules and Crystals, J. Chem.
Phys. 18, 365.
[6] C. Delerue, M. Lannoo, G. Allan (2001), Tight binding for complex semiconductor
systems, Physica Status Solidi (B), vol. 227 , issue 1 , pp. 115-149.
[7] J. M Jancu, F. Bassani, F. Della Sala, and R Scholz (2002), Transferable tight-
binding parametrization for the group-III nitrides. Appl. Phys. Lett. 81, 4838.
doi:10.1063/1.1529312.
[8] Yaohua P. Tan, Michael Povolotskyi, Tillmann Kubis, Timothy B. Boykin and
Gerhard Klimeck (2012), Generation of Empirical Tight Binding Parameters from
ab-initio simulations. Abstracts of IWCE 2012.
96
[9] M. Lopez, F. Sacconi, M. Auf der Maur, A. Pecchia, and A. Di Carlo (2012),
Atomistic simulation of InGaN/GaN quantum disk LEDs. Optical and Quantum
Electronics, vol. 44, issue 3, pp. 89-94. doi: 10.1007/s11082-012-9554-3.
[10] M. Lopez, M. Auf der Maur, A. Pecchia, F. Sacconi, G. Penazzi and A. Di Carlo
(2013), Simulation of Random Alloy Effects in InGaN/GaN LEDs, Numerical
Simulation of Optoelectronic Devices (NUSOD). doi:10.1109/NUSOD.2013.6633150
[11] Fabiano Oyafuso, Gerhard Klimeck, R.Chris Bowen, and Timothy B. Boykin (2002),
Atomistic electronic structure calculations of unstrained alloyed systems consisting
of a million atoms. Journal of Computational Electronics, vol. 1, issue 3, pp. 317-321.
ISSN:1569-8025. doi:10.1023/A:1020774819509.
[12] Aldo Di Carlo (2002), Tight-binding methods for transport and optical properties
in realistic nanostructures, Physica B 314, pp. 211-219.
[13] C. M. Goringey, D. R. Bowleryk and E. Hern`andez (1997), Tight-binding modelling
of materials. Rep. Prog. Phys., 60:14471512. doi:10.1143/JJAP.44.L173.
[14] Aldo Di Carlo, Paolo Lugli and Andrea Reale (1997), Modelling of semiconductor
nanostructured devices within the tight-binding approach. J. Phys.: Condens.
Matter, 11. doi:10.1088/0953-8984/11/31/311.
[15] Aldo Di Carlo (1997), Self-consistent tight-binding methods applied to
semiconductor nanostructures. volume 491, issue 1, doi:10.1557/PROC-491-389.
[16] A. Di Carlo (2003), Microscopic theory of nanostructured semiconductor
devices: beyond the envelope-function approximation. Semiconductor Science and
Technology, vol. 18 issue 1. doi: 10.1088/0268-1242/18/1/201.
[17] M. Auf der Maur, Gabriele Penazzi, Giuseppe Romano, Fabio Sacconi, A. Pecchia,
Aldo Di Carlo (2011), The Multiscale Paradigm in Electronic Device Simulation,
IEEE Transactions on Electron Devices vol. 58, issue 5, pp. 1425-1432.
[18] Suman De, Arunasish Layek, Sukanya Bhattacharya, Dibyendu Kumar Das,
Abdul Kadir, Arnab Bhattacharya, Subhabrata Dhar, and Arindam Chowdhury
(2012). Quantum-confined stark effect in localized luminescent centers within
97
InGaN/GaN quantum-well based light emitting diodes. Appl. Phys. Lett,
101:121919. doi:10.1063/1.4754079.
[19] G. Penazzi, A. Pecchia, F. Sacconi and A. Di Carlo (2010), Calculation of optical
properties of a quantum dot embedded in a GaN/AlGaN nanocolumn. Superlattices
and Microstructures, vol. 47, Issue 1, pp. 123-128
[20] C. Delerue and M. Lannoo (2004), Nanostructures - Theory and Modeling, Springer.
ISBN:9783662089033
[21] Matthias Auf der Maur (2008), A Multiscale Simulation Environment for Electronic
and Optoelectronic Devices., Ph.D. thesis, University of Rome Tor Vergata, Rome,
Italy.
[22] L. C. Lew Yan Voon and L. R. Ram-Mohan (1993), The tight binding representation
of the optical matrix elements: theory and applications, Physical Review B,
47:15500-15508. doi:10.1103/PhysRevB.47.15500.
[23] R. Shankar (1994), Principles of Quantum Mechanics (2nd ed.), Kluwer
Academic/Plenum Publishers. ISBN:9780306447907.
[24] Gordon E. Moore (1965), Cramming More Components onto Integrated Circuits,
Electronics, vol. 38, issue 8, pp. 114-117.
[25] Brock, C. David (2006), Understanding Moore’s law: four decades of innovation,
Philadelphia, Chemical Heritage Press. ISBN:0941901416.
[26] A. V. Aho, M. S. Lam, R. Sethi, and J. D. Ullman (2007), Compilers: Principles,
Techniques, and Tools, 2n
d Ed., Addison-Wesley. ISBN:9780321486813.
[27] A. Vajda (2011), Programming Many-Core Chips, Chapter 2, pp. 9-43, springer,
ISBN:9781441997388
[28] Geoffrey Blake, Ronald G. Dreslinski, and Trevor Mudge (2009), A
Survey of Multicore Processors, IEEE Signal Processing Magazine, vol 26.
doi:10.1109/MSP.2009.934110.
[29] T.S Crow (2004), Evolution of the Graphical Processing Unit. Master’s thesis, Univ.
of Nevada, Reno.
98
[30] Sha’Kia Boggan and Daniel M. Pressel (2007), GPUs: An Emerging Platform for
General-Purpose Computation, Technical report, U.S. Army Research Laboratory,
Aberdeen Proving Ground, MD, USA.
[31] Kayvon Fatahalian and Mike Houston (2008), A closer look at GPUs,
Communications ACM, vol. 51 issue 10, pp. 50-57, ACM New York, NY, USA,
doi:10.1145/1400181.1400197
[32] John D. Owens, Mike Houston, David Luebke, Simon Green, John E. Stone, and
James C. Phillips (2008), GPU Computing, Proceedings of the IEEE, vol. 96, issue
5, pp. 879-899.
[33] John D. Owens, David Luebke, Naga Govindaraju, Mark Harris, Jens Kr¨uger, Aaron
E. Lefohn, and Tim Purcell (2007), A Survey of General-Purpose Computation on
Graphics Hardware, Computer Graphics Forum, vol. 26, issue 1, pp. 80-113.
[34] E. Lindholm, J. Nickolls, S. Oberman, J. Montrym (2008), NVIDIA Tesla: A Unified
Graphics and Computing Architecture, Micro, IEEE, vol. 28 , issue 2, pp. 39-55.
doi:10.1109/MM.2008.31
[35] Nvidia corporation (2006), NVIDIA GeForce 8800 Architecture Technical Brief,
November 2006.
[36] J. Nickolls, I. Buck, K. Skadron, and M. Garland (2008), Scalable Parallel
Programming with CUDA, ACM Queue, vol. 6, issue 2, pp. 40-53.
[37] NVIDIA corporation (2014), CUDA C PROGRAMMING GUIDE, version 6.5.
[38] NVIDIA corporation (2014), CUDA C BEST PRACTICES GUIDE, version 6.5.
[39] Kirk David B. and Hwu Wen-mei W. (2010), Programming Massively Parallel
Processors: A Hands-on Approach, Morgan Kaufmann Publishers Inc., San
Francisco, CA, USA, ISBN:0123814723, 9780123814722
[40] Sanders Jason and Kandrot Edward (2010), CUDA by Example: An Introduction
to General-Purpose GPU Programming, Addison-Wesley Professional, ISBN:
0131387685, 9780131387683
99
[41] Peter N. Glaskowsky (2009), NVIDIA’s Fermi: The First Complete GPU Computing
Architecture, White paper September 2009.
[42] NVIDIA corporation (2009), NVIDIA’s Next Generation CUDA Compute
Architecture: Fermi, technical report, NVIDIA 2009
[43] Matthew Murray (2012), Nvidia’s Kepler architecture: 6 things you should know,
PC, March 23, 2012.
[44] Ryan Smith (2012), NVIDIA GeForce GTX 680 Review: Retaking The Performance
Crown, AnandTech, March 22, 2012
[45] NVIDIA corporation (2012), NVIDIA’s Next Generation CUDA Compute
Architecture: Kepler GK110/210. White paper.
[46] NVIDIA corporation (2012), NVIDIA Kepler Compute Architecture Datasheet,
May 2012.
[47] Ryan Smith (2012), NVIDIA Launches Tesla K20 and K20X: GK110 Arrives At
Last, AnandTech, November 12, 2012
[48] NVIDIA corporation (2012), NVIDIA’s Next Generation CUDA Compute
Architecture: Kepler GK110, White paper
[49] Rob Farber (2008), CUDA, Supercomputing for the Masses: Part 1 , Dr. Dobb’s,
April 15, 2008.
[50] Qihang Huang, Zhiyi Huang, P. Werstein, M. Purvis (2008), GPU as a
General Purpose Computing Resource, International conference on Parallel and
Distributed Computing, Applications and Technologies, Otago, pp. 151-158.
doi:10.1109/PDCAT.2008.38
[51] David Tarditi, Sidd Puri, Jose Oglesby (2006), Accelerator: using data parallelism to
program GPUs for general-purpose uses, ACM SIGARCH Computer Architecture
News, vol. 34, issue 5.
[52] Shuai Che, Michael Boyer, Jiayuan Meng, D. Tarjan, Jeremy W. Sheaffer, Kevin
Skadron (2008), A performance study of general-purpose applications on graphics
100
processors using CUDA. Journal of Parallel and Distributed Computing, vol 68,
issue 10, pp. 1370-1380. doi:10.1016/j.jpdc.2008.05.014
[53] Peng Du, Rick Weber, Piotr Luszczek, Stanimire Tomov, Gregory Peterson,
Jack Dongarra (2012), From CUDA to OpenCL: Towards a performance-portable
solution for multi-platform GPU programming, Parallel Computing, vol. 38, issue
8, pp. 391-407. doi:10.1016/j.parco.2011.10.002
[54] John E. Stone, James C. Phillips, Peter L. Freddolino, David J. Hardy 1, Leonardo
G. Trabuco, Klaus Schulten (2007), Accelerating molecular modeling applications
with graphics processors, Journal of Computational Chemistry, vol. 28, issue 16, pp.
2618-2640. doi:10.1002/jcc.20829
[55] Joshua A. Anderson, Chris D. Lorenz, A. Travesset (2008), General Purpose
Molecular Dynamics Simulations Fully Implemented on Graphics Processing
Units, Journal of Computational Physics, vol. 227, issue 10, pp. 5342-5359.
doi:10.1016/j.jcp.2008.01.047
[56] John Paul Walters, Vidyananth Balu, Vipin Chaudhary, David Kofke, and Andrew
Schultz (2008), Accelerating molecular dynamics simulations with GPUs, In
ISCA 21st International Conference on Parallel and Distributed Computing and
Communication Systems (ISCA PDCCS), pp. 44-49, New Orleans, USA.
[57] S.B. Kylasa, H.M. Aktulga, A.Y. Grama (2014), PuReMD-GPU: A reactive
molecular dynamics simulation package for GPUs, Journal of Computational
Physics, vol. 272, pp. 343-359.
[58] Ivan S. Ufimtsev and Todd J. Martinez (2008), Graphical Processing Units
for Quantum Chemistry, Comp. Sci. Eng., vol. 10, issue 6, pp. 26-34.
doi:10.1109/MCSE.2008.148
[59] Ivan S. Ufimtsev and Todd J. Martinez (2008), Quantum Chemistry on Graphical
Processing Units. 1. Strategies for Two-Electron Integral Evaluation, J. Chem. Theo.
Comp., vol. 4, issue 2, pp. 222-231. doi:10.1021/ct700268q
[60] Mark Watson, Roberto Olivares-Amaya, Richard G. Edgar, and Alan Aspuru-Guzik
(2010), Accelerating correlated quantum chemistry calculations using graphical
101
processing units, Computing in Science and Engineering, vol 12, issue 4, pp. 40-
50. doi:10.1109/MCSE.2010.29
[61] Andreas W. G¨otz, Thorsten W¨olfle1, and Ross C. Walker (2010), Quantum
Chemistry on Graphics Processing Units, In Annual Reports in Computational
Chemistry, vol. 6, Elsevier B.V 2010. doi:10.1016/S1574-1400(10)06002-0
[62] M. J. Harvey, Gianni De Fabritiis (2012), A survey of computational
molecular science using graphics processing units, Wiley Interdisciplinary
Reviews: Computational Molecular Science, vol. 2, issue 5, pp. 734-742, 2012,
doi:10.1002/wcms.1101
[63] A. Dal Corso (1996), A pseudopotential plane waves program (pwscf) and some
case studies, Lecture Notes in Chemistry, vol. 67, C. Pisani editor, Springer Verlag,
Berlin, 1996.
[64] K. P. Esler, Jeongnim Kim, L. Shulenburger, D.M. Ceperley (2012), Computing in
Science and Engineering, vol.14, issue 1, pp. 40-51. doi:10.1109/MCSE.2010.122
[65] Andrea Manconi, Alessandro Orro, Emanuele Manca, Giuliano Armano, Luciano
Milanesi (2014), A tool for mapping Single Nucleotide Polymorphisms using
Graphics Processing Units, BMC Bioinformatics, vol 15, issue 1, pp. 1-13.
doi:10.1186/1471-2105-15-S1-S10
[66] Ling Sing Yung, Can Yang, Xiang Wan, Weichuan Yu (2011), GBOOST: a GPU-
based tool for detecting gene-gene interactions in genome-wide case control studies,
Bioinformatics, vol. 27, issue 9, pp. 1309-1310. doi:10.1093/bioinformatics/btr114
[67] Alhadi Bustamam, Kevin Burrage, Nicholas A. Hamilton (2012), Fast Parallel
Markov Clustering in Bioinformatics using Massively Parallel Computing on
GPU with CUDA and ELLPACK-R Sparse Format, IEEE/ACM Transactions
on Computational Biology and Bioinformatics, vol. 9, issue 3, pp. 679-692.
doi:10.1109/TCBB.2011.68
[68] Panagiotis D. Vouzis, Nikolaos V. Sahinidis (2011), GPU-BLAST: using graphics
processors to accelerate protein sequence alignment, Bioinformatics vol. 27, issue 2,
pp. 182-188. doi:10.1093/bioinformatics/btq644
102
[69] Guillaume Rizk, Dominique Lavenier (2009), GPU Accelerated RNA Folding
Algorithm, In Computational Science - ICCS 2009. vol. 5544 Pp. 1004-1013. Springer
Berlin/Heidelberg. doi:10.1007/978-3-642-01970-8 101
[70] Peter Huthwaite (2014), Accelerated finite element elastodynamic simulations using
the GPU, Journal of Computational Physics, vol. 257, part A, pp. 687-707
[71] R. Spurzem, P. Berczik, G. Marcus, A. Kugel, G. Lienhart, I. Berentzen, R. M¨anner,
R. Klessen, R. Banerjee (2009), Accelerating astrophysical particle simulations with
programmable hardware (FPGA and GPU), Computer Science - Research and
Development, vol. 23, issue 3-4, pp. 231-239. doi:10.1007/s00450-009-0081-9
[72] Spurzem Rainer, Berczik Peter, Berentzen Ingo, Ge Wei, Wang Xiaowei, Schive Hsi-
yu, Nitadori Keigo, Hamada Tsuyoshi, Fiestas Jose (2012), Accelerated Many-Core
GPU Computing for Physics and Astrophysics on Three Continents, Chapter 3,
Large-Scale Computing, John Wiley and Sons, Inc,. ISBN:9780470592441
[73] Dossay Oryspayev, Hugh Potter, Pieter Maris, Masha Sosonkina, James P. Vary,
Sven Binder, Angelo Calci, Joachim Langhammer, Robert Roth (2013), Leveraging
GPUs in Ab Initio Nuclear Physics Calculations, Parallel and Distributed Processing
Symposium Workshops and PhD Forum (IPDPSW), 2013 IEEE 27th
International,
20-24 May 2013, Cambridge, MA, pp. 1365-1372. doi:10.1109/IPDPSW.2013.253
[74] Ari Harju, Topi Siro, Filippo Federici Canova, Samuli Hakala, Teemu Rantalaiho
(2013), Computational Physics on Graphics Processing Units, Applied Parallel
and Scientific Computing, Lecture Notes in Computer Science, vol. 7782, pp 3-26.
doi:10.1007/978-3-642-36803-5 1
[75] J. Kruger and R. Westermann (2003), Linear algebra operators for GPU
implementation of numerical algorithms, ACM Trans. Graph. vol. 22, issue 3, pp.
908-916.
[76] Markus Geveler, Dirk Ribbrock, Dominik G¨oddeke, Peter Zajac and Stefan Turek
(2013), Towards a complete FEM-based simulation toolkit on GPUs: Unstructured
grid finite element geometric multigrid solvers with strong smoothers based
on sparse approximate inverses, Computers and Fluids, vol. 80, pp. 327-332.
doi:10.1016/j.compfluid.2012.01.025
103
[77] Volodymyr Kindratenko (2014), Numerical Computations with GPUs, Springer
International Publishing, Switzerland, ISBN:9783319065472
[78] W. Li, Z. Fan, X. Wei, and A. Kaufman (2003), GPU-Based Flow Simulation with
Complex Boundaries, Technical Report 031105, Computer Science Department,
Suny at Stony Brook. Nov 2003.
[79] T Nagatake and T Kunugi (2010), Application of GPU to computational multiphase
fluid dynamics, IOP Conf. Series: Materials Science and Engineering, vol. 10, 012024,
doi:10.1088/1757-899X/10/1/012024
[80] Mark J. Harris (2004), Fast Fluid Dynamics Simulation on the GPU, GPU Gems,
Chapter 38.
[81] Anders Eklund, Paul Dufort, Daniel Forsberg, Stephen M. LaConte (2013), Medical
image processing on the GPU - Past, present and future, Medical Image Analysis,
vol. 17, issue 8, pp. 1073-1094. doi:10.1016/j.media.2013.05.008
[82] Pavel Karas (2010), GPU Acceleration of Image Processing Algorithms, dissertation
thesis, Centre for Biomedical Image Analysis, Faculty of Informatics, Masaryk
University.
[83] Brijmohan Daga, Avinash Bhute, Ashok Ghatol (2011), Implementation of Parallel
Image Processing Using NVIDIA GPU Framework, Advances in Computing,
Communication and Control Communications in Computer and Information
Science, vol. 125, pp. 457-464. doi: 10.1007/978-3-642-18440-6 58
[84] T. Preis (2011), GPU-computing in econophysics and statistical physics,
European Physical Journal Special Topics, vol. 194, issue 1, pp. 87-119.
doi:10.1140/epjst/e2011-01398-x
[85] Scott Grauer-Gray, William Killian, Robert Searles, John Cavazos (2013),
Accelerating financial applications on the GPU, Proceedings of the 6th
Workshop
on General Purpose Processor Using Graphics Processing Units, pp. 127-136, ACM
New York, USA. doi:10.1145/2458523.2458536
[86] Hawkins, T. (1975), Cauchy and the spectral theory of matrices, Historia
Mathematica, vol 2, issue 1, pp. 1-29. doi:10.1016/0315-0860(75)90032-4
104
[87] Morris Kline (1972), Mathematical thought from ancient to modern times, Oxford
University Press, ISBN:0195014960
[88] Richard von Mises and H. Pollaczek-Geiringer (1929), Praktische Verfahren
der Gleichungsaufl¨osung, ZAMM - Zeitschrift f¨ur Angewandte Mathematik und
Mechanik, vol. 9, pp. 152-164.
[89] William H. Press, Saul A. Teukolsky, William T. Vetterling, Brian P.
Flannery (2007), Numerical Recipes: The Art of Scientific Computing, Chapter
11: Eigensystems, pp. 563-597. Third edition, Cambridge University Press.
ISBN:9780521880688
[90] J.G.F. Francis (1961), The QR Transformation - part 1, The Computer Journal,
vol. 4, issue 3, pp. 265-271, doi:10.1093/comjnl/4.3.265
[91] J.G.F. Francis (1962), The QR Transformation - part 2, The Computer Journal,
vol. 4, issue 4, pp. 332-345.
[92] Vera N. Kublanovskaya, On some algorithms for the solution of the complete
eigenvalue problem, USSR Computational Mathematics and Mathematical Physics,
vol. 1, issue 3, pp 637-657.
[93] G. H. Golub and C. F. Van Loan (1996), Matrix Computations, 3rd
ed., Johns
Hopkins University Press, Baltimore. ISBN:0801854148.
[94] J. J. M. Cuppen (1981), A divide and conquer method for the symmetric tridiagonal
eigenproblem, Numer. Math., vol. 36, pp. 177-195.
[95] M. Gu and S. C. Eisenstat (1994), A stable and efficient algorithm for the rank-one
modification of the symmetric eigenproblem, SIAM J. Matrix Anal. Appl., vol. 15,
pp. 1266-1276.
[96] M. Gu and S. C. Eisenstat (1995), A Divide-and-Conquer Algorithm for the
Symmetric Tridiagonal Eigenproblem, SIAM J. Matrix Anal. Appl., vol. 16, pp.
172-191, doi:10.1137/S0895479892241287
[97] G. H. Golub and H. A. van der Vorst (2000), Eigenvalue computation in the 20th
century, Journal of Computational and Applied Mathematics, vol. 123, issue 1-2,
pp. 35-65.
105
[98] J.W. Givens (1953), A method of computing eigenvalues and eigenvectors suggested
by classical results on symmetric matrices, U.S. Nat. Bur. Standards App. Math.,
vol. 29, pp. 117-122.
[99] J.W. Givens (1954), Numerical computation of the characteristic values of a real
symmetric matrix. Oak Ridge National Laboratory, Report: ORNL-1574.
[100] C. G. J. Jacobi (1846), ¨Uber ein leichtes Verfahren die in der Theorie der
S¨acularst¨orungen vorkommenden Gleichungen numerisch aufzul¨osen. Journal f¨ur
die reine und angewandte Mathematik, vol. 30, issue 30, pp. 51-94.
[101] J. H. Wilkinson (1988), The Algebraic Eigenvalue Problem, Oxford University Press,
Inc., New York, USA. ISBN:0198534183
[102] J. W. Demmel and K. Veselic (1992), Jacobi’s method is more accurate than QR,
SIAM J. Matrix Anal. Appl., vol. 13, pp. 1204-1246.
[103] John H. Mathews and Kurtis D. Fink (2004), Numerical Methods: Using Matlab,
Fourth Edition, Prentice-Hall Pub. Inc., NJ, USA. ISBN:0130652482
[104] B.N. Parlett (1980), The Symmetric Eigenvalue Problem, Prentice-Hall Series
in Computational Mathematics, Prentice Hall, Englewood Cliffs, N.J, USA.
ISBN:0138800472
[105] W. E. Arnoldi (1951), The principle of minimized iterations in the solution of the
matrix eigenvalue problem, Quarterly of Applied Mathematics, vol. 9, pp. 17-29.
[106] Y. Saad (1992), Numerical Methods for Large Eigenvalue Problems, Halsted Press,
Div. of John Wiley and Sons, Inc., New York, USA.
[107] Y. Saad (1980), Variations of Arnoldi’s method for computing eigenelements of large
unsymmetric matrices, Linear Algebra and Its Applications, vol. 34, pp. 269-295.
[108] D. C. Sorensen (1992), Implicit application of polynomial filters in a k-step Arnoldi
method, SIAM Journal on Matrix Analysis and Applications, vol. 13, issue 1, pp.
357-385.
[109] C. Lanczos (1950), An iteration method for the solution of the eigenvalue problem
of linear differential and integral operators, J. Res. Nat’l Bur. Std. 45, pp. 225-282.
106
[110] G.W. Stewart (2001), Matrix Algorithms, Volume II: Eigensystems, SIAM, Chapter
5, pp. 306-367. ISBN:0470218207
[111] Jane K. Cullum and Ralph A. Willoughby (2002), Lanczos Algorithms for
Large Symmetric Eigenvalue Computations, vol. 1, SIAM, Philadelphia, USA.
ISBN:0817630589
[112] B. N. Parlett and D. S. Scott (1979), The Lanczos algorithm with selective
orthogonalization, Mathematics of Computation, vol. 33, issue 145, pp. 217-238.
[113] Chang San-Cheng (1986), Lanczos algorithm with selective reorthogonalization for
eigenvalue extraction in structural dynamic and stability analysis, Computers and
Structures vol. 23, issue 2, pp. 121-128. doi:10.1016/0045-7949(86)90206-3
[114] Andrew V., Knyazev (2001), Toward the Optimal Preconditioned Eigensolver:
Locally Optimal Block Preconditioned Conjugate Gradient Method, SIAM Journal
on Scientific Computing, vol. 23, issue 2, 517-541. doi:10.1137/S1064827500366124
[115] E. R. Davidson (1975), The Iterative Calculation of a Few of the Lowest Eigenvalues
and Corresponding Eigenvectors of Large Real Symmetric Matrices, J. Comput.
Phys., vol. 17, pp. 87-94.
[116] Z. Bai, J. Demmel, J. Dongarra, A. Ruhe, and H. van der Vorst (2000), Templates
for the Solution of Algebraic Eigenvalue Problems: A Practical Guide, SIAM,
Philadelphia, USA.
[117] E. R. Davidson (1993), Monster matrices: Their eigenvalues and eigenvectors,
Comput. Phys., vol. 7, pp. 519-522.
[118] G. L. G. Sleijpen and H. A. van der Vorst (1996), A Jacobi-Davidson iteration
method for linear eigenvalue problems, SIAM J. Matrix Anal. Appl., vol. 17, pp.
401-425.
[119] M.E. Hochstenbach, Y. Notay (2006), The Jacobi-Davidson method, GAMM
Mitteilungen, vol. 29, issue 2, pp. 368-382. ISSN:09367195
[120] P. Arbenz and M. E. Hochstenbach (2004), A Jacobi-Davidson method for solving
complex symmetric eigenvalue problems SIAM J. Sci. Comput., vol. 25, pp. 1655-
1673. doi:10.1137/S1064827502410992
107
[121] T. Sakurai and H. Sugiura (2003), A projection method for generalized eigenvalue
problems, Journal of Computational and Applied Mathematics, vol. 159, issue 1,
pp. 119-128. doi:10.1016/S0377-0427(03)00565-X
[122] T. Sakurai and H. Tadano (2007), CIRR: a Rayleigh-Ritz type method with contour
integral for generalized eigenvalue problems, Hokkaido Mathematical Journal, vol.
36, pp. 745-757.
[123] E. Polizzi (2009), Density-Matrix-Based Algorithms for Solving Eigenvalue
Problems, Phys. Rev. B., vol. 79, 115112.
[124] Martin Galgon, Lukas Kramer, and Bruno Lang (2011), The FEAST algorithm for
large eigenvalue problems, PAMM. Proc. Appl. Math. Mech., vol. 11, pp. 747-748.
doi:10.1002/pamm.201110363
[125] J. H. Wilkinson, C. Reinsch (1971), Handbook for Automatic Computation, Vol.
2: Linear Algebra, Grundlehren Der Mathematischen Wissenschaften, vol. 186,
Springer-Verlag. ISBN: 978-0387054148
[126] G.L.G. Sleijpen, H.A. Van der Vorst (2000), A Jacobi-Davidson iteration methods
for linear eigenvalue problems, SIAM Rev., vol. 42, pp. 267-293.
[127] R.B. Lehoucq, D.C. Sorensen, C. Yang (1998), ARPACK Users Guide: Solution
of Large-Scale Eigenvalue Problems with Implicitly Restarted Arnoldi Methods,
SIAM, Philadelphia, USA.
[128] A. Stathopoulos, J.R. McCombs (2010), PRIMME: preconditioned iterative
multimethod eigensolver methods and software description, ACM Trans. Math.
Softw. (TOMS), vol. 37, issue 2, pp. 1-30.
[129] V. Hernandez, J.E. Roman, V. Vidal (2005), SLEPc: A scalable and flexible toolkit
for the solution of eigenvalue problems, ACM Trans. Math. Softw. (TOMS), vol.
31, issue 3, pp. 351-362. Special issue on the Advanced Computational Software
(ACTS) Collection.
[130] A. Dziekonski, A. Lamecki, M. Mrozowski (2011), A memory efficient and fast sparse
matrix vector product on a GPU, Prog. Electromagn. Res., vol. 116, pp. 49-63.
108
[131] F. Smailbegovic, G.N. Gaydadjiev, S. Vassiliadis (2005), Sparse Matrix Storage
Format. 16th
Annual Workshop on Circuits, Systems and Signal Processing,
ProRISC 2005, Veldhoven, 17-18 November, 2005.
[132] S. Pescetelli, A. Di Carlo, P. Lugli (1997), Conduction Band Mixing in T- and
V-shaped quantum wires, Phys. Rev. B 56, 1668.
[133] G. Grosso, L. Martinelli, G. Pastori Parravicini (1995), Lanczos-type algorithm for
excited states of very-large-scale quantum systems, Phys. Rev. B 51, 13033-13038.
[134] Kapadia Nirav Harish (1994). A SIMD Sparse Matrix-Vector Multiplication
Algorithm For Computational Electromagnetics And Scattering Matrix Models.
ECE Technical Reports. http://docs.lib.purdue.edu/ecetr/200/
[135] Shameem Akhter and Jason Roberts (2006), Multi-Core Programming: Increasing
Performance through Software Multithreading, Intel Press. ISBN:0976483246,
9780976483243
[136] Kamran Karimi, Neil G. Dickson, Firas Hamze, High Performance Physics
Simulations Using Multi-Core CPUs and GPGPUs in a Volunteer Computing
Context, D-Wave Systems Inc. British Columbia Canada. http://arxiv.org/pdf/
1004.0023
[137] Nathan Bell, Michael Garland (2009), Implementing sparse matrix-vector
multiplication on throughput-oriented processors, Proceedings of the Conference on
High Performance Computing Networking, Storage and Analysis, Oregon, Portland,
14-20 November 2009.
[138] I. Reguly, M. Giles (2012), Efficient sparse matrix-vector multiplication on cache-
based GPUs, Innov. Parallel Comput. IEEE, pp. 1-12.
[139] Luciano Colombot, William Sawyer and Djordje Marict (1995), A Parallel
Implementation of Tight-Binding Molecular Dynamics Based on Reordering of
Atoms and the Lanczos Eigen-Solver, MRS Proceedings, vol. 408, pp. 107.
doi:10.1557/PROC-408-107.
[140] Luca Bergamaschi, Giorgio Pini, Flavio Sartoretto (2003), Computational
experience with sequential and parallel, preconditioned Jacobi-Davidson for large,
109
sparse symmetric matrices, Journal of Computational Physics, vol. 188, issue 1, pp.
318-331. doi:10.1016/S0021-9991(03)00190-6
[141] M. Camara, A. Mauger, and I. Devos (2002), Electronic structure of the layer
compounds GaSe and InSe in a tight-binding approach, Phys. Rev. B 65, 125206.
[142] Steven E. Laux (2012), Solving complex band structure problems with the FEAST
eigenvalue algorithm. Phys. Rev. B 86, 075103.
[143] Alan R. Levin, Deyin Zhang, Eric Polizzi (2012), FEAST fundamental framework
for electronic structure calculations: Reformulation and solution of the muffin-tin
problem, Computer Physics Communications, vol. 183, issue 11, pp. 2370-2375.
doi:10.1016/j.cpc.2012.06.004
[144] R. Barret, M. Berry, T. F. Chan, J. Demmel, J. Donato, J. Dongarra, V. Eijkhout,
R. Pozo, C. Romine, and H. van der Vorst (1994), Templates for the Solution of
Linear Systems, Building Blocks for Iterative Methods, SIAM, Philadelphia, PA.
[145] G.L.G. Sleijpen, J.G.L. Booten, D.R. Fokkema, and H.A. Van der Vorst (1996),
Jacobi-Davidson type methods for generalized eigenproblems and polynomial
eigenproblems, BIT 36, pp. 595-633.
[146] M.E. Hochstenbach, G.L.G. Sleijpen (2008), Harmonic and refined Rayleigh-Ritz for
the polynomial eigenvalue problem, Numerical Linear Algebra with Applications,
vol. 15, issue 1, pp. 35-54.
[147] Y. Saad (2003), Iterative Methods for Sparse Linear Systems, 2nd edition, Society
for Industrial and Applied Mathematics. ISBN:9780898715347
[148] Y. Saad and M.H. Schultz (1986), GMRES: A generalized minimal residual
algorithm for solving nonsymmetric linear systems, SIAM J. Sci. Stat. Comput.,
7, pp. 856-869. doi:10.1137/0907058
[149] E. Polizzi (2012), A High-Performance Numerical Library for Solving Eigenvalue
Problems, FEAST solver User’s guide. arxiv.org/abs/1203.4031
[150] D. R. Fokkema, G.L.G. Sleijpen, H. A. Van der Vorst (1996), Generalized conjugate
gradient squared, Journal of Computational and Applied Mathematics, vol. 71, pp.
125-146.
110
[151] Michele Benzi (2002), Preconditioning techniques for large linear systems, A Survey,
Journal of Computational Physics, vol. 182, pp. 418-477.
[152] Stefano Sanguinetti, Claudio Somaschini, Sergio Bietti and Noboyuki Koguchi
(2011), Complex Nanostructures by Pulsed Droplet Epitaxy, Nanomaterials and
Nanotechnology, vol. 1, issue 1, pp. 14-17.
[153] Daniele Barettin, Matthias Auf der Maur, Alessandro Pecchia, Walter Rodrigues
et al. (2015), Realistic model of LED structure with InGaN quantum-dots active
region, abstract submitted to International IEEE Conference on Nanotechnology
(IEEE NANO 2015), Rome, Italy.
[154] R. M. Camacho, M. V. Pack, J. C. Howell, A. Schweinsberg, and R. W. Boyd (2007),
Wide-Bandwidth, Tunable, Multiple-Pulse-Width Optical Delays Using Slow Light
in Cesium Vapor, Phys. Rev. Lett., 98 (15), pp. 153601.
[155] Wen-Hsuan Kuan, Chi-Shung Tang and Cheng-Hung Chang (2007), Spectral
properties and magneto-optical excitations in semiconductor double rings under
Rashba spin-orbit, Phys. Rev. B, vol. 75, issue 15, pp. 155326.
[156] Luis G. G. V. Dias da Silva, Jos´e M. Villas-Bˆoas and Sergio E. Ulloa (2007),
Tunneling and optical control in quantum ring molecules, Phys. Rev. B, vol. 76,
issue 15, pp. 155306.
[157] F. Carre˜no, M. A. Ant´on, Sonia Melle, Oscar G. Calder´on, E. Cabrera-Granado,
Joel Cox, Mahi R. Singh and A. Egatz-G´omez (2014), Plasmon-enhanced terahertz
emission in self-assembled quantum dots by femtosecond pulses, J. Appl. Phys., vol.
115, issue 6, pp. 064304.
[158] Gwyddion - Free SPM (AFM, SNOM/NSOM, STM, MFM) data analysis software,
http://gwyddion.net/
[159] D. Barettin, R. De Angelis, P. Prosposito, M. Auf der Maur, M. Casalboni,
A. Pecchia (2014), Model of a realistic InP surface quantum dot extrapolated
from atomic force microscopy results. Nanotechnology, vol. 25, issue 19, 195201.
doi:10.1088/0957-4484/25/19/195201
111
[160] F. Sacconi, M. Auf der Maur, A. Di Carlo (2012), Optoelectronic Properties of
Nanocolumn InGaN/GaN LEDs. Electron Devices, IEEE Transac, vol. 59, issue 11,
pp. 2979-2987. doi:10.1109/TED.2012.2210897.
[161] C. Bocklin, R. G. Veprek, S. Steiger and B. Witzigmann (2010), Computational
study of an InGaN/GaN nanocolumn light-emitting diode. Phys. Rev. B, 81, 155306.
doi:10.1103/PhysRevB.81.155306.
112
Abbreviations
AlGaN . . . . . . . . . . . . . Aluminium Gallium Nitride
AlGaAs . . . . . . . . . . . . Aluminium Gallium Arsenide
CPU . . . . . . . . . . . . . . . Central Processing Unit
CUDA . . . . . . . . . . . . . Compute Unified Device Architecture
CAD . . . . . . . . . . . . . . . Computer Aided-Design
CB . . . . . . . . . . . . . . . . . Conduction Band
CSR . . . . . . . . . . . . . . . Compressed Sparse Row
CGS . . . . . . . . . . . . . . . Conjugate Gradient Squared Method
DMA . . . . . . . . . . . . . . Direct Memory Access
DFT . . . . . . . . . . . . . . . Density Functional Theory
ETB . . . . . . . . . . . . . . . Empirical Tight Binding
Eg . . . . . . . . . . . . . . . . . . Energy gap
FMA . . . . . . . . . . . . . . . Fused Multiply Add
GaN . . . . . . . . . . . . . . . Gallium Nitride
GaAs . . . . . . . . . . . . . . Gallium Arsenide
GPU . . . . . . . . . . . . . . . Graphic Processing Unit
GMRES . . . . . . . . . . . Generalized Minimal Residual Method
H . . . . . . . . . . . . . . . . . . . Hamiltonian matrix
113
HPC . . . . . . . . . . . . . . . High Performance Computing
InGaN . . . . . . . . . . . . . Indium Gallium Nitride
InN . . . . . . . . . . . . . . . . Indium Nitride
ILU . . . . . . . . . . . . . . . . Incomplete LU
JD . . . . . . . . . . . . . . . . . Jacobi-Davidson
LED . . . . . . . . . . . . . . . Light Emitting Diode
LCAO . . . . . . . . . . . . . Linear Combination of Atomic Orbitals
MP . . . . . . . . . . . . . . . . Multi-Processing
MPI . . . . . . . . . . . . . . . . Message Passing Interface
MIMD . . . . . . . . . . . . . Multiple Instruction Multiple Data
MOI . . . . . . . . . . . . . . . Memory Optimized Implementation
OpenMP . . . . . . . . . . Open Multi-Processing
SMX . . . . . . . . . . . . . . . Next-generation Streaming Multiprocessor
SM . . . . . . . . . . . . . . . . . Streaming Multiprocessor
SPD . . . . . . . . . . . . . . . . Spatial Probability Density
SFU . . . . . . . . . . . . . . . . Special Function Unit
SIMD . . . . . . . . . . . . . . Single Instruction Multiple Data
spMV . . . . . . . . . . . . . . Sparse Matrix-Vector Multiplication
TB . . . . . . . . . . . . . . . . . Tight-Binding
VCA . . . . . . . . . . . . . . . Virtual Crystal Approximation
VB . . . . . . . . . . . . . . . . . Valence Band
114
List of Figures
2.1 Schematic comparison of CPU and GPU structure (Source: NVIDIA) . . . 16
2.2 Full chip block diagram of Kepler microarchitecture based GPU (Source:
NVIDIA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3 Architectural overview of next-generation streaming multiprocessor (SMX)
within Kepler microarchitecture (Source: NVIDIA) . . . . . . . . . . . . . 20
2.4 Warp scheduler within next-generation streaming multiprocessors (Source:
NVIDIA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.5 Kepler GPU memory hierarchy (Source: NVIDIA) . . . . . . . . . . . . . . 22
2.6 Direct Peer-to-Peer data transfer between two GPUs using GPUDirect
(Source: NVIDIA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.7 (Left) Gird of thread blocks (Source: NVIDIA). (Right) CUDA execution
model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.1 Conical wurtzite GaN/AlGaN quantum dot with 30% Al. Atomistic
description: In yellow Aluminium, in red Gallium. . . . . . . . . . . . . . . 41
4.2 Performance of spMV operation on GPU employing different data types . . 48
4.3 (Left) Typical sparsity pattern of a TB Hamiltonian and partitioning over
four nodes. (Right) Data exchanged between adjacent nodes . . . . . . . . 49
4.4 Memory utilization by TB Hamiltonian matrix on GPU . . . . . . . . . . . 52
4.5 Time comparison of Lanczos iteration using MPI-OpenMP on a HPC
cluster connected via InfiniBand . . . . . . . . . . . . . . . . . . . . . . . . 54
4.6 Time taken per Lanczos iteration for different implementations and
technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.7 Performance comparison for the Lanczos iteration between different
implementations and technologies . . . . . . . . . . . . . . . . . . . . . . . 55
115
4.8 Speed comparison for spMV between implementations on each of the
technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.1 (Left) Cubical wurtzite GaN/AlGaN quantum dot showing the core with
30% Aluminum. (Right) a central slice of the cube. Atomistic description:
in yellow Aluminum, in red Gallium . . . . . . . . . . . . . . . . . . . . . . 64
5.2 Time comparison between methods on 1 Kepler GPU for the calculation
of 8 energy eigenstates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.3 Time comparison between methods on 4 Kepler GPUs for the calculation
of 8 energy eigenstates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.4 Scaling of Lanczos method on 1 to 4 GPUs . . . . . . . . . . . . . . . . . . 68
5.5 Scaling of Jacobi-Davidson (subspace in host memory) method on 1 to 4
GPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.6 Scaling of FEAST method on 1 to 4 GPUs . . . . . . . . . . . . . . . . . . 69
5.7 Percentage of time taken for memory and compute operations on (Left) 1
GPU and (Right) 4 GPUs respectively . . . . . . . . . . . . . . . . . . . . 70
5.8 Memory consumption between methods on 1 GPU . . . . . . . . . . . . . . 72
5.9 Memory consumption between methods on 4 GPUs . . . . . . . . . . . . . 73
5.10 Time performance comparison between Lanczos, Jacobi-Davidson and
FEAST method on 4, 8, 16 and 32 nodes of the HPC cluster for the
calculation of 8 energy eigenstates . . . . . . . . . . . . . . . . . . . . . . . 74
5.11 Scaling of Lanczos method on 4, 8, 16 and 32 nodes of the HPC cluster . . 75
5.12 Scaling of Jacobi-Davidson (subspace in host memory) method on 4, 8, 16
and 32 nodes of the HPC cluster . . . . . . . . . . . . . . . . . . . . . . . . 75
5.13 Scaling of FEAST method on 4, 8, 16 and 32 nodes of the HPC cluster . . 76
6.1 Atomic force microscope images of GaAs/Al0.3Ga0.7As complex quantum
dot/ring nanostructure (Source: Sanguinetti (2011)) . . . . . . . . . . . . . 79
6.2 (Below) Lateral view, (Above) Top view: Geometry of dot/ring complex
nanostructure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
6.3 Partly sliced GaAs/Al0.3Ga0.7As complex quantum dot/ring nanostructure
with 30% Al, 70% Ga. Atomistic description: in Pink Aluminum, in Blue
Gallium . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
116
6.4 Electron states using ETB methods for varying radius of the quantum dot
while the rest of the geometry of the complex nanostructure is kept fixed . 81
6.5 SPD for first 8 electrons states using ETB method for the quantum dot
with radius = 8 nm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
6.6 Evolution of eigenenergies with quantum dot radius. The lines connect
states which have been identified to have the same wave function symmetry. 82
6.7 Probability density for lambda states in quantum dot with radius = 6.2
nm, overlapping between states B, C and H . . . . . . . . . . . . . . . . . 83
6.8 Probability density for lambda states in quantum dot with radius = 6.5
nm, overlapping between (Left) states B and F and (Right) states C and E 83
6.9 InGaN quantum dot with varying content of Indium derived from
experimental high-resolution transmission electron microscopy . . . . . . . 84
6.10 A central slice of InGaN quantum dot with 19% Indium randomly
distributed. Atomistic description: in Red Indium, in White Gallium . . . . 85
6.11 InGaN quantum dot with uniform content of Indium. Description: in Red
19% Indium, in Blue 0% Indium . . . . . . . . . . . . . . . . . . . . . . . . 85
6.12 Electronic ground states obtained from ETB calculation of InGaN quantum
dot with random Indium content . . . . . . . . . . . . . . . . . . . . . . . 85
6.13 Electronic ground states obtained from ETB calculation of InGaN quantum
dot with uniform Indium content . . . . . . . . . . . . . . . . . . . . . . . 86
7.1 Performance of Lanczos implementation benchmarked on different
technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
7.2 Performance of Lanczos, Jacobi-Davidson (JD) and FEAST
implementation benchmarked on different technologies . . . . . . . . . . . 90
117
List of Tables
3.1 Detailed list of available software packages for large-scale eigenproblems . . 38
4.1 Results for energy eigenstate calculation using CUDA on Nvidia Kepler
K20c GPU (Test system 1) . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.2 Results for energy eigenstate calculation using MPI-CUDA implementation
running on two Nvidia Kepler K20c GPUs (Test system 1) . . . . . . . . . 53
4.3 Results for energy eigenstate calculations using MPI-OpenMP (Test system
2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.1 Profiler output for 151,472 atom quantum dot, listing the most significant
compute operations within Jacobi-Davidson method with subspace stored
in host memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.2 Profiler output for 151,472 atom quantum dot, listing the most significant
compute operations within Lanczos method . . . . . . . . . . . . . . . . . 71
5.3 Profiler output for 151,472 atom quantum dot, listing the most significant
compute operations within the CGS method (linear solver for FEAST) . . 71
118
OLABs: Optoelectronics & Nanoelectronics Laboratory
Printed in Rome, Italy
May 2015

Thesis_Walter_PhD_final_updated

  • 1.
    UNIVERSIT`A DEGLI STUDIDI ROMA “TOR VERGATA” DOTTORATO DI RICERCA IN INGEGNERIA DELLE TELECOMUNICAZIONI E MICROELETTRONICA CICLO XXVII GPU ACCELERATION OF ATOMISTIC SIMULATION OF NANOSTRUCTURED DEVICES Ph.D. Candidate: Walter Jesuslee Savio Rodrigues Anno di Esame: 2015 Dipartimento di Ingegneria Elettronica Ph.D. Tutor: Prof. Dr. Aldo Di Carlo Ph.D. Coordinator: Prof. Dr. Aldo Di Carlo
  • 2.
    UNIVERSIT`A DEGLI STUDIDI ROMA “TOR VERGATA” DOCTOR OF PHILOSOPHY IN TELECOMMUNICATION AND MICROELECTRONICS ENGINEERING CYCLE XXVII GPU ACCELERATION OF ATOMISTIC SIMULATION OF NANOSTRUCTURED DEVICES Ph.D. Candidate: Walter Jesuslee Savio Rodrigues Year of Ph.D. Dissertation Defense: 2015 Department of Electronics Engineering Ph.D. Advisor: Prof. Dr. Aldo Di Carlo Ph.D. Coordinator: Prof. Dr. Aldo Di Carlo
  • 3.
    OLABs: Optoelectronics &Nanoelectronics Laboratory GPU Acceleration of Atomistic Simulation of Nanostructured Devices Walter Jesuslee Savio Rodrigues May, 2015 Ph.D. in Telecommunication and Microelectronics Engineering Program - XXVII Cycle Optoelectronics & Nanoelectronics Laboratory Simulation & Theoretical Research Group Department of Electronics Engineering Engineering Faculty University of Rome Tor Vergata Via del Politecnico 1, 00133, Rome, Italy Phone + 39 (0)6 7259 7939 www.optolab.uniroma2.it
  • 4.
    Acknowledgment I would liketo express my sincere gratitude to my advisor Prof. Aldo Di Carlo for the continuous support during my Ph.D. studies. His motivation and enthusiasm has helped me to keep going till this point. I would like to thank Dr. Alessandro Pecchia, Dr. Matthias Auf der Maur and Dr. Daniele Barettin for patiently sharing their immense knowledge with me and guiding me throughout my research. I thank all my fellow colleagues Giacomo, Francesco, Claudio, Antonio, Marco, Amir, Corrado, Babak, Matteo P., Andrea R., Thomas B., Francesca B., Matteo G., Lucio, Monica, Elisa, Giorgia, Fabio S., and Desi for welcoming me into the group and for all their love and support that I have received over the last three years. Last but not the least, thanks to all my friends that have made my stay in Rome a memorable one and my wife, Jasmine, for her love, support and patience throughout my Ph.D. studies. 2
  • 5.
    Abstract Numerical simulation ofmaterials and devices at the atomistic level plays an important role in advancing science and guiding device fabrications. Also, it plays an increasing role in explaining experimental findings and studying micro and macro systems at a level that may otherwise not be physically possible. Nowadays, many high-ended sophisticated computational tools are available to scientists that can accelerate innovation and lead to low cost advancements and device optimizations. This also enables the domain experts to move their focus to areas of expertise and help solve key issues that, once resolved, lead to major scientific breakthroughs. The progress in the field of numerical simulations began with the enormous advancements in computing technology that revolutionized the world three decades ago. Today, larger and faster computing systems are widely accessible. Supercomputers and high-ended, expensive, computationally powerful computing systems are being utilized to speedup numerical calculations. However, many times these improvements in technology have not translated into equivalent productivity. Till date, many computational scientists still employ outdated tools and algorithmic implementations; thereby, spending unnecessary time waiting for results. The advent of graphics processing unit (GPU) has grasped the attention of the scientific computation community with its huge number of computing engines. The work reported here is specifically to help computational scientists and nanoelectronic’s domain experts to develop tools that take advantage of modern improvements in computing technology. Atomistic simulation of nanostructured devices often requires the simulation of systems with an irreducibly-large number of atoms. However, large-scale atomistic calculations such as those based on empirical tight binding (ETB) approach reported 3
  • 6.
    here, must facethe computational obstacle for the diagonalization of the Hamiltonian matrix needed for the calculation of eigenvalues and eigenvectors. This bottleneck can be overcome by parallel computing techniques or the introduction of faster algorithms. Recent advancements have enabled the construction of massively parallel codes and O(N) computational schemes. Nevertheless, such codes require large high performance computing (HPC) facilities to run; thereby, reducing the accessibility to a wider range of users. This work has been motivated by the lack of specialized eigensolvers for large-scale computations on GPUs. Developing algorithms that can ideally scale over GPUs is an important component for transferring the hardware feature into actual beneficial speedups. In recent times, there has been an extensive effort being put in translating algorithms initially designed for sequential processors. However, many aspects need to be considered to result in speedups while dealing with GPU or other parallel computing technologies. Hence, often this sequential to parallel transition is not straight forward and requires deeper understanding of the system’s architecture and algorithms itself. In this work, significance is also placed on addressing some basic problems that hinder the development of efficient eigensolvers on GPU: first, the choice of the algorithm itself. I demonstrate how to overcome the problem of compute versus communication gap that exists in GPUs and have also established ways to resolve the computational and memory related bottlenecks. Also, multi-GPU implementations that scales with GPUs are presented, resulting in eigensolvers that accelerates efficiently large-scale tight binding calculations. However, there are several methods that can be used to calculate the needed energy eigenstates. Given the variety of possible methods it is still unclear which one is more suited and how their performance compares in a given scenario. Hence, I concentrate on the GPU implementation of three different methods that are common among peers in the electronic computational domain. An analysis for timing, memory occupancy and convergence on a multi-GPU system is performed. Finally, realistic applications of GPU accelerated atomistic simulations will be presented. ETB calculation of quantum heterostructures derived from experimental results will be performed using GPU showing that the performance of the solvers employed for the atomistic simulation of nanostructured devices can be considerably enhanced using GPUs. 4
  • 7.
    Preface The work outlinedin this dissertation was carried out in the Department of Electronics Engineering, University of Rome Tor Vergata, over the period from January 2012 to April 2015. This dissertation is the result of my work and includes a small part which is the outcome of the work done in collaboration. The material included in this dissertation has not been submitted for a degree or diploma or any other qualification at any other university. This work has been divided into seven parts. The first chapter introduces the Tight binding model and outlines the motivation for this research work. The second chapter briefly describes the hardware architecture and the CUDA programming model for GPU. A review and survey of eigensolver methods are presented in chapter three. Chapter four and five details the design and benchmarking of GPU based eigensolvers for atomistic simulation. The sixth chapter presents real applications of the research work carried out and the last chapter is the conclusions. 5
  • 8.
    Contents Acknowledgment 2 Abstract 3 Preface5 Contents 6 1 Introduction to tight binding model and its computational challenges 7 1.1 Empirical tight binding model . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.2 Mathematical formulation for empirical tight binding model . . . . . . . . 10 1.3 Schr¨odinger equation and the eigenvalue problem . . . . . . . . . . . . . . 11 1.4 Computational challenges of empirical tight binding method . . . . . . . . 11 1.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2 Introduction to GPU and general purpose GPU computing 14 2.1 Towards an unified graphics computing architecture . . . . . . . . . . . . . 17 2.2 Architectural overview of the Tesla Kepler GPU . . . . . . . . . . . . . . . 18 2.2.1 Next-generation streaming multiprocessor . . . . . . . . . . . . . . 19 2.2.2 Instruction scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.2.3 Memory model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.2.4 Advance features . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.3 CUDA programming model . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.4 General-purpose computing on graphics processing units . . . . . . . . . . 26 2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 6
  • 9.
    3 Introduction toEigensolvers 29 3.1 Direct methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.1.1 QR algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.1.2 Divide-and-conquer method . . . . . . . . . . . . . . . . . . . . . . 31 3.1.3 Bisection method and inverse iteration . . . . . . . . . . . . . . . . 32 3.1.4 Jacobi method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.2 Iterative methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.2.1 Power iteration method . . . . . . . . . . . . . . . . . . . . . . . . 33 3.2.2 Rayleigh quotient iteration method (RQI) . . . . . . . . . . . . . . 33 3.2.3 Arnoldi method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.2.4 Lanczos method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.2.5 Locally optimal block preconditioned conjugate gradient method (LOBPCG) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.2.6 Davidson method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.2.7 Jacobi-Davidson method . . . . . . . . . . . . . . . . . . . . . . . . 36 3.2.8 Contour integral spectral slicing . . . . . . . . . . . . . . . . . . . . 36 3.2.9 FEAST method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.3 Survey of available software packages for eigenproblems . . . . . . . . . . . 37 3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4 Design of GPU based eigensolver for atomistic simulation 40 4.1 Lanczos method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.2 Implementation and optimization strategies for parallel eigensolvers . . . . 43 4.2.1 MPI-OpenMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.2.2 MPI-CUDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.2.3 Performance enhancement via communication cost reduction . . . . 46 4.2.4 Memory optimization by Splitting approach . . . . . . . . . . . . . 46 4.2.5 Mix real-complex CUDA kernel . . . . . . . . . . . . . . . . . . . . 47 4.2.6 Performance enhancement using the Overlap technique . . . . . . . 49 4.2.7 CUDA-aware MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.3 Benchmarking the Lanczos method . . . . . . . . . . . . . . . . . . . . . . 50 4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 7
  • 10.
    5 GPU focusedcomprehensive study of popular eigenvalue methods 58 5.1 GPU based implementations of popular eigenvalue methods . . . . . . . . 59 5.1.1 Jacobi-Davidson method . . . . . . . . . . . . . . . . . . . . . . . . 60 5.1.2 FEAST method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 5.2 Benchmarking results, comparison and discussion . . . . . . . . . . . . . . 64 5.2.1 Eigensolver evaluation on a Multi-GPU workstation . . . . . . . . . 66 5.2.2 Eigensolver evaluation on a HPC cluster . . . . . . . . . . . . . . . 73 5.2.3 Performance comparison between GPU and HPC cluster . . . . . . 76 5.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 6 Application of GPU accelerated atomistic simulations 78 6.1 Atomistic simulation of complex quantum dot/ring nanostructure . . . . . 78 6.2 Atomistic simulation of InGaN quantum dot with Indium fluctuation . . . 84 6.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 7 Conclusion 87 Publications and Conferences 92 Bibliography 94 Abbreviations 111 List of Figures 113 List of Tables 114 8
  • 11.
    Chapter 1 Introduction totight binding model and its computational challenges The birth of the use of computer simulations occurred around couple of decades ago, but their impact in modern science has exactly mirrored the exponential growth in the power of computers. In recent times, almost all fields of sciences have seen an explosion of the use of computer simulations to the point where computational methods now stand alongside with theoretical and experimental methods in value [1]. In turn, the growing power of computers have spurred the development of methods and scientific software packages, widening the potential of simulations to tackle a wide range of scientific issues and placing sophisticated tools in the hands of a wider group of scientists. Atomistic simulations are playing an increasingly important role in realistic, scientific and industry applications in many areas including advance material design, nanotechnology, modern chemistry and semiconductor research. Atomistic simulation is the theoretical and computational modeling of what happens at the atomic scale in solids, liquids, molecules and plasmas. Often, this means solving numerically the classical or quantum-mechanical microscopic equations for the motion of interacting atoms, or even deeper electrons and nuclei. Atomistic simulation is used to interpret existing experimental data and predict new phenomena, to reach computationally where simple theory alone cannot and to provide a way forward where experiments are not yet possible. The predictive capability of these simulation approaches hinges on the accuracy of the model used to describe atomic interaction. Modern models are optimized 9
  • 12.
    to reproduce experimentalvalues and electronic structure estimates for the forces and energies of representative atomic configuration deemed important for the problem of interest. Most solid-state applications are now making heavy use of density functional theory (DFT) which has proved to be extremely successful in studying structural properties and electronic states of materials from which formation energies, phase stability and thermodynamic properties can be understood or even predicted. Many particle corrections can be introduced as a perturbation, allowing also the exploration of optical properties. Localized basis approaches like the Gaussian orbitals, wavelets or the augmented-plane wave methods are used for calculating the electronic band structure of solids allowing the prediction of many important properties [2]. All these methods involve the development of quite complicated computer codes. Limited computational resources, however, impose restrictions on both the system size and the level of theory that can be used to calculate interaction between electrons and ions. In order to overcome these limitations, more approximate methods have been developed and advance optimization tactics either theoretical or practical are widely welcomed. 1.1 Empirical tight binding model The model name “tight binding” suggests that it describes the properties of tightly bound electrons in solids. The electrons in this model are considered to be tightly bound to the atom to which they belong and they have limited interaction with states and potentials of surrounding atoms. As a result, the wave function of the electron is rather similar to the atomic orbital of the free atom to which it belongs. The energy of the electron is close to the ionization energy of the electron in the free atom or ion because the interaction with the potentials and states of neighboring atoms is limited. The tight binding (TB) approach to electronic structure is one of the most used methods in solid state systems [3]. The empirical tight binding (ETB) method, which dates back to the work of Slater and Koster [4] assumes mostly two-center approximation and the matrix elements of the Hamiltonian between orthogonal and atom-centered orbitals [5] are treated as parameters fitted to experiment or first-principles calculations. ETB is widely employed for the description of electronic structure of complex systems [6] like interfaces 10
  • 13.
    and defects incrystals, amorphous materials, nanoclusters, and quantum dots because it is computationally efficient and provides physically transparent results. Indeed this technique requires a relatively small number of parameters which are fitted to accurately reproduce a given set of experimental data. As stated, ETB considers a system where electrons are bound to atoms and the perturbation produced from the linear combination of atomic orbitals (LCAO) [4, 16] (e.g. sp3 , sp3 d5 , etc). ETB employs an implicit basis composed of the localized atomic-like orbitals in order to describe the band structure, but do not involve the direct computation of inter-atomic overlaps. Consequently, many authors define ETB as a formal expression over Wannier function. The Hamiltonian matrix elements are typically obtained empirically from fits to more accurate calculations, experiments or derived from first-principles expressions [7,8]. The ETB method used for calculations of particles state of atomistic systems [9, 10] is generally less accurate and less transferable than methods based on DFT, where the Hamiltonian is computed from explicit wave functions, but it does provide a good alternative for simulating systems of larger size [11] and over longer time scales than are currently tractable using first-principles methods. In fact, the ETB is the model of choice for atomistic description of the electronic properties of nanostructured devices [12–15]. According to the macroscopic device description and crystallographic orientation, the atomistic structure needed for ETB calculations is generated internally in TiberCAD, a multiscale CAD tool for the simulation of modern nanoelectronics and optoelectronics devices [17]. The atomistic structure is deformed based on the strain calculations obtained from a continuous media elasticity model by projecting the deformation field onto the atomic positions [18]. In order to couple the atomistic calculation of electronic states with the continuous media model for particle transport, the macroscopic electrostatic potential calculated with the Poisson/drift-diffusion model has been projected onto the atomic positions in a multiscale fashion [19]. The solution of the eigenvalue problem resulting from the ETB provides the quantum energy eigenstates and consequently the charge density. An ETB model based on a sp3 d5 s∗ + spin-orbital parametrization has been applied in this work [7]. 11
  • 14.
    1.2 Mathematical formulationfor empirical tight binding model ETB describes the system Hamiltonian (H) taking the linear combination of localized orbitals centered on each atom position [20]. The function |Ψ = α,R Cα(R)|α, R (1.1) represents standing waves or atomic orbitals. Which is necessary to find an approximation of the eigenenergies and a set of expansion coefficients Cα [21]. In the quantum atomistic approach, the energy levels, , of the stationary states can be seen as the eigenvalues of the matrix H, H|Ψ = |Ψ (1.2) which is the time-independent Schr¨odinger equation. ETB, widely explained elsewhere, determines the energy of H in terms of energy levels by solving the secular equation det|H − I| = 0 (1.3) where I is the overlap matrix elements which reduces to unit matrix when neglecting inter-atomic overlaps [20] and are the energy levels (eigenvalues). The matrix H in equation 1.2 for the sp3 d5 s∗ parametrization used here [7] includes the spin-orbit interactions forming a block matrix of 20×20 for each atom. In later chapters we shall see at length methods to solve similar equations efficiently. The solution of the eigenvalue problem defined in equation 1.2 provides the quantum energy eigenstates which gives the charge density and allows the prediction of many other important properties of the system. 12
  • 15.
    1.3 Schr¨odinger equationand the eigenvalue problem The wavefunction for a given physical system contains the measurable information about the system. To obtain specific values for physical parameters, for example energy eigenstates, one operates on the wavefunction with the quantum mechanical operator associated with that parameter. The operator associated with energy is the Hamiltonian and the operation on the wavefunction is the Schr¨odinger equation as given in equation 1.2. Thus, the time-independent Schr¨odinger equation in a linear algebra terminology is an eigenvalue equation for the Hamiltonian operator [23] which is explained in more detail in Chapter 3. Solutions exist for the time-independent Schr¨odinger equation only for certain values of energy and these values are called “eigenvalues” of energy. The band energy states form a discrete spectrum of values, physically interpreted as quantization. Corresponding to each eigenvalue is an “eigenfunction”. More specifically, the energy eigenstates form a basis. The solution to the Schr¨odinger equation for a given energy i involves also finding the specific function |Ψi which describes that energy state. Any wavefunction may be written as a sum over the discrete energy states or an integral over continuous energy states, or more generally as an integral over a measure. 1.4 Computational challenges of empirical tight binding method The pursuit for ever higher levels of detail and realism in nanoelectronics simulations presents formidable modeling and computational challenges. Over the last two decades, available computer power has grown as well as the size of system that can be considered employing the TB method has also grown. As the nanostructure systems become larger, however, the issue of scaling becomes crucial. The number of computational operations required to diagonalize a matrix is proportional to the cube of the number of basis functions, and thus to the number of atoms. This behavior is referred to as O(N3 ) scaling. As a result, a thousand-fold increase in computer power only buys a ten-fold 13
  • 16.
    increase in systemsize. The O(N3 ) scaling of the H matrix diagonalization limits the number of atoms in the system to a few hundred thousand. Realistic nanostructures fabricated in lab are around 30 nm, comprising ≈ 1 million atoms. For III-V semiconductors every atom has 4 neighbors since the sp3 d5 s∗ + spin- orbital parametrization is used based on 20 orbitals per atom, this translate to an H matrix that is 20 times bigger and having an average of 40 non-zero values per row. The spin-orbit coupling adds imaginary component to the H matrix doubling the problem size. The ETB method is implemented using double precision arithmetic to ensure highly accurate solutions and faster convergence. Since H is a Hermitian matrix each non zero value take 16 bytes of memory (double-complex data type), the total memory needed only for the H matrix generated from a realistic nanostructure is more than what is available on most workstations. Consequently, such codes require large high performance computing (HPC) facilities to run, reducing the accessibility to a wider range of users. Thus, limited computational resources impose restrictions either on the system size or forces one to introduce further approximations in the level of theory. Efforts are constantly made to reduce computational cost in terms of run-time and memory. These significant challenges posed by large-scale ETB based calculations have been addressed in this work by the development of new HPC strategies for numerical algorithms and their implementations on parallel architectures. A specialized implementation that spares memory and reduces at most machine-to-machine data transfers have been developed. Furthermore, in order to study bigger, realistic nanostructured systems, a parallel distributive approach using the standard message passing interface (MPI) is employed. 1.5 Summary The ETB model presented here is in fact the model of choice for atomistic description of the electronic properties of nanostructured devices despite it being less accurate and less transferable than methods based on DFT. The ETB parametrization given by Jancu for nearest-neighbor bond lengths have been used despite the enormous cost in storage that the H matrix representation can deliver, the ETB model is indeed, the best approximation of energy functions for III-V semiconductors. However, large-scale atomistic calculations 14
  • 17.
    involving ETB approachmust face the computational obstacle for the diagonalization of the TB Hamiltonian matrix. This bottleneck can be overcome by parallel computing techniques or the introduction of faster algorithms which are reported in this work. 15
  • 18.
    Chapter 2 Introduction toGPU and general purpose GPU computing In 1965, Gordon E. Moore made an interesting observation that the number of transistors in a dense integrated circuit would double approximately every two years [24, 25]. His prediction has proven to be accurate and is termed as the “Moore’s law.” The exponential increase in the number of transistors on a chip has dramatically enhanced the effect of digital electronics in nearly every segment of life. In the last few decades, the microprocessor performance has drastically increased as a result of many related advances like increased transistor density, increased transistor performance, wider data paths, pipelining, faster processor speed, superscalar execution, speculative execution, caching, chip and system-level integration. As of 2012, every square millimeters of chip area has up to 9 million transistors. Microprocessors are easy to program because compilers evolved right along with the hardware they run on [26]. Users can ignore most of the complexity in modern central processing unit (CPU) since its microarchitecture is almost invisible. Multi-core chips have the same software architecture as older multiprocessor systems, a simple coherent memory model and a few identical computing engines [27,28]. However, CPU cores continue to be optimized for single-threaded performance at the expense of parallel execution. This fact is most apparent when one considers that integer and floating- point execution units occupy only a tiny fraction of the die area in a modern CPU. With such a small part of the chip devoted to performing direct calculations, it is no surprise 16
  • 19.
    that CPUs arerelatively inefficient for HPC applications. The need for CPU designers to maximize single-threaded performance is also behind the use of aggressive process technology to achieve the highest possible clock rates. However, this comes with significant costs. Faster transistors run hotter, cost more to manufacture and leak more power even when they aren’t switching. Manufactures that make high-end CPUs spend staggering amounts of money on process technology just to improve single-threaded performance. The market demands general-purpose processors that deliver high single threaded performance as well as multi-core throughput for a wide variety of workloads. This pressure has given us almost three decades of progress toward higher complexity and higher clock rates. Each new generation of process technology requires ever more heroic measures to improve transistor characteristics. These challenges have become more apparent in the late 20 century. By 2005, the primary focus of processor manufactures have been to continue to increase the core count on chips. This approach, however, has reached a point of diminishing returns. Dual-core CPUs provide noticeable benefits for most CPU users, but are rarely fully utilized except when working with multimedia content or multiple performance- hungry applications. Most of the time quad-core CPUs are only a slight improvement. As CPU core design continues to progress there will continue to be further improvements in process technology, faster memory interfaces, and wider superscalar cores. However, about a decade ago, processor architects realized that CPUs were no longer the preferred solution for certain problems and started with a clean slate for a better solution. Graphics processing unit (GPU) is a specialized electronic circuit designed to rapidly manipulate data and alter memory [29,30]. In a GPU 80% of the transistors on the die are devoted to data processing rather than data caching and flow control as in CPU because they are designed to execute the same function on each element of data with high arithmetic intensity. A simple way to understand a GPU is to look at the difference between a CPU and GPU and to compare how each process tasks. Architecturally, the CPU is composed of only few cores with lots of cache memory optimized for sequential serial processing that can handle a few software tasks at a time. In contrast, a GPU has a massively parallel architecture consisting of thousands of smaller, more efficient cores designed for handling thousands of tasks simultaneously. The ability of a GPU with thousands of cores to process thousands of tasks can accelerate some software by 100x 17
  • 20.
    over a CPUalone. Moreover, the GPU achieves this acceleration while being more power and cost-efficient than a CPU. Figure 2.1: Schematic comparison of CPU and GPU structure (Source: NVIDIA) In recent times, GPU computing has grown into a mainstream movement supported by the latest operating systems as well. The reason for the wide and mainstream acceptance is that the GPU is a computational powerhouse, its capabilities goes far beyond basic graphics controller functions and are growing faster than those of the CPU. GPU architectures are becoming increasingly programmable, offering the potential for dramatic speedups for a variety of general purpose applications compared to CPUs. GPU computing is not meant to replace CPU computing. Each approach has advantages for certain kinds of software. As explained earlier, CPUs are optimized for applications where most of the work is being done by a limited number of threads, especially where the threads exhibit high data locality, a mix of different operations, and a high percentage of conditional branches. GPU design aims at the other end of the spectrum where applications with multiple threads that are dominated by longer sequences of computational instructions. In recent times, GPUs have become much better at thread handling, data caching, virtual memory management, flow control and other CPU-like features. However, the distinction between computationally intensive procedure and control-flow intensive procedure is fundamental. In a GPU since most of the circuitry within each core is dedicated to computation, rather than speculative features meant to enhance single-threaded performance, most of the die area and power consumed by GPU goes into the application’s actual algorithmic work. 18
  • 21.
    2.1 Towards anunified graphics computing architecture The GPU is a processor with ample computational resources. The modern GPU has evolved from a fixed function graphics pipeline to a programmable parallel processor with computing power exceeding that of multicore CPUs. Traditional GPUs structure their graphics computation in a similar organization called the graphics pipeline. This pipeline is designed to allow hardware implementations to maintain high computation rates through parallel execution. The pipeline is divided into several stages. All geometric primitives pass through every stage. In hardware, each stage is implemented as a separate piece of hardware on the GPU in what is termed a task-parallel machine organization [31–34]. The input to the pipeline is a list of geometry, expressed as vertices in object coordinates. The output is an image in a frame buffer. The first stage of the pipeline, the geometry stage, transforms each vertex from object space into screen space then assembles the vertices into triangles and traditionally performs lighting calculations on each vertex. The output of the geometry stage are triangles in screen space. The next stage, rasterization, determines both the screen positions covered by each triangle and interpolates per-vertex parameters across the triangle. The result of the rasterization stage is a fragment for each pixel location covered by a triangle. The third stage, the fragment stage, computes the color for each fragment using the interpolated values from the geometry stage. In the final stage, composition, fragments are assembled into an image of pixels usually by choosing the closest fragment to the camera at each pixel location [33,34]. Over the years, graphics vendors have transformed the fixed-function pipeline into a more flexible programmable pipeline [31–34]. This effort has been primarily concentrated on two stages of the graphics pipeline: vertex processors operate on the vertices of primitives such as points, lines, and triangles. Typical operations include transforming coordinates into screen space which are then fed to the setup unit and the rasterizer, and setting up lighting and texture parameters to be used by the pixel-fragment processors. Pixel-fragment processors operate on rasterizer output which fills the interior of primitives along with the interpolated parameters. Vertex and pixel-fragment processors have evolved at different rates. Vertex 19
  • 22.
    processors were designedfor low-latency, high-precision math operations. Whereas, pixel-fragment processors were optimized for high-latency, lower-precision texture filtering. Vertex processors have traditionally supported more complex processing, so they became programmable first. Each new generation of GPUs have increased the functionality and generality of these two programmable stages. The two processor types were functionally converging as the result of a need for greater programming generality. However, the increased generality also increased the design complexity and cost of developing two separate processors. Since GPUs typically must process more pixels than vertices, pixel-fragment processors traditionally outnumber vertex processors by about three to one. However, typical workloads were not well balanced leading to inefficiency. These factors influenced the decision to design a unified architecture. A primary design objective was to execute vertex and pixel-fragment shader programs on the same unified processor architecture. Unification would enable dynamic load balancing of varying vertex, pixel-processing workloads and permit the introduction of new graphics shader stages such as geometry shaders. It also would allow the sharing of expensive hardware such as the texture units. The generality required of a unified processor opened the door to a completely new GPU parallel-computing capability. In November 2006, NVIDIA introduced the Tesla architecture [34, 35] which unifies the vertex and pixel processors and extends them, enabling high performance parallel computing applications written in the C language using the Compute Unified Device Architecture (CUDA) [36–40]. The Tesla architecture is based on a scalable processor array. Due to its unified-processor design, the physical Tesla architecture does not resemble the logical order of graphic pipeline stages. The following section gives a brief overview of the recent GPU microarchitecture based on the new Tesla unified graphics computing architecture which is utilized here to benchmark this work. 2.2 Architectural overview of the Tesla Kepler GPU In 2012, the GPU microarchitecture codename, Kepler was introduced which is the successor to the Fermi microarchitecture. Developed by NVIDIA, it is comprised of 7.1 billion transistors making it the fastest and the most complex microprocessor ever built. The Kepler microarchitecture uses a similar design to Fermi [41, 42], but with a couple 20
  • 23.
    Figure 2.2: Fullchip block diagram of Kepler microarchitecture based GPU (Source: NVIDIA) of key differences [43]. The Kepler architecture focuses on efficiency, programmability and performance. The Kepler architecture employs a new streaming multiprocessor architecture called the next-generation streaming multiprocessor (SMX). Each SMX contains 192 cores which suggests potential for considerably greater performance. The polymorph engines have been redesigned to deliver twice the performance because all those cores run at a lower clock speed than the previous Fermi’s core did. The GPU as a whole uses less power even as it delivers more performance. The reason for Kepler’s power efficiency is that the whole GPU uses a single Core clock rather than the double-pump Shader clock [44]. The Kepler implementations include 15 SMX units and six 64-bit memory controllers. Different products GK110/210 will use different configurations. 2.2.1 Next-generation streaming multiprocessor Each SMX unit consists of 192 single-precision cores, 64 double-precision units, 32 special function units, and 32 load/store units, 64 KB of shared memory, and 48 KB of read-only data cache. The shared memory and the data cache are accessible to all 21
  • 24.
    Figure 2.3: Architecturaloverview of next-generation streaming multiprocessor (SMX) within Kepler microarchitecture (Source: NVIDIA) threads executing on the same streaming multiprocessor. Each core within SMX has fully pipelined floating-point and integer arithmetic logic units. Floating-point operations follow the IEEE 754-2008 floating-point standard. Each core can perform one single-precision fused multiply-add (FMA) operation in each clock period and one double-precision FMA in two clock periods. FMA support also increases the accuracy and performance of other mathematical operations such as division and square root and more complex functions such as extended-precision arithmetic, interval arithmetic and linear algebra. The integer ALU supports the usual mathematical and logical operations including multiplication on both 32-bit and 64-bit values. Memory operations are handled by the load-store units. The load/store instructions can now refer to memory in terms of two-dimensional arrays providing addresses in terms of x and y values. Kepler is designed to significantly increase the GPU’s double precision performance. The 32 Special Function Units (SFUs) is also available to handle transcendental and other 22
  • 25.
    special operations suchas sin, cos, exp (exponential) and rcp (reciprocal) [43,45–47]. 2.2.2 Instruction scheduler The SMX schedules threads in groups of 32 parallel threads called warps. Each SMX features four warp schedulers and eight instruction dispatch units allowing four warps to be issued and executed concurrently. Kepler’s quad warp scheduler selects four warps and two independent instructions per warp can be dispatched each cycle. Kepler allows double precision instructions to be paired with other instructions [45,48]. Figure 2.4: Warp scheduler within next-generation streaming multiprocessors (Source: NVIDIA) 2.2.3 Memory model The number of registers that can be accessed by a thread has been quadrupled in Kepler allowing each thread access to up to 255 registers. Codes that exhibit high register pressure or spilling behavior in previous microarchitecture may see substantial speedups as a result of the increased available per-thread register count. Kepler also implements a new shuffle instruction which allows threads within a warp to share data. Previously, sharing data between threads within a warp required separate store and load operations to pass the data through shared memory. With the shuffle instruction, threads within a warp can read values from other threads in the warp in just about any imaginable permutation. 23
  • 26.
    Figure 2.5: KeplerGPU memory hierarchy (Source: NVIDIA) The Kepler microarchitecture provides for local memory in each streaming multiprocessor. The Kepler architecture supports a unified memory request path for loads and stores with an L1 cache per SMX multiprocessor. In the Kepler GK110 architecture, each SMX has 64 KB of on-chip memory that can be configured as 48 KB of shared memory with 16 KB of L1 cache or as 16 KB of shared memory with 48 KB of L1 cache. Kepler also allows for additional flexibility in configuring the allocation of shared memory and L1 cache by permitting a 32 KB/32 KB split between shared memory and L1 cache. The decision to allocate 16 KB, 48 KB or 32 KB of the local memory as cache usually depends on two factors: how much shared memory is needed and how predictable the kernel’s accesses to global memory are likely to be. A larger shared-memory requirement argues for less cache, more frequent or unpredictable accesses to larger regions of DRAM argue for more cache. For the GK210 architecture, the total amount of configurable memory is doubled to 128 KB allowing a maximum of 112 KB shared memory and 16 KB of L1 cache. Other possible memory configurations are 32 KB L1 cache with 96 KB shared memory or 48 KB L1 cache with 80 KB of shared memory. In addition to the L1 cache, Kepler introduces a 48 KB cache for data that is known to be read-only for the duration of the function. Use of the read-only path is beneficial because it takes both load and working set footprint off the shared/L1 cache path. The Kepler GK110/210 GPUs feature 1536 KB of dedicated L2 cache memory. The L2 cache is the primary point of data unification between the SMX units servicing all load, store and texture requests and providing efficient, high speed data sharing across the GPU. The L2 cache subsystem also implements another feature not found on CPUs: a set of 24
  • 27.
    memory read-modify-write operationsthat are atomic and thus ideal for managing access to data that must be shared across thread blocks or even kernels. L1 and L2 caches help in improving the random memory access performance while the texture cache enables faster texture filtering. The programs also have access to a dedicated shared memory which is a small software-managed data cache attached to each multiprocessor shared among the cores. This is a low-latency, high-bandwidth, indexable memory which runs essentially at register speeds. Kepler’s register files, shared memories, L1 cache, L2 cache and DRAM memory are protected by a single-error correct double-error detect ECC code. 2.2.4 Advance features In Kepler, Hyper-Q enables multiple CPU cores to launch work on a single GPU simultaneously; thereby, expanding Kepler GPU hardware work queues from 1 to 32 [45, 46]. The significance of this being that having a single work queue meant that previous GPU could be under occupied at times if there wasn’t enough work in that queue to fill every streaming multiprocessor. By having 32 work queues, Kepler can in many scenarios achieve higher utilization by being able to put different task streams on what would otherwise be an idle SMX. When working with a large amount of data, increasing the data throughput and reducing latency is vital to increasing compute performance. Kepler GK110/210 supports the RDMA feature in NVIDIA GPUDirect which is designed to improve performance by allowing direct access to GPU memory by third-party devices [45, 46]. GPUDirect provides direct memory access (DMA) between NIC and GPU without the need for CPU side data buffering. GPUDirect enables much higher aggregate bandwidth for GPU-to-GPU communication within a server and across servers with the Peer-to-Peer and RDMA features. Kepler has a possibility of dynamic parallelism which allows the GPU to generate new work for itself, synchronize on results and control the scheduling of that work via dedicated, accelerated hardware paths all without involving the CPU [45,46]. In previous GPUs, all work was launched from the host CPU, run to completion, and return a result back to the CPU. The result would then be used as part of the final solution or would be analyzed by the CPU which would then send additional requests back to the GPU for additional processing. In Kepler, any kernel can launch another kernel and can create the 25
  • 28.
    Figure 2.6: DirectPeer-to-Peer data transfer between two GPUs using GPUDirect (Source: NVIDIA) necessary streams, events and manage the dependencies needed to process additional work without the need for host CPU interaction. This architectural innovation makes it easier for developers to create and optimize recursive and data-dependent execution patterns and allows more of a program to be run directly on GPU. 2.3 CUDA programming model In November 2006, NVIDIA introduced CUDA, a general purpose parallel computing architecture with a new parallel programming model and instruction set architecture. CUDA comes with a software environment that allows developers to use C as a high- level programming language [37, 49]. At its core are three key abstractions; a hierarchy of thread groups, shared memories and barrier synchronization that are simply exposed to the programmer as a minimal set of language extensions. These abstractions provide fine-grained data parallelism and thread parallelism nested within coarse-grained data parallelism and task parallelism. They guide the programmer to partition the problem into coarse sub-problems that can be solved independently in parallel by blocks of threads and each sub-problem into finer pieces that can be solved cooperatively in parallel by all threads within the block [38–40]. CUDA extends C by allowing the programmer to define C functions called kernels [50]. Kernel is the parallel portion of the application that will execute on the GPU. Kernels are executed N times in parallel by N different CUDA threads as opposed to only once like regular C functions. Each thread that executes the kernel is given a unique thread ID that is accessible within the kernel through the built-in threadIdx 26
  • 29.
    variable. The threadIdxis a 3 component vector so that threads can be identified using a one-dimensional, two-dimensional or three-dimensional thread index, forming a one-dimensional, two-dimensional or three-dimensional thread block. Figure 2.7: (Left) Gird of thread blocks (Source: NVIDIA). (Right) CUDA execution model There is a limit to the number of threads per block, since all threads of a block are expected to reside on the same processor core and must share the limited memory resources of that core. A kernel can be executed by multiple equally-shaped thread blocks so that the total number of threads is equal to the number of threads per block times the number of blocks. Blocks are organized into a one-dimensional or two-dimensional grid of thread blocks. The number of thread blocks in a grid is usually dictated by the size of the data being processed or the number of processors in the system. Each block within the grid can be identified by a one-dimensional or two-dimensional index accessible within the kernel through the built-in blockIdx variable. The dimension of the thread block is accessible within the kernel through the built-in blockDim variable. Thread blocks are required to execute independently in any order, in parallel or in series. This independence requirement allows thread blocks to be scheduled in any order across any number of cores. Threads within a block can cooperate by sharing data through some shared memory and by synchronizing their execution to coordinate memory accesses. More precisely, one can specify synchronization points in the kernel by calling a barrier at which all threads in the block must wait before any is allowed to proceed. 27
  • 30.
    CUDA threads mayaccess data from multiple memory spaces during their execution. Each thread has private local memory. Each thread block has shared memory visible to all threads of the block and with the same lifetime as the block. All threads have access to the same global memory. There are also two additional read-only memory spaces accessible by all threads: the constant and texture memory spaces. The global, constant and texture memory spaces are persistent across kernel launches by the same application. 2.4 General-purpose computing on graphics processing units Traditionally, powerful GPUs have been useful mostly to gamers looking for realistic experiences along with engineers and creatives needing 3D modeling functionality. General-purpose computing on GPUs only became practical and popular after 2001 with the advent of both programmable shaders and floating point support on graphics processors. In particular, problems involving matrices and/or vectors especially two, three or four-dimensional vectors were easy to translate to a GPU which acts with native speed and support on those types. The scientific computing community’s experiments with the new hardware started with a matrix multiplication routine. These early efforts to use GPUs as general-purpose processors required reformulating computational problems in terms of graphics primitives as supported by the two major APIs for graphics processors, OpenGL and DirectX [33]. This cumbersome translation was obviated by the advent of general-purpose programming languages and APIs such as Sh/RapidMind, Brook and Accelerator [31,51,52]. These were followed by NVIDIA’s CUDA, which allowed programmers to ignore the underlying graphical concepts in favor of more common high-performance computing concepts [32, 53]. Newer, hardware vendor-independent offerings include Microsoft’s DirectCompute and Apple/Khronos Group’s OpenCL [53]. This means modern GPGPU pipelines can act on any big data operation and leverage the speed of a GPU without requiring full and explicit conversion of the data to a graphical form [50]. GPU flexibility has increased over the last decade thanks to their massive multi-core parallelization, delivering high throughput capabilities even on double-precision arithmetic, to their increased on-board memory and the efforts made by vendors in 28
  • 31.
    facilitating programmability. GPUaccelerated computing has revolutionized the HPC industry. Researchers have quickly realized that many real world problems map very well to the pipelined single instruction multiple data (SIMD) hardware in the GPU’s streaming processors. There are many computational applications across a wide range of fields already optimized for GPUs. Some examples are: Molecular dynamics [54–57], Quantum chemistry [58–62], Materials science [63, 64], Bioinformatics [65–69], Physics [70–74], Numerical analytics [75–77], Fluid dynamics [78–80], Medical imaging [81–83], Finance [84,85]. While GPU has many benefits such as more computing power, larger memory bandwidth and low power consumption, there are some constraints to fully utilize its processing power. Developing codes for GPU takes more time and need more sophisticated work, gaining relevant speedup requires that algorithms are coded to reflect the GPU architecture, and programming for the GPU differs significantly from traditional CPUs. In particular, incorporating GPU acceleration into pre-existing codes is more difficult than just moving from one CPU family to another. A GPU-savvy programmers need to dive into the code and make significant changes to critical components. Also, GPU code runs in parallel so data partition and synchronization technique are needed which also enforces access levels for different categories of memory. The low bandwidth PCI-E bus that physically connects between the GPU and the rest of the system is one of the main performance limiting factor. The performance of GPU goes down an order of magnitude as transferring anything over PCI-E lowers the speeds twentyfold compared to the onboard memory. These constraints make performance optimization more difficult. Also, GPU’s debugging environment is not as powerful as general CPU. 2.5 Summary GPU is the most powerful computing engine available to computational scientists and is being utilized in a wide range of scientific computing applications. What make the GPU so powerful is its thousands of identical cores that run at lower clock rate than CPU but optimized for recursive SIMD type operation on a big data set, along with its high memory bandwidth and ease of programmability using a high level language. However, there are 29
  • 32.
    certain types ofapplication that are more ideal for GPU computing than others. Most applications need to be re-coded for GPU extensively and one needs a better and deeper understanding of the GPU architecture and memory model to obtain optimal speedups. The ongoing remarkable effort put by GPU vendors have resulted in a generation of more sophisticated, easily programmable, compute optimal GPU architectures. 30
  • 33.
    Chapter 3 Introduction toEigensolvers The theory and computation of eigenvalue problems are among the most successful and widely used tools of applied mathematics and scientific computing. Eigenvalue problems find its application in a variety of scientific and engineering applications including acoustics, control theory, earthquake engineering, graph theory, Markov chains, pattern recognition, quantum mechanics, stability analysis, quantum physics, material sciences and many other areas. The increasing number of applications and the ever-growing scale of problems have motivated fundamental progress in the numerical solution of eigenvalue problems. Eigenvalues are often introduced in the context of linear algebra or matrix theory. However, historically, they arose in the study of quadratic forms and differential equations. In the 18th century, Euler studied the rotational motion of a rigid body and discovered the importance of the principal axes. Lagrange realized that the principal axes are the eigenvectors of the inertia matrix [86]. In the early 19th century, Cauchy saw how their work could be used to classify the quadric surfaces and generalized it to arbitrary dimensions. At the start of the 20th century, Hilbert studied the eigenvalues of integral operators by viewing the operators as infinite matrices [87]. He was the first to use the word “eigen.” The first numerical algorithm for computing eigenvalues and eigenvectors appeared in 1929 when Von Mises published the power method [88]. An eigenvector of an N×N square matrix A is a non-zero vector v that, when multiplied with A, yields a scalar (λ) multiple of itself. 31
  • 34.
    Av = λv(3.1) This equation is referred to as the standard eigenvalue problem. Here, λ is an eigenvalue of A, v is the corresponding right eigenvector and (λ, v) is called an eigenpair. The set of all eigenvectors of a matrix, each paired with its corresponding eigenvalue is called the eigensystem of that matrix [89]. The full set of eigenvalues of A is called the spectrum and is denoted by λ(A) = λ1, λ2, ..., λn. Any multiple of an eigenvector is also an eigenvector with the same eigenvalue. An eigenspace of a matrix A is the set of all eigenvectors with the same eigenvalue together with the zero vector. An eigenbasis for A is any basis for the set of all vectors that consists of linearly independent eigenvectors of A. In solving an eigenvalue problem, there are a number of properties that need be considered like the type of matrix (real or complex), structure of the matrix (band, sparse, structured sparseness, toeplitz), special properties of the matrix (symmetric, hermitian, skew symmetric, unitary) and type of eigenvalues required (largest, smallest, inner, sums of intermediate eigenvalues). These greatly affect the choice of algorithm. There are a variety of more complicated eigenproblems. For instance, Ax = λBx and more generalized eigenproblems like Ax + λBx + λ2 Cx = 0, higher order polynomial problems, and nonlinear eigenproblems. All these problems are considerably more complicated than the standard eigenproblem depending on the operators involved. In numerical mathematics, several different techniques needed to calculate the eigenpairs have been developed. These techniques can be divided into two main groups: “direct methods” and “iterative methods.” First, the algorithms for medium sized problems that calculate one up to all eigenvalues. Second, the methods for huge eigenvalue equations that calculate only a few eigenpairs projecting the huge problem onto a much smaller search space which is build up within the algorithm. The projected system is small enough to be solved by techniques of the former group. 3.1 Direct methods In this section, lets briefly discuss various direct methods for the computation of eigenvalues of matrices that are small and can be stored in the computer memory as full matrices. These direct methods are sometimes called transformation methods and are 32
  • 35.
    built up aroundsimilarity transformations. They transforms the matrix to a simpler form and finds all the eigenvalues and eigenvectors. 3.1.1 QR algorithm This algorithm finds all the eigenvalues and optionally all the eigenvectors. The basic idea is to perform QR decomposition [90–92]. The QR algorithm consists of two separate stages. First, by means of a similarity transformation, the original matrix is transformed in a finite number of steps to Hessenberg form or in the Hermitian/symmetric case to real tridiagonal form. This first stage of the algorithm prepares it for the second stage which is the actual QR iterations that are applied to the Hessenberg or tridiagonal matrix [93]. It takes O(n2 ) floating point operations for finding all the eigenvalues of a tridiagonal matrix. Since reducing a dense matrix to tridiagonal form costs 4 3 n3 floating point operations, O(n2 ) is negligible for large enough n. For finding all the eigenvectors as well, QR iteration takes a little over 6n3 floating point operations on average. 3.1.2 Divide-and-conquer method An eigenvalue problem is divided into two problems of roughly half the size, each of these are solved recursively and the eigenvalues of the original problem are computed from the results of these smaller problems. This algorithm was originally proposed by Cuppen [94]. However, it took ten more years until a stable variant was found by Gu and Eisenstat [95,96]. The advantage of divide-and-conquer comes when eigenvectors are needed as well. If this is the case, reduction to tridiagonal form takes 8 3 n3 , but the second part of the algorithm takes O(n3 ) as well. For the QR algorithm with a reasonable target precision, this is ≈ 6n3 , whereas for divide-and-conquer it is ≈ 4 3 n3 . The reason for this improvement is that in divide-and-conquer the O(n3 ) part of the algorithm is separate from the iteration, whereas in QR, this must occur in every iterative step. Adding the 8 3 n3 flops for the reduction, the total improvement is from ≈ 9n3 to ≈ 4n3 flops. The divide-and-conquer approach is now the fastest algorithm for computing all eigenvalues and eigenvectors of a symmetric matrix of order larger than 25, this also holds true for non-parallel computers. If the subblocks are of order greater than 25, then they are further reduced else, the QR algorithm is used for computing the eigenvalues and eigenvectors of the subblock [97]. 33
  • 36.
    3.1.3 Bisection methodand inverse iteration Bisection may be used to find just a subset of the eigenvalues, like those in an interval [a, b]. It needs only O(nk) floating point operations, where k is the number of eigenvalues desired. Thus the bisection method could be much faster than the QR method when k n. It can be highly accurate, but may be adjusted to run faster if lower accuracy is acceptable [98,99]. Inverse iteration can then be used to find the corresponding eigenvectors. In the best case, when the eigenvalues are well separated, inverse iteration also costs only O(nk) floating point operations. This is much less than either QR or divide-and-conquer, even when all eigenvalues and eigenvectors are desired (k = n). On the other hand, when many eigenvalues are clustered close together, Gram-Schmidt orthogonalization will be needed to make sure that one does not get several identical eigenvectors. This will add O(nk2 ) floating point operations to the operation count in the worst case. 3.1.4 Jacobi method Jacobi method is mostly used for solving Hermitian eigenvalue problems. This method constructs an orthogonal transformation to diagonal form, A = XΛX∗ by applying a sequence of elementary orthogonal rotations, each time reducing the sum of squares of the nondiagonal elements of the matrix, until it is of diagonal form to working accuracy [100]. The Jacobi algorithm has been very popular since its implementation is very simple and gives eigenvectors that are orthogonal to working accuracy. However, it cannot compete with the QR method in terms of operation counts. Jacobi needs 2sn3 multiplications for s sweeps, which is more than the 4 3 n3 needed for tridiagonal reduction. There is one important advantage to the Jacobi algorithm. It can deliver eigenvalue approximations with a small error in the relative sense, in contrast to algorithms based on tridiagonalization, which only guarantee that the error is bounded relative to the norm of the matrix [101,102]. 3.2 Iterative methods Theoretically, the numerical algorithms mentioned above are applicable for arbitrary dimensions but practically they are limited by memory restrictions and computational 34
  • 37.
    time. The effortof the QR algorithm is in O(n3 ) and cannot be handled for large N on current computers. In this section, numerical methods are introduced that calculate a few eigenvalues with less computational cost. The well-known iterative methods for solving eigenvalue problems are the power method (the inverse iteration), the Krylov subspace methods, the Jacobi-Davidson algorithm and FEAST method. Traditionally, if the extreme eigenvalues are not well separated or the eigenvalues sought are in the interior of the spectrum, a shift-and-invert transformation has to be used in combination with these eigenvalue problem solvers. 3.2.1 Power iteration method The power iteration is a very simple algorithm. It does not compute a matrix decomposition, the basic idea is to multiply the matrix A repeatedly by a well chosen starting vector, so that the component of that vector in the direction of the eigenvector with largest eigenvalue in absolute value is magnified relative to the other components [88]. The speed of convergence of the power iteration depends on the ratio of the second largest eigenvalue to the largest eigenvalue. It is interesting that the most effective variant is the inverse power method with shift which can find interior as well as exterior eigenvalues [103]. The idea of this method is to apply the power method on A−1 or on the inverse of the shifted matrix (A − µ0I)−1 . The eigenvalues of A−1 are the inverse eigenvalues of A. Thus, the inverse power method finds the eigenvalue closest to zero. The smallest eigenvalue of the shifted matrix (A − µ0I) is the eigenvalue of A closest to µ0. Therefore, this method can find any simple eigenvalue when an appropriate guess µ0 is available. 3.2.2 Rayleigh quotient iteration method (RQI) RQI is an eigenvalue algorithm which extends the idea of the inverse iteration by using the Rayleigh quotient to obtain increasingly accurate eigenvalue estimates [104]. Starting with a normalized putative eigenvector a sequence of normalized approximate eigenvectors are generated with their associated Rayleigh quotients. The RQI algorithm converges cubically for Hermitian or symmetric matrices, given an initial vector that is sufficiently close to an eigenvector of the matrix that is being analyzed. If the matrix is non-Hermitian then it is still possible to get cubical convergence by using a two-sided version of the algorithm. The 35
  • 38.
    drawbacks of theRQI method is that it may converge to an eigenvalue which is not the closest to the desired one and the algorithm has a high computation cost since it requires a factorization at every iteration. 3.2.3 Arnoldi method The Arnoldi method was first introduced as a direct algorithm for reducing a general matrix into upper Hessenberg form [105]. It was later discovered that this algorithm leads to a good iterative technique for approximating eigenvalues of large sparse matrices. Arnoldi method belongs to a class of linear algebra algorithms based on the idea of Krylov subspaces that give a partial result after a relatively small number of iterations. It is an orthogonal projection method onto a Krylov subspace. The procedure can be essentially viewed as a modified Gram-Schmidt process for building an orthogonal basis of the Krylov subspace Km (A, v). The cost of orthogonalization increases as the method proceeds. A convergence analysis of eigenvector approximation using the Arnoldi method can be found in [106,107]. As CPU time and memory needed to manage the Krylov subspace increase with its dimension, a subspace restarting strategy is necessary. Roughly speaking, the restarting strategy builds a new subspace of smaller dimension by extracting the desired approximate eigenvectors from the current subspace of a larger dimension. An elegant implicit restarting strategy based on the shifted-QR algorithm was proposed by Sorensen [108]. This method generates a new Krylov subspace of smaller dimension without using matrix-vector products involving A. The resulting algorithm is called the implicitly restarted Arnoldi (IRA) method. 3.2.4 Lanczos method The Lanczos algorithm can be viewed as a simplified Arnoldi’s algorithm in that it applies to Hermitian matrices. It’s algorithm is an effective iterative method to find eigenvalues and eigenvectors of large sparse matrices by first building an orthonormal basis and then forming approximate solutions using Rayleigh projection. It reduces a large, complicated eigenvalue problem into a simpler one [109, 110] explicitly taking advantage of the symmetry of the matrix. However, the Lanczos method diverges when implemented on a finite precision architecture since the Lanczos vectors inevitably lose 36
  • 39.
    their mutual orthogonality[110, 111]. Hence, it needs a full reorthogonalization of each newly computed vector against all preceding Lanczos vectors. This not only greatly increases the number of computations required, but also requires that all the vectors be stored. For large problems, it will be very expensive to take more than a few steps using full reorthogonalization. Nevertheless, linear independence will surely be lost without some sort of corrective procedure. Selective orthogonalization interpolates between full reorthogonalization and simple Lanczos to obtain the best of both worlds. Robust linear independence is maintained among the vectors at a cost which is close to that of simple Lanczos [112,113]. Another way to maintain orthogonality is to limit the size of the basis set and use a restarting scheme by replacing the starting vector with an improved starting vector and computing a new Lanczos factorization with the new vector. 3.2.5 Locally optimal block preconditioned conjugate gradient method (LOBPCG) LOBPCG is based on a local optimization of a three-term recurrence. It is designed to find the smallest or the largest eigenvalues and corresponding eigenvectors of a symmetric and positive definite eigenvalue problems [114]. Similar to other conjugate gradient based methods, this is accomplished by the iterative minimization of the Rayleigh quotient, while taking the gradient as the search direction in every iteration step. Which results in finding the smallest eigenstates of the original problem. In the LOBPCG method the minimization at each step is done locally, in the subspace of the current approximation, the previous approximation, and the preconditioned residual. The subspace minimization is done by the Rayleigh-Ritz method. Iterating several approximate eigenvectors, simultaneously, in a block in a similar locally optimal fashion, results in the full block version of the LOBPCG. 3.2.6 Davidson method Davidson came up with an idea of expanding the subspace in such a way that certain eigenpairs would be favored. Bearing in mind the fact that if certain true eivenvector lies in the subspace of current iteration, the eigen problem in the subspace would give the exact corresponding eigenpair. Thus to achieve fast convergence, a better way to 37
  • 40.
    expand the subspaceis to choose the new expansion vector to be the component of the error vector which is orthogonal to the subspace [115,116]. If this orthogonal component could be solved exactly and added to the subspace, then convergence is guaranteed to be achieved in the next iteration in exact arithmetic. It has been reported that this method can be quite successful in finding dominant eigenvalues of (strongly) diagonally dominant matrices. Davidson [117] suggests that his algorithm (more precisely, the Davidson-Liu variant) may be interpreted as a Newton-Raphson scheme and this has been used as an argument to explain its fast convergence. 3.2.7 Jacobi-Davidson method The Jacobi-Davidson method is a popular technique to compute a few eigenpairs of large sparse matrices. It is motivated by the fact that standard eigensolvers often require an expensive factorization of the matrix to compute interior eigenvalues. Such a factorization is unfeasible for large matrices in large-scale simulations. In the Jacobi-Davidson method, one still needs to solve inner linear systems, but a factorization is avoided because the method is designed so as to favor the efficient use of iterative solution techniques based on preconditioning [118]. Jacobi-Davidson method belongs to the class of subspace methods which means that approximate eigenvectors are sought in a subspace. Each iteration of this method has two important phases: the subspace extraction in which an approximate eigenpair is sought with the approximate vector in the search space and the subspace expansion in which the search space is enlarged by adding a new basis vector to it trying to lead to a better approximate eigenpairs in the next extraction phase [119,120]. 3.2.8 Contour integral spectral slicing The contour integral spectral slicing method is based on the contour integral method proposed by Sakurai-Sugiura [121] for finding certain eigenvalues of a generalized eigenvalue problem that lie in a given domain of the complex plane. The method projects the matrix pencil onto a subspace associated with the eigenvalues that are located in the domain. The approach is based on the root finding method for an analytic function. This method finds all of the zeros that lie in a circle using numerical integration. The algorithm requires a region that includes several eigenvalues and an estimate of the number of eigenvalues or clusters in the region. The major advantage of 38
  • 41.
    this method isthat iterative process for constructing subspace is not required. At each contour point the projected matrix pencil with eigenvalues of interest are derived by solution of linear systems. A Rayleigh-Ritz type variant of the method has been also developed to improve numerical stability [122]. 3.2.9 FEAST method Lately, the FEAST algorithm which takes its inspiration from the density-matrix representation and contour integration technique in quantum mechanics has been developed [123]. Unlike the Lanczos and Jacobi-Davidson method, the aim of the FEAST algorithm is to actually compute the eigenvectors instead of approximating them. The algorithm deviates fundamentally from the traditional Krylov subspace iteration based techniques. This algorithm is free from any orthogonalization procedures and its main computational tasks consist of solving the inner independent linear systems with multiple right-hand sides. The FEAST algorithm finds all the eigenpair in a given search interval. It requires that one provides an estimate for the number of eigenpair within the search interval which often is not possible to obtain beforehand [124]. 3.3 Survey of available software packages for eigenproblems The history of reliable high quality software for numerical linear algebra started in 1971 with the book titled the “Handbook for Automatic Computation” [125]. This book described state-of-the-art algorithms for the solution of linear systems and eigenproblems. During the same decade, few research groups started the development of two influential software packages: LINPACK and EISPACK. LINPACK covered the numerical solution of linear systems. EISPACK concentrated on eigenvalue problems. These packages can also be viewed as prototypes for eigenvalue routines in the bigger software packages NAG and IMSL and in the widely available software package MATLAB. EISPACK was replaced in 1995 by LAPACK. In Table 3.1, we can noticed that there are numerous commercial and open source free packages available that support single and double precision, real or complex arithmetic eigensolvers and even distributed computing via MPI or other technologies. Yet, there 39
  • 42.
    Table 3.1: Detailedlist of available software packages for large-scale eigenproblems Package name Numerical method employed Real Complex Shared memory GPU Distributed Multi-GPU Sparse Interior Anasazi Block Krylov-Schur, Block Davidson, LOBPCG Yes Yes Yes No Yes No Yes Yes ARPACK Arnoldi/Lanczos (implicit restart) Yes Yes Yes No Yes No Yes Yes BLOPEX LOBPCG Yes Yes Yes No Yes No Yes No FEAST FEAST Yes Yes Yes No Yes No Yes Yes FILTLAN Polynomial filtered Lanczos Yes Yes Yes No No No Yes Yes IETL Power, RQI, Lanczos Yes Yes Yes No No No Yes Yes LASO Lanczos Yes No Yes No No No Yes No MAGMA LOBPCG Yes Yes Yes Yes Yes Yes (limited support) Yes Yes PRIMME Block Davidson, JDQMR, JDQR, LOBPCG Yes Yes Yes No Yes No Yes Yes PROPACK SVD via Lanczos Yes Yes Yes No No No Yes Yes PySPARSE Jacobi-Davidson Yes No No No No No Yes Yes SLEPc Krylov-Schur, Arnoldi, Lanczos, RQI, Subspace Yes Yes Yes Yes (limited support) Yes Yes (limited support) Yes Yes TRLAN Lanczos (dynamic thick- restart) Yes No Yes No Yes No Yes No are a lot of disadvantages of employing one of it. To list a few: first, the users have to assume that they are optimal implementations and trade control for easy usability with flexibility. Second, packages are developed based on one hardware/software feature and may not exploit all the available optimization prospect advance platforms have to offer. Also, most commercial packages are driven by the requirements of their clients and may fail to serve the broader scientific community. Most packages are inadequate to meet the needs of large groups of computational experts from different domains. Some are dedicated to real systems, whereas others are meant to solve complex systems, some are developed for both real and complex, but experience has shown that they may not have solvers for other specific eigenvalue systems. As seen, there are few independent projects currently in progress to implement eigensolvers to execute in a multi-GPU and CPU-GPU hybrid scenario. However, their capability is limited by various factors and a lot of work still needs to be done before they can be widely employed for general purpose numerical computation. 40
  • 43.
    3.4 Summary Eigenvalue problemarises in a wide range of scientific domain. Till date, enormous numerical effort has been put to develop methods that can solve these systems. The eigenproblem variations that are most widely encountered are the standard eigenvalue problem and the generalized eigenvalue problem. There are a number of methods that can be employed to solve eigenproblems, but the choice of the method utilized depends on a number of factors. In a broad sense, algorithms can be divided in two groups: direct methods that are employed for small systems and the iterative methods that are engaged while dealing with large-scale eigenproblems. There are a number of implementations of a wide variety of algorithms in the form of portable software packages available. However, there’s limited work focused to develop robust, optimal eigensolver packages for recent HPC and GPU based systems. 41
  • 44.
    Chapter 4 Design ofGPU based eigensolver for atomistic simulation There are two important aspects that must be considered while employing a numerical method. The first one is the correct implementation of the physical governing equations and the accuracy of the mathematical algorithms. The second one is directly related to the nature of the hardware needed to execute the model. Each kind of platform used to perform numerical simulations presents its own advantages and limitations. Parallelization methods and optimization techniques are essential to perform simulations at a reasonable execution time. Iterative methods based on the Krylov subspace which were introduced in Chapter 3 are usually employed to compute few eigenstates of large sparse matrices. Among these methods is the original Lanczos algorithm, Arnoldi [110], Krylov-Schur or Jacobi-Davidson [126]. As already seen, some of the main standard libraries that include iterative eigensolver routines are ARPACK (ARnoldi PACKage) [127], PRIMME (PReconditioned Iterative MultiMethod Eigensolver) a library based on Jacobi-Davidson algorithm [128], IETL (Iterative Eigensolver Template Library) providing a generic template interface to performance solvers [116] and SLEPc, a scalable library based on the linear algebra package PETSc [129]. All these libraries support single and double precisions, real or complex arithmetic and even distributed computations via MPI as well. Most eigenvalue solvers have concentrated on computational techniques that 42
  • 45.
    accelerate separate components,in particular the matrix-vector multiplication [130] or a new efficient sparse matrix storage formats [131]. However, there exists a limited amount of work realized for taking advantage of modern day processor architectural improvements for high performance computing in atomistic simulation which is facilitated by their enhanced programmability and motivated by their attractive price to performance ratio and incredible growth in speed [116,127,128]. This work has been motivated by the lack of specialized eigensolvers for large-scale computations on GPUs. I concentrate on addressing some basic problems that hinder the development of efficient eigensolver on GPUs: First, the choice of the algorithm itself. Then its demonstrate how to overcome the problem of compute versus communication gap that exists in GPUs and have also established ways to resolve the computational and memory related bottlenecks. Finally, a multi-GPU implementation that scales with GPUs is presented. Resulting in an eigensolver that accelerates efficiently large-scale TB calculations. In the following sections, I start with the custom implementation of the Lanczos algorithm with a simple restart that is optimized for GPUs as it has been identified as a more fitted method for computing few eigenpairs on a GPU framework that can cope with memory limitations of current GPUs and slow GPU-CPU communication. I, also, discuss the enhancements and strategies developed for optimal eigenslover implementations utilizing GPU and other HPC based distributed technologies and present benchmark calculations performed on a GaN/AlGaN wurtzite quantum dot similar to the one shown in Figure 4.1. I further the discussion in Chapter 5 by comparing our fine-tuned Lanczos implementation with GPU based Jacobi-Davidson and FEAST method implementations. Figure 4.1: Conical wurtzite GaN/AlGaN quantum dot with 30% Al. Atomistic description: In yellow Aluminium, in red Gallium. 43
  • 46.
    4.1 Lanczos method Weare interested in finding inner eigenvalues of the energy spectrum, near the energy gap of the large GaN/AlGaN quantum dots nanostructure as shown in 4.1. Such systems have important applications in modern nitride-based light emitting diodes (LEDs) [9, 19]. However, the Lanczos algorithm converges fast to the extreme eigenvalues. As stated in Chapter 3, different spectral transformations are used for the purpose, like spectrum folding or shift-and-invert [110]. In this, implementation spectrum folding is applied in order to avoid the computation of the matrix inverse that might pose additional convergence problems. So, in general, the lowest eigenpairs of the operator A = (H − sI)2 is computed, where s is the chosen spectrum shift [132]. The implemented algorithm is a variant of that described in reference [133]. Algorithm. The Lanczos method Assume H is a Hermitian matrix, q1 is a random vector with |q1| = 1 q0 = 0, β1 = 0 for i = 1 to m : ui = (H − sI)qi αi = ui · ui qi+1 = (H − sI)ui − αiqi − βiqi−1 βi+1 = ||ui||2 After each iteration, we get αi and βi, the coefficients used to construct the tridiagonal matrix, T =               α1 β2 0 β2 α2 β3 β3 α3 ... ... ... βm−1 βm−1 αm−1 βm 0 βm αm               Due to finite precision arithmetic, new q vectors slowly become less orthogonal to the initial vectors [106]. Reorthogonalizing the current q vector against all previous qi takes 44
  • 47.
    a lot ofresources and it is not done in our implementation. Other versions of the Lanczos algorithm, performs a partial reorthogonalization keeping the subspace rather small. Experience shows that the convergence rate increases when the subspace is considerably enlarged at the expense of accurate orthogonality. In this implementation, the Lanczos iterations are performed until orthogonality with respect to the initial vector, q1, is preserved to an error of 10−5 . In this way, the typical size of the tridiagonal matrix, T , becomes of the order of 1000, which can be diagonalized using standard LAPACK routines, obtaining the eigenvalues, λ (m) i and corresponding eigenvectors, w (m) i . It can be proved that the eigenvalues of T are approximate eigenvalues of A. Here, only the eigenvalues with lowest |λi| are considered, corresponding to the eigenvalue λi = |λi|+s of H closer to s. The projected eigenvector, vi, can be calculated as vi = Qmw (m) i , where Qm is the transformation matrix whose column vectors are q1, q2, · · · , qm. The qi vectors are recomputed on the fly by running the Lanczos iteration a second time. This might seem a waste of time at first, but reducing the subspace size in order to store the qi vectors in memory does not improve overall speed. Once the approximate eigenvector, vi, has been computed, the algorithm is tested for convergence by considering the residual norm || ¯vi|H|¯vi / ¯vi|¯vi − λi|| < tol. One can notice from the algorithm that each iteration requires two sparse matrix- vector (spMV) multiplications and four vector operations, which implies that, if Rmax is the maximum number of non-zero elements in any one row of the sparse matrix H, then the complexity of the spMV product operation is O(Rmax · N) [134]. The complexity per iteration of the Lanczos algorithm is O(2(Rmax · N) + N) where the dominant operation is given by the matrix-vector multiplication. Observe that the matrix remains unchanged along this loop. 4.2 Implementation and optimization strategies for parallel eigensolvers Two different hardware technologies have been employed: CPUs and GPUs. Current CPUs have multiple processing cores, making possible the distribution of workload among the different cores using its multi-core shared-memory architecture. In addition, 45
  • 48.
    CPUs also presentSIMD which allows performing an operation on multiple data simultaneously. Open Multi-Processing (OpenMP) may be used for explicit direct multithreaded, shared memory parallelism thus providing a portable, scalable model for developers of shared memory parallel applications. OpenMP programs accomplish parallelism exclusively through the use of threads [135]. As detailed in Chapter 2, the GPU architecture allows for the execution of threads on a larger number of processing elements. Although these processing elements are typically much slower than those of a CPU, having a large number of threads may make it possible to surpass the performance of current multi-core CPUs [136]. Another characteristic of parallel programming with GPUs is the ability to start a large number of threads with little overhead [39]. This is unlike traditional CPU threads, where each individual thread is treated as an entity independent of others, requiring separate resources such as stack memory, and whose creation and management are not cheap [39]. GPU threads, on the other hand, are cheaper to create and manage, since batches of GPU threads are treated the same, it is possible to create a large number of them and run them for a shorter duration. The parallelization task on multiple computing systems can be performed by using MPI for communicating via messages between distributed processes that are running in parallel over the network. We combine MPI with OpenMP and CUDA to enable solving tight binding problems with a H matrix that are too large to fit on a single node or that would require an unreasonably long compute time on a single node. We also take advantage of latest development in hardware technologies such as NVIDIA GPUDirect so as to achieve additional improvements in performance. 4.2.1 MPI-OpenMP In OpenMP, the goal is usually to parallelize loops. A serialized program can be parallelized one loop at a time. When compiler directives are used, OpenMP will automatically make loop index variables private within team threads (Master thread + Worker threads) and global variables shared. Below is the pseudocode for spMV with OpenMP. Do i = 1 to Number_of_Rows Start=row_index(i) 46
  • 49.
    Stop=row_index(i+1) Sum = 0 Dok = Start to Stop Sum += H(k) * q(col_index(k)) End Do V(i) = Sum End Do All non-zero coefficients of matrix H are stored at contiguous memory location in array H(:), row by row, and the starting offsets of all rows are contained in a separate array row index(:). Array col index(:) contains the original column index for each non-zero matrix coefficient. A matrix-vector multiplication with vector q(:) can then be written as shown in the pseudocode. While array H(:) is traversed contiguously, access to q(:) is indexed. The rows of matrix H and the solution vector V (:) are partitioned between threads. The OpenMP compiler directives takes care of generating the code for distributing the work and synchronizing across the threads. MPI-OpenMP hybrid paradigm works well for multi-core CPU nodes connected over a network since MPI is designed to handle distributed-memory systems. We use MPI across nodes and OpenMP within individual node, thus, avoiding the extra communication overhead with MPI within the same node. We have divided the problem into a two-level parallelism. MPI is used for coarse-grained parallelism among nodes while OpenMP is used for fine-grained parallelism between different CPU cores on the same node. 4.2.2 MPI-CUDA There are many reasons for wanting to combine the two parallel programming approaches of MPI and CUDA. A common reason is to enable solving problems with a data size too large to fit into the memory of a single GPU or that would require an unreasonably long compute time on a single node. Another reason is to accelerate an existing MPI application with GPUs or to enable an existing single-node multi-GPU application to scale across multiple nodes. MPI-CUDA hybrid paradigm is utilized to enable solving large TB calculations on multiple GPUs. The workstation has multiple GPUs that are connected to the same host. Similar to MPI-OpenMP, the problem has been divided into a two-level parallelism. 47
  • 50.
    MPI is usedfor coarse-grained parallelism among GPUs while CUDA kernels are used for fine-grained parallelism within a single GPU. To further improve the performance of the MPI-CUDA implementation, techniques, like the splitting technique, the mix real- complex arithmetic kernel, the overlap transfer technique and the CUDA-aware MPI which are explained in detail in the following subsections have been utilized. 4.2.3 Performance enhancement via communication cost reduction In order to reduce memory usage and traffic at the cost of extra flops, the eigenvalues and the eigenvectors are calculated using minimal information without saving any subspace vectors as described in section 4.1. This might initially seem a waste of time, but as previously stated, reducing the subspace size in order to store the qi vectors in memory does not improve overall speed. Furthermore, a considerable time needed to transfer the vectors from GPU to machine RAM has to be spent. Since the peak bandwidth between the device memory and the GPU is much higher than the peak bandwidth between host memory and device memory, it is important to minimize data transfer between the host and the device. Therefore, it is necessary to keep the entire matrix and the intermediate vectors on the GPU. The advantage of the described algorithm is that it resides in a very little memory at the expenses of computing more matrix-vector products. This is ideal for graphic cards limited in memory, but fast in performing vector operations. Another fundamental advantage of this implementation is the absence of expensive data transfer of the vector qi from the device to the host. Only the scalars αi, βi are transferred at each iteration since T is diagonalized on the host. 4.2.4 Memory optimization by Splitting approach Memory optimizations are the most important area for performance enhancement. The goal is to maximize the possible atomistic size that can be simulated on the GPU. The TB Hamiltonian is a sparse matrix with approximately 40 non-zero coefficients per row with a standard deviation ranging from 3.0 to 4.0. Therefore, the Hamiltonian is stored in a compressed sparse row (CSR) format which stores only the non-zero elements. To enable multithread parallelism, we store both the upper and lower triangular blocks. Performance 48
  • 51.
    improvements may bepossible using alternative sparse matrix representations such as ELLPACK, although, it has been shown that CSR becomes very efficient when matrix rows exceed four million [137]. Spin-orbit couplings add imaginary components to the Hamiltonian matrix doubling the problem size and adding the burden of complex algebra operations. In conventional TB approaches, based on the local atomic spin-orbit interaction, the size of the imaginary part of the Hamiltonian is much smaller than the real part. Therefore, memory can be saved by exploiting the sparsity if we split the complex TB Hamiltonian matrix into their real and imaginary parts and then perform the eigenvalue calculation. The complex spMV are substituted by two multiplications, V = Mul(Hreal, q) + iMul(Himg, q) (4.1) This has been achieved by designing a new CUDA kernel accepting mixed complex/real arithmetic as explained in the following subsection 4.2.5. 4.2.5 Mix real-complex CUDA kernel Sparse matrix-vector multiplication is an integral part of most numerical methods and it is a bandwidth-limited operation on current hardware. On cache-based architectures, like GPU, the main factors that influence performance are spatial locality in accessing the matrix and temporal locality in re-using the elements of the vector. The new mix real- complex CUDA kernel is based on the implementation discussed by Reguly and Giles [138] who shows that it can outperform CUSPARSE library. The main idea of the kernel is to let many threads cooperate on any row during spMV products, thereby increasing data locality and decrease cache misses. int tid = threadIdx.x; int coopIdx = threadIdx.x%coop; int i = (repeat*blockIdx.x * blockDim.x + tid)/coop; __shared__ cuDoubleComplex sdata[ BLOCK_SIZE ]; for (int r = 0; r<repeat; r++) { cuDoubleComplex localSum = 0.0; int rowPtr = rowPtrs[i]; 49
  • 52.
    int stop =rowPtrs[i+1]-rowPtr; for (int j = coopIdx; j < stop; j+=coop) { localSum.x += values[rowPtr+j] * x[colIdxs[rowPtr+j]].x; localSum.y += values[rowPtr+j] * x[colIdxs[rowPtr+j]].y; } sdata[tid] = localSum; for (unsigned int s=coop/2; s>0; s>>=1) if (coopIdx < s){ sdata[tid].x += sdata[tid + s].x; sdata[tid].y += sdata[tid + s].y; } if (coopIdx == 0) y[i] = sdata[tid]; i += blockDim.x/coop; } 0 50000 100000 150000 200000 250000 300000 350000 400000 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 Time(Sec) Number of atoms Complex/Complex Real/Real Complex/Real Figure 4.2: Performance of spMV operation on GPU employing different data types Two different CUDA streams are used to carry out the matrix-vector multiplication because the operations are independent of each other and can be executed in parallel if enough GPU resources are available. For III-V semiconductors, every atom has 4 neighbors, Rmax ≈ 40. In contrast, the imaginary part has Rmax = 2. For this reason, 50
  • 53.
    different tuning strategiesare necessary for the two spMV operations. For the spMV operations involving the real part, numerical experiments give the best performance using coop = 8 and repeat = 2 in the notation of Ref. [138] and in the kernel reported above. The spMV involving the imaginary parts is performed with coop = 1 and repeat = 1. As seen in figure 4.2, this hybrid complex/real kernel performs much better than the original implementation based on four real/real spMV operations that suffered almost a factor of 2× performance degradation. This is due to the fact that the real matrix needs to be fetched only once, decreasing the bandwidth utilization. 4.2.6 Performance enhancement using the Overlap technique Figure 4.3: (Left) Typical sparsity pattern of a TB Hamiltonian and partitioning over four nodes. (Right) Data exchanged between adjacent nodes To facilitate the calculation of big nanostructures, MPI is utilized because it is one of the dominant technologies used in HPC today. Distributive parallel computing nature of MPI, along with being portable, efficient and flexible, is very ideal for scientific computing bound by memory and speed limitations. However, the challenge with TB application is that different parts of the TB Hamiltonian matrix has been distributed to different nodes and the algorithm is executed in an independent fashion on each localized node. Therefore, after each matrix-vector multiplication, a part of the resultant vector that is needed to carry out future matrix-vector multiplication correctly needs to be transferred. This part acts as an overlap between nodes that need to be exchanged. The bandwidth of the overlap transferred is dictated by the bandwidth of the H matrix. Figure 4.3 shows the typical 51
  • 54.
    sparsity pattern ofthe TB Hamiltonian matrix and the right panel shows the overlap data exchange between adjacent nodes. This exchange is a critical parameter in the performance scaling of the parallel implementation. The atomistic structure is reordered using the reverse Cuthill-McKee algorithm before building the Hamiltonian. Since the bandwidth of the reordered matrix is reduced, the overlap that needs to be transferred between nodes is almost reduced by half compared to the original overlap when reordering is not performed. By using this technique, we will avoid having to gather the entire resultant vector for each node. 4.2.7 CUDA-aware MPI Using CUDA-aware MPI makes the algorithm run more efficiently since all operations that are required to carry out the message transfer can be pipelined and acceleration technologies, like GPUDirect, can be utilized. The goal of this technology is to reduce dependency on the CPU to manage transfer. Since with a regular MPI implementation, only pointers to host memory can be passed to MPI API’s and one needs to stage GPU buffers through host memory. Further, Pinned memory is used to speedup host-to-device and device-to-host transfer in general since it prevents memory pages from being swapped out. On the GPU test system utilized here for benchmarking, all the GPUs are connected over the same PCI-E bus, GPUDirect Peer-to-Peer is utilized to achieve a high-bandwidth, low-latency communications between the GPUs. 4.3 Benchmarking the Lanczos method GaN/AlGaN wurtzite quantum dot, like the one showed in Figure 4.1, is used to perform the benchmark calculations on nanostructures with up to 600,000 atoms corresponding to H matrix size of around 12,000,000 and approximately 480,000,000 of non-zero elements. Numerical benchmark comparisons have been performed on systems having the following architectures. • Test system 1 (CUDA, MPI-CUDA): Intel Xeon Processor E5-2620 (6 cores, 2 GHz, Cache 15 MB), 64 GB DDR3 SDRAM (Speed 1333 MHz) and 2 Nvidia Tesla K20c (Chip Kepler GK110 GPU, Processor Clock 706 MHz, Memory Clock 2.6 GHz, 52
  • 55.
    Memory Size 5GB) connected on the same PCI-E with an operating system based on Linux kernel 3.0.85. • Test system 2 (MPI-OpenMP): Intel Xeon Processor X5560 (4 cores, 2.8 GHz, Cache 8 MB), 48 GB DDR3 SDRAM (Speed 1333 MHz) connected through a 20 Gbps InfiniBand (4x DDR) with an operating system based on Linux kernel 2.6.30. • Test system 3 (Sequential, OpenMP): Intel Xeon Processor W3530 (4 cores, 2.8 GHz, Cache 8 MB), 6 GB DDR3 SDRAM (Speed 1333 MHz) with an operating system based on Linux kernel 2.6.30. The algorithm has been written in Fortran 95 and compiled with Intel Fortran 11.1, whereas the GPU parts are written in C and compiled with CUDA Toolkit 5.5. Here, I concentrated particularly on the sparse matrix-vector multiplication timings and discussed the performance of GPUs to find one conduction band energy eigenstate. Table 4.1 reports the timings to find the first eigenvalue on a single K20c GPU for increasing size of the problem in terms of the number of atoms. The corresponding Hamiltonian size is given by multiplying the number of atoms times 20, which is the basis size, whereas the number of non-zero elements is approximately given by multiplying the Hamiltonian size times 40 (average number of non-zeros per row). Table 4.1 also reported the total number of iterations needed to reach convergence. This number varies depending on the starting guess (s), the quantum dot shape, composition and size. The absolute error on the eigenvalue also varies since convergence tolerance is tested once the orthogonality is lost and the matrix T has been diagonalized. Therefore, it is more instructive in order to compare performance on different machines to compute the time per iteration, which is directly related to the time per spMV multiplication. It has been observed that the timings for the memory-optimized algorithm are slightly worse than the original complex/complex algorithm, despite the reduction of the overall number of floating point operations. This is attributed to the fact that now two distinct matrices need to be accessed as given by equation 4.1. The Kepler K20c GPUs used for this work has 5 GB of memory which is sufficient for a nanostructure with up to ≈ 260,000 atoms. Splitting the H matrix saves 35-40% of memory as shown in Figure 4.4 which enables the simulation of up to 350,000 atom structure on a single GPU. Extra time is spent for splitting and further memory optimization of the 53
  • 56.
    Table 4.1: Resultsfor energy eigenstate calculation using CUDA on Nvidia Kepler K20c GPU (Test system 1) Number of atoms CUDA implementation Memory-optimized CUDA implementation Error ×10−6 Runtime (sec) Number of Lanczos iterations Time per iteration (msec) Error ×10−6 Runtime (sec) Number of Lanczos iterations Time per iteration (msec) 8,039 7.9 3.04 950 3.2 5.3 3.5 960 3.7 24,650 1.4 8.6 1330 6.5 1.1 13.0 1840 7.1 79,495 2.7 76.9 4180 18.4 8.7 62.5 3160 19.8 151,472 9.7 70.4 1940 36.3 3.6 76.4 1960 39.0 203,376 8.0 88.6 1580 56.1 8.2 95.1 1580 60.2 263,379 3.9 141.6 1940 73.0 1.6 149.7 1960 76.4 351,600 - - - - 1.3 186.8 1940 96.3 H matrix. This overhead seems acceptable in the memory versus time trade off. 0 100000 200000 300000 400000 500000 600000 0 2000 4000 6000 8000 10000 Memoryutilization(MB) Number of atoms Tight binding Hamiltonian Tight binding Hamiltonian optimized for GPU Figure 4.4: Memory utilization by TB Hamiltonian matrix on GPU Table 4.2 shows timings for the distributed calculations on two Kepler K20c GPUs. First, it is observed that the parallel implementation can be slower than the single GPU implementation when the data transfer is performed via host memory (first column). This happens because data transfer is the speed limiting factor. Performance can be substantially improved using CUDA-aware MPI implementations exploiting the PCI-E bus data transfer supported by K20c and available in CUDA Toolkit 5.5 onwards. Such architecture can perform a peer-to-peer transfer between the GPU memories directly at a rate of 6.3 GB/s, boosting the computation speed by a factor of 2.7× (second column 54
  • 57.
    Table 4.2: Resultsfor energy eigenstate calculation using MPI-CUDA implementation running on two Nvidia Kepler K20c GPUs (Test system 1) Number of atoms MPI-CUDA implementation via Host memory MPI-CUDA implementation via PCI Memory-optimized MPI-CUDA implementation via PCI Error ×10−6 Runtime (sec) Number of Lanczos iterations Time per iteration (msec) Error ×10−6 Runtime (sec) Number of Lanczos iterations Time per iteration (msec) Error ×10−6 Runtime (sec) Number of Lanczos iterations Time per iteration (msec) 79,495 7.6 175.29 5440 32.2 6.4 63.1 5540 11.4 7.1 67.3 5520 12.2 203,376 7.5 305.31 3520 86.7 7.5 117.2 3520 33.3 7.6 119.6 3520 34.0 351,600 2.2 379.78 2580 147.2 4.5 94.5 1780 53.1 0.16 195.1 3400 57.4 601,766 - - - - - - - - 2.7 268.2 2640 101.6 of Table 4.2). In this second case, the average performance of the parallel implementation is about a factor 1.7× faster than a single GPU runs, shown in Table 4.1. The largest structure that can fit on two GPUs is made of little more than 600,000 atoms which requires 6.0 GB of storage in total. As already stated before, this requires a splitting strategy to be employed. In order to put the GPU performance in the right perspective, a benchmark comparison of the same algorithm running in parallel on multi-core CPU nodes connected through a high-speed, high-bandwidth (20Gbit/s) InfiniBand network is performed, as available on most HPC facilities (Test system 2). The best approach is a hybrid MPI-OpenMP implementation in which the matrix is distributed over quad-core nodes on which every matrix-vector multiplication operation has been parallelized using four OpenMP threads. Table 4.3 summarizes the results of these runs. Here timings for 2, 4, 8 and 16 MPI processes, for a total of 8, 16, 32 and 64 cores, respectively are reported. The relevant performance is also graphically shown in Figure 4.5. We can observe that using MPI- OpenMP over InfiniBand, there is almost a linear scaling when moving from 2 nodes to 8 nodes, but this degrades for bigger systems as the overlap which is dictated by the bandwidth of the Hamiltonian matrix is larger for large-scale systems and it needs to be transferred after each matrix-vector multiplication, hence MPI message transfer and synchronizations among the processes after each matrix-vector multiplication takes substantial amount of time and acts as a speed degrading factor. Figure 4.6 and 4.7 shows the time in seconds and performance in Gflops per Lanczos iteration on a CPU’s single core vs. quad-core CPU using OpenMP vs. standard GPU implementation on a single Kepler K20c GPU vs. memory optimized implementation (MOI) on a Kepler K20c GPU vs. MPI-OpenMP implementation on 2, 4, 8 and 16 quad-core CPU respectively vs. MPI-CUDA implementation on two Kepler K20c GPU 55
  • 58.
    Figure 4.5: Timecomparison of Lanczos iteration using MPI-OpenMP on a HPC cluster connected via InfiniBand Figure 4.6: Time taken per Lanczos iteration for different implementations and technologies where MPI communication is done via host memory and via PCI respectively vs. memory optimized (MOI) MPI-CUDA implementation on two Kepler K20c GPU with exchange via PCI-E bus using GPUDirect. The performance reported in Gflops in Figure 4.7 is given by ((number of non-zero elements in H × number of multiply-add operations in algorithm)/time per Lanczos 56
  • 59.
    Table 4.3: Resultsfor energy eigenstate calculations using MPI-OpenMP (Test system 2) Number of atoms Error Number of MPI nodes Runtime (sec) Number of Lanczos iterations Time per iteration (msec) 79,495 6.8 ×10−6 2 428.58 4670 91.8 7.0 ×10−6 4 342.21 4660 64.2 7.2 ×10−6 8 212.09 4630 45.9 6.2 ×10−6 16 147.9 4710 31.4 203,376 3.8 ×10−6 2 918.89 3280 280.2 7.1 ×10−6 4 755.47 3210 235.3 4.2 ×10−6 8 510.65 3260 156.7 1.1 ×10−6 16 357.82 3330 107.2 351,600 1.2 ×10−7 2 1800.93 3470 519.0 1.9 ×10−7 4 1460.34 3420 427.0 0.7 ×10−7 8 864.12 3490 247.5 1.9 ×10−7 16 671.5 3420 196.3 601,766 3.3 ×10−6 2 2562.07 3310 773.8 5.3 ×10−6 4 2039.16 3200 637.2 8.6 ×10−6 8 1124.02 3050 368.6 5.1 ×10−6 16 1049.46 3220 325.9 Figure 4.7: Performance comparison for the Lanczos iteration between different implementations and technologies iteration) obtained on the GPU and compared it to a single quad-core Xeon CPU, as described above (Test System 3), using OpenMP multithreading on the M × v operations and MPI-OpenMP implementation on 2, 4, 8 and 16 quad-core CPU respectively (Test System 2). A performance gain of a factor of more than 40× can be 57
  • 60.
    achieved on theGPU as compared to a single CPU core and a factor of 10× compared to the OpenMP implementation on a quad-core. The point corresponding to 351,600 atoms is only possible with memory optimization. Besides some oscillations, we observe quite opposite trends of numerical efficiency between the GPU and the CPU, the first steeply rising and then saturating with problem size, while the second is steadily degrading. This is attributed to the large memory bandwidth (208 Gbit/sec) of the Kepler K20c which is the ultimate speed limiting factor for the large matrices handle here. It is also observed that on small systems there is no appreciable GPU speedup. This is due to memory allocation and transfer of data to the GPU takes a considerable amount of time. Figure 4.8: Speed comparison for spMV between implementations on each of the technologies Comparing the MPI-OpenMP performance with respect to MPI-CUDA, for the smallest structure of 79,495 atoms, we obtain a time/iteration of 45.9 msec on 8 nodes. Whereas, on the same structure the best MPI-CUDA performance is 11.4 msec corresponding to an acceleration of 4.0×. Even comparing to the slowest MPI-CUDA via host memory, the GPU algorithm has a speedup factor of 1.4×. For the larger structures, the gain on the GPU is even further increased. For the case of 351,600 atoms the speedup factors range between 4.6×, in the case of MPI-CUDA via PCI to 1.68× for the MPI-CUDA via host. The largest structure of 601,766 atoms can be compared only 58
  • 61.
    in the caseof memory-optimized strategy and for this, the acceleration factor is 3.6×. These comparisons can be appreciated in Figure 4.6, where it is possible to see that GPU implementation on two cards outperforms the parallel implementation of the same algorithm running on CPUs. Figure 4.8 shows the speed comparison for spMV operations, as seen, GPUs out performs every other implementation. Even a single GPU is faster than 16 quad-core nodes connected by an InfiniBand network. Clearly, the drawback of this GPU implementation is that it faces memory limitations that prevent scaling up the system size above a certain limit. Nevertheless, the amount of memory hosted by GPU is likely to increase in future, as the latest NVIDIA Kepler K80 already has 24 GB of device memory. As demonstrated by these benchmarks, fast direct GPU inter-communication is needed for high performance. Currently, multiple GPU cards can be interconnected via PCI switches to a single I/O hub, although a system with 4 GPUs gives optimal parallel performance. 4.4 Summary The Lanczos method has been fine-tuned for memory limited GPU. Advance optimization strategies and techniques that take into account the characteristics of the sp3 d5 s∗ + spin-orbital parametrization Hamiltonian matrix are developed and utilized to obtain optimal performance. The whole algorithm has been developed using CUDA and runs entirely on the GPU. Furthermore, parallel distribution over several GPUs has been attained using MPI and the implementation is fully vectorized and scales with GPUs. Benchmark calculations performed on a GaN/AlGaN wurtzite quantum dot with up to 601,766 atoms are presented. The GPU results are also compared to other available computing technologies. 59
  • 62.
    Chapter 5 GPU focusedcomprehensive study of popular eigenvalue methods As already outlined in Chapter 3, there are several methods that can be used to calculate the needed eigenstates of the H matrix. Given the variety of possible methods, it is still unclear which one is more suited and how their performance compares in a given scenario. However, there are few methods which are more widely used given their implementation feasibility, convergence characteristic, accuracy and reliability. Methods such as Lanczos, Jacobi-Davidson and conjugate gradient are popular and widely utilized in tight binding calculations [139–141]. Recently, a new method called FEAST is gaining popularity [142, 143]. Hence studying, optimizing and benchmarking them for recent HPC and GPU architectures is of importance for the given application domain. Today, larger and faster computing systems are widely accessible. Supercomputers and high-end expensive computing systems are being utilized to accelerate computation in a parallel distributed, cluster or grid computing setting. The advent of GPUs have grasped the attention of most of the scientific computation community. Developing algorithms that can ideally scale over such a system is an important component for transferring the hardware feature into actual beneficial speedups. In recent times, there has been an extensive effort being put in translating algorithms initially designed for sequential processors to now days HPC system which normally deal with either SIMD or multiple-instruction-multiple-data (MIMD) scenario. However, a lot of aspects need to be considered to result in speedups while dealing with parallel computing. Hence, often 60
  • 63.
    this sequential toparallel transition is not straight forward and requires a deeper understanding of the system architecture and the eigenvalue method itself. There are many challenging questions to be considered in terms of the choice of method employed. Some of these questions include: what method takes the least total computation time and is well suited for GPUs given its limited available resources? Which approach is robust in convergence when used with nanostructures having a dense energy spectrum? Also, in a multi-GPU scenario where data has to be shared among GPUs, its important to identify the implementation that deals well with hardware limitation. Characteristics of the method like its ratio of compute to memory intensive operation, which are needed for a good speedup in hybrid implementations also need to be considered. Finally, its important to find a method that scales best in a multi-GPU distributed setup. Having identified the aspects that need to be taken into account and proposed a design for parallel computing eigensolver in Chapter 4. Here, lets test and compare some of the popular eigenvalue algorithms for memory utilization, execution time, implementation complexity (feasibility) and convergence. Also, lets benchmark a robust implementation of each of the algorithm on a multi-GPU system as well on a HPC cluster. 5.1 GPU based implementations of popular eigenvalue methods As we know, GPUs have a limited memory and the peak bandwidth between the device memory and the GPU is much higher than the peak bandwidth between host memory and the device memory. Therefore, as already shown in Chapter 4 it is crucial to minimize the data transfer between the host and the GPU by keeping the Hamiltonian matrix and the search subspace on the device memory. For this reason, the TB Hamiltonian matrix is converted to a single precision format prior to transfer to GPU’s global memory. The algorithms are implemented using mixed single/double precision arithmetic to ensure highly accurate solutions. Since the Lanczos method is detailed in Chapter 4, its parallel design and implementation details are not listed in the subsequent subsections. 61
  • 64.
    5.1.1 Jacobi-Davidson method TheJacobi-Davidson method is an iterative subspace method for computing one or more eigenpairs of large sparse matrices. In this method, each iteration has two phases: the subspace extraction and the subspace expansion. For the subspace expansion phase, given an approximate eigenpair (θi, ui) close to (λi, vi), with ui ∈ U, where U is the subspace, and θi = u∗ i Hui u∗ i ui is the Rayleigh quotient of ui, taken as approximate eigenvalue because it minimizes the two-norm of the residual: r = Hui − θiui . To expand U in an appropriate direction lets look for an orthogonal correction t ⊥ ui such that ui + t satisfies the eigenvalue equation: H(ui + t) = λi(ui + t) (5.1) Lets try to find eigenvalues closest to some given target τ, initially, lets consider this to be the same as the chosen Lanczos shift τ = s. In the above equation, (H − τI)t = −r + (λi − θi)ui + (λi − τ)t (5.2) t and | λ − τ | are small and can be neglected. When we multiply both sides of equation 5.2 by the orthogonal projection I − uiu∗ i . We have the following equation (I − uiu∗ i )(H − τI)(I − uiu∗ i )t = −r (5.3) where t ⊥ ui. We solve equation 5.3 only approximately using generalized minimal residual method (GMRES) and its approximate solution is used for the expansion of the subspace [144]. To save GPU memory, the process is enhanced by restarting the Jacobi-Davidson method with a few recently found ui in this way, the dimension of the search subspace is restricted [145]. In order to avoid the found eigenvalues from reentering the computational process, the new search vectors are explicitly made orthogonal to the computed eigenvectors. As stated above, the interior eigenvalues are of interest. The Ritz vectors represents poor candidates for restart since they converge monotonically towards exterior eigenvalues. One solution to this problem is using the harmonic Ritz vectors. The harmonic Ritz values 62
  • 65.
    are inverses ofthe Ritz values of H−1 . Since the H matrix is Hermitian, the harmonic Ritz values for the shifted matrix (H − τI) converge monotonically towards θi = τ eigenvalues closest to the target value τ. The search subspace for the shifted matrix and the unshifted matrix coincide and hence its possible for the computation of harmonic Ritz pairs for any shift. The harmonic Ritz vector for the shifted matrix can be interpreted as maximizing a Rayleigh quotient for (H − τI)−1 . It represents the best information that is available for the wanted eigenvalue; therefore, it is also the best candidate as a starting vector after the restart [146]. GMRES method is designed to solve nonsymmetric linear systems. The most popular form of GMRES is based on the modified Gram-Schmidt procedure and it uses restarts. If no restarts are used, GMRES will converge in no more than N steps. This is of no practical value here since N is very large. Moreover, the storage and computational requirements in the absence of restarts are prohibitive. However, there exist cases for which the method stagnates and convergences takes place only at the Nth step. For such systems, any choice of restart less than N fails to converge. Algorithm. The GMRES method Start: Choose x0 and compute r0 = f − Ax0 and v1 = r0 ||r0|| Iterate: For j = 1, 2, . . . , m do: hi,j = (Avj, vi), i = 1, 2, . . . , j ˆvj+1 = Avj − j i=h hi,jvi, hj+1,j = ||ˆvj+1||, and vj+1 = ˆvj+1 hj+1,j Form the approximate solution: xm = x0 + Vmym where ym minimizes ||βe1 − ¯Hmy||, y ∈ Rm . Restart: Compute rm = f − Axm, if satisfied then stop else compute x0 = xm, v1 = rm/||rm|| and then iterate once again. The least square problem min ||βe1 − ¯Hmy|| is solved by factorizing ¯Hm into QmRm using plane rotation. The difficulty is in choosing an appropriate value for restart. If its too small, GMRES may be slow to converge or fail to converge entirely. A value for restart that is larger than necessary involves excessive work and uses more storage. There are no definite rules governing the choice of restart and it is a matter of experience. More details on practical implementation of GMRES method can be found in reference [148]. 63
  • 66.
    The correction equationis solved to an accuracy of just 10−1 . it is sufficient enough to keep the number of outer iterations between 4 to 10 with internal restart set to 10. GMRES, although more expensive than other linear solvers, is chosen because it is found to be more stable in solving the correction equation for TB Hamiltonian [147, 148]. We can further improve the computation by treating the H matrix with a preconditioner. However, the preconditioner will occupy similar memory as the actual matrix and also increase the crucial time consuming matrix-vector multiplications per iteration. Hence, it may not be a wise choice for a GPU-accelerated solver where 10−1 accuracy is sufficient. 5.1.2 FEAST method The aim of the FEAST algorithm is to actually compute the eigenvectors instead of approximating them, unlike the Lanczos and Jacobi-Davidson method. It yields all the eigenvalues and eigenvectors within a given search interval [λmin, λmax]. FEAST relies on the Rayleigh-Ritz method [123,124] for finding the eigenvector space V in some enveloping space U ⊇ V . Let Γ be a simply closed differentiable curve in the complex plane that encloses exactly the eigenvalues λ1, ..., λm and z be the contour point. Using the Cauchy integral theorem, it can easily be shown that V V ∗ = 1 2πi Γ (zI − H)−1 dz = Q (5.4) Next, choose a random matrix Y ∈ Cn×m0 , where m0 is the size of the working subspace which is slightly larger than m the number of eigenvalues within the search interval. The expression in 5.4 leads to a new set of m0 independent vectors Qn×m0 = q1, q2, ..qm0 obtained by solving linear systems along the contour and form U = QY . It follows that U = span(U) ⊇ V is a candidate for the space used in the Rayleigh- Ritz method. The matrix U can be computed, for our TB Hamiltonian matrix 3 to 8 integration points are sufficient. Then for each integration point z, a block linear system (zI − H)−1 Ui = Yi needs to be solved, each with m0 right hand sides. Notice that the matrix keeps on changing with z throughout the run. The FEAST algorithm can be parallelized in several ways. First, the interval [λmin, λmax] can be split and each part can be treated separately. Also, for each contour point block linear system can be solved independently from each other. As well as each 64
  • 67.
    linear system inprincipal can be solved in parallel [149]. Here, FEAST has not been parallelized using any of the mentioned strategies. Instead, the solver that find the solution for each linear system using our parallel multi-GPU enhanced techniques is parallelized, since the solution to the block linear system is the most expensive part of the method. The conjugate gradient squared method (CGS) is employed to solve the block inner independent linear systems since the cost per iteration of CGS is cheaper than that of GMRES in terms of computation and memory [144, 150]. The inner independent linear systems need to be solved to a high accuracy of at least 10−6 . For non-converged linear system, the solver can be stopped after a few hundreds of iteration. The CGS method is outlined below. Algorithm. The CGS method Choose an initial guess x0 and ˜r0 r0 = b − Ax0 u−1 = w−1 = 0, α−1 = σ−1 = 1 for k = 0, 1, 2 . . . do ρk = (rk, ˜r0) βk = ( −1 αk−1 )( ρk σk−1 ) vk = rk − βkuk−1 wk = vk − βk(uk−1 − βkwk−1) c = Awk σk = (c, ˜r0) αk = ρk σk uk = vk − αkc xk+1 = xk + αk(vk + uk) if xk+1 is accurate enough, then stop if not rk+1 = rk − αkA(vk + uk) and iterate Often, convergence is improved by using an incomplete factorization method based on Gaussian elimination like incomplete LU (ILU) as a preconditioner matrix [151]. However, for TB Hamiltonian matrix under consideration the ILU factorization with 0 level of fill-in is not sufficient for convergence. If utilized it takes more iterations to converge compared 65
  • 68.
    to the casewhere a preconditioner is not employed and hence, we need to perform higher level of factorizations. As the fill-in level in an ILU decomposition increases, the quality of the ILU preconditioner improves. This also changes the sparsity of the preconditioner matrix. Thus more accurate ILU preconditioners require more memory to such an extent that eventually the running time of the algorithm increases, even though the total number of iterations in the linear solver decreases. Also, the parallelization of ILU involves a lot of data transfers between the nodes since almost the entire TB Hamiltonian matrix is needed on each node and it takes a noticeable amount of compute time because a fresh ILU factorization is needed to be computed for each contour point as the matrix keeps on changing. Therefore, a FEAST implementation that utilizes an incomplete factorization based method to generate a preconditioner matrix is not implemented. To obtain a higher speedup and low memory foot print, parallel preconditioners that are better suited for GPU parallelism must be developed. 5.2 Benchmarking results, comparison and discussion All benchmarks are performed by analyzing the algorithms to find the lowest 8 conduction energy eigenstates of atomistic quantum dots similar to the one show in Figure 5.1. Here, the Lanczos, Jacobi-Davidson and FEAST methods are compared and I especially focus on their ability to compute multiple eigenpairs. Figure 5.1: (Left) Cubical wurtzite GaN/AlGaN quantum dot showing the core with 30% Aluminum. (Right) a central slice of the cube. Atomistic description: in yellow Aluminum, in red Gallium 66
  • 69.
    The GPU implementationof the algorithms and linear solver is done utilizing the TB Hamiltonian splitting approach, the mixed real-complex arithmetic matrix-vector multiplication CUDA kernel and all of the parallel GPU implementation techniques and optimization strategies discussed in Chapter 4. However, in case of the FEAST method, the matrix keep on changing with different contour points as zI − H (or z∗ I − H). Therefore, it is not optimal to use the splitting approach since tests have shown that a significant amount of time is spent building the splitted matrix and dropping the zeros. The Lanczos algorithm has been fully ported to the GPU and vectorized to scale with MPI parallelization on multi-GPU workstations as show in Chapter 4. Similarly, the Jacobi-Davidson algorithm has been implemented on GPU, along with GMRES method which is utilized as a linear solver for the Jacobi-Davidson correction equation. In order to spare GPU memory, the subspace vectors have been saved on the host memory. This strategy enables to treat larger systems at the expense of more device-host communication. A comparison between Jacobi-Davidson algorithm with and without the subspace in the device memory is shown in the following subsections. Concerning FEAST, only the linear solver (CGS) has been ported to the GPU given that this is the most time consuming part of the algorithm. In this respect, Lanczos and Jacobi-Davidson can be considered as pure GPU implementations and FEAST as a hybrid CPU-GPU, even though 98% of the total time is spent on the GPU solving the block liner system. The relevant details of the test hardware are given below. • Test system 5 (Multi-GPU workstation): Intel Xeon Processor E5-2620 (6 cores, 2 GHz, Cache 15 MB), 64 GB DDR3 SDRAM (Speed 1333 Mhz) and 2 Nvidia Tesla K40 (Chip Kepler GK110B GPU, Processor Clock 745 MHz, CUDA cores 2880, Memory Clock 3.0 GHz, Memory Size 12 GB, Peak performance 1.43 Tflops) + 2 Nvidia Tesla K20 (Chip Kepler GK110 GPU, Processor Clock 706 MHz, CUDA cores 2496, Memory Clock 2.6 GHz, Memory Size 5 GB, Peak performance 1.17 Tflops) connected on the same PCI-E with an operating system based on Linux kernel 3.0.85. • Test system 6 (HPC cluster): 2208 compute nodes, each node has 2 Intel Xeon X5570 (4 cores, 2.93 GHz, Cache 8 MB), 24 GB DDR3 SDRAM (Speed 1066 MHz). Nodes are connected through an Infiniband QDR network with non-blocking Fat 67
  • 70.
    Tree topology witha total Peak performance of 207 Tflops and having an operating system based on Linux kernel 2.6.32. 5.2.1 Eigensolver evaluation on a Multi-GPU workstation Figure 5.2: Time comparison between methods on 1 Kepler GPU for the calculation of 8 energy eigenstates Figure 5.3: Time comparison between methods on 4 Kepler GPUs for the calculation of 8 energy eigenstates 68
  • 71.
    On a singleGPU, Jacobi-Davidson with subspace in host memory performs almost 2× times faster as Lanczos and 13× faster as FEAST as seen from Figure 5.2. However, when we move from one GPU to a multi-GPU scenario as shown in Figure 5.3, Jacobi- Davidson with subspace in host memory performs only 1.4× times faster than Lanczos when first few eigenstates are searched. The decrease in speedup compared to a single GPU implementation is attributed to the fact that the sparse mix real-complex matrix-vector operations become less significant as seen in Table 5.1. Also, since the subspace is saved on the host memory, it imposes more inter host-GPU data movement than Lanczos as seen from Figure 5.7, this is the main speed limiting factor for any parallel implementation. To attain ideal scaling, there should not be any data dependency or synchronizations between GPUs. Also, there should be enough data to utilize all the GPU cores efficiently. As noticed from Figure 5.2, 5.3, 5.8 and 5.9 with regards to Jacobi-Davidson implementation with subspace stored in device memory, it is possible only to fit upto 151,472 atom quantum dot on GPUs having a memory limit of 5 GB. Therefore, as already stated, it is crucial to employ the implementation that spares memory by moving the subspace to the host memory. The rest of the discussion in the following subsections corresponds to the Jacobi- Davidson method with subspace stored in the host memory. Figure 5.4, 5.5 and 5.6 shows the scaling of each method over multiple GPUs. We observe that the Lanczos and the FEAST method exhibit a strong scaling for a large quantum dot. The ample data movement in the Jacobi-Davidson implementation due to the subspace being stored in host memory impedes its scaling performance. 69
  • 72.
    Figure 5.4: Scalingof Lanczos method on 1 to 4 GPUs Figure 5.5: Scaling of Jacobi-Davidson (subspace in host memory) method on 1 to 4 GPUs 70
  • 73.
    Figure 5.6: Scalingof FEAST method on 1 to 4 GPUs The profiling results from a data movement perspective for 151,472 atom quantum dot are shown in Figure 5.7. Notice that Lanczos is a compute intensive algorithm as almost 99% of time is used for computation with minimal data transfer which happens only during the launch as the matrix is loaded onto the GPU memory. Whereas, in the case of Jacobi-Davidson method the host to device and device to host data transfer account for 15-20% of the total effective time since the subspace is stored on the host memory. The CGS method used to solve the block linear system within FEAST, imposes an ample amount of device to device data transfer accounting to 10-25% of the total computation time. We attain a peak bandwidth of 7.45 Gbits/sec between the host and the device. 71
  • 74.
    Figure 5.7: Percentageof time taken for memory and compute operations on (Left) 1 GPU and (Right) 4 GPUs respectively The profiling tests have also revealed that given the sequential nature of the iterative algorithms and the pure GPU implementation with minimal data transfer, it is not possible to obtain any significant memory copy or compute overlap. Only in the case of Jacobi-Davidson method, a 3% of compute/memory copy overlap is obtained since the subspace vectors are stored on the host memory. It is expected that this number will increase as the size of the quantum dot increases. Table 5.1, 5.2 and 5.3 shows the profiling results for compute operations of the algorithms for 151,472 atom quantum dot. In all of the three methods, the sparse matrix-vector multiplication is the most important computation task. However, when we go from single GPU to multi-GPU implementation for the Jacobi-Davidson method, the dense subspace-vector multiplication gains significance over the sparse Hamiltonian matrix-vector multiplication. Notice in Table 5.1 that the GPU occupancy for this operation is very low, hence, it would be best to off load this operation onto the CPU. Increasing the warp efficiency will maximize GPU compute resource utilization. A low value indicates that there are divergent branches. As the size of the nanostructure is increased, usually more energy states are needed and these states happen to be closely spaced. This poses a challenge for realistic nanostructure simulations since the eigenvalue happen to be less distinct. Investigation has shown that Jacobi-Davidson happens to be the most robust method in terms of convergence. Even for closely placed energy states, the algorithm performs fairly well compared to the other 72
  • 75.
    Table 5.1: Profileroutput for 151,472 atom quantum dot, listing the most significant compute operations within Jacobi-Davidson method with subspace stored in host memory Computation Profile for single GPU Computation Profile for multi -GPU Shared Memory Registers Compute Time GPU Occupancy Warp Efficiency Compute Time GPU Occupancy Warp Efficiency Mix complex-real, SpMxV product Mul(Hreal, qcomplex) 45.30% 0.991 90.89% 32.20% 0.972 94.05% 4096 28 Vector operations y = y + αx 15.50% 0.976 100.00% 12.20% 0.942 100.00% 0 20 Dense MxV operation 14.70% 0.197 89.35% 37.40% 0.201 89.33% 10240 60 Dot product 13.80% 0.497 100.00% 8.30% 0.482 100.00% 1024 28 Shift matrix 3.00% 0.998 69.94% 1.70% 0.997 73.14% 0 8 Table 5.2: Profiler output for 151,472 atom quantum dot, listing the most significant compute operations within Lanczos method Computation Profile for single GPU Computation Profile for multi -GPU Shared Memory Registers Compute Time GPU Occupancy Warp Efficiency Compute Time GPU Occupancy Warp Efficiency Mix complex-real, SpMxV product Mul(Hreal, qcomplex) 84.20% 0.941 91.14% 82.80% 0.924 94.13% 4096 29 Mix complex-real, SpMxV product Mul(Himag, qcomplex) 3.20% 0.876 42.02% 3.50% 0.829 51.87% 0 32 Vector operations y = y + αx 7.80% 0.781 100.00% 8.30% 0.748 100.00% 0 14 Table 5.3: Profiler output for 151,472 atom quantum dot, listing the most significant compute operations within the CGS method (linear solver for FEAST) Computation Profile for single GPU Computation Profile for multi -GPU Shared Memory Registers Compute Time GPU Occupancy Warp Efficiency Compute Time GPU Occupancy Warp Efficiency Complex SpMxV product Mul(Hcomplex, qcomplex) 85.50% 0.993 89.83% 83.80% 0.976 93.07% 4096 31 Vector operations y = y + αx 11.70% 0.973 100.00% 13.60% 0.923 100.00% 0 16-21 Dot product 2.70% 0.497 100.00% 2.40% 0.491 100.00% 1024 28 73
  • 76.
    methods, typically 300-600iteration are sufficient to find the first few energy states. Experience shows that in Jacobi-Davidson for fast convergence, the minimum dimension of the subspace can safely be restricted to 4 more than the number of wanted energy states and the maximum dimension needs to be at least 10 more than the number of wanted energy states, i.e. in this case minimum = 8+4, maximum = 8+10 since 8 energy states are sought. In the case of the Lanczos method, the convergence is drastically lowered for a dense eigenvalue spectrum. The convergence rate falls as the size of the quantum dot is increased. Usually for big systems around 10,000-20,000 Lanczos iteration are needed to find each energy state. Similarly, in the case of FEAST, more contour points and a bigger search space is needed to improved convergence. Which also translates into more work and more memory utilization for each FEAST iteration. Typically, 10-25 number of FEAST iteration are sufficient for good accuracy. Comparing the accuracy of the methods with the direct diagonalization carried out on a small nanostructure, it was found that FEAST delivered results to an absolute accuracy of 10−11 . While, Lanczos and Jacobi-Davidson methods delivered to an absolute accuracy of 10−6 . Stopping convergence criteria in all the three methods were set to 10−5 eV. Figure 5.8: Memory consumption between methods on 1 GPU Regarding memory occupancy as shown in Figure 5.8 and 5.9, a single GPU Lanczos implementation occupies the least amount of memory since subspace vectors are not stored. Whereas, the slightly higher memory occupancy of CGS used in the FEAST 74
  • 77.
    Figure 5.9: Memoryconsumption between methods on 4 GPUs method can be attributed to the original complex TB Hamiltonian matrix since the splitting technique was not used. For the Jacobi-Davidson method, a subspace of 8+10 is needed for basis vectors and another additional space of 8+10 vectors is needed for the projection of the H matrix onto this subspace. If the subspace is stored on the GPU, the feasible simulation size of the quantum dot is reduced by half. In a multi-GPU system, the TB Hamiltonian is divided equally among GPUs. As the Hamiltonian size is reduced on each node, the subspace and temporary vectors required in the implementation scheme tend to gain importance and takes over the Hamiltonian as the chief memory consumption entity. One advantage of the Lanczos method over other methods is that since each eigenstates is calculated one at a time, it is possible to calculate the degenerate energy state with just one matrix-vector multiplication, once found this eigenpair is project out and other unique energy states are calculated. However, Jacobi-Davidson is also found to be robust in this case since it happens to find the degenerate state within a few iterations in most cases using the harmonic extraction. 5.2.2 Eigensolver evaluation on a HPC cluster As described in Test system 6, each node has a dual quad-core CPU with 24 GB of main memory. A hybrid MPI-OpenMP (multi-process/multi-thread) implementation has been 75
  • 78.
    Figure 5.10: Timeperformance comparison between Lanczos, Jacobi-Davidson and FEAST method on 4, 8, 16 and 32 nodes of the HPC cluster for the calculation of 8 energy eigenstates employed for each of these methods. The benchmark calculation has been performed for 4, 8, 16 and 32 MPI processes with a constant of 8 OpenMP threads on each nodes corresponding to 32, 64, 128 and 256 CPU cores in usage. Figure 5.10 shows the weak scaling while Figure 5.11, 5.12 and 5.13 shows the strong scaling results for the benchmark calculation performed on the HPC cluster. Memory analysis shows that there is no significant difference in memory consumption when the Hamiltonian is split on 4 nodes or 32 nodes. This is due to the size of the subspace and temporary vectors overplay the importance of the TB Hamiltonian matrix. Which has been highly memory optimized using the single precision storage and splitting technique. Out of the three methods considered, Lanczos is most memory efficient given that no subspace vectors are saved because of the choice of more flops over bytes. It is followed by the FEAST method using CGS as linear solver, which requires 3.2× times more memory than Lanczos mainly because a search space bigger than the number of eigenpairs in the given interval is needed. The Jacobi-Davidson method is found to be the most memory expensive given its requirement to save an adequate subspace and find a solution to the complex algebra correction equation. Jacobi-Davidson requires 5× times more memory than Lanczos and hence, we can fit only up to 699,399 atom quantum dot on the test hardware. 76
  • 79.
    Figure 5.11: Scalingof Lanczos method on 4, 8, 16 and 32 nodes of the HPC cluster Figure 5.12: Scaling of Jacobi-Davidson (subspace in host memory) method on 4, 8, 16 and 32 nodes of the HPC cluster 77
  • 80.
    Figure 5.13: Scalingof FEAST method on 4, 8, 16 and 32 nodes of the HPC cluster To summarize the findings for small systems Jacobi-Davidson performs on an average 10.2× times faster than Lanczos which further increases to 17.2× with the increase in the system size given the slow convergence nature of Lanczos for closely spaced energy states in large quantum dot. Whereas, in the case of FEAST method it executes on average 1.6× times slower than Lanczos for small systems which increases to 9.3× for large systems since more contour points are needed for convergence. In all three methods, one thing that is common is the trend of speedups when the number of nodes employed are doubled, which is 1.5× for 4 to 8 nodes, 1.3× for 8 to 16 and 1.15× for 16 to 32 nodes. The decrease in speedup with increase in nodes is mainly due to process synchronization and limitations in inter-node communications. 5.2.3 Performance comparison between GPU and HPC cluster To examine the advantage of GPUs over an expensive HPC cluster for TB calculations, lets compare the performance of 1 and 4 Tesla Kepler GPUs with 256 CPU cores and also inspect the gain of multi-GPUs over single GPU. Comparing the performance of the different method against different hardware, it is possible to infer that for Lanczos and FEAST method a 3.0× and 2.6× scaling in speedup is achieved when we go from 1 GPU to 4 GPUs for large quantum dot. Whereas, in the case of Jacobi-Davidson the speedup is limited to a factor of 1.6× demonstrating that the transfer of the subspace from the 78
  • 81.
    host to thedevice and vice versa is the limiting factor as already stated. When the performance of 256 CPU cores on the HPC cluster is equated with a single Tesla Kepler GPU, the Jacobi-Davidson method on the HPC cluster is found to outperform the GPU by a factor of 1.2×. On the contrary, the GPU implementation of Lanczos and FEAST methods on 1 GPU beats the performance of 256 CPU cores by a factor of 5.8× and 4.1× respectively. Comparing the multi-GPU implementation on 4 GPUs against 256 CPU cores of the HPC cluster for Jacobi-Davidson, Lanczos and FEAST method, the multi-GPU system outperform the HPC cluster by a factor of 1.5×, 13.7× and 10.8× respectively. 5.3 Summary Three different eigenvalue algorithms that are commonly employed for electronic band calculations have been implemented and optimized for a multi-GPU workstation. An analysis for timing, memory occupancy and convergence on a multi-GPU workstation and a HPC cluster has been performed. By this work, the feasibility and advantage of each method as an eigensolver specifically for large-scale TB calculations have been examined. The tests have shown that Jacobi-Davidson is the most robust method in terms of convergence and is fast in terms of execution time but suffers from a high memory requirement. Lanczos, on the contrary, is the most memory efficient method. 79
  • 82.
    Chapter 6 Application ofGPU accelerated atomistic simulations Numerical simulations of quantum heterostructures derived from experimental results will be performed using GPU based ETB implementation discussed in the previous chapters. As already shown, GPU facilitates the simulation of realistic nanostructures within a reasonable time frame compared to HPC clusters. Here, two different applications of GPU accelerated atomistic simulations are presented. First, a GaAs/Al0.3Ga0.7As complex dot/ring nanostructure is studied [152]. The fabricated nanostructure is very large in dimension for an ETB calculation to be performed hence, a study on a ideal scaled complex quantum dot/ring nanostructure is presented. Second, a real sample containing large InGaN islands and non-uniform Indium content is analyzed [153]. The three-dimensional models for the quantum dot have been directly extrapolated from experimental results by a numerical algorithm. 6.1 Atomistic simulation of complex quantum dot/ring nanostructure Complex three dimensional quantum nanostructures are being fabricated in labs given their potential to adjust their electronic properties via size and shape fine tuning [152]. These physical parameters set the confinement potential for electrical charge carriers, thus determining the electronic and optical properties of the quantum nanostructured system. 80
  • 83.
    In this work,a complex GaAs quantum nanostructure over an Al0.3Ga0.7As buffer layer has been considered to compute the electron states. A multiphysics quantum/classical simulation coupling drift-diffusion with ETB method has been performed. The multiscale software tool TiberCAD which has been incorporated with the GPU implementation of the eigensolvers discussed in the previous chapters has been used to calculate the energy gap as well as the spatial probability density (SPD) of a scaled quantum dot/ring nanostructure similar to the one showed in Figure 6.1 Figure 6.1: Atomic force microscope images of GaAs/Al0.3Ga0.7As complex quantum dot/ring nanostructure (Source: Sanguinetti (2011)) The nanostructure studied consists of a central cylindrical quantum dot and a surrounding ring of GaAs, surrounded by AlGaAs. The dot has a diameter of 16 nm, and the ring a width of 5 nm. The spatial separation between the dot and the ring is 5 nm. The dot is 7 nm high while the outer ring is 5 nm high. The structure is grown on Al0.3Ga0.7As on the (001) plane and covered with 1.4 nm and 3.4 nm thick Al0.3Ga0.7As, respectively (see Figure 6.2 and 6.3). 2 nm of the substrate and 0.8 nm outer AlGaAs shell has been included in the simulations. Calculations are performed using the above described structure for varying quantum dot size. The size of the quantum dot is varied by varying the radius. Similarly, even the height of the quantum dot can be varied. Twenty electron states per structure are sought using the ETB method which also includes the spin states. The resulting density is projected onto the finite element mesh used for classical models. The solutions also provides the SPD for electrons. In order to couple the atomistic calculation with the continuous media model, the macroscopic electrostatic potential is calculated by solving the Poisson equation and is projected onto the atomic positions by interpolations. Due to GPU memory limitations for structures having more than 500,000 atoms, we are restricted in finding fewer than twenty states using the ETB method which is sufficient enough for 81
  • 84.
    Figure 6.2: (Below)Lateral view, (Above) Top view: Geometry of dot/ring complex nanostructure this work. The sp3 d5 s∗ parametrization is considered for the calculation of electron energy states. Here, it is of interest to find nanostructure sizes for which electron states localized in the dot and the ring have the same energies and therefore delocalize on both dot and ring. Taking into account the unavoidable hole state localization that takes place in these nanoscale heterostructures, due to the higher effective mass, this would permit to produce closely energy spaced and tunable (by controlling the actual nanostructure sizes) lambda type absorption resonances in topologically complex nanostructures. The lambda resonances exhibited by the investigated dot/ring nanostructures have many potential applications in photon storage for quantum computing (low group velocity media [154]), metamaterials [155, 156] and teraherz generation [157]. The atomistic calculations are performed for varying dot size so as to be able to predict the dot and ring dimensions needed in order to delocalize electron states and lead to lambda states formation. 82
  • 85.
    Figure 6.3: Partlysliced GaAs/Al0.3Ga0.7As complex quantum dot/ring nanostructure with 30% Al, 70% Ga. Atomistic description: in Pink Aluminum, in Blue Gallium Figure 6.4: Electron states using ETB methods for varying radius of the quantum dot while the rest of the geometry of the complex nanostructure is kept fixed In Figure 6.4, we see the eigenenergies of electron states found using ETB method. Here, the energy frame is defined such that the Fermi energy is 0 eV. The plots look less dense for some structures since it was possible only to calculate sixteen electron states due to limitations in GPU memory. Figure 6.5 shows the first few electron states’ probability densities for a structure with 8 nm dot radius. In this case, all states are localized in the dot or in the ring. Figure 6.6 shows the eigenenergies of the states with symmetry as shown in Figure 6.5 for different dot radii. The lines connect the energies of states that have been identified to have the same symmetry by visually inspecting the wave functions. The graph suggests 83
  • 86.
    Figure 6.5: SPDfor first 8 electrons states using ETB method for the quantum dot with radius = 8 nm Figure 6.6: Evolution of eigenenergies with quantum dot radius. The lines connect states which have been identified to have the same wave function symmetry. 84
  • 87.
    Figure 6.7: Probabilitydensity for lambda states in quantum dot with radius = 6.2 nm, overlapping between states B, C and H Figure 6.8: Probability density for lambda states in quantum dot with radius = 6.5 nm, overlapping between (Left) states B and F and (Right) states C and E 85
  • 88.
    that the firstexcited states in the dot B and C get resonant with the H and the E/F states for radii of roughly 6.2 nm and 6.5 nm, respectively. Notice that in Figure 6.6 the state A is not reported, as it is well separated from B and C states and would form lambda states at unrealistically small quantum dot radii. Figures 6.7 and 6.8 confirms this picture, showing strong mixing between the dot and ring states for dot radii where resonance is expected. Due to symmetrical causes, there is an affirmatory relationship between states of the type B and F, and C and E as seen in Figure 6.8. Note that due to symmetry reasons, the B/C dot states do not couple with the ring D state. 6.2 Atomistic simulation of InGaN quantum dot with Indium fluctuation Recent scientific work has clearly pointed out how taking into account realistic elements directly derived from experimental results can strengthen the effectiveness of models used for simulations. Nowadays, a new possible field of application of a comprehensive realistic multiscale [17] approach appears to be the analysis of Indium Gallium Nitride systems because of the increasing role in the fabrication of LEDs. Here, an ETB calculation is performed on a real sample containing large InGaN islands with size of tens of nanometer and non-uniform Indium content. Figure 6.9: InGaN quantum dot with varying content of Indium derived from experimental high-resolution transmission electron microscopy A complex algorithm has been developed in order to build a three-dimensional geometry and a structure from the experimental image of the out-of-plane strain obtained by geometric phase analysis (GPA) of the high-resolution transmission electron microscopy image of a real sample. The latter contains several InGaN/GaN superlattices and large InGaN quantum dot islands having sizes of tens of nanometers, with 86
  • 89.
    Figure 6.10: Acentral slice of InGaN quantum dot with 19% Indium randomly distributed. Atomistic description: in Red Indium, in White Gallium Figure 6.11: InGaN quantum dot with uniform content of Indium. Description: in Red 19% Indium, in Blue 0% Indium non-uniform Indium distribution similar to the one shown in Figure 6.9. Using the Gwyddion software [158], we sampled the quantum dot and extrapolated a three-dimensional structure. The details of the extrapolation method and the numerical models are described in reference [159]. This extrapolated structure has been used to create a finite element model to discretize the electronic ETB model. ETB calculations of the quantum dot with random Indium distribution has been performed and the results are compared to InGaN alloys with the Virtual Crystal Approximation (VCA) (see Figure 6.11) [160,161]. Where, VCA considers an alloy ABC as a fictitious material whose properties are a weighted average of the properties of its alloy components. The ETB results shown in Figure 6.12 shows that the confined states strongly depend on the local distribution of Indium. This dependence is mainly due to the large energy gap difference between InN and GaN with a valence band difference of just 0.45-0.5 meV compare to 2.7-2.75 meV of the conduction band. The ground states are more likely to be Figure 6.12: Electronic ground states obtained from ETB calculation of InGaN quantum dot with random Indium content 87
  • 90.
    Figure 6.13: Electronicground states obtained from ETB calculation of InGaN quantum dot with uniform Indium content present in regions with higher Indium content which would dictate certain electronic and optical properties of InGaN LEDs depending on whether the states overlap or not. In the case of the quantum dot generated using VCA, the ground states are very symmetric and ideally overlap each other as seen in Figure 6.13. 6.3 Summary Numerical atomistic simulations of realistic quantum nanostructures have been carried out using GPUs showing that GPUs can be employed to accelerate ETB calculation ten folds compared to state-of-the-art HPC clusters. In the first case, ETB calculation on a number of idealistic scaled GaAs/Al0.3Ga0.7As complex quantum dot/ring nanostructures were performed. GPUs assisted in cutting short the time needed to simulate these multiple samples from a few weeks to few days. Similarly, in the second case, GPUs were used to calculate the ground states of realistic InGaN quantum dot having around 750,000 atoms. 88
  • 91.
    Chapter 7 Conclusion In thiswork, it has been shown that large-scale atomistic simulation of nanostructured devices that plays a significant role in guiding and explaining experimental findings in modern material science and semiconductor research, which faces the computational obstacle from the diagonalization of the Hamiltonian matrix can be accelerated using parallel computing techniques and the introduction of enhanced algorithms. Both this aspects have been addressed in this work by developing optimized algorithms to execute on state-of-the-art computing hardware. It is widely known that implementing algorithms that can ideally scale over parallel computing architectures is an important component for transferring the hardware advancements into beneficial speedups. This also requires a deeper knowledge of the method and the underlining hardware architecture being utilized. Today’s GPUs are developed to help computational scientists push out the frontiers. They have certainly grasped the attention of most researchers which is lately evident from the extensive effort being put in translating algorithms initially designed for other computing machines to GPU. Here, it has been shown that GPUs can be used for the acceleration of atomistic simulation of nanostructured devices by employing them for the calculation of energy eigenstate in a quantum nanostructured system. Benchmark calculations are performed for an atomistic model of wurtzite GaN/AlGaN quantum dot parametrized using an ETB scheme demonstrating that GPUs can be very effectively used for iterative numerical optimization problems such as finding the extreme eigenvalues of large sparse matrices. 89
  • 92.
    Figure 7.1: Performanceof Lanczos implementation benchmarked on different technologies In Chapter 4, a fine-tuned GPU based parallel implementation of the Lanczos algorithm with a simple restart is reported as it has been identified as the algorithm that is more fitted for computing few eigenpairs on a GPU framework that can cope with memory limitations of current GPUs and slow GPU-CPU communication. Here, a technique has been developed that exploits the structure of the TB Hamiltonian matrices. Using which we can optimize on the memory occupation by splitting the TB Hamiltonian into its real and imaginary parts. This further required the development of a new mixed real/complex arithmetic CUDA kernel. Performing the multiplication in a split fashion resulted in a 35-40% memory saving without significant loss of performance. Thus, allowing to increase the maximum system size that can be handled on a GPU. Likewise, it has been shown how the performance of the eigenvalue solver can be further enhanced by subduing the slow communication between GPUs by exploiting the matrix sparsity pattern and moreover, taking advantage of the GPU-GPU communication offered by the new GPUDirect technology. The implementation designed and tested is fully vectorized and scales with GPUs. As evident from Figure 7.1 the fine-tuned Lanczos implementation benchmarked running on Kepler K20c (Test system 1) performed on an average 10× times faster compared to the same OpenMP implementation running on a Xeon quad-core CPU (Test system 3). Also, shown here are the benchmark calculations in a multi-GPU 90
  • 93.
    scenario, parallelized usingMPI. In this context, the importance of using fast data transfer via direct PCI-E interconnects is shown. The performance of a dual-GPU versus a HPC cluster upto 16 nodes connected via InfiniBand is shown. This demonstrates that the dual-GPU on average is faster by a factor of 4.1× for a system comprising of around 350,000 atoms and by more than a factor of 3.2× for systems comprising of 600,000 atoms. Assuming an ideal parallel scaling on the InfiniBand HPC cluster that might be reached with faster interconnects, a large number of nodes will be needed. Currently, a 32 core IBM HPC system costs ≈ $90,000 and has a peak power consumption of ≈ 791 Watts. On the other hand a single quad-core workstation with a single Kepler GPU will cost less than ≈ $10,000 and will consume ≈ 486 Watts of power making GPU more cost-effective in terms of energy, infrastructure cost and maintenance. The drawback of this fine-tuned GPU implementation is that it faces memory limitations that prevents scaling up the system size above a certain limit. Nonetheless, the amount of memory hosted by GPUs is likely to increase in the future. In the search for faster algorithms, it was noticed that there are a few methods which are more widely used for atomistic simulations given their implementation feasibility, convergence characteristic, accuracy and reliability. Thus, a comprehensive study of Jacobi-Davidson, Lanczos and FEAST methods for energy eigenstate calculation in nanostructures was conducted in Chapter 5 because it was still unclear which one is more suited for GPU and how they perform in a given setup. By creating, testing and profiling a GPU based performance enhanced implementation of the listed methods their feasibility and advantage as an eigensolver specifically for the tight binding calculations was examined. The study revealed that Jacobi-Davidson is the most robust method in terms of convergence and is fastest in terms of execution time. However, it has a high memory consumption and is therefore less suited for calculating the energy eigenstates of large nanostructures. This shortcoming can be overcome by moving the subspace vectors to the host memory as shown thus enabling us to calculate the energy states of larger systems. Nevertheless, this type of GPU implementation of the Jacobi-Davidson does not scale well as compared to Lanczos and FEAST. Lanczos, on the contrary, is the most memory efficient method, but the poor convergence for higher energy eigenstates in large nanostructures is a primary bottleneck which makes it not the first method of 91
  • 94.
    Figure 7.2: Performanceof Lanczos, Jacobi-Davidson (JD) and FEAST implementation benchmarked on different technologies choice. However, on a multi-GPU system it shows a superior scaling trend. The FEAST method performs the worst since a preconditioner matrix was not utilized while solving the block linear system because the construction of a typical preconditioner based on incomplete factorization is expensive in terms of both memory and time and is not ideal for a GPU based implementation. This led to the important inference that Jacobi-Davidson can be considered as the best method given its good convergence even without a preconditoner matrix and should be considered as the method of choice on computing systems where memory is not a constraint. On GPUs, it can be employed to calculate the energy eigenstates of few hundred thousand atom nanostructures. Lanczos, on the other hand, is the method of choice when memory usage is the limiting factor. Even though Lanczos is slow in convergence, it can be easily scaled using a multi-GPU implementation to perform in par with Jacobi-Davidson as seen in Figure 7.2. Two different applications of GPU accelerated atomistic simulations were also presented. First, numerical simulations of an idealized GaAs/Al0.3Ga0.7As complex quantum dot/ring nanostructure was performed. GPUs were employed to carry out the ETB calculation within a reasonable time frame for systems with varying quantum dot size. The goal of the analysis was to fine-tune the electronic properties of the complex nanostructure via size tuning, in order to find lambda states (coupled states) that are localized in both the quantum dot and quantum ring. As this type of lambda state 92
  • 95.
    characteristic exhibited bycomplex nanostructures has many potential applications in quantum computing to metamaterials. Second, numerical simulations of quantum dot structures derived from experimental high-resolution transmission electron microscopy results were performed. A real sample containing large InGaN islands with size of ten of nanometer and non-uniform Indium content was analyzed. The three-dimensional models for the quantum dots were directly extrapolated from the experimental results by a numerical algorithm. The ground energy eigenstates of these quantum dots greater than 750,000 atoms were calculated using the GPU based implementation for varying Indium content within a few hours compared to a few days that would be needed on other hardware platform. Finally, the question is was the principal objective of the proposed work realized? This can be established by means of a test case. Let us consider the atomistic simulation of ≈ 200, 000 atom quantum dot which can be considered as an averaged size nanostructure often encountered in computational electronics domain. To calculate 8 electron energy eigenstates using the ETB method that utilizes a Lanczos type eigensolver it would take ≈ 24 hours engaging a sequential implementation on Test system 3. On the same test system, implementation involving OpenMP technology would require ≈ 8 hours. Whereas, on 16 nodes of a HPC cluster connected via InfiniBand (Test system 2) utilizing MPI-OpenMP technology it would need ≈ 1.45 hours. Employing Kepler GPU with the CUDA implementation of the fine-tuned Lanczos based eigensolver, it took ≈ 50 minutes, which was additionally lowered to ≈ 20 minutes using MPI-CUDA implementation on 4 Kepler GPUs (Test system 5). By using the MPI-CUDA implementation of the Jacobi-Davidson method, the time taken was further reduced to ≈ 10 minutes. Thus, one can say that the objective to accelerated atomistic simulations were accomplished using enhanced algorithms, GPU and other parallel computing techniques. Multi-GPU system, with a high speed data interconnect, can be considered as one of the best, cost-effective, energy efficient computing architecture currently available to accelerate the atomistic simulation of nanostructured devices. 93
  • 96.
    Publications and Conferences •Walter Rodrigues, A. Pecchia, M. Lopez, A. Auf der Maur, A. Di Carlo (2014), “Accelerating atomistic calculations of quantum energy eigenstates on graphic cards”, Computer Physics Communications Journal, Vol. 185, Issue 10, Pages 2510-2518. DOI:10.1016/j.cpc.2014.05.028 • W. Rodrigues, A. Pecchia, M. Auf der Maur, A. Di Carlo, “A multi-GPU based approach for atomistic calculations of quantum energy eigenstates”, Poster presentation, 17th International Workshop on Computational Electronics, June 3-6, 2014, Paris, France, Pages 145-146. ISBN:978-2-9547858-0-6 • Walter Rodrigues, M. Lopez, A. Pecchia, A. Auf der Maur, A. Di Carlo (2014), “GPU based approach for the atomistic calculation of quantum energy eigenstates in nanostructured system”, Proceedings of the 6th International Conference from Scientific Computing to Computational Engineering (6th IC-SCCE), 9-12 July 2014, Athens, Greece. ISSN:2241-8865, ISBN:978-618-80527-5-8 • W. Rodrigues, A. Pecchia, M. Auf der Maur, A. Di Carlo (2015), “A comprehensive study of popular eigenvalue methods employed for quantum calculation of energy eigenstates in nanostructures using GPUs”, Journal of Computational Electronics, In Press, Published online on April 9, 2015. DOI:10.1007/s10825-015-0695-z • W. Rodrigues, A. Pecchia, M. Auf der Maur, D. Barettin, S. Sanguinetti, A. Di Carlo, “Atomistic simulation of GaAs/AlGaAs quantum dot/ring nanostructures”, Accepted to the 15th International Conference on Nanotechnology (IEEE NANO 2015), July 27-30, 2015, Rome, Italy. • D. Barettin, M. Auf der Maur, A. Pecchia, W. Rodrigues, A. Tsatsulnikov, A. 94
  • 97.
    V. Sakharov, W.V. Lundin, A. E. Nikolaev, N. Cherkashin, M. J. Hytch, S. Yu. Karpov, A. Di Carlo, “Realistic model of LED structure with InGaN quantum-dots active region”, Accepted to the 15th International Conference on Nanotechnology (IEEE NANO 2015), July 27-30, 2015, Rome, Italy. 95
  • 98.
    Bibliography [1] Martin T.Dove, An introduction to atomistic simulation methods, Seminarios de la SEM, vol. 4, pp. 7-37. [2] Neil W. Ashcroft and N. David Mermin (1976), Solid State Physics, Cengage Learning, ISBN:0030839939. [3] P. E. Turchi, A. Gonis, and L. Colombo (1998), Tight-Binding Approach to Computational Materials Science, Materials Research Society, Warrendale, PA, Vol. 491. [4] J. C. Slater and G. F. Koster (1954), Simplified LCAO Method for the Periodic Potential Problem, Phys. Rev. 94, 1498. [5] Per-Olov L¨owdin (1950), On the Non-Orthogonality Problem Connected with the Use of Atomic Wave Functions in the Theory of Molecules and Crystals, J. Chem. Phys. 18, 365. [6] C. Delerue, M. Lannoo, G. Allan (2001), Tight binding for complex semiconductor systems, Physica Status Solidi (B), vol. 227 , issue 1 , pp. 115-149. [7] J. M Jancu, F. Bassani, F. Della Sala, and R Scholz (2002), Transferable tight- binding parametrization for the group-III nitrides. Appl. Phys. Lett. 81, 4838. doi:10.1063/1.1529312. [8] Yaohua P. Tan, Michael Povolotskyi, Tillmann Kubis, Timothy B. Boykin and Gerhard Klimeck (2012), Generation of Empirical Tight Binding Parameters from ab-initio simulations. Abstracts of IWCE 2012. 96
  • 99.
    [9] M. Lopez,F. Sacconi, M. Auf der Maur, A. Pecchia, and A. Di Carlo (2012), Atomistic simulation of InGaN/GaN quantum disk LEDs. Optical and Quantum Electronics, vol. 44, issue 3, pp. 89-94. doi: 10.1007/s11082-012-9554-3. [10] M. Lopez, M. Auf der Maur, A. Pecchia, F. Sacconi, G. Penazzi and A. Di Carlo (2013), Simulation of Random Alloy Effects in InGaN/GaN LEDs, Numerical Simulation of Optoelectronic Devices (NUSOD). doi:10.1109/NUSOD.2013.6633150 [11] Fabiano Oyafuso, Gerhard Klimeck, R.Chris Bowen, and Timothy B. Boykin (2002), Atomistic electronic structure calculations of unstrained alloyed systems consisting of a million atoms. Journal of Computational Electronics, vol. 1, issue 3, pp. 317-321. ISSN:1569-8025. doi:10.1023/A:1020774819509. [12] Aldo Di Carlo (2002), Tight-binding methods for transport and optical properties in realistic nanostructures, Physica B 314, pp. 211-219. [13] C. M. Goringey, D. R. Bowleryk and E. Hern`andez (1997), Tight-binding modelling of materials. Rep. Prog. Phys., 60:14471512. doi:10.1143/JJAP.44.L173. [14] Aldo Di Carlo, Paolo Lugli and Andrea Reale (1997), Modelling of semiconductor nanostructured devices within the tight-binding approach. J. Phys.: Condens. Matter, 11. doi:10.1088/0953-8984/11/31/311. [15] Aldo Di Carlo (1997), Self-consistent tight-binding methods applied to semiconductor nanostructures. volume 491, issue 1, doi:10.1557/PROC-491-389. [16] A. Di Carlo (2003), Microscopic theory of nanostructured semiconductor devices: beyond the envelope-function approximation. Semiconductor Science and Technology, vol. 18 issue 1. doi: 10.1088/0268-1242/18/1/201. [17] M. Auf der Maur, Gabriele Penazzi, Giuseppe Romano, Fabio Sacconi, A. Pecchia, Aldo Di Carlo (2011), The Multiscale Paradigm in Electronic Device Simulation, IEEE Transactions on Electron Devices vol. 58, issue 5, pp. 1425-1432. [18] Suman De, Arunasish Layek, Sukanya Bhattacharya, Dibyendu Kumar Das, Abdul Kadir, Arnab Bhattacharya, Subhabrata Dhar, and Arindam Chowdhury (2012). Quantum-confined stark effect in localized luminescent centers within 97
  • 100.
    InGaN/GaN quantum-well basedlight emitting diodes. Appl. Phys. Lett, 101:121919. doi:10.1063/1.4754079. [19] G. Penazzi, A. Pecchia, F. Sacconi and A. Di Carlo (2010), Calculation of optical properties of a quantum dot embedded in a GaN/AlGaN nanocolumn. Superlattices and Microstructures, vol. 47, Issue 1, pp. 123-128 [20] C. Delerue and M. Lannoo (2004), Nanostructures - Theory and Modeling, Springer. ISBN:9783662089033 [21] Matthias Auf der Maur (2008), A Multiscale Simulation Environment for Electronic and Optoelectronic Devices., Ph.D. thesis, University of Rome Tor Vergata, Rome, Italy. [22] L. C. Lew Yan Voon and L. R. Ram-Mohan (1993), The tight binding representation of the optical matrix elements: theory and applications, Physical Review B, 47:15500-15508. doi:10.1103/PhysRevB.47.15500. [23] R. Shankar (1994), Principles of Quantum Mechanics (2nd ed.), Kluwer Academic/Plenum Publishers. ISBN:9780306447907. [24] Gordon E. Moore (1965), Cramming More Components onto Integrated Circuits, Electronics, vol. 38, issue 8, pp. 114-117. [25] Brock, C. David (2006), Understanding Moore’s law: four decades of innovation, Philadelphia, Chemical Heritage Press. ISBN:0941901416. [26] A. V. Aho, M. S. Lam, R. Sethi, and J. D. Ullman (2007), Compilers: Principles, Techniques, and Tools, 2n d Ed., Addison-Wesley. ISBN:9780321486813. [27] A. Vajda (2011), Programming Many-Core Chips, Chapter 2, pp. 9-43, springer, ISBN:9781441997388 [28] Geoffrey Blake, Ronald G. Dreslinski, and Trevor Mudge (2009), A Survey of Multicore Processors, IEEE Signal Processing Magazine, vol 26. doi:10.1109/MSP.2009.934110. [29] T.S Crow (2004), Evolution of the Graphical Processing Unit. Master’s thesis, Univ. of Nevada, Reno. 98
  • 101.
    [30] Sha’Kia Bogganand Daniel M. Pressel (2007), GPUs: An Emerging Platform for General-Purpose Computation, Technical report, U.S. Army Research Laboratory, Aberdeen Proving Ground, MD, USA. [31] Kayvon Fatahalian and Mike Houston (2008), A closer look at GPUs, Communications ACM, vol. 51 issue 10, pp. 50-57, ACM New York, NY, USA, doi:10.1145/1400181.1400197 [32] John D. Owens, Mike Houston, David Luebke, Simon Green, John E. Stone, and James C. Phillips (2008), GPU Computing, Proceedings of the IEEE, vol. 96, issue 5, pp. 879-899. [33] John D. Owens, David Luebke, Naga Govindaraju, Mark Harris, Jens Kr¨uger, Aaron E. Lefohn, and Tim Purcell (2007), A Survey of General-Purpose Computation on Graphics Hardware, Computer Graphics Forum, vol. 26, issue 1, pp. 80-113. [34] E. Lindholm, J. Nickolls, S. Oberman, J. Montrym (2008), NVIDIA Tesla: A Unified Graphics and Computing Architecture, Micro, IEEE, vol. 28 , issue 2, pp. 39-55. doi:10.1109/MM.2008.31 [35] Nvidia corporation (2006), NVIDIA GeForce 8800 Architecture Technical Brief, November 2006. [36] J. Nickolls, I. Buck, K. Skadron, and M. Garland (2008), Scalable Parallel Programming with CUDA, ACM Queue, vol. 6, issue 2, pp. 40-53. [37] NVIDIA corporation (2014), CUDA C PROGRAMMING GUIDE, version 6.5. [38] NVIDIA corporation (2014), CUDA C BEST PRACTICES GUIDE, version 6.5. [39] Kirk David B. and Hwu Wen-mei W. (2010), Programming Massively Parallel Processors: A Hands-on Approach, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, ISBN:0123814723, 9780123814722 [40] Sanders Jason and Kandrot Edward (2010), CUDA by Example: An Introduction to General-Purpose GPU Programming, Addison-Wesley Professional, ISBN: 0131387685, 9780131387683 99
  • 102.
    [41] Peter N.Glaskowsky (2009), NVIDIA’s Fermi: The First Complete GPU Computing Architecture, White paper September 2009. [42] NVIDIA corporation (2009), NVIDIA’s Next Generation CUDA Compute Architecture: Fermi, technical report, NVIDIA 2009 [43] Matthew Murray (2012), Nvidia’s Kepler architecture: 6 things you should know, PC, March 23, 2012. [44] Ryan Smith (2012), NVIDIA GeForce GTX 680 Review: Retaking The Performance Crown, AnandTech, March 22, 2012 [45] NVIDIA corporation (2012), NVIDIA’s Next Generation CUDA Compute Architecture: Kepler GK110/210. White paper. [46] NVIDIA corporation (2012), NVIDIA Kepler Compute Architecture Datasheet, May 2012. [47] Ryan Smith (2012), NVIDIA Launches Tesla K20 and K20X: GK110 Arrives At Last, AnandTech, November 12, 2012 [48] NVIDIA corporation (2012), NVIDIA’s Next Generation CUDA Compute Architecture: Kepler GK110, White paper [49] Rob Farber (2008), CUDA, Supercomputing for the Masses: Part 1 , Dr. Dobb’s, April 15, 2008. [50] Qihang Huang, Zhiyi Huang, P. Werstein, M. Purvis (2008), GPU as a General Purpose Computing Resource, International conference on Parallel and Distributed Computing, Applications and Technologies, Otago, pp. 151-158. doi:10.1109/PDCAT.2008.38 [51] David Tarditi, Sidd Puri, Jose Oglesby (2006), Accelerator: using data parallelism to program GPUs for general-purpose uses, ACM SIGARCH Computer Architecture News, vol. 34, issue 5. [52] Shuai Che, Michael Boyer, Jiayuan Meng, D. Tarjan, Jeremy W. Sheaffer, Kevin Skadron (2008), A performance study of general-purpose applications on graphics 100
  • 103.
    processors using CUDA.Journal of Parallel and Distributed Computing, vol 68, issue 10, pp. 1370-1380. doi:10.1016/j.jpdc.2008.05.014 [53] Peng Du, Rick Weber, Piotr Luszczek, Stanimire Tomov, Gregory Peterson, Jack Dongarra (2012), From CUDA to OpenCL: Towards a performance-portable solution for multi-platform GPU programming, Parallel Computing, vol. 38, issue 8, pp. 391-407. doi:10.1016/j.parco.2011.10.002 [54] John E. Stone, James C. Phillips, Peter L. Freddolino, David J. Hardy 1, Leonardo G. Trabuco, Klaus Schulten (2007), Accelerating molecular modeling applications with graphics processors, Journal of Computational Chemistry, vol. 28, issue 16, pp. 2618-2640. doi:10.1002/jcc.20829 [55] Joshua A. Anderson, Chris D. Lorenz, A. Travesset (2008), General Purpose Molecular Dynamics Simulations Fully Implemented on Graphics Processing Units, Journal of Computational Physics, vol. 227, issue 10, pp. 5342-5359. doi:10.1016/j.jcp.2008.01.047 [56] John Paul Walters, Vidyananth Balu, Vipin Chaudhary, David Kofke, and Andrew Schultz (2008), Accelerating molecular dynamics simulations with GPUs, In ISCA 21st International Conference on Parallel and Distributed Computing and Communication Systems (ISCA PDCCS), pp. 44-49, New Orleans, USA. [57] S.B. Kylasa, H.M. Aktulga, A.Y. Grama (2014), PuReMD-GPU: A reactive molecular dynamics simulation package for GPUs, Journal of Computational Physics, vol. 272, pp. 343-359. [58] Ivan S. Ufimtsev and Todd J. Martinez (2008), Graphical Processing Units for Quantum Chemistry, Comp. Sci. Eng., vol. 10, issue 6, pp. 26-34. doi:10.1109/MCSE.2008.148 [59] Ivan S. Ufimtsev and Todd J. Martinez (2008), Quantum Chemistry on Graphical Processing Units. 1. Strategies for Two-Electron Integral Evaluation, J. Chem. Theo. Comp., vol. 4, issue 2, pp. 222-231. doi:10.1021/ct700268q [60] Mark Watson, Roberto Olivares-Amaya, Richard G. Edgar, and Alan Aspuru-Guzik (2010), Accelerating correlated quantum chemistry calculations using graphical 101
  • 104.
    processing units, Computingin Science and Engineering, vol 12, issue 4, pp. 40- 50. doi:10.1109/MCSE.2010.29 [61] Andreas W. G¨otz, Thorsten W¨olfle1, and Ross C. Walker (2010), Quantum Chemistry on Graphics Processing Units, In Annual Reports in Computational Chemistry, vol. 6, Elsevier B.V 2010. doi:10.1016/S1574-1400(10)06002-0 [62] M. J. Harvey, Gianni De Fabritiis (2012), A survey of computational molecular science using graphics processing units, Wiley Interdisciplinary Reviews: Computational Molecular Science, vol. 2, issue 5, pp. 734-742, 2012, doi:10.1002/wcms.1101 [63] A. Dal Corso (1996), A pseudopotential plane waves program (pwscf) and some case studies, Lecture Notes in Chemistry, vol. 67, C. Pisani editor, Springer Verlag, Berlin, 1996. [64] K. P. Esler, Jeongnim Kim, L. Shulenburger, D.M. Ceperley (2012), Computing in Science and Engineering, vol.14, issue 1, pp. 40-51. doi:10.1109/MCSE.2010.122 [65] Andrea Manconi, Alessandro Orro, Emanuele Manca, Giuliano Armano, Luciano Milanesi (2014), A tool for mapping Single Nucleotide Polymorphisms using Graphics Processing Units, BMC Bioinformatics, vol 15, issue 1, pp. 1-13. doi:10.1186/1471-2105-15-S1-S10 [66] Ling Sing Yung, Can Yang, Xiang Wan, Weichuan Yu (2011), GBOOST: a GPU- based tool for detecting gene-gene interactions in genome-wide case control studies, Bioinformatics, vol. 27, issue 9, pp. 1309-1310. doi:10.1093/bioinformatics/btr114 [67] Alhadi Bustamam, Kevin Burrage, Nicholas A. Hamilton (2012), Fast Parallel Markov Clustering in Bioinformatics using Massively Parallel Computing on GPU with CUDA and ELLPACK-R Sparse Format, IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 9, issue 3, pp. 679-692. doi:10.1109/TCBB.2011.68 [68] Panagiotis D. Vouzis, Nikolaos V. Sahinidis (2011), GPU-BLAST: using graphics processors to accelerate protein sequence alignment, Bioinformatics vol. 27, issue 2, pp. 182-188. doi:10.1093/bioinformatics/btq644 102
  • 105.
    [69] Guillaume Rizk,Dominique Lavenier (2009), GPU Accelerated RNA Folding Algorithm, In Computational Science - ICCS 2009. vol. 5544 Pp. 1004-1013. Springer Berlin/Heidelberg. doi:10.1007/978-3-642-01970-8 101 [70] Peter Huthwaite (2014), Accelerated finite element elastodynamic simulations using the GPU, Journal of Computational Physics, vol. 257, part A, pp. 687-707 [71] R. Spurzem, P. Berczik, G. Marcus, A. Kugel, G. Lienhart, I. Berentzen, R. M¨anner, R. Klessen, R. Banerjee (2009), Accelerating astrophysical particle simulations with programmable hardware (FPGA and GPU), Computer Science - Research and Development, vol. 23, issue 3-4, pp. 231-239. doi:10.1007/s00450-009-0081-9 [72] Spurzem Rainer, Berczik Peter, Berentzen Ingo, Ge Wei, Wang Xiaowei, Schive Hsi- yu, Nitadori Keigo, Hamada Tsuyoshi, Fiestas Jose (2012), Accelerated Many-Core GPU Computing for Physics and Astrophysics on Three Continents, Chapter 3, Large-Scale Computing, John Wiley and Sons, Inc,. ISBN:9780470592441 [73] Dossay Oryspayev, Hugh Potter, Pieter Maris, Masha Sosonkina, James P. Vary, Sven Binder, Angelo Calci, Joachim Langhammer, Robert Roth (2013), Leveraging GPUs in Ab Initio Nuclear Physics Calculations, Parallel and Distributed Processing Symposium Workshops and PhD Forum (IPDPSW), 2013 IEEE 27th International, 20-24 May 2013, Cambridge, MA, pp. 1365-1372. doi:10.1109/IPDPSW.2013.253 [74] Ari Harju, Topi Siro, Filippo Federici Canova, Samuli Hakala, Teemu Rantalaiho (2013), Computational Physics on Graphics Processing Units, Applied Parallel and Scientific Computing, Lecture Notes in Computer Science, vol. 7782, pp 3-26. doi:10.1007/978-3-642-36803-5 1 [75] J. Kruger and R. Westermann (2003), Linear algebra operators for GPU implementation of numerical algorithms, ACM Trans. Graph. vol. 22, issue 3, pp. 908-916. [76] Markus Geveler, Dirk Ribbrock, Dominik G¨oddeke, Peter Zajac and Stefan Turek (2013), Towards a complete FEM-based simulation toolkit on GPUs: Unstructured grid finite element geometric multigrid solvers with strong smoothers based on sparse approximate inverses, Computers and Fluids, vol. 80, pp. 327-332. doi:10.1016/j.compfluid.2012.01.025 103
  • 106.
    [77] Volodymyr Kindratenko(2014), Numerical Computations with GPUs, Springer International Publishing, Switzerland, ISBN:9783319065472 [78] W. Li, Z. Fan, X. Wei, and A. Kaufman (2003), GPU-Based Flow Simulation with Complex Boundaries, Technical Report 031105, Computer Science Department, Suny at Stony Brook. Nov 2003. [79] T Nagatake and T Kunugi (2010), Application of GPU to computational multiphase fluid dynamics, IOP Conf. Series: Materials Science and Engineering, vol. 10, 012024, doi:10.1088/1757-899X/10/1/012024 [80] Mark J. Harris (2004), Fast Fluid Dynamics Simulation on the GPU, GPU Gems, Chapter 38. [81] Anders Eklund, Paul Dufort, Daniel Forsberg, Stephen M. LaConte (2013), Medical image processing on the GPU - Past, present and future, Medical Image Analysis, vol. 17, issue 8, pp. 1073-1094. doi:10.1016/j.media.2013.05.008 [82] Pavel Karas (2010), GPU Acceleration of Image Processing Algorithms, dissertation thesis, Centre for Biomedical Image Analysis, Faculty of Informatics, Masaryk University. [83] Brijmohan Daga, Avinash Bhute, Ashok Ghatol (2011), Implementation of Parallel Image Processing Using NVIDIA GPU Framework, Advances in Computing, Communication and Control Communications in Computer and Information Science, vol. 125, pp. 457-464. doi: 10.1007/978-3-642-18440-6 58 [84] T. Preis (2011), GPU-computing in econophysics and statistical physics, European Physical Journal Special Topics, vol. 194, issue 1, pp. 87-119. doi:10.1140/epjst/e2011-01398-x [85] Scott Grauer-Gray, William Killian, Robert Searles, John Cavazos (2013), Accelerating financial applications on the GPU, Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units, pp. 127-136, ACM New York, USA. doi:10.1145/2458523.2458536 [86] Hawkins, T. (1975), Cauchy and the spectral theory of matrices, Historia Mathematica, vol 2, issue 1, pp. 1-29. doi:10.1016/0315-0860(75)90032-4 104
  • 107.
    [87] Morris Kline(1972), Mathematical thought from ancient to modern times, Oxford University Press, ISBN:0195014960 [88] Richard von Mises and H. Pollaczek-Geiringer (1929), Praktische Verfahren der Gleichungsaufl¨osung, ZAMM - Zeitschrift f¨ur Angewandte Mathematik und Mechanik, vol. 9, pp. 152-164. [89] William H. Press, Saul A. Teukolsky, William T. Vetterling, Brian P. Flannery (2007), Numerical Recipes: The Art of Scientific Computing, Chapter 11: Eigensystems, pp. 563-597. Third edition, Cambridge University Press. ISBN:9780521880688 [90] J.G.F. Francis (1961), The QR Transformation - part 1, The Computer Journal, vol. 4, issue 3, pp. 265-271, doi:10.1093/comjnl/4.3.265 [91] J.G.F. Francis (1962), The QR Transformation - part 2, The Computer Journal, vol. 4, issue 4, pp. 332-345. [92] Vera N. Kublanovskaya, On some algorithms for the solution of the complete eigenvalue problem, USSR Computational Mathematics and Mathematical Physics, vol. 1, issue 3, pp 637-657. [93] G. H. Golub and C. F. Van Loan (1996), Matrix Computations, 3rd ed., Johns Hopkins University Press, Baltimore. ISBN:0801854148. [94] J. J. M. Cuppen (1981), A divide and conquer method for the symmetric tridiagonal eigenproblem, Numer. Math., vol. 36, pp. 177-195. [95] M. Gu and S. C. Eisenstat (1994), A stable and efficient algorithm for the rank-one modification of the symmetric eigenproblem, SIAM J. Matrix Anal. Appl., vol. 15, pp. 1266-1276. [96] M. Gu and S. C. Eisenstat (1995), A Divide-and-Conquer Algorithm for the Symmetric Tridiagonal Eigenproblem, SIAM J. Matrix Anal. Appl., vol. 16, pp. 172-191, doi:10.1137/S0895479892241287 [97] G. H. Golub and H. A. van der Vorst (2000), Eigenvalue computation in the 20th century, Journal of Computational and Applied Mathematics, vol. 123, issue 1-2, pp. 35-65. 105
  • 108.
    [98] J.W. Givens(1953), A method of computing eigenvalues and eigenvectors suggested by classical results on symmetric matrices, U.S. Nat. Bur. Standards App. Math., vol. 29, pp. 117-122. [99] J.W. Givens (1954), Numerical computation of the characteristic values of a real symmetric matrix. Oak Ridge National Laboratory, Report: ORNL-1574. [100] C. G. J. Jacobi (1846), ¨Uber ein leichtes Verfahren die in der Theorie der S¨acularst¨orungen vorkommenden Gleichungen numerisch aufzul¨osen. Journal f¨ur die reine und angewandte Mathematik, vol. 30, issue 30, pp. 51-94. [101] J. H. Wilkinson (1988), The Algebraic Eigenvalue Problem, Oxford University Press, Inc., New York, USA. ISBN:0198534183 [102] J. W. Demmel and K. Veselic (1992), Jacobi’s method is more accurate than QR, SIAM J. Matrix Anal. Appl., vol. 13, pp. 1204-1246. [103] John H. Mathews and Kurtis D. Fink (2004), Numerical Methods: Using Matlab, Fourth Edition, Prentice-Hall Pub. Inc., NJ, USA. ISBN:0130652482 [104] B.N. Parlett (1980), The Symmetric Eigenvalue Problem, Prentice-Hall Series in Computational Mathematics, Prentice Hall, Englewood Cliffs, N.J, USA. ISBN:0138800472 [105] W. E. Arnoldi (1951), The principle of minimized iterations in the solution of the matrix eigenvalue problem, Quarterly of Applied Mathematics, vol. 9, pp. 17-29. [106] Y. Saad (1992), Numerical Methods for Large Eigenvalue Problems, Halsted Press, Div. of John Wiley and Sons, Inc., New York, USA. [107] Y. Saad (1980), Variations of Arnoldi’s method for computing eigenelements of large unsymmetric matrices, Linear Algebra and Its Applications, vol. 34, pp. 269-295. [108] D. C. Sorensen (1992), Implicit application of polynomial filters in a k-step Arnoldi method, SIAM Journal on Matrix Analysis and Applications, vol. 13, issue 1, pp. 357-385. [109] C. Lanczos (1950), An iteration method for the solution of the eigenvalue problem of linear differential and integral operators, J. Res. Nat’l Bur. Std. 45, pp. 225-282. 106
  • 109.
    [110] G.W. Stewart(2001), Matrix Algorithms, Volume II: Eigensystems, SIAM, Chapter 5, pp. 306-367. ISBN:0470218207 [111] Jane K. Cullum and Ralph A. Willoughby (2002), Lanczos Algorithms for Large Symmetric Eigenvalue Computations, vol. 1, SIAM, Philadelphia, USA. ISBN:0817630589 [112] B. N. Parlett and D. S. Scott (1979), The Lanczos algorithm with selective orthogonalization, Mathematics of Computation, vol. 33, issue 145, pp. 217-238. [113] Chang San-Cheng (1986), Lanczos algorithm with selective reorthogonalization for eigenvalue extraction in structural dynamic and stability analysis, Computers and Structures vol. 23, issue 2, pp. 121-128. doi:10.1016/0045-7949(86)90206-3 [114] Andrew V., Knyazev (2001), Toward the Optimal Preconditioned Eigensolver: Locally Optimal Block Preconditioned Conjugate Gradient Method, SIAM Journal on Scientific Computing, vol. 23, issue 2, 517-541. doi:10.1137/S1064827500366124 [115] E. R. Davidson (1975), The Iterative Calculation of a Few of the Lowest Eigenvalues and Corresponding Eigenvectors of Large Real Symmetric Matrices, J. Comput. Phys., vol. 17, pp. 87-94. [116] Z. Bai, J. Demmel, J. Dongarra, A. Ruhe, and H. van der Vorst (2000), Templates for the Solution of Algebraic Eigenvalue Problems: A Practical Guide, SIAM, Philadelphia, USA. [117] E. R. Davidson (1993), Monster matrices: Their eigenvalues and eigenvectors, Comput. Phys., vol. 7, pp. 519-522. [118] G. L. G. Sleijpen and H. A. van der Vorst (1996), A Jacobi-Davidson iteration method for linear eigenvalue problems, SIAM J. Matrix Anal. Appl., vol. 17, pp. 401-425. [119] M.E. Hochstenbach, Y. Notay (2006), The Jacobi-Davidson method, GAMM Mitteilungen, vol. 29, issue 2, pp. 368-382. ISSN:09367195 [120] P. Arbenz and M. E. Hochstenbach (2004), A Jacobi-Davidson method for solving complex symmetric eigenvalue problems SIAM J. Sci. Comput., vol. 25, pp. 1655- 1673. doi:10.1137/S1064827502410992 107
  • 110.
    [121] T. Sakuraiand H. Sugiura (2003), A projection method for generalized eigenvalue problems, Journal of Computational and Applied Mathematics, vol. 159, issue 1, pp. 119-128. doi:10.1016/S0377-0427(03)00565-X [122] T. Sakurai and H. Tadano (2007), CIRR: a Rayleigh-Ritz type method with contour integral for generalized eigenvalue problems, Hokkaido Mathematical Journal, vol. 36, pp. 745-757. [123] E. Polizzi (2009), Density-Matrix-Based Algorithms for Solving Eigenvalue Problems, Phys. Rev. B., vol. 79, 115112. [124] Martin Galgon, Lukas Kramer, and Bruno Lang (2011), The FEAST algorithm for large eigenvalue problems, PAMM. Proc. Appl. Math. Mech., vol. 11, pp. 747-748. doi:10.1002/pamm.201110363 [125] J. H. Wilkinson, C. Reinsch (1971), Handbook for Automatic Computation, Vol. 2: Linear Algebra, Grundlehren Der Mathematischen Wissenschaften, vol. 186, Springer-Verlag. ISBN: 978-0387054148 [126] G.L.G. Sleijpen, H.A. Van der Vorst (2000), A Jacobi-Davidson iteration methods for linear eigenvalue problems, SIAM Rev., vol. 42, pp. 267-293. [127] R.B. Lehoucq, D.C. Sorensen, C. Yang (1998), ARPACK Users Guide: Solution of Large-Scale Eigenvalue Problems with Implicitly Restarted Arnoldi Methods, SIAM, Philadelphia, USA. [128] A. Stathopoulos, J.R. McCombs (2010), PRIMME: preconditioned iterative multimethod eigensolver methods and software description, ACM Trans. Math. Softw. (TOMS), vol. 37, issue 2, pp. 1-30. [129] V. Hernandez, J.E. Roman, V. Vidal (2005), SLEPc: A scalable and flexible toolkit for the solution of eigenvalue problems, ACM Trans. Math. Softw. (TOMS), vol. 31, issue 3, pp. 351-362. Special issue on the Advanced Computational Software (ACTS) Collection. [130] A. Dziekonski, A. Lamecki, M. Mrozowski (2011), A memory efficient and fast sparse matrix vector product on a GPU, Prog. Electromagn. Res., vol. 116, pp. 49-63. 108
  • 111.
    [131] F. Smailbegovic,G.N. Gaydadjiev, S. Vassiliadis (2005), Sparse Matrix Storage Format. 16th Annual Workshop on Circuits, Systems and Signal Processing, ProRISC 2005, Veldhoven, 17-18 November, 2005. [132] S. Pescetelli, A. Di Carlo, P. Lugli (1997), Conduction Band Mixing in T- and V-shaped quantum wires, Phys. Rev. B 56, 1668. [133] G. Grosso, L. Martinelli, G. Pastori Parravicini (1995), Lanczos-type algorithm for excited states of very-large-scale quantum systems, Phys. Rev. B 51, 13033-13038. [134] Kapadia Nirav Harish (1994). A SIMD Sparse Matrix-Vector Multiplication Algorithm For Computational Electromagnetics And Scattering Matrix Models. ECE Technical Reports. http://docs.lib.purdue.edu/ecetr/200/ [135] Shameem Akhter and Jason Roberts (2006), Multi-Core Programming: Increasing Performance through Software Multithreading, Intel Press. ISBN:0976483246, 9780976483243 [136] Kamran Karimi, Neil G. Dickson, Firas Hamze, High Performance Physics Simulations Using Multi-Core CPUs and GPGPUs in a Volunteer Computing Context, D-Wave Systems Inc. British Columbia Canada. http://arxiv.org/pdf/ 1004.0023 [137] Nathan Bell, Michael Garland (2009), Implementing sparse matrix-vector multiplication on throughput-oriented processors, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, Oregon, Portland, 14-20 November 2009. [138] I. Reguly, M. Giles (2012), Efficient sparse matrix-vector multiplication on cache- based GPUs, Innov. Parallel Comput. IEEE, pp. 1-12. [139] Luciano Colombot, William Sawyer and Djordje Marict (1995), A Parallel Implementation of Tight-Binding Molecular Dynamics Based on Reordering of Atoms and the Lanczos Eigen-Solver, MRS Proceedings, vol. 408, pp. 107. doi:10.1557/PROC-408-107. [140] Luca Bergamaschi, Giorgio Pini, Flavio Sartoretto (2003), Computational experience with sequential and parallel, preconditioned Jacobi-Davidson for large, 109
  • 112.
    sparse symmetric matrices,Journal of Computational Physics, vol. 188, issue 1, pp. 318-331. doi:10.1016/S0021-9991(03)00190-6 [141] M. Camara, A. Mauger, and I. Devos (2002), Electronic structure of the layer compounds GaSe and InSe in a tight-binding approach, Phys. Rev. B 65, 125206. [142] Steven E. Laux (2012), Solving complex band structure problems with the FEAST eigenvalue algorithm. Phys. Rev. B 86, 075103. [143] Alan R. Levin, Deyin Zhang, Eric Polizzi (2012), FEAST fundamental framework for electronic structure calculations: Reformulation and solution of the muffin-tin problem, Computer Physics Communications, vol. 183, issue 11, pp. 2370-2375. doi:10.1016/j.cpc.2012.06.004 [144] R. Barret, M. Berry, T. F. Chan, J. Demmel, J. Donato, J. Dongarra, V. Eijkhout, R. Pozo, C. Romine, and H. van der Vorst (1994), Templates for the Solution of Linear Systems, Building Blocks for Iterative Methods, SIAM, Philadelphia, PA. [145] G.L.G. Sleijpen, J.G.L. Booten, D.R. Fokkema, and H.A. Van der Vorst (1996), Jacobi-Davidson type methods for generalized eigenproblems and polynomial eigenproblems, BIT 36, pp. 595-633. [146] M.E. Hochstenbach, G.L.G. Sleijpen (2008), Harmonic and refined Rayleigh-Ritz for the polynomial eigenvalue problem, Numerical Linear Algebra with Applications, vol. 15, issue 1, pp. 35-54. [147] Y. Saad (2003), Iterative Methods for Sparse Linear Systems, 2nd edition, Society for Industrial and Applied Mathematics. ISBN:9780898715347 [148] Y. Saad and M.H. Schultz (1986), GMRES: A generalized minimal residual algorithm for solving nonsymmetric linear systems, SIAM J. Sci. Stat. Comput., 7, pp. 856-869. doi:10.1137/0907058 [149] E. Polizzi (2012), A High-Performance Numerical Library for Solving Eigenvalue Problems, FEAST solver User’s guide. arxiv.org/abs/1203.4031 [150] D. R. Fokkema, G.L.G. Sleijpen, H. A. Van der Vorst (1996), Generalized conjugate gradient squared, Journal of Computational and Applied Mathematics, vol. 71, pp. 125-146. 110
  • 113.
    [151] Michele Benzi(2002), Preconditioning techniques for large linear systems, A Survey, Journal of Computational Physics, vol. 182, pp. 418-477. [152] Stefano Sanguinetti, Claudio Somaschini, Sergio Bietti and Noboyuki Koguchi (2011), Complex Nanostructures by Pulsed Droplet Epitaxy, Nanomaterials and Nanotechnology, vol. 1, issue 1, pp. 14-17. [153] Daniele Barettin, Matthias Auf der Maur, Alessandro Pecchia, Walter Rodrigues et al. (2015), Realistic model of LED structure with InGaN quantum-dots active region, abstract submitted to International IEEE Conference on Nanotechnology (IEEE NANO 2015), Rome, Italy. [154] R. M. Camacho, M. V. Pack, J. C. Howell, A. Schweinsberg, and R. W. Boyd (2007), Wide-Bandwidth, Tunable, Multiple-Pulse-Width Optical Delays Using Slow Light in Cesium Vapor, Phys. Rev. Lett., 98 (15), pp. 153601. [155] Wen-Hsuan Kuan, Chi-Shung Tang and Cheng-Hung Chang (2007), Spectral properties and magneto-optical excitations in semiconductor double rings under Rashba spin-orbit, Phys. Rev. B, vol. 75, issue 15, pp. 155326. [156] Luis G. G. V. Dias da Silva, Jos´e M. Villas-Bˆoas and Sergio E. Ulloa (2007), Tunneling and optical control in quantum ring molecules, Phys. Rev. B, vol. 76, issue 15, pp. 155306. [157] F. Carre˜no, M. A. Ant´on, Sonia Melle, Oscar G. Calder´on, E. Cabrera-Granado, Joel Cox, Mahi R. Singh and A. Egatz-G´omez (2014), Plasmon-enhanced terahertz emission in self-assembled quantum dots by femtosecond pulses, J. Appl. Phys., vol. 115, issue 6, pp. 064304. [158] Gwyddion - Free SPM (AFM, SNOM/NSOM, STM, MFM) data analysis software, http://gwyddion.net/ [159] D. Barettin, R. De Angelis, P. Prosposito, M. Auf der Maur, M. Casalboni, A. Pecchia (2014), Model of a realistic InP surface quantum dot extrapolated from atomic force microscopy results. Nanotechnology, vol. 25, issue 19, 195201. doi:10.1088/0957-4484/25/19/195201 111
  • 114.
    [160] F. Sacconi,M. Auf der Maur, A. Di Carlo (2012), Optoelectronic Properties of Nanocolumn InGaN/GaN LEDs. Electron Devices, IEEE Transac, vol. 59, issue 11, pp. 2979-2987. doi:10.1109/TED.2012.2210897. [161] C. Bocklin, R. G. Veprek, S. Steiger and B. Witzigmann (2010), Computational study of an InGaN/GaN nanocolumn light-emitting diode. Phys. Rev. B, 81, 155306. doi:10.1103/PhysRevB.81.155306. 112
  • 115.
    Abbreviations AlGaN . .. . . . . . . . . . . Aluminium Gallium Nitride AlGaAs . . . . . . . . . . . . Aluminium Gallium Arsenide CPU . . . . . . . . . . . . . . . Central Processing Unit CUDA . . . . . . . . . . . . . Compute Unified Device Architecture CAD . . . . . . . . . . . . . . . Computer Aided-Design CB . . . . . . . . . . . . . . . . . Conduction Band CSR . . . . . . . . . . . . . . . Compressed Sparse Row CGS . . . . . . . . . . . . . . . Conjugate Gradient Squared Method DMA . . . . . . . . . . . . . . Direct Memory Access DFT . . . . . . . . . . . . . . . Density Functional Theory ETB . . . . . . . . . . . . . . . Empirical Tight Binding Eg . . . . . . . . . . . . . . . . . . Energy gap FMA . . . . . . . . . . . . . . . Fused Multiply Add GaN . . . . . . . . . . . . . . . Gallium Nitride GaAs . . . . . . . . . . . . . . Gallium Arsenide GPU . . . . . . . . . . . . . . . Graphic Processing Unit GMRES . . . . . . . . . . . Generalized Minimal Residual Method H . . . . . . . . . . . . . . . . . . . Hamiltonian matrix 113
  • 116.
    HPC . .. . . . . . . . . . . . . High Performance Computing InGaN . . . . . . . . . . . . . Indium Gallium Nitride InN . . . . . . . . . . . . . . . . Indium Nitride ILU . . . . . . . . . . . . . . . . Incomplete LU JD . . . . . . . . . . . . . . . . . Jacobi-Davidson LED . . . . . . . . . . . . . . . Light Emitting Diode LCAO . . . . . . . . . . . . . Linear Combination of Atomic Orbitals MP . . . . . . . . . . . . . . . . Multi-Processing MPI . . . . . . . . . . . . . . . . Message Passing Interface MIMD . . . . . . . . . . . . . Multiple Instruction Multiple Data MOI . . . . . . . . . . . . . . . Memory Optimized Implementation OpenMP . . . . . . . . . . Open Multi-Processing SMX . . . . . . . . . . . . . . . Next-generation Streaming Multiprocessor SM . . . . . . . . . . . . . . . . . Streaming Multiprocessor SPD . . . . . . . . . . . . . . . . Spatial Probability Density SFU . . . . . . . . . . . . . . . . Special Function Unit SIMD . . . . . . . . . . . . . . Single Instruction Multiple Data spMV . . . . . . . . . . . . . . Sparse Matrix-Vector Multiplication TB . . . . . . . . . . . . . . . . . Tight-Binding VCA . . . . . . . . . . . . . . . Virtual Crystal Approximation VB . . . . . . . . . . . . . . . . . Valence Band 114
  • 117.
    List of Figures 2.1Schematic comparison of CPU and GPU structure (Source: NVIDIA) . . . 16 2.2 Full chip block diagram of Kepler microarchitecture based GPU (Source: NVIDIA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.3 Architectural overview of next-generation streaming multiprocessor (SMX) within Kepler microarchitecture (Source: NVIDIA) . . . . . . . . . . . . . 20 2.4 Warp scheduler within next-generation streaming multiprocessors (Source: NVIDIA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.5 Kepler GPU memory hierarchy (Source: NVIDIA) . . . . . . . . . . . . . . 22 2.6 Direct Peer-to-Peer data transfer between two GPUs using GPUDirect (Source: NVIDIA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.7 (Left) Gird of thread blocks (Source: NVIDIA). (Right) CUDA execution model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.1 Conical wurtzite GaN/AlGaN quantum dot with 30% Al. Atomistic description: In yellow Aluminium, in red Gallium. . . . . . . . . . . . . . . 41 4.2 Performance of spMV operation on GPU employing different data types . . 48 4.3 (Left) Typical sparsity pattern of a TB Hamiltonian and partitioning over four nodes. (Right) Data exchanged between adjacent nodes . . . . . . . . 49 4.4 Memory utilization by TB Hamiltonian matrix on GPU . . . . . . . . . . . 52 4.5 Time comparison of Lanczos iteration using MPI-OpenMP on a HPC cluster connected via InfiniBand . . . . . . . . . . . . . . . . . . . . . . . . 54 4.6 Time taken per Lanczos iteration for different implementations and technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.7 Performance comparison for the Lanczos iteration between different implementations and technologies . . . . . . . . . . . . . . . . . . . . . . . 55 115
  • 118.
    4.8 Speed comparisonfor spMV between implementations on each of the technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 5.1 (Left) Cubical wurtzite GaN/AlGaN quantum dot showing the core with 30% Aluminum. (Right) a central slice of the cube. Atomistic description: in yellow Aluminum, in red Gallium . . . . . . . . . . . . . . . . . . . . . . 64 5.2 Time comparison between methods on 1 Kepler GPU for the calculation of 8 energy eigenstates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 5.3 Time comparison between methods on 4 Kepler GPUs for the calculation of 8 energy eigenstates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 5.4 Scaling of Lanczos method on 1 to 4 GPUs . . . . . . . . . . . . . . . . . . 68 5.5 Scaling of Jacobi-Davidson (subspace in host memory) method on 1 to 4 GPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.6 Scaling of FEAST method on 1 to 4 GPUs . . . . . . . . . . . . . . . . . . 69 5.7 Percentage of time taken for memory and compute operations on (Left) 1 GPU and (Right) 4 GPUs respectively . . . . . . . . . . . . . . . . . . . . 70 5.8 Memory consumption between methods on 1 GPU . . . . . . . . . . . . . . 72 5.9 Memory consumption between methods on 4 GPUs . . . . . . . . . . . . . 73 5.10 Time performance comparison between Lanczos, Jacobi-Davidson and FEAST method on 4, 8, 16 and 32 nodes of the HPC cluster for the calculation of 8 energy eigenstates . . . . . . . . . . . . . . . . . . . . . . . 74 5.11 Scaling of Lanczos method on 4, 8, 16 and 32 nodes of the HPC cluster . . 75 5.12 Scaling of Jacobi-Davidson (subspace in host memory) method on 4, 8, 16 and 32 nodes of the HPC cluster . . . . . . . . . . . . . . . . . . . . . . . . 75 5.13 Scaling of FEAST method on 4, 8, 16 and 32 nodes of the HPC cluster . . 76 6.1 Atomic force microscope images of GaAs/Al0.3Ga0.7As complex quantum dot/ring nanostructure (Source: Sanguinetti (2011)) . . . . . . . . . . . . . 79 6.2 (Below) Lateral view, (Above) Top view: Geometry of dot/ring complex nanostructure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 6.3 Partly sliced GaAs/Al0.3Ga0.7As complex quantum dot/ring nanostructure with 30% Al, 70% Ga. Atomistic description: in Pink Aluminum, in Blue Gallium . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 116
  • 119.
    6.4 Electron statesusing ETB methods for varying radius of the quantum dot while the rest of the geometry of the complex nanostructure is kept fixed . 81 6.5 SPD for first 8 electrons states using ETB method for the quantum dot with radius = 8 nm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 6.6 Evolution of eigenenergies with quantum dot radius. The lines connect states which have been identified to have the same wave function symmetry. 82 6.7 Probability density for lambda states in quantum dot with radius = 6.2 nm, overlapping between states B, C and H . . . . . . . . . . . . . . . . . 83 6.8 Probability density for lambda states in quantum dot with radius = 6.5 nm, overlapping between (Left) states B and F and (Right) states C and E 83 6.9 InGaN quantum dot with varying content of Indium derived from experimental high-resolution transmission electron microscopy . . . . . . . 84 6.10 A central slice of InGaN quantum dot with 19% Indium randomly distributed. Atomistic description: in Red Indium, in White Gallium . . . . 85 6.11 InGaN quantum dot with uniform content of Indium. Description: in Red 19% Indium, in Blue 0% Indium . . . . . . . . . . . . . . . . . . . . . . . . 85 6.12 Electronic ground states obtained from ETB calculation of InGaN quantum dot with random Indium content . . . . . . . . . . . . . . . . . . . . . . . 85 6.13 Electronic ground states obtained from ETB calculation of InGaN quantum dot with uniform Indium content . . . . . . . . . . . . . . . . . . . . . . . 86 7.1 Performance of Lanczos implementation benchmarked on different technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 7.2 Performance of Lanczos, Jacobi-Davidson (JD) and FEAST implementation benchmarked on different technologies . . . . . . . . . . . 90 117
  • 120.
    List of Tables 3.1Detailed list of available software packages for large-scale eigenproblems . . 38 4.1 Results for energy eigenstate calculation using CUDA on Nvidia Kepler K20c GPU (Test system 1) . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.2 Results for energy eigenstate calculation using MPI-CUDA implementation running on two Nvidia Kepler K20c GPUs (Test system 1) . . . . . . . . . 53 4.3 Results for energy eigenstate calculations using MPI-OpenMP (Test system 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 5.1 Profiler output for 151,472 atom quantum dot, listing the most significant compute operations within Jacobi-Davidson method with subspace stored in host memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.2 Profiler output for 151,472 atom quantum dot, listing the most significant compute operations within Lanczos method . . . . . . . . . . . . . . . . . 71 5.3 Profiler output for 151,472 atom quantum dot, listing the most significant compute operations within the CGS method (linear solver for FEAST) . . 71 118
  • 121.
    OLABs: Optoelectronics &Nanoelectronics Laboratory Printed in Rome, Italy May 2015