1. National Technology Programme A2 sub-programme; TeraTomo
1
National Technology Programme
Year of proposal submission 2008
Name of sub-programme
/dedicated call
Competitive Industry (A2)
Project acronym TeraTomo
Title of project
Development of a teraflop capacity image reconstruction system for
various medical tomography devices used for diagnosis
Name of co-ordinator enterprise
MEDISO Medical Equipment Developing and Services Ltd.,
Budapest
Name of project leader (person
representing the consortium)
Illés Müller
2. National Technology Programme A2 sub-programme; TeraTomo
2
Table of Contents
1. Detailed description of the project.................................................................................................... 3
1.1. Work plan................................................................................................................................... 4
1.1.1. Objectives of the project................................................................................................... 4
1.1.2. Innovative nature of the objectives in Hungarian and international context .................... 5
1.1.3. Summary of the activities preceding the project .............................................................. 9
1.1.4. R&D activities and achieved results of consortium members........................................ 10
1.1.5. Description of the activities of the proposed project...................................................... 13
1.1.6. Dissemination plan ......................................................................................................... 23
1.1.7. Exploitation plan ............................................................................................................ 24
1.1.8. References ...................................................................................................................... 29
1.2. Activity periods of the project.................................................................................................. 31
1.2.1. Work packages accomplished by Mediso Ltd. by reporting dates 1.-3.......................... 31
1.2.2. Work packages accomplished by BUTE by reporting dates 1.-3................................... 33
1.2.3. Work packages accomplished by Semmelweis Univ. Fac. of Med. Department of
Diagnostic Radiology and Oncotherapy by reporting dates 1.-3.................................... 35
1.3. Project monitoring indicators................................................................................................... 37
1.4. Description of the professional activities of applicants............................................................ 39
1.4.1. Description the professional activities of applicant organizations ................................. 40
1.4.2. Description of the professional activities of persons having a key role in the project ... 43
1.5. Description of project management ......................................................................................... 58
1.6. Description of projects and project proposals of similar topics ............................................... 58
1.6.1. Mediso Ltd...................................................................................................................... 58
1.6.2. BUTE.............................................................................................................................. 60
1.6.3. SU DDRO....................................................................................................................... 63
1.7. Link to the projects of the European Community .................................................................... 64
1.8. Financial plan........................................................................................................................... 64
3. National Technology Programme A2 sub-programme; TeraTomo
4
1.1. Work plan
1.1.1. Objectives of the project
The medicine of the 21th
century would be unconceivable without tomography imaging diagnostic
equipments. For a patient with a carcinoma it can be vital to the letter that the doctor can get the
possibly most accurate information about the change in her/his body. It is not the only problem to
locate the tumor in the body, but to allocate its nature, whether it is good natured or malignant growth,
it is created by living or dead cells, whether are there any, and if there exist, where are metastases, has
at least the same importance. The combination of different equipments based on the radiation detecting
allows us to get information simultaneously on the location of anatomical changes (CT1
), as well as
the molecular-, and cell-level functioning and metabolic processes (SPECT2
, PET3
).
The medicine of the 21th
century would be unconceivable without tomography imaging diagnostic
equipments. For a patient with a carcinoma it can be vital to the letter that the doctor can get the
possibly most accurate information about the change in her/his body. It is not the only problem to
locate the tumor in the body, but to allocate its nature, whether it is good natured or malignant growth,
it is created by living or dead cells, whether are there any, and if there exist, where are metastases, has
at least the same importance. The combination of different equipments based on the radiation detecting
allows us to get information simultaneously on the location of anatomical changes (CT), as well as the
molecular-, and cell-level functioning and metabolic processes (SPECT, PET).
The first multimodal equipments (PET/CT, SPECT/CT) have appeared just ten years ago, but had
revolutionary effect on the diagnosis methods. The method helps to identify tumor-related diseases
like breast-cancer, colonic-, and small intestine tumor, head and neck cancers, brain tumors etc.,
earlier and more accurately. This information ensures irretrievable help for the early recognition of
several growth, heart-, and cerebrospinal diseases and for the determination and controlling of the
applied therapies.
One of the most important benchmarks of the technical evolution is the picture quality and hereby the
information content of the diagnosis. The essential quantities of the quality of imaging are the
sensitivity and sensitiveness, particularly the space-, time-, and contrast resolution. These values are
influenced by all of the parts of the systems: The detectors, the signal processing electronics, the
corrections applied during the data collection and the image processing. During the image
reconstruction we are trying to recreate the original space distribution from the acquired projected
data.
The extraordinarily quick technical development of imaging systems result in data boom, namely
nowadays we can acquire far more data during one single examination than we could be able to
process considering realistic limitations of time and means. The velocity of the examination
performing is also an important parameter of the imaging process and not only for financial reasons
(patient throughput). There is a main trend for minimizing the time requirement of the examination
and the data processing.
There are known imaging processes in the literature which can provide better image quality than the
currently diagnostic methods in clinical practice. The problem is that these imaging processes are very
time-consuming, they need several hours even (or much more) with the latest processing workstations
therefore the usage of these methods in the clinical practice is very limited.
The aim of the project is to create a high-performance computing system and software library, using
parallelized computing methods, what help to accelerate substantially the time consuming 3D
reconstructing processes so that they will be applicable as clinical routine processes in practice. The
matter of the development is that the image quality can be improved notably with the same imaging
1
Computed Tomography
2
Single Photon Emission Computed Tomography
3
Positron Emission Tomography
4. National Technology Programme A2 sub-programme; TeraTomo
5
equipment – only by software process. Accordingly, the system can be used not only for new
equipments, but it can be built in previously installed tomography systems of the consortium leader.
The quality of the new systems will be rigorously analyzed, the medical validation is necessary before
the clinical use. The improvement of the spatial resolution, the contrast and the better signal to noise
ratio have several advantages:
• more accurate diagnosis
• the irradiation dose can be decreased due to the better signal to noise ratio
• he duration of the examination can be decreased, so the patient throughput increased
The members of the consortium are MEDISO Medical Equipment Developing and Services Ltd.,
Budapest University of Technology and Economics (BUTE) and Semmelweis University (SU).
In SU, Medicine Department of Diagnostic Radiology and Oncotherapy (SU FOM DDRO) is
involved, in BUTE the Institute of Nuclear Techniques (INT) as well as Department of Control
Engineering and Information Technology (DCEIT) participate in the proposed development.
1.1.2. Innovative nature of the objectives in Hungarian and
international context
Currently only CPU clusters are used for executing high performance reconstruction algorithms of
medical imaging systems in the industry. The mayor drawback of such systems is their enormous size
and consumption combined with high price.
High performance solutions with compact design that can be shipped with the gantry device are to be
designed based on GPUs, FPGAs, and Cell BE processors. GPU and Cell clusters are subjects of the
latest scientific and industrial studies.
Our goal is to design and create a reconstruction system with exceptional performance per price,
consumption, and size ratio. This task involves establishing the hardware and software platform,
adapting reconstruction algorithms to parallel and distributed architectures, and implementing an
industrial software product. The next subsessions give an introduction to the mentioned high
performance technologies.
High performance scientific computing systems
Today's high performance scientific computing systems involve multiprocessor supercomputers, PC
clusters performing symmetric multiprocessing (SMP) or batched computing, graphics hardware
boards with GPUs (Graphics Processing Unit), the Cell processor of the IBM, Sony, and Toshiba
consortium with MIMD architecture (Multiple Instruction Multiple Data), and finally FPGA
semiconductor devices (Field-Programmable Gate Array) consisting of programmable logic
components and programmable interconnects.
The Cell processor has so far been used in the new Sony Playstation 3, and Mercury, Inc. has also
announced products that use the Cell Broadband Engine in various accelerator configurations. The
Cell processor has a double clock speed of the G80 processors run at. However, the peak performance
of the Cell processor is 256 GFLOPS, which is a half that of the G80, possibly due to the fewer
number of pipelines. However, while it is tempting to compare GPUs and the Cell just by the number
of pipelines, clock speeds, and GFLOP performance, only the particular application and their
implementation can really tell which is more favorable to use [Xu, 2007][Scherl, 2007][Mueller,
2007].
Finally, another way to obtain a high-performance architecture is by configuring an FPGA. FPGAs
are generally slower application-specific integrated circuit and can only designs of moderate
5. National Technology Programme A2 sub-programme; TeraTomo
6
complexity. Nevertheless, they allow hardcoding a specific algorithm without incurring the overheads
of instruction-style programming.
GPUs for general-purpose computing
The demand for visual and physical realism in computer games, backed by heavy consumer spending,
has given rise to an incredible performance growth in PC-scale computing platforms. Peak
performances of 500 GFlops (109
floating point operations per second) and more are now possible,
which equals the peak performance of a forty-processor Cray X1 supercomputer with a $500 000 price
tag. In contrast, the most popular of these PC-scale computing platforms, the boards hosting a GPUs
fit into the PCI Express slot of any standard desktop computer and are available for $500 or less at any
computer outlet.
In fact, due to the to the increasing levels of programmability and flexibility, these computing
platforms have also found use in a much wider range of general computing applications, including
those formerly only achievable with supercomputers. GPU-clusters targeted for large-scale scientific
computing emerged in the last years. The use of GPUs for general-purpose computing has become
popularly known as General Purpose GPU (GPGPU), and an extensive website http://www.gpgpu.org/
logs many of these works. For example, GPGPU has enabled the acceleration of computations in
domains such as signal processing, database processing, computer vision, image processing, and also
medical imaging.
GPUs are, in some ways, similar to the vector processing units of the supercomputers, in contrast to
the sequential instruction-driven CPUs with Neumann architecture, only required the decoding of one
single instruction for a long vector of data. But unlike this traditional vector processing hardware,
GPUs can perform multiple operations per data item which is more efficient in terms of memory
bandwidth since the major bottleneck of recent computer architectures are the memory operations. In
this regard, GPUs share more common features with the later streaming architectures.
The low performance, less amount of memory and low level programmability of GPUs before 2000
set back their applicability in general purpose applications. Recent GPUs have none of these
shortcomings. They are generally programmable, have full IEEE single-precision floating point
arithmetic (double precision is planned to the end of 2008), their memory capacity reached the level of
a half to 1.5 gigabytes, and due to the PCI-Express bus also have fast data transfer rates from main
memory to GPU, where the data transfer may overlap with ongoing computations. Finally, they can be
run in multi-board configurations in a single PC.
GPU applications for stream-based computing
The overall target of GPUs has not changed since the early Silicon Graphics architectures (SGI), that
is, to feed a rectangular array of screen pixels with a massive amount of data, consisting of textures
and geometry. Such a process can be modeled as a continuous data stream, which is consumed by a
pipeline of processing kernels at a minimum of data dependencies. Since all pixels are being
composed via the same underlying forward-streaming computation, an inherent SIMD (Single
Instruction – Multiple Data) processing model can be employed. This model makes processor design
simple and raises the percent of chip area dedicated for arithmetic in contrast to CPUs where over 50
percent of the chip area goes into caches and cache management. In relation to the projected growth of
general purpose CPUs, which is predicted by Moore’s law and amounts to an 18-month period for a
doubling of benchmark performance, this represents a performance growth at a rate 3 times Moore’s
better than of law in case of GPUs.
However, GPUs are not general-purpose processors. They can only live up to their full potential when
presented with a computational task that bears the features of the underlying streaming SIMD
programming model. A few general thumb rules are presented here:
1. It is better to focus each pipeline onto a single output data element than onto many output data
elements. The pipeline cannot be looped back, iterative calculations can be computed in more
steps. Fortunately, many times one can just reformat the computation to form a GPU-suitable
program flow.
6. National Technology Programme A2 sub-programme; TeraTomo
7
2. The same goes for programs that have a large number of conditional control statements
(conditions and loops). The graphics hardware operates at the highest performance when all
threads executes the same instructions.
3. It is also advised to have a sufficient amount of data items available to fill the pipelines and
get a data stream going. The memory is not optimized for latency, but for bandwidth, which
favors the streaming model. Sometimes it is better to compute an intermediate result than to
query a lookup table. As in any current computational platform, memory is still a bottleneck,
but GPUs excel (over CPUs) at computational speed.
There are two interfaces available for programming GPUs. The first of them is the Shader Model that
is especially designed for graphics applications. This is supported by both of the main manufacturers,
NVIDIA and AMD (ATI). On the other hand, the graphics hardware can be viewed as general stream
processor. To achieve this goal, NVIDIA introduced CUDA (Compute Unified Device Architecture)
and AMD designed CMT (Close-to-Metal). These interfaces allow users to write high-performance
programs for any compute-intensive task in the standard C language. Previously, users needed to use
NVIDIA Cg (C for graphics), in conjunction with OpenGL or DirectX, which yielded shader assemby
code. In this context, one should also mention Lib Sh which is GPGPU programming library
implemented in C++ and BrookGPU, a streaming programming language from Stanford University
which extended the programming language C to streams and produced optimized shader code when
compiled.
Prior and related work in high-performance tomography reconstruction
The lack of programmability impeded the use of SGI hardware for serious GPGPU applications.
Nevertheless, it was medical imaging that was chosen to be one of the pioneering non-graphics
applications to be run on that platform. Cabral et al. [Cabral, 94], and Mueller and Yagel [Mueller,
2000] exploited high-end SGI workstations for accelerated CT. The low 12-bit precision of the
hardware was particularly limiting in the iterative scenario, and to cope with, Mueller and Yagel
devised a dual-channel scheme, using the red and blue color channels enabling a pseudo 16-bit
precision and which enabled some of the accumulations to be done directly in the hardware.
The first paper discussing the use of PC-native GPU boards for CT was the one by Chidlow and
Möller [Möller, 2003] who implemented emission tomography on an NVIDIA GeForce 4 GPU card.
Although good speed-ups and quality reconstructions could be achieved, the potential of these
solutions remained to be limited, due to the still limited 16-bit precision and accumulation capabilities.
This required many operations still to be performed on the slower CPU, which incurred expensive data
transfers.
Soon after, GPUs with full programmability and 32-bit floating point precision enabled a complete
GPU-resident CT reconstruction, with both analytical and iterative methods [Xu, 2005] and with large
data [Mueller, 2006] at a fidelity comparable to CPU-based methods. Following these more
fundamental works were a number of papers targeting specific CT applications, all with impressive
speedup factors. Kole and Beekman [Goddard, 2002] accelerated the so-called OSC (ordered subset
convex) reconstruction algorithm, Xue et al. [Xue, 2006] accelerated fluoro-based CT for mobile C-
arm units, and Schwietz et al. [Schiwietz, 2006] accelerated the backprojection and FFT operations
employed for MR k-space transforms. Finally, Xu and Mueller [Xu, 2006] used their GPU-accelerated
reconstruction framework to enable interactive volume visualization directly from a full set of
projection data.
FPGA-based solutions include the cone-beam CT application by Goddard and Trepanier
[Goddard, 2002], the 9-bit precision parallel-beam application by Leeser et al. [Leeser, 2002] and the
16-bit precision cone-beam reconstruction by Li et al. [Li, 2005]. The performance times are similar
when normalized to a certain size: Goddard and Trepanier reconstruct a 512³ volume from 300 cone-
beam projections in 38.7s using Feldkamp’s algorithm, while Li et al. solve the same problem in
33.5s. Extrapolating the parallel-beam results of Leeser et al. to this problem also yields a
reconstruction time of 37s. Recently, the Cell processor was also used for cone-beam reconstruction
with Feldkamp’s algorithm [Cabral, 2005]. Using the Cell-board available from Mercury, Inc., a 512³
7. National Technology Programme A2 sub-programme; TeraTomo
8
volume could be reconstructed from 512 projections in 13.6 s [Kachelrieß, 2006]. In fact, the Mercury
board, called a blade, hosts two Cell processors and thus a reconstruction could be achieved in half the
time, that is, 6.8s.
Role of Monte Carlo simulation methods in image recontruction
Monte Carlo simulation is one of the most important tool of the theoretical development of modern
nuclear diagnostic equipments. The Monte Carlo method is a widely used numerical method in
different fields, e.g. in physics, meteorology, chemistry, economics. The Monte Carlo method is based
on random number generation. If the system under consideration can be characterized by known
probability distributions then the physical effects can be statistically described by Monte Carlo
simulation. In the case of MC simulation we track real physical effects by playing each and every
single atomic event – characterized by its probability distribution – one by one.
Monte Carlo methods have already been applied in medicine, e.g. in the development of medical
instruments, therapy, or definition of dosage maps. However, the history of the application of Monte
Carlo methods in image reconstruction is about a decade long. Involving the model of real physical
effects by Monte Carlo simulation into the iterative reconstruction algorithms used for emission
tomography (SPECT, PET) results in more precise reconstruction algorithm. The geometrical
parameters and the detection probabilities are built in the system matrix in the case of an iterative
reconstruction algorithm, which is created traditionally by analytic computations. Creating a system
matrix by applying Monte Carlo simulations can produce more precise system model resulting in
better image contrast, spatial resolution and signal-to-noise ratio. These image parameters are
proportional to the applied radioactive dosage injected into the patient therefore improving image
quality by mathematical-physical methods can lead to the reduction of the radioactive dosage.
Nowadays Monte Carlo methods are used primarily for creating the system matrix of an iterative
reconstruction algorithm as a part of the calibration process. The Monte Carlo simulation has already
been used in an experimental study to describe the geometry and components of a PET detector
system without considering scattering and attenuation media within the phantom [Refecas 2004]. In an
other study the system matrix of a SPECT system has been described by modelling scattering and
photon attenuation as well assuming a homogeneous attenuating medium [Lazaro 2004] [Lazaro
2005]. In both cases the application of the Monte Carlo simulation resulted in a better spatial
resolution.
Theoretically, the iterative reconstruction process applying Monte Carlo simulations in the 3D
projector by modelling scattering and photon attenuation using the attenuation map derived from a CT
scan would be the most precise reconstruction algorithm.
There is a bottleneck of the application of the Monte Carlo methods. The Monte Carlo methods
require extremely high computation time in order to achieve a statistically proper result. Therefore it is
impossible to apply an iterative reconstruction algorithm using Monte Carlo method with conventional
computers. Reconstructing a volume with such a reconstruction algorithm would last for some month.
However, the rapid evolution of HPC (High Performance Computing) platforms make viable the
application of Monte Carlo methods in clinical equipments in the close future.
Our goal in this project is to investigate how can be used HPC methods to reduce the execution time of
Monte Carlo simulations in order to create more precise system matrix for SPECT and PET systems or
to perform a more precise scatter and attenuation correction by using a CT scan.