1. 3D DIGITAL IMAGE PROCESSING FOR BIOFILM
QUANTIFICATION FROM CONFOCAL LASER
SCANNING MICROSCOPY
MULTIDIMENSIONAL STATISTICAL ANALYSIS OF BIOFILM
MODELING
A Thesis Submitted
to the Graduate School
University of Arkansas at Little Rock
in partial fulfillment of requirements
for the degree of
PHILOSOPHY DOCTOR
In Applied Science
in the Department of Applied Science
Engineering Science and Systems
August 2012
Jerzy S. Zielinski
M.S. from Warsaw University of Technology, Warsaw Poland, 2006
B.S. from Warsaw University of Technology, Warsaw Poland, 2002
3. This dissertation, “3D Digital Image Processing for Biofilm Quantification from
Confocal Laser Scanning Microscopy”, by Jerzy S. Zielinski, is approved by:
Dissertation Advisor: ____________________________________
Nidhal Bouaynaya
Assistant Professor of Systems Engineering
Dissertation Committee: ____________________________________
Seshadri Mohan
Professor of Systems Engineering
____________________________________
Yu-Po Chan
Professor of Systems Engineering
____________________________________
Hussain M. Al-Rizzo
Associate Professor of Systems Engineering
____________________________________
Craig Cooney
Investigator, Veteran's Affairs Medical Center
Program Coordinator: ____________________________________
Tansel Karabacak
Assistant Professor of Applied Science
Graduate Dean: ____________________________________
Patrick J. Pellicane
Professor of Construction Management
4. Fair Use
This thesis is protected by the Copyright Laws of the United States (Public Law
94-553, revised in 1976). Consistent with fair use as defined in the Copyright
Laws, brief quotations from this material are allowed with proper
acknowledgment. Use of this material for financial gain without the author’s
express written permission is not allowed.
Duplication
I authorize the Head of Interlibrary Loan or the Head of Archives at the
Ottenheimer Library at the University of Arkansas at Little Rock to arrange for
duplication of this thesis for educational or scholarly purposes when so requested
by a library user. The duplication will be at the user’s expense.
Signature _____________________________________________________
5. 3D DIGITAL IMAGE PROCESSING FOR BIOFILM QUANTIFICATION FROM
CONFOCAL LASER SCANNING MICROSCOPY, MULTIDIMENSIONAL
STATISTICAL ANALYSIS OF BIOFILM MODELING, by Jerzy S. Zielinski,
August 2012
Abstract
The dramatic increase in number and volume of digital images produced
in medical diagnostics, and the escalating demand for rapid access to these
relevant medical data, along with the need for interpretation and retrieval has
become of paramount importance to a modern healthcare system. Therefore,
there is an ever growing need for processed, interpreted and saved images of
various types. Due to the high cost and unreliability of human-dependent image
analysis, it is necessary to develop an automated method for feature extraction,
using sophisticated mathematical algorithms and reasoning.
This work is focused on digital image signal processing of biological and
biomedical data in one- two- and three-dimensional space. Methods and
algorithms presented in this work were used to acquire data from genomic
sequences, breast cancer, and biofilm images. One-dimensional analysis was
applied to DNA sequences which were presented as a non-stationary sequence
and modeled by a time-dependent autoregressive moving average (TD-ARMA)
model. Two-dimensional analyses used 2D-ARMA model and applied it to detect
breast cancer from x-ray mammograms or ultrasound images. Three-dimensional
detection and classification techniques were applied to biofilm images acquired
using confocal laser scanning microscopy.
6. Modern medical images are geometrically arranged arrays of data. The
broadening scope of imaging as a way to organize our observations of the
biophysical world has led to a dramatic increase in our ability to apply new
processing techniques and to combine multiple channels of data into
sophisticated and complex mathematical models of physiological function and
dysfunction. With explosion of the amount of data produced in a field of
biomedicine, it is crucial to be able to construct accurate mathematical models of
the data at hand. Two main purposes of signal modeling are: data size
conservation and parameter extraction. Specifically, in biomedical imaging we
have four key problems that were addressed in this work: (i) registration, i.e.
automated methods of data acquisition and the ability to align multiple data sets
with each other; (ii) visualization and reconstruction, i.e. the environment in which
registered data sets can be displayed on a plane or in multidimensional space;
(iii) segmentation, i.e. automated and semi-automated methods to create models
of relevant anatomy from images; (iv) simulation and prediction, i.e. techniques
that can be used to simulate growth end evolution of researched phenomenon.
Mathematical models can not only be used to verify experimental findings, but
also to make qualitative and quantitative predictions, that might serve as
guidelines for the future development of technology and/or treatment.
8. Acknowledgements
I would like to thank my advisor and mentor, Dr. Nidhal Bouaynaya for her
guidance, illuminating discussions related to this work and beyond,
encouragement and moral and financial support in this research. I also would like
to extend my gratitude to Dr. Seshadri Mohan, Dr. Yu-Po Chan, Dr. Hussain M.
Al-Rizzo and Dr. Craig Cooney for being part of my committee and for their
insights and interest in my work.
I am extremely grateful to my family who has been my greatest support.
This accomplishment is not mine alone. Thank you for sharing my struggles and
my victories. Thank you to my friends and colleagues for sharing my pauses and
supporting me during my ups and downs.
10. x
Table of Contents
List of Tables...................................................................................................................xiii
List of Figures................................................................................................................. xiv
Chapter 1 Introduction...................................................................................................1
Problem Statement ........................................................................................................1
Research Objectives......................................................................................................2
Motivation.......................................................................................................................4
Research Contributions .................................................................................................5
Organization...................................................................................................................6
Chapter 2 Biology Background ...................................................................................10
Genomics.....................................................................................................................11
Breast Cancer..............................................................................................................14
Biofilm ..........................................................................................................................15
Chapter 3 Time-Dependent ARMA Modeling of Genomic Sequences .......................18
Abstract........................................................................................................................18
Background..................................................................................................................20
Methods .......................................................................................................................23
Mean-squares estimation.............................................................................................26
Least-squares estimation..........................................................................................27
Index of randomness ................................................................................................31
Results.........................................................................................................................33
11. xi
Chapter 4 Two-Dimensional ARMA Modeling for Breast Cancer Detection and
Classification .....................................................................................................................6
Abstract..........................................................................................................................6
Introduction ....................................................................................................................7
2D-ARMA Modeling..................................................................................................10
Yule-Walker Least-Squares Parameter Estimation ..................................................11
Tumor detection and classification...............................................................................15
Simulations ..................................................................................................................15
Chapter 5 Statistical Sequential Analysis for Detection of Microcalcifications in Digital
Mammograms .................................................................................................................18
Abstract........................................................................................................................18
Introduction ..................................................................................................................18
2D-ARMA representation.............................................................................................21
Change detection algorithm.........................................................................................23
Results.........................................................................................................................27
A. 2D ARMA Model...................................................................................................27
B. Change Detection Algorithm ................................................................................28
Chapter 6 Automated Biofilm Region Recognition And Morphology Quantification from
Confocal Laser Scanning Microscopy Imaging ...............................................................32
Abstract........................................................................................................................32
Introduction ..................................................................................................................32
Quantification of biofilm structure.................................................................................34
12. xii
Morphology quantification parameters.........................................................................34
Image processing tool..................................................................................................36
Preprocessing and used methodology.........................................................................36
Growth and CLSM of static biofilm...............................................................................37
Results.........................................................................................................................38
Chapter 7 Three Dimensional Morphology Quantification of Biofilm Structures from
Confocal Laser Scanning Microscopy Images ................................................................42
Abstract........................................................................................................................42
Introduction ..................................................................................................................43
Average Run Length.................................................................................................45
Aspect Ratio .............................................................................................................45
Average and Maximum Diffusion Distance...............................................................45
Biomass....................................................................................................................46
Average Thickness ...................................................................................................46
Application to CLSM Biofilm Images............................................................................48
Biofilm culture preparation and image acquisition ....................................................48
Segmentation and parameter quantification results .................................................49
Conclusion and Recommendation ..................................................................................55
References......................................................................................................................63
13. xiii
List of Tables
Table 1 Index of randomness of the Coding and Non-Coding segmants of Various
Gene Sequences.............................................................................................5
Table 2 Classification accuracy of cancereus and benign tumors ..............................17
Table 3 PERFORMANCE OF THE CHANGE DETECTION ALGORITHM IN 524
NORMAL AND CANCEROUS MAMMOGRAMS...........................................29
Table 4 UAMS-1 sarA- results of biomass and average thickness calculations using
watershed algorithm and COMSTAT software ..............................................39
Table 5 UAMS-1 results of biomass and average thickness calculations using
watershed algorithm and COMSTAT software ..............................................39
Table 6 Average error calculated from manual calculations accross all layers in
confocal imaging with use of Watershed algorithm and COMSTAT software39
Table 7 Results of biofilm parameter quantification for Stack 1 for 3D and 2D
segmentations in comparison with the ground truth ......................................51
Table 8 Results of biofilm parameter quantification for stack 2 for 3D and 2D
segmentations in comparison with the ground truth. .....................................52
Table 9 Results of biofilm parameter quantification for stack 3 for 3D and 2D
segmentations in comparison with the ground truth. ....................................53
14. xiv
List of Figures
Figure 1 Three hypotheses explaining the formation of intrinsic levels of antibiotic
resistance Adapted from Stewwart et al., 2001 .............................................17
Figure 2 Gene Structure. Gene structure of the Human gene 276 located in
chromosome I: The boxes correspond to the exons (coding regions, and line
between them represent the introns (non-coding regions)). The total length of
the gene is N=8208 bases, including 1536 coding and 6672 non-coding
bases.............................................................................................................25
Figure 3 DNA Walk. DNA walk of the Human gene 276 .............................................25
Figure 4 TD-ARMA modeling. TD-ARMA modeling of the Human gene 276: The blue
signal is the DNA walk, and the red signal is the TD-ARMA(1,1) fitted signal.
The TD-ARMA(1,1) model accurately fits the genomic signal.........................2
Figure 5 TD-ARMA coefficients estimation. Estimation of the TD-ARMA(1,1)
coefficients of the Human gene 276. The TD-ARMA(1,1) model is given by
. The blue and black (resp. red
and green) curves depict the time series (resp ) for the coding and
non-coding regions of the gene, respectively. .................................................3
Figure 6 Curve of randomness. The curves of randomness of the coding and non-
coding regions of the Human gene 276 are shown in blue and red,
respectively. The index of randomness of the coding sequence is equal to
1.0603, whereas its corresponding value for the non-coding sequence is
equal to 1.0627................................................................................................4
Figure 7 ARMA modeling and segmentation of ultrasound breast images: (a)
cancereus ultrasound image; (b) 2D-ARMA[2,2,2,2] representation of (a); (c)
segmentation of (b) using an appropriate threshold; (d) 1D-ARMA[2,2]
15. xv
representation of (a); (e) benign tumor ultrasound image; (f) 2D-
ARMA[2,2,2,2] representation of (e); (g) segmentation of (f) using an
appropriate threshold; (h) 1D-ARMA[2,2] representation of (e);....................17
Figure 8 2D ARMA Modeling (a) Original (healthy) mammogram; (b) 2D
ARMA[2,2,2,2] model of (a); (c) 2D ARMA[3,3,3,3] model of (a); (d) 2D
ARMA[4,4,4,4] model of (a); (e) 2D ARMA[6,6,6,6] model of (a); .................29
Figure 9 Change detection algorithm (a) A normal (healthy) mammogram; (b) 2D
ARMA[2,2,2,2] model of (a); (c) Plot of the average gray level of the 16x16
sub images in (a); (d) Plot of the decision function for the image in (a); (e)
original cancerous image; (f) 2D ARMA[2,2,2,2] model of (e); (g) Plot of the
average gray level of the 16x16 sub images in (e); (h) Plot of the decision
function for the image in (e); ....................................................................30
Figure 10 The decision function for four mammograms; cancerous in red/magenta
and normal in blue/green. The value of the threshold id determined as the
mean of the highest normal peak and the highest of the cancerous peak. ...30
Figure 11 Change detection algorithm: (a) radiologist marked are of interest; (b) plot of
the decision function of the mammogram, arrows indicate the peaks above
the threshold; (c) marked 16x16 clusters that correspond to the detected
peaks.............................................................................................................31
Figure 12 Confocal images and their segmentations: A-D: images from the UAMS-1
sarA- mutant (section 1), their respective segmentations with watershed
(section 2), COMSTAT analysis (section (3) .................................................40
Figure 13 Confocal images and their segmentations: E-H: images from the UAMS-1
(section 1), their respective segmentations with watershed algorithm (section
2), COMSTAT analysis (section 3) ................................................................41
16. xvi
Figure 14 Mean Square Error of 2D gradient-based segmentation (black), 2D
watershed segmentation (green) and 3D (red) gradient based segmentation
applied to CLAS z-stack of images................................................................50
Figure 15 Combined Mean Square Error of 2D gradient-based (black), 2D watershed
(green) and and 3D gradient-based segmentations......................................50
Figure 16 2D and 3D-gradient based segmentation: Column 1: the original CLSM
images. Column 2: 3D gradient-based segmentation; Column 3: 2D gradient-
based segmentation; Column 4: 2D Watershed Segmentation....................54
17. 63
Chapter 1 Introduction
Biological and biomedical signals are acquired by a range of techniques
across all biological scales, which go far beyond the visible light photographs and
microscope images of the early 20th century. Today the techniques in use are:
confocal scanning microscopy, x-ray microscopy, electron microscopy, etc. with
extensive use of Digital Signal Processing (DSP) techniques and reconstruction
algorithms in two and three dimensions and even multidimensional space.
Modern medical images are geometrically arranged arrays of data sample. The
broadening scope of imaging as a way to organize our observations of the
biophysical world has led to a dramatic increase in our ability to apply new
processing techniques and to combine multiple channels of data into
sophisticated and complex mathematical models of physiological function and
dysfunction. With explosion of the amount of data produced in a field of
biomedicine, it is crucial to be able to construct accurate mathematical models of
the data at hand. Two main purposes of signal modeling are: data size
conservation and parameter extraction.
Problem Statement
Over the past century we have undergone a revolution in a field of
microbiology and biomedicine. We went from microscope to computerized, highly
sophisticated method of acquisition methods that can scan surrounding
environment with accuracy and precision. The amount of data that is being
18. 63
produced in the scanning process is enormous and became very difficult to
analyze by a human being in an efficient way.
The main problem that stands before scientists faced with large data pool
is ability to translate digital information into meaningful data, further being used
by physicians and biologists in their studies. Among many the most important
are:
Feature extraction, which is spatial form of the dimensionality reduction. It
is used for either images that are too large to process or those that are
redundant in nature. In those two cases the input data can be transformed
into a reduced representation of a set of features (also named feature
vector).
Segmentation process, which is simplification and/or change the
representation of an image and then device into more meaningful
subsections.
Development of accurate and robust Computer Aided Diagnostic (CAD)
systems for biomedical applications, which can be used in faster and more
accurate delivery of results in biomedical imaging with minimal or no
human involvement.
Research Objectives
The goal of this work is to research different ways of analysis in Digital
Signal Processing area of biological and biomedical signals, specifically to
19. 63
develop methods that can be used in parameters extraction, further used by
biologists and physicians in development of new patient treatment techniques.
The ultimate goal is to develop accurate and robust models for Computer Aided
Diagnostic systems in areas of: microcalcifications and cancer tissue detection in
breast tissue from X-ray Mammograms and ultrasound imaging and also
segmentation of biofilm of Staphylococcus aureus colonies from Confocal Laser
Scanning Microscopy (CLSM)
The goal of this research is realized through the following objectives:
1. Development of non-stationary modeling technique for modeling DNA data
2. Development of 2D-ARMA technique for image signal modeling
3. Use extracted features of ARMA model for biological and clinical
classification of microcalcification in X-ray breast mammography and
ultrasound imaging
4. Formulation of the problem of detection of microcalcification as change
detection hypothesis testing problem
5. Development of accurate and robust segmentation technique for Confocal
Laser Scanning Microscopy, which led to more accurate quantification of
biofilm
6. Development of three dimensional (3D) segmentation techniques that
takes under consideration spatial pixel relationship.
7. Development of fully automated method of quantifying staphylococcal
biofilm images using confocal laser scanning microscopy (CLSM) with
standardized parameters that are independent of user input.
20. 63
Motivation
Breast cancer continues to be a significant public health problem in the
United States: It is the second leading cause of female mortality, and,
disturbingly, one out of eight women in the United States will be diagnosed with
breast cancer in her life time. Until the cause of this disease is fully understood,
early detection remains the only hope to improve breast cancer prognosis and
treatment. Breast cancer screening modalities are mainly based on clinical
examination, mammography, ultrasound imaging, magnetic resonance imaging
(MRI), and core biopsy. Mammography (breast x-ray imaging) is by far the
fastest and cheapest screening test for breast cancer. Unfortunately, it is also
among the most difficult of radiological images to interpret: mammograms are of
low contrast, and features indicative of breast disease are often very small.
Staphylococcus aureus is an opportunistic human pathogen responsible
for a wide range of diseases that vary in clinical presentation and severity. S.
aureus can cause infections ranging from minor skin lesions to life-threatening
conditions such as pneumonia, osteomyelitis and toxic shock syndrome. Recent
significant changes in health care delivery and antimicrobial resistance patterns
have caused a shift in the epidemiology of S. aureus. A recent report estimated
the number of invasive infections caused by methicillin-resistant S. aureus
(MRSA) in 2005 in the United States alone at 94,360, with 18,650 of these
resulting in a fatal outcome. This means that S. aureus infection has now passed
AIDS as a cause of death in the United States. Based on such considerations, S.
21. 63
aureus is arguably a greater clinical concern now than at any time since the pre-
antibiotic era.
Research Contributions
This work contributes to the field of computational bioinformatics and
biology through the application of information theory and communication theory
to the study and analysis of genetic sequences. Further, this work presents novel
techniques in detection and recognition of breast cancer in its early stages of
development as well as classification of developed tumors from ultrasound and x-
ray mammography imaging techniques. This work describes the novel methods
of parameter extraction from Confocal Laser Scanning Microcopy (CLSM)
imaging of biofilm formation in bacterial development. All described techniques
contribute in providing semi- and fully-automated tools for scientists and
physicians for treatment development in areas of large mortality of human in
advanced countries.
Specific contributions of this work to the field of biomedical science include:
1. Derivation of non-stationary TD-ARMA model, in which the novel
estimation algorithms of the TD-ARMA parameters were introduced and
calculated, which resulted in creation of more robust and precise model of
non-stationary signal
2. Formulation of use of TD-ARMA modeling as a non-stationary model of
genomic sequences, which resulted in better understanding and better
22. 63
recognition of coding and non-coding regions of DNA sequences of highly
evolved organisms
3. Usage of the word decomposition theorem, applied to ultrasound breast
images treated as random fields, for two-dimensional ARMA modeling
4. Derivation of 2D-ARMA model, which takes under consideration
distributed correlation between points in space and/or time that resulted in
better understanding of higher-dimensionality modeling techniques
5. The formulation of the problem of the microcalcification detection in digital
breast images as a statistical change of detection problem in the local
properties of the image. The solution was not only able to detect the
presence of microcalcification but also gave accurate estimate of their
location within the breast
6. Derivation of a Computer Aided Diagnostic System for the automatic
quantification of biologically relevant parameters of biofilm created in
microbial development of Staphylococcus aureus colonies, that further
were used in microbiology studies of virulence and treatment resistance of
certain strains of bacteria
Organization
This thesis is organized as follows:
In Chapter 2, we provide a cursory overview of the biology background,
which include brief introduction to genomics with emphasis on Structure of DNA.
Next we describe breast cancer types that are most common among female
23. 63
population in highly evolved countries. Last we present description of biofilm
genesis and different areas where mathematical biofilm modeling can be used.
Grasping the essence of the biological inspirations of this work is crucial to
understanding the motivation, assumptions and theoretical results of this work.
In Chapter 3 we model the non-stationary genomic sequences by a time-
dependent autoregressive moving average (TD-ARMA) model. By expressing the
time dependent coefficients as linear combinations of parametric basis functions,
we were able to transform a linear non-stationary problem into a linear time-
invariant problem. Subsequently, we proposed three methods to estimate the
time-dependent coefficients: Mean -square, least-squares, and recursive least-
squares algorithms. Based on the estimated TD-ARMA coefficients, we defined
an index of randomness to quantify the degree of randomness of both coding
and non-coding sequences, results to follow.
In Chapter 4 we propose to exploit the high spatial correlation inherent in
neighboring pixels to improve tumor detection and classification in ultrasound
breast images. We achieve this goal by using a two-dimensional autoregressive
moving average (ARMA) field model of the image. Current techniques often rely
on one-dimensional representations of the image in terms of its scan lines in
order to process it as a one-dimensional time series [5], [6]. Such one-
dimensional projections are advocated solely on the basis of the simplicity of
their mathematical formulations. The analysis of two-dimensional fields is more
involved mathematically and computationally than the study of one-dimensional
time-series. In this work, we derive an efficient two-stage algorithm to estimate
24. 63
the parameters of the two-dimensional ARMA field model of the breast image.
The estimated ARMA parameters are excellent discriminative features, which are
used as the basis for statistical detection and classification of tumors in the
breast image.
In Chapter 5, we introduce a new approach to the problem of malignancy
detection in digital mammograms using statistical sequential analysis theory. The
statistical approach inherently takes into account the noise in the image (from the
imaging device and the digitization) and the great variety of healthy and
cancerous mammograms by considering an underlying probability distribution of
the image characteristics. For increased efficiency, the dimensionality of the
original images is reduced using 2D-ARMA modeling, which is shown to
accurately represent mammograms. The change detection algorithm is applied to
the low-dimensional 2D-ARMA feature vectors compared to the pixels of the raw
image.
In Chapter 6 we propose algorithm efficiently segments and quantifies
images not relying on a manual setup of a threshold. Average error of results
obtained with watershed-based algorithm, calculated based on the manual
analysis, was comparable to the one acquired with COMSTAT software.
In Chapter 7 we show the importance of 3D analysis of biofilm structures,
which yields more accurate morphological parameter quantification for clinical
and biological assessment of the biofilm than sophisticated 2D-based analysis
like the watershed segmentation. Two-dimensional analysis of the biofilm
morphology treats the CLSM images independently from each other, whereas
25. 63
the 3D analysis takes into account the temporal and spatial correlation between
stacked images.
26. 63
Chapter 2 Biology Background
Biomedical signals are observations of physiological activities that can be
obtained from a biological system. This diverse group of signals may range from
observations on a molecular level such as gene and protein sequences to
macroscopic images of organs. The processing of those biomedical signals aims
at extracting only significant information from an often overwhelming amount of
data. What constitutes the information of interest depends on the specific
application. Therefore, the purpose of signal processing is to selectively eliminate
irrelevant information from signal so as to make the information of interest more
easily accessible to a human observer or a computer system.
In the past, the primary application of signal processing was to filter
signals and remove noise. Background, arising from either the imprecision of
instruments or biological systems themselves, is eliminated using primarily two
methods. In the first technique, the noise cancelation was achieved by analyzing
the signal spectra and suppressing the undesired frequency components. In the
other method used, data was treated as random signals and statistical
characterizations of the signal are utilized to extract desired components, e.g.
Wiener filtering or Kalman filtering.
Since the introduction of new technologies and instruments, the
applications for biomedical signal processing have expanded well beyond just the
removal of background noise. Segmentation, the process of partitioning a digital
image to locate objects and boundaries, is extensively used in the analysis of the
medical images, including organ structure quantification and detection of local
27. 63
abnormalities such as tumors. Another signal processing technique, motion
tracking is widely used in molecular biology for visualizing dynamics in living
cells. The same method can be used to track the distribution and growth of live
cells tagged with fluorescent probes in biofilms or other biological formations.
Sequence analysis is yet another processing method that was born with the
invention of automatic DNA sequencing. It allowed scientists to create genetic
maps based on the short DNA fragments analyzed by DNA sequencers. Finally,
one of the most important applications of signal processing is pattern
classification, often dependent on segmentation. Extensively used by clinicians
and biologists, classification helps to automatically distinguish pathological
formations from the normal background. Although, very often experts are more
superior in pattern classification than any automated method, classification of
biomedical signals by humans faces its several difficulties. It requires a lot of
knowledge combined with experience, it is labor intensive and time consuming, it
may be challenging in the situation when the signal characteristics are not very
prominent and is sensitive to human error and bias. Therefore, automated
methods for classification and segmentation in signal processing may overcome
those stated limitations and assist in screening large databases, advancing the
technology.
Genomics
There are two types of nucleic acid that are of key importance in cells:
deoxyribonucleic acid (DNA) and ribonucleic acid (RNA). DNA is found as a
28. 63
double strand. The backbone of the molecule is composed of deoxyribose sugars
linked by phosphate groups in a repeating polymer chain. Each sugar is linked to
a molecule known as a base. In DNA, there are four types of base, called
adenine, tymine, guanine and cytosine, usually referred to simply as A, T, G and
C. The two distinct ends of a DNA sequence are known under the name of the 5'
end and the 3' end. The fundamental building block of a nucleic acid is called a
nucleotide: this is the unit of one base plus one sugar plus one phosphate. We
usually think of the length of a nucleic acid sequence as the number of
nucleotides in the chain. The two strands of the DNA molecule are held together
by hydrogen bonding between A and T and between C and G. The two strands
run in opposite directions and are exactly complementary in sequence, so that
where one has A, the other has T and where one has C the other has G.
Therefore, naming the bases on the conventionally chosen side of the strand is
enough to describe the entire double-strand sequence. The two strands are
coiled around one another in the famous double helical structure elucidated by
Watson and Crick 50 years ago.
An RNA strand that is transcribed from a protein-coding region of DNA is
called a messenger RNA (mRNA). The mRNA is used as a template for protein
synthesis in the translation process discussed below. Eukaryotic gene
sequences are composed of alternating sections called exons and introns. Exons
are the pieces of the sequence that contain the information for protein coding.
These pieces will be translated. Introns do not contain protein-coding information.
The discovery of introns led to the Nobel Prize in Physiology or Medicine in 1993
29. 63
for Phillip Allen Sharp and Richard J. Roberts. The introns are cut out of the pre-
mRNA and are not present in the mRNA after processing. When an intron is
removed the ends of the exons on either side of it are linked together to form a
continuous strand.
Splicing is carried out by spliceosome, a complex of several types of RNA
and proteins bound together and acting as a molecular machine. The
spliceosome is able to recognize signals in the pre-mRNA sequence that tell it
where the intron-exon boundaries are and hence which bits of the sequence to
remove. As with promoter sequences for transcription, the signals for the splice
sites are fairly short and variable, so that reliable identification of the intron-exon
structure of a gene is a difficult problem in bioinformatics. Nevertheless, the
spliceosome manages to do it. Introns that are spliced out by the spliceosome
are called spliceosomal introns. This is the majority of introns in most organisms.
In addition, there are some interesting, but fairly rare introns that are capable of
catalyzing their own splicing out of the primary RNA transcript without the action
of the spliceosome. There are surprisingly large numbers of introns in many
eukaryotic genes: 10 or 20 in one gene is not uncommon. In this work we focus
on research of RNA mathematical modeling and automated recognition of coding
and non-coding regions.
30. 63
Breast Cancer
Breast cancer is the most common type of cancer among women and it is
the second leading cause of cancer mortality. Besides skin cancer, breast cancer
is the most commonly diagnosed cancer among American women [1]. According
to a statistical report by the National Cancer Institute of United States, it is
estimated that 230,480 women in the USA were diagnosed, out of which 39,520
women are expected to die of breast cancer in 2011 [1]. The screening
mammography is the most widely used technique for detection of breast cancer.
The routine screening of mammogram is evaluated as a probable option to
detect the earliest signs of cancerous growth [2]. The mortality rates of women
under the age of 50 have been steadily decreasing since 1990. This decrease is
surmised to be the result of the advances in treatment and earlier detection
through screening. Thus, early detection and adapting modern methods of
treatment for breast cancer can significantly improve the survival rate of victims.
Currently, X-ray mammography is widely observed as the efficient imaging
modality for early detection of abnormality. The earliest sign of breast cancer is
microcalcification, which is nodular in structure with high intensity, localized or
broadly diffused along the breast areas. Microcalcifications are tiny bits of
calcium deposits present in the breast tissue and they appear as clusters or in
patterns associated with extra cell activity in breast region. The detection of
microcalcification at an early stage is a challenging task to radiologists and a few
of the clusters could not be detected by them due to their impalpable size [3].
The detection sensitivity of radiologists in microcalcification detection is 70–90 %
31. 63
and sensitivity depends on their experience [4]. Therefore, a Computer Aided
Diagnosis (CAD) system for breast cancer detection on mammogram has been
developed to improve the diagnostic rate. By incorporating the expert knowledge
of radiologists, the CAD system can be made to improve the detection accuracy.
Biofilm
Staphylococcus aureus is an opportunistic human pathogen responsible
for a wide range of diseases that vary in clinical presentation and severity. S.
aureus can cause infections ranging from minor skin lesions to life-threatening
conditions such as pneumonia, osteomyelitis and toxic shock syndrome. Recent
significant changes in health care delivery and antimicrobial resistance patterns
have caused a shift in the epidemiology of S. aureus. A recent report estimated
the number of invasive infections caused by methicillin-resistant S. aureus
(MRSA) in 2005 in the United States alone at 94,360, with 18,650 of these
resulting in a fatal outcome. This means that S. aureus infection has now passed
AIDS as a cause of death in the United States. Based on such considerations, S.
aureus is arguably a greater clinical concern now than at any time since the pre-
antibiotic era.
S. aureus infections exhibit general characteristics common in many
different types of infection. Strains that are resistant to methicillin already account
for the majority of the S. aureus nosocomial infection cases [5,6]. A more
alarming concern is the emergence of MRSA as a cause of infection in the
32. 63
general community. In other words, patients who have never been hospitalized
and who have no other known risk factors for MRSA infection are becoming ill.
The treatment of such patients becomes more challenging because S. aureus is
capable of forming a biofilm, which shows intrinsic levels of antibiotic resistance.
The familiar mechanisms of antibiotic resistance, such as efflux pumps,
modifying enzymes, and target mutations [7], do not seem to be responsible for
the protection of bacteria in a biofilm. In fact, the metabolism of cells within a
biofilm is profoundly slower than dispersed cells, and it is more likely that this
slowing reduces the effects of antibiotics that block metabolic processes, such as
protein synthesis. Additionally, antibiotic-sensitive bacteria with no known genetic
basis for resistance can have profoundly reduced susceptibility when growing in
a biofilm. Such strains, when grown in a biofilm, have to be treated with much
higher doses of antibiotics than that needed to eradicate free-floating bacteria [7].
Three hypotheses explaining the formation of intrinsic levels of antibiotic
resistance are shown in Figure 1, adapted from [7].
33. 63
Figure 1 Three hypotheses explaining the formation of intrinsic levels of antibiotic resistance
Adapted from Stewwart et al., 2001
For the reasons mentioned above, it is necessary to consider biofilm
formation as an important weapon in S. aureus pathogenesis. The role of biofilm
formation in virulence requires extensive studies of biofilm regulation and
advanced methods of biofilm characterization and quantification.
34. 63
Chapter 3 Time-Dependent ARMA Modeling of Genomic
Sequences
Abstract
Over the past decade, many investigators have used sophisticated time
series tools for the analysis of genomic sequences. Specifically, the correlation of
the nucleotide chain has been studied by examining the properties of the power
spectrum. The main limitation of the power spectrum is that it is restricted to
stationary time series. However, it has been observed over the past decade that
genomic sequences exhibit non-stationary statistical behavior. Standard
statistical tests have been used to verify that the genomic sequences are indeed
not stationary. More recent analysis of genomic data has relied on time-varying
power spectral methods to capture the statistical characteristics of genomic
sequences. Techniques such as the evolutionary spectrum and evolutionary
periodogram have been successful in extracting the time-varying correlation
structure. The main difficulty in using time-varying spectral methods is that they
are extremely unstable. Large deviations in the correlation structure results from
very minor perturbations in the genomic data and experimental procedure. A
fundamental new approach is needed in order to provide a stable platform for the
non-stationary statistical analysis of genomic sequences. Results: In this chapter,
we propose to model non-stationary genomic sequences by a time-dependent
autoregressive moving average (TD-ARMA) process. The model is based on a
classical ARMA process whose coefficients are allowed to vary with time. A
35. 63
series expansion of the time-varying coefficients is used to form a generalized
Yule-Walker-type system of equations. A recursive least-squares algorithm is
subsequently used to estimate the time-dependent coefficients of the model. The
non-stationary parameters estimated are used as a basis for statistical inference
and biophysical interpretation of genomic data. In particular, we rely on the TD-
ARMA model of genomic sequences to investigate the statistical properties and
differentiate between coding and non-coding regions in the nucleotide chain.
Specifically, we define a quantitative measure of randomness to assess how far
a process deviates from white noise. Our simulation results on various gene
sequences show that both the coding and non-coding regions are non-random.
However, coding sequences are "whiter" than non-coding sequences as
attested by a higher index of randomness. Conclusion: We demonstrate that the
proposed TD-ARMA model can be used to provide a stable time series tool for
the analysis of non-stationary genomic sequences. The estimated time-varying
coefficients are used to define an index of randomness, in order to assess the
statistical correlations in coding and non-coding DNA sequences. It turns out that
the statistical differences between coding and non-coding sequences are more
subtle than previously thought using stationary analysis tools: Both coding and
non-coding sequences exhibit statistical correlations, with the coding regions
being "whiter" than the non-coding regions. These results corroborate the
evolutionary periodogram analysis of genomic sequences and revoke the
stationary analysis' conclusion that coding DNA behaves like random sequences.
36. 63
Background
Understanding the statistical properties of genomic sequences helps
recreate the dynamical processes that led to the current DNA structure, and
determine gene-related diseases like cancer and Alzheimer disease. For
instance, based on the view that non-coding DNA exhibits long-range
correlations [1-6], Li [7] proposed an expansion-modification model of gene
evolution. The model incorporates the two basic features of DNA evolution: (i)
sequence elongation due to gene duplication and (ii) mutations. It can be shown
that the limiting sequence created by this dynamical process exhibits a long-
range correlation structure, as attested by a spectrum, where the exponent
is a function of the probability of mutation. To understand the relationship
between the DNA correlation structure and possible gene aberrations, Dodin et
al. [8] designed a simple correlation function intended to visualize the regular
patterns encountered in DNA sequences. This function is used to revisit the
intriguing question of triplet repeats with the aim of providing a visual estimate of
the propensity of genes to be highly expressed and/or to lead to possible
aberrant structures formed upon strand slippage.
Statistical analysis of genomic sequences has, however, relied, for a long
time, on signal processing techniques for stationary signals (correlation and
power spectrum) [2,4,9,10], and statistical tools for slowly-varying trends within
stationary signals (Detrended Fluctuation Analysis or DFA) [1,5,6]. Stationary can
be argued as a valid assumption for time-series of short duration. However, such
an assumption rapidly loses its credibility in the enormous databases maintained
37. 63
by various genome projects. Standard statistical tests (e.g., Priestley's test for
stationary) have been used to verify that genomic sequences are not stationary
and the nature of their non-stationary varies and is often more complex than a
simple trend [11,12]. Subsequently, more recent analysis of genomic data [1] has
relied on time-varying power spectral methods (the evolutionary spectrum and
periodogram) to capture the statistical characteristics of genomic sequences
[11,12]. The main difficulty in using time-varying spectral methods is that they are
extremely unstable and very noisy. Typically, the power spectrum and the
evolutionary spectrum are averaged over time in order to obtain smooth and less
jittery curves. Moreover, as was pointed out in [13], the evolutionary spectrum is
restricted to the class of oscillatory processes. A stochastic process, , is
oscillatory if it has a representation of the form
Equation 1
∫
Where is an orthogonal increment process, and the evolutionary power
spectrum of the process is defined by | | . This definition of the
evolutionary power spectrum has the following disadvantages [13]:
i. It is not uniquely defined for a given non-stationary process.
ii. The estimation procedure for the evolutionary spectrum depends greatly
on the nature of the amplitude function , which is not known a priori.
38. 63
iii. An increase in the number of observations does not provide added
information on the local behavior of the evolutionary spectrum, and thus
does not improve estimation accuracy.
We propose to model non-stationary genomic sequences by a time-
dependent autoregressive moving average (TD-ARMA) process. Cramer [14]
showed that a non-stationary process still possesses a Word Decomposition in
terms of its innovation and its generating system. However, the linear system
generating the non-stationary signal , when driven by the innovation, , is
no longer shiftinvariant; the parameters of the impulse response, , of this
system are time-dependent so that
Equation 2
∑
The existence of a time-varying ARMA representation of this process is
ensured by two theorems due, independently, to Grenier [15] and Huang and
Aggarwal [16]. The uniqueness of the TD-ARMA representation is obtained by
constraining the ARMA model to be invertible, but this leads to conditions on the
time-varying impulse response and its inverse (namely to be absolutely
summable at any time t), which cannot be easily expressed in terms of the time-
dependent coefficients of the ARMA model. In this chapter, we estimate the time-
dependent coefficients of the general TD-ARMA model using mean squares,
least-squares and recursive least-squares algorithms. The mean-squares
39. 63
estimation leads to generalized Yule-Walker type equations [15]. Once the non-
stationary parameters are estimated (as time series), we use them to provide a
basis for statistical inference by defining an index of randomness, which
quantitatively assesses how close the non-stationary signal is to white noise. Our
simulation results on various gene sequences show that (i) both the coding and
non-coding segments of a gene are not random, and (ii) the coding segments are
"closer" to random sequences than non-coding segments. Our results support
the view that both coding and non-coding sequences are not random
[11,12,9,17-20], and revokes the stationary study that maintains that non-coding
DNA sustains long-range correlations whereas coding DNA behaves like random
sequences [1-3,5,6,10].
Methods
Numerical representation of genomic sequences Converting the DNA
sequence into a digital signal offers the opportunity to apply powerful signal
processing methods for the handling and analysis of genomic information. This
is, however, not an easy task as the analysis results might depend on the chosen
map. Various numerical mappings have been adopted in the literature. To cite
few, Peng et al. [1] construct a one-dimensional map of nucleotide sequences
onto a walk, , which they termed "DNA walk". The DNA walk is defined by the
rule that the walker steps up if a pyrimidine resides at position i, and
steps down otherwise. Voss [9] represents a DNA sequence by four
binary indicator sequences, which indicate the locations of the four nucleotides A,
40. 63
T, C and G. Berthelsen et al. [21] proposed a two-dimensional representation of
DNA sequences, characterized by a Hausdorff dimension (also called fractal
dimension) that is invariant under (i) complementarity, (ii) reflection symmetry,
(iii) compatibility and (iv) substitution symmetry of AT and . The
corresponding embedding assignment is given by
. In this chapter, since we are interested in time-
dependent ARMA modeling of one-dimensional non-stationary genomic
sequences, we adopt the widely used "DNA walk" map proposed by Peng et al
[1]. The DNA walk provides a nice graphical representation for each gene. For
instance, Figure 2 shows the structure of the Human gene 276 located in
chromosome 1, and its DNA walk is displayed in Figure 3. Time-dependent ARMA
model Grenier [22] showed that a discrete non-stationary signal, , can be
represented by finite-order time-varying ARMA processes of the form
Equation 3
∑ ∑
where is the length of the sequence and are the time-
dependent model parameters, p and q are the model orders and is a white
noise process. Observe that the coefficients and appear with an
argument depending on . This is purely arbitrary since any time origin can
be chosen, without restraining the generality of the model. We assume that the
41. 63
time-dependent coefficients and can be expressed as linear
combinations of some basis functions ,
Equation 4
∑
∑
Figure 2 Gene Structure. Gene structure of the Human gene 276 located in chromosome I: The
boxes correspond to the exons (coding regions, and line between them represent the
introns (non-coding regions)). The total length of the gene is N=8208 bases, including
1536 coding and 6672 non-coding bases
Figure 3 DNA Walk. DNA walk of the Human gene 276
42. 63
The advantage of the basis parameterization is clear from the fact that the
identification of the time-dependent coefficients and reduces to the
identification of the constant coefficients and , and therefore
the linear non-stationary problem reduces to a linear time-invariant problem. The
basis functions do not have to be limited to the standard choices of
Legendre, Fourier, or the prolate spheroidal basis but can also take advantage of
any prior information, such as the presence of a jump in the coefficients at a
known instant [22].
Estimation of the time-dependent ARMA coefficients from Equation 4 , the
unknown parameters of the TD-ARMA model are the weights of the linear
combinations defining each time-varying parameter. The linearity is the key to the
algorithms proposed in this chapter. We will derive mean-squares, least-squares
and recursive least-squares solutions to estimate the time-dependent coefficients
and .
Mean-squares estimation
Define the process
Equation 5
∑ ∑
and define the vector
43. 63
Equation 6
[ ]
It is possible to derive for this process orthogonally conditions similar to the
stationary ARMA model conditions [23]. Observe that the process , defined in
Eq. (6), is orthogonal to ; hence, it is
orthogonal to for all , and orthogonal to for all
[22]. This orthogonality condition leads to a generalized Yule-Walker
equation [22]
Equation 7
([ ] | | ) ([ ] )
Although the process is non-stationary, the stationarity and ergodicity
of the process , together with the linearity of the model, allow us to replace in
Eq. (8) the expectation by a summation. However, although consistent, the
above estimator is often considered a poor one [22].
Least-squares estimation
Equations (4) and (5) can be written in vector format as follows
45. 63
Using this vector notation, Eq. (3) can be written as
Equation 11
Or equivalently
Equation 12
Where is a row vector
Equation 13
And
Equation 14
[ ]
Observe that the vector contains all the unknowns of the TD-ARMA model.
Writing Eq. (10) for – leads to
46. 63
Equation 15
where
[ ]
[ ]
[ ]
The least-squares solution of Equation 11 is given by
Equation 16
To overcome the computational complexity associated with the least-
squares estimation (involving in particular the inversion of a square
matrix), we opted for recursive least-squares estimation
as follows.
Recursive least-squares estimation The recursive least squares algorithm
is summarized as [24]
47. 63
Equation 17
̂ ̂ { ̂ }
Index of randomness
Over the past decade, there has been a flow of conflicting papers about
the long-range power-law correlations detected in eukaryotic DNA [1-3,5,6,9-
12,17-20]. The controversy is generated by conflicting views that either advocate
that non-coding DNA sustains long-range correlations whereas coding DNA
behaves like random sequences [1,10,2,3,5,6], or maintains that both coding and
non-coding DNA exhibit long-range power-law correlations [11,12,9,17-20].
Based on the analysis of the time-dependent power spectrum of genomic
sequences, Bouaynaya and Schonfeld [11,12] showed that the statistical
differences between coding and non-coding sequences are more subtle than
previously concluded using stationary analysis tools. In fact they found that both
coding and non-coding sequences are non-random. However, coding sequences
are "whiter" than non-coding sequences. We propose to qualitatively assess the
degree of randomness of both coding and non-coding sequences using the time-
dependent ARMA coefficients and . Consider the system function,
, of a stationary ARMA model (whose coefficients and are constant,
i.e., independent of time). We know that
48. 63
Equation 18
∑
∑
∏
∏
Where and are zeros and poles of the system function
respectively. From the fact that a stationary white noise process has a at
spectrum, we observe that the closer (in L2 distance) the zeros and poles are,
the flatter is the spectrum of the process. Following the same reasoning locally
for non-stationary processes, we define the curve of randomness, CR [n], of the
non-stationary process by
Equation 19
{
( ∑| | ∑ | |)
( ∑| | ∑ | |)
( ∑| |)
where the minimum is taken over all pairs . Observe that the case
is obtained from the case by interchanging the roles of and ,
and the indices and . The curve of randomness defined in Equation 19 is a
measure of how close the zeros and the poles of the system function are, and
therefore, is a measure of how flat the system function is, and how close is the
49. 63
process from a white noise. The index of randomness, , of a TD-
ARMA(p,q), is then defined as the average of the curve of randomness, i.e.,
Equation 20
∑
In particular, the index of randomness of a TD-ARMA(1,1)
is given by
Equation 21
∑ | |
Observe that the index of randomness of a white noise process is equal to
zero. We say that the sequence with index of randomness is more
random than the sequence with index of randomness if the index of
randomness of the former is lower than the index of randomness of the latter,
i.e.,
Results
All genome sequences considered in this chapter have been extracted
from the NIH website http://www.ncbi.nlm.nih.gov. The algorithms were
implemented in MATLAB. The DNA sequences were mapped to numerical
50. 63
sequences using the purine-pyrimidine DNA walk proposed in [1]. In our
simulations, the recursive least squares algorithm was found to best estimate the
time-dependent coefficients of the TD-ARMA model. We used the MATLAB
function rarmax, which implements the recursive least-squares algorithm for TD-
ARMA models. The choice of the orders p and q of the model were determined
experimentally as follows: For each genomic sequence, we computed 100 TD-
ARMA models corresponding to the orders (1, 1) up to (10, 10). The best model
was chosen to be the one that minimizes the average squared error between the
actual and the fitted sequences. Our extensive simulations on various DNA
sequences from different organisms show that most of the sequences are best
fitted with low-order TD-ARMA models, e.g., TD-ARMA(1,1), TD-ARMA(1,2) and
TDARMA(2,1). Figure 4 shows the DNA walk of the Human gene 276 and its TD-
ARMA(1,1) fitted sequence. Observe that the TD-ARMA(1,1) model accurately
fits this gene sequence. The estimated time-varying coefficients a [n] and b [n]
are displayed in Figure 5 for both the coding and non-coding regions of this gene.
Their statistical differences are not clear from the plot of the time-series
coefficients. The curves of randomness of the coding and noncoding regions are
displayed in Figure 6.
51. 1
Table 1 shows the index of randomness of various gene sequences. For
concise representation, the column titles have been abbreviated as follows: "C.
length" (resp."N.C. length") denotes the length (in base pairs) of the coding (resp.
non-coding) segment of the gene. The total length of the gene is the sum of the
lengths of its coding and non-coding regions. "C. (p, q)" (resp. "N.C. (p, q)")
denotes the optimal TDARMA parameters (p, q) for the coding (resp. non-coding)
region of the gene. Finally, "C. IR" (resp. "N.C. IR") is the index of randomness of
the coding (resp. non-coding) segment of the gene. Observe that, in all
considered genes, the index of randomness of both the coding and non-coding
segments are strictly positive, and the index of randomness of the coding region
is consistently lower than the index of randomness of the non-coding region
(recall that the index of randomness of a white noise is zero). These observations
bring to bear two important conclusion: (i) Both the coding and non-coding
sequences are not random, as attested by an index of randomness greater than
zero. (ii) The coding sequences are "whiter" than the non-coding sequences.
This conclusion revokes previous work on statistical correlation of DNA
sequences, which, based on stationary time-series analysis, presumed that
coding DNA behaves like random sequences [1-3,5,6,10]; and supports the
conflicting view that both coding and non-coding sequences are not random
52. 2
[11,12,9,17-20]. In particular, our conclusion is in accordance with the
evolutionary periodogram analysis conducted in [11,12].
Figure 4 TD-ARMA modeling. TD-ARMA modeling of the Human gene 276: The blue signal is the
DNA walk, and the red signal is the TD-ARMA(1,1) fitted signal. The TD-ARMA(1,1) model
accurately fits the genomic signal
53. 3
Figure 5 TD-ARMA coefficients estimation. Estimation of the TD-ARMA(1,1) coefficients of the
Human gene 276. The TD-ARMA(1,1) model is given by
. The blue and black (resp. red and green) curves depict the time series
(resp ) for the coding and non-coding regions of the gene, respectively.
54. 4
Figure 6 Curve of randomness. The curves of randomness of the coding and non-coding regions
of the Human gene 276 are shown in blue and red, respectively. The index of
randomness of the coding sequence is equal to 1.0603, whereas its corresponding value
for the non-coding sequence is equal to 1.0627
55. 5
Table 1 Index of randomness of the Coding and Non-Coding segmants of Various Gene Sequences
56. 6
Chapter 4 Two-Dimensional ARMA Modeling for Breast
Cancer Detection and Classification
Abstract
Computer aided diagnosis (CAD) paradigms have gained currency for
discriminating malignant from benign lesions in ultrasound breast images. But
even the most sophisticated investigators often rely on one-dimensional
representations of the image in terms of its scan lines. Such vector
representations are convenient because of the mathematical tractability of one-
dimensional time-series. However, they fail to take into account the spatial
correlations between the pixels, which is crucial in tumor detection and
classification in breast images. In this chapter, we propose a CAD system for
tumor detection and classification (cancerous v.s. benign) in ultrasound breast
images based on a two-dimensional Auto-Regressive-Moving-Average (ARMA)
model of the breast image. First, we show, using the Wold decomposition
theorem, that ultrasound breast images can be accurately modeled by two-
dimensional ARMA random fields. As in the 1D case, the 2D ARMA parameter
estimation problem is much more difficult than its 2D AR counterpart, due to the
nonlinearity in estimating the 2D moving average (MA) parameters. We propose
to estimate the 2D ARMA parameters using a two-stage Yule-Walker Least-
Squares algorithm. The estimated parameters are then used as the basis for
statistical inference and biophysical interpretation of the breast image. We
57. 7
evaluate the performance of the 2D ARMA vector features in real ultrasound
images using a k-means classifier. Our results suggest that the proposed CAD
system based on a two-dimensional ARMA model leads to parameters that can
accurately segment the ultrasound breast image into three regions: healthy
tissue, benign tumor, and cancerous tumor. Moreover, the specificity and
sensitivity of the proposed two-dimensional CAD system is superior to its one-
dimensional homologue.
Introduction
Breast cancer continues to be a significant public health problem in the
United States: It is the second leading cause of female mortality, and,
disturbingly, one out of eight women in the United States will be diagnosed with
breast cancer in her life time. Until the cause of this disease is fully understood,
early detection remains the only hope to improve breast cancer prognosis and
treatment. Breast cancer screening modalities are mainly based on clinical
examination, mammography, ultrasound imaging, magnetic resonance imaging
(MRI), and core biopsy. Mammography (breast x-ray imaging) is by far the
fastest and cheapest screening test for breast cancer. Unfortunately, it is also
among the most difficult of radiological images to interpret: mammograms are of
low contrast, and features indicative of breast disease are often very small. Many
studies have shown that ultrasound and MRI imaging techniques can help
supplement mammography by detecting small breast cancers that may not be
visible with mammography. However, these techniques often fail to determine if a
58. 8
detected tumor is cancerous or benign, and a biopsy may be recommended.
Consequently, many unnecessary biopsies are often undertaken due to the high
false positive rate. Computer aided diagnosis (CAD) paradigms have recently
received great attention for lesion detection and discrimination in X-ray and
ultrasound breast mammograms [25]–[28]. The large amount of negative
biopsies encountered in clinical practice could be reduced if a computer system
was available to help the radiologists screen breast images. Broadly, the CAD
systems proposed in the literature can be grouped into four major categories:
geometrical [24], artificial intelligence [25], pyramidal (or multi-resolution) [27],
and model-based techniques [28], [29]. Geometrical methods employ
morphological and other segmentation techniques to extract small specks of
calcium known as microcalcifications from breast images [25]. However, this
procedure usually requires a priori knowledge of the tumor pattern
characteristics. Moreover, these techniques also tend to rely on many stages of
heuristics attempting to eliminate false positives. Artificial intelligence techniques
include neural networks and fuzzy logic methods. The performance of these
systems is tied to the architecture of the network and the number of training data.
Breast cancer is a heterogeneous disease which includes several subtypes with
distinct prognosis. In particular, the variability associated with the appearances of
the breast cancer, ranging from relative uniformity to complex patterns of bright
streaks and blobs [26], makes the ANN require a large training data set to ensure
a certain level of reliability. Pyramidal or multi-resolution techniques refer mainly
to the wavelet transform [27], which can be seen from a signal decomposition
59. 9
view point. Specifically, a signal is decomposed onto a set of the basis wavelet
functions. A very appealing feature of the wavelet analysis is that it provides a
uniform resolution for all the scales. However limited by the size of the basic
wavelet function, the downside of the uniform resolution is uniformly poor
resolution. Model based methods include linear, non-linear and finite-element
methods to build an accurate model of the breast [28], [29]. The model is
subsequently used for image matching, detection, and classification [29]. The
accuracy of the results are tied to the accuracy of the considered model. In this
work, we propose a new model-based CAD system for tumor detection and
classification. We show that (x-ray, ultrasound, and MRI) breast images can be
accurately modeled by two-dimensional autoregressive moving average (ARMA)
random fields. The model parameters, being the fingerprints of the image, serve
as the basis for statistical inference and biophysical interpretation of the breast
image. ARMA models are parametric representations of wide-sense stationary
(WSS) processes with rational spectra. The Word Decomposition theorem states
that any WSS process can be decomposed as the sum of a regular process,
which spectrum is continuous, and a predictable process, which spectrum
consists of impulses. Since rational spectra form a dense set in the class of
continuous spectra, the ARMA model renders accurately the regular part of the
WSS process. It is, therefore, surprising that very few researchers have
attempted to derive a general ARMA representation of the breast image, and use
it for tumor detection and classification. In [29], the authors use a one-
dimensional fractional differencing autoregressive moving average (FARMA)
60. 10
process to model the ultrasound RF echo reflected from the breast tissue.
However, by considering separate scan lines, they do not take into account the
two-dimensional spatial correlation between the pixels in the image. In [30], an
autoregressive (AR) model is considered for improving the contrast of breast
cancer lesions in ultrasound images. ARMA models, however, provide a more
accurate model of a homogeneous random field than an AR model. As in the 1D
case, the 2D ARMA parameter estimation problem is much more difficult than its
2D AR counterpart, due to the non-linearity in estimating the 2D moving average
(MA) parameters.
2D-ARMA Modeling
We represent the breast image as a 2D random field .
We define a total order on the discrete lattice as follows
Equation 22
The 2D-ARMA(p1,p2,q1,q2) model is defined for the image
by the following difference equation
Equation 23
∑ ∑ ∑ ∑
where is a stationary white noise field with variance , and the
coefficients , are the parameters of the model. From Equation 22 the
61. 11
image can be viewed as the output of the linear time-invariant causal
system excited by a white noise input, where
Equation 24
∑ ∑
∑ ∑
With
Yule-Walker Least-Squares Parameter Estimation
Assume first that the noise sequence were known. Then the
problem of estimating the parameters in the ARMA model in Equation 23 would be
a simple input-output system parameter estimation problem, which could be
solved by several methods, the simplest of which is the least-squares (LS)
method. In the LS method, we express Equation 23 as
Equation 25
Where
Equation 26
and
62. 12
Equation 27
[ ]
Writing Equation 24 in matrix form for , and
for some , and , gives
Equation 28
Where
Equation 29
[ ]
[ ]
And is displayed below. Assume we know , then we can obtain a least-
squares estimate of the parameter vector in Equation 28 as
Equation 30
̂
Observe that the input model noise in is unknown.
Nevertheless, it can be estimated by considering the noise process as
the output of the linear filter with input . From Nirenberg’s
proof of the division theorem in multi-dimensional spaces [32], we can write the
inverse ARMA filter as the infinite order AR filter
63. 13
Equation 31
∑ ∑
In the time domain we obtain
Equation 32
∑ ∑
Therefore, we can estimate by first estimating the AR parameters
and next obtaining by filtering as in Equation 29. Since we
cannot estimate
Equation 33
( )
an infinite number of (independent) parameters from a finite number of samples,
we approximate the finite AR model by one of finite order, say The
parameters in the truncated AR model can be estimated by using a 2D extension
of the Yule-Walker equations as follows
Equation 34
∑ ∑
64. 14
Where are the autocorrelation values of the field , computed as
follows:
Equation 35
∑ ∑
and is the 2D Kronecker delta function. Equation 35 is a system of linear
equations that can be written in matrix form and solved for the coefficients .
Finally, the Yule-Walker Least-Squares algorithm is summarized below
1. Estimate the parameters in an model of by the
Yule-Walker method in Equation 35. Obtain an estimate of the noise field
as
Equation 36
̂ ∑ ∑ ̂
for , and .
2. Replace the by ̂ computed in Step 1. Obtain ̂ in Equation
30 with , and .
65. 15
Tumor detection and classification
The estimated ARMA parameters, { }, , are used as a basis for
inference about the presence of a tumor and its nature: benign or cancerous. We
use the k-means algorithm to segment the breast image into 3 classes: healthy
tissue, benign tumor and cancerous tumor. Our method consists of representing
each pixel in the image by an ARMA model whose parameters are estimated by
using an appropriate neighborhood for the pixel. We make the assumption that
all pixels in the considered neighborhood belong to the same class, and hence,
for computational efficiency, we replace the entire neighborhood by the vector
value of the estimated ARMA parameters. This procedure is repeated for the
entire image, creating a new block by block vector-valued image, which will be
the input to the k-means classifier.
Simulations
Although the proposed algorithm is independent of the imaging modality of
the breast, we perform our simulations on ultrasound images, collected from the
Radiology department, College of Medicine at the University of Illinois at
Chicago. Our database of cancerous images shows intraductal carcinoma, which
is the most common type of breast cancer in women. Intraductal carcinoma is
usually discovered through a mammogram or an ultrasound as
microcalcifications. Our benign tumor images show the Fibroadenoma of the
66. 16
breast, which is a benign fibroepithelial tumor characterized by proliferation of
both glandular and stromal elements. Our extensive simulations indicate that
ARMA[2,2,2,2] is a sufficient model order, in terms of mean square error, to
accurately represent ultrasound breast images. Figure 1 shows two ultrasound
images, one with a cancerous tumor and one with a benign tumor, and their
respective 2D-ARMA[2,2,2,2] and 1D-ARMA[2,2] models. It is visually clear that
the 2D-ARMA model accurately represents both ultrasound images, whereas the
1D model fails to capture any image feature.
We estimate the 2D-ARMA parameters using a window of size .
The choice of the window size presents an inherent trade-off between the
accuracy of the representation and the accuracy of the classification. A large
window size would lead a better representation of the 2D-ARMA model, but
might include pixels from different classes. We found that for 256256 images, a
window size leads to a good segmentation performance. Each image is
therefore represented by a number of 2D-ARMA feature vectors, which
contain the 8 parameters for each sub-
block image. Without loss of generality, we chose . Therefore, the
size of the feature vectors reduces to 6 instead of 8. We decide that an image
has a cancerous (resp., benign) tumor if at least one of the sub-block images is
classified as a cancerous (resp., benign) tumor. Otherwise, we conclude that the
image is healthy and contains no tumors.
We conducted our simulations using 573 ARMA feature vectors of healthy,
benign and cancerous ultrasound breast images. The ARMA feature vectors
67. 17
were used as the input to a k-means classifier. Figure 7(c) and Figure 7(f) show
the segmentation outputs of the cancerous and benign tumor images,
respectively. We can observe clear delineations of the tumors from the healthy
tissues in both cases. The accuracy, sensitivity and specificity of the 2D-ARMA
and 1D-ARMA k-means classifiers are shown in Table 2. It is clear that the 2D-
ARMA feature vectors are more selective than their one-dimensional homologue.
Figure 7 ARMA modeling and segmentation of ultrasound breast images: (a) cancereus
ultrasound image; (b) 2D-ARMA[2,2,2,2] representation of (a); (c) segmentation of (b)
using an appropriate threshold; (d) 1D-ARMA[2,2] representation of (a); (e) benign tumor
ultrasound image; (f) 2D-ARMA[2,2,2,2] representation of (e); (g) segmentation of (f)
using an appropriate threshold; (h) 1D-ARMA[2,2] representation of (e);
Table 2 Classification accuracy of cancereus and benign tumors
Accuracy Sensitivity Specificity
2D-ARMA 92.87% 92.03% 94.14%
1D-ARMA 78.51% 59.54% 79.76%
68. 18
Chapter 5 Statistical Sequential Analysis for Detection of
Microcalcifications in Digital Mammograms
Abstract
We formulate the problem of microcalcification detection in digital
mammograms as a statistical change detection problem in the local properties of
the image. First, we represent mammograms by two-dimensional autoregressive
moving-average (2D ARMA) fields; thus uniquely characterizing the images by
their reduced dimensionality 2D ARMA feature vectors. Texture changes in
mammograms are then modeled as an additive change in the mean parameter of
the PDF associated with the 2D ARMA feature vector sequence that describes
the image. A generalized likelihood ratio test is used to detect these changes and
estimate the exact time (or space) where they occur. Our simulation results on
the Digital Database for Screening Mammography hosted by the University of
South Florida show that the decision functions of cancerous images present high
peaks at the microcalcification locations, whereas they exhibit a uniform behavior
for healthy mammograms. The proposed algorithm achieves a sensitivity and
specificity of 96:9% and 97:8%, respectively.
Introduction
The rapid expansion in number and volume of digital mammograms, the
increasing demand for fast access to relevant medical data, the need for
69. 19
interpretation, and retrieval of medical information has become of paramount
importance [33]. Mammography is the current standard for breast cancer
diagnosis. Women 40 years of age or older are recommended undergoing a
screening mammogram to check for breast malignancies every 6 months.
Screening mammograms usually involve two x-rays of each breast. This process
generates a huge amount of data that needs to be processed, interpreted and
saved.
The presence of microcalcifications (tiny deposits of calcium) in the breast
is an important sign for the detection of early breast carcinoma. Accurate and
uniform evaluation of the enormous number of mammograms generated in
widespread screening is a difficult task. 10-30% of breast carcinomas are missed
by trained radiologists [34]. Mammograms are low contrast images, and the
breast malignancies present a great diversity in shape, size and location, and low
distinguishability from the surrounding healthy tissue.
In the last two decades, various computer-aided (CAD) systems have
been proposed to help bring suspicious areas on the mammogram to the
radiologist’s attention. Many approaches were considered including denoising
[35], segmentation [36], filtering [37], machine learning [38], [39] and artificial
intelligence [37], mathematical morphology [40], time-frequency analysis and
multiresolution techniques, and neural networks [34]. Despite their technical
differences, these approaches share a common outline: they are all deterministic.
They usually assume a small region of interest as a subject of recognition.
70. 20
Hence, their performance is contingent upon the natural variability of healthy and
cancerous mammography images.
In contrast to deterministic methods, statistical methods take into account
the noise in the digitized mammogram and the heterogeneity of its characteristics
by considering an underlying probability distribution of the image features. It is,
therefore, surprising that very few researchers have pursued this direction.
Statistical analysis of mammograms was mainly considered in the context of
textural information [41], [42]. In [41], the third and fourth order statistical
moments, skewness and kurtosis, were estimated from the bandpass filtered
mammogram. A region with high positive skewness and kurtosis is marked as a
region of interest. In [42], a statistical model of the mammographic image, termed
the “loglikelihood image”, is generated from the original mammogram image.
However, the method does not include any decision making, and the log-
likelihood image has the same resolution of the original mammogram.
The challenge in breast carcinoma localization is that the detection
algorithm must be able to handle all types of microcalcifications. Therefore, it is
necessary to formulate the detection problem beyond the use of empirical
observations about the type, shape, size or location of microcalcifications, which
may or may not hold in all cases. In order to address these challenges, we pose
the microcalcification detection problem in the context of statistical sequential
representation and analysis of mammograms. A mammogram image is
considered to be a realization of a stochastic process. We use statistical
analysis to detect parameter changes of the stochastic process, which will
71. 21
indicate the presence of suspicious areas in the breast. In our approach, we
achieve a decision-making CAD system through use of dimensionality reduction
and sufficient statistics. We first show that mammograms can be accurately
modeled as 2D autoregressive moving-average (ARMA) fields, and thus each
image can be solely represented by its 2D ARMA coefficients.
In this chapter, we consider a change detection framework based on
additive modeling. Specifically, we detect changes of the mean parameter of the
PDF associated with the 2D ARMA feature vector sequence. The sufficient
statistic used is based on the generalized likelihood ratio. Thus, the main steps
used for detecting microcalcifications in mammograms are the 2D ARMA
dimensionality reduction of the original image followed by change detection on
the resulting feature vectors. In particular, no a priori assumptions are made
about the specific nature of the microcalcifications (e.g., circular, smooth, etc.).
2D-ARMA representation
We represent the breast image as a 2D random field
[43]. We define a total order on the discrete lattice as follows:
and [11]. The 2D model is
defined for the image
by the following difference equation
72. 22
Equation 37
∑ ∑ ∑ ∑
where is a stationary white noise field with variance , and the
coefficients { } are the parameters of the model.
A Two-stage Yule-Walker Least Squares parameter estimation method
was proposed in [43]. First, the noise sequence is assumed to be
known. The ARMA parameter estimation problem is then reduced to a simple
input-output system identification problem, which is solved by a leastsquares
(LS) method. The final estimate is then obtained by estimating the noise, using a
truncated autoregressive (AR) model, and plugging it back in the Least Squares
solution [43].
In practice the ARMA parameters are estimated using a window of size
. The choice of the window size presents an inherent trade-off between
the accuracy of the ARMA representation and the reliability of the classification.
An image of size is therefore represented by
ARMA feature vectors. Let [ ]
be the ARMA feature vector of the k-th block. The mammogram image is
then compared to the raw pixels of the unprocessed image. The 2D ARMA
model presents a compressed representation of the image, which will lead to an
efficient implementation of the CAD system. For instance, for
73. 23
, the 2D ARMA model represents a
dimensionality reduction of more than 97% compared to the original image.
Figures 2b and 2f show the 2D models of a healthy and
cancerous mammograms respectively Section IV subsection IV-A discuss in
detail the choice of the model degree parameters .
The problem of tumor detection becomes one of detecting changes in the
parameters of the probability density function (PDF) associated with the ARMA
vector random process.
Change detection algorithm
The 2D ARMA feature vectors are assumed to form an i.i.d. (independent
and identically distributed) sequence of r-dimensional random vectors ,
with Gaussian distribution
having PDF:
Equation 38
√ ∑
( ) ∑
Observe that the ARMA feature vectors are assumed to be independent.
However the components of each ARMA feature vector are correlated with
covariance matrix ∑. The independence of the ARMA feature vectors reflects an
independence assumption between pixels in different sub-blocks of an
74. 24
image. The tumor detection is modeled as a change in the vector parameter
of the PDF characterizing the feature vector random process. Let the
parameter be the value before the change, and the value after the
change. In general, we have minimal or no information about the parameter
after change. Let us begin by discussing the scenario where there is a known
upper bound for and a known lower bound for . In this case, the change
detection problem is equivalent to the following:
Equation 39
|| ||
|| ||
Where:
|| || is the true change time and The case
of interest where is assumed to be known, and unknown can be obtained
as a limit case of the solution to the above problem.
The solution to the detection problem formulated in Equation 39 can be
obtained by deriving the generalized likelihood ratio (GLR) test [44], where the
unknown parameters are replaced by their maximum likelihood estimates. Thus,
the generalized likelihood ratio for the sequence is:
Equation 40
|| ||
|| ||
75. 25
where is the corresponding parameterized probability density function. The
sequential GLR algorithm is then given by
Equation 41
Where: is the discere time index, is the alarm (detection) event, is the
test statistic, and is a threshold
Given the i.i.d. Gaussian assumption, can be written as
Equation 42
|| ||
∑
|| ||
∑
It can be shown that can be rewritten as [44]
Equation 43
{
( )
( ) ( )
( )
76. 26
Where
Equation 44
[(̅ ) (̅ )]
Equation 45
̅ ∑
Observe that, for the current problem formulation, the data that are needed in
Equation 44 are the feature vectors , the covariance , and the mean before the
change .
In the more realistic case where the parameter before the change is
assumed to be known but the parameter after the change is assumed to be
completely unknown, the change detection problem statement is as follows
Equation 46
Hence, the case where nothing is known about can be considered the
limit of the previous case when .
Therefore, the GLR algorithm in Equation 46 becomes:
77. 27
Equation 47
( )
Where is defined in Equation 44
In the above study, is assumed to be known. In practice, can be
estimated using a number of feature vectors at the beginning of each
mammogram. The covariance is estimated using the same feature vectors.
Results
A. 2D ARMA Model
We test the proposed algorithm using the University of South Florida
digital mammography library available online at: http://marathon.csee.usf.edu.
The Digital Database for Screening Mammography (DDSM) is a resource for use
by the mammographic image analysis research community. Each mammogram
image is pixels. The ARMA parameters were estimated using a
window of size . Hence, each image is represented by 256 ARMA
feature vectors . We find the optimal ARMA degree model as
the degree that minimizes the average square error between the original image
and the predicted ARMA model. An exhaustive off-line search between the
degrees and reveals that leads to the
78. 28
smallest average square error for most mammogram images in the DDSM
library. Figure 8 shows 2D-ARMA models of an original healthy mammogram.
B. Change Detection Algorithm
We can estimate the value of (parameter before the change) as the
sample mean of the first 10 feature vectors . Another approach is to estimate
the value of from the entire mammogram image. This method yields an
estimation error not higher than the relative size of the microcalcifications in the
image, i.e. about 1%. For both methods, estimation of the parameter yielded
similar values. The detection algorithm is based on the value of the threshold h,
that was chosen experimentally. Figure 10 displays the decision function of
four sample mammograms including two cancerous and two normal. The
cancerous images exhibit peaks that are twice as high, on average, than healthy
images. Therefore, we found that a value of equal to the mean of the highest
cancerous peak and the highest normal peak achieves an optimal balance
between false and missed alarms. Figure 9 shows a plot of the average grey level
of the sub-images of healthy and cancerous mammograms. It is seen
that a simple plot of the grey level values of the mammograms does not
discriminate between healthy and cancerous mammograms. However the
proposed change detection algorithm leads to a decision function that is uniform
for healthy mammograms and spiky for cancerous mammograms, where the
spikes indicate the position of microcalcifications.
79. 29
By lexicographical ordering of the image and its feature vectors, we are
able to not only discriminate between normal and cancerous mammograms but
also pinpoint the exact location of microcalcifications in the cancerous image.
The peaks of the decision function can be easily traced back to the suspicious
areas. Figure 11 shows a radiologist’s marked area of suspicion, which is
successfully detected as cancerous by our algorithm. Table I displays the
performance of our algorithm based on 524 normal and cancerous digital
mammograms from the DDSM library. Based on these statistically significant
analysis, the results of the sensitivity and specificity of the proposed algorithm
are 96:9% and 97:8%, respectively.
Table 3 PERFORMANCE OF THE CHANGE DETECTION ALGORITHM IN 524 NORMAL AND
CANCEROUS MAMMOGRAMS
True False
Positive 96% 4%
Negative 97% 3%
Figure 8 2D ARMA Modeling (a) Original (healthy) mammogram; (b) 2D ARMA[2,2,2,2]
model of (a); (c) 2D ARMA[3,3,3,3] model of (a); (d) 2D ARMA[4,4,4,4] model of
(a); (e) 2D ARMA[6,6,6,6] model of (a);
80. 30
Figure 9 Change detection algorithm (a) A normal (healthy) mammogram; (b) 2D ARMA[2,2,2,2]
model of (a); (c) Plot of the average gray level of the 16x16 sub images in (a); (d) Plot of
the decision function for the image in (a); (e) original cancerous image; (f) 2D
ARMA[2,2,2,2] model of (e); (g) Plot of the average gray level of the 16x16 sub images in
(e); (h) Plot of the decision function for the image in (e);
Figure 10 The decision function for four mammograms; cancerous in red/magenta and normal
in blue/green. The value of the threshold id determined as the mean of the highest
normal peak and the highest of the cancerous peak.
81. 31
Figure 11 Change detection algorithm: (a) radiologist marked are of interest; (b) plot of the
decision function of the mammogram, arrows indicate the peaks above the threshold;
(c) marked 16x16 clusters that correspond to the detected peaks
82. 32
Chapter 6 Automated Biofilm Region Recognition And
Morphology Quantification from Confocal Laser
Scanning Microscopy Imaging
Abstract
Staphylococcus aureus is an opportunistic human pathogen and a primary
cause of nosocomial infections. Its biofilm forming capability is an adaptation
strategy utilized by many species of bacteria to overcome stressful environmental
conditions and provides both resistance to antimicrobial treatments and
protection from the host immune system. This chapter addresses a growing
demand for an objective, fully automated method of biofilm structure description
with standardized parameters that are independent of user input. In this study,
we used watershed segmentation to analyze and compare confocal laser
scanning microscopy (CLSM) images of two S. aureus strains with different
biofilm-forming capabilities. Results are compared with manual calculations as
well as the commonly used COMSTAT software.
Introduction
Staphylococcus aureus is an opportunistic human pathogen responsible
for a wide range of diseases that vary in clinical presentation and severity. S.
aureus can cause diseases ranging from minor skin infections to life-threatening
conditions such as pneumonia, osteomyelitis and toxic shock syndrome. Recent
83. 33
significant changes in health care delivery and antimicrobial resistance patterns
have caused a shift in the epidemiology of S. aureus. Recently, this has been
evidenced by a dramatic increase in methicillin-resistant S. aureus (MRSA)
infection rates which, at least in the United States, has led MRSA mortality rates
to be higher than those of HIV. [45] The public health concern caused by S.
aureus-related infections has led to extensive efforts put into improving the
efficacy of available therapies as well as introducing new pharmaceuticals. Both
strategies are challenged by the fact that S. aureus infections are associated with
formation of a biofilm, which limits the efficacy of therapy by creating a resistance
to antimicrobials and by protecting the bacteria from the host immune system. In
order to conduct studies on targeting biofilms therapeutically, it is necessary to
be able to quantitatively measure biofilm morphological characteristics like area,
biomass and thickness. In this chapter, we consider a clinical isolate (UAMS-1),
which forms robust, dense and uniformly distributed biofilm as well as its isogenic
variant caring a mutation in the sarA-gene, limiting its ability to form a biofilm.
For the assessment of biofilm structure, CLSM has been described as an
ideal technique [46]. By using several fluorescent stains or conjugated
antibodies in combination with multichannel CLSM 3D, the location of different
biofilm constituents can be recorded. Using these data sets, the
threedimensional architecture of the biofilms can be reconstructed and quantified
with digital image analysis. There is a wide range of commercially available
software capable of analyzing biofilm morphology, including COMSTAT, ImageJ,
ISA3D and Volocity. However, they all rely on thresholding to segment the
84. 34
biofilm. Specifically, the automated segmentation procedure is implemented in
two steps: (i) thresholding using user-dependent parameters [47] [48], followed
by (ii) connecting volume filtration [49]. The purpose of this work is to create a
fully automated method of biofilm segmentation and quantification that does not
rely on user input or thresholding.
Quantification of biofilm structure
Quantitative parameters describing the biofilm physical structure have
been extracted from three-dimensional confocal laser scanning microscopy
images and used to compare different biofilm structures. Quantitative descriptive
parameters of biofilm chosen for this study are: (i) area occupied by biomass in
each cross section, (ii) biomass in biovolume, (iii) thickness distribution and (iv)
average thickness.
Morphology quantification parameters
The following parameters are used to describe the biofilm structure:
area occupied by biomass in each cross-section: measured as the
total sum of all the unit areas (pixels) of each CLSM cross section
categorized as occupied area.
85. 35
Equation 48
∮ ∑
where:
o occupied area in cross section z,
o closed contour of occupied area,
o cell of a cross section recognized as occupied area
biomass in biovolume, V: measured from numeric integration of the area
of microbial colonization profiles, following a method previously described
in [50]
Equation 49
∫ [ ∑ ]
where:
o number of horizontal cross-sections,
o z-step in the image stack.
thickness distribution: the number of occupied clusters in each cross-
section over the total number of clusters in a cross-section of the CLSM
3D image.
average thickness: calculated as the average value of the height of all
clusters of the biofilm rise from solid-substratum in the z direction between
crosssections.
86. 36
Based on the four aforementioned “basic” parameters, other
characteristics of the biofilm can be calculated. For example, after the biomass is
segmented from the background, a number of features including roughness of
the film, porosity, thickness, etc. can be obtained. Those parameters can be used
together to uniquely describe the biofilm structure and to eventually differentiate
between different biofilm strains.
Image processing tool
The software suite of image processing operations was implemented
under the Matlab programming environment (Matlab 2010a, The Mathworks,
Inc). Matlab was chosen due to the convenience offered for matrix calculus. In
order to evaluate our results, we used manually calculated data as a baseline
and the widely used Matlab software COMSTAT for the comparison. In our
approach, we use the watershed segmentation method based on Fernand
Meyer’s algorithm [51].
Preprocessing and used methodology
Segmentation is one of the most difficult image processing operations.
The biofilm segmentation task is even harder because the biofilm is a
disconnected structure. This difficulty may explain the use of simple thresholding
in widely adopted biofilm analysis systems such as COMSTAT. Nonetheless,
after trying several segmentation algorithms, it became apparent that the
87. 37
watershed transformation provides the most accurate segmentation of the biofilm
structure. The watershed transformation finds ”catchment basins” and
”watershed ridge lines” in an image by treating it as a surface where light pixels
are high (area of interest) and dark pixels are low (background). Segmentation
using the watershed transformation works best if one can identify, or ”mark,”
foreground objects and background locations. This marking process is done
automatically with reference to the black background on the CLSM image.
Marker-controlled watershed segmentation follows this basic procedure:
1. Compute a segmentation function. This is an image whose dark regions
are the objects to be segmented.
2. Compute foreground markers. These are connected groups of pixels
within each of the objects.
3. Compute background markers with a use of Gradient Magnitude as the
Segmentation Function. These are pixels that are not part of any object.
4. Modify the segmentation function so that it only has minima at the
foreground and background marker locations.
5. Compute the watershed transform of the modified segmentation function.
Growth and CLSM of static biofilm
Costar 3596 plates (Corning Life Sciences, Acton, MA) wells were coated
overnight at 4oC with 20% human plasma (Sigma) in bicarbonate buffer.