SlideShare a Scribd company logo
1 of 121
Download to read offline
3D DIGITAL IMAGE PROCESSING FOR BIOFILM
QUANTIFICATION FROM CONFOCAL LASER
SCANNING MICROSCOPY
MULTIDIMENSIONAL STATISTICAL ANALYSIS OF BIOFILM
MODELING
A Thesis Submitted
to the Graduate School
University of Arkansas at Little Rock
in partial fulfillment of requirements
for the degree of
PHILOSOPHY DOCTOR
In Applied Science
in the Department of Applied Science
Engineering Science and Systems
August 2012
Jerzy S. Zielinski
M.S. from Warsaw University of Technology, Warsaw Poland, 2006
B.S. from Warsaw University of Technology, Warsaw Poland, 2002
© Copyright by
Jerzy S. Zielinski
2012
This dissertation, “3D Digital Image Processing for Biofilm Quantification from
Confocal Laser Scanning Microscopy”, by Jerzy S. Zielinski, is approved by:
Dissertation Advisor: ____________________________________
Nidhal Bouaynaya
Assistant Professor of Systems Engineering
Dissertation Committee: ____________________________________
Seshadri Mohan
Professor of Systems Engineering
____________________________________
Yu-Po Chan
Professor of Systems Engineering
____________________________________
Hussain M. Al-Rizzo
Associate Professor of Systems Engineering
____________________________________
Craig Cooney
Investigator, Veteran's Affairs Medical Center
Program Coordinator: ____________________________________
Tansel Karabacak
Assistant Professor of Applied Science
Graduate Dean: ____________________________________
Patrick J. Pellicane
Professor of Construction Management
Fair Use
This thesis is protected by the Copyright Laws of the United States (Public Law
94-553, revised in 1976). Consistent with fair use as defined in the Copyright
Laws, brief quotations from this material are allowed with proper
acknowledgment. Use of this material for financial gain without the author’s
express written permission is not allowed.
Duplication
I authorize the Head of Interlibrary Loan or the Head of Archives at the
Ottenheimer Library at the University of Arkansas at Little Rock to arrange for
duplication of this thesis for educational or scholarly purposes when so requested
by a library user. The duplication will be at the user’s expense.
Signature _____________________________________________________
3D DIGITAL IMAGE PROCESSING FOR BIOFILM QUANTIFICATION FROM
CONFOCAL LASER SCANNING MICROSCOPY, MULTIDIMENSIONAL
STATISTICAL ANALYSIS OF BIOFILM MODELING, by Jerzy S. Zielinski,
August 2012
Abstract
The dramatic increase in number and volume of digital images produced
in medical diagnostics, and the escalating demand for rapid access to these
relevant medical data, along with the need for interpretation and retrieval has
become of paramount importance to a modern healthcare system. Therefore,
there is an ever growing need for processed, interpreted and saved images of
various types. Due to the high cost and unreliability of human-dependent image
analysis, it is necessary to develop an automated method for feature extraction,
using sophisticated mathematical algorithms and reasoning.
This work is focused on digital image signal processing of biological and
biomedical data in one- two- and three-dimensional space. Methods and
algorithms presented in this work were used to acquire data from genomic
sequences, breast cancer, and biofilm images. One-dimensional analysis was
applied to DNA sequences which were presented as a non-stationary sequence
and modeled by a time-dependent autoregressive moving average (TD-ARMA)
model. Two-dimensional analyses used 2D-ARMA model and applied it to detect
breast cancer from x-ray mammograms or ultrasound images. Three-dimensional
detection and classification techniques were applied to biofilm images acquired
using confocal laser scanning microscopy.
Modern medical images are geometrically arranged arrays of data. The
broadening scope of imaging as a way to organize our observations of the
biophysical world has led to a dramatic increase in our ability to apply new
processing techniques and to combine multiple channels of data into
sophisticated and complex mathematical models of physiological function and
dysfunction. With explosion of the amount of data produced in a field of
biomedicine, it is crucial to be able to construct accurate mathematical models of
the data at hand. Two main purposes of signal modeling are: data size
conservation and parameter extraction. Specifically, in biomedical imaging we
have four key problems that were addressed in this work: (i) registration, i.e.
automated methods of data acquisition and the ability to align multiple data sets
with each other; (ii) visualization and reconstruction, i.e. the environment in which
registered data sets can be displayed on a plane or in multidimensional space;
(iii) segmentation, i.e. automated and semi-automated methods to create models
of relevant anatomy from images; (iv) simulation and prediction, i.e. techniques
that can be used to simulate growth end evolution of researched phenomenon.
Mathematical models can not only be used to verify experimental findings, but
also to make qualitative and quantitative predictions, that might serve as
guidelines for the future development of technology and/or treatment.
To
My Wife and Parents
Acknowledgements
I would like to thank my advisor and mentor, Dr. Nidhal Bouaynaya for her
guidance, illuminating discussions related to this work and beyond,
encouragement and moral and financial support in this research. I also would like
to extend my gratitude to Dr. Seshadri Mohan, Dr. Yu-Po Chan, Dr. Hussain M.
Al-Rizzo and Dr. Craig Cooney for being part of my committee and for their
insights and interest in my work.
I am extremely grateful to my family who has been my greatest support.
This accomplishment is not mine alone. Thank you for sharing my struggles and
my victories. Thank you to my friends and colleagues for sharing my pauses and
supporting me during my ups and downs.
Acknowledgements PL
x
Table of Contents
List of Tables...................................................................................................................xiii
List of Figures................................................................................................................. xiv
Chapter 1 Introduction...................................................................................................1
Problem Statement ........................................................................................................1
Research Objectives......................................................................................................2
Motivation.......................................................................................................................4
Research Contributions .................................................................................................5
Organization...................................................................................................................6
Chapter 2 Biology Background ...................................................................................10
Genomics.....................................................................................................................11
Breast Cancer..............................................................................................................14
Biofilm ..........................................................................................................................15
Chapter 3 Time-Dependent ARMA Modeling of Genomic Sequences .......................18
Abstract........................................................................................................................18
Background..................................................................................................................20
Methods .......................................................................................................................23
Mean-squares estimation.............................................................................................26
Least-squares estimation..........................................................................................27
Index of randomness ................................................................................................31
Results.........................................................................................................................33
xi
Chapter 4 Two-Dimensional ARMA Modeling for Breast Cancer Detection and
Classification .....................................................................................................................6
Abstract..........................................................................................................................6
Introduction ....................................................................................................................7
2D-ARMA Modeling..................................................................................................10
Yule-Walker Least-Squares Parameter Estimation ..................................................11
Tumor detection and classification...............................................................................15
Simulations ..................................................................................................................15
Chapter 5 Statistical Sequential Analysis for Detection of Microcalcifications in Digital
Mammograms .................................................................................................................18
Abstract........................................................................................................................18
Introduction ..................................................................................................................18
2D-ARMA representation.............................................................................................21
Change detection algorithm.........................................................................................23
Results.........................................................................................................................27
A. 2D ARMA Model...................................................................................................27
B. Change Detection Algorithm ................................................................................28
Chapter 6 Automated Biofilm Region Recognition And Morphology Quantification from
Confocal Laser Scanning Microscopy Imaging ...............................................................32
Abstract........................................................................................................................32
Introduction ..................................................................................................................32
Quantification of biofilm structure.................................................................................34
xii
Morphology quantification parameters.........................................................................34
Image processing tool..................................................................................................36
Preprocessing and used methodology.........................................................................36
Growth and CLSM of static biofilm...............................................................................37
Results.........................................................................................................................38
Chapter 7 Three Dimensional Morphology Quantification of Biofilm Structures from
Confocal Laser Scanning Microscopy Images ................................................................42
Abstract........................................................................................................................42
Introduction ..................................................................................................................43
Average Run Length.................................................................................................45
Aspect Ratio .............................................................................................................45
Average and Maximum Diffusion Distance...............................................................45
Biomass....................................................................................................................46
Average Thickness ...................................................................................................46
Application to CLSM Biofilm Images............................................................................48
Biofilm culture preparation and image acquisition ....................................................48
Segmentation and parameter quantification results .................................................49
Conclusion and Recommendation ..................................................................................55
References......................................................................................................................63
xiii
List of Tables
Table 1 Index of randomness of the Coding and Non-Coding segmants of Various
Gene Sequences.............................................................................................5
Table 2 Classification accuracy of cancereus and benign tumors ..............................17
Table 3 PERFORMANCE OF THE CHANGE DETECTION ALGORITHM IN 524
NORMAL AND CANCEROUS MAMMOGRAMS...........................................29
Table 4 UAMS-1 sarA- results of biomass and average thickness calculations using
watershed algorithm and COMSTAT software ..............................................39
Table 5 UAMS-1 results of biomass and average thickness calculations using
watershed algorithm and COMSTAT software ..............................................39
Table 6 Average error calculated from manual calculations accross all layers in
confocal imaging with use of Watershed algorithm and COMSTAT software39
Table 7 Results of biofilm parameter quantification for Stack 1 for 3D and 2D
segmentations in comparison with the ground truth ......................................51
Table 8 Results of biofilm parameter quantification for stack 2 for 3D and 2D
segmentations in comparison with the ground truth. .....................................52
Table 9 Results of biofilm parameter quantification for stack 3 for 3D and 2D
segmentations in comparison with the ground truth. ....................................53
xiv
List of Figures
Figure 1 Three hypotheses explaining the formation of intrinsic levels of antibiotic
resistance Adapted from Stewwart et al., 2001 .............................................17
Figure 2 Gene Structure. Gene structure of the Human gene 276 located in
chromosome I: The boxes correspond to the exons (coding regions, and line
between them represent the introns (non-coding regions)). The total length of
the gene is N=8208 bases, including 1536 coding and 6672 non-coding
bases.............................................................................................................25
Figure 3 DNA Walk. DNA walk of the Human gene 276 .............................................25
Figure 4 TD-ARMA modeling. TD-ARMA modeling of the Human gene 276: The blue
signal is the DNA walk, and the red signal is the TD-ARMA(1,1) fitted signal.
The TD-ARMA(1,1) model accurately fits the genomic signal.........................2
Figure 5 TD-ARMA coefficients estimation. Estimation of the TD-ARMA(1,1)
coefficients of the Human gene 276. The TD-ARMA(1,1) model is given by
. The blue and black (resp. red
and green) curves depict the time series (resp ) for the coding and
non-coding regions of the gene, respectively. .................................................3
Figure 6 Curve of randomness. The curves of randomness of the coding and non-
coding regions of the Human gene 276 are shown in blue and red,
respectively. The index of randomness of the coding sequence is equal to
1.0603, whereas its corresponding value for the non-coding sequence is
equal to 1.0627................................................................................................4
Figure 7 ARMA modeling and segmentation of ultrasound breast images: (a)
cancereus ultrasound image; (b) 2D-ARMA[2,2,2,2] representation of (a); (c)
segmentation of (b) using an appropriate threshold; (d) 1D-ARMA[2,2]
xv
representation of (a); (e) benign tumor ultrasound image; (f) 2D-
ARMA[2,2,2,2] representation of (e); (g) segmentation of (f) using an
appropriate threshold; (h) 1D-ARMA[2,2] representation of (e);....................17
Figure 8 2D ARMA Modeling (a) Original (healthy) mammogram; (b) 2D
ARMA[2,2,2,2] model of (a); (c) 2D ARMA[3,3,3,3] model of (a); (d) 2D
ARMA[4,4,4,4] model of (a); (e) 2D ARMA[6,6,6,6] model of (a); .................29
Figure 9 Change detection algorithm (a) A normal (healthy) mammogram; (b) 2D
ARMA[2,2,2,2] model of (a); (c) Plot of the average gray level of the 16x16
sub images in (a); (d) Plot of the decision function for the image in (a); (e)
original cancerous image; (f) 2D ARMA[2,2,2,2] model of (e); (g) Plot of the
average gray level of the 16x16 sub images in (e); (h) Plot of the decision
function for the image in (e); ....................................................................30
Figure 10 The decision function for four mammograms; cancerous in red/magenta
and normal in blue/green. The value of the threshold id determined as the
mean of the highest normal peak and the highest of the cancerous peak. ...30
Figure 11 Change detection algorithm: (a) radiologist marked are of interest; (b) plot of
the decision function of the mammogram, arrows indicate the peaks above
the threshold; (c) marked 16x16 clusters that correspond to the detected
peaks.............................................................................................................31
Figure 12 Confocal images and their segmentations: A-D: images from the UAMS-1
sarA- mutant (section 1), their respective segmentations with watershed
(section 2), COMSTAT analysis (section (3) .................................................40
Figure 13 Confocal images and their segmentations: E-H: images from the UAMS-1
(section 1), their respective segmentations with watershed algorithm (section
2), COMSTAT analysis (section 3) ................................................................41
xvi
Figure 14 Mean Square Error of 2D gradient-based segmentation (black), 2D
watershed segmentation (green) and 3D (red) gradient based segmentation
applied to CLAS z-stack of images................................................................50
Figure 15 Combined Mean Square Error of 2D gradient-based (black), 2D watershed
(green) and and 3D gradient-based segmentations......................................50
Figure 16 2D and 3D-gradient based segmentation: Column 1: the original CLSM
images. Column 2: 3D gradient-based segmentation; Column 3: 2D gradient-
based segmentation; Column 4: 2D Watershed Segmentation....................54
63
Chapter 1 Introduction
Biological and biomedical signals are acquired by a range of techniques
across all biological scales, which go far beyond the visible light photographs and
microscope images of the early 20th century. Today the techniques in use are:
confocal scanning microscopy, x-ray microscopy, electron microscopy, etc. with
extensive use of Digital Signal Processing (DSP) techniques and reconstruction
algorithms in two and three dimensions and even multidimensional space.
Modern medical images are geometrically arranged arrays of data sample. The
broadening scope of imaging as a way to organize our observations of the
biophysical world has led to a dramatic increase in our ability to apply new
processing techniques and to combine multiple channels of data into
sophisticated and complex mathematical models of physiological function and
dysfunction. With explosion of the amount of data produced in a field of
biomedicine, it is crucial to be able to construct accurate mathematical models of
the data at hand. Two main purposes of signal modeling are: data size
conservation and parameter extraction.
Problem Statement
Over the past century we have undergone a revolution in a field of
microbiology and biomedicine. We went from microscope to computerized, highly
sophisticated method of acquisition methods that can scan surrounding
environment with accuracy and precision. The amount of data that is being
63
produced in the scanning process is enormous and became very difficult to
analyze by a human being in an efficient way.
The main problem that stands before scientists faced with large data pool
is ability to translate digital information into meaningful data, further being used
by physicians and biologists in their studies. Among many the most important
are:
 Feature extraction, which is spatial form of the dimensionality reduction. It
is used for either images that are too large to process or those that are
redundant in nature. In those two cases the input data can be transformed
into a reduced representation of a set of features (also named feature
vector).
 Segmentation process, which is simplification and/or change the
representation of an image and then device into more meaningful
subsections.
 Development of accurate and robust Computer Aided Diagnostic (CAD)
systems for biomedical applications, which can be used in faster and more
accurate delivery of results in biomedical imaging with minimal or no
human involvement.
Research Objectives
The goal of this work is to research different ways of analysis in Digital
Signal Processing area of biological and biomedical signals, specifically to
63
develop methods that can be used in parameters extraction, further used by
biologists and physicians in development of new patient treatment techniques.
The ultimate goal is to develop accurate and robust models for Computer Aided
Diagnostic systems in areas of: microcalcifications and cancer tissue detection in
breast tissue from X-ray Mammograms and ultrasound imaging and also
segmentation of biofilm of Staphylococcus aureus colonies from Confocal Laser
Scanning Microscopy (CLSM)
The goal of this research is realized through the following objectives:
1. Development of non-stationary modeling technique for modeling DNA data
2. Development of 2D-ARMA technique for image signal modeling
3. Use extracted features of ARMA model for biological and clinical
classification of microcalcification in X-ray breast mammography and
ultrasound imaging
4. Formulation of the problem of detection of microcalcification as change
detection hypothesis testing problem
5. Development of accurate and robust segmentation technique for Confocal
Laser Scanning Microscopy, which led to more accurate quantification of
biofilm
6. Development of three dimensional (3D) segmentation techniques that
takes under consideration spatial pixel relationship.
7. Development of fully automated method of quantifying staphylococcal
biofilm images using confocal laser scanning microscopy (CLSM) with
standardized parameters that are independent of user input.
63
Motivation
Breast cancer continues to be a significant public health problem in the
United States: It is the second leading cause of female mortality, and,
disturbingly, one out of eight women in the United States will be diagnosed with
breast cancer in her life time. Until the cause of this disease is fully understood,
early detection remains the only hope to improve breast cancer prognosis and
treatment. Breast cancer screening modalities are mainly based on clinical
examination, mammography, ultrasound imaging, magnetic resonance imaging
(MRI), and core biopsy. Mammography (breast x-ray imaging) is by far the
fastest and cheapest screening test for breast cancer. Unfortunately, it is also
among the most difficult of radiological images to interpret: mammograms are of
low contrast, and features indicative of breast disease are often very small.
Staphylococcus aureus is an opportunistic human pathogen responsible
for a wide range of diseases that vary in clinical presentation and severity. S.
aureus can cause infections ranging from minor skin lesions to life-threatening
conditions such as pneumonia, osteomyelitis and toxic shock syndrome. Recent
significant changes in health care delivery and antimicrobial resistance patterns
have caused a shift in the epidemiology of S. aureus. A recent report estimated
the number of invasive infections caused by methicillin-resistant S. aureus
(MRSA) in 2005 in the United States alone at 94,360, with 18,650 of these
resulting in a fatal outcome. This means that S. aureus infection has now passed
AIDS as a cause of death in the United States. Based on such considerations, S.
63
aureus is arguably a greater clinical concern now than at any time since the pre-
antibiotic era.
Research Contributions
This work contributes to the field of computational bioinformatics and
biology through the application of information theory and communication theory
to the study and analysis of genetic sequences. Further, this work presents novel
techniques in detection and recognition of breast cancer in its early stages of
development as well as classification of developed tumors from ultrasound and x-
ray mammography imaging techniques. This work describes the novel methods
of parameter extraction from Confocal Laser Scanning Microcopy (CLSM)
imaging of biofilm formation in bacterial development. All described techniques
contribute in providing semi- and fully-automated tools for scientists and
physicians for treatment development in areas of large mortality of human in
advanced countries.
Specific contributions of this work to the field of biomedical science include:
1. Derivation of non-stationary TD-ARMA model, in which the novel
estimation algorithms of the TD-ARMA parameters were introduced and
calculated, which resulted in creation of more robust and precise model of
non-stationary signal
2. Formulation of use of TD-ARMA modeling as a non-stationary model of
genomic sequences, which resulted in better understanding and better
63
recognition of coding and non-coding regions of DNA sequences of highly
evolved organisms
3. Usage of the word decomposition theorem, applied to ultrasound breast
images treated as random fields, for two-dimensional ARMA modeling
4. Derivation of 2D-ARMA model, which takes under consideration
distributed correlation between points in space and/or time that resulted in
better understanding of higher-dimensionality modeling techniques
5. The formulation of the problem of the microcalcification detection in digital
breast images as a statistical change of detection problem in the local
properties of the image. The solution was not only able to detect the
presence of microcalcification but also gave accurate estimate of their
location within the breast
6. Derivation of a Computer Aided Diagnostic System for the automatic
quantification of biologically relevant parameters of biofilm created in
microbial development of Staphylococcus aureus colonies, that further
were used in microbiology studies of virulence and treatment resistance of
certain strains of bacteria
Organization
This thesis is organized as follows:
In Chapter 2, we provide a cursory overview of the biology background,
which include brief introduction to genomics with emphasis on Structure of DNA.
Next we describe breast cancer types that are most common among female
63
population in highly evolved countries. Last we present description of biofilm
genesis and different areas where mathematical biofilm modeling can be used.
Grasping the essence of the biological inspirations of this work is crucial to
understanding the motivation, assumptions and theoretical results of this work.
In Chapter 3 we model the non-stationary genomic sequences by a time-
dependent autoregressive moving average (TD-ARMA) model. By expressing the
time dependent coefficients as linear combinations of parametric basis functions,
we were able to transform a linear non-stationary problem into a linear time-
invariant problem. Subsequently, we proposed three methods to estimate the
time-dependent coefficients: Mean -square, least-squares, and recursive least-
squares algorithms. Based on the estimated TD-ARMA coefficients, we defined
an index of randomness to quantify the degree of randomness of both coding
and non-coding sequences, results to follow.
In Chapter 4 we propose to exploit the high spatial correlation inherent in
neighboring pixels to improve tumor detection and classification in ultrasound
breast images. We achieve this goal by using a two-dimensional autoregressive
moving average (ARMA) field model of the image. Current techniques often rely
on one-dimensional representations of the image in terms of its scan lines in
order to process it as a one-dimensional time series [5], [6]. Such one-
dimensional projections are advocated solely on the basis of the simplicity of
their mathematical formulations. The analysis of two-dimensional fields is more
involved mathematically and computationally than the study of one-dimensional
time-series. In this work, we derive an efficient two-stage algorithm to estimate
63
the parameters of the two-dimensional ARMA field model of the breast image.
The estimated ARMA parameters are excellent discriminative features, which are
used as the basis for statistical detection and classification of tumors in the
breast image.
In Chapter 5, we introduce a new approach to the problem of malignancy
detection in digital mammograms using statistical sequential analysis theory. The
statistical approach inherently takes into account the noise in the image (from the
imaging device and the digitization) and the great variety of healthy and
cancerous mammograms by considering an underlying probability distribution of
the image characteristics. For increased efficiency, the dimensionality of the
original images is reduced using 2D-ARMA modeling, which is shown to
accurately represent mammograms. The change detection algorithm is applied to
the low-dimensional 2D-ARMA feature vectors compared to the pixels of the raw
image.
In Chapter 6 we propose algorithm efficiently segments and quantifies
images not relying on a manual setup of a threshold. Average error of results
obtained with watershed-based algorithm, calculated based on the manual
analysis, was comparable to the one acquired with COMSTAT software.
In Chapter 7 we show the importance of 3D analysis of biofilm structures,
which yields more accurate morphological parameter quantification for clinical
and biological assessment of the biofilm than sophisticated 2D-based analysis
like the watershed segmentation. Two-dimensional analysis of the biofilm
morphology treats the CLSM images independently from each other, whereas
63
the 3D analysis takes into account the temporal and spatial correlation between
stacked images.
63
Chapter 2 Biology Background
Biomedical signals are observations of physiological activities that can be
obtained from a biological system. This diverse group of signals may range from
observations on a molecular level such as gene and protein sequences to
macroscopic images of organs. The processing of those biomedical signals aims
at extracting only significant information from an often overwhelming amount of
data. What constitutes the information of interest depends on the specific
application. Therefore, the purpose of signal processing is to selectively eliminate
irrelevant information from signal so as to make the information of interest more
easily accessible to a human observer or a computer system.
In the past, the primary application of signal processing was to filter
signals and remove noise. Background, arising from either the imprecision of
instruments or biological systems themselves, is eliminated using primarily two
methods. In the first technique, the noise cancelation was achieved by analyzing
the signal spectra and suppressing the undesired frequency components. In the
other method used, data was treated as random signals and statistical
characterizations of the signal are utilized to extract desired components, e.g.
Wiener filtering or Kalman filtering.
Since the introduction of new technologies and instruments, the
applications for biomedical signal processing have expanded well beyond just the
removal of background noise. Segmentation, the process of partitioning a digital
image to locate objects and boundaries, is extensively used in the analysis of the
medical images, including organ structure quantification and detection of local
63
abnormalities such as tumors. Another signal processing technique, motion
tracking is widely used in molecular biology for visualizing dynamics in living
cells. The same method can be used to track the distribution and growth of live
cells tagged with fluorescent probes in biofilms or other biological formations.
Sequence analysis is yet another processing method that was born with the
invention of automatic DNA sequencing. It allowed scientists to create genetic
maps based on the short DNA fragments analyzed by DNA sequencers. Finally,
one of the most important applications of signal processing is pattern
classification, often dependent on segmentation. Extensively used by clinicians
and biologists, classification helps to automatically distinguish pathological
formations from the normal background. Although, very often experts are more
superior in pattern classification than any automated method, classification of
biomedical signals by humans faces its several difficulties. It requires a lot of
knowledge combined with experience, it is labor intensive and time consuming, it
may be challenging in the situation when the signal characteristics are not very
prominent and is sensitive to human error and bias. Therefore, automated
methods for classification and segmentation in signal processing may overcome
those stated limitations and assist in screening large databases, advancing the
technology.
Genomics
There are two types of nucleic acid that are of key importance in cells:
deoxyribonucleic acid (DNA) and ribonucleic acid (RNA). DNA is found as a
63
double strand. The backbone of the molecule is composed of deoxyribose sugars
linked by phosphate groups in a repeating polymer chain. Each sugar is linked to
a molecule known as a base. In DNA, there are four types of base, called
adenine, tymine, guanine and cytosine, usually referred to simply as A, T, G and
C. The two distinct ends of a DNA sequence are known under the name of the 5'
end and the 3' end. The fundamental building block of a nucleic acid is called a
nucleotide: this is the unit of one base plus one sugar plus one phosphate. We
usually think of the length of a nucleic acid sequence as the number of
nucleotides in the chain. The two strands of the DNA molecule are held together
by hydrogen bonding between A and T and between C and G. The two strands
run in opposite directions and are exactly complementary in sequence, so that
where one has A, the other has T and where one has C the other has G.
Therefore, naming the bases on the conventionally chosen side of the strand is
enough to describe the entire double-strand sequence. The two strands are
coiled around one another in the famous double helical structure elucidated by
Watson and Crick 50 years ago.
An RNA strand that is transcribed from a protein-coding region of DNA is
called a messenger RNA (mRNA). The mRNA is used as a template for protein
synthesis in the translation process discussed below. Eukaryotic gene
sequences are composed of alternating sections called exons and introns. Exons
are the pieces of the sequence that contain the information for protein coding.
These pieces will be translated. Introns do not contain protein-coding information.
The discovery of introns led to the Nobel Prize in Physiology or Medicine in 1993
63
for Phillip Allen Sharp and Richard J. Roberts. The introns are cut out of the pre-
mRNA and are not present in the mRNA after processing. When an intron is
removed the ends of the exons on either side of it are linked together to form a
continuous strand.
Splicing is carried out by spliceosome, a complex of several types of RNA
and proteins bound together and acting as a molecular machine. The
spliceosome is able to recognize signals in the pre-mRNA sequence that tell it
where the intron-exon boundaries are and hence which bits of the sequence to
remove. As with promoter sequences for transcription, the signals for the splice
sites are fairly short and variable, so that reliable identification of the intron-exon
structure of a gene is a difficult problem in bioinformatics. Nevertheless, the
spliceosome manages to do it. Introns that are spliced out by the spliceosome
are called spliceosomal introns. This is the majority of introns in most organisms.
In addition, there are some interesting, but fairly rare introns that are capable of
catalyzing their own splicing out of the primary RNA transcript without the action
of the spliceosome. There are surprisingly large numbers of introns in many
eukaryotic genes: 10 or 20 in one gene is not uncommon. In this work we focus
on research of RNA mathematical modeling and automated recognition of coding
and non-coding regions.
63
Breast Cancer
Breast cancer is the most common type of cancer among women and it is
the second leading cause of cancer mortality. Besides skin cancer, breast cancer
is the most commonly diagnosed cancer among American women [1]. According
to a statistical report by the National Cancer Institute of United States, it is
estimated that 230,480 women in the USA were diagnosed, out of which 39,520
women are expected to die of breast cancer in 2011 [1]. The screening
mammography is the most widely used technique for detection of breast cancer.
The routine screening of mammogram is evaluated as a probable option to
detect the earliest signs of cancerous growth [2]. The mortality rates of women
under the age of 50 have been steadily decreasing since 1990. This decrease is
surmised to be the result of the advances in treatment and earlier detection
through screening. Thus, early detection and adapting modern methods of
treatment for breast cancer can significantly improve the survival rate of victims.
Currently, X-ray mammography is widely observed as the efficient imaging
modality for early detection of abnormality. The earliest sign of breast cancer is
microcalcification, which is nodular in structure with high intensity, localized or
broadly diffused along the breast areas. Microcalcifications are tiny bits of
calcium deposits present in the breast tissue and they appear as clusters or in
patterns associated with extra cell activity in breast region. The detection of
microcalcification at an early stage is a challenging task to radiologists and a few
of the clusters could not be detected by them due to their impalpable size [3].
The detection sensitivity of radiologists in microcalcification detection is 70–90 %
63
and sensitivity depends on their experience [4]. Therefore, a Computer Aided
Diagnosis (CAD) system for breast cancer detection on mammogram has been
developed to improve the diagnostic rate. By incorporating the expert knowledge
of radiologists, the CAD system can be made to improve the detection accuracy.
Biofilm
Staphylococcus aureus is an opportunistic human pathogen responsible
for a wide range of diseases that vary in clinical presentation and severity. S.
aureus can cause infections ranging from minor skin lesions to life-threatening
conditions such as pneumonia, osteomyelitis and toxic shock syndrome. Recent
significant changes in health care delivery and antimicrobial resistance patterns
have caused a shift in the epidemiology of S. aureus. A recent report estimated
the number of invasive infections caused by methicillin-resistant S. aureus
(MRSA) in 2005 in the United States alone at 94,360, with 18,650 of these
resulting in a fatal outcome. This means that S. aureus infection has now passed
AIDS as a cause of death in the United States. Based on such considerations, S.
aureus is arguably a greater clinical concern now than at any time since the pre-
antibiotic era.
S. aureus infections exhibit general characteristics common in many
different types of infection. Strains that are resistant to methicillin already account
for the majority of the S. aureus nosocomial infection cases [5,6]. A more
alarming concern is the emergence of MRSA as a cause of infection in the
63
general community. In other words, patients who have never been hospitalized
and who have no other known risk factors for MRSA infection are becoming ill.
The treatment of such patients becomes more challenging because S. aureus is
capable of forming a biofilm, which shows intrinsic levels of antibiotic resistance.
The familiar mechanisms of antibiotic resistance, such as efflux pumps,
modifying enzymes, and target mutations [7], do not seem to be responsible for
the protection of bacteria in a biofilm. In fact, the metabolism of cells within a
biofilm is profoundly slower than dispersed cells, and it is more likely that this
slowing reduces the effects of antibiotics that block metabolic processes, such as
protein synthesis. Additionally, antibiotic-sensitive bacteria with no known genetic
basis for resistance can have profoundly reduced susceptibility when growing in
a biofilm. Such strains, when grown in a biofilm, have to be treated with much
higher doses of antibiotics than that needed to eradicate free-floating bacteria [7].
Three hypotheses explaining the formation of intrinsic levels of antibiotic
resistance are shown in Figure 1, adapted from [7].
63
Figure 1 Three hypotheses explaining the formation of intrinsic levels of antibiotic resistance
Adapted from Stewwart et al., 2001
For the reasons mentioned above, it is necessary to consider biofilm
formation as an important weapon in S. aureus pathogenesis. The role of biofilm
formation in virulence requires extensive studies of biofilm regulation and
advanced methods of biofilm characterization and quantification.
63
Chapter 3 Time-Dependent ARMA Modeling of Genomic
Sequences
Abstract
Over the past decade, many investigators have used sophisticated time
series tools for the analysis of genomic sequences. Specifically, the correlation of
the nucleotide chain has been studied by examining the properties of the power
spectrum. The main limitation of the power spectrum is that it is restricted to
stationary time series. However, it has been observed over the past decade that
genomic sequences exhibit non-stationary statistical behavior. Standard
statistical tests have been used to verify that the genomic sequences are indeed
not stationary. More recent analysis of genomic data has relied on time-varying
power spectral methods to capture the statistical characteristics of genomic
sequences. Techniques such as the evolutionary spectrum and evolutionary
periodogram have been successful in extracting the time-varying correlation
structure. The main difficulty in using time-varying spectral methods is that they
are extremely unstable. Large deviations in the correlation structure results from
very minor perturbations in the genomic data and experimental procedure. A
fundamental new approach is needed in order to provide a stable platform for the
non-stationary statistical analysis of genomic sequences. Results: In this chapter,
we propose to model non-stationary genomic sequences by a time-dependent
autoregressive moving average (TD-ARMA) process. The model is based on a
classical ARMA process whose coefficients are allowed to vary with time. A
63
series expansion of the time-varying coefficients is used to form a generalized
Yule-Walker-type system of equations. A recursive least-squares algorithm is
subsequently used to estimate the time-dependent coefficients of the model. The
non-stationary parameters estimated are used as a basis for statistical inference
and biophysical interpretation of genomic data. In particular, we rely on the TD-
ARMA model of genomic sequences to investigate the statistical properties and
differentiate between coding and non-coding regions in the nucleotide chain.
Specifically, we define a quantitative measure of randomness to assess how far
a process deviates from white noise. Our simulation results on various gene
sequences show that both the coding and non-coding regions are non-random.
However, coding sequences are "whiter" than non-coding sequences as
attested by a higher index of randomness. Conclusion: We demonstrate that the
proposed TD-ARMA model can be used to provide a stable time series tool for
the analysis of non-stationary genomic sequences. The estimated time-varying
coefficients are used to define an index of randomness, in order to assess the
statistical correlations in coding and non-coding DNA sequences. It turns out that
the statistical differences between coding and non-coding sequences are more
subtle than previously thought using stationary analysis tools: Both coding and
non-coding sequences exhibit statistical correlations, with the coding regions
being "whiter" than the non-coding regions. These results corroborate the
evolutionary periodogram analysis of genomic sequences and revoke the
stationary analysis' conclusion that coding DNA behaves like random sequences.
63
Background
Understanding the statistical properties of genomic sequences helps
recreate the dynamical processes that led to the current DNA structure, and
determine gene-related diseases like cancer and Alzheimer disease. For
instance, based on the view that non-coding DNA exhibits long-range
correlations [1-6], Li [7] proposed an expansion-modification model of gene
evolution. The model incorporates the two basic features of DNA evolution: (i)
sequence elongation due to gene duplication and (ii) mutations. It can be shown
that the limiting sequence created by this dynamical process exhibits a long-
range correlation structure, as attested by a spectrum, where the exponent
is a function of the probability of mutation. To understand the relationship
between the DNA correlation structure and possible gene aberrations, Dodin et
al. [8] designed a simple correlation function intended to visualize the regular
patterns encountered in DNA sequences. This function is used to revisit the
intriguing question of triplet repeats with the aim of providing a visual estimate of
the propensity of genes to be highly expressed and/or to lead to possible
aberrant structures formed upon strand slippage.
Statistical analysis of genomic sequences has, however, relied, for a long
time, on signal processing techniques for stationary signals (correlation and
power spectrum) [2,4,9,10], and statistical tools for slowly-varying trends within
stationary signals (Detrended Fluctuation Analysis or DFA) [1,5,6]. Stationary can
be argued as a valid assumption for time-series of short duration. However, such
an assumption rapidly loses its credibility in the enormous databases maintained
63
by various genome projects. Standard statistical tests (e.g., Priestley's test for
stationary) have been used to verify that genomic sequences are not stationary
and the nature of their non-stationary varies and is often more complex than a
simple trend [11,12]. Subsequently, more recent analysis of genomic data [1] has
relied on time-varying power spectral methods (the evolutionary spectrum and
periodogram) to capture the statistical characteristics of genomic sequences
[11,12]. The main difficulty in using time-varying spectral methods is that they are
extremely unstable and very noisy. Typically, the power spectrum and the
evolutionary spectrum are averaged over time in order to obtain smooth and less
jittery curves. Moreover, as was pointed out in [13], the evolutionary spectrum is
restricted to the class of oscillatory processes. A stochastic process, , is
oscillatory if it has a representation of the form
Equation 1
∫
Where is an orthogonal increment process, and the evolutionary power
spectrum of the process is defined by | | . This definition of the
evolutionary power spectrum has the following disadvantages [13]:
i. It is not uniquely defined for a given non-stationary process.
ii. The estimation procedure for the evolutionary spectrum depends greatly
on the nature of the amplitude function , which is not known a priori.
63
iii. An increase in the number of observations does not provide added
information on the local behavior of the evolutionary spectrum, and thus
does not improve estimation accuracy.
We propose to model non-stationary genomic sequences by a time-
dependent autoregressive moving average (TD-ARMA) process. Cramer [14]
showed that a non-stationary process still possesses a Word Decomposition in
terms of its innovation and its generating system. However, the linear system
generating the non-stationary signal , when driven by the innovation, , is
no longer shiftinvariant; the parameters of the impulse response, , of this
system are time-dependent so that
Equation 2
∑
The existence of a time-varying ARMA representation of this process is
ensured by two theorems due, independently, to Grenier [15] and Huang and
Aggarwal [16]. The uniqueness of the TD-ARMA representation is obtained by
constraining the ARMA model to be invertible, but this leads to conditions on the
time-varying impulse response and its inverse (namely to be absolutely
summable at any time t), which cannot be easily expressed in terms of the time-
dependent coefficients of the ARMA model. In this chapter, we estimate the time-
dependent coefficients of the general TD-ARMA model using mean squares,
least-squares and recursive least-squares algorithms. The mean-squares
63
estimation leads to generalized Yule-Walker type equations [15]. Once the non-
stationary parameters are estimated (as time series), we use them to provide a
basis for statistical inference by defining an index of randomness, which
quantitatively assesses how close the non-stationary signal is to white noise. Our
simulation results on various gene sequences show that (i) both the coding and
non-coding segments of a gene are not random, and (ii) the coding segments are
"closer" to random sequences than non-coding segments. Our results support
the view that both coding and non-coding sequences are not random
[11,12,9,17-20], and revokes the stationary study that maintains that non-coding
DNA sustains long-range correlations whereas coding DNA behaves like random
sequences [1-3,5,6,10].
Methods
Numerical representation of genomic sequences Converting the DNA
sequence into a digital signal offers the opportunity to apply powerful signal
processing methods for the handling and analysis of genomic information. This
is, however, not an easy task as the analysis results might depend on the chosen
map. Various numerical mappings have been adopted in the literature. To cite
few, Peng et al. [1] construct a one-dimensional map of nucleotide sequences
onto a walk, , which they termed "DNA walk". The DNA walk is defined by the
rule that the walker steps up if a pyrimidine resides at position i, and
steps down otherwise. Voss [9] represents a DNA sequence by four
binary indicator sequences, which indicate the locations of the four nucleotides A,
63
T, C and G. Berthelsen et al. [21] proposed a two-dimensional representation of
DNA sequences, characterized by a Hausdorff dimension (also called fractal
dimension) that is invariant under (i) complementarity, (ii) reflection symmetry,
(iii) compatibility and (iv) substitution symmetry of AT and . The
corresponding embedding assignment is given by
. In this chapter, since we are interested in time-
dependent ARMA modeling of one-dimensional non-stationary genomic
sequences, we adopt the widely used "DNA walk" map proposed by Peng et al
[1]. The DNA walk provides a nice graphical representation for each gene. For
instance, Figure 2 shows the structure of the Human gene 276 located in
chromosome 1, and its DNA walk is displayed in Figure 3. Time-dependent ARMA
model Grenier [22] showed that a discrete non-stationary signal, , can be
represented by finite-order time-varying ARMA processes of the form
Equation 3
∑ ∑
where is the length of the sequence and are the time-
dependent model parameters, p and q are the model orders and is a white
noise process. Observe that the coefficients and appear with an
argument depending on . This is purely arbitrary since any time origin can
be chosen, without restraining the generality of the model. We assume that the
63
time-dependent coefficients and can be expressed as linear
combinations of some basis functions ,
Equation 4
∑
∑
Figure 2 Gene Structure. Gene structure of the Human gene 276 located in chromosome I: The
boxes correspond to the exons (coding regions, and line between them represent the
introns (non-coding regions)). The total length of the gene is N=8208 bases, including
1536 coding and 6672 non-coding bases
Figure 3 DNA Walk. DNA walk of the Human gene 276
63
The advantage of the basis parameterization is clear from the fact that the
identification of the time-dependent coefficients and reduces to the
identification of the constant coefficients and , and therefore
the linear non-stationary problem reduces to a linear time-invariant problem. The
basis functions do not have to be limited to the standard choices of
Legendre, Fourier, or the prolate spheroidal basis but can also take advantage of
any prior information, such as the presence of a jump in the coefficients at a
known instant [22].
Estimation of the time-dependent ARMA coefficients from Equation 4 , the
unknown parameters of the TD-ARMA model are the weights of the linear
combinations defining each time-varying parameter. The linearity is the key to the
algorithms proposed in this chapter. We will derive mean-squares, least-squares
and recursive least-squares solutions to estimate the time-dependent coefficients
and .
Mean-squares estimation
Define the process
Equation 5
∑ ∑
and define the vector
63
Equation 6
[ ]
It is possible to derive for this process orthogonally conditions similar to the
stationary ARMA model conditions [23]. Observe that the process , defined in
Eq. (6), is orthogonal to ; hence, it is
orthogonal to for all , and orthogonal to for all
[22]. This orthogonality condition leads to a generalized Yule-Walker
equation [22]
Equation 7
([ ] | | ) ([ ] )
Although the process is non-stationary, the stationarity and ergodicity
of the process , together with the linearity of the model, allow us to replace in
Eq. (8) the expectation by a summation. However, although consistent, the
above estimator is often considered a poor one [22].
Least-squares estimation
Equations (4) and (5) can be written in vector format as follows
63
Equation 8
Where
[ ]
[ ]
[ ]
define
Equation 9
Then we have
Equation 10
63
Using this vector notation, Eq. (3) can be written as
Equation 11
Or equivalently
Equation 12
Where is a row vector
Equation 13
And
Equation 14
[ ]
Observe that the vector contains all the unknowns of the TD-ARMA model.
Writing Eq. (10) for – leads to
63
Equation 15
where
[ ]
[ ]
[ ]
The least-squares solution of Equation 11 is given by
Equation 16
To overcome the computational complexity associated with the least-
squares estimation (involving in particular the inversion of a square
matrix), we opted for recursive least-squares estimation
as follows.
Recursive least-squares estimation The recursive least squares algorithm
is summarized as [24]
63
Equation 17
̂ ̂ { ̂ }
Index of randomness
Over the past decade, there has been a flow of conflicting papers about
the long-range power-law correlations detected in eukaryotic DNA [1-3,5,6,9-
12,17-20]. The controversy is generated by conflicting views that either advocate
that non-coding DNA sustains long-range correlations whereas coding DNA
behaves like random sequences [1,10,2,3,5,6], or maintains that both coding and
non-coding DNA exhibit long-range power-law correlations [11,12,9,17-20].
Based on the analysis of the time-dependent power spectrum of genomic
sequences, Bouaynaya and Schonfeld [11,12] showed that the statistical
differences between coding and non-coding sequences are more subtle than
previously concluded using stationary analysis tools. In fact they found that both
coding and non-coding sequences are non-random. However, coding sequences
are "whiter" than non-coding sequences. We propose to qualitatively assess the
degree of randomness of both coding and non-coding sequences using the time-
dependent ARMA coefficients and . Consider the system function,
, of a stationary ARMA model (whose coefficients and are constant,
i.e., independent of time). We know that
63
Equation 18
∑
∑
∏
∏
Where and are zeros and poles of the system function
respectively. From the fact that a stationary white noise process has a at
spectrum, we observe that the closer (in L2 distance) the zeros and poles are,
the flatter is the spectrum of the process. Following the same reasoning locally
for non-stationary processes, we define the curve of randomness, CR [n], of the
non-stationary process by
Equation 19
{
( ∑| | ∑ | |)
( ∑| | ∑ | |)
( ∑| |)
where the minimum is taken over all pairs . Observe that the case
is obtained from the case by interchanging the roles of and ,
and the indices and . The curve of randomness defined in Equation 19 is a
measure of how close the zeros and the poles of the system function are, and
therefore, is a measure of how flat the system function is, and how close is the
63
process from a white noise. The index of randomness, , of a TD-
ARMA(p,q), is then defined as the average of the curve of randomness, i.e.,
Equation 20
∑
In particular, the index of randomness of a TD-ARMA(1,1)
is given by
Equation 21
∑ | |
Observe that the index of randomness of a white noise process is equal to
zero. We say that the sequence with index of randomness is more
random than the sequence with index of randomness if the index of
randomness of the former is lower than the index of randomness of the latter,
i.e.,
Results
All genome sequences considered in this chapter have been extracted
from the NIH website http://www.ncbi.nlm.nih.gov. The algorithms were
implemented in MATLAB. The DNA sequences were mapped to numerical
63
sequences using the purine-pyrimidine DNA walk proposed in [1]. In our
simulations, the recursive least squares algorithm was found to best estimate the
time-dependent coefficients of the TD-ARMA model. We used the MATLAB
function rarmax, which implements the recursive least-squares algorithm for TD-
ARMA models. The choice of the orders p and q of the model were determined
experimentally as follows: For each genomic sequence, we computed 100 TD-
ARMA models corresponding to the orders (1, 1) up to (10, 10). The best model
was chosen to be the one that minimizes the average squared error between the
actual and the fitted sequences. Our extensive simulations on various DNA
sequences from different organisms show that most of the sequences are best
fitted with low-order TD-ARMA models, e.g., TD-ARMA(1,1), TD-ARMA(1,2) and
TDARMA(2,1). Figure 4 shows the DNA walk of the Human gene 276 and its TD-
ARMA(1,1) fitted sequence. Observe that the TD-ARMA(1,1) model accurately
fits this gene sequence. The estimated time-varying coefficients a [n] and b [n]
are displayed in Figure 5 for both the coding and non-coding regions of this gene.
Their statistical differences are not clear from the plot of the time-series
coefficients. The curves of randomness of the coding and noncoding regions are
displayed in Figure 6.
1
Table 1 shows the index of randomness of various gene sequences. For
concise representation, the column titles have been abbreviated as follows: "C.
length" (resp."N.C. length") denotes the length (in base pairs) of the coding (resp.
non-coding) segment of the gene. The total length of the gene is the sum of the
lengths of its coding and non-coding regions. "C. (p, q)" (resp. "N.C. (p, q)")
denotes the optimal TDARMA parameters (p, q) for the coding (resp. non-coding)
region of the gene. Finally, "C. IR" (resp. "N.C. IR") is the index of randomness of
the coding (resp. non-coding) segment of the gene. Observe that, in all
considered genes, the index of randomness of both the coding and non-coding
segments are strictly positive, and the index of randomness of the coding region
is consistently lower than the index of randomness of the non-coding region
(recall that the index of randomness of a white noise is zero). These observations
bring to bear two important conclusion: (i) Both the coding and non-coding
sequences are not random, as attested by an index of randomness greater than
zero. (ii) The coding sequences are "whiter" than the non-coding sequences.
This conclusion revokes previous work on statistical correlation of DNA
sequences, which, based on stationary time-series analysis, presumed that
coding DNA behaves like random sequences [1-3,5,6,10]; and supports the
conflicting view that both coding and non-coding sequences are not random
2
[11,12,9,17-20]. In particular, our conclusion is in accordance with the
evolutionary periodogram analysis conducted in [11,12].
Figure 4 TD-ARMA modeling. TD-ARMA modeling of the Human gene 276: The blue signal is the
DNA walk, and the red signal is the TD-ARMA(1,1) fitted signal. The TD-ARMA(1,1) model
accurately fits the genomic signal
3
Figure 5 TD-ARMA coefficients estimation. Estimation of the TD-ARMA(1,1) coefficients of the
Human gene 276. The TD-ARMA(1,1) model is given by
. The blue and black (resp. red and green) curves depict the time series
(resp ) for the coding and non-coding regions of the gene, respectively.
4
Figure 6 Curve of randomness. The curves of randomness of the coding and non-coding regions
of the Human gene 276 are shown in blue and red, respectively. The index of
randomness of the coding sequence is equal to 1.0603, whereas its corresponding value
for the non-coding sequence is equal to 1.0627
5
Table 1 Index of randomness of the Coding and Non-Coding segmants of Various Gene Sequences
6
Chapter 4 Two-Dimensional ARMA Modeling for Breast
Cancer Detection and Classification
Abstract
Computer aided diagnosis (CAD) paradigms have gained currency for
discriminating malignant from benign lesions in ultrasound breast images. But
even the most sophisticated investigators often rely on one-dimensional
representations of the image in terms of its scan lines. Such vector
representations are convenient because of the mathematical tractability of one-
dimensional time-series. However, they fail to take into account the spatial
correlations between the pixels, which is crucial in tumor detection and
classification in breast images. In this chapter, we propose a CAD system for
tumor detection and classification (cancerous v.s. benign) in ultrasound breast
images based on a two-dimensional Auto-Regressive-Moving-Average (ARMA)
model of the breast image. First, we show, using the Wold decomposition
theorem, that ultrasound breast images can be accurately modeled by two-
dimensional ARMA random fields. As in the 1D case, the 2D ARMA parameter
estimation problem is much more difficult than its 2D AR counterpart, due to the
nonlinearity in estimating the 2D moving average (MA) parameters. We propose
to estimate the 2D ARMA parameters using a two-stage Yule-Walker Least-
Squares algorithm. The estimated parameters are then used as the basis for
statistical inference and biophysical interpretation of the breast image. We
7
evaluate the performance of the 2D ARMA vector features in real ultrasound
images using a k-means classifier. Our results suggest that the proposed CAD
system based on a two-dimensional ARMA model leads to parameters that can
accurately segment the ultrasound breast image into three regions: healthy
tissue, benign tumor, and cancerous tumor. Moreover, the specificity and
sensitivity of the proposed two-dimensional CAD system is superior to its one-
dimensional homologue.
Introduction
Breast cancer continues to be a significant public health problem in the
United States: It is the second leading cause of female mortality, and,
disturbingly, one out of eight women in the United States will be diagnosed with
breast cancer in her life time. Until the cause of this disease is fully understood,
early detection remains the only hope to improve breast cancer prognosis and
treatment. Breast cancer screening modalities are mainly based on clinical
examination, mammography, ultrasound imaging, magnetic resonance imaging
(MRI), and core biopsy. Mammography (breast x-ray imaging) is by far the
fastest and cheapest screening test for breast cancer. Unfortunately, it is also
among the most difficult of radiological images to interpret: mammograms are of
low contrast, and features indicative of breast disease are often very small. Many
studies have shown that ultrasound and MRI imaging techniques can help
supplement mammography by detecting small breast cancers that may not be
visible with mammography. However, these techniques often fail to determine if a
8
detected tumor is cancerous or benign, and a biopsy may be recommended.
Consequently, many unnecessary biopsies are often undertaken due to the high
false positive rate. Computer aided diagnosis (CAD) paradigms have recently
received great attention for lesion detection and discrimination in X-ray and
ultrasound breast mammograms [25]–[28]. The large amount of negative
biopsies encountered in clinical practice could be reduced if a computer system
was available to help the radiologists screen breast images. Broadly, the CAD
systems proposed in the literature can be grouped into four major categories:
geometrical [24], artificial intelligence [25], pyramidal (or multi-resolution) [27],
and model-based techniques [28], [29]. Geometrical methods employ
morphological and other segmentation techniques to extract small specks of
calcium known as microcalcifications from breast images [25]. However, this
procedure usually requires a priori knowledge of the tumor pattern
characteristics. Moreover, these techniques also tend to rely on many stages of
heuristics attempting to eliminate false positives. Artificial intelligence techniques
include neural networks and fuzzy logic methods. The performance of these
systems is tied to the architecture of the network and the number of training data.
Breast cancer is a heterogeneous disease which includes several subtypes with
distinct prognosis. In particular, the variability associated with the appearances of
the breast cancer, ranging from relative uniformity to complex patterns of bright
streaks and blobs [26], makes the ANN require a large training data set to ensure
a certain level of reliability. Pyramidal or multi-resolution techniques refer mainly
to the wavelet transform [27], which can be seen from a signal decomposition
9
view point. Specifically, a signal is decomposed onto a set of the basis wavelet
functions. A very appealing feature of the wavelet analysis is that it provides a
uniform resolution for all the scales. However limited by the size of the basic
wavelet function, the downside of the uniform resolution is uniformly poor
resolution. Model based methods include linear, non-linear and finite-element
methods to build an accurate model of the breast [28], [29]. The model is
subsequently used for image matching, detection, and classification [29]. The
accuracy of the results are tied to the accuracy of the considered model. In this
work, we propose a new model-based CAD system for tumor detection and
classification. We show that (x-ray, ultrasound, and MRI) breast images can be
accurately modeled by two-dimensional autoregressive moving average (ARMA)
random fields. The model parameters, being the fingerprints of the image, serve
as the basis for statistical inference and biophysical interpretation of the breast
image. ARMA models are parametric representations of wide-sense stationary
(WSS) processes with rational spectra. The Word Decomposition theorem states
that any WSS process can be decomposed as the sum of a regular process,
which spectrum is continuous, and a predictable process, which spectrum
consists of impulses. Since rational spectra form a dense set in the class of
continuous spectra, the ARMA model renders accurately the regular part of the
WSS process. It is, therefore, surprising that very few researchers have
attempted to derive a general ARMA representation of the breast image, and use
it for tumor detection and classification. In [29], the authors use a one-
dimensional fractional differencing autoregressive moving average (FARMA)
10
process to model the ultrasound RF echo reflected from the breast tissue.
However, by considering separate scan lines, they do not take into account the
two-dimensional spatial correlation between the pixels in the image. In [30], an
autoregressive (AR) model is considered for improving the contrast of breast
cancer lesions in ultrasound images. ARMA models, however, provide a more
accurate model of a homogeneous random field than an AR model. As in the 1D
case, the 2D ARMA parameter estimation problem is much more difficult than its
2D AR counterpart, due to the non-linearity in estimating the 2D moving average
(MA) parameters.
2D-ARMA Modeling
We represent the breast image as a 2D random field .
We define a total order on the discrete lattice as follows
Equation 22
The 2D-ARMA(p1,p2,q1,q2) model is defined for the image
by the following difference equation
Equation 23
∑ ∑ ∑ ∑
where is a stationary white noise field with variance , and the
coefficients , are the parameters of the model. From Equation 22 the
11
image can be viewed as the output of the linear time-invariant causal
system excited by a white noise input, where
Equation 24
∑ ∑
∑ ∑
With
Yule-Walker Least-Squares Parameter Estimation
Assume first that the noise sequence were known. Then the
problem of estimating the parameters in the ARMA model in Equation 23 would be
a simple input-output system parameter estimation problem, which could be
solved by several methods, the simplest of which is the least-squares (LS)
method. In the LS method, we express Equation 23 as
Equation 25
Where
Equation 26
and
12
Equation 27
[ ]
Writing Equation 24 in matrix form for , and
for some , and , gives
Equation 28
Where
Equation 29
[ ]
[ ]
And is displayed below. Assume we know , then we can obtain a least-
squares estimate of the parameter vector in Equation 28 as
Equation 30
̂
Observe that the input model noise in is unknown.
Nevertheless, it can be estimated by considering the noise process as
the output of the linear filter with input . From Nirenberg’s
proof of the division theorem in multi-dimensional spaces [32], we can write the
inverse ARMA filter as the infinite order AR filter
13
Equation 31
∑ ∑
In the time domain we obtain
Equation 32
∑ ∑
Therefore, we can estimate by first estimating the AR parameters
and next obtaining by filtering as in Equation 29. Since we
cannot estimate
Equation 33
( )
an infinite number of (independent) parameters from a finite number of samples,
we approximate the finite AR model by one of finite order, say The
parameters in the truncated AR model can be estimated by using a 2D extension
of the Yule-Walker equations as follows
Equation 34
∑ ∑
14
Where are the autocorrelation values of the field , computed as
follows:
Equation 35
∑ ∑
and is the 2D Kronecker delta function. Equation 35 is a system of linear
equations that can be written in matrix form and solved for the coefficients .
Finally, the Yule-Walker Least-Squares algorithm is summarized below
1. Estimate the parameters in an model of by the
Yule-Walker method in Equation 35. Obtain an estimate of the noise field
as
Equation 36
̂ ∑ ∑ ̂
for , and .
2. Replace the by ̂ computed in Step 1. Obtain ̂ in Equation
30 with , and .
15
Tumor detection and classification
The estimated ARMA parameters, { }, , are used as a basis for
inference about the presence of a tumor and its nature: benign or cancerous. We
use the k-means algorithm to segment the breast image into 3 classes: healthy
tissue, benign tumor and cancerous tumor. Our method consists of representing
each pixel in the image by an ARMA model whose parameters are estimated by
using an appropriate neighborhood for the pixel. We make the assumption that
all pixels in the considered neighborhood belong to the same class, and hence,
for computational efficiency, we replace the entire neighborhood by the vector
value of the estimated ARMA parameters. This procedure is repeated for the
entire image, creating a new block by block vector-valued image, which will be
the input to the k-means classifier.
Simulations
Although the proposed algorithm is independent of the imaging modality of
the breast, we perform our simulations on ultrasound images, collected from the
Radiology department, College of Medicine at the University of Illinois at
Chicago. Our database of cancerous images shows intraductal carcinoma, which
is the most common type of breast cancer in women. Intraductal carcinoma is
usually discovered through a mammogram or an ultrasound as
microcalcifications. Our benign tumor images show the Fibroadenoma of the
16
breast, which is a benign fibroepithelial tumor characterized by proliferation of
both glandular and stromal elements. Our extensive simulations indicate that
ARMA[2,2,2,2] is a sufficient model order, in terms of mean square error, to
accurately represent ultrasound breast images. Figure 1 shows two ultrasound
images, one with a cancerous tumor and one with a benign tumor, and their
respective 2D-ARMA[2,2,2,2] and 1D-ARMA[2,2] models. It is visually clear that
the 2D-ARMA model accurately represents both ultrasound images, whereas the
1D model fails to capture any image feature.
We estimate the 2D-ARMA parameters using a window of size .
The choice of the window size presents an inherent trade-off between the
accuracy of the representation and the accuracy of the classification. A large
window size would lead a better representation of the 2D-ARMA model, but
might include pixels from different classes. We found that for 256256 images, a
window size leads to a good segmentation performance. Each image is
therefore represented by a number of 2D-ARMA feature vectors, which
contain the 8 parameters for each sub-
block image. Without loss of generality, we chose . Therefore, the
size of the feature vectors reduces to 6 instead of 8. We decide that an image
has a cancerous (resp., benign) tumor if at least one of the sub-block images is
classified as a cancerous (resp., benign) tumor. Otherwise, we conclude that the
image is healthy and contains no tumors.
We conducted our simulations using 573 ARMA feature vectors of healthy,
benign and cancerous ultrasound breast images. The ARMA feature vectors
17
were used as the input to a k-means classifier. Figure 7(c) and Figure 7(f) show
the segmentation outputs of the cancerous and benign tumor images,
respectively. We can observe clear delineations of the tumors from the healthy
tissues in both cases. The accuracy, sensitivity and specificity of the 2D-ARMA
and 1D-ARMA k-means classifiers are shown in Table 2. It is clear that the 2D-
ARMA feature vectors are more selective than their one-dimensional homologue.
Figure 7 ARMA modeling and segmentation of ultrasound breast images: (a) cancereus
ultrasound image; (b) 2D-ARMA[2,2,2,2] representation of (a); (c) segmentation of (b)
using an appropriate threshold; (d) 1D-ARMA[2,2] representation of (a); (e) benign tumor
ultrasound image; (f) 2D-ARMA[2,2,2,2] representation of (e); (g) segmentation of (f)
using an appropriate threshold; (h) 1D-ARMA[2,2] representation of (e);
Table 2 Classification accuracy of cancereus and benign tumors
Accuracy Sensitivity Specificity
2D-ARMA 92.87% 92.03% 94.14%
1D-ARMA 78.51% 59.54% 79.76%
18
Chapter 5 Statistical Sequential Analysis for Detection of
Microcalcifications in Digital Mammograms
Abstract
We formulate the problem of microcalcification detection in digital
mammograms as a statistical change detection problem in the local properties of
the image. First, we represent mammograms by two-dimensional autoregressive
moving-average (2D ARMA) fields; thus uniquely characterizing the images by
their reduced dimensionality 2D ARMA feature vectors. Texture changes in
mammograms are then modeled as an additive change in the mean parameter of
the PDF associated with the 2D ARMA feature vector sequence that describes
the image. A generalized likelihood ratio test is used to detect these changes and
estimate the exact time (or space) where they occur. Our simulation results on
the Digital Database for Screening Mammography hosted by the University of
South Florida show that the decision functions of cancerous images present high
peaks at the microcalcification locations, whereas they exhibit a uniform behavior
for healthy mammograms. The proposed algorithm achieves a sensitivity and
specificity of 96:9% and 97:8%, respectively.
Introduction
The rapid expansion in number and volume of digital mammograms, the
increasing demand for fast access to relevant medical data, the need for
19
interpretation, and retrieval of medical information has become of paramount
importance [33]. Mammography is the current standard for breast cancer
diagnosis. Women 40 years of age or older are recommended undergoing a
screening mammogram to check for breast malignancies every 6 months.
Screening mammograms usually involve two x-rays of each breast. This process
generates a huge amount of data that needs to be processed, interpreted and
saved.
The presence of microcalcifications (tiny deposits of calcium) in the breast
is an important sign for the detection of early breast carcinoma. Accurate and
uniform evaluation of the enormous number of mammograms generated in
widespread screening is a difficult task. 10-30% of breast carcinomas are missed
by trained radiologists [34]. Mammograms are low contrast images, and the
breast malignancies present a great diversity in shape, size and location, and low
distinguishability from the surrounding healthy tissue.
In the last two decades, various computer-aided (CAD) systems have
been proposed to help bring suspicious areas on the mammogram to the
radiologist’s attention. Many approaches were considered including denoising
[35], segmentation [36], filtering [37], machine learning [38], [39] and artificial
intelligence [37], mathematical morphology [40], time-frequency analysis and
multiresolution techniques, and neural networks [34]. Despite their technical
differences, these approaches share a common outline: they are all deterministic.
They usually assume a small region of interest as a subject of recognition.
20
Hence, their performance is contingent upon the natural variability of healthy and
cancerous mammography images.
In contrast to deterministic methods, statistical methods take into account
the noise in the digitized mammogram and the heterogeneity of its characteristics
by considering an underlying probability distribution of the image features. It is,
therefore, surprising that very few researchers have pursued this direction.
Statistical analysis of mammograms was mainly considered in the context of
textural information [41], [42]. In [41], the third and fourth order statistical
moments, skewness and kurtosis, were estimated from the bandpass filtered
mammogram. A region with high positive skewness and kurtosis is marked as a
region of interest. In [42], a statistical model of the mammographic image, termed
the “loglikelihood image”, is generated from the original mammogram image.
However, the method does not include any decision making, and the log-
likelihood image has the same resolution of the original mammogram.
The challenge in breast carcinoma localization is that the detection
algorithm must be able to handle all types of microcalcifications. Therefore, it is
necessary to formulate the detection problem beyond the use of empirical
observations about the type, shape, size or location of microcalcifications, which
may or may not hold in all cases. In order to address these challenges, we pose
the microcalcification detection problem in the context of statistical sequential
representation and analysis of mammograms. A mammogram image is
considered to be a realization of a stochastic process. We use statistical
analysis to detect parameter changes of the stochastic process, which will
21
indicate the presence of suspicious areas in the breast. In our approach, we
achieve a decision-making CAD system through use of dimensionality reduction
and sufficient statistics. We first show that mammograms can be accurately
modeled as 2D autoregressive moving-average (ARMA) fields, and thus each
image can be solely represented by its 2D ARMA coefficients.
In this chapter, we consider a change detection framework based on
additive modeling. Specifically, we detect changes of the mean parameter of the
PDF associated with the 2D ARMA feature vector sequence. The sufficient
statistic used is based on the generalized likelihood ratio. Thus, the main steps
used for detecting microcalcifications in mammograms are the 2D ARMA
dimensionality reduction of the original image followed by change detection on
the resulting feature vectors. In particular, no a priori assumptions are made
about the specific nature of the microcalcifications (e.g., circular, smooth, etc.).
2D-ARMA representation
We represent the breast image as a 2D random field
[43]. We define a total order on the discrete lattice as follows:
and [11]. The 2D model is
defined for the image
by the following difference equation
22
Equation 37
∑ ∑ ∑ ∑
where is a stationary white noise field with variance , and the
coefficients { } are the parameters of the model.
A Two-stage Yule-Walker Least Squares parameter estimation method
was proposed in [43]. First, the noise sequence is assumed to be
known. The ARMA parameter estimation problem is then reduced to a simple
input-output system identification problem, which is solved by a leastsquares
(LS) method. The final estimate is then obtained by estimating the noise, using a
truncated autoregressive (AR) model, and plugging it back in the Least Squares
solution [43].
In practice the ARMA parameters are estimated using a window of size
. The choice of the window size presents an inherent trade-off between
the accuracy of the ARMA representation and the reliability of the classification.
An image of size is therefore represented by
ARMA feature vectors. Let [ ]
be the ARMA feature vector of the k-th block. The mammogram image is
then compared to the raw pixels of the unprocessed image. The 2D ARMA
model presents a compressed representation of the image, which will lead to an
efficient implementation of the CAD system. For instance, for
23
, the 2D ARMA model represents a
dimensionality reduction of more than 97% compared to the original image.
Figures 2b and 2f show the 2D models of a healthy and
cancerous mammograms respectively Section IV subsection IV-A discuss in
detail the choice of the model degree parameters .
The problem of tumor detection becomes one of detecting changes in the
parameters of the probability density function (PDF) associated with the ARMA
vector random process.
Change detection algorithm
The 2D ARMA feature vectors are assumed to form an i.i.d. (independent
and identically distributed) sequence of r-dimensional random vectors ,
with Gaussian distribution
having PDF:
Equation 38
√ ∑
( ) ∑
Observe that the ARMA feature vectors are assumed to be independent.
However the components of each ARMA feature vector are correlated with
covariance matrix ∑. The independence of the ARMA feature vectors reflects an
independence assumption between pixels in different sub-blocks of an
24
image. The tumor detection is modeled as a change in the vector parameter
of the PDF characterizing the feature vector random process. Let the
parameter be the value before the change, and the value after the
change. In general, we have minimal or no information about the parameter
after change. Let us begin by discussing the scenario where there is a known
upper bound for and a known lower bound for . In this case, the change
detection problem is equivalent to the following:
Equation 39
|| ||
|| ||
Where:
|| || is the true change time and The case
of interest where is assumed to be known, and unknown can be obtained
as a limit case of the solution to the above problem.
The solution to the detection problem formulated in Equation 39 can be
obtained by deriving the generalized likelihood ratio (GLR) test [44], where the
unknown parameters are replaced by their maximum likelihood estimates. Thus,
the generalized likelihood ratio for the sequence is:
Equation 40
|| ||
|| ||
25
where is the corresponding parameterized probability density function. The
sequential GLR algorithm is then given by
Equation 41
Where: is the discere time index, is the alarm (detection) event, is the
test statistic, and is a threshold
Given the i.i.d. Gaussian assumption, can be written as
Equation 42
|| ||
∑
|| ||
∑
It can be shown that can be rewritten as [44]
Equation 43
{
( )
( ) ( )
( )
26
Where
Equation 44
[(̅ ) (̅ )]
Equation 45
̅ ∑
Observe that, for the current problem formulation, the data that are needed in
Equation 44 are the feature vectors , the covariance , and the mean before the
change .
In the more realistic case where the parameter before the change is
assumed to be known but the parameter after the change is assumed to be
completely unknown, the change detection problem statement is as follows
Equation 46
Hence, the case where nothing is known about can be considered the
limit of the previous case when .
Therefore, the GLR algorithm in Equation 46 becomes:
27
Equation 47
( )
Where is defined in Equation 44
In the above study, is assumed to be known. In practice, can be
estimated using a number of feature vectors at the beginning of each
mammogram. The covariance is estimated using the same feature vectors.
Results
A. 2D ARMA Model
We test the proposed algorithm using the University of South Florida
digital mammography library available online at: http://marathon.csee.usf.edu.
The Digital Database for Screening Mammography (DDSM) is a resource for use
by the mammographic image analysis research community. Each mammogram
image is pixels. The ARMA parameters were estimated using a
window of size . Hence, each image is represented by 256 ARMA
feature vectors . We find the optimal ARMA degree model as
the degree that minimizes the average square error between the original image
and the predicted ARMA model. An exhaustive off-line search between the
degrees and reveals that leads to the
28
smallest average square error for most mammogram images in the DDSM
library. Figure 8 shows 2D-ARMA models of an original healthy mammogram.
B. Change Detection Algorithm
We can estimate the value of (parameter before the change) as the
sample mean of the first 10 feature vectors . Another approach is to estimate
the value of from the entire mammogram image. This method yields an
estimation error not higher than the relative size of the microcalcifications in the
image, i.e. about 1%. For both methods, estimation of the parameter yielded
similar values. The detection algorithm is based on the value of the threshold h,
that was chosen experimentally. Figure 10 displays the decision function of
four sample mammograms including two cancerous and two normal. The
cancerous images exhibit peaks that are twice as high, on average, than healthy
images. Therefore, we found that a value of equal to the mean of the highest
cancerous peak and the highest normal peak achieves an optimal balance
between false and missed alarms. Figure 9 shows a plot of the average grey level
of the sub-images of healthy and cancerous mammograms. It is seen
that a simple plot of the grey level values of the mammograms does not
discriminate between healthy and cancerous mammograms. However the
proposed change detection algorithm leads to a decision function that is uniform
for healthy mammograms and spiky for cancerous mammograms, where the
spikes indicate the position of microcalcifications.
29
By lexicographical ordering of the image and its feature vectors, we are
able to not only discriminate between normal and cancerous mammograms but
also pinpoint the exact location of microcalcifications in the cancerous image.
The peaks of the decision function can be easily traced back to the suspicious
areas. Figure 11 shows a radiologist’s marked area of suspicion, which is
successfully detected as cancerous by our algorithm. Table I displays the
performance of our algorithm based on 524 normal and cancerous digital
mammograms from the DDSM library. Based on these statistically significant
analysis, the results of the sensitivity and specificity of the proposed algorithm
are 96:9% and 97:8%, respectively.
Table 3 PERFORMANCE OF THE CHANGE DETECTION ALGORITHM IN 524 NORMAL AND
CANCEROUS MAMMOGRAMS
True False
Positive 96% 4%
Negative 97% 3%
Figure 8 2D ARMA Modeling (a) Original (healthy) mammogram; (b) 2D ARMA[2,2,2,2]
model of (a); (c) 2D ARMA[3,3,3,3] model of (a); (d) 2D ARMA[4,4,4,4] model of
(a); (e) 2D ARMA[6,6,6,6] model of (a);
30
Figure 9 Change detection algorithm (a) A normal (healthy) mammogram; (b) 2D ARMA[2,2,2,2]
model of (a); (c) Plot of the average gray level of the 16x16 sub images in (a); (d) Plot of
the decision function for the image in (a); (e) original cancerous image; (f) 2D
ARMA[2,2,2,2] model of (e); (g) Plot of the average gray level of the 16x16 sub images in
(e); (h) Plot of the decision function for the image in (e);
Figure 10 The decision function for four mammograms; cancerous in red/magenta and normal
in blue/green. The value of the threshold id determined as the mean of the highest
normal peak and the highest of the cancerous peak.
31
Figure 11 Change detection algorithm: (a) radiologist marked are of interest; (b) plot of the
decision function of the mammogram, arrows indicate the peaks above the threshold;
(c) marked 16x16 clusters that correspond to the detected peaks
32
Chapter 6 Automated Biofilm Region Recognition And
Morphology Quantification from Confocal Laser
Scanning Microscopy Imaging
Abstract
Staphylococcus aureus is an opportunistic human pathogen and a primary
cause of nosocomial infections. Its biofilm forming capability is an adaptation
strategy utilized by many species of bacteria to overcome stressful environmental
conditions and provides both resistance to antimicrobial treatments and
protection from the host immune system. This chapter addresses a growing
demand for an objective, fully automated method of biofilm structure description
with standardized parameters that are independent of user input. In this study,
we used watershed segmentation to analyze and compare confocal laser
scanning microscopy (CLSM) images of two S. aureus strains with different
biofilm-forming capabilities. Results are compared with manual calculations as
well as the commonly used COMSTAT software.
Introduction
Staphylococcus aureus is an opportunistic human pathogen responsible
for a wide range of diseases that vary in clinical presentation and severity. S.
aureus can cause diseases ranging from minor skin infections to life-threatening
conditions such as pneumonia, osteomyelitis and toxic shock syndrome. Recent
33
significant changes in health care delivery and antimicrobial resistance patterns
have caused a shift in the epidemiology of S. aureus. Recently, this has been
evidenced by a dramatic increase in methicillin-resistant S. aureus (MRSA)
infection rates which, at least in the United States, has led MRSA mortality rates
to be higher than those of HIV. [45] The public health concern caused by S.
aureus-related infections has led to extensive efforts put into improving the
efficacy of available therapies as well as introducing new pharmaceuticals. Both
strategies are challenged by the fact that S. aureus infections are associated with
formation of a biofilm, which limits the efficacy of therapy by creating a resistance
to antimicrobials and by protecting the bacteria from the host immune system. In
order to conduct studies on targeting biofilms therapeutically, it is necessary to
be able to quantitatively measure biofilm morphological characteristics like area,
biomass and thickness. In this chapter, we consider a clinical isolate (UAMS-1),
which forms robust, dense and uniformly distributed biofilm as well as its isogenic
variant caring a mutation in the sarA-gene, limiting its ability to form a biofilm.
For the assessment of biofilm structure, CLSM has been described as an
ideal technique [46]. By using several fluorescent stains or conjugated
antibodies in combination with multichannel CLSM 3D, the location of different
biofilm constituents can be recorded. Using these data sets, the
threedimensional architecture of the biofilms can be reconstructed and quantified
with digital image analysis. There is a wide range of commercially available
software capable of analyzing biofilm morphology, including COMSTAT, ImageJ,
ISA3D and Volocity. However, they all rely on thresholding to segment the
34
biofilm. Specifically, the automated segmentation procedure is implemented in
two steps: (i) thresholding using user-dependent parameters [47] [48], followed
by (ii) connecting volume filtration [49]. The purpose of this work is to create a
fully automated method of biofilm segmentation and quantification that does not
rely on user input or thresholding.
Quantification of biofilm structure
Quantitative parameters describing the biofilm physical structure have
been extracted from three-dimensional confocal laser scanning microscopy
images and used to compare different biofilm structures. Quantitative descriptive
parameters of biofilm chosen for this study are: (i) area occupied by biomass in
each cross section, (ii) biomass in biovolume, (iii) thickness distribution and (iv)
average thickness.
Morphology quantification parameters
The following parameters are used to describe the biofilm structure:
 area occupied by biomass in each cross-section: measured as the
total sum of all the unit areas (pixels) of each CLSM cross section
categorized as occupied area.
35
Equation 48
∮ ∑
where:
o occupied area in cross section z,
o closed contour of occupied area,
o cell of a cross section recognized as occupied area
 biomass in biovolume, V: measured from numeric integration of the area
of microbial colonization profiles, following a method previously described
in [50]
Equation 49
∫ [ ∑ ]
where:
o number of horizontal cross-sections,
o z-step in the image stack.
 thickness distribution: the number of occupied clusters in each cross-
section over the total number of clusters in a cross-section of the CLSM
3D image.
 average thickness: calculated as the average value of the height of all
clusters of the biofilm rise from solid-substratum in the z direction between
crosssections.
36
Based on the four aforementioned “basic” parameters, other
characteristics of the biofilm can be calculated. For example, after the biomass is
segmented from the background, a number of features including roughness of
the film, porosity, thickness, etc. can be obtained. Those parameters can be used
together to uniquely describe the biofilm structure and to eventually differentiate
between different biofilm strains.
Image processing tool
The software suite of image processing operations was implemented
under the Matlab programming environment (Matlab 2010a, The Mathworks,
Inc). Matlab was chosen due to the convenience offered for matrix calculus. In
order to evaluate our results, we used manually calculated data as a baseline
and the widely used Matlab software COMSTAT for the comparison. In our
approach, we use the watershed segmentation method based on Fernand
Meyer’s algorithm [51].
Preprocessing and used methodology
Segmentation is one of the most difficult image processing operations.
The biofilm segmentation task is even harder because the biofilm is a
disconnected structure. This difficulty may explain the use of simple thresholding
in widely adopted biofilm analysis systems such as COMSTAT. Nonetheless,
after trying several segmentation algorithms, it became apparent that the
37
watershed transformation provides the most accurate segmentation of the biofilm
structure. The watershed transformation finds ”catchment basins” and
”watershed ridge lines” in an image by treating it as a surface where light pixels
are high (area of interest) and dark pixels are low (background). Segmentation
using the watershed transformation works best if one can identify, or ”mark,”
foreground objects and background locations. This marking process is done
automatically with reference to the black background on the CLSM image.
Marker-controlled watershed segmentation follows this basic procedure:
1. Compute a segmentation function. This is an image whose dark regions
are the objects to be segmented.
2. Compute foreground markers. These are connected groups of pixels
within each of the objects.
3. Compute background markers with a use of Gradient Magnitude as the
Segmentation Function. These are pixels that are not part of any object.
4. Modify the segmentation function so that it only has minima at the
foreground and background marker locations.
5. Compute the watershed transform of the modified segmentation function.
Growth and CLSM of static biofilm
Costar 3596 plates (Corning Life Sciences, Acton, MA) wells were coated
overnight at 4oC with 20% human plasma (Sigma) in bicarbonate buffer.
thesis_Jerzy_Zielinski_2012-08-27
thesis_Jerzy_Zielinski_2012-08-27
thesis_Jerzy_Zielinski_2012-08-27
thesis_Jerzy_Zielinski_2012-08-27
thesis_Jerzy_Zielinski_2012-08-27
thesis_Jerzy_Zielinski_2012-08-27
thesis_Jerzy_Zielinski_2012-08-27
thesis_Jerzy_Zielinski_2012-08-27
thesis_Jerzy_Zielinski_2012-08-27
thesis_Jerzy_Zielinski_2012-08-27
thesis_Jerzy_Zielinski_2012-08-27
thesis_Jerzy_Zielinski_2012-08-27
thesis_Jerzy_Zielinski_2012-08-27
thesis_Jerzy_Zielinski_2012-08-27
thesis_Jerzy_Zielinski_2012-08-27
thesis_Jerzy_Zielinski_2012-08-27
thesis_Jerzy_Zielinski_2012-08-27
thesis_Jerzy_Zielinski_2012-08-27
thesis_Jerzy_Zielinski_2012-08-27
thesis_Jerzy_Zielinski_2012-08-27
thesis_Jerzy_Zielinski_2012-08-27
thesis_Jerzy_Zielinski_2012-08-27
thesis_Jerzy_Zielinski_2012-08-27
thesis_Jerzy_Zielinski_2012-08-27
thesis_Jerzy_Zielinski_2012-08-27
thesis_Jerzy_Zielinski_2012-08-27
thesis_Jerzy_Zielinski_2012-08-27
thesis_Jerzy_Zielinski_2012-08-27
thesis_Jerzy_Zielinski_2012-08-27
thesis_Jerzy_Zielinski_2012-08-27
thesis_Jerzy_Zielinski_2012-08-27
thesis_Jerzy_Zielinski_2012-08-27
thesis_Jerzy_Zielinski_2012-08-27
thesis_Jerzy_Zielinski_2012-08-27

More Related Content

What's hot

Lec1: Medical Image Computing - Introduction
Lec1: Medical Image Computing - Introduction Lec1: Medical Image Computing - Introduction
Lec1: Medical Image Computing - Introduction Ulaş Bağcı
 
THE EFFECT OF PHYSICAL BASED FEATURES FOR RECOGNITION OF RECAPTURED IMAGES
THE EFFECT OF PHYSICAL BASED FEATURES FOR RECOGNITION OF RECAPTURED IMAGESTHE EFFECT OF PHYSICAL BASED FEATURES FOR RECOGNITION OF RECAPTURED IMAGES
THE EFFECT OF PHYSICAL BASED FEATURES FOR RECOGNITION OF RECAPTURED IMAGESijcsit
 
IRJET - Deep Learning based Bone Tumor Detection with Real Time Datasets
IRJET -  	  Deep Learning based Bone Tumor Detection with Real Time DatasetsIRJET -  	  Deep Learning based Bone Tumor Detection with Real Time Datasets
IRJET - Deep Learning based Bone Tumor Detection with Real Time DatasetsIRJET Journal
 
IRJET - Machine Learning Applications on Cancer Prognosis and Prediction
IRJET - Machine Learning Applications on Cancer Prognosis and PredictionIRJET - Machine Learning Applications on Cancer Prognosis and Prediction
IRJET - Machine Learning Applications on Cancer Prognosis and PredictionIRJET Journal
 
Medical Image Analysis and Its Application
Medical Image Analysis and Its ApplicationMedical Image Analysis and Its Application
Medical Image Analysis and Its ApplicationSubarno Pal
 
Meta analysis of convolutional neural networks for radiological images - Pubrica
Meta analysis of convolutional neural networks for radiological images - PubricaMeta analysis of convolutional neural networks for radiological images - Pubrica
Meta analysis of convolutional neural networks for radiological images - PubricaPubrica
 
Prospects of Deep Learning in Medical Imaging
Prospects of Deep Learning in Medical ImagingProspects of Deep Learning in Medical Imaging
Prospects of Deep Learning in Medical ImagingGodswll Egegwu
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...ijceronline
 
IRJET- Retinal Fundus Image Segmentation using Watershed Algorithm
IRJET-  	  Retinal Fundus Image Segmentation using Watershed AlgorithmIRJET-  	  Retinal Fundus Image Segmentation using Watershed Algorithm
IRJET- Retinal Fundus Image Segmentation using Watershed AlgorithmIRJET Journal
 
IRJET- Breast Cancer Detection from Histopathology Images: A Review
IRJET-  	  Breast Cancer Detection from Histopathology Images: A ReviewIRJET-  	  Breast Cancer Detection from Histopathology Images: A Review
IRJET- Breast Cancer Detection from Histopathology Images: A ReviewIRJET Journal
 
MEDICAL IMAGES AUTHENTICATION THROUGH WATERMARKING PRESERVING ROI
MEDICAL IMAGES AUTHENTICATION THROUGH WATERMARKING PRESERVING ROIMEDICAL IMAGES AUTHENTICATION THROUGH WATERMARKING PRESERVING ROI
MEDICAL IMAGES AUTHENTICATION THROUGH WATERMARKING PRESERVING ROIhiij
 
Recent advances of AI for medical imaging : Engineering perspectives
Recent advances of AI for medical imaging : Engineering perspectivesRecent advances of AI for medical imaging : Engineering perspectives
Recent advances of AI for medical imaging : Engineering perspectivesNamkug Kim
 
A Tailored Anti-Forensic Approach for Bitmap Compression in Medical Images
A Tailored Anti-Forensic Approach for Bitmap Compression in  Medical ImagesA Tailored Anti-Forensic Approach for Bitmap Compression in  Medical Images
A Tailored Anti-Forensic Approach for Bitmap Compression in Medical ImagesIOSR Journals
 
An optimized approach for extensive segmentation and classification of brain ...
An optimized approach for extensive segmentation and classification of brain ...An optimized approach for extensive segmentation and classification of brain ...
An optimized approach for extensive segmentation and classification of brain ...IJECEIAES
 
Retinal Blood Vessels Exudates Classification For Detection Of Hemmorages Tha...
Retinal Blood Vessels Exudates Classification For Detection Of Hemmorages Tha...Retinal Blood Vessels Exudates Classification For Detection Of Hemmorages Tha...
Retinal Blood Vessels Exudates Classification For Detection Of Hemmorages Tha...IJSRED
 
2017 07 03_meetup_d
2017 07 03_meetup_d2017 07 03_meetup_d
2017 07 03_meetup_dDana Brophy
 

What's hot (20)

Deep Learning for Computer Vision: Medical Imaging (UPC 2016)
Deep Learning for Computer Vision: Medical Imaging (UPC 2016)Deep Learning for Computer Vision: Medical Imaging (UPC 2016)
Deep Learning for Computer Vision: Medical Imaging (UPC 2016)
 
Lec1: Medical Image Computing - Introduction
Lec1: Medical Image Computing - Introduction Lec1: Medical Image Computing - Introduction
Lec1: Medical Image Computing - Introduction
 
THE EFFECT OF PHYSICAL BASED FEATURES FOR RECOGNITION OF RECAPTURED IMAGES
THE EFFECT OF PHYSICAL BASED FEATURES FOR RECOGNITION OF RECAPTURED IMAGESTHE EFFECT OF PHYSICAL BASED FEATURES FOR RECOGNITION OF RECAPTURED IMAGES
THE EFFECT OF PHYSICAL BASED FEATURES FOR RECOGNITION OF RECAPTURED IMAGES
 
IRJET - Deep Learning based Bone Tumor Detection with Real Time Datasets
IRJET -  	  Deep Learning based Bone Tumor Detection with Real Time DatasetsIRJET -  	  Deep Learning based Bone Tumor Detection with Real Time Datasets
IRJET - Deep Learning based Bone Tumor Detection with Real Time Datasets
 
IRJET - Machine Learning Applications on Cancer Prognosis and Prediction
IRJET - Machine Learning Applications on Cancer Prognosis and PredictionIRJET - Machine Learning Applications on Cancer Prognosis and Prediction
IRJET - Machine Learning Applications on Cancer Prognosis and Prediction
 
Medical Image Analysis and Its Application
Medical Image Analysis and Its ApplicationMedical Image Analysis and Its Application
Medical Image Analysis and Its Application
 
Meta analysis of convolutional neural networks for radiological images - Pubrica
Meta analysis of convolutional neural networks for radiological images - PubricaMeta analysis of convolutional neural networks for radiological images - Pubrica
Meta analysis of convolutional neural networks for radiological images - Pubrica
 
Prospects of Deep Learning in Medical Imaging
Prospects of Deep Learning in Medical ImagingProspects of Deep Learning in Medical Imaging
Prospects of Deep Learning in Medical Imaging
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
 
IRJET- Retinal Fundus Image Segmentation using Watershed Algorithm
IRJET-  	  Retinal Fundus Image Segmentation using Watershed AlgorithmIRJET-  	  Retinal Fundus Image Segmentation using Watershed Algorithm
IRJET- Retinal Fundus Image Segmentation using Watershed Algorithm
 
Exploratory Analysis of Feature Selection Techniques in Medical Image Processing
Exploratory Analysis of Feature Selection Techniques in Medical Image ProcessingExploratory Analysis of Feature Selection Techniques in Medical Image Processing
Exploratory Analysis of Feature Selection Techniques in Medical Image Processing
 
Medical Imaging (D3L3 2017 UPC Deep Learning for Computer Vision)
Medical Imaging (D3L3 2017 UPC Deep Learning for Computer Vision)Medical Imaging (D3L3 2017 UPC Deep Learning for Computer Vision)
Medical Imaging (D3L3 2017 UPC Deep Learning for Computer Vision)
 
IRJET- Breast Cancer Detection from Histopathology Images: A Review
IRJET-  	  Breast Cancer Detection from Histopathology Images: A ReviewIRJET-  	  Breast Cancer Detection from Histopathology Images: A Review
IRJET- Breast Cancer Detection from Histopathology Images: A Review
 
MEDICAL IMAGES AUTHENTICATION THROUGH WATERMARKING PRESERVING ROI
MEDICAL IMAGES AUTHENTICATION THROUGH WATERMARKING PRESERVING ROIMEDICAL IMAGES AUTHENTICATION THROUGH WATERMARKING PRESERVING ROI
MEDICAL IMAGES AUTHENTICATION THROUGH WATERMARKING PRESERVING ROI
 
Paper presentation report
Paper presentation reportPaper presentation report
Paper presentation report
 
Recent advances of AI for medical imaging : Engineering perspectives
Recent advances of AI for medical imaging : Engineering perspectivesRecent advances of AI for medical imaging : Engineering perspectives
Recent advances of AI for medical imaging : Engineering perspectives
 
A Tailored Anti-Forensic Approach for Bitmap Compression in Medical Images
A Tailored Anti-Forensic Approach for Bitmap Compression in  Medical ImagesA Tailored Anti-Forensic Approach for Bitmap Compression in  Medical Images
A Tailored Anti-Forensic Approach for Bitmap Compression in Medical Images
 
An optimized approach for extensive segmentation and classification of brain ...
An optimized approach for extensive segmentation and classification of brain ...An optimized approach for extensive segmentation and classification of brain ...
An optimized approach for extensive segmentation and classification of brain ...
 
Retinal Blood Vessels Exudates Classification For Detection Of Hemmorages Tha...
Retinal Blood Vessels Exudates Classification For Detection Of Hemmorages Tha...Retinal Blood Vessels Exudates Classification For Detection Of Hemmorages Tha...
Retinal Blood Vessels Exudates Classification For Detection Of Hemmorages Tha...
 
2017 07 03_meetup_d
2017 07 03_meetup_d2017 07 03_meetup_d
2017 07 03_meetup_d
 

Similar to thesis_Jerzy_Zielinski_2012-08-27

Important Aspects of Digital Pathology- A Focus on Whole Slide Imaging/Tissue...
Important Aspects of Digital Pathology- A Focus on Whole Slide Imaging/Tissue...Important Aspects of Digital Pathology- A Focus on Whole Slide Imaging/Tissue...
Important Aspects of Digital Pathology- A Focus on Whole Slide Imaging/Tissue...The Lifesciences Magazine
 
Preprocessing Techniques for Image Mining on Biopsy Images
Preprocessing Techniques for Image Mining on Biopsy ImagesPreprocessing Techniques for Image Mining on Biopsy Images
Preprocessing Techniques for Image Mining on Biopsy ImagesIJERA Editor
 
RETINAL IMAGE CLASSIFICATION USING NEURAL NETWORK BASED ON A CNN METHODS
RETINAL IMAGE CLASSIFICATION USING NEURAL NETWORK BASED ON A CNN METHODSRETINAL IMAGE CLASSIFICATION USING NEURAL NETWORK BASED ON A CNN METHODS
RETINAL IMAGE CLASSIFICATION USING NEURAL NETWORK BASED ON A CNN METHODSIRJET Journal
 
Integrated workflow for digital pathology ina biospecimen repository enviornment
Integrated workflow for digital pathology ina biospecimen repository enviornmentIntegrated workflow for digital pathology ina biospecimen repository enviornment
Integrated workflow for digital pathology ina biospecimen repository enviornmentBIT002
 
Texture Analysis As An Aid In CAD And Computational Logic
Texture Analysis As An Aid In CAD And Computational LogicTexture Analysis As An Aid In CAD And Computational Logic
Texture Analysis As An Aid In CAD And Computational Logiciosrjce
 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)ijceronline
 
Case Study: Advanced analytics in healthcare using unstructured data
Case Study: Advanced analytics in healthcare using unstructured dataCase Study: Advanced analytics in healthcare using unstructured data
Case Study: Advanced analytics in healthcare using unstructured dataDamo Consulting Inc.
 
Paper id 25201494
Paper id 25201494Paper id 25201494
Paper id 25201494IJRAT
 
Search System for Medical Images
Search System for Medical ImagesSearch System for Medical Images
Search System for Medical ImagesIRJET Journal
 
Information Technology and Radiology: challenges and future perspectives
Information Technology and Radiology: challenges and future perspectivesInformation Technology and Radiology: challenges and future perspectives
Information Technology and Radiology: challenges and future perspectivesErik R. Ranschaert, MD, PhD
 
Btp report linu&rupam
Btp report linu&rupamBtp report linu&rupam
Btp report linu&rupamLinu George
 
Biological sciences cancer microscopy and microarray database
Biological sciences cancer microscopy and microarray databaseBiological sciences cancer microscopy and microarray database
Biological sciences cancer microscopy and microarray databaseBIT002
 
Data Mining Techniques In Computer Aided Cancer Diagnosis
Data Mining Techniques In Computer Aided Cancer DiagnosisData Mining Techniques In Computer Aided Cancer Diagnosis
Data Mining Techniques In Computer Aided Cancer DiagnosisDataminingTools Inc
 
Data Mining Techniques In Computer Aided Cancer Diagnosis
Data Mining Techniques In Computer Aided Cancer DiagnosisData Mining Techniques In Computer Aided Cancer Diagnosis
Data Mining Techniques In Computer Aided Cancer DiagnosisDatamining Tools
 
Cancer tissue evaluation.pptx
Cancer tissue evaluation.pptxCancer tissue evaluation.pptx
Cancer tissue evaluation.pptxKerenEvangelineI
 
Delineating Cancer Genomics through Data Visualization
Delineating Cancer Genomics through  Data VisualizationDelineating Cancer Genomics through  Data Visualization
Delineating Cancer Genomics through Data VisualizationRupam Das
 
Lung Cancer Detection using Machine Learning
Lung Cancer Detection using Machine LearningLung Cancer Detection using Machine Learning
Lung Cancer Detection using Machine Learningijtsrd
 
Medical Image segmentation from dl .pptx
Medical Image segmentation from dl .pptxMedical Image segmentation from dl .pptx
Medical Image segmentation from dl .pptxSACHINS902817
 

Similar to thesis_Jerzy_Zielinski_2012-08-27 (20)

Important Aspects of Digital Pathology- A Focus on Whole Slide Imaging/Tissue...
Important Aspects of Digital Pathology- A Focus on Whole Slide Imaging/Tissue...Important Aspects of Digital Pathology- A Focus on Whole Slide Imaging/Tissue...
Important Aspects of Digital Pathology- A Focus on Whole Slide Imaging/Tissue...
 
Preprocessing Techniques for Image Mining on Biopsy Images
Preprocessing Techniques for Image Mining on Biopsy ImagesPreprocessing Techniques for Image Mining on Biopsy Images
Preprocessing Techniques for Image Mining on Biopsy Images
 
RETINAL IMAGE CLASSIFICATION USING NEURAL NETWORK BASED ON A CNN METHODS
RETINAL IMAGE CLASSIFICATION USING NEURAL NETWORK BASED ON A CNN METHODSRETINAL IMAGE CLASSIFICATION USING NEURAL NETWORK BASED ON A CNN METHODS
RETINAL IMAGE CLASSIFICATION USING NEURAL NETWORK BASED ON A CNN METHODS
 
Integrated workflow for digital pathology ina biospecimen repository enviornment
Integrated workflow for digital pathology ina biospecimen repository enviornmentIntegrated workflow for digital pathology ina biospecimen repository enviornment
Integrated workflow for digital pathology ina biospecimen repository enviornment
 
Sub1528
Sub1528Sub1528
Sub1528
 
Texture Analysis As An Aid In CAD And Computational Logic
Texture Analysis As An Aid In CAD And Computational LogicTexture Analysis As An Aid In CAD And Computational Logic
Texture Analysis As An Aid In CAD And Computational Logic
 
A017350106
A017350106A017350106
A017350106
 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)
 
Case Study: Advanced analytics in healthcare using unstructured data
Case Study: Advanced analytics in healthcare using unstructured dataCase Study: Advanced analytics in healthcare using unstructured data
Case Study: Advanced analytics in healthcare using unstructured data
 
Paper id 25201494
Paper id 25201494Paper id 25201494
Paper id 25201494
 
Search System for Medical Images
Search System for Medical ImagesSearch System for Medical Images
Search System for Medical Images
 
Information Technology and Radiology: challenges and future perspectives
Information Technology and Radiology: challenges and future perspectivesInformation Technology and Radiology: challenges and future perspectives
Information Technology and Radiology: challenges and future perspectives
 
Btp report linu&rupam
Btp report linu&rupamBtp report linu&rupam
Btp report linu&rupam
 
Biological sciences cancer microscopy and microarray database
Biological sciences cancer microscopy and microarray databaseBiological sciences cancer microscopy and microarray database
Biological sciences cancer microscopy and microarray database
 
Data Mining Techniques In Computer Aided Cancer Diagnosis
Data Mining Techniques In Computer Aided Cancer DiagnosisData Mining Techniques In Computer Aided Cancer Diagnosis
Data Mining Techniques In Computer Aided Cancer Diagnosis
 
Data Mining Techniques In Computer Aided Cancer Diagnosis
Data Mining Techniques In Computer Aided Cancer DiagnosisData Mining Techniques In Computer Aided Cancer Diagnosis
Data Mining Techniques In Computer Aided Cancer Diagnosis
 
Cancer tissue evaluation.pptx
Cancer tissue evaluation.pptxCancer tissue evaluation.pptx
Cancer tissue evaluation.pptx
 
Delineating Cancer Genomics through Data Visualization
Delineating Cancer Genomics through  Data VisualizationDelineating Cancer Genomics through  Data Visualization
Delineating Cancer Genomics through Data Visualization
 
Lung Cancer Detection using Machine Learning
Lung Cancer Detection using Machine LearningLung Cancer Detection using Machine Learning
Lung Cancer Detection using Machine Learning
 
Medical Image segmentation from dl .pptx
Medical Image segmentation from dl .pptxMedical Image segmentation from dl .pptx
Medical Image segmentation from dl .pptx
 

thesis_Jerzy_Zielinski_2012-08-27

  • 1. 3D DIGITAL IMAGE PROCESSING FOR BIOFILM QUANTIFICATION FROM CONFOCAL LASER SCANNING MICROSCOPY MULTIDIMENSIONAL STATISTICAL ANALYSIS OF BIOFILM MODELING A Thesis Submitted to the Graduate School University of Arkansas at Little Rock in partial fulfillment of requirements for the degree of PHILOSOPHY DOCTOR In Applied Science in the Department of Applied Science Engineering Science and Systems August 2012 Jerzy S. Zielinski M.S. from Warsaw University of Technology, Warsaw Poland, 2006 B.S. from Warsaw University of Technology, Warsaw Poland, 2002
  • 2. © Copyright by Jerzy S. Zielinski 2012
  • 3. This dissertation, “3D Digital Image Processing for Biofilm Quantification from Confocal Laser Scanning Microscopy”, by Jerzy S. Zielinski, is approved by: Dissertation Advisor: ____________________________________ Nidhal Bouaynaya Assistant Professor of Systems Engineering Dissertation Committee: ____________________________________ Seshadri Mohan Professor of Systems Engineering ____________________________________ Yu-Po Chan Professor of Systems Engineering ____________________________________ Hussain M. Al-Rizzo Associate Professor of Systems Engineering ____________________________________ Craig Cooney Investigator, Veteran's Affairs Medical Center Program Coordinator: ____________________________________ Tansel Karabacak Assistant Professor of Applied Science Graduate Dean: ____________________________________ Patrick J. Pellicane Professor of Construction Management
  • 4. Fair Use This thesis is protected by the Copyright Laws of the United States (Public Law 94-553, revised in 1976). Consistent with fair use as defined in the Copyright Laws, brief quotations from this material are allowed with proper acknowledgment. Use of this material for financial gain without the author’s express written permission is not allowed. Duplication I authorize the Head of Interlibrary Loan or the Head of Archives at the Ottenheimer Library at the University of Arkansas at Little Rock to arrange for duplication of this thesis for educational or scholarly purposes when so requested by a library user. The duplication will be at the user’s expense. Signature _____________________________________________________
  • 5. 3D DIGITAL IMAGE PROCESSING FOR BIOFILM QUANTIFICATION FROM CONFOCAL LASER SCANNING MICROSCOPY, MULTIDIMENSIONAL STATISTICAL ANALYSIS OF BIOFILM MODELING, by Jerzy S. Zielinski, August 2012 Abstract The dramatic increase in number and volume of digital images produced in medical diagnostics, and the escalating demand for rapid access to these relevant medical data, along with the need for interpretation and retrieval has become of paramount importance to a modern healthcare system. Therefore, there is an ever growing need for processed, interpreted and saved images of various types. Due to the high cost and unreliability of human-dependent image analysis, it is necessary to develop an automated method for feature extraction, using sophisticated mathematical algorithms and reasoning. This work is focused on digital image signal processing of biological and biomedical data in one- two- and three-dimensional space. Methods and algorithms presented in this work were used to acquire data from genomic sequences, breast cancer, and biofilm images. One-dimensional analysis was applied to DNA sequences which were presented as a non-stationary sequence and modeled by a time-dependent autoregressive moving average (TD-ARMA) model. Two-dimensional analyses used 2D-ARMA model and applied it to detect breast cancer from x-ray mammograms or ultrasound images. Three-dimensional detection and classification techniques were applied to biofilm images acquired using confocal laser scanning microscopy.
  • 6. Modern medical images are geometrically arranged arrays of data. The broadening scope of imaging as a way to organize our observations of the biophysical world has led to a dramatic increase in our ability to apply new processing techniques and to combine multiple channels of data into sophisticated and complex mathematical models of physiological function and dysfunction. With explosion of the amount of data produced in a field of biomedicine, it is crucial to be able to construct accurate mathematical models of the data at hand. Two main purposes of signal modeling are: data size conservation and parameter extraction. Specifically, in biomedical imaging we have four key problems that were addressed in this work: (i) registration, i.e. automated methods of data acquisition and the ability to align multiple data sets with each other; (ii) visualization and reconstruction, i.e. the environment in which registered data sets can be displayed on a plane or in multidimensional space; (iii) segmentation, i.e. automated and semi-automated methods to create models of relevant anatomy from images; (iv) simulation and prediction, i.e. techniques that can be used to simulate growth end evolution of researched phenomenon. Mathematical models can not only be used to verify experimental findings, but also to make qualitative and quantitative predictions, that might serve as guidelines for the future development of technology and/or treatment.
  • 7. To My Wife and Parents
  • 8. Acknowledgements I would like to thank my advisor and mentor, Dr. Nidhal Bouaynaya for her guidance, illuminating discussions related to this work and beyond, encouragement and moral and financial support in this research. I also would like to extend my gratitude to Dr. Seshadri Mohan, Dr. Yu-Po Chan, Dr. Hussain M. Al-Rizzo and Dr. Craig Cooney for being part of my committee and for their insights and interest in my work. I am extremely grateful to my family who has been my greatest support. This accomplishment is not mine alone. Thank you for sharing my struggles and my victories. Thank you to my friends and colleagues for sharing my pauses and supporting me during my ups and downs.
  • 10. x Table of Contents List of Tables...................................................................................................................xiii List of Figures................................................................................................................. xiv Chapter 1 Introduction...................................................................................................1 Problem Statement ........................................................................................................1 Research Objectives......................................................................................................2 Motivation.......................................................................................................................4 Research Contributions .................................................................................................5 Organization...................................................................................................................6 Chapter 2 Biology Background ...................................................................................10 Genomics.....................................................................................................................11 Breast Cancer..............................................................................................................14 Biofilm ..........................................................................................................................15 Chapter 3 Time-Dependent ARMA Modeling of Genomic Sequences .......................18 Abstract........................................................................................................................18 Background..................................................................................................................20 Methods .......................................................................................................................23 Mean-squares estimation.............................................................................................26 Least-squares estimation..........................................................................................27 Index of randomness ................................................................................................31 Results.........................................................................................................................33
  • 11. xi Chapter 4 Two-Dimensional ARMA Modeling for Breast Cancer Detection and Classification .....................................................................................................................6 Abstract..........................................................................................................................6 Introduction ....................................................................................................................7 2D-ARMA Modeling..................................................................................................10 Yule-Walker Least-Squares Parameter Estimation ..................................................11 Tumor detection and classification...............................................................................15 Simulations ..................................................................................................................15 Chapter 5 Statistical Sequential Analysis for Detection of Microcalcifications in Digital Mammograms .................................................................................................................18 Abstract........................................................................................................................18 Introduction ..................................................................................................................18 2D-ARMA representation.............................................................................................21 Change detection algorithm.........................................................................................23 Results.........................................................................................................................27 A. 2D ARMA Model...................................................................................................27 B. Change Detection Algorithm ................................................................................28 Chapter 6 Automated Biofilm Region Recognition And Morphology Quantification from Confocal Laser Scanning Microscopy Imaging ...............................................................32 Abstract........................................................................................................................32 Introduction ..................................................................................................................32 Quantification of biofilm structure.................................................................................34
  • 12. xii Morphology quantification parameters.........................................................................34 Image processing tool..................................................................................................36 Preprocessing and used methodology.........................................................................36 Growth and CLSM of static biofilm...............................................................................37 Results.........................................................................................................................38 Chapter 7 Three Dimensional Morphology Quantification of Biofilm Structures from Confocal Laser Scanning Microscopy Images ................................................................42 Abstract........................................................................................................................42 Introduction ..................................................................................................................43 Average Run Length.................................................................................................45 Aspect Ratio .............................................................................................................45 Average and Maximum Diffusion Distance...............................................................45 Biomass....................................................................................................................46 Average Thickness ...................................................................................................46 Application to CLSM Biofilm Images............................................................................48 Biofilm culture preparation and image acquisition ....................................................48 Segmentation and parameter quantification results .................................................49 Conclusion and Recommendation ..................................................................................55 References......................................................................................................................63
  • 13. xiii List of Tables Table 1 Index of randomness of the Coding and Non-Coding segmants of Various Gene Sequences.............................................................................................5 Table 2 Classification accuracy of cancereus and benign tumors ..............................17 Table 3 PERFORMANCE OF THE CHANGE DETECTION ALGORITHM IN 524 NORMAL AND CANCEROUS MAMMOGRAMS...........................................29 Table 4 UAMS-1 sarA- results of biomass and average thickness calculations using watershed algorithm and COMSTAT software ..............................................39 Table 5 UAMS-1 results of biomass and average thickness calculations using watershed algorithm and COMSTAT software ..............................................39 Table 6 Average error calculated from manual calculations accross all layers in confocal imaging with use of Watershed algorithm and COMSTAT software39 Table 7 Results of biofilm parameter quantification for Stack 1 for 3D and 2D segmentations in comparison with the ground truth ......................................51 Table 8 Results of biofilm parameter quantification for stack 2 for 3D and 2D segmentations in comparison with the ground truth. .....................................52 Table 9 Results of biofilm parameter quantification for stack 3 for 3D and 2D segmentations in comparison with the ground truth. ....................................53
  • 14. xiv List of Figures Figure 1 Three hypotheses explaining the formation of intrinsic levels of antibiotic resistance Adapted from Stewwart et al., 2001 .............................................17 Figure 2 Gene Structure. Gene structure of the Human gene 276 located in chromosome I: The boxes correspond to the exons (coding regions, and line between them represent the introns (non-coding regions)). The total length of the gene is N=8208 bases, including 1536 coding and 6672 non-coding bases.............................................................................................................25 Figure 3 DNA Walk. DNA walk of the Human gene 276 .............................................25 Figure 4 TD-ARMA modeling. TD-ARMA modeling of the Human gene 276: The blue signal is the DNA walk, and the red signal is the TD-ARMA(1,1) fitted signal. The TD-ARMA(1,1) model accurately fits the genomic signal.........................2 Figure 5 TD-ARMA coefficients estimation. Estimation of the TD-ARMA(1,1) coefficients of the Human gene 276. The TD-ARMA(1,1) model is given by . The blue and black (resp. red and green) curves depict the time series (resp ) for the coding and non-coding regions of the gene, respectively. .................................................3 Figure 6 Curve of randomness. The curves of randomness of the coding and non- coding regions of the Human gene 276 are shown in blue and red, respectively. The index of randomness of the coding sequence is equal to 1.0603, whereas its corresponding value for the non-coding sequence is equal to 1.0627................................................................................................4 Figure 7 ARMA modeling and segmentation of ultrasound breast images: (a) cancereus ultrasound image; (b) 2D-ARMA[2,2,2,2] representation of (a); (c) segmentation of (b) using an appropriate threshold; (d) 1D-ARMA[2,2]
  • 15. xv representation of (a); (e) benign tumor ultrasound image; (f) 2D- ARMA[2,2,2,2] representation of (e); (g) segmentation of (f) using an appropriate threshold; (h) 1D-ARMA[2,2] representation of (e);....................17 Figure 8 2D ARMA Modeling (a) Original (healthy) mammogram; (b) 2D ARMA[2,2,2,2] model of (a); (c) 2D ARMA[3,3,3,3] model of (a); (d) 2D ARMA[4,4,4,4] model of (a); (e) 2D ARMA[6,6,6,6] model of (a); .................29 Figure 9 Change detection algorithm (a) A normal (healthy) mammogram; (b) 2D ARMA[2,2,2,2] model of (a); (c) Plot of the average gray level of the 16x16 sub images in (a); (d) Plot of the decision function for the image in (a); (e) original cancerous image; (f) 2D ARMA[2,2,2,2] model of (e); (g) Plot of the average gray level of the 16x16 sub images in (e); (h) Plot of the decision function for the image in (e); ....................................................................30 Figure 10 The decision function for four mammograms; cancerous in red/magenta and normal in blue/green. The value of the threshold id determined as the mean of the highest normal peak and the highest of the cancerous peak. ...30 Figure 11 Change detection algorithm: (a) radiologist marked are of interest; (b) plot of the decision function of the mammogram, arrows indicate the peaks above the threshold; (c) marked 16x16 clusters that correspond to the detected peaks.............................................................................................................31 Figure 12 Confocal images and their segmentations: A-D: images from the UAMS-1 sarA- mutant (section 1), their respective segmentations with watershed (section 2), COMSTAT analysis (section (3) .................................................40 Figure 13 Confocal images and their segmentations: E-H: images from the UAMS-1 (section 1), their respective segmentations with watershed algorithm (section 2), COMSTAT analysis (section 3) ................................................................41
  • 16. xvi Figure 14 Mean Square Error of 2D gradient-based segmentation (black), 2D watershed segmentation (green) and 3D (red) gradient based segmentation applied to CLAS z-stack of images................................................................50 Figure 15 Combined Mean Square Error of 2D gradient-based (black), 2D watershed (green) and and 3D gradient-based segmentations......................................50 Figure 16 2D and 3D-gradient based segmentation: Column 1: the original CLSM images. Column 2: 3D gradient-based segmentation; Column 3: 2D gradient- based segmentation; Column 4: 2D Watershed Segmentation....................54
  • 17. 63 Chapter 1 Introduction Biological and biomedical signals are acquired by a range of techniques across all biological scales, which go far beyond the visible light photographs and microscope images of the early 20th century. Today the techniques in use are: confocal scanning microscopy, x-ray microscopy, electron microscopy, etc. with extensive use of Digital Signal Processing (DSP) techniques and reconstruction algorithms in two and three dimensions and even multidimensional space. Modern medical images are geometrically arranged arrays of data sample. The broadening scope of imaging as a way to organize our observations of the biophysical world has led to a dramatic increase in our ability to apply new processing techniques and to combine multiple channels of data into sophisticated and complex mathematical models of physiological function and dysfunction. With explosion of the amount of data produced in a field of biomedicine, it is crucial to be able to construct accurate mathematical models of the data at hand. Two main purposes of signal modeling are: data size conservation and parameter extraction. Problem Statement Over the past century we have undergone a revolution in a field of microbiology and biomedicine. We went from microscope to computerized, highly sophisticated method of acquisition methods that can scan surrounding environment with accuracy and precision. The amount of data that is being
  • 18. 63 produced in the scanning process is enormous and became very difficult to analyze by a human being in an efficient way. The main problem that stands before scientists faced with large data pool is ability to translate digital information into meaningful data, further being used by physicians and biologists in their studies. Among many the most important are:  Feature extraction, which is spatial form of the dimensionality reduction. It is used for either images that are too large to process or those that are redundant in nature. In those two cases the input data can be transformed into a reduced representation of a set of features (also named feature vector).  Segmentation process, which is simplification and/or change the representation of an image and then device into more meaningful subsections.  Development of accurate and robust Computer Aided Diagnostic (CAD) systems for biomedical applications, which can be used in faster and more accurate delivery of results in biomedical imaging with minimal or no human involvement. Research Objectives The goal of this work is to research different ways of analysis in Digital Signal Processing area of biological and biomedical signals, specifically to
  • 19. 63 develop methods that can be used in parameters extraction, further used by biologists and physicians in development of new patient treatment techniques. The ultimate goal is to develop accurate and robust models for Computer Aided Diagnostic systems in areas of: microcalcifications and cancer tissue detection in breast tissue from X-ray Mammograms and ultrasound imaging and also segmentation of biofilm of Staphylococcus aureus colonies from Confocal Laser Scanning Microscopy (CLSM) The goal of this research is realized through the following objectives: 1. Development of non-stationary modeling technique for modeling DNA data 2. Development of 2D-ARMA technique for image signal modeling 3. Use extracted features of ARMA model for biological and clinical classification of microcalcification in X-ray breast mammography and ultrasound imaging 4. Formulation of the problem of detection of microcalcification as change detection hypothesis testing problem 5. Development of accurate and robust segmentation technique for Confocal Laser Scanning Microscopy, which led to more accurate quantification of biofilm 6. Development of three dimensional (3D) segmentation techniques that takes under consideration spatial pixel relationship. 7. Development of fully automated method of quantifying staphylococcal biofilm images using confocal laser scanning microscopy (CLSM) with standardized parameters that are independent of user input.
  • 20. 63 Motivation Breast cancer continues to be a significant public health problem in the United States: It is the second leading cause of female mortality, and, disturbingly, one out of eight women in the United States will be diagnosed with breast cancer in her life time. Until the cause of this disease is fully understood, early detection remains the only hope to improve breast cancer prognosis and treatment. Breast cancer screening modalities are mainly based on clinical examination, mammography, ultrasound imaging, magnetic resonance imaging (MRI), and core biopsy. Mammography (breast x-ray imaging) is by far the fastest and cheapest screening test for breast cancer. Unfortunately, it is also among the most difficult of radiological images to interpret: mammograms are of low contrast, and features indicative of breast disease are often very small. Staphylococcus aureus is an opportunistic human pathogen responsible for a wide range of diseases that vary in clinical presentation and severity. S. aureus can cause infections ranging from minor skin lesions to life-threatening conditions such as pneumonia, osteomyelitis and toxic shock syndrome. Recent significant changes in health care delivery and antimicrobial resistance patterns have caused a shift in the epidemiology of S. aureus. A recent report estimated the number of invasive infections caused by methicillin-resistant S. aureus (MRSA) in 2005 in the United States alone at 94,360, with 18,650 of these resulting in a fatal outcome. This means that S. aureus infection has now passed AIDS as a cause of death in the United States. Based on such considerations, S.
  • 21. 63 aureus is arguably a greater clinical concern now than at any time since the pre- antibiotic era. Research Contributions This work contributes to the field of computational bioinformatics and biology through the application of information theory and communication theory to the study and analysis of genetic sequences. Further, this work presents novel techniques in detection and recognition of breast cancer in its early stages of development as well as classification of developed tumors from ultrasound and x- ray mammography imaging techniques. This work describes the novel methods of parameter extraction from Confocal Laser Scanning Microcopy (CLSM) imaging of biofilm formation in bacterial development. All described techniques contribute in providing semi- and fully-automated tools for scientists and physicians for treatment development in areas of large mortality of human in advanced countries. Specific contributions of this work to the field of biomedical science include: 1. Derivation of non-stationary TD-ARMA model, in which the novel estimation algorithms of the TD-ARMA parameters were introduced and calculated, which resulted in creation of more robust and precise model of non-stationary signal 2. Formulation of use of TD-ARMA modeling as a non-stationary model of genomic sequences, which resulted in better understanding and better
  • 22. 63 recognition of coding and non-coding regions of DNA sequences of highly evolved organisms 3. Usage of the word decomposition theorem, applied to ultrasound breast images treated as random fields, for two-dimensional ARMA modeling 4. Derivation of 2D-ARMA model, which takes under consideration distributed correlation between points in space and/or time that resulted in better understanding of higher-dimensionality modeling techniques 5. The formulation of the problem of the microcalcification detection in digital breast images as a statistical change of detection problem in the local properties of the image. The solution was not only able to detect the presence of microcalcification but also gave accurate estimate of their location within the breast 6. Derivation of a Computer Aided Diagnostic System for the automatic quantification of biologically relevant parameters of biofilm created in microbial development of Staphylococcus aureus colonies, that further were used in microbiology studies of virulence and treatment resistance of certain strains of bacteria Organization This thesis is organized as follows: In Chapter 2, we provide a cursory overview of the biology background, which include brief introduction to genomics with emphasis on Structure of DNA. Next we describe breast cancer types that are most common among female
  • 23. 63 population in highly evolved countries. Last we present description of biofilm genesis and different areas where mathematical biofilm modeling can be used. Grasping the essence of the biological inspirations of this work is crucial to understanding the motivation, assumptions and theoretical results of this work. In Chapter 3 we model the non-stationary genomic sequences by a time- dependent autoregressive moving average (TD-ARMA) model. By expressing the time dependent coefficients as linear combinations of parametric basis functions, we were able to transform a linear non-stationary problem into a linear time- invariant problem. Subsequently, we proposed three methods to estimate the time-dependent coefficients: Mean -square, least-squares, and recursive least- squares algorithms. Based on the estimated TD-ARMA coefficients, we defined an index of randomness to quantify the degree of randomness of both coding and non-coding sequences, results to follow. In Chapter 4 we propose to exploit the high spatial correlation inherent in neighboring pixels to improve tumor detection and classification in ultrasound breast images. We achieve this goal by using a two-dimensional autoregressive moving average (ARMA) field model of the image. Current techniques often rely on one-dimensional representations of the image in terms of its scan lines in order to process it as a one-dimensional time series [5], [6]. Such one- dimensional projections are advocated solely on the basis of the simplicity of their mathematical formulations. The analysis of two-dimensional fields is more involved mathematically and computationally than the study of one-dimensional time-series. In this work, we derive an efficient two-stage algorithm to estimate
  • 24. 63 the parameters of the two-dimensional ARMA field model of the breast image. The estimated ARMA parameters are excellent discriminative features, which are used as the basis for statistical detection and classification of tumors in the breast image. In Chapter 5, we introduce a new approach to the problem of malignancy detection in digital mammograms using statistical sequential analysis theory. The statistical approach inherently takes into account the noise in the image (from the imaging device and the digitization) and the great variety of healthy and cancerous mammograms by considering an underlying probability distribution of the image characteristics. For increased efficiency, the dimensionality of the original images is reduced using 2D-ARMA modeling, which is shown to accurately represent mammograms. The change detection algorithm is applied to the low-dimensional 2D-ARMA feature vectors compared to the pixels of the raw image. In Chapter 6 we propose algorithm efficiently segments and quantifies images not relying on a manual setup of a threshold. Average error of results obtained with watershed-based algorithm, calculated based on the manual analysis, was comparable to the one acquired with COMSTAT software. In Chapter 7 we show the importance of 3D analysis of biofilm structures, which yields more accurate morphological parameter quantification for clinical and biological assessment of the biofilm than sophisticated 2D-based analysis like the watershed segmentation. Two-dimensional analysis of the biofilm morphology treats the CLSM images independently from each other, whereas
  • 25. 63 the 3D analysis takes into account the temporal and spatial correlation between stacked images.
  • 26. 63 Chapter 2 Biology Background Biomedical signals are observations of physiological activities that can be obtained from a biological system. This diverse group of signals may range from observations on a molecular level such as gene and protein sequences to macroscopic images of organs. The processing of those biomedical signals aims at extracting only significant information from an often overwhelming amount of data. What constitutes the information of interest depends on the specific application. Therefore, the purpose of signal processing is to selectively eliminate irrelevant information from signal so as to make the information of interest more easily accessible to a human observer or a computer system. In the past, the primary application of signal processing was to filter signals and remove noise. Background, arising from either the imprecision of instruments or biological systems themselves, is eliminated using primarily two methods. In the first technique, the noise cancelation was achieved by analyzing the signal spectra and suppressing the undesired frequency components. In the other method used, data was treated as random signals and statistical characterizations of the signal are utilized to extract desired components, e.g. Wiener filtering or Kalman filtering. Since the introduction of new technologies and instruments, the applications for biomedical signal processing have expanded well beyond just the removal of background noise. Segmentation, the process of partitioning a digital image to locate objects and boundaries, is extensively used in the analysis of the medical images, including organ structure quantification and detection of local
  • 27. 63 abnormalities such as tumors. Another signal processing technique, motion tracking is widely used in molecular biology for visualizing dynamics in living cells. The same method can be used to track the distribution and growth of live cells tagged with fluorescent probes in biofilms or other biological formations. Sequence analysis is yet another processing method that was born with the invention of automatic DNA sequencing. It allowed scientists to create genetic maps based on the short DNA fragments analyzed by DNA sequencers. Finally, one of the most important applications of signal processing is pattern classification, often dependent on segmentation. Extensively used by clinicians and biologists, classification helps to automatically distinguish pathological formations from the normal background. Although, very often experts are more superior in pattern classification than any automated method, classification of biomedical signals by humans faces its several difficulties. It requires a lot of knowledge combined with experience, it is labor intensive and time consuming, it may be challenging in the situation when the signal characteristics are not very prominent and is sensitive to human error and bias. Therefore, automated methods for classification and segmentation in signal processing may overcome those stated limitations and assist in screening large databases, advancing the technology. Genomics There are two types of nucleic acid that are of key importance in cells: deoxyribonucleic acid (DNA) and ribonucleic acid (RNA). DNA is found as a
  • 28. 63 double strand. The backbone of the molecule is composed of deoxyribose sugars linked by phosphate groups in a repeating polymer chain. Each sugar is linked to a molecule known as a base. In DNA, there are four types of base, called adenine, tymine, guanine and cytosine, usually referred to simply as A, T, G and C. The two distinct ends of a DNA sequence are known under the name of the 5' end and the 3' end. The fundamental building block of a nucleic acid is called a nucleotide: this is the unit of one base plus one sugar plus one phosphate. We usually think of the length of a nucleic acid sequence as the number of nucleotides in the chain. The two strands of the DNA molecule are held together by hydrogen bonding between A and T and between C and G. The two strands run in opposite directions and are exactly complementary in sequence, so that where one has A, the other has T and where one has C the other has G. Therefore, naming the bases on the conventionally chosen side of the strand is enough to describe the entire double-strand sequence. The two strands are coiled around one another in the famous double helical structure elucidated by Watson and Crick 50 years ago. An RNA strand that is transcribed from a protein-coding region of DNA is called a messenger RNA (mRNA). The mRNA is used as a template for protein synthesis in the translation process discussed below. Eukaryotic gene sequences are composed of alternating sections called exons and introns. Exons are the pieces of the sequence that contain the information for protein coding. These pieces will be translated. Introns do not contain protein-coding information. The discovery of introns led to the Nobel Prize in Physiology or Medicine in 1993
  • 29. 63 for Phillip Allen Sharp and Richard J. Roberts. The introns are cut out of the pre- mRNA and are not present in the mRNA after processing. When an intron is removed the ends of the exons on either side of it are linked together to form a continuous strand. Splicing is carried out by spliceosome, a complex of several types of RNA and proteins bound together and acting as a molecular machine. The spliceosome is able to recognize signals in the pre-mRNA sequence that tell it where the intron-exon boundaries are and hence which bits of the sequence to remove. As with promoter sequences for transcription, the signals for the splice sites are fairly short and variable, so that reliable identification of the intron-exon structure of a gene is a difficult problem in bioinformatics. Nevertheless, the spliceosome manages to do it. Introns that are spliced out by the spliceosome are called spliceosomal introns. This is the majority of introns in most organisms. In addition, there are some interesting, but fairly rare introns that are capable of catalyzing their own splicing out of the primary RNA transcript without the action of the spliceosome. There are surprisingly large numbers of introns in many eukaryotic genes: 10 or 20 in one gene is not uncommon. In this work we focus on research of RNA mathematical modeling and automated recognition of coding and non-coding regions.
  • 30. 63 Breast Cancer Breast cancer is the most common type of cancer among women and it is the second leading cause of cancer mortality. Besides skin cancer, breast cancer is the most commonly diagnosed cancer among American women [1]. According to a statistical report by the National Cancer Institute of United States, it is estimated that 230,480 women in the USA were diagnosed, out of which 39,520 women are expected to die of breast cancer in 2011 [1]. The screening mammography is the most widely used technique for detection of breast cancer. The routine screening of mammogram is evaluated as a probable option to detect the earliest signs of cancerous growth [2]. The mortality rates of women under the age of 50 have been steadily decreasing since 1990. This decrease is surmised to be the result of the advances in treatment and earlier detection through screening. Thus, early detection and adapting modern methods of treatment for breast cancer can significantly improve the survival rate of victims. Currently, X-ray mammography is widely observed as the efficient imaging modality for early detection of abnormality. The earliest sign of breast cancer is microcalcification, which is nodular in structure with high intensity, localized or broadly diffused along the breast areas. Microcalcifications are tiny bits of calcium deposits present in the breast tissue and they appear as clusters or in patterns associated with extra cell activity in breast region. The detection of microcalcification at an early stage is a challenging task to radiologists and a few of the clusters could not be detected by them due to their impalpable size [3]. The detection sensitivity of radiologists in microcalcification detection is 70–90 %
  • 31. 63 and sensitivity depends on their experience [4]. Therefore, a Computer Aided Diagnosis (CAD) system for breast cancer detection on mammogram has been developed to improve the diagnostic rate. By incorporating the expert knowledge of radiologists, the CAD system can be made to improve the detection accuracy. Biofilm Staphylococcus aureus is an opportunistic human pathogen responsible for a wide range of diseases that vary in clinical presentation and severity. S. aureus can cause infections ranging from minor skin lesions to life-threatening conditions such as pneumonia, osteomyelitis and toxic shock syndrome. Recent significant changes in health care delivery and antimicrobial resistance patterns have caused a shift in the epidemiology of S. aureus. A recent report estimated the number of invasive infections caused by methicillin-resistant S. aureus (MRSA) in 2005 in the United States alone at 94,360, with 18,650 of these resulting in a fatal outcome. This means that S. aureus infection has now passed AIDS as a cause of death in the United States. Based on such considerations, S. aureus is arguably a greater clinical concern now than at any time since the pre- antibiotic era. S. aureus infections exhibit general characteristics common in many different types of infection. Strains that are resistant to methicillin already account for the majority of the S. aureus nosocomial infection cases [5,6]. A more alarming concern is the emergence of MRSA as a cause of infection in the
  • 32. 63 general community. In other words, patients who have never been hospitalized and who have no other known risk factors for MRSA infection are becoming ill. The treatment of such patients becomes more challenging because S. aureus is capable of forming a biofilm, which shows intrinsic levels of antibiotic resistance. The familiar mechanisms of antibiotic resistance, such as efflux pumps, modifying enzymes, and target mutations [7], do not seem to be responsible for the protection of bacteria in a biofilm. In fact, the metabolism of cells within a biofilm is profoundly slower than dispersed cells, and it is more likely that this slowing reduces the effects of antibiotics that block metabolic processes, such as protein synthesis. Additionally, antibiotic-sensitive bacteria with no known genetic basis for resistance can have profoundly reduced susceptibility when growing in a biofilm. Such strains, when grown in a biofilm, have to be treated with much higher doses of antibiotics than that needed to eradicate free-floating bacteria [7]. Three hypotheses explaining the formation of intrinsic levels of antibiotic resistance are shown in Figure 1, adapted from [7].
  • 33. 63 Figure 1 Three hypotheses explaining the formation of intrinsic levels of antibiotic resistance Adapted from Stewwart et al., 2001 For the reasons mentioned above, it is necessary to consider biofilm formation as an important weapon in S. aureus pathogenesis. The role of biofilm formation in virulence requires extensive studies of biofilm regulation and advanced methods of biofilm characterization and quantification.
  • 34. 63 Chapter 3 Time-Dependent ARMA Modeling of Genomic Sequences Abstract Over the past decade, many investigators have used sophisticated time series tools for the analysis of genomic sequences. Specifically, the correlation of the nucleotide chain has been studied by examining the properties of the power spectrum. The main limitation of the power spectrum is that it is restricted to stationary time series. However, it has been observed over the past decade that genomic sequences exhibit non-stationary statistical behavior. Standard statistical tests have been used to verify that the genomic sequences are indeed not stationary. More recent analysis of genomic data has relied on time-varying power spectral methods to capture the statistical characteristics of genomic sequences. Techniques such as the evolutionary spectrum and evolutionary periodogram have been successful in extracting the time-varying correlation structure. The main difficulty in using time-varying spectral methods is that they are extremely unstable. Large deviations in the correlation structure results from very minor perturbations in the genomic data and experimental procedure. A fundamental new approach is needed in order to provide a stable platform for the non-stationary statistical analysis of genomic sequences. Results: In this chapter, we propose to model non-stationary genomic sequences by a time-dependent autoregressive moving average (TD-ARMA) process. The model is based on a classical ARMA process whose coefficients are allowed to vary with time. A
  • 35. 63 series expansion of the time-varying coefficients is used to form a generalized Yule-Walker-type system of equations. A recursive least-squares algorithm is subsequently used to estimate the time-dependent coefficients of the model. The non-stationary parameters estimated are used as a basis for statistical inference and biophysical interpretation of genomic data. In particular, we rely on the TD- ARMA model of genomic sequences to investigate the statistical properties and differentiate between coding and non-coding regions in the nucleotide chain. Specifically, we define a quantitative measure of randomness to assess how far a process deviates from white noise. Our simulation results on various gene sequences show that both the coding and non-coding regions are non-random. However, coding sequences are "whiter" than non-coding sequences as attested by a higher index of randomness. Conclusion: We demonstrate that the proposed TD-ARMA model can be used to provide a stable time series tool for the analysis of non-stationary genomic sequences. The estimated time-varying coefficients are used to define an index of randomness, in order to assess the statistical correlations in coding and non-coding DNA sequences. It turns out that the statistical differences between coding and non-coding sequences are more subtle than previously thought using stationary analysis tools: Both coding and non-coding sequences exhibit statistical correlations, with the coding regions being "whiter" than the non-coding regions. These results corroborate the evolutionary periodogram analysis of genomic sequences and revoke the stationary analysis' conclusion that coding DNA behaves like random sequences.
  • 36. 63 Background Understanding the statistical properties of genomic sequences helps recreate the dynamical processes that led to the current DNA structure, and determine gene-related diseases like cancer and Alzheimer disease. For instance, based on the view that non-coding DNA exhibits long-range correlations [1-6], Li [7] proposed an expansion-modification model of gene evolution. The model incorporates the two basic features of DNA evolution: (i) sequence elongation due to gene duplication and (ii) mutations. It can be shown that the limiting sequence created by this dynamical process exhibits a long- range correlation structure, as attested by a spectrum, where the exponent is a function of the probability of mutation. To understand the relationship between the DNA correlation structure and possible gene aberrations, Dodin et al. [8] designed a simple correlation function intended to visualize the regular patterns encountered in DNA sequences. This function is used to revisit the intriguing question of triplet repeats with the aim of providing a visual estimate of the propensity of genes to be highly expressed and/or to lead to possible aberrant structures formed upon strand slippage. Statistical analysis of genomic sequences has, however, relied, for a long time, on signal processing techniques for stationary signals (correlation and power spectrum) [2,4,9,10], and statistical tools for slowly-varying trends within stationary signals (Detrended Fluctuation Analysis or DFA) [1,5,6]. Stationary can be argued as a valid assumption for time-series of short duration. However, such an assumption rapidly loses its credibility in the enormous databases maintained
  • 37. 63 by various genome projects. Standard statistical tests (e.g., Priestley's test for stationary) have been used to verify that genomic sequences are not stationary and the nature of their non-stationary varies and is often more complex than a simple trend [11,12]. Subsequently, more recent analysis of genomic data [1] has relied on time-varying power spectral methods (the evolutionary spectrum and periodogram) to capture the statistical characteristics of genomic sequences [11,12]. The main difficulty in using time-varying spectral methods is that they are extremely unstable and very noisy. Typically, the power spectrum and the evolutionary spectrum are averaged over time in order to obtain smooth and less jittery curves. Moreover, as was pointed out in [13], the evolutionary spectrum is restricted to the class of oscillatory processes. A stochastic process, , is oscillatory if it has a representation of the form Equation 1 ∫ Where is an orthogonal increment process, and the evolutionary power spectrum of the process is defined by | | . This definition of the evolutionary power spectrum has the following disadvantages [13]: i. It is not uniquely defined for a given non-stationary process. ii. The estimation procedure for the evolutionary spectrum depends greatly on the nature of the amplitude function , which is not known a priori.
  • 38. 63 iii. An increase in the number of observations does not provide added information on the local behavior of the evolutionary spectrum, and thus does not improve estimation accuracy. We propose to model non-stationary genomic sequences by a time- dependent autoregressive moving average (TD-ARMA) process. Cramer [14] showed that a non-stationary process still possesses a Word Decomposition in terms of its innovation and its generating system. However, the linear system generating the non-stationary signal , when driven by the innovation, , is no longer shiftinvariant; the parameters of the impulse response, , of this system are time-dependent so that Equation 2 ∑ The existence of a time-varying ARMA representation of this process is ensured by two theorems due, independently, to Grenier [15] and Huang and Aggarwal [16]. The uniqueness of the TD-ARMA representation is obtained by constraining the ARMA model to be invertible, but this leads to conditions on the time-varying impulse response and its inverse (namely to be absolutely summable at any time t), which cannot be easily expressed in terms of the time- dependent coefficients of the ARMA model. In this chapter, we estimate the time- dependent coefficients of the general TD-ARMA model using mean squares, least-squares and recursive least-squares algorithms. The mean-squares
  • 39. 63 estimation leads to generalized Yule-Walker type equations [15]. Once the non- stationary parameters are estimated (as time series), we use them to provide a basis for statistical inference by defining an index of randomness, which quantitatively assesses how close the non-stationary signal is to white noise. Our simulation results on various gene sequences show that (i) both the coding and non-coding segments of a gene are not random, and (ii) the coding segments are "closer" to random sequences than non-coding segments. Our results support the view that both coding and non-coding sequences are not random [11,12,9,17-20], and revokes the stationary study that maintains that non-coding DNA sustains long-range correlations whereas coding DNA behaves like random sequences [1-3,5,6,10]. Methods Numerical representation of genomic sequences Converting the DNA sequence into a digital signal offers the opportunity to apply powerful signal processing methods for the handling and analysis of genomic information. This is, however, not an easy task as the analysis results might depend on the chosen map. Various numerical mappings have been adopted in the literature. To cite few, Peng et al. [1] construct a one-dimensional map of nucleotide sequences onto a walk, , which they termed "DNA walk". The DNA walk is defined by the rule that the walker steps up if a pyrimidine resides at position i, and steps down otherwise. Voss [9] represents a DNA sequence by four binary indicator sequences, which indicate the locations of the four nucleotides A,
  • 40. 63 T, C and G. Berthelsen et al. [21] proposed a two-dimensional representation of DNA sequences, characterized by a Hausdorff dimension (also called fractal dimension) that is invariant under (i) complementarity, (ii) reflection symmetry, (iii) compatibility and (iv) substitution symmetry of AT and . The corresponding embedding assignment is given by . In this chapter, since we are interested in time- dependent ARMA modeling of one-dimensional non-stationary genomic sequences, we adopt the widely used "DNA walk" map proposed by Peng et al [1]. The DNA walk provides a nice graphical representation for each gene. For instance, Figure 2 shows the structure of the Human gene 276 located in chromosome 1, and its DNA walk is displayed in Figure 3. Time-dependent ARMA model Grenier [22] showed that a discrete non-stationary signal, , can be represented by finite-order time-varying ARMA processes of the form Equation 3 ∑ ∑ where is the length of the sequence and are the time- dependent model parameters, p and q are the model orders and is a white noise process. Observe that the coefficients and appear with an argument depending on . This is purely arbitrary since any time origin can be chosen, without restraining the generality of the model. We assume that the
  • 41. 63 time-dependent coefficients and can be expressed as linear combinations of some basis functions , Equation 4 ∑ ∑ Figure 2 Gene Structure. Gene structure of the Human gene 276 located in chromosome I: The boxes correspond to the exons (coding regions, and line between them represent the introns (non-coding regions)). The total length of the gene is N=8208 bases, including 1536 coding and 6672 non-coding bases Figure 3 DNA Walk. DNA walk of the Human gene 276
  • 42. 63 The advantage of the basis parameterization is clear from the fact that the identification of the time-dependent coefficients and reduces to the identification of the constant coefficients and , and therefore the linear non-stationary problem reduces to a linear time-invariant problem. The basis functions do not have to be limited to the standard choices of Legendre, Fourier, or the prolate spheroidal basis but can also take advantage of any prior information, such as the presence of a jump in the coefficients at a known instant [22]. Estimation of the time-dependent ARMA coefficients from Equation 4 , the unknown parameters of the TD-ARMA model are the weights of the linear combinations defining each time-varying parameter. The linearity is the key to the algorithms proposed in this chapter. We will derive mean-squares, least-squares and recursive least-squares solutions to estimate the time-dependent coefficients and . Mean-squares estimation Define the process Equation 5 ∑ ∑ and define the vector
  • 43. 63 Equation 6 [ ] It is possible to derive for this process orthogonally conditions similar to the stationary ARMA model conditions [23]. Observe that the process , defined in Eq. (6), is orthogonal to ; hence, it is orthogonal to for all , and orthogonal to for all [22]. This orthogonality condition leads to a generalized Yule-Walker equation [22] Equation 7 ([ ] | | ) ([ ] ) Although the process is non-stationary, the stationarity and ergodicity of the process , together with the linearity of the model, allow us to replace in Eq. (8) the expectation by a summation. However, although consistent, the above estimator is often considered a poor one [22]. Least-squares estimation Equations (4) and (5) can be written in vector format as follows
  • 44. 63 Equation 8 Where [ ] [ ] [ ] define Equation 9 Then we have Equation 10
  • 45. 63 Using this vector notation, Eq. (3) can be written as Equation 11 Or equivalently Equation 12 Where is a row vector Equation 13 And Equation 14 [ ] Observe that the vector contains all the unknowns of the TD-ARMA model. Writing Eq. (10) for – leads to
  • 46. 63 Equation 15 where [ ] [ ] [ ] The least-squares solution of Equation 11 is given by Equation 16 To overcome the computational complexity associated with the least- squares estimation (involving in particular the inversion of a square matrix), we opted for recursive least-squares estimation as follows. Recursive least-squares estimation The recursive least squares algorithm is summarized as [24]
  • 47. 63 Equation 17 ̂ ̂ { ̂ } Index of randomness Over the past decade, there has been a flow of conflicting papers about the long-range power-law correlations detected in eukaryotic DNA [1-3,5,6,9- 12,17-20]. The controversy is generated by conflicting views that either advocate that non-coding DNA sustains long-range correlations whereas coding DNA behaves like random sequences [1,10,2,3,5,6], or maintains that both coding and non-coding DNA exhibit long-range power-law correlations [11,12,9,17-20]. Based on the analysis of the time-dependent power spectrum of genomic sequences, Bouaynaya and Schonfeld [11,12] showed that the statistical differences between coding and non-coding sequences are more subtle than previously concluded using stationary analysis tools. In fact they found that both coding and non-coding sequences are non-random. However, coding sequences are "whiter" than non-coding sequences. We propose to qualitatively assess the degree of randomness of both coding and non-coding sequences using the time- dependent ARMA coefficients and . Consider the system function, , of a stationary ARMA model (whose coefficients and are constant, i.e., independent of time). We know that
  • 48. 63 Equation 18 ∑ ∑ ∏ ∏ Where and are zeros and poles of the system function respectively. From the fact that a stationary white noise process has a at spectrum, we observe that the closer (in L2 distance) the zeros and poles are, the flatter is the spectrum of the process. Following the same reasoning locally for non-stationary processes, we define the curve of randomness, CR [n], of the non-stationary process by Equation 19 { ( ∑| | ∑ | |) ( ∑| | ∑ | |) ( ∑| |) where the minimum is taken over all pairs . Observe that the case is obtained from the case by interchanging the roles of and , and the indices and . The curve of randomness defined in Equation 19 is a measure of how close the zeros and the poles of the system function are, and therefore, is a measure of how flat the system function is, and how close is the
  • 49. 63 process from a white noise. The index of randomness, , of a TD- ARMA(p,q), is then defined as the average of the curve of randomness, i.e., Equation 20 ∑ In particular, the index of randomness of a TD-ARMA(1,1) is given by Equation 21 ∑ | | Observe that the index of randomness of a white noise process is equal to zero. We say that the sequence with index of randomness is more random than the sequence with index of randomness if the index of randomness of the former is lower than the index of randomness of the latter, i.e., Results All genome sequences considered in this chapter have been extracted from the NIH website http://www.ncbi.nlm.nih.gov. The algorithms were implemented in MATLAB. The DNA sequences were mapped to numerical
  • 50. 63 sequences using the purine-pyrimidine DNA walk proposed in [1]. In our simulations, the recursive least squares algorithm was found to best estimate the time-dependent coefficients of the TD-ARMA model. We used the MATLAB function rarmax, which implements the recursive least-squares algorithm for TD- ARMA models. The choice of the orders p and q of the model were determined experimentally as follows: For each genomic sequence, we computed 100 TD- ARMA models corresponding to the orders (1, 1) up to (10, 10). The best model was chosen to be the one that minimizes the average squared error between the actual and the fitted sequences. Our extensive simulations on various DNA sequences from different organisms show that most of the sequences are best fitted with low-order TD-ARMA models, e.g., TD-ARMA(1,1), TD-ARMA(1,2) and TDARMA(2,1). Figure 4 shows the DNA walk of the Human gene 276 and its TD- ARMA(1,1) fitted sequence. Observe that the TD-ARMA(1,1) model accurately fits this gene sequence. The estimated time-varying coefficients a [n] and b [n] are displayed in Figure 5 for both the coding and non-coding regions of this gene. Their statistical differences are not clear from the plot of the time-series coefficients. The curves of randomness of the coding and noncoding regions are displayed in Figure 6.
  • 51. 1 Table 1 shows the index of randomness of various gene sequences. For concise representation, the column titles have been abbreviated as follows: "C. length" (resp."N.C. length") denotes the length (in base pairs) of the coding (resp. non-coding) segment of the gene. The total length of the gene is the sum of the lengths of its coding and non-coding regions. "C. (p, q)" (resp. "N.C. (p, q)") denotes the optimal TDARMA parameters (p, q) for the coding (resp. non-coding) region of the gene. Finally, "C. IR" (resp. "N.C. IR") is the index of randomness of the coding (resp. non-coding) segment of the gene. Observe that, in all considered genes, the index of randomness of both the coding and non-coding segments are strictly positive, and the index of randomness of the coding region is consistently lower than the index of randomness of the non-coding region (recall that the index of randomness of a white noise is zero). These observations bring to bear two important conclusion: (i) Both the coding and non-coding sequences are not random, as attested by an index of randomness greater than zero. (ii) The coding sequences are "whiter" than the non-coding sequences. This conclusion revokes previous work on statistical correlation of DNA sequences, which, based on stationary time-series analysis, presumed that coding DNA behaves like random sequences [1-3,5,6,10]; and supports the conflicting view that both coding and non-coding sequences are not random
  • 52. 2 [11,12,9,17-20]. In particular, our conclusion is in accordance with the evolutionary periodogram analysis conducted in [11,12]. Figure 4 TD-ARMA modeling. TD-ARMA modeling of the Human gene 276: The blue signal is the DNA walk, and the red signal is the TD-ARMA(1,1) fitted signal. The TD-ARMA(1,1) model accurately fits the genomic signal
  • 53. 3 Figure 5 TD-ARMA coefficients estimation. Estimation of the TD-ARMA(1,1) coefficients of the Human gene 276. The TD-ARMA(1,1) model is given by . The blue and black (resp. red and green) curves depict the time series (resp ) for the coding and non-coding regions of the gene, respectively.
  • 54. 4 Figure 6 Curve of randomness. The curves of randomness of the coding and non-coding regions of the Human gene 276 are shown in blue and red, respectively. The index of randomness of the coding sequence is equal to 1.0603, whereas its corresponding value for the non-coding sequence is equal to 1.0627
  • 55. 5 Table 1 Index of randomness of the Coding and Non-Coding segmants of Various Gene Sequences
  • 56. 6 Chapter 4 Two-Dimensional ARMA Modeling for Breast Cancer Detection and Classification Abstract Computer aided diagnosis (CAD) paradigms have gained currency for discriminating malignant from benign lesions in ultrasound breast images. But even the most sophisticated investigators often rely on one-dimensional representations of the image in terms of its scan lines. Such vector representations are convenient because of the mathematical tractability of one- dimensional time-series. However, they fail to take into account the spatial correlations between the pixels, which is crucial in tumor detection and classification in breast images. In this chapter, we propose a CAD system for tumor detection and classification (cancerous v.s. benign) in ultrasound breast images based on a two-dimensional Auto-Regressive-Moving-Average (ARMA) model of the breast image. First, we show, using the Wold decomposition theorem, that ultrasound breast images can be accurately modeled by two- dimensional ARMA random fields. As in the 1D case, the 2D ARMA parameter estimation problem is much more difficult than its 2D AR counterpart, due to the nonlinearity in estimating the 2D moving average (MA) parameters. We propose to estimate the 2D ARMA parameters using a two-stage Yule-Walker Least- Squares algorithm. The estimated parameters are then used as the basis for statistical inference and biophysical interpretation of the breast image. We
  • 57. 7 evaluate the performance of the 2D ARMA vector features in real ultrasound images using a k-means classifier. Our results suggest that the proposed CAD system based on a two-dimensional ARMA model leads to parameters that can accurately segment the ultrasound breast image into three regions: healthy tissue, benign tumor, and cancerous tumor. Moreover, the specificity and sensitivity of the proposed two-dimensional CAD system is superior to its one- dimensional homologue. Introduction Breast cancer continues to be a significant public health problem in the United States: It is the second leading cause of female mortality, and, disturbingly, one out of eight women in the United States will be diagnosed with breast cancer in her life time. Until the cause of this disease is fully understood, early detection remains the only hope to improve breast cancer prognosis and treatment. Breast cancer screening modalities are mainly based on clinical examination, mammography, ultrasound imaging, magnetic resonance imaging (MRI), and core biopsy. Mammography (breast x-ray imaging) is by far the fastest and cheapest screening test for breast cancer. Unfortunately, it is also among the most difficult of radiological images to interpret: mammograms are of low contrast, and features indicative of breast disease are often very small. Many studies have shown that ultrasound and MRI imaging techniques can help supplement mammography by detecting small breast cancers that may not be visible with mammography. However, these techniques often fail to determine if a
  • 58. 8 detected tumor is cancerous or benign, and a biopsy may be recommended. Consequently, many unnecessary biopsies are often undertaken due to the high false positive rate. Computer aided diagnosis (CAD) paradigms have recently received great attention for lesion detection and discrimination in X-ray and ultrasound breast mammograms [25]–[28]. The large amount of negative biopsies encountered in clinical practice could be reduced if a computer system was available to help the radiologists screen breast images. Broadly, the CAD systems proposed in the literature can be grouped into four major categories: geometrical [24], artificial intelligence [25], pyramidal (or multi-resolution) [27], and model-based techniques [28], [29]. Geometrical methods employ morphological and other segmentation techniques to extract small specks of calcium known as microcalcifications from breast images [25]. However, this procedure usually requires a priori knowledge of the tumor pattern characteristics. Moreover, these techniques also tend to rely on many stages of heuristics attempting to eliminate false positives. Artificial intelligence techniques include neural networks and fuzzy logic methods. The performance of these systems is tied to the architecture of the network and the number of training data. Breast cancer is a heterogeneous disease which includes several subtypes with distinct prognosis. In particular, the variability associated with the appearances of the breast cancer, ranging from relative uniformity to complex patterns of bright streaks and blobs [26], makes the ANN require a large training data set to ensure a certain level of reliability. Pyramidal or multi-resolution techniques refer mainly to the wavelet transform [27], which can be seen from a signal decomposition
  • 59. 9 view point. Specifically, a signal is decomposed onto a set of the basis wavelet functions. A very appealing feature of the wavelet analysis is that it provides a uniform resolution for all the scales. However limited by the size of the basic wavelet function, the downside of the uniform resolution is uniformly poor resolution. Model based methods include linear, non-linear and finite-element methods to build an accurate model of the breast [28], [29]. The model is subsequently used for image matching, detection, and classification [29]. The accuracy of the results are tied to the accuracy of the considered model. In this work, we propose a new model-based CAD system for tumor detection and classification. We show that (x-ray, ultrasound, and MRI) breast images can be accurately modeled by two-dimensional autoregressive moving average (ARMA) random fields. The model parameters, being the fingerprints of the image, serve as the basis for statistical inference and biophysical interpretation of the breast image. ARMA models are parametric representations of wide-sense stationary (WSS) processes with rational spectra. The Word Decomposition theorem states that any WSS process can be decomposed as the sum of a regular process, which spectrum is continuous, and a predictable process, which spectrum consists of impulses. Since rational spectra form a dense set in the class of continuous spectra, the ARMA model renders accurately the regular part of the WSS process. It is, therefore, surprising that very few researchers have attempted to derive a general ARMA representation of the breast image, and use it for tumor detection and classification. In [29], the authors use a one- dimensional fractional differencing autoregressive moving average (FARMA)
  • 60. 10 process to model the ultrasound RF echo reflected from the breast tissue. However, by considering separate scan lines, they do not take into account the two-dimensional spatial correlation between the pixels in the image. In [30], an autoregressive (AR) model is considered for improving the contrast of breast cancer lesions in ultrasound images. ARMA models, however, provide a more accurate model of a homogeneous random field than an AR model. As in the 1D case, the 2D ARMA parameter estimation problem is much more difficult than its 2D AR counterpart, due to the non-linearity in estimating the 2D moving average (MA) parameters. 2D-ARMA Modeling We represent the breast image as a 2D random field . We define a total order on the discrete lattice as follows Equation 22 The 2D-ARMA(p1,p2,q1,q2) model is defined for the image by the following difference equation Equation 23 ∑ ∑ ∑ ∑ where is a stationary white noise field with variance , and the coefficients , are the parameters of the model. From Equation 22 the
  • 61. 11 image can be viewed as the output of the linear time-invariant causal system excited by a white noise input, where Equation 24 ∑ ∑ ∑ ∑ With Yule-Walker Least-Squares Parameter Estimation Assume first that the noise sequence were known. Then the problem of estimating the parameters in the ARMA model in Equation 23 would be a simple input-output system parameter estimation problem, which could be solved by several methods, the simplest of which is the least-squares (LS) method. In the LS method, we express Equation 23 as Equation 25 Where Equation 26 and
  • 62. 12 Equation 27 [ ] Writing Equation 24 in matrix form for , and for some , and , gives Equation 28 Where Equation 29 [ ] [ ] And is displayed below. Assume we know , then we can obtain a least- squares estimate of the parameter vector in Equation 28 as Equation 30 ̂ Observe that the input model noise in is unknown. Nevertheless, it can be estimated by considering the noise process as the output of the linear filter with input . From Nirenberg’s proof of the division theorem in multi-dimensional spaces [32], we can write the inverse ARMA filter as the infinite order AR filter
  • 63. 13 Equation 31 ∑ ∑ In the time domain we obtain Equation 32 ∑ ∑ Therefore, we can estimate by first estimating the AR parameters and next obtaining by filtering as in Equation 29. Since we cannot estimate Equation 33 ( ) an infinite number of (independent) parameters from a finite number of samples, we approximate the finite AR model by one of finite order, say The parameters in the truncated AR model can be estimated by using a 2D extension of the Yule-Walker equations as follows Equation 34 ∑ ∑
  • 64. 14 Where are the autocorrelation values of the field , computed as follows: Equation 35 ∑ ∑ and is the 2D Kronecker delta function. Equation 35 is a system of linear equations that can be written in matrix form and solved for the coefficients . Finally, the Yule-Walker Least-Squares algorithm is summarized below 1. Estimate the parameters in an model of by the Yule-Walker method in Equation 35. Obtain an estimate of the noise field as Equation 36 ̂ ∑ ∑ ̂ for , and . 2. Replace the by ̂ computed in Step 1. Obtain ̂ in Equation 30 with , and .
  • 65. 15 Tumor detection and classification The estimated ARMA parameters, { }, , are used as a basis for inference about the presence of a tumor and its nature: benign or cancerous. We use the k-means algorithm to segment the breast image into 3 classes: healthy tissue, benign tumor and cancerous tumor. Our method consists of representing each pixel in the image by an ARMA model whose parameters are estimated by using an appropriate neighborhood for the pixel. We make the assumption that all pixels in the considered neighborhood belong to the same class, and hence, for computational efficiency, we replace the entire neighborhood by the vector value of the estimated ARMA parameters. This procedure is repeated for the entire image, creating a new block by block vector-valued image, which will be the input to the k-means classifier. Simulations Although the proposed algorithm is independent of the imaging modality of the breast, we perform our simulations on ultrasound images, collected from the Radiology department, College of Medicine at the University of Illinois at Chicago. Our database of cancerous images shows intraductal carcinoma, which is the most common type of breast cancer in women. Intraductal carcinoma is usually discovered through a mammogram or an ultrasound as microcalcifications. Our benign tumor images show the Fibroadenoma of the
  • 66. 16 breast, which is a benign fibroepithelial tumor characterized by proliferation of both glandular and stromal elements. Our extensive simulations indicate that ARMA[2,2,2,2] is a sufficient model order, in terms of mean square error, to accurately represent ultrasound breast images. Figure 1 shows two ultrasound images, one with a cancerous tumor and one with a benign tumor, and their respective 2D-ARMA[2,2,2,2] and 1D-ARMA[2,2] models. It is visually clear that the 2D-ARMA model accurately represents both ultrasound images, whereas the 1D model fails to capture any image feature. We estimate the 2D-ARMA parameters using a window of size . The choice of the window size presents an inherent trade-off between the accuracy of the representation and the accuracy of the classification. A large window size would lead a better representation of the 2D-ARMA model, but might include pixels from different classes. We found that for 256256 images, a window size leads to a good segmentation performance. Each image is therefore represented by a number of 2D-ARMA feature vectors, which contain the 8 parameters for each sub- block image. Without loss of generality, we chose . Therefore, the size of the feature vectors reduces to 6 instead of 8. We decide that an image has a cancerous (resp., benign) tumor if at least one of the sub-block images is classified as a cancerous (resp., benign) tumor. Otherwise, we conclude that the image is healthy and contains no tumors. We conducted our simulations using 573 ARMA feature vectors of healthy, benign and cancerous ultrasound breast images. The ARMA feature vectors
  • 67. 17 were used as the input to a k-means classifier. Figure 7(c) and Figure 7(f) show the segmentation outputs of the cancerous and benign tumor images, respectively. We can observe clear delineations of the tumors from the healthy tissues in both cases. The accuracy, sensitivity and specificity of the 2D-ARMA and 1D-ARMA k-means classifiers are shown in Table 2. It is clear that the 2D- ARMA feature vectors are more selective than their one-dimensional homologue. Figure 7 ARMA modeling and segmentation of ultrasound breast images: (a) cancereus ultrasound image; (b) 2D-ARMA[2,2,2,2] representation of (a); (c) segmentation of (b) using an appropriate threshold; (d) 1D-ARMA[2,2] representation of (a); (e) benign tumor ultrasound image; (f) 2D-ARMA[2,2,2,2] representation of (e); (g) segmentation of (f) using an appropriate threshold; (h) 1D-ARMA[2,2] representation of (e); Table 2 Classification accuracy of cancereus and benign tumors Accuracy Sensitivity Specificity 2D-ARMA 92.87% 92.03% 94.14% 1D-ARMA 78.51% 59.54% 79.76%
  • 68. 18 Chapter 5 Statistical Sequential Analysis for Detection of Microcalcifications in Digital Mammograms Abstract We formulate the problem of microcalcification detection in digital mammograms as a statistical change detection problem in the local properties of the image. First, we represent mammograms by two-dimensional autoregressive moving-average (2D ARMA) fields; thus uniquely characterizing the images by their reduced dimensionality 2D ARMA feature vectors. Texture changes in mammograms are then modeled as an additive change in the mean parameter of the PDF associated with the 2D ARMA feature vector sequence that describes the image. A generalized likelihood ratio test is used to detect these changes and estimate the exact time (or space) where they occur. Our simulation results on the Digital Database for Screening Mammography hosted by the University of South Florida show that the decision functions of cancerous images present high peaks at the microcalcification locations, whereas they exhibit a uniform behavior for healthy mammograms. The proposed algorithm achieves a sensitivity and specificity of 96:9% and 97:8%, respectively. Introduction The rapid expansion in number and volume of digital mammograms, the increasing demand for fast access to relevant medical data, the need for
  • 69. 19 interpretation, and retrieval of medical information has become of paramount importance [33]. Mammography is the current standard for breast cancer diagnosis. Women 40 years of age or older are recommended undergoing a screening mammogram to check for breast malignancies every 6 months. Screening mammograms usually involve two x-rays of each breast. This process generates a huge amount of data that needs to be processed, interpreted and saved. The presence of microcalcifications (tiny deposits of calcium) in the breast is an important sign for the detection of early breast carcinoma. Accurate and uniform evaluation of the enormous number of mammograms generated in widespread screening is a difficult task. 10-30% of breast carcinomas are missed by trained radiologists [34]. Mammograms are low contrast images, and the breast malignancies present a great diversity in shape, size and location, and low distinguishability from the surrounding healthy tissue. In the last two decades, various computer-aided (CAD) systems have been proposed to help bring suspicious areas on the mammogram to the radiologist’s attention. Many approaches were considered including denoising [35], segmentation [36], filtering [37], machine learning [38], [39] and artificial intelligence [37], mathematical morphology [40], time-frequency analysis and multiresolution techniques, and neural networks [34]. Despite their technical differences, these approaches share a common outline: they are all deterministic. They usually assume a small region of interest as a subject of recognition.
  • 70. 20 Hence, their performance is contingent upon the natural variability of healthy and cancerous mammography images. In contrast to deterministic methods, statistical methods take into account the noise in the digitized mammogram and the heterogeneity of its characteristics by considering an underlying probability distribution of the image features. It is, therefore, surprising that very few researchers have pursued this direction. Statistical analysis of mammograms was mainly considered in the context of textural information [41], [42]. In [41], the third and fourth order statistical moments, skewness and kurtosis, were estimated from the bandpass filtered mammogram. A region with high positive skewness and kurtosis is marked as a region of interest. In [42], a statistical model of the mammographic image, termed the “loglikelihood image”, is generated from the original mammogram image. However, the method does not include any decision making, and the log- likelihood image has the same resolution of the original mammogram. The challenge in breast carcinoma localization is that the detection algorithm must be able to handle all types of microcalcifications. Therefore, it is necessary to formulate the detection problem beyond the use of empirical observations about the type, shape, size or location of microcalcifications, which may or may not hold in all cases. In order to address these challenges, we pose the microcalcification detection problem in the context of statistical sequential representation and analysis of mammograms. A mammogram image is considered to be a realization of a stochastic process. We use statistical analysis to detect parameter changes of the stochastic process, which will
  • 71. 21 indicate the presence of suspicious areas in the breast. In our approach, we achieve a decision-making CAD system through use of dimensionality reduction and sufficient statistics. We first show that mammograms can be accurately modeled as 2D autoregressive moving-average (ARMA) fields, and thus each image can be solely represented by its 2D ARMA coefficients. In this chapter, we consider a change detection framework based on additive modeling. Specifically, we detect changes of the mean parameter of the PDF associated with the 2D ARMA feature vector sequence. The sufficient statistic used is based on the generalized likelihood ratio. Thus, the main steps used for detecting microcalcifications in mammograms are the 2D ARMA dimensionality reduction of the original image followed by change detection on the resulting feature vectors. In particular, no a priori assumptions are made about the specific nature of the microcalcifications (e.g., circular, smooth, etc.). 2D-ARMA representation We represent the breast image as a 2D random field [43]. We define a total order on the discrete lattice as follows: and [11]. The 2D model is defined for the image by the following difference equation
  • 72. 22 Equation 37 ∑ ∑ ∑ ∑ where is a stationary white noise field with variance , and the coefficients { } are the parameters of the model. A Two-stage Yule-Walker Least Squares parameter estimation method was proposed in [43]. First, the noise sequence is assumed to be known. The ARMA parameter estimation problem is then reduced to a simple input-output system identification problem, which is solved by a leastsquares (LS) method. The final estimate is then obtained by estimating the noise, using a truncated autoregressive (AR) model, and plugging it back in the Least Squares solution [43]. In practice the ARMA parameters are estimated using a window of size . The choice of the window size presents an inherent trade-off between the accuracy of the ARMA representation and the reliability of the classification. An image of size is therefore represented by ARMA feature vectors. Let [ ] be the ARMA feature vector of the k-th block. The mammogram image is then compared to the raw pixels of the unprocessed image. The 2D ARMA model presents a compressed representation of the image, which will lead to an efficient implementation of the CAD system. For instance, for
  • 73. 23 , the 2D ARMA model represents a dimensionality reduction of more than 97% compared to the original image. Figures 2b and 2f show the 2D models of a healthy and cancerous mammograms respectively Section IV subsection IV-A discuss in detail the choice of the model degree parameters . The problem of tumor detection becomes one of detecting changes in the parameters of the probability density function (PDF) associated with the ARMA vector random process. Change detection algorithm The 2D ARMA feature vectors are assumed to form an i.i.d. (independent and identically distributed) sequence of r-dimensional random vectors , with Gaussian distribution having PDF: Equation 38 √ ∑ ( ) ∑ Observe that the ARMA feature vectors are assumed to be independent. However the components of each ARMA feature vector are correlated with covariance matrix ∑. The independence of the ARMA feature vectors reflects an independence assumption between pixels in different sub-blocks of an
  • 74. 24 image. The tumor detection is modeled as a change in the vector parameter of the PDF characterizing the feature vector random process. Let the parameter be the value before the change, and the value after the change. In general, we have minimal or no information about the parameter after change. Let us begin by discussing the scenario where there is a known upper bound for and a known lower bound for . In this case, the change detection problem is equivalent to the following: Equation 39 || || || || Where: || || is the true change time and The case of interest where is assumed to be known, and unknown can be obtained as a limit case of the solution to the above problem. The solution to the detection problem formulated in Equation 39 can be obtained by deriving the generalized likelihood ratio (GLR) test [44], where the unknown parameters are replaced by their maximum likelihood estimates. Thus, the generalized likelihood ratio for the sequence is: Equation 40 || || || ||
  • 75. 25 where is the corresponding parameterized probability density function. The sequential GLR algorithm is then given by Equation 41 Where: is the discere time index, is the alarm (detection) event, is the test statistic, and is a threshold Given the i.i.d. Gaussian assumption, can be written as Equation 42 || || ∑ || || ∑ It can be shown that can be rewritten as [44] Equation 43 { ( ) ( ) ( ) ( )
  • 76. 26 Where Equation 44 [(̅ ) (̅ )] Equation 45 ̅ ∑ Observe that, for the current problem formulation, the data that are needed in Equation 44 are the feature vectors , the covariance , and the mean before the change . In the more realistic case where the parameter before the change is assumed to be known but the parameter after the change is assumed to be completely unknown, the change detection problem statement is as follows Equation 46 Hence, the case where nothing is known about can be considered the limit of the previous case when . Therefore, the GLR algorithm in Equation 46 becomes:
  • 77. 27 Equation 47 ( ) Where is defined in Equation 44 In the above study, is assumed to be known. In practice, can be estimated using a number of feature vectors at the beginning of each mammogram. The covariance is estimated using the same feature vectors. Results A. 2D ARMA Model We test the proposed algorithm using the University of South Florida digital mammography library available online at: http://marathon.csee.usf.edu. The Digital Database for Screening Mammography (DDSM) is a resource for use by the mammographic image analysis research community. Each mammogram image is pixels. The ARMA parameters were estimated using a window of size . Hence, each image is represented by 256 ARMA feature vectors . We find the optimal ARMA degree model as the degree that minimizes the average square error between the original image and the predicted ARMA model. An exhaustive off-line search between the degrees and reveals that leads to the
  • 78. 28 smallest average square error for most mammogram images in the DDSM library. Figure 8 shows 2D-ARMA models of an original healthy mammogram. B. Change Detection Algorithm We can estimate the value of (parameter before the change) as the sample mean of the first 10 feature vectors . Another approach is to estimate the value of from the entire mammogram image. This method yields an estimation error not higher than the relative size of the microcalcifications in the image, i.e. about 1%. For both methods, estimation of the parameter yielded similar values. The detection algorithm is based on the value of the threshold h, that was chosen experimentally. Figure 10 displays the decision function of four sample mammograms including two cancerous and two normal. The cancerous images exhibit peaks that are twice as high, on average, than healthy images. Therefore, we found that a value of equal to the mean of the highest cancerous peak and the highest normal peak achieves an optimal balance between false and missed alarms. Figure 9 shows a plot of the average grey level of the sub-images of healthy and cancerous mammograms. It is seen that a simple plot of the grey level values of the mammograms does not discriminate between healthy and cancerous mammograms. However the proposed change detection algorithm leads to a decision function that is uniform for healthy mammograms and spiky for cancerous mammograms, where the spikes indicate the position of microcalcifications.
  • 79. 29 By lexicographical ordering of the image and its feature vectors, we are able to not only discriminate between normal and cancerous mammograms but also pinpoint the exact location of microcalcifications in the cancerous image. The peaks of the decision function can be easily traced back to the suspicious areas. Figure 11 shows a radiologist’s marked area of suspicion, which is successfully detected as cancerous by our algorithm. Table I displays the performance of our algorithm based on 524 normal and cancerous digital mammograms from the DDSM library. Based on these statistically significant analysis, the results of the sensitivity and specificity of the proposed algorithm are 96:9% and 97:8%, respectively. Table 3 PERFORMANCE OF THE CHANGE DETECTION ALGORITHM IN 524 NORMAL AND CANCEROUS MAMMOGRAMS True False Positive 96% 4% Negative 97% 3% Figure 8 2D ARMA Modeling (a) Original (healthy) mammogram; (b) 2D ARMA[2,2,2,2] model of (a); (c) 2D ARMA[3,3,3,3] model of (a); (d) 2D ARMA[4,4,4,4] model of (a); (e) 2D ARMA[6,6,6,6] model of (a);
  • 80. 30 Figure 9 Change detection algorithm (a) A normal (healthy) mammogram; (b) 2D ARMA[2,2,2,2] model of (a); (c) Plot of the average gray level of the 16x16 sub images in (a); (d) Plot of the decision function for the image in (a); (e) original cancerous image; (f) 2D ARMA[2,2,2,2] model of (e); (g) Plot of the average gray level of the 16x16 sub images in (e); (h) Plot of the decision function for the image in (e); Figure 10 The decision function for four mammograms; cancerous in red/magenta and normal in blue/green. The value of the threshold id determined as the mean of the highest normal peak and the highest of the cancerous peak.
  • 81. 31 Figure 11 Change detection algorithm: (a) radiologist marked are of interest; (b) plot of the decision function of the mammogram, arrows indicate the peaks above the threshold; (c) marked 16x16 clusters that correspond to the detected peaks
  • 82. 32 Chapter 6 Automated Biofilm Region Recognition And Morphology Quantification from Confocal Laser Scanning Microscopy Imaging Abstract Staphylococcus aureus is an opportunistic human pathogen and a primary cause of nosocomial infections. Its biofilm forming capability is an adaptation strategy utilized by many species of bacteria to overcome stressful environmental conditions and provides both resistance to antimicrobial treatments and protection from the host immune system. This chapter addresses a growing demand for an objective, fully automated method of biofilm structure description with standardized parameters that are independent of user input. In this study, we used watershed segmentation to analyze and compare confocal laser scanning microscopy (CLSM) images of two S. aureus strains with different biofilm-forming capabilities. Results are compared with manual calculations as well as the commonly used COMSTAT software. Introduction Staphylococcus aureus is an opportunistic human pathogen responsible for a wide range of diseases that vary in clinical presentation and severity. S. aureus can cause diseases ranging from minor skin infections to life-threatening conditions such as pneumonia, osteomyelitis and toxic shock syndrome. Recent
  • 83. 33 significant changes in health care delivery and antimicrobial resistance patterns have caused a shift in the epidemiology of S. aureus. Recently, this has been evidenced by a dramatic increase in methicillin-resistant S. aureus (MRSA) infection rates which, at least in the United States, has led MRSA mortality rates to be higher than those of HIV. [45] The public health concern caused by S. aureus-related infections has led to extensive efforts put into improving the efficacy of available therapies as well as introducing new pharmaceuticals. Both strategies are challenged by the fact that S. aureus infections are associated with formation of a biofilm, which limits the efficacy of therapy by creating a resistance to antimicrobials and by protecting the bacteria from the host immune system. In order to conduct studies on targeting biofilms therapeutically, it is necessary to be able to quantitatively measure biofilm morphological characteristics like area, biomass and thickness. In this chapter, we consider a clinical isolate (UAMS-1), which forms robust, dense and uniformly distributed biofilm as well as its isogenic variant caring a mutation in the sarA-gene, limiting its ability to form a biofilm. For the assessment of biofilm structure, CLSM has been described as an ideal technique [46]. By using several fluorescent stains or conjugated antibodies in combination with multichannel CLSM 3D, the location of different biofilm constituents can be recorded. Using these data sets, the threedimensional architecture of the biofilms can be reconstructed and quantified with digital image analysis. There is a wide range of commercially available software capable of analyzing biofilm morphology, including COMSTAT, ImageJ, ISA3D and Volocity. However, they all rely on thresholding to segment the
  • 84. 34 biofilm. Specifically, the automated segmentation procedure is implemented in two steps: (i) thresholding using user-dependent parameters [47] [48], followed by (ii) connecting volume filtration [49]. The purpose of this work is to create a fully automated method of biofilm segmentation and quantification that does not rely on user input or thresholding. Quantification of biofilm structure Quantitative parameters describing the biofilm physical structure have been extracted from three-dimensional confocal laser scanning microscopy images and used to compare different biofilm structures. Quantitative descriptive parameters of biofilm chosen for this study are: (i) area occupied by biomass in each cross section, (ii) biomass in biovolume, (iii) thickness distribution and (iv) average thickness. Morphology quantification parameters The following parameters are used to describe the biofilm structure:  area occupied by biomass in each cross-section: measured as the total sum of all the unit areas (pixels) of each CLSM cross section categorized as occupied area.
  • 85. 35 Equation 48 ∮ ∑ where: o occupied area in cross section z, o closed contour of occupied area, o cell of a cross section recognized as occupied area  biomass in biovolume, V: measured from numeric integration of the area of microbial colonization profiles, following a method previously described in [50] Equation 49 ∫ [ ∑ ] where: o number of horizontal cross-sections, o z-step in the image stack.  thickness distribution: the number of occupied clusters in each cross- section over the total number of clusters in a cross-section of the CLSM 3D image.  average thickness: calculated as the average value of the height of all clusters of the biofilm rise from solid-substratum in the z direction between crosssections.
  • 86. 36 Based on the four aforementioned “basic” parameters, other characteristics of the biofilm can be calculated. For example, after the biomass is segmented from the background, a number of features including roughness of the film, porosity, thickness, etc. can be obtained. Those parameters can be used together to uniquely describe the biofilm structure and to eventually differentiate between different biofilm strains. Image processing tool The software suite of image processing operations was implemented under the Matlab programming environment (Matlab 2010a, The Mathworks, Inc). Matlab was chosen due to the convenience offered for matrix calculus. In order to evaluate our results, we used manually calculated data as a baseline and the widely used Matlab software COMSTAT for the comparison. In our approach, we use the watershed segmentation method based on Fernand Meyer’s algorithm [51]. Preprocessing and used methodology Segmentation is one of the most difficult image processing operations. The biofilm segmentation task is even harder because the biofilm is a disconnected structure. This difficulty may explain the use of simple thresholding in widely adopted biofilm analysis systems such as COMSTAT. Nonetheless, after trying several segmentation algorithms, it became apparent that the
  • 87. 37 watershed transformation provides the most accurate segmentation of the biofilm structure. The watershed transformation finds ”catchment basins” and ”watershed ridge lines” in an image by treating it as a surface where light pixels are high (area of interest) and dark pixels are low (background). Segmentation using the watershed transformation works best if one can identify, or ”mark,” foreground objects and background locations. This marking process is done automatically with reference to the black background on the CLSM image. Marker-controlled watershed segmentation follows this basic procedure: 1. Compute a segmentation function. This is an image whose dark regions are the objects to be segmented. 2. Compute foreground markers. These are connected groups of pixels within each of the objects. 3. Compute background markers with a use of Gradient Magnitude as the Segmentation Function. These are pixels that are not part of any object. 4. Modify the segmentation function so that it only has minima at the foreground and background marker locations. 5. Compute the watershed transform of the modified segmentation function. Growth and CLSM of static biofilm Costar 3596 plates (Corning Life Sciences, Acton, MA) wells were coated overnight at 4oC with 20% human plasma (Sigma) in bicarbonate buffer.