DNA - based signatures defend against biological warfare agents and their makers
DNA-BASED SIGl':ATURES'DEFEND~~ . ",. .
AdAINST BIOLOGICAL WARFARE AGENTS
AND THEIR MAKE~S ,
AROIANA TIWARI, SUSMrr KOSTA, ROOPESiI JAIN
WIth the end of the Cold War, the thrbat of nuclear holocaust faded but
another threat emerged - attack by terrorists or even natio~ usip.g biological
agents such as bacteria, viruses, biological toxins, and ge~etically 'altered
organisms. The former Soviet Union once had a formidable biol~ical weapons
program. Now, several countries and extremist groups are belieVl:(dto possess
or to be developing biological weapons that could threaten urban P9puJations,
destroy livestock, and wipe out crops. Even terrorists with limited ~kills and
resources could make biological weapons without much'difficulty. It's not
complex, it's not expensive,. and you don't need a large facility. For these
reasons, biological weapons have dubbed the poor man's atomic bQmb,
Contributing to the ease of making and concealing biological weapons is 'the
. dual-use nature of the materials to produce such weapons, be¢ause they, are
found in many legitimate medical research and agricultural acti~ities as well.
The agents used in biological weapons are difficult todetec~ an~to identify
quickly and reliably. Yet, early detection and iden,tification are ctucial for
minimizing their potentially catastrophic humapand e~onomic cost.
A major objective ofbiol6gical warfare is 4evelop~g better equipment,
both fixed and portable, to detect biological agents. How,ever, any det~ction
system is dependent on knowing the sign&tures of organism~ likely tq be tl,sed
in biological weapons; These signatures are telltale bits of'PNA unique t,o
pathogens (disease-causing microbes). Without prqper signatures, medical
authorities could .lose houI:sor days trying to. determine the. cause of an
outbreak, or they could be treating victims with ineffective antibiotics. Because
of the importance of biological signatures as a key thrust ofits effort to improve
response· to terrorist attacks. Over the past several years, scientific,' teams
expect- to produce species- level signatures for all the' most likely b.iological.
."DNA-Based Signatures Agliinst BiologicalWarfare .2ffJ
.,'warfare pathogens. The team also expecis to have an iriitial set of species~level
'signatures for likely agnculturalpathogens, because an attack 60 a nation's
"food supply could b.ejust as disruptive as'an attac~ ohthe .civilian population.
Modem health and s~~urityconCeli1s have raised iriterest in the real-time
detection and identification, of pathogenic microbes. Bacterial and viral
pathogens have alwaysrepn:seIlted.one of the greatest threats to human
health, andin-recent timesfuis threat .Increased due to the possibility of
engineered biological agents. For these and other reasons; the genome
sequencing field has targefed and sequenced-the 'complete genomes of
hundreds of baderia.and thousands of viruses over the past q~cade, with
·many more sequences expected to appear in the near future. These sequences
now make it possible.todevelop probe-based assays capable of identifying
" any of hundreds of organisms in environmental an.<!,clinical samples. Such
assays rely on detecting a DNA sequence that distinguishes the ,target organism
from all other known bllcteriaand viruses and from backgroun4.material, which
cQuld include DNA from humans, other animals, plants; or other species. A
·probe that 'accurately distinguishes between a target geno~Oo'-Or set of
genomes--:-and all olherbackground genomes is terrned a signature sequence.
DNA signatures are nucleotide sequences that can be' used to detect the
,•presence of an organism and to distinguish that organism from all other species.
Several Levels of Signature
The prime aim is to develop strain-level signatures for tI:i.~top suspected
agents. Strains are a subset of a species, and their DNA !nay differ by about 0.1
percent within the species. A species, in turn, is a member of a larger related
group (genus), and itsDNA may differ by a percent or so from that 'of other
!nemb~rs of the genus, Characterizing pathogens at the strain -hivel signatures
are essential for determining the native origin of a pathogen ass,ociated with an
outbreak; such mfonttation could help law inforcement id~tify the group or
. groupsbehirrd the attack.
The biological foUndations work aims to provide validated signatures
· usefulto public health and law enforcement agencies as well as classified
.signatures for the national security community. In developing these signatures, ,
biological foundation researchers are also shedding light on poorly understood
aspects of biology, microbiology, and genetics, such as iinmunology, evolution,
and virulence. Increased'kIlowledge in these fields holds the promise of better
medical treatments, including new kinds of vaccines. The biological foundations
work is one element in DOE's Chemical anaBiological NonprolifetationProgram.
Livetmore'scomponent<>fthis work is managed by its Nonproliferation, Arms .
Control, andIntemationai secUrity DIrectorate. Other components of the overall
program iI}clude detection, 'modeling and prediction, decontamination, and
t~chD:01ogydemonstration projects. Livermore researchers were among the
fIrst tot~cognize, in the early 1990s, the tremendous potential of detectors
based on DNA signatures. OWe knew that a lot Of work was necessary to
, ' 4
An ideal sigJiallire:, ,
• Has few short regions «0.1% total)
• ,Occurs only in the parhogea
- No false positives
• Occurs i.n all variants (strai.ns)
• No false negatives
• Is necessary for virulence
- "Unspoofable", '
• In engineered ocganisms '
• Tests for antibiotic resistance
• Provides rooundancy
Figure h'Bacte~ial chromos~mes (DNA) form loois, unlike human'
chromosomes which form strands. In the loop, between two to five million
bases'ofbacterialDNA are screened to locate unique r~gion~ (circled), which
are marked with primer pairs. The marked regions are amplified thousands
of times using polymerase chain reaction technology and then processed to
identify and characterize an organism.
,classes of threats, such as agricultural pathogens. Two extremely virulent
pathogens head the list: B. anthracis and Y.pestis, which'cause anthrax and
plague in humans, respectively. Bacillus arithracis has few detectable
differences among its~trainsj whereas Y.pestis strllms can vary considerably
in genetic 'makeup"Unraveling die' sigriificant differences between the two
organisms wilr givenatioRallaboratory researchers, experience vital for facing
the challe,ngesofthenext few years, as they develop signatures for a wide,
spectrum of microbes.
, DNA- Based SignaturesAgainst Biological Warfare '271
of confidence requires several days; the goal again istoreduc~the'time to,less
than 30 minutes. The final si~ature levei, mtended primarily for laWenf()rcement "
use, will permit detailed identific,ation of a specific strain of a pathogen (for
example, ,Yersinidpestis 'I<1M) and' correlate that strain with otl:ler forensic
evidence. Such data will help to identify and prosecute attackers. Theptesent
typical time lag for results isctirrently a few,weeks, and the goal IS to reduce
that to a few days.." , '
Biological scientists asserrible a list of natural pathogens most likely to
b~ used in a domestic attack; The list includes bacteria, viruses, ap.d other
no ,Bioterrorism and Biologial Warfare
develop the signatures the new detectors would, n~ed,6 says WeiQstein. In
particular, the researchers recognized several pitfalls. For example, if signatmes
are overly specific, they do not identify all strains of the pathogen and so' can
give a false-negative reading. On the other hand, if signatures are based on '
g~nes that are widely shared among'many different bacteria, they can give a
fa!se-positive reading. ~s a re~ult, ~i~atures m~st b~ .able, for example, to
,~eparate a nonpathogemc vaccmestram from an mfectIous one.""'.Jl:
'S'everal Levels of Identification
To enhance their detection development effort, researchers are exploring
advanced methods that distinguish slight differences in DNA. They are using , /
the multidisciplinary approach. 'In this case, t>NA signature development'
involves a team of microbiologists, molecular biologists, biochemists, "
geneticists, and computer experts. Much of the work is focused on screening
the two to five million bases that comprise a typical microbial genome to design
unique DNA markers. This phylogenetic tree is a simple represen~tion of the
bacterial kingdom. All human bacterial pathogens belong to the Granl-positive
(red) or Proteobacteria (magenta) divisions. The other divisions' consist of
nonpathogenic bacteria associated with diverSe environments. ~iological
signatUres must be able to differentiate infectious bacteria from hundreds of
, thousands of harmless ones. Each genus of bacteria has many species, and
each species can have thousands of different strains are performing suppressive.,
, subtractive hybridization to distinguish DNA of vario,us species of virulent
organisms that, will identify the microbe. The markers, called, primer pairs,
,typically contain about 30 base segments and bracket specific regions of DNA '
that area few hoodred bases long (Figure 1). The bracketed regions are replicated
many thousands of times with a detector that uses polymerase chain reaction
(PCR) technology. Then they are processed to unambiguously identify and
characterize the orgartismofinterest. The different signatures will be needed
for different levels of resolution. For example, authorities trying to characterize
an unknown material or respond to a suspected act ofbioterrorism will begin
with fairly simple signatures that flag potentially harmful pathogens within a
few minutes. Typically, such a signature would encompass one' or two primer
pairs and be sufficient for identification at the genus level (Yersinia or Bacillus,
for example) or below. A signature in the next level of resolution is needed for
unambiguously identifying a pathogen at the speCies level (Yersinia pestis, for
example)~ This signature involves about:lO primer pairs. Currently, it takes
several days to obtain conclusive data for a,speci~s-Ievel signature. The goal
is to reduce that time to less than 30 minutes;
The third signature level is used in pathogen characterization, identifying
any features that could affect medical response (for example, harmless vaccine
, materials versus highly virulent or antibiotic~esistancepathogeils). This
,signature level involves some 20 to 30 primer pairs. Together, the Primer pairS
offer a certainty of correct identification. Currently, providingsucl1 a high level
'QNA~ Based SigilaturesAgainstBiological Warfare m .
'. . .During the Cold War, the Soviet Union ransevei;a~offensive biowarfare
programs to developso~caned "Super Bugs." One. such program; Project
Bonfire, worked to create bacteria that wereresistantto about tetivarietiesof
antibiotics (Figtire 2).' This was done by identifying and cutting out.genes that
" cQnferred antibiotic resistance in many different strams of bacteria. By pasting
thes~ genes into the DNA of the anthrax bacterium, the Project Bonfire
researchers created.a strain of anthrax that resisteq any existing cUre, making it
imp()ssible to treat." '
.'The· HUnter ,Program was· anothei' Soviet biological warfare r.esearch
program that focused on combiningwhole,genomes of different viruses to
produce completely new hybrid viruses. These artificial viruses .could cause
unpredictable Sym,ptOIllSthathave no known treatment. In an innovative tWist,
,the HUIlter ProgramreselU"chers' alsocreat.ed bacteria strains that carried'
, pathogenic viruses inside thein.(Figure3).
Figure 3: Hunter ProjeCt '
. These strains woul9bedouble trouble: a person who cOIitracte~ the
bacterial disease wou.ld likely be treated with an antibiotic, whiCh would stop
'tQe .mfection by disrupting the bacterial cells. This would release the virus"
re~ultingin an outbreak of viral diseaSe, Such a scenario would confuse medical
perso~el; making treatment very diffic.ul~. .£
Bioterrodsmand BiQlogical Warfare·...~m
Figure 2: ProjectBonffire
tilolo'gical:agents'·' ....- ":"..' ,....':' .', ,
, ';, ,;inili~it:~atural state, bacteria, viruses and f'un~i canInak~ pretty .good
.'biologic~l weapons. Thfow some genetic engineering mto the mix, however,
an,dmore harmful agents can emer8i.. .
~~ " Eachofth~se organisms maintainsits genetic information in tHeforin of .
.'DNA or,.in some virUses, RNA. This genetic material contains genes, which
encode all of the information the organism needs to, survive and replicate.
, Some of these. genes gove,m the organism's pathogeniCity, or its ability to
infect a cell of a plant or animal. Through genetic~ngmeeiing, pathogenicity ,"
genes may be, manipulated to make, the orgap.isni'mOi;¢ .iJ}~ect~ous;,or more '.
resistant to a therapy or cute. ', ,
Figure 4.:Idaho. Technology's R;A;P.I.l)~ detection unit. . .4
DNA~ BaSed SignilturesAgainst.Biological Warfare
• The state health ·department conducts a (ull'Investigation' to
determine whether the incident is an act of biological warfare~ To
protect themselves from any potentially harlnful biological prchemical
. agents, investigators at the scene are outfitted in protective suits
apd self-contained breathing apparatus (SCBA) respirators'(same
as the "SCUBA" gear used underwater). . . .
• Investigators collect samples from patients and the surroundirig
environment~,then test them for the presence of harmful biQlogical
agents. In order to know which agents to test for, investigators
evaluate .allof the evidence they collect at the scene, in~luding signs
and symptoms shown by patients and patterns of disease .
. transmission. Through a process of deduction, investigators can
. narrow the list of suspected pathogens to just a few candidates ..
.• While testing can take'place in existing laboratories, it can' be
performed more quickly' in temporary field laboratories, using
compact, portable detection units such as Idaho Technology's
R.A.P.l.D., which stands for "Ruggedized Advanced Pathogen
.Identification Device." The RA.P.I.D., detection unit uses peR to
.id.eriiify the unique DNA signatures of suspected pathogens (Figure
.Oenomic'DNA or RNA extracted from collected samples is added to
a cocktail of reagents thatwill amplify a particular pathogen's DNA
signature. If that specific pathogen is present in the sample collected,
it will be positively identifie_dusing this approach. 'the entire process
.from sample preparation to detection ~es less than 60 minutes.
, '. . Ifa biowarfare. incident is confrrmed or thought to be probable; the
state investigators notify the FBI and local law eg,forcement agencies'
..' immediately. Law enforcement and health officBls work together to
.implement a pian to contain the site of contamination, clean itup and
. . pinpoint the source of the attack. '
.,< ..•.~.iil"., ..
Bioterrorism and Biological Warfare274
Defensive biological warfare ~ vacdnesaild detection methods
In 1969, President Richard M. Nixon tenninated'the U.S. offensive
biological warfare program and ordered stockpiles destroyed. The biological
warfare research focus shifted from offensive to defensive techniques. Three
y~ars later, at the 1972 Biological arid Toxin Weapons Convention, more than
100 nations signed a treaty prohib)ting the possession of deadly biological'
~agents, except for purposes of defensive research. Nations around the' world
i=~oncentrated on developing vaccines as well as enhancing detection of
biological agents. The Soviet Union signed the treaty, butiristead of dismantling
their offensive program, they stepped uptheii pace. The Soviet program was
not terminated until after the collapse of the Soviet ,Unionin 1992, when Russian
President BorisYeltsin banned all offensive biological weapons-related ~ctivity. .
All biological weapons stockpiles were destroyed and rese'arch was'
considerably doWnsized, but it is unknown if Russia has completely dissolved
the old Soviet program. .
Traditionally, vaccines consisted of a preparation of the infectious agent'
itself - either living, weakened or killed. Introducing the vaccine into the body
activates the immune system, resulting in the production of antibodies against
that particular agent. Ifa vaccinated person is later exposed to the infectious .'
agent, he or she will already have built up immunity against.it. More recently, ..
researchers have started using fragments of the pathog~ri's DNA genome as a'
vaccine, rather than the entire organis~. This approach helps eliminate the risk .
of infection that comes with using traditional vaccines. .
While vaccination helps protect a population from known infectious
agents, rapid detection of a suspected act ofbiowarfare allows fast action to be
taken to control the spread of disease. Curreritdetection methods take advantage
ofthe fact that each biological agent maintains its own unique DNA signature.
Rapid detection methods use a technique called Polymerase Chain Reaction
(PCR) to make a billion copies of a single. DNA strand within minutes. This
method positively ideritifies an infectioqs agent, by means of its DNA signature,
using even the tiniest samples. . ..
Putting It All Togeth~r
The Dark Winter project gave us a~gliriipse of how a biowarfare scenario
might unfold. But what. is the government's planned response to· such a
scenario? .. ,.
~. Although it niay be difficult to confrrmrightaway that an unusual
illnessiD. a community is caused by a biological attack, theloeal
health officer isimmediat~!y notified. Itis this person's responsibility .
to ,wonn the state health ~depaitment, which in turn notifi.es the
. federal Centers for DiseaSe Control and Prevention (CDe).
":' --_.• -f
~NA-"- BasedSignaturesAgainst Biological Warfare in
simultaneously analyze 96 strains ofDNA..The another technique to aid poultry .
ind~try by providing a handy way to detect Salmonella enteritidis. This
bacteriWn can cause illness if eggs are eaten raw orundercooked. Subtractive
hybridization results have been so successful that the signature can now be
used to diStinguish between. subtypes of salmonella bacterhim.' In addition to
· the DNA-based pathogen detection m.ethods, researchers are developing
" detection capabilities using.antibodies that can tag a pathogen by attaching to .
a molecular level physical feature of the organism. Antibody assays are likely
to play an important tole in pathogen detection because they are generally fast
and easy to use (commercial home-use medical tests use.this form of assay).
Researchers are working to. improve. these· detection methods. as well. A
bacteriophage (bacteria-killing virus) that only attacks Y.pestis and none of its
cousins discovered that the virus produces a unique prQtein component to
attach to the bacterium cell wall at a certain site and gain entry recognizing the
distinct site could fonn the basis of a foolproof antibody signature. To achieve
it with Y.pestis, we may be able to do it with other pathogens.
As more information about' pathogens and their disease mechanIsms
becomes available and as genetic engineering tools to transplant genes become
cheaper and simpler to use, the threat of genetically engineered pathogens
increases. Biodetectors must be able to sense the virulence signatures of
· genetically engineered pathogens" or they will be blind to an entire class of
.threats. The ultimate objective is to identify several specific virulence factors
that might be used in engineered biological warfare organisms so that we can
detect these engineered organisms and breaktheir virulence pathway. One.key
factor useful for' detecting. engmeered organisms is an antibiotic resistance
gene~ When transplanted into an infectious microbe, the gene could greatly
increase the effectiveness of a biological attack and complicate medical
response. Some antibiotic resistance genes are widely shared among bacteria
•.and are easily transferred with elementary molecular biolom. methods. In fact,
a standard biotechnology research technique is intrOducingantibiotlc resistance
genes into bacteria as an indicator of successful 'cloning. The need to be able
to rapidly recognize such genes so that the medical response is appropriate,
another telltale indication of genetic tampermg is the presence of virulence
genes in a microbe that should not con~in them. Virulence gene~ areoftett
involved in producing toxinsor~olecutes that cause harm or that simply
· evade a host defense. A series of genes is made available to perform their
functions at the right time, they.could cause real damage. If interfering with the
action of otie of these.genes. or its protems interrupts the virulence pathway,
thl': disease process can be halted. Identifying and characterizing iinportant
virulence genes attddeterminirtg their detailed molecular structure will greatly
aid the development of vaccines, drUgs, .and other medical treatments. As an
example,Y. pestis disables the imrnurie system in humans by injecting proteins
into macrophages, one of the body key. deftmders against bacterial attack.
Bioterrorisin and Biological Warfare
~' --- . J
Y .. . ..(, '7
.' .00 ••••• ~:.
, . . " • f',~.,., ,.;,':~'~:~_~;~.~:~:<
Figure 5: Twoextremelyvirul¢~t.()J;"ganisms head the list of pathogens most .
likely to be used byt~rforistS;'B;:ttn~hrac~Oeft)and y. pest~ (right), which
causeantllraxand'plagueinhumans;respectively. . .
Focus on Plague
. The main focUsis on Y.pestis, Francise/la tuli:ll'ensis(a bacterium caUsing
a plague like illness in hunians), and s~veral other microbes thaUhreaten human
and animal health, Eleven species and many thousands of strains belong to the
Y~rsiIiia genus. The most nototious sp'ecies, Y.pestis, causes bubonic pl!!gue
. and'is usually fatal unless treated~uickly with antibiotics. The disease is
transmitted by·rodents and their fleas tQ humans and other animals. The
'~gly subtle DNA differences among many Yersinia species maSk important
One species causes' gastroenteritis, another is often fatal; and a third is
virtually harmless; yet all have very similar genetic makeup. Insertionseqllence~'"
based fingerprinting to understand these slight genetic, 4ifferences. Insertion
sequences are mobile sectionS of DNA that replicate 0l.l their own. Analyzing'
for their presence will not only help refinesigriatUres for Y. pestis but also shedI. .
I:igDton how microorganisms evolve into strains that produce. lethal toxins.
This tmderstanding, in turn, should give ammunition to researchers seeking an
antidote or vaccine. to better understand the genetic differences. among species
and strains. COmparing the genetic complement ofY. pestis with another I1lember
of tile Yersinia group (pseudotuberculosis) that caUses aninte~tinal disease
they are closely related, and yet they' cause such different diseases.
Bette •. and Faster, with More Uses
There are a number of methods. fu~t:allow ritorerapid.identification and
. characterization of unique segments of;P~A.,.Eachmethod has advantages
and drawbacks, with some more applicable to one;organism than another; In
addition to the insertion seqllence method, another promising technique is
called suppressive subtractive hYbridization ..The method takes aD organism
and its near neighbor, hybri4izes the DNA frQill both, and determines the
fragments 'not in common as th~ basIs of asignatiJrc: One':goal is to
278 Bioterrorism arid BiologicalWarfare
Because,the protein acts as an inimuDosuppressant to disable the macrophage;
understanding its structure not only would help scientist,s fashion a drug that
physically blocks the protein but also would shed light on autoimmune diseases
. sUyh as arthritis and asthma. . .
Virulence Genes in' Com~on
Vfrulence genes spread natur¥lyamong pathogens and thus are also .
!?und in unrelated microbial species. Therefore, virulence genes alone are
n'8t>sufficient for species-specific DNA based detection. Differentiate the
virulence genes in natural organisms from engineeredorganisnis are using
different methods for differentiating virulence genes from among the
thousands of genes comprising the genomes of pathogens. One technique
looks for genes that start making proteins at the internal telDperatlli"es of
mammals. For example, genes of rpestis that becomes much more active at
'Sl"C. It seems a safe bet that many of these genes are associated with the
bacterium multiplying within a warm blooded host. The sequence of the
three plasmids (bits of DNA located outside the microorganism circular
chromosome) that contain most. of the virulence genes required for' full
development of the bubonic plague in animals and humans. Plasmids
sometimes transfer their genes to neighboring bacteria in what is called lateral
evolution. (Antibiotic resistance genes are also located on plasmids.). The Y.
pestis strain that causes bubonic plague, for example, may. have evolved
some 20,000 years ago. Such understanding is relevant to HIV, which may
not have become infectious for humans until.the 20tlr century.
Working with End Users
. There needs to be a strong relationship between development ofbiological
signatures and detection technologies and their end uses. Making diagnostic
tools available to regional publichealthagencies and thus create a national
mechanism for responding quickly to bioterrorism threats. Currently, many
. health agencies use detection methods that are not sufficiently sensitive,
selective, or fast. For example, one culture test for detecting anthrax takes two
days. Major damage and even death may have occurred in that time. DNA
signatures will be thoroughly validated before being released, because their
use might lead to evacuations of subways, airports, or sporting· events and
such evacuations cannot be undertaken lightly. As part of the validation effort,
which are characterizing natural microbial backgrounds to make sure that the .
signatures are accurate. imdel'actual conditions. To that end, researchers are
. collecting background microbial samples in air, water, and soil, liswell as in
human blood, urine, and saliva ..R anthracis is related to B: thrugin~nsis, a
naturally occurring harmless microbe that lives in dirt and can give a false
positive readingto anthrax ifthe signann:e used is not adequately specific. The
characterization effort is being aided by a device called the Gene Chip. The
device simultaneously moriitors the expression of thousands of gepes. Equally
"·c. ....alii" '
. DNA-Based Signatures AgainstBiological Warfare .T79
. .important, the researchers envision a strong mechanism linking biomedical
.scientists with public health and law enforcement officials to develop new
signatures speedily and cost,effectively to stay several steps ahead of
.. DNA signatures are nu<;leotide sequences that can. be useq. to. detect
the presence of an organism and to distinguish that organism from all other
'species: Here we describe Insignia, a new, compreIwnsive system for the
rapid identification. of signatures in the genomes of bacteria and virLises.
With the availability of hundreds 6fcomplete bacterial lmd viral genome
sequences, it is now possible to use computational methods'toidentify
signature sequences in aU of these species, and to use these !>ignatures as
the basis for diagnostic assays to detect and genotype microbes in both
environmental and ciiniCa,lsamples. The success of such assays critically
depends on the methods llsed to identify signatures that properly differentiate
between the target genomes and the sample background. We have used
Insignia- to compute accurate signatures for most bactel'iatgenomes and
. made them available through our Web site. A sample of these signatures has
been 'Suc~essfully;,tested on a set of 46 Vibrio cholerae strains; and the ..
results indicate that the signatures are highly sensitive for detection as well
as specific for discrimination between these strains and their near relatives.
Our approach, whereby the entire genomic complement of organisms are
compared to identify probe targets, is a promising method for diagnostic
assay development, and it provides assay designers with the flexibility to
cho.ose probes from the most relevant genes or genomic Fegions. The Insignia
system is freely accessible via a Web interface and has be.en released as
open source software at: http://insignia.cbcb.umd.edu.
'Occurrence and E~pression of Insignia .
. Modem health and security concerns have raised int~est in the real-time
detection and identification of pathogenic microbes. nacterial and viral
. pathogens have always represented one of the greatest threats to human
health, and in recent times this threat increased due tp the possibility of
engineered biological agents. For these and other reasons, the genome
sequencing field has.targeted and sequenced the-complete genomesof
hundreds of bacteria and thousands of viruses over the past decade, with
lnany more sequences expected to appear in the near·future. These sequences
. now make it possible· to develop probe-based assays capable ofidentifying
any of hundreds of organisms in environmental and clinical samples. Such
.assays rely on detecting a DNA sequerice that distinguishes thetarget organism
from all other known bacteria and.viruses and from background material, yvhich
could include DNA from humans, other animals, plants, Qr other species. A
probe that accurately distinguishes between a target genome-or set of
genomes--and all other background genomes is termed a signature sequence:
DNA- B3sed Signatures Against Biological Warfare· 281
anthraciswhose 16SrRNA seqlences are identical [Keim et al., 1999,2000].
Although these methods areeffettive, they only provide a limited number of
signatures, which are not always sufficient to ideritifybacteria or viruses in a
new sample; in particular, if the siunple contains an unknown strain, it might
contain genetic variability in preci~ely the region for which assays are designed.
Thus, in general; one would like to have as mliQ.y,assaysavailable as possible.
Insignia a,ddresses this by' 4sing the complete genome' to generate all unique
signatures, ,from which the assay ,designer can choose those that are best-
suited for a particular appiication. "
Recent increases irithe amount of available genomic sequence have made
it possible to largely automate the design and screening o(probes via
cOlQPutational search algorithms. Large-scale computational prediction of DNA
'.sign~ture!! was first undertaken for the Biological Aerosol Sentry and
, , Information System (BASIS), deployed at the Salt Lake City Olympic Games in
2002 [Fitch.et al., 2002, 7.9.03].The related BioWatch project operates by
collecting' and analyzing airborne miCrobial samples for known pathogens,, '
.. using PCR probe-based detection methods. Newer aerosol detll<;,tionsystems,
'. such as'the Autonomous Pathogen Detection System (APDS) [McBride et al.,
, 2003], automate the proces!!~and can identif< a known bioweapon in 0.5 to 1.5
hours [Brown, 2004]. Similar'teclmiques are not limited to aerosols, and can be
used in clinical or agricultural settings [Lirn, 2005]., "(
The success of these assaysde~eridS on both the available sequence
databases and the computational meth used to identify signatures that
differentiate the threat organisms from the ) c~ground.Signature design for
both BAS.IS and BioWat(;h was handled by. LliWr~c~ Livermore, ~ational
Laboratones (LLNL), an4 what began as asunple proof..of-concept BLAST
searen at LLNL evolved into the sophisticated KPATII signature pipeline [Slezak
, et a/~, 2003]. KPAtHidentifies sequences shared by a collection of ~get
'genomes, yet urtique with respect to all other microbial genoiites, and isnotable
for its ability to handle such a large search space. Other methods for probe
selection more rigorously address hybridization efficiency (binding energy,
, self-hybridization; etc.), but do not scale well for large target and background
sets [Kaderali and Schliep, 2002., Gordon and Sensen, 2004,. Nordberg, 2005.,
Li F, 2001]. Most notabie are the approaches that promise the scalabi1ity~of
KPATH combined with the hybridization considerations .ofthe other methods
(Tembe et al., 2007., Rahmatm, 2003].
Because of its history of use in real-wotld'dlagnostic syste~s, a mote
detailed description of KPATH is warranted. It consists ,of four major
components. First, a whole-genome multi-alignmerit isperforrned ein a set of
targefgenomes. This produc~s a "~o~seris1.isgestalt," ,which repres~nts the
sequences that are conserved in all the' target genomes:Next, this consensus
is matched against a database ofbackgrOuD.dsequences using Vmatch [Kurtz,
2003]. This'step computes all exact matchesb,etween ~e target consensus and,
~.~.,.r.:. .i. ~'
.Bioterrorism .and Biological Warfare ''.
By the definition, a signature sequence mus~be conserved among a set
of target genomes and dissimilar to any !!equence in the surrounding
environment. TO.detect a target with existing technology such as qPCR 'assays,
signatures must be relatively short; however, if they are too short, they will not
be ,unique: For example, because there.are only 410 ~ I million 10-bp (base-pair)
sequences, and a typical bacterial genome is more than I million bp in length, .
most 10-mers will be shared by man~.genomes and therefore make unsuitable
.~ipatures. Increasing the length, k~of the signature alleviates this problem,.
. but'if k is to.o large, it may not be possible to fmd a signature shared by a set of
target genomes. Therefore, there is a tradeoff between signature sensitivity
(the number of genomes that share the signature) and specificity (the number
of genomes that do notpossesstne signature). For instaIlce, a long signature .'-
may be highly specific to a particular strain or isolate, but it may not be sensitive
enough to detect closely related strains that might 'cause the same disease or
have other shared phenotypic characteristics. Because genomic sequence is
nonrandom, and only a small sample of genomes has been, sequenced, it is
difficult to estimate an optimal signature length. In practice, signature length is '
usually determined by the constraints of the detection technology (e.g., -20
bp focPCR primers).
Cmrent probe-based technologies are generally based on either PCR or
microarray hybridization. These methods are beginning to replace traditional
gel-based fingerprinting because they can more effectively differentiate between:
closely related microbes (Willse et al., 2004]. Microarray methods are particularly
promising because of their ability to multiplex many probes on a single chip
[Willse et al., 2004; Wang et al. 2002; Volokhov et al., 2004], improving both the '
redundancy and capabilities of the diagnostic. PCR does notmultip1ex as nicely;
'however, it remains popular because of its robustness, speed, and low cost -
[Slezak et al., 2003; O'Connell et al., 2006; Moser, 2006]. Unlike restriction
fingerprinting, both PCR and microarray methods require explicit knowledge of
the Underlying DNA sequence, therefore necessitatiri.g probe design.
Traditional probe design strategies have focused on single genes or
other loci that are determined a priori to be useful in distinguishing one target
organism from another. Examples include genes that are associated with
phylogenetic distance (e.g., 16S rRNA genes) and variable number tandem
repeats (VNTRs). In the fOflller case, where the gene or locus is conserved
among target and nontarget organisms, gene sequence alignments would be
used to aid in probe design. Probes would then be manually designed and
screened for sensitivity and specificity to the target. Those assays failing, to
identifyall target organisms, or producing false positives, would be invalidated
and the design revised. This manual screening made diagnostic assay,design
expensive and only worth doing for a few select pathogens. Alternatively,
variable number ~dem repeats (VNTRs) have proven very useful in classifying
and distinguishing many closely related strains of bacteria, such as Bacillus
DNA-Based Signatures Against Biological Warfare
the matches may take days to compute,' the signatures can be extracted from
this cached information in seconds.
The function of the match pipeline is to identify exact matches between
aU pairs of target and background sequences in the database. The size of the
Insignia sequence database is cUrrently about 60 billion nucleotidesj and even
with the linear-time algorithms described below, this is too large to search in
real time. Some computational effort is saved by limiting targets to microbial
genomes only, but the process of matching all pairs oftarget a~d background
genom~s remains expensive. . . '
To complete the matching phase within a reasonable amount oftirne, all
exact matches of 18bp or longer are first identified using MUMmer [DeIcher et
ai., 1999; Deicher et ai., 2002; Kurtzet al., 2004], alinear time and space suffix
tree matching algorithm. To expedite the process, MUMmer'searches are
partitioned across a 192-node Linux cluster. Even with the use of an efficient
search algorithnl, however, the size of the database and the high repeat content
·of many genomeS"causethe size of the output-the number of matches between
allpairs'of genomes-to reach unmanageable levels (e.g., the number of
matches can be quadratic with respect to the size of the genomes). To combat
this problem, matches are converted to a minimalized "match cover" data
structure, described riext. This structure saves substantial space and later
·provides a convenient mechanism for computing signatures.. .
. The match cover is not a lossless conversion, however, because it discards
information about where a match occurred in the background. The information
is nonetheless sufficient for sigllature computation, where it suffices to know
which regions of a target are unique .. Furthermore, .by ex,cbiding irrelevant
background match positions, large background_databases can be
accommodated without drastically increasing the matclftoversize, and dra.ft
quality genomic sequences can be incorporated without diffic,ulty. As the next
.section will show, the match cover encapsulates all the necessary information
. for signature discovery and.allows for the rapid construct~on of signatures for
any set of target andbackgrOl.ind genomes in linear time.
For petspec~ive,' it is"worth mentioningthaf the match cover is an
equivalent, interyal representation of matching statistics (Chang & Lawler,
'1994; Gusfield, 1991]. Both formalizations represent the longest contiguous
match beginrling at any position of Ii. sequence, but our interval representation
is space-efficient ~ndeasier to interpret in the context of signature discovery .
Rahm~ also leverages the properties of matching statistics in describing a
"jump list" for the discovery of DNA probes [Rahmann, 2003], and it is
· iriteresting to note that although the match cover and jump list Were ~ived at
· independently, they are analogous given their shared utilization of matching
Bioterrorismand Biological Warfare
the background. Matching sequences are m~ked out to create a "uniqueness
gestalt," which represents all sequences that are shared between target genomes
and unique with respect to the backgroUnd. Third, signature sequences are
. supplied to the PrimerJprogram [Rozen and Skaletsky, 2000], which designs
,PCRa$says based on those sequences. Primer3 produces a set of oligos suitable
for testing by a TaqMan PCR assay: a forward primer; areverse primer, andan
intervening probe oligomer [Liva1cet al., 1995]. Finally, assay candidates are
:'~;;creenedusing BLAST [Altschul· et at., 1990] for near matches that might
. disrupt the hybridization process, and ranked according to their satisfaction of
PCR experimental 'Constraints. The result of this four-stage process is a set of
ranked, prescreened assays, which are 'then subjected to rigorous laboratory. .
validation. The. transition' to these computational methods' from previously
manual design methods has result~d in greatly increased design efficiency by
limiting the number of assays that fail during laboratory validation.
In addition to the computational restrictions, limitations ofTaqMan PCR
have been demonstrated for rapidly diverging target genomes, such as hepatitis .
and HIV viruses [Gardner et aI.~2004; Gardner etaI., 2003]. However, for typical
bacterial targets, TaqMail assays remain one of the most rapid and sensitive
methods for signatute detection. In the case where TaqMan is inadequate,
different detection technologies, such aschip-hybridizatiori methods, could
be used to remove the TaqMan requirement for three adjacent probes and to
provide greater signature redundancy. Insignia would,easily support the'design
of such assays.
ViruSes pose significant challenges for all detection methods because of
. . their sma'l genomes and high mutation rates. The Insignia databasecontams
thousands of viral genomes; however, for -large target. sets there are often no
conserved signatures. To address highly divergent targets, future Insignia
versions may include the ability to identify signatures with degenerate basesj
for cases where no exact signature is share.dbetween them. An alternative is to
compute the minimUm signature set; where each signature might not identify
every target, but the set contains at least. one identifying signature for each
target. This approach is particJ.llarlysuited for chip assays where signatures
can be multiplexed. A related approach selects combinations of non-Unique
probes, such that certain viral strains c.anbe·identified by theirhybridiZeition
pattern [Urismanet ai., 2005]. Insignia support for specialized viral diagnostics .
isleft for future work. . . .
Insignia provides real-time signatUre retrieval for an arbitrary set o(target
and background genQmes. This requires the vast majority of compuuitional .
. work to be done in advance ,and cached, so that a minimum amount of
computation is necessary at the time of the query.To accommodate this, Insignia ,
is designed as two separate components: th,ematch,pipeline and the signatUre .
pipeline. This distinction separates the computationaUy intensive matching. t .• • .
step from the much simpler signature'generation step, and· allows, sequence
matches to be recomputed offline as new genomes become available. While
The function of the signature pipeline is to generate valid signatures for
, any set of target and background genomes. Because there are thousands of
, possible targets artc:imany more backgroUnds, combin,atorics rules out the pre-
computation of all signatures;' however, it is possible to generate signatures
from the match information with Giinimal overhead. The pipeline for doing so is
,~ ",divided into tWo parallel stages, corresponding to the two primary criteria a
valid signature ml,lstmeet:
I. a signature must be shared by all genomesin the target set; and '
2. a signature mus£not exist in any genome in the backgroood set. .'
Occurrericeand Expression of MannDB
MannDB is a relational database that organizes data resulting from fully
automated, high~throughput protein~sequence anaiyses using open:"source
tools. Types of analyses proxided include predictions of cleavage, chemical
properties, classification, features, functional assignment, post-translational
modifications, motifs, antigeniclty, and secondary structure. Proteomes (lists
of hypothetical and known proteinS) are downloaded and parsed from Genbank
and then inserted into MannDB; and annotations from SwissProt are
downloaded when identifiers arefound jn the Genbank entry or when identical
sequences are identified. Currently 36 open-source tools are run against
MannDB protein sequences either on local systems or by means of batch
submissiontp external servers. In addition, BLAST against protein entries in
, ,MvirDB, our d~abas~ of~icrobial virulence. factors, isperfouned. A web client
browser enables vlewmg..of comput,atlOnal results and downloaded
'annotations" anda query tool enables sttuctured, and free~textsearch
capabiIItie~.Whenavailable, links to eXternal databases, includingMvirDB,
are provided. 'MannDB' contains whole~pro~eomeanalyses, for at least, one
, iepre~entative organismfrQm each category of biological threat organism liste,d
by APHIS, CDC; HHS,NWD, USDA, USFDA:, and VHO~
MannDB comprises a huge numberofgenomes and compreh~nsive
protein sequence analysesrepreseriting :organismslisted as high-priority;agents
on the websites of several governmental organizations concerned With bio-
terrorism. MannDB ptovides the user with a BLAST ,interfabe for companson
of native and non-native sequences arid a query tool forconveriiently selecting
proteins' of interest. In addition, ~the.user has acc¢ss to a web-based browser
. that compile's comprehensive and extensive reporKAccessto MiuinDa is
freely available at h#p://inanndb.llril.gov/ webcite.· ..... . . '.
MannDB was created to meeta needfor;apid,comprehensive sequence
analysis with an emphasis 011 proteinprocessing,.surfa~e:character:istics;. and
functional classification to support selection of pathogen or, virulence-
associated proteins suitable as targets for driving the developfilent of protein-
h~~ed rea!!:ents (e.!! .• antibodies, non-natUral amino-acid ligan~, synthetic
DNA- Based Signatur.esAgainst BiologicalWarfare 285
high, affinity ligands) for pathogen detection [Slezak et '01., 2003; Zhou CEZ,
~005]. Because comprt';heIisive. analyses of this typeJequired using a large
humber of open-source tools,aIidbecause it was 'necessary to'scale the
computations for analysis, 'of whole prott';omes, we built a fully automated
system for executing sequence analysis tools and for storage, integration, and
display of protein 's~quem:eanalysisand amlotation data. IIi order to be able
to rapidly examine and compare whole bacterial and viral proteomes for selection
of suitable target protemsfor .bio-defense applications, we compiled data for
whole proteomes frbmreprt';sentative organisms from allcategories of biological .
threat agents listed by several governmental' agencies: APifIS, CDC, HHS; .
USDA, USFDA; NiAID, and WHO.[APIDS Agricultilral SelectAgent Program
select agent and toxin list;GDC bioterrorism agents/dIseases list; HHS and
USDA select agents artd toxins list; USFDABad Bug Book; NIAID catt';gory
A, B and C priority pathogens; WHO list of majo~ zoonotic diseases; WHo list
. of diseases covered by the Epidemic and Pandemic Alert and Response (EPR)]
as weHas taxonomic near-neighbor species as appropriate. Therefore, the scope
of MannDB is automated sequence analysis and evidence integration for
proteins fromalkurrently recognized bio-threat pathogens. Emphasis is placed
. upon analyses that are most useful in characterizing potential protein targets
and surface· motifs that could be exploited for development of detection
reagents. The contentofManriDB is updated on a regular basis,
In recent years several software systems and accompanying databases
have been developed for microbial genome annotation, each with a particular
emphasis [Andrade etal., 1999; Frishman et al., 2001; Gattiker et al., 2003;
Goesmarin et al.,'2005;Markowitz et al., 2005; Meyer et al., 2003; Vallenet,
2006;VaiI Domselaar et al.; 2005 ]. Some databases place artemphasis on gene
prediction and DNA-basedanalysesvs. protein sequence-based analyses, or
provide autorpated (primary) vs. curated (secondary) annotations. Although
microbial annotation databases frequently include pr~ictions of biological,
chemical, structural .• and physical properties. Qf proteins (e.g., antigenicity,
post-translational modification's, hydrophobicity, membrane helices), none
cWTentlyoffers the comprehensive suite of analyses (see MannDB website for
complete list of tools ) contained withinMannDB for chm-acterizing viral as well
as bacterial proteins from human and agricultural/veterinary pathogens of
interest to the bio-defense community anl for rapIdly identifying putative
viI1l1ence-associated proteins for development of functional assays. The,
MannDB database was built and linked to MvirDB [MvirDB nllcrobial virulence
database] in order to meet thesetequirements. In addition, we focus on
sequence analyses that' assist. in selection' of protein features (e.g., surface,
characteristics) most suited for targeting detection reagent development.
Construction and Content
MannDB is implemented as an Oracle 10 g relational database. The schema
forMannDB data organization is available on the website. MannDB captures
. - 4
'd 0~b"lO r::
••. r:: '.~.-~0- -= ~,~~.- 0
',~' r:: Sm.
= r:: '-'
o ell =
"':s ~.f!. ,,'"
o;lell~-a "Q ~
r:: •••• -
_ o. sg
" r:: "Q ;j.- Q) -
._, ,r:: =
a. ~ ell
ell '" >.
ell Col' '"
~ 0 -u a- 0
= 1;;' ~
C" ~ "u
~ ~ a-
'" 0 ='
=.c ~~ - I
r:: ~ 5
, ; ~ go
':S ~ 'Q
a- c =
= ....= ~ -=ell _ U
a- '" ~
ell '" ~
5 Q ~ 8
~ .E ~
" ~ ..' . u
t ~~. 'Cj.w
ILl '5-:il ' ,~.!-= c . '.c::a
-e ~ ••
~ ~~'" ~
CIS 4!,jf =-:-::s
::s b.Q ~ rI) e- = e t:• aI"O.fIllU.~"f'-I401 C Go' u· C •• tn·, = >.
a ~~J ci~~ 5~QuGJ-uO::s=t::o
a 6- ~ cr·-;·. ~:e ~
~ .r a·
~~o•• 0 I:••. s. -=
. "Q Co ••
,~,,i,r-1 == ..
.. -bg.. -I:-
DNA.;..BasedSignatures Against BiologicalWarfare
286 Bioterrorism and BiologicalWarfare
results from our fullyautoinated, high~throughpiit, whole-proteome sequence
analysis process pipeline, depicted in Fig. 6. Proteomes (lists of hypothetical
and known' proteins) representing' human, bacterial and viral pathogens' and
near-neighbor species are downloaded from GenBank and parsed into MannDB.
Whenever possible, ~e begiQ with gene calls on fmished genomes. However,
the system' can be used to predict genes on draft genomes, and can be used to
analyze arbitrary listtof protein sequences. Reference genomes ,are updated
on a quarterly basis to ensure that the softWare tools are being run on current
sequence data. Annotations fromSwissProt are downloaded when GenBank,
entries contain SwissProt identifiers, or when identical sequences are detected
by blasting MannDB entries against the SwissProt proteiQ fas~ ,database.
MannDB contains at leastone.reference genomefor each category of pathogen
listed as abio-threat organisl}l on websites maintained by APHIS! CDC, HHS,
USDA; USFDA, NIAID, and WHO. Open-source tools 'are run either on local
systems or by means of batch submission to external serVers. As of this writing
the system executes 36 tools, which are listed on the MannDB web site.
Automated sequence analyses include predictions of po'st-translational
modifications; structural conformation, chemical properties, functional
assignment, and antigenicity, as well as motif detection and pre-computed
BLAST against protein and nucleic acid sequences in MvirDB, our database
of microbial virulence factors, protein toxms, and antibiotic resistance genes
[MvirDB microbial virulence database]. Tools thatarerun in-house are updated
periodically to ensure that the system.isrunning themost recent software
versions against the mos,t recent data sets. Toolsar~ selected, and input'
parameters are set according to'the taxon ()f th~,organism from ~hich the
ptoteinset is constructed. For example, some tools (e.g., NetPicoRNA; [Blom
et al., 1996]) are run only on specific organisms, 'Whereas others (e.g., SignalP;
[Bendtsen et al., 2004]) have taxon-specific settings. In some cases we run
more than one tool for a similar prediction. TMHMM and TopPred both predict
membrane helices, but results may. differ, for example, in the start and end
residues for a given segment. Our strategy is to e,mploy more than one tool,
when available, so that conflicting results can be noted and evaluated by the
user. In parsing results from each tool, data are inserted into one of nine tables
(see schema on web site) depending on the type of prediction (e.g., protein
chemistry); tools that make similar predict~ons tend to produce similarly
structured output (although formatting differs considerably), which facilitates
data storage and retrieval. .. " .
A web client browser enables viewing of automated analY)iis results,
annotations, and'linksto MvrrDB. The user first selectS a proteome, then'a
specific protein for which to view summary results, and [mally selects the
specific categories of analysis to be vie·wed. Only analyseS returning results
are displayed. Hyperlinksto external data sources are provided for additional'
information whenever external database identifiers arereturned ..The MannD8
I:~': [,II'i i
! i 1,. ,
Biot~rrorism and Biol()gicalWarfare
tools.etincludesa BLAST interface, which can be used to quickly identify an
entry ofinter:est by its sequence, when the gene name orlocus tag is unknown,
or to identify protein sequences related to,a sequence of interest. A query tool
allows the user, to construct 3 types of searches: 1)free-'textsearches against
~ndatabase fields that contain qescriptive infornation, including fields,
, containing gene names or external database ident1fiers~2) structw:ed searches
against-specific analysis types; and-J) a search for proteIns linked to entries in
~JviVirDBeitherbycornmon uniq.ieidentifier orbypre-computed blast homology.
Iteports lUldresults sets from the query tool can be downloaded into Excel.
zhou et ai, BMe Bioinformatics 2006 7:459doi: 10,1186/1471-2105-7-459
"MannDB provides users with pre-computed s~quenceanalyses for
complete proteomes of bacterial and viTalpathoge~ from several governmental
agencies' lists ofbio-tbreat agents. The genomes.and tools are maintained up
to date, with predictions being re-run every 3' months. The user can browse
proteomes, or can blast sequences againstMannDBto pull up related entries
,and associated data. MannDB provides a convenient source of automated
sequence analyses and downloaded annotation information for whole
proteomes of human pathogenic bacteria and viruses and has a high degree of
integration with external databases.
MannDB provides sequence analysis information ofpririlary interest to
'researchers in the bio-defense communitY. We have been using MannDB for"
several years to "annotate" DNA signatures [Slezak: etal., 2003] and more
" recently to assist collaborators in efforts to down~select from ",hole bacterial
and viral genomes to identify suitable protein ~gets and protein features for _.
driving the development of detection reagents [Zhou et al., 2005]. For example,
a common requirement for a detection assay isthat it be performed with minimal
sample disruption. Therefore, an initial down selection for proteins expected to
be on the stirface of a bacteriaJparticle might entail identification of proteins
that are predicted to be secreted or membrane bound by using tools such as
PSORT [Gardy et al.,2005; Nakai and Horton, 1999;], TMHMM (Kroghet al.,
2001], SignalP, TargetP [EmanueJsson et al., 2000], TopPred [Claros et al., 1994],
and HMMTOP [Tusnady and Simon, 1998]. Having results from several tools
that provide similar predictions but using different algorithms or slightly
different approaches' allows. us to compare predictions and make selections
with greater confideJ:.lce. Identification of surface features for targeting of.
detection reagents is done primarily by means' of additional sequence- and
structure-based analyses [Zhou et aj., 2005], although predictions pertaining
. to post-translational modifications (e.g., glycosylation, cleavage) are taken
into consideration as they may affect prote~ recognition.
Availability llnd Requirements
MannDB is freely accessible at http://manndbJInl.gov/ webCite. Although
> •••••••- . ·f
,DNA- Based SignaturesAgainstBiologicalWarfare 289
the software that populates and updates MannDB is not open-source,' the user'
may request coUaborativesequence analysis services by contacting
List of abbreviations
BLAST =Basic local alignment se'arch tool.
APHIS =Animal and PlantHealth Inspection SerVice.
(])c = Centers for Disease Control and Prevention.
,'HHS =Health and tIuman Services.
USDA = United States Departinent of Agriculture.
USFDA =United States Food and Drug Administration. '
NIAID =National Institute 'of Allergies and Infectious Diseases.
, WHO , =World Health Organization.
Comparative genomicstools applied, to bioterrorism defence
Rapid advances in the genomic sequencing ofbacteril!.and viruses over
the past few yeats have made it possible'to consider sequencing the genomes
of all pathogens that ·affect hlimans and the crops and livestock upon which
our lives depend. Recent events make it imperative that full genome sequencing
be accomplished as soon as possible for pathogens that could be used as
weapons of mass destruction or disruption. This sequence information must
be exploited to provide rapid and accurate diagnostics to identitY pathogens
and distinguish them from harmless near-neighbours and hoaxes. The Chem-
Bio Non-Proliferation (CBNP) programme of the US Department of Energy
(DOE) began a large-scale effort of pathogen detection in early 2000 when it
,was announced that the DOE would be providing bio-security at the 2002
Winter Olympic Games in Salt Lake Cityl Utah.' Our team at the Lawrence
Livemlore National Lab (LLNL) was given the task of dAveIopingreliable and
validated assay s for a number of the most likely biote'rrorist agents. The short
timeline led us to devise a novel system that utilised whole-genome comparison
methods to rapidly focus on parts of the pathogen genomes that had a high
probability of being unique. As~ays develqped with this approach have been
validated by the Centers for Disease Control (CDe). They were used at the
2002 Winter Olympics, have entered the public health system, and have been
in continual use for non-publicised aspects of homeland defence since autumn
2001. Assays have been developed for all major threat list agents for which
adequate genomic sequence is available, as well as for other pathogens
requested by various government agencieS'. Collaborations with-comparative
genomics algorithm developers have enabled our LLNL team to make niajor
. advances in pathogen detection, since many of the existing tools simply did
not scale well enough to be of practical use for this application. It is hoped that
a discussiOn of a real-life practical application of comparative genomics
JI' II ,.-I,L!t
I: i :1
Bioterrorism and Biological Warfare
algorithms may help spur algorithm developers to tackle some of the many
remaining problems that need to be addressed. Solutions to these problems
will advance awide range of biological disciplines, only one of which is pathogen
det~ction. For example, exploration in evolution and phylogenetics, amwtilting
,~ene coding regions; predicting- and understanding gene function and
regulation, and untangling gene networks all rely on tools for.aligning multiple
. sequences, detecting gene rearrangements and duplications, andvisualising
,~~~n:o~c.data:Two key problems cUrrently needing improved so~utions ar?: (1)lilt'gnmgmcomplete, fragmentary sequence (eg draft genome contlgs or arbItrary
genome regions) with both complete genomes and other fragmentary
seq~ences; and (2) ordering, aligning and visualising hon-colinear gene
rearrangements and inversions in addition to ~e colinear alignments handled , .'
by current tools. . ,.
DNA- based signatUres are needed to quiCkly and' accurately identify
biological warfare agents and their makers. DNA signatures are nucleotide
sequences that can l)e used to detect the presence of an organism' and to
distinguish that organism from all other species. Insignia, a new, comprehensive .
system is applicable for the rapid identification of signatures in the genQmes of
bacteria and viruses. With the availability of hundreds of complete bacterial
. and viral genome sequences, it is now possible to use computational methods .
to identify signature sequences in all of these species, and to use these
signatures as the basis for diagnostic assays to detect and genotype microbes.~ .
in both enviJ:onmental and clinical samples. The success of such assays critically
depends on the methods used to identify signatures that properly differentiate
between the target genomes and the sample background. Insi@ia is used to
compute accurate signatures for most bacterial genomes and' made them
available through the Web site. A sample of these signatures has been
successfully tested on a set of 46 Vibrio cholerae strains, and the results
iri.dicatethat the signatures are highly sensitive for detection as well as specific
for discrimination between these strains' and. their near relatives. Th~ entire
genomic complement of organisms are compared to identify probe targets, is a
promising method for diagnostic assay development, and it provides assay
designers with the flexibilitY to choose probes from the most relevant genes or
genomic regions. The Insignia system is freely accessible via a Web interface
and has been released as open source software at: http://insignia.cbcb.umd.edu.
MannDB is a genome-centric database containing comprehensive
automated sequen£eanalysis predictions for protein :;equences from organisms
of interest to the bio-defense research community. Computational tQolsfor the
MannDB automated pipeline were selected based on customer needs in
providihg down selections from large sets of proteins (e.g., wholeproteomes)
to short lists of proteins most suitable for developing reagents to be used in
field assays for detection of pathogens. For that reason we have focused our
.efforts on' applying tools that would enable selection of proteins that meet
,'DNA-Based Signatures Against lUological Warfare. 291
assay requirements, such as cellular localization, that would liSsistin determining
the value of a surface feature for targeting'ligand binding, or that would identify
antig~nic sub-sequences of particular value inantipody development~ As the
·goals of some of these assays have been to detect toxins or proteins associated
with virulence, we constructed hard links between protein sequences in MannDB
with entries in MvirDB in order to conveniently identify and characterize protein -
· targets and features for ~ese applications. We believe that MannDB will be of
general use to the bio-defense and medicalresearch communities as a resoUrce
· for predictive sequence analyses and virulence inform<tion.
Altschul SF, GishW;Miller W, Myers EW and Lipman OJ (1990): BasiC local
alignment search tOI;>1.JMol Bioi, 2i5, 403-410. .
Aridrade MA, Brown NP, Leroy G,Hoersh S, de Daruvar A, Reigh C, Franchini
. A, Tamames J, Valencia A, Ousounis C and Sander C (1999) : Automated
. gen_omesequence analysis and annbtatlon. Bioinformatics, 15,391-412.
APHIS Agricultural Select Agent Prograin select agent and toxin list [http://
. www.aphis.uspa.gov/programs/atLselectagentlalLbioter't _toxinslisthtml]
BendtsenJD, Nielsen H, von Heijne G and Brunak S (2004) :Improved prediction
. of signal peptides: SignalP 3.0. Journal ofMolecularBiologj., 340, 783-
BlomN, Hansen J, Blaas D and Brunak S (1996): Cleavage site analysis in
picomaviral polyproteins: Discovering.cellular targets by neural networks ...
Protein Science,S, 2203-2216.
Brown K (2004): Biosecurity. Up in the air.Science, 305·, 1228-1229.
CDC bioterrorism agents/diseases list [http://www.bt.~dc.gov!agentlagentIist-
category.asplwebcite . j;;'
Chang WI and Lawler EL (1994): Sublinear expect~d time approximate string
matching and biological applications. Algorithmica, 12,327-344.
Claroi,-MG,vonHeijpe G: TopPred IT (1994) :An improved software for membrane
profein structure predictions. CABIOS, 10,685-686.
DeIcher AL, KasifS, Fleischniann RD, Peterson J, White °and et al. (1999):
. Alignmentofwholegenomes. NucleicAcids Re.s, 27,2369-2376.
Deicher AL, Phillippy A, Carlton J and Salzberg SL (2002): Fast algorithms for
. large-scale genome aHgrimentand comparison~Nucleic Acids Res; 30,2478-
'Emanuelsson 0, Nielsen H, Brunak S and vOll Heijne G (2000): PrediCting
subcellular localization of proteins based on their N-terminal amino acid
sequence. Journal of Molecular Biology, 300, 1005-1016.'
292· DNA- Based Signatures Against Biological Warfare 293
Keim P, PriceLB, KlevytskaAM, Smith KL and Schupp JM, (2000) Multiple-
locus variable-number tandem repeat analysis reveals genetic relationships
within Bacfllus anthracis. JBacteriol182, 2928-2936.
'Krogh A, Larsson B, von Heijne G, Sonnhammer ELL Year: Predictmg
. transmembrane protein topology with a hidden Markov model: application
to .complete genQmes.
Kurtz S (2003): A time and space efficient algorithm for the substring matching
problem. TechDicalReport. Hamburg: Zentrum fiirBioinformatik, Universitiit
Kurtz S, Phillippy A, DeIcher AL, Smoot M, Shumway M and et al. (2004) :
Versatile and open software for comparing large genomes. Genome Biol, 5,
Li F and Stonno GD (200 I): Selection of optimal DNA oligos for geneexpression
. arrays. Bioinformatics, 17, 1067-1076.
Limnv,Simpson 1M,Keams EAand Kramer MF (2005): Current anddeveloping
technologies for monitoring agents of bioterrorismaml biowarfare. Clin
Microbiol Rev. 18,583-'607.. .
.L-ivakKJ, Flood SJ, Marmaro J, Giusti Wand Deetz K (1995) Oligonucleotides
with fluorescent dyes at opposite ends provide a quenched probe system
useful for detecting peR product and nucleic acid hybridization. PCR
Methods Appl, 4,357-362. ,
McBride MT, Masquelier D, Hindson BJ, MakarewiczAJ and Brown S (2003):
Autonomous detection of aerosolized Bacillus anthracis and Yersinia
Markowifz vM, Korzeniewski F,PalaniappanK:; Szeto P, lv.anovaN and Kyrpides
NC(200S): The integrated microbial genomes (IMG) system: a case study in
biological data management. Proceedings of the 3J§t VLDB Conference:
2005; TrondheimNorway.2005, 1067-1078. .
Meyer F, GoesmannA, McHardy AC, Bartels D, Bekel T, Clausen J, Kafinowski·
J, Linke B, Rupp 0, Giegerich Rand PuhlerA (2003): GenDB - an open
source genome annotation system for prokaryote genomes. Nucleic Acids
Peterson ill,Umayam LA; Dickinson TM, Hickey EK and WhiteO (200I): The
comprehensive microbial resource. Nucleic Acids Research, 29, 123-125.. . .
MvirDB microbial virulence database [bttp://mvirdb.llnl.gov} webcite.
Moser MJ, Christensen DR, Norwood D an~Prudent JR (2006): Multiplexed
.detection of anthrax-related toxin genes.J Mol Diagn, 8, 89~96.
NakaiK and Horton P (1999) : PSORT: a program for detecting the sorting
signals of proteins and predicting their subcellular localization.
NlAID category A, B and C prioritY pathogenS (bttp://wWw3.niaid.tiih.govl
biodefenselbandc Jlriority.htm) webcite , . ..•. : .. '.
Bioterrorismand Biological Warfare
Fitch jp, Gardner SN, Kuczmarski TA; Kurtz S,MyerS R and et al. (2002):.Rapid
. deveiopment of nucleic aciddiagnpstics. Proc IEEE, 90, 1708:-1721. .
Fitch IP,Raber Eand Imbro DR(2003): Technology challenges in ~esponding
to biologiCal or chemical attacks in the civilian sector. Science, 302, .1350-
,,1354. .. . .. .... .
FtjshmartD,Albermanrt K,Hari I, Heumann K, MetariomskiA,Zollner A, Mewes
. H-W (2001): Functional ari-d structural genomics using PEDANT ..
'.'1>,(j;, Bioinjormatics, 17,44~57. . .
Gardy JL, Laird MR,CheriF, Rey S, WalshCJ; EsterM and BrfukmanFSL'
(2005):PSOR1b v.2.0: expanded prediction of bacterial prQteinsubcellular
localization and ~·insighis.gained from comparative proteome analysis.~·
Bioinfo~matics, 21,617-623. .
Gardner SN, Lam MW, Mulakkeil NI, Torres CL; Smith JR and ef al. (2004):·
Sequencing needs for viral di<!-gnostics.J Clin Microbiol, 42, '5472-5476.
Gardner SN, Kuczmarski TA, Vitalis EAand Slezak TR (2003): Limitations of
TaqMilDPCR for detecting divergent viral pathogens illustrated by hepatitis
A, B, C, and E viruses and human iminunodeficiency virus. JClin Microbiol,
Gattiker A, Michoud K, Rivoire C, Auchincloss AH, Coudert E, Lima T, Kersey
P, Pagni M, Sigrist CJA, Lachaize C, Veuthey A-L, Gasteiger E and Bairoch
A, (2003): Automated amiotation of microbial proteomes in SWISS-PROTo
Computational Biology and Che"Jistry, 27,49-58. .
GoesmannA, Linke B, Bartels D, Dondrup M; Drause L, Neuweger H, Oehm S,
Paczian T, Wilke A and Meyer F, (2005): BRIGEP - the BRIDGE-based~
genome-transcriptome-proteomebrowser. Nucleic Acids· Research, 33,
Gordon PM and Sensen CW (2004): Osprey: A comprehensive tool employing.
novel methods for the design of oligonucleotides for DNA sequencing and
microairays.NucleicAcids Res, 32, e133. .
Gusfield b (1997): Algorithms on strings, trees, and sequences: Computer'
science and computational biology. New York: Cambridge University Press.
HHS and USDA select agents and toxinS list (http://www.cdc.gov/od/sap/docsl
Hohl M, Kurtz S aild Ohlebusdr E (2002): Efficient multiple genome alignment.
Bioinjormatics" 18, S312-8320.
Kaderali L and Schliep A(2002): Selecting signature oligonucleotides to identify
organisms using DNA arrays.Bioinjormatics, 18, 1340-1349"
Keim P, KlevytskaAM, Price La, Schupp JM and Zinser G (1999) Molecular
diversity in Bacillus anthracis. I Appl Microbiol87: 215c..217..
294 Bioterrorismand Biological Warfare
Nordberg EK (2005) YODA: Selecting signatl,lre oligonucleotides'.
Bioinformatics, 21, 1365~1370. '
O'Connell KP, Bucher JR, Anderson PE, Cao CJ, Khan AS and et al. (2006):
Real-time fluorogenic reverse trans~ription-PQRassays for detection ofbacteriophage MS2. Appl Environ Mic.robiol, 12, 478~83.
,KD, Tatusova T"Maglott DR: N~I reference/sequence' (RefSeq), year: a
curated non-redundant sequence database of genomes" transcripts and
'iI!~ '" proteins; Nucleic Acids Res 35: D61-D65. '
Pruitt KD, Tatusova T and Maglott DR (2007): NCBI reference sequences
(RefSeq): A curated nonredundantsequence qatabase of genomes,
transcripts, and protein's. Nucleic Acids Res, 35, D61~D65.
Rahmann S (2003): Fast and sensitive probe selection for DNA chips usmg
jumps in matching statistics. Proc IEEEComput Soc·Bio~form Conf2, 57-
Rozen Sand Skaletsky H(2000): Primer3 on the WWW for general users and
for biologist programmers. Methods Mol BioI, 132: 365-386: '
, Slez3I< T, KuczmarsId T, OU'L, Torres C, Medeiros D and et al. (2003):
Comparative genomics tools applied to Qioterrorismdefense. BriefBloinform
Slezak T, Kuczmarski T, Ott L, Torres C, Mederos D, Smith J, Truitt B,Mulakken
N, Lam M, Vitalis E, ZemlaA, Zhou,C and Gardner S (2003) : Comparative
genomics tools applied to bioterrorism defense. Briefings in Bioinformatics,
Tembe W, Zavaljevski N, Bode E,ChaseC, Geyer J andet al. (2007):
Oligonucleotide fmgerprint identification for micro array-based pathogen .
diagnostic assays. Bioinformatics, 23, 5~13. '
Tusnady GE, Simon I year? : Principles governing ~inoacid composition of
integral membrane proteins: applications to t9P.ology prediction. ,
UrismanA, Fischer KF, Chiu CY,Kistler AL, Beck S andet al. (2005):E-Predict:
- '0_, A computational'strategy for speciesidentificatiori 'based, on ob'served
DNA micro array hybridization patterns. Genome BioI, 6~R78.", '
USFDABad Bug Book [http://www.cfsan.fda~gov/-mowr.iltroJitmllweb~ite
Vallenet D, Labarre L, Roily Z, Barbe V"Bocs S~Cruveiller S;Lajtis A, pascal 0.
Scarpelli C and Medigue C (2006): MaGe: a Diicrobial genome annotation
system supported by synteny results. Nucleic Acids Research, 34,53-65.
. Van Domselaar GH, Stothard P, Shrivastava S;CMJA, Guo A, Dong X,LuP,
Szafran D,Gremer Rand WIShart DS (2005) :BASys:IIweb server fot~tOrnated "
bacterial genome annotation. NudeicAcids Research,;~3(Yl455;;'W 459, ,,'. ': ... '",.','" ',.1,' "
,V610khov D,Pomerantsev A, Kivovich WRaSoolyA afidChi~ikov V(2004):
,0; Identification ofBacillus,antl;zracisbYU1f1.ltip(o~ Iiiicro~ay hybridization.
DiagnMicrobiolInjectDis; 49, 163q7f: ' ' ,
DNA- Based SignaturesAgainst Biological Warfa,re 295 .
Wang D, Coscoy L, Zylberberg M, Avila PC, Boushey HA and et al. (2002):
Microarray.;.based detection and genotyping of viral pathogens. Proc Natl
AcadSci USA, 99,15687-15692.
WHO list of major zoonotic diseases [http://www.who.intlzoonoses/diseases/
'en/) webcite '
WHO list of diseases covered by the Epidemic and Pandemic Alert and
Response (EPR) [http://wWw.who.iiltlcsr/diseaseleRl] webcite
Willse A, Straub TM, v~schel SC, Small JA, Call DR and et al. (2004):
Quantitative oligonucleotide micro array fingerprinting of Salmonella
ent~rica isolates. Nucleie;Acids Res, 32, 1848-1856.
Zhou CEZ, ZernlaA, Roe D, YoungM, Lam M, SchoeingerJ and Balhom R
(2005) :Computational approaches for identification of conserved/unique
binding pockets in ,the A chain of ricin. [http://
bioinforrnatics.oxfordj ournals.o rg/cgilreprint/21114/3089] webcite,
Bio'informatics, 21, 3085~3096. '