jbe_validation_MS_draft_6_12_16_clase

- 1 -
Validation of Mycobacteriophage Genome Annotation
by Mass Spectrometry
Yi Li1, 2, 3, Soo Jung Ha1, 2, 3, Zach T. Andor1, Hee Gun Eom1, Mokunfope O.
Fatukasi4, Nathaniel R. Hunnewell5, Jiewei Wu1, Victoria E. Hedrick2, Tiago J.P.
Sobreira2, Kari L. Clase1, 2, 3, 6§
1Department of Agricultural and Biological Engineering, Purdue University, 225 S.
University Street, West Lafayette, IN, 47907, USA
2Bindley Bioscience Center, Purdue University, 1203 W. State Street, West Lafayette,
IN, 47907, USA
3Biotechnology Innovation and Regulatory Science Center, Purdue University, 1203
W. State Street, West Lafayette, IN, 47907, USA
4Department of Biochemistry, Purdue University, 175 S. University Street, West
Lafayette, IN, 47907, USA
5Department of Biological Sciences, Purdue University, 915 W. State Street, West
Lafayette, IN, 47907, USA
6Department of Technology Leadership and Innovation, Purdue University, 155 S.
Grant Street, West Lafayette, IN, 47907, USA
§Corresponding author
Email addresses:
YL: li949@purdue.edu
SJH: has@purdue.edu
ZTA: zandor@purdue.edu

- 2 -
HGE: heom@purdue.edu
MOF: mfatukas@purdue.edu
NRH: nhunnewe@purdue.edu
JW: wu371@purdue.edu
VEH: vhedrick@purdue.edu
TJPS: sobreira@purdue.edu
KLC: klclase@purdue.edu

- 3 -
Abstract
Background
Mycobacteriophages are the viruses that infect Mycobacteria, the genus of
Actinobacteria that include members responsible for diseases such as tuberculosis and
leprosy. The biological activities of phages are mediated by host production of phage
proteins encoded in the genetic material of the phage and translated by the host upon
phage infection. Phage-host interactions are complex and are likely mediated through
proteins and their functional interactions. New phage genomes are often annotated
and classified into different clusters based on nucleotide similarity. Many annotated
putative proteins, however, remain uncharacterized. There is a need to validate and
improve the accuracy of phage genome annotation in order to further explore novel
proteins and their function. Proteins from phage have potential applications in broad
areas, including both the food industry and healthcare. In order to examine the
proteins produced during phage infection of its host and develop a method for rapid
validation of phage genome annotation, a set of diverse mycobacteriophages isolated
through the Howard Hughes Medical Institute (HHMI) Science Education Alliance
Phage Hunters Advancing Genomics and Evolutionary Science (SEA-PHAGES)
program were examined by high-performance liquid chromatography-tandem mass
spectrometry (HPLC-MS/MS).
Results
The mass spectra of peptides from each phage-bacteria mixture was searched against
a database consisting of a six-frame translation of the phage genomes, the annotated
proteins of Mycobacterium smegmatis (M. smegmatis) and UniProt using the software
X! Tandem. Peptides of annotated putative phage proteins were detected from nine
diverse phages and the number of peptides identified in the samples ranged from 75 to

- 4 -
715 total peptides, matched either the bacteria M. smegmatis or phage, and included
both functionally characterized and uncharacterized proteins. Mascot, as an
alternative software, was tested by searching mass spectra data of Giles and Omega
against three distinct databases. Analysis using Mascot was sufficient for peptide
identification and was also used to search the mass spectra against the annotated
putative proteins archived in the phage database, PhagesDB. The phage peptides
identified in the samples ranged from 24 to 414 and included peptides that matched
both proteins from the phage samples and homologous proteins from other phages in
the database. In addition to peptide identification and validation of genome
annotation, the Mascot-based analysis method was further evaluated for its utility to
confirm the cluster classification of phages, based upon the cluster with highest
number of peptides detected.
Conclusions
The method identified peptides that validated the previous genome annotations of
phages and putative phage proteins. Since many uncharacterized proteins were also
identified, this method could also be used to investigate the expression of phage
proteins in a dynamic phage-bacteria system and further adapted to examine native
phage proteins and changes in protein expression throughout the lifecycle of the
phage.
Keywords
mycobacteriophage, genome annotation, protein expressions, mass spectrometry,
peptide identification, uncharacterized proteins, phage-host interactions

- 5 -
Background
Bacteriophages or phages are the viruses that infect bacteria. Similar to other viruses,
a phage is constructed of two components: a genome of nucleotide material and a
capsid made of proteins [1]. The phage genome can be either single- or double-
stranded DNA or RNA and is dependent upon the host for replication and the
production of new phage particles. Interactions between the host and the phage result
in either lytic or lysogenic phage life cycles. In a lytic cycle, the phage genome is
replicated, transcribed, and translated using the bacterial machinery. Phage particles
are then assembled and released out of the bacterium. In contrast, during the lysogenic
cycle, the phage genome is integrated into the bacterial chromosome. Under certain
conditions, the integrated phage genome can be released from the host chromosome
and subsequently enter a lytic cycle [2].
Mycobacteriophages, phages that can infect Mycobacteria, have been studied
extensively. Novel phages are isolated from environmental samples, and phage
genomes are sequenced and annotated [3]. Isolated mycobacteriophages can be
categorized into clusters through a four-step analysis at the nucleotide level: dot-plot
comparison of all genomes with one another, pairwise average nucleotide identities
(ANI), pairwise genome map comparisons, and gene content analysis [3]. Functions
of annotated gene products are predicted through bioinformatics analysis and
homologous proteins are assembled into different phamilies [4]. Wet lab
experimentation is still needed to validate the genome annotations and further
characterize the putative phage proteins and their function. Previous studies employed
mass spectrometry (MS) to examine phage proteins [5], however, since analysis
focused solely on purified phage particles, only virion and structural proteins were
examined. Exploration of the remaining phage proteins is challenging since protein

- 6 -
production depends upon complex interactions between the phage and its bacterial
host. Thus the availability and concentration of the proteins in phage lysates is
unpredictable. In this research, proteins were extracted from the phage-bacteria
mixture during phage infection, and further analyzed by mass spectrometry.
Results
Sample Preparation
The overall experimental process is shown in Figure 1. Phages from diverse clusters
(Table A1) were obtained from Prof. Graham Hatfull’s research group through the
HHMI SEA-PHAGES program. Individual phage stocks were incubated separately
with M. smegmatis cell culture and provided maximum time for infection and
production of phage proteins by the infected bacteria. Non-infected M. smegmatis cell
culture was used as a negative control. Total proteins of the phage-bacteria mixture
were extracted using high pressure to break the cell membrane and optimize the yield
of proteins, digested using trypsin and loaded to mass spectrometer for analysis.
Figure 1. Experimental Process
The MS method was designed and performed at Purdue University [6]. Data analysis
using X! Tandem was adapted from Pope, et al [7].
Identification of phage peptides by Mass Spectrometry
To identify the peptides that were expressed in the phage-bacteria mixture, the
software X! Tandem was used to search the mass spectra raw data against a database
containing the six-frame translation of the sample phage genomes, the annotated
proteins of M. smegmatis, and the proteins from the comprehensive protein database,

- 7 -
UniProt, as published previously [7]. In the output of X! Tandem, peptides from both
the corresponding phages and the host mycobacteria were detected. The number of
peptides, however, from the host was much larger than the amount of detected
peptides from the phage (Figure 2). The total number of identified phage peptides in
all samples is listed in Table 1.
Figure 2. Peptides identified by using X! Tandem.
The number of identified phage peptides was low in comparison to the number of
identified bacterial peptides. Additional software, Scaffold 4, was used for further
analysis and adjustment of False Discovery Rates (FDRs) and the results obtained
using the Stringent setting is shown. All phage proteins identified based upon the
peptides detected from MS are listed in Tables A4-A12.
Identification of phage proteins
The phage peptides identified with X! Tandem were labeled in the genome map of the
phages to compare the expressed phage proteins with the predicted proteins from the
phage genome annotation. Four proteins were identified from phage Wee, more than
ten proteins were detected in Babsiella, Giles, Henry, LHTSCC, LinStu, Omega and
Papyrus, and more than thirty proteins were observed in Chah. Although only a
portion of the putative proteins predicted from the genome annotations were
identified, the detected peptides validated the existing genome annotations for the
corresponding genes. As expected, putative virion structure and assembly proteins
were detected in all samples, due to the presence and production of phage virions.
Non-structural proteins predicted by the genome annotations were also detected as
described in the following paragraphs.

- 8 -
Fifteen proteins were detected in Babsiella (Table A4), including LysA and LysB,
proteins that prompt host cell lysis. Other proteins were putative virion structure and
assembly genes based upon homology with another phage member of Cluster I,
Che9c. The first twenty two genes from Che9c are virion structure and assembly
proteins [4] and based upon a comparison to Babsiella using Phamerator, the putative
proteins have a similar structure and distribution and belong to conserved phamilies
(Figure 4). Therefore, although gp5 (gene product of gene 5), gp8, and gp13 of
Babsiella are functionally uncharacterized, their putative proteins may be associated
with virion structure and assembly based upon the comparison in Phamerator. The
remaining proteins identified, gp75 and gp76, have unknown functions.
Figure 4. Comparison between the genomes of Babsiella and Che9c
The genomes of Babsiella and Che9c were compared using Phamerator [9]. Most of
the first 22 genes in both genomes belong to the same phamilies (same color and
phamily number) and have similar distribution.
Thirty one proteins were detected from Chah (Table A5), a phage from Subcluster B1.
Among the detected proteins, only seven had putative functions assigned based on
genome annotation efforts. Potential functions include proteins responsible for: virion
structure and assembly such as portal (gp9), capsid (gp12) and major tail subunit
(gp18); host cell lysis such as LysA (gp50) and LysB (gp51); and DNA replication
such as helicase type III restriction subunit (gp54) and RepA helicase (gp60).
Eleven proteins were detected from Giles (Table A6), one of five phages from Cluster
Q. Seven of the eleven proteins detected are putative virion structure and assembly

- 9 -
proteins. In addition, Clp protease (gp7), a protein involved in a protease-required
capsid assembly process [4], and LysB (gp32) were also detected along with
functionally uncharacterized proteins, gp37, gp58, and gp60.
Twenty two proteins were detected from the phage sample, Henry (Table A7), a
member of Cluster E. Putative proteins for virion structure and assembly, lysis
cassette and DNA replication were detected, similar to results from the phage sample,
Chah. Gp34 was also detected but it is not functionally characterized based upon the
genome annotation. Performing a genome comparative analysis with Phamerator,
however, suggests that gp34 from Henry is a holin, based upon homology with the
holin (gp33) of another phage from Cluster E, Cjw1 (Figure 5).
Figure 5. Comparison between the genomes of Henry and Cjw1
The comparison was obtained using Phamerator [9]. Gp34 of Henry and gp33 of
Cjw1 have an E-value of 0.0 and are from the same phamily. Therefore, gp34 of
Henry may also be a putative holin protein based upon homology with the putative
holin protein (gp33) of Cjw1.
Twenty one proteins from LHTSCC (Table A8), a Cluster A4 phage, were detected,
including those with putative functions in DNA replication, virion structure and
assembly and lysis, similar to observations in other phage samples. Unique proteins
not detected in other phage samples were also observed, including a putative
recombination directionality factor (RDF) and ThyX (gp50) encoding flavin-
dependent thymidylate synthase. Interestingly, a peptide of the tapemeasure protein

- 10 -
(gp26) covering an additional six amino acids upstream from the annotated sequence,
was detected as well (Figure 6).
Figure 6. A peptide of LHTSCC tapemeasure protein (gp26) covering six amino acids
upstream from the annotated start site.
The green bars are annotated putative genes, and the pink bar is the detected peptide
matching six amino acids upstream from the annotated tapemeasure protein (gp26).
Similar to other phage samples, most of the peptides detected in the remaining phage
samples LinStu (Cluster C1, Table A9), Omega (Cluster J, Table A10), Papyrus
(Cluster R, Table A11), and Wee (Cluster F1, Table A12), were virion structure and
assembly proteins. Other proteins involved in DNA replication and lysis were also
detected in the phage Omega. The functions for the remaining proteins detected were
not characterized, as in the phage Papyrus where the putative functions of thirteen of
the eighteen proteins detected in the phage sample were unknown.
Identification of phage peptides using Mascot
Mascot is another common software that is used to analyze raw MS peptide data [6].
Thus, as an alternative to the data analysis process with X! Tandem, both Mascot and
X! Tandem were used to search the mass spectra raw data of Giles and Omega against
three distinct databases: the six-frame translation of the sample phage and the six
frame translation of M. smegmatis, a public phage protein database PhagesDB and a
comprehensive protein database UniProt, and the combination of the six-frame
translation of the sample phage, the six frame translation of M. smegmatis, PhagesDB,
and UniProt. Based upon the results (Figure A1), Mascot identified more peptides

- 11 -
than X! Tandem and Scaffold 4 with the same database and the larger database
resulted in more identified peptides. Thus, Mascot was used for cluster identification.
Mascot was used to search the mass spectra raw data against the public phage protein
database, PhagesDB [6]. As shown in Table 2, the range of peptides detected was
from 24 to 414 with over one hundred peptides obtained for phages Babsiella, Chah,
Giles, Henry, LHTSCC, LinStu, and Omega. Less than one hundred peptides were
detected in the remaining samples of phage. Searching against the PhagesDB database
resulted in the identification of peptides from multiple phages, and in some samples,
Chah, Henry, Lintsu and Wee, only a small portion or even none of the peptides
belonged to the sample phage (Figure 3). In contrast, over half of the total peptides
identified from Babsiella and Papyrus belonged to the predicted putative proteins
from these phages.
Figure 3. Percentage of the peptides from the sample phages in the total peptides
identified. Only a small portion of the identified peptides belonged to the predicted
putative proteins from the sample phages.
Identification of phage cluster
Clustering phages as described earlier and characterizing the putative proteins in
novel phages, currently relies on genomic DNA sequencing, which can be costly and
time-consuming. Previous studies have employed rapid protein extraction with mass
spectrometry to quickly identify microorganisms, such as bacteria and fungi, for
laboratory-based diagnostics [10, 11, 12]. Thus, it may be possible to categorize new
phages by analyzing the similarity between isolated phage peptides and annotated

- 12 -
putative phage proteins in database PhagesDB [13], since phages are clustered based
upon nucleotide similarity [14] and putative phage proteins are predicted from the
genomic DNA.
Peptide identification using Mascot may result in a match to peptides of homologous
proteins in other phages since all of the archived putative proteins in PhagesDB are
searched. These homologous proteins may be from the same phamilies and include
proteins belonging to phages from various clusters [4]. Since the phages from the
same cluster have the most similar nucleic acid and predicted amino acid sequences, it
was predicted that most of the homologous proteins may belong to the phages from
the same cluster as the sample phage. Therefore, we decided to investigate if the
Mascot analysis results could be further utilized to categorize phages and predict their
cluster. The phage cluster that includes the largest number of detected peptides (the
most frequently detected cluster, the MFDC) in the output of Mascot analysis might
match the initial cluster assignment of the phage samples from genomic analysis.
The MFDCs of the phages span over 50% of all detected peptides (Table 3). Chah and
Henry had a MFDC that covered around 90% of the peptides detected, while LinStu
had a MFDC covering less than 70% of the peptides. Based upon the criteria, all the
phages listed in Table 3 had the MFDC that matched the initial cluster assignment
from nucleotide analysis.

- 13 -
Discussion
Validation of phage genome annotations
Most of the peptides detected from the MS experimental analysis validated the
existing phage genome annotations of putative proteins. A more thorough validation
was limited, however, by the number of peptides detected and the overall coverage of
the translated genomic sequence. The matching peptides did not overlap with
annotated start sites for putative proteins with one notable exception: the identified
peptide from the tapemeasure protein (gp26) of phage LHTSCC matched six amino
acids upstream from the predicted annotated start site. This suggests that the actual
start site of gp26 may be different from the current annotated start site. Future
applications will explore strategies to optimize coverage and the number of phage
peptides isolated and subsequently detected and analyzed, by changing variables such
as the titer of phage, the incubation time before sample isolation, and the mixture of
protease used to digest the isolated proteins prior to MS.
Studying uncharacterized phage proteins produced by native expression
Previous studies used recombinant protein expression to characterize the function of
non-structural proteins [15, 16, 17]. The recombinant expressed proteins, however,
may differ from the native phage protein and this could impact function. Native phage
proteins may undergo unique posttranslational modifications including the addition
or removal of specific amino acid residues [18, 19, 20]. Since phage proteins perform
specialized functions in host bacterial cells, it is likely that native phage proteins have
certain properties that facilitate functional interactions with host bacterial proteins. In
fact, recent research [21] revealed that the protein encoded by lysA of phage D29, is
only active in M. smegmatis and becomes inactive in Escherichia coli (E. coli). The

- 14 -
method designed in this research provides an effective way to study uncharacterized
phage proteins through native expression in the bacterial host cells.
Phage proteins, phage life cycles and bacterial physiology
The mechanisms that control phage protein expression are not well understood. Since
the phages do not have native metabolism, phage protein expression relies on the
machinery of the host bacteria. As the physiology of the host changes, the pattern of
protein expression for the phage is also likely to change in response.
Previous research showed that wildtype E. coli cells, infected with a single phage, or
with multiplicity of infection (MOI) of one, will predominantly be lysed; if MOI is
higher than two, the lysogenic cycle is preferred over the lytic cycle [22, 23]. It was
reported that the frequency of the lysogenic cycle would be increased in an
environment that was not favorable for bacterial proliferation [24].
The studies above suggest that both M. smegmatis cell culture growth phases and
MOI could affect the phage life cycle. In order to obtain a high frequency of the
phage lysogenic cycle, St-Pierre and Endy grew E. coli cell culture for phage
infection to stationary phase [25]. Pope et al investigated protein expression of the
mycobacteriophage Patience by infecting exponentially growing M. smegmatis, and
found that the peptide amount of some proteins change within the first 2.5 hours after
infection [7]. Future studies investigating changes in phage protein expression
isolated from infected bacterial cells over time, could provide a better understanding
of the putative phage proteins beyond previous methods focusing solely on purified
phage virion particles. A better understanding of the phage protein expression pattern
could help expand the applications of phage. Diverse applications could include: a

- 15 -
phage-based biosystem with deterministic behavior based on the mechanism of lytic-
lysogenic lifecycle alteration, production of anti-bacterial phage agents with higher
efficiency, and creation of new functional cassettes and modules (BioBricks) for
synthetic biology as novel phage proteins are characterized.
Conclusions
In this research, we developed an MS-based method that can validate phage genome
annotation in a wet lab. This method also provides an effective way to study native
phage proteins. In the future, the method will be improved to increase the amount of
phage peptides and optimize the phage peptide identification. Phage protein
expression will be further examined in a dynamic phage-bacteria system. Better
understanding of phage-host interaction could be obtained and exploited in
application of phages.
Methods
Phage stocks
Stocks of fifteen phages (Table A1) were obtained from Prof. Graham Hatfull’s
research group through the HHMI SEA-PHAGES program. However, sample Kugel

- 16 -
& Avrafan appeared to be a mixture of two different phages, thus results and further
analysis from this sample were not included. In the phage-bacteria mixture of
Crossroads, Kamiyu, L5, Patience, and Pixie, no phage peptides were detected using
X! Tandem. Therefore, the results and further analysis of these remaining six samples
were not included in this paper.
Protein Extraction
A portion of M. smegmatis 48-hour culture (5 ml) and 100 μl of phage stocks were
incubated together in a 250 ml sterile Erlenmeyer flask overnight at 37°C with
shaking. After the incubation, the phage-bacteria mixture was transferred into a 15 ml
conical tube and centrifuged for 10 minutes at 1,000 × g. The pellet was collected and
resuspended with 100 μl of 100 mM ammonium bicarbonate. A PCT microtube
containing 100 μl of the mixture was placed in Barocycler NEP 2320 (Pressure
BioSciences Inc.) for cell lysis at 35,000 psi for 30 to 80 cycles at room temperature.
Protein concentration of each sample was estimated using nitrocellulose paper and
Bradford assay. A portion (50 μl) of each sample from the barocycler was transferred
into a 1.5 ml microcentrifuge tubes. The samples were mixed with 100 μl of
chloroform/methanol (2:1) and centrifuged for five minutes at maximum speed. The
lower phase of the sample was discarded and sterile water was added to bring the total
volume up to 100 μl. Four times the volume of -20°C acetone was added to the
sample. The sample was mixed by vortex and centrifuged for 5 minutes at maximum
speed. The supernatant was removed and the pellet was saved. The protein pellet was
resuspended with 10 μl of denaturation solution (8 M urea + 10 mM dithiothreitol in
10 ml water). This solution was incubated at 37°C for 0.5 to 1.5 hours. After that, 10
μl of freshly prepared cocktail (195 μl of acetonitrile, 1 μl triethylphosphine, and 4 μl

- 17 -
2-iodoethanol) was added to the sample and incubated at 37°C for 0.5 to 1.5 hours.
The sample was then completely dried by speed vacuum.
HPLC-MS/MS
Trypsin was added (1 μg trypsin per 50 μg protein) to digest the dry sample at 37°C
for at least 12 hours. Digestion was stopped by adding 1 μl of 10% trifluoroacetic acid
(TFA). Finally, 20 μl of 0.01% TFA was added to the sample to bring final
concentration to 1 μg/μl. Protein concentration was estimated using nitrocellulose
paper. The trypsin-digested proteins were then separated by HPLC and introduced
into a mass spectrometer for fragmentation and sequencing to identify the parent
proteins. The tryptic peptides were separated on a nanoLC system (1100 Series LC ,
Agilent Technologies, Santa Clara, CA). The peptides were loaded on the Agilent
300SB-C18 enrichment column for concentration and the enrichment column was
switched into the nano-flow path after 5 minutes. Peptides were separated with the
C18 reversed phase ZORBAX 300SB-C18 analytical column (0.75 μm × 150 mm, 3.5
um) from Agilent. The column was connected to the emission tip from New Objective
and coupled to the nano-electrospray ionization (ESI) source of the high resolution
hybrid ion trap mass spectrometer LTQ-Orbitrap XL (Thermo Scientific). The
peptides were eluted from the column using an acetonitrile (ACN)/0.1% formic acid
(FA, mobile phase B) linear gradient. For the first 5 minutes, the column was
equilibrated with 95% H2O/0.1% formic acid (mobile phase A) followed by the linear
gradient of 5% B to 40% B in 45 min. at 0.3 ul/min, then from 40% B to 100% B in
additional 5 minutes. The column was washed with 100% ACN/0.1% FA and
equilibrated with 95% H2O/0.1% FA before the next sample was injected. A blank
injection was run between samples to avoid carryover. The LTQ-Orbitrap mass

- 18 -
spectrometer was operated in the data-dependent positive acquisition mode in which
each full MS scan (30.000 resolving power) was followed by eight MS/MS scans
where the eight most abundant molecular ions were selected and fragmented by
collision induced dissociation (CID) using a normalized collision energy of 35%.
Data analysis
Raw data from the orbitrap instrument was processed using X! Tandem and searched
against a database comprised of the six-frame translation of phage and the annotated
proteins of M. smegmatis. Genomic DNA sequences (FASTA files) of the phages
were downloaded from PhagesDB.org. All the phage genomic DNA sequences were
converted to six frame translation. Annotated protein sequences of M. smegmatis
(NC_018289.1) were downloaded from the National Center for Biotechnology
Information (NCBI). The output of X! Tandem was then analyzed by Scaffold 4
(Proteome Software Inc.) to adjust peptide false discover rate (FDR) to 1% and
protein FDR to 5% for “Relaxed” setting, and peptide FDR of 0.1% and protein FDR
of 0.6% for “Stringent” setting. The actual FDRs were slightly different, and were
recorded. The spectrum-matched peptides were then marked in the six reading frames
of their own genome in Artemis (Trust Sanger Institute). Alternatively, the raw
spectra data was analyzed using the Mascot Daemon (v.2.5.1) protein search engine
and searched against the PhagesDB database to identify proteins that were detected.
During the search, MS/MS tolerance was set up to 0.2 Da and peptide tolerance was
set up to 0.05 Da. A maximum of one missed cleavage was allowed on the search.
Ethanolyl of cysteine was set as a fixed modification and oxidation of methionine,
phosphorylation of serine, threonine, and tyrosine were set as variable modifications.
The peptide FDR was set as 1% for “Relaxed” setting and 0.1% for “Stringent”

- 19 -
setting. Different from X! Tandem and Scaffold 4, both peptide identification and
FDR adjustment could be accomplished in Mascot, but only FDR of peptides could be
adjusted. The actual FDRs were slightly different, and were recorded. The output of
the Mascot search was used to perform spectral counting using the ProteinCounter
software (proteo.bbc.purdue.edu/proteincounter/). The phage cluster identification
was decided by the cluster that includes the largest number of detected peptides, i.e.
most frequently detected cluster (MFDC).
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
The methodology was designed and conducted by KLC. The wet lab experiments
were conducted by SJH. The clustering assignments were performed by SJH and YL.
Analysing mass spectra using X! Tandem and Scaffold 4 was performed by YL.
Labeling detected peptides in genome maps was done by YL, ZTA, HGE, MOF,
NRH and JW. VEH provided guidance for the data analysis with Mascot, and TJPS
installed X! Tandem and assisted in associated data analysis. The manuscript was
drafted by YL, SJH and KLC. All authors read and approved the final manuscript.
Acknowledgements
This program has received funding from Howard Hughes Medical Institute Science
Education Alliance, Purdue Polytechnic Institute, Department of Agricultural and
Biological Engineering, and Purdue Biotechnology Innovation and Regulatory
Science Center. We also thank Dr. Graham Hatfull, Dr. Welkin H. Pope, Debbie

- 20 -
Jacobs-Sera, and Dan Russell from University of Pittsburgh who provided expertise,
phage samples, and creative encouragement through the HHMI SEA Phage challenge
competition for this research project. We also thank the undergraduate students
enrolled in Biotechnology Lab II (IT22700) during spring semester 2013 who
performed a portion of the studies as a course undergraduate research experience, and
Haleigh Eppler and Julie Beth Gillespie who assisted with repeating the experiments
during summer 2013. We also thank the HHMI Science Education Alliance phage
community.
References
1. Marks T, Sharp R: Bacteriophages and biotechnology: a review. J Chem
Technol Biotechnol 2000, 75: 6-17
2. Herskowitz I, Hagen D: The lysis-lysogeny decision of phage λ: explicit
programming and responsiveness. Ann Rev Genet 1980, 14: 399-445
3. Pope WH, Bowman CA, Russell DA, Jacobs-Sera D, Asai DJ, Cresawn SG,
Jacob Jr WR, Hendrix RW, Lawrence JF, Hatfull GF, SEA-PAGES, PHIRE,
Mycobacterial Genetics Course: Whole genome comparison of a large
collection of mycobacteriophages reveals a continuum of phage genetic
diversity. eLife 2015, 4: e06416.
4. Hatfull GF: The secret lives of mycobacteriophages. Advances in Virus
Research 2012, 82: 179-288
5. Mageeney C, Pope WH, Harrison M, Moran D, Cross Trevor, Jacobs-Sera D,
Hendrix RW, Dunbar D, Hatfull GF: Mycobacteriophage Marvin: a new
singleton phage with an unusual genome organization. J Virol 2012, 86:
4762-4775

- 21 -
6. Acosta J, Agee R, Alvarez M, Czyszczon EA, Gadberry S, Ha S, Harmon N,
Liston L, Mathis C, Meader MC, Menon Nidhi, Morgan C, Pizzato H,
Showalter GM, Thompson C, Waterstreet W, Yang X, Rickus J, Clase K:
Comparative analysis of unknown mycobacteriophage using mass
spectrometry and the proteome discovery pipeline. The 5th Annual SEA-
PHAGES Symposium 2013, Ashburn, VA
7. Pope WH, Jacobs-Sera D, Russell DA, Rubin DH, Kajee A, Msibi ZNP,
Larsen MH, Jacobs WR, Jr, Lawrence JG, Hendrix RW, Hatfull GF:
Genomics and proteomics of mycobacteriophage Patience, an accidental
tourist in the Mycobcaterium neighborhood. mBio 2014, 5(6): e02145-14
8. Mascot. http://www.matrixscience.com/. Accessed on 04 June 2015
9. Cresawn SG, Bogel M, Day N, Jacobs-Sera D, Hendrix RW, Hatfull GF:
Phamerator: a bioinformatic tool for comparative bacteriophage
genomes. BMC Bioinformatics 2011, 12(395)
10. Vlek A, Kolecka A, Khayhan K, Theelen B, Groenewald M, Boel E,
Multicenter Study Group, Boekhout T: Interlaboratory comparison of
sample preparation methods, database expansions, and cutoff values for
identification of yeasts by matrix-assisted laser desorption ionization-time
of flight mass spectrometry using a yeast test panel. J Clin Microbiol 2014
52 (8): 3023-3029
11. Martinez RM, Bauerle ER, Fang FC, Butler-Wu SM: Evaluation of three
rapid diagnostic methods for direct identification of microorganisms in
positive blood cultures. J Clin Microbiol 2014 52(7): 2521-2529
12. Fedorko DP, Drake SK, Murray PR: Identification of clinical isolates of
anaerobic bacteria using matrix-assisted laser desorption ionization-time

- 22 -
of flight mass spectrometry. Eur J Clin Microbiol Infect Dis 2012 31: 2257-
2262
13. PhagesDB. http://phagesdb.org. Accessed on 04 June 2015
14. Hatfull GF, Jacobs-Sera D, Lawrence JG, Pope WH, Russell DA, Ko CC,
Weber RJ, Patel MC, Germane KL, Edgar RH, Hoyte NN, Bowman CA,
Tantoco AT, Paladin EC, Myers MS, Smith AL, Grace MS, Pham TT,
O’Brien MB, Vogelsberger AM, Hryckowian AJ, Wynalek JL, Donis-Keller
Helen, Bogel MW, Peebles CL, Cresawn SG, Hendrix RW: Comaparative
genomic analysis of 60 mycobacteriophage genome: genome clustering,
gene acquisition, and gene size. J Mol Biol 2010, 397: 119-143
15. Henry M, Begley M, Neve H, Maher F, Ross RP, McAuliffe O, Coffey A,
O’Mahony JM: Cloning and expression of a mureinolytic enzyme from the
mycobacteriophage TM4. FEMS Microbiol Lett 2010, 311: 126-132
16. van Kessel JC, Hatfull GF: Recombineering in Mycobacterium tuberculosis.
Nat Methods 2007, 4 (2): 147-152
17. Catalao MJ, Milho C, Gil F, Moniz-Pereira J, Pimentel M: A second
endolysin gene is fully embedded in-frame with the lysA gene of
mycobacteriophage Ms6. Plos ONE 2011, 6 (6): e20515
18. Xu M, Arulandu A, Struck DK, Swanson S, Sacchettini JC, Young R:
Disulfide isomerization after membrane release of its SAR domain
activates P1 lysozyme. Science 2005, 307: 113-117
19. Nelson D, Schuch R, Chahales P, Zhu S, Fischetti VA: PlyC: a multimeric
bacteriophage lysin. Proc Natl Acad Sci U.S.A. 2006, 103: 10765-10770

- 23 -
20. Kanamaru S, Ishiwata Y, Suzuki T, Rossman MG, Arisaka F: Control of
bacteriophage T4 tail lysozyme activity during the infection process. J Mol
Biol 2005, 346: 1013-1020
21. Pohane AA, Joshi H, Jain V: Molecular dissection of phage endolysin: an
interdomain interaction confers host specificity in Lysin A of
mycobacterium phage D29. J Biol Chem 2014, 289:12085-12095
22. Kourilsky P: Lysogenization by bacteriophage lambda. I. Multiple
infection and the lysogenic response. Mol Gen Genet 1973, 122: 183-195
23. Oppenheim AB, Kobiler O, Stavans J, Court DL, Adhya S: Switches in
bacteriophage lambda development. Annu Rev Genet 2005, 39: 409-429
24. Marsh P, Wellington EMH: Phage-host interactions in soil. FEMS Microbiol
Ecol 1994, 15: 99-108
25. St-Pierre F, Endy D: Determination of cell fate selection during phage
lambda infection. PNAS 2008, 105(52): 20705-20710

- 24 -
Figures
Figure 1 - Procedures of the research
Figure 2 - Peptides identified using X! Tandem
Phage
stocks
Phage
infection Phage-bacteria
mixture
Protein
extraction
Proteins
Trypsin
digestion
Peptides
HPLC-MS/MSMass
spectra
X! Tandem
& Scaffold 4 Mascot
Search against
database
Peptide
identification
Incubation
overnight Wet lab
In silico
0
100
200
300
400
500
600
700
Babsiella Chah Giles Henry LHTSCC LinStu Omega Papyrus Wee
Number
Phage
Peptides Identified
Phage Peptides
Bacterial Peptides

- 25 -
Figure 3 - Percentage of sample phage peptides in the total peptides identified
Figure 4 - Comparison between the genomes of Babsiella and Che9c
Figure 5 - Comparison between the genomes of Henry and Cjw1
0
10
20
30
40
50
60
70
Babsiella Chah Giles Henry LHTSCC LinStu Omega Papyrus Wee
Percentage(%)
Phage
Percentage of the peptides of the sample phages
Virion structure and assembly proteins
Cjw1
Henry
Holin

- 26 -
Figure 6 - A peptide of LHTSCC gp26 covering six upstream amino acids

- 27 -
Tables
Table 1 - Peptide identification using X! Tandem
Relaxeda Stringenta
Phage
No. of
Peptides
No. of
Spectra
No. of
Peptides
No. of
Spectra
Babsiella 46 60 36 48
Chah 94 131 81 107
Giles 26 38 21 28
Henry 57 86 51 72
LHTSCC 84 131 68 100
LinStu 64 90 45 55
Omega 61 86 50 72
Papyrus 36 51 27 38
Wee 15 22 8 12
a The actual FDRs for Relaxed and Stringent settings are listed in Table A2.

- 28 -
Table 2 - Peptide identification using Mascot
Phage No. of the
Peptides of
Sample
Phages
No. of
Total
Peptides
Percentage
(%)
No. of
Peptides
of Sample
Phages
No. of
Total
Spectra
Percentage
(%)
Papyrus 53 89 60 78 141 55
Babsiella 55 102 54 82 148 55
Omega 54 134 40 94 236 40
Giles 33 166 20 64 259 25
LHTSCC 47 414 11 66 919 7
LinStu 12 178 7 15 282 5
Chah 13 318 4 19 490 4
Henry 5 134 4 8 257 3
Wee 0 24 0 0 54 0
Both Relaxed and Stringent settings led to the same result. The actual peptide FDRs
for Relaxed and Stringent settings are listed in Table A3.

- 29 -
Table 3 - Analysis of Most FrequentlyDetected Clusters
Phage Initial
cluster
MFDCa No. of
Proteins
No. of
peptides
No. of
spectra
MFDC
peptides
(%)
Chah B1 B1 66 310 479 97.48
Henry E E 31 130 245 97.01
Omega J J 32 117 210 87.31
LHTSCC A4 A4 62 359 746 86.71
Papyrus R R 25 76 120 85.39
Wee F1 F1 8 20 41 83.33
Babsiella I1 I1 20 82 119 80.39
LinStu C1 C1 29 117 193 65.73
a Most Frequently Detected Cluster

- 30 -
Additional files
Table A1 – Phage stocks
Number Name
1 LinStu
2 Crossroadsa
3 Pixiea
4 Wee
5 Kugel & Avrafanb
6 Chah
7 Babsiella
8 Kamiyua
9 L5a
10 Patiencea
11 Giles
12 Henry
13 Omega
14 Papyrus
15 LHTSCC
a No phage peptides identified using X!Tandem and Scaffold 4
b The sample is a mixture of two phages

- 31 -
Table A2 – Actual thresholds and false discovery rates (FDRs) obtained by
Scaffold 4
Relaxed Stringent
Protein Peptide Protein Peptide
Threshold
(% )
FDR
(% )
Threshold
(% )
FDR
(% )
Threshold
(% )
FDR
(% )
Threshold
(% )
FDR
(% )
Babsiella 5.0 2.90 90.2 1.00 98.9 0.60 99.1 0.10
Chah 5.0 2.20 89.4 0.99 98.6 0.60 99.1 0.10
Giles 5.0 3.40 92.5 0.99 5.0 0.50 99.5 0.10
Henry 5.0 3.30 91.6 1.00 98.6 0.60 99.2 0.10
LHTSCC 5.0 3.20 92.0 0.99 98.5 0.60 99.4 0.09
LinStu 5.0 2.40 90.6 1.01 97.5 0.60 99.3 0.10
Omega 5.0 2.90 90.2 1.00 98.8 0.60 99.2 0.10
Papyrus 5.0 2.50 88.8 1.00 98.5 0.60 99.2 0.10
Wee 29.7 5.00 92.2 0.99 98.8 0.60 99.5 0.09
Table A3 – Actual peptide FDRs obtained by Mascot
Phage Relaxed (%) Stringent (%)
Wee 0.85 0
LinStu 0.91 0
Omega 0.89 0
Papyrus 0.98 0
Giles 0.89 0
Babsiella 0.82 0
Henry 0.99 0
Chah 0.93 0
LHTSCC 0.98 0

- 32 -
Table A4 – Babsiella annotated proteins identified by mass spectrometry
Relaxed Stringent
Product Function Peptides Spectra Peptides Spectra
gp5 1 1 1 1
gp6 protease 2 3 2 3
gp7 capsid 10 12 7 9
gp8 1 2 1 2
gp9 Hyp 1 1 1 1
gp13 2 2 2 2
gp14
tail
assembly
chaperone
2 2 2 2
gp17 minor tail 1 1 0 0
gp16 tapemeasure 2 3 2 3
gp21 major tail 2 3 2 3
gp26 LysA 7 9 6 8
gp27 LysB 2 2 2 2
gp50
FtsK
domain
2 3 1 2
gp54
DNA
methylase
domain
1 1 1 1
gp75 2 2 1 1
gp76 7 12 5 8

- 33 -
Table A5 – Chah annotated proteins identified by mass spectrometry
Relaxed Stringent
gp1 1 1 0 0
gp7 2 3 2 3
gp9 portal 1 1 1 1
gp12 capsid 6 10 6 10
gp14 8 13 8 12
gp16 1 2 1 2
gp18
major tail
subunit
1 3 1 3
gp23 3 4 3 3
gp24 1 1 1 1
gp25 2 3 2 2
gp35 1 1 1 1
gp36 1 1 1 1
gp39 1 1 0 0
gp42 3 4 2 3
gp43 1 1 1 1
gp45 1 1 1 1
gp50 LysA 7 12 7 10
gp51 LysB 4 5 4 5
gp52 4 5 3 3
gp53 6 9 5 6
gp54 helicase 1 1 1 1

- 34 -
type III
restriction
subunit
gp56 1 1 1 1
gp58 1 1 0 0
gp60
RepA
helicase
10 15 9 14
gp61 2 2 1 1
gp63 4 4 3 3
gp68 4 5 3 3
gp69 1 1 1 1
gp73 4 7 4 5
gp74 2 3 2 3
gp93 3 3 1 1
gp94 4 4 3 3
gp99 1 2 1 2
gp103 1 1 1 1

- 35 -
Table A6 – Giles annotated proteins identified by mass spectrometry
Relaxed Stringent
gp3 1 1 0 0
gp7 protease 2 3 2 3
gp8 2 4 2 4
gp9
capsid
subunit
5 5 4 4
gp10
virion
protein
4 7 3 5
gp15 1 1 1 1
gp16
virion
protein
3 8 3 5
gp18 1 1 0 0
gp24
virion
protein
1 1 1 1
gp32 LysB 1 1 1 1
gp37 2 3 2 2
gp58 1 1 1 1
gp59 1 1 0 0
gp60 1 1 1 1

- 36 -
Table A7 – Henry annotated proteins identified by mass spectrometry
Relaxed Stringent
gp13 capsid 8 13 8 11
gp15 1 2 1 2
gp19
major tail
subunit
2 3 2 3
gp20
tail assembly
chaperone
1 2 1 1
gp23 2 4 1 2
gp26
minor tail
subunit
3 5 3 5
gp27 3 7 3 5
gp30 1 2 1 2
gp33 LysA 2 2 2 2
gp34 4 5 3 4
gp35 1 2 1 2
gp36 LysB 1 1 1 1
gp37 1 1 1 1
gp49 1 1 1 1
gp57 1 2 1 2
gp64 1 1 1 1
gp89 RNA ligase 5 6 5 5
gp94 2 3 2 3

- 37 -
gp98 Clp 7 12 6 10
gp111 1 1 0 0
gp112 Pol II 1 1 1 1
gp117 RecA 3 4 3 4
gp134 1 1 0 0

- 38 -
Table A8 – LHTSCC annotated proteins identified by mass spectrometry
Relaxed Stringent
gp3
DNA
polymerase III
3 4 2 3
gp4 4 4 4 4
gp7 1 2 1 1
gp8 LysA 5 5 3 3
gp9 holin 1 2 1 2
gp10 LysB 2 4 2 3
gp12 portal 11 15 7 10
gp14 scaffold 3 4 2 2
gp15 capsid 14 34 13 27
gp17 3 4 3 4
gp18 1 1 1 1
gp24
tail assembly
chaperone
3 6 2 4
gp46 1 3 1 2
gp50 ThyX 2 3 2 3
gp51 3 5 3 4
gp52
ribonucleotide
reductase
7 9 6 7

- 39 -
gp54 RDF 3 3 3 3
gp57 DNA primase 1 1 1 1
gp74 1 1 0 0

- 40 -
Table A9 – LinStu annotated proteins identified by mass spectrometry
Relaxed Stringent
gp29 2 2 1 1
gp62 2 3 2 3
gp88 6 7 4 4
gp90 1 1 1 1
gp96
LysM
domain
5 7 3 4
gp97 2 4 2 4
gp98 2 3 2 3
gp99 10 17 8 5
gp117 2 2 2 2
gp125 6 10 4 7
gp126 2 3 2 3
gp128
tail
assembly
chaperone
1 2 1 2
gp138 baseplate J 1 1 0 0
gp139 1 2 1 2
gp171 1 1 0 0
gp190 2 2 1 1
gp196 1 1 0 0
gp216 4 6 3 4
gp220 2 2 0 0

- 41 -
gp221 1 1 0 0
gp225 1 1 1 1
gp234 2 3 1 2
gp243
head
decoration
4 4 3 3
gp248 3 5 3 3

- 42 -
Table A10 – Omega annotated proteins identified by mass spectrometry
Relaxed Stringent
gp3
terminase small
unit
1 2 1 1
gp15 capsid 9 14 9 14
gp23 1 1 1 1
gp31 major tail unit 3 4 2 3
gp32
tail assembly
chaperones
2 6 2 5
gp35 3 3 2 2
gp37
minor tail
proteins
1 1 1 1
gp38 2 3 2 3
gp39
D-ala-Dala-
carboxypeptidase
3 3 3 3
gp42 2 6 2 6
gp50 LysA 7 9 7 8
gp52 3 4 1 2
gp53 LysB 2 2 2 2
gp65 1 1 1 1
gp115 1 1 1 1
gp123
AddA-like
protein
1 2 1 2
gp127 DNA methylase 1 1 0 0

- 43 -
gp144 4 5 3 4
gp149 1 1 0 0
gp160 1 1 1 1
gp162 RNA ligase 3 5 1 3
gp163 Clp 1 1 1 1
gp164 DnaQ 2 3 2 3
gp173 3 4 2 3
gp174 Hyp 1 1 1 1
gp198 1 1 1 1
gp211 1 1 0 0

- 44 -
Table A11 – Papyrus annotated proteins identified by mass spectrometry
Relaxed Stringent
gp2 1 2 1 2
gp7 portal 1 2 1 1
gp14 2 4 1 2
gp15 tail sheath 1 1 1 1
gp16 6 6 3 3
gp17
DNA
binding
domain
2 5 2 5
gp22 3 3 2 2
gp24
fibronectin
type III
domain
1 3 1 3
gp25
tail
assembly
chaperone
4 6 3 5
gp33 2 3 2 3
gp34 2 2 2 2
gp36 2 3 1 1
gp52 1 1 1 1
gp53 1 1 1 1
gp68 2 2 1 1
gp69 1 1 0 0

- 45 -
gp71 1 1 1 1
gp85 2 3 2 3
gp96 1 2 1 1
Table A12 – Wee annotated proteins identified by mass spectrometry
Relaxed Stringent
gp6 capsid 5 6 3 4
gp11
major tail
subunit
3 6 1 4
gp12
tail
assembly
chaperone
2 2 2 2
gp31 LysA 2 2 2 2
gp32 LysB 1 1 0 0
gp71 1 4 0 0
gp102 1 1 0 0

- 46 -
Figure A1 - Total peptides identified in Giles and Omega using Mascot and X!
Tandem& Scaffold against three distinct databases
In both Giles (A) and Omega (B), Mascot identified more peptides than X! Tandem
and Scaffold 4.
0
200
400
600
800
1000
1200
1400
The six frame translation of the
phage + the six frame
translation of M. smegmatis
PhagesDB + UniProt The six frame translation of the
translation of M. smegmatis +
PhagesDB + UniProt
NumberofPeptides
Databases
Total pepetides identified in Giles
X! Tandem and Scaffold 4
Mascot
0
200
400
600
800
1000
1200
The six frame translation of the
translation of M. smegmatis
PhagesDB + UniProt The six frame translation of the
translation of M. smegmatis +
PhagesDB + UniProt
NumberofPeptides
Databases
Total peptides identified in Omega
X! Tandem and Scaffold 4
Mascot
A
B

jbe_validation_MS_draft_6_12_16_clase

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to jbe_validation_MS_draft_6_12_16_clase

Similar to jbe_validation_MS_draft_6_12_16_clase (20)

jbe_validation_MS_draft_6_12_16_clase