USE OF BIO-INFORMATIC TOOLS TOSTUDY IMPLICATIONS OF G-CCONTENT OF DNA ON THE PROTEIN.DEBTANU CHAKRABORTY
Index 1) Note of Acknowledgement 2) Bio-informatics 3) G-C content 4) Classification tree of Bacteria 5) List of low G-C bacteria 6) List of high G-C bacteria 7) Introduction to Carbonic Anhydrase 8) Peptide Sequence and their analysis 9) Gene Sequences and their analysis 10) Codon usage plot 11) Conclusion 12) Future work-scope
Note of AcknowledgementThe project would have been incomplete without the help of a number of persons. First Iwould like to thank my mentor and guide Prof. Chanchal K. Das Gupta who gave me theidea and inspiration to do the project and helped me in every step whenever I was introuble. I would like to thank Prof. Punyasloke Bhadury who helped me by introducing toNCBI website and showing me to perform tasks like alignment, BLAST in internet.I cannot repay the sin if I don’t mention the names of my superiors Papri di, Amit da andShimonti di who also helped me with the project.I have in my work, extensively used the websites- NCBI and Uniprot.
Bioinformatics is the application of statistics and computer science to the fieldof molecular biology.The term bioinformatics was coined by Paulien Hogeweg in 1979 for the study ofinformatic processes in biotic systems. Its primary use since at least the late 1980s hasbeen in genomics and genetics, particularly in those areas of genomics involving large-scale DNA sequencing.Bioinformatics now entails the creation and advancement of databases, algorithms,computational and statistical techniques and theory to solve formal and practicalproblems arising from the management and analysis of biological data.Over the past few decades rapid developments in genomic and other molecularresearch technologies and developments in information technologies have combined toproduce a tremendous amount of information related to molecular biology. It is thename given to these mathematical and computing approaches used to gleanunderstanding of biological processes.Common activities in bioinformatics include mapping and analyzing DNA and proteinsequences, aligning different DNA and protein sequences to compare them andcreating and viewing 3-D models of protein structures.The primary goal of bioinformatics is to increase the understanding of biologicalprocesses. What sets it apart from other approaches, however, is its focus ondeveloping and applying computationally intensive techniques (e.g., patternrecognition, data mining, machine learning algorithms, and visualization) to achieve thisgoal. Major research efforts in the field include sequence alignment, genefinding, genome assembly, drug design, drug discovery, protein structurealignment, protein structure prediction, prediction of gene expression and protein-protein interactions, genome-wide association studies and the modeling of evolution.
GC-content (or guanine-cytosine content), in molecular biology, is the percentageof nitrogenous bases on a molecule which are either guanine or cytosine (from apossibility of four different ones, also including adenine and thymine). This may refer toa specific fragment of DNA or RNA, or that of the whole genome. When it refers to afragment of the genetic material, it may denote the GC-content of part of a gene(domain), single gene, group of genes (or gene clusters) or even a non-coding region. G(guanine) and C (cytosine) undergo a specific hydrogen bonding whereas A (adenine)bonds specifically with T (thymine).The GC pair is bound by three hydrogen bonds, while AT pairs are bound by twohydrogen bonds. DNA with high GC-content is more stable than DNA with low GC-content, but contrary to popular belief, the hydrogen bonds do not stabilize the DNAsignificantly and stabilization is mainly due to stacking interactions. In spite of thehigher conferred to the genetic material, it is envisaged that cells with DNA with highGC-content undergo autolysis, thereby reducing the longevity of the cell per se. Due tothe robustness endowed to the genetic materials in high GC organisms it wascommonly believed that the GC content played a vital part in adaptation temperatures, ahypothesis which has recently been refuted.In PCR experiments, the GC-content of primers are used to predict their annealingtemperature to the template DNA. A higher GC-content level indicates a higher meltingtemperature.
THE EVOLUTION TREE IN BACTERIA WHERE IS G-C CONTENT STUDY IS ANANALYTICAL TOOL.The guanine plus cytosine (GC) content in bacteria ranges from ~20% to 75% where aswe will see in a later lecture that eukaryotic genomes have GC contents that often havea restricted range from ~35-50% (about 40%-45% in vertebrates).
For our convenience, we chose Carbonic Anhydrase because it is present in all bacteriaacross the G-C content spectrum of Bacterias-The carbonic anhydrases (or carbonate dehydratases) form a family of enzymes thatcatalyze the rapid conversion of carbon dioxide and water to bicarbonate and protons, areaction that occurs rather slowly in the absence of a catalyst. The active site of mostcarbonic anhydrases contains a zinc ion; they are therefore classifiedas metalloenzymes.THE CARBONIC ANHYDRASE PROTEIN-
In our analysis, we choose the following bacteria- 1) Methaococcus voltae A3 (UI-A8TF20) (G-Cc=27%) 2) Staphylococcus carnosus (UI-B9DMU8_STACT) (G-Cc=34%) 3) Vibrio cholera (UI-Q9KMP6_VIBCH) (G-Cc=47%) 4) Escherichia coli (UI-P61517) (G-Cc=50%%) 5) Truepera radiovictrix DSM1703 (UI-ADI14363) (G-Cc=68.2%) 6) Salinispora arenicola (UI-A8MOD8) (G-Cc=69.2%%) 7) Frankia CcI (UI-Q2JF50) (G-Cc=71%) *UI stands for the Uniprot Accession number of the Carbonic Anhydrase protein of the respective bacteria.We begin analyzing the protein Carbonic Anhydrase from these bacteria-The peptide sequence goes as follows->Methanococcus voltae Carbonic Anhydrase ProteinLN*LFNLASVNVNHKPFNFHIFRNCRVIFD*FDTFQHVFFFVIHFTHPSFKVWRKVWIYSSFNHFFSYLFNICSCHSTVGMTYDSYLFNI*TVYCNY*RP*YIVCNNITCVFDDFCVASF*THFFR*EIYESCIHTSYYC*FLFRFGFCSDSFTYTQ>Staphylococcus carnosus Carbonic Anhydrase ProteinYPXXXMTLLESILAYNKDFVGNKEFENYTTSKKPDKKAVLFTCMDTRLQDLGTKALGFNNGDLKVVKNAGAIITHPYGSTIKSLLVGIYALGAEEIIIMAHKDCGMGCLDVSTVKDAMKERGVTEETFKIIEHSGVDVDSFLQGFKDAEENVRRNIDMVYNHPLFDKSVPIHGLVIDPHTGELDLIQDGYELAAQNK*>Vibrio cholerae Carbonic Anhydrase ProteinMKKTTWVLAMVASMSFGVQASEWGYEGEHAPEHWGKVAPLCAEGKNQSPIDVAQSVEADLQPFTLNYQGQVVGLLNNGHTLQAIVRGNNPLQIDGKTFQLKQFHFHTPSENLLKGKQFPLEAHFVHADEQGNLAVVAVMYQVGSENPLLKVLTADMPTKGNSTQLTQGIPLADWIPESKHYYRFNGSLTTPPCSEGVRWIVLKEPAHLSNQQEQQLSAVMGHNNRPVQPHNARLVLQAD*>1st Escherichia coli Carbonic Anhydrase ProteinLFVVGVFQLEVGDPVTVTLLKGFAVSRCDIQITQQAVVNAVGPAVNGDFLPAFPR*LHNSGVAQVIHLFHDVQFTQGIQTALLRHFAEQ*AMFEPDIADMQQPVVDKPQFRVFNCGLYAAATVV
>2 Frankia CcI3 Carbonic Anhydrase 2nd ProteinVDTDDHTAVDPVADVHADDVHADTVRPADTVSPVSGAATATELLLSYAAGHPARRREAGLPALPGARPRLGVAVVACMDVRIQVEALLGLVEGDAHILRNAGGVITPDVVRSLAVSQHVLGTTEIILLHHTGCGLERITDDGFRDQLECKTGVRPEWAVYSFPDVEEDVRKSVRVLRSSPFLQSTTSVRGFVYQVETGALVEVLP*We have 3 protein sequences for E coli and 2 sequences each for Salinispora andFrankia. We now compare them amongst themselves.For E coli-The sequence marked Escherishsia is the 1st sequence.The sequence Ecoli is the 2nd sequence.The sequence Final is the 3rd sequence.
For Salinispora- . :For Frankia-After viewing the alignment of the suspected Carbonic Anhydrase within the samespecies, we wish to align the proteins from all the sources, all proteins from samespecies is also incorporated.
The alignment sequence of the bacteria is as follows-
Analysis- we can see two things from the above. 1) Bacteria with high G-C have two genes for Carbonic Anhydrase and consequently 2 proteins suspected to be Carbonic Anhydrase. 2) Bacteria with high G-C incorporate synonymous amino acid which requires G-C rich codons to compensate in their protons.We will elaborate on the 2nd point later using Codon-plot. We can show that thecorresponding codon of the DNA of Carbonic Anhydrase gene of this bacteria.Now we move to analyzing the DNA of the genes of Carbonic Anhydrase-The DNA sequences are as follows->Methanococcus voltae Carbonic Anhydrase of 471 basesttaaattaactttttaatctcgccagtgttaatgtcaatcataagcccttcaacttccacatctttaggaattgcagggtgatttttgattaattcgacacctttcaacacgttttcttcttcgttatccattttacccatccaagcttcaaagtctggcgtaaagtatggatttactcctcttttaatcatttcttttcttatctcttcaatatctgctcctgccattccacagtcggtatgacctacgatagctatcttttcaacatctaaacagtatattgcaactactaacgaccttaatacatcgtctgtaataatattacctgcgtttttgatgacttttgcgtcgcctctttctaaacccattttttcaggtaagaaatttacgagtcttgtatccatacaagttattactgctaatttctttttcggtttggcttctgctccgatagtttcacctatactcaa>Staphylococcus carnosus Carbonic Anhydrase of 594 basestaccccancancanaatgacgttattagaaagcattttagcttataataaagattttgtcggcaacaaagaatttgaaaactatacaacaagtaaaaaaccagataaaaaagcagtgttatttacatgtatggatacacgtttgcaagatttaggtacaaaagcactcggttttaataatggtgacttgaaagttgttaaaaatgcaggtgcaattatcacgcacccatatggttcaactataaaaagcttactagtaggtatttatgcattaggtgctgaagaaattattattatggcacataaagattgcggaatgggttgtcttgatgtcagcactgttaaagacgcaatgaaagaacgtggcgtaacagaagaaacatttaaaatcatcgaacattctggtgtagatgtagacagctttttacaaggtttcaaagatgctgaagaaaatgtccgcagaaatatcgatatggtatataatcatcccttatttgataaatccgtacctattcacggcttagtcatcgatcctcatacgggggaattagatttaattcaagacggctatgaattagctgctcaaaataaataa
>Vibrio cholerae Carbonic Anhydrase of 720 basesatgaaaaagacaacgtgggtattagcgatggtagccagtatgagcttcggcgtacaggcttccgagtgggggtatgaaggagagcatgctccggagcattggggcaaagttgcccctctttgcgcagagggtaaaaatcaaagcccgattgatgtcgcgcaaagcgtagaagcggatctacagcctttcacgctcaattatcaagggcaagtggttgggctgctcaataacgggcacactttacaagcgatagtccgtggtaataacccactgcagatcgatggcaaaacgtttcagcttaagcagtttcattttcataccccttctgaaaatttgctaaaaggaaaacaattcccactggaagcgcattttgttcatgccgacgagcaaggcaatctggcggttgttgcggtgatgtaccaagtggggtcggaaaatccgctgcttaaggttctcacggcggatatgccgaccaaagggaattcgactcagctcacgcaagggatccctttggctgattggatcccagaatcgaagcactactatcgtttcaatggttcattgactacgccgccttgcagtgaaggtgtacgttggattgtgttaaaagagccagcacatttgtcgaatcaacaagagcagcagcttagtgccgtgatgggacacaataatcgacccgtacaaccgcataatgctcgtcttgtcttgcaagccgactaa>Escherichia coli Carbonic Anhydrase of 372 basesttatttgtggttggcgtgtttcagcttgaggttggagatcccgtgacggtaacgttgctcaagggtttcgcggttagtcgctgtgacatccagatcacgcagcaagccgtcgtgaatgccgtaggcccagccgtgaatggtgactttctgcccgcgtttccacgctgattgcataatagtggagtggcccaggttatacacctgttccatgacgttcagttcacacaaggtatccagacggcgctcttgcggcatttcgccgagcaatgagctatgtttgaaccagatatcgcggatatgcagcagccagttgttgataagccccagttccgggttttcaactgcggcttgtacgccgccgcaaccgtagtg>123 Escherichia coli carbonic Anhydrase Finalaagccccagttccgggttttcaactgcggcttgtacgccgccgcaaccgtagtggccacagataataatgtgttcaacttcgagtacatccactgcatactgaaccacggaaaggcagttcaggtcagtttatttgtggttggcgtgtttcagcttgaggttggaaatcccgtgacggtaacgttgctcaagggtttcgcggttggtggcggtaacatccagatcacgcagcaagccgtcgtgaatgccgtaggcccagccgtgaatggtaactttctgcccgcgtttccacgctgattgcataatggtggagtggcccaggttatacacctgttccatgacgttcagttcacacaaggtatccagacggcgctcttgcggcatttcgccgagcaatgagctatgtttgaaccagatatcgcggatatgcagcagccagttgttgatgtgaatgaccaggttagcaacattacggtgaacaaagagttcgcccggctcaagaccggttaaacgttctgcaggaacgcgactgtcggaacatccaatccatagaaagcgcggtttttgcgcttgtgccagtttctcaaaaaacccgggatcctcttccaccagcatttttgaccatagtgcattgttgctgatgagtgtatctatgtcttt cat>456 Escherichia coli Carbonic Anhydrase Final 2gtgaaagagattattgatggattccttaaattccagcgcgaggcatttccgaagcgggaagccttgtttaaacagctggcgacacagcaaagcccgcgcacactttttatctcctgctccgacagccg
gaccagcgcatcaccgacagcgtcgcccaccacgcccaccgccgcattcgagaggctgatggagggcaaccagcggtgggtgcgtggagaccttcaacaacccaaccgggatccagctcggcgtcaagtcgtggcccacgaacagaagccctttggggcggtcctcgcatgcattgactcgcgggtgccgcctgaactcctcttcgacaccggcctgggtgatcttttcgtgacacgtacgggaggtgaggcgatcggcccagtggtcactggttctgtcgagtttggacctctgaccagtggcactccgctcatcgtggtccttgggcatcagcgttgcggcgccgtcaaggcggcgtacacctcccttcgtgagggcaagccgctgcccggcaacctaccggcgatcgttacggccctccagccggcgtatgaacaggtagcctcagcggggagcgccgacccgatcgacgccatggcccgagcccaggccgagctgatcgcaaacgacctgcgctccaacccggaactagccccactcgtggcgaagcgggaccttgccgtggtcagtgcctactattccctcgataccggcgcggtggaagtcctcagtggcagaccctcctga>Frankia CcI Carbonic Anhydrase 1 of 488 basestgtccgtcaccgacgactacctgaccaacaacgccgcctacgcgaagaccttcgccgggccgcttccgctgccgccgtccaagcacatcgccgccgtcgcctgcatggacgcacggctcaacgtctacgcgatccttggcctgggcgacggcgaggctcacgtcatccgcaacgccggcggcgtcgtcaccgacgacgagatccgttccctcgcgatcagccagcgcctgctcggcacccgcgagatcatcctgatccaccacaccgactgcggcatgctgaccttcaccgacgacgattttaaacgctcgatccaggacgagaccgggatcaaaccagaatgggccgtggagtcgtttaccgacctggccgaagacatacgccagtcgattgcgcggatcaaggcgagcccgttcatcccgcataccgacgccatccgcggcttcatcttcgatgttgccaccggactgctcaccgaagtcgcgtga>xyz Frankia CcI3 Carbonic Anhydrase 2 of 618 basesgtggacaccgatgaccacaccgctgtcgaccccgttgccgatgtccatgcagacgatgtccatgcggacaccgtgcgccccgcggatacggtgagcccggtgagcggcgctgccacggcgaccgaactcctgctgagctacgctgcaggtcaccccgcccggcggcgggaggccgggctacctgccctgcccggcgcgcggccgcgcctgggcgtcgcggtggttgcgtgtatggacgtgcggatccaggtggaggccttgctcggtcttgtcgaaggtgacgcccacatcctgcgcaacgccggtggtgtcatcaccccggatgtggtccgctcgctcgccgtgagccagcacgtgctgggaacgacggagatcattcttttgcatcacaccgggtgtggtctcgaaaggatcaccgacgacgggttccgggaccagttggagtgcaagacgggcgttcgtcccgaatgggccgtgtattcctttcccgatgtcgaggaggacgtgcgcaagtccgtcagggtgctgcgttcgtcgccgttcctgcagtccaccacctcggtacgcgggttcgtctaccaggtggagaccggggcactggtcgaggttctgccgtagWe will now proceed to compare the translation product of the ORF of the gene with theoriginal protein product. Methanococcus produces the protein in reading frame 1 of thereverse strand of the DNA segment. It does not start with ATG.first amino acid is Linplace of M.Staphylococcus and Vibrio does the same thing in frame 1 of forwarddirection. The same is observed in Frankia and Salinispora.
The gene product is typically labeled ‘orf’. 1) 1) Methanococcus Voltae A3- 2)Staphylococcus Carnosus- 3)Vibrio cholera-
3)The comparison of E.coli gene-pro and protein are as follows-For the rest, we will be comparing only 1 suspected protein and gene productfor consistency.For Truepera-
From the above list, we conclude two things- 1) The codon-plot of the different gene o.r.f.s from the same organism are the same except at some minor points. 2) The codon-plot of the organisms only confirm our suspicion while analyzing the peptide sequences that choice of codons is different to suit the G-C content of the organism.
Corrections-We undertake this because we noticed that gene products of Methanococcus voltae andFrankia were not starting with amino-acid Methionine.Methanococcus voltae corrections-The mistake seems to be in the database from where sequence has been downloaded. The DNA seq.had ‘ata’ instead of ‘atg’.Frankia sp CcI3 corrections-The mistake seems to have been in the sequence again. The DNA seq. began 27 bp before and theclaimed starting site of the protein actually coded for Valine.
Conclusion:After studying the three analysis we did with the protein, DNA and the ORF codons,weconclude the following- 1) Bacteria choose codons based on its G-C composition to get same amino acid for creation of protein. G-C rich codon of course gets preference for G-C rich bacteria. Similarly and conversely, A-T rich codon gets preference for G-C poor bacteria. 2) If same amino acid is not there, a synonymous amino acid is used having the same or near about same chemical properties. 3) High G-C content bacteria often employ two different genes for same purpose. The finding of two possible genes in their genome for Carbonic Anhydrase is the proof for such a statement. 4) Most bacteria use Zinc at the metal site yet a small number of bacteria use Cadmium and other metals. 5) Even if they are of varied length, one may look for Serine and Glycine on the peptide chain and see that this region is conserved in all protein,. This is because the protein domains must be similar for all the anhydrases.