V. K. SinghInformation officerCentre for BioinformaticsBanaras Hindu UniversityIntroduction to Bioinformatics
What is Bioinformatics“The analysis of biological information usingcomputers and statistical techniques;the science of dev...
What Bioinformatics can offer tobiologists?1: Introduction
1: IntroductionComputational biology –Insilico genome revolution atthe turn of the century.
•Life was classified asplants and animals•When Bacteria were discoveredthey were initially classified as plants.•Ernst Hae...
1: Introduction
Thus, life were classified to 5 kingdoms:When electron microscopes were developed, it wasfound that Protista in fact inclu...
Later, plants, animals, protists and fungi werecollectively called the Eucarya domain, and theprocaryotes were shifted fro...
rRNA was sequenced from a greatnumber of organisms to study phylogeny1: Introduction
Revolutionizing the Classification of Life1: IntroductionThe rRNA phylogenetic tree
From sequence analysis only, it was thusestablished that life is divided into 3:BacteriaArchaeaEucarya1: Introduction
Gregor Mendellaws of inheritance,“gene”1866Watson and CrickDNA Discovery1953GenomeProject20031: Introduction
Sequencing of Genomes
Genomic Sequencing – shotgun sequencingSequencing is usually ~700 bp in a single run.How can we sequence a genome?1: Intro...
Genomic Sequencing – Walking.1.Design a primer2.Sequence.3.Design a new primer4.Sequence5.…One has to designnew primers ev...
GAGGAGACGAACACCCGTATACAGTCGACGACCCCGAGGAGACGAACACCCGTATACAGTCGACGTTTATATATAGTATACAGTCGACGTTTATATATAACCCCGAGGAGACGAGenomic ...
GAGGAGACGAACACCCGTATACAGTCGACGACCCCGAGGAGACGA ? GTATACAGTCGACGTTTATATATAGTATACAGTCGACGTTTATATATAACCCCGAGGAGACGAShotgun seq...
Shotgun sequencing – why isn’t it a trivial task?2. Some pieces do not align because ofsequencing errorsGAGGTGAGGAACACCCGT...
Shotgun sequencing – why not a trivial task?3. Repetitive sequences –satellites DNA.GGGGGGGGGGGGGGGGGGGGGGGGGGGGACCCCGGGGG...
A section of the genome that could bereliably assembled.A contig1: Introduction
23BIOINFORMATICS DATABASES
24What’s in a database?• Sequences – genes, proteins, etc…• Full genomes• Expression data• Structures• Annotation – inform...
25NCBI and Entrez• One of the most largest and comprehensivedatabases belonging to the NIH (national institute ofhealth. T...
32PubMed: NCBI’s database of biomedicalarticlesYang X, Kurteva S, Ren X, Lee S, Sodroski J. “Subunit stoichiometryYang X, ...
33Use fields!Yang[AU] AND glycoprotein[TI] AND 2006[DP] AND J virol[TA]For the full list of field tags: go to help -> Sear...
34Example• Retrieve all publications in which the firstauthor is: Davidovich C and the last author is:Yonath A
35Using limitsRetrieve the publications ofYonath A, in the journals:Nature and Proc Natl AcadSci U S A., in the last 5 years
36Searching NCBI for the proteinhuman CD4Search demonstrationSearch demonstration
37
38Using field descriptions, qualifiers, andboolean operators• Cd4[GENE] AND human[ORGN]OrCd4[gene name] AND human[organism...
39This time we directly search in the protein databaseThis time we directly search in the protein database
40RefSeq• Subcollection of NCBI databases with only non-redundant, highly annotated entries (genomic DNA,transcript (RNA),...
41
42An explanation on GenBank records
4343Fasta format> gi|10835167|ref|NP_000607.1| CD4 antigen precursor [Homo sapiens]MNRGVPFRHLLLVLQLALLPAATQGKKVVLGKKGDTVEL...
4444Downloading
Homology SearchUsingSequence Alignment
|| || ||||| ||| || || |||||||||||||||||||MVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFE…ATGGTGAACCTGACCTCTGACGAGAAGACTGCCGTC...
What is sequence alignment?Alignment: Comparing two (pairwise) or more(multiple) sequences. Searching for a series ofident...
Why sequence alignment?Predict characteristics of a protein –use the structure or function information on knownproteins wi...
Local vs. Global• Global alignment – finds the bestalignment across the whole twosequences.• Local alignment – finds regio...
In the course of evolution, the sequences changed from theancestral sequence by random mutationsThree types of changes:1. ...
In the course of evolution, the sequences changed from theancestral sequence by random mutationsThree types of changes :1....
In the course of evolution, the sequences changed from theancestral sequence by random mutationsThree types of mutations:1...
Sequence alignmentAAGCTGAATTCGAAAGGCTCATTTCTGAAAGCTGAATT-C-GAAAGGCT-CATTTCTGA-One possible alignment:This alignment includ...
Choosing an alignment:• Many different alignments are possible:AAGCTGAATTCGAAAGGCTCATTTCTGAA-AGCTGAATTC--GAAAG-GCTCA-TTTCT...
Scoring an alignment:example - naïve scoring system:• Match: +1• Mismatch: -2• Indel: -1AAGCTGAATT-C-GAAAGGCT-CATTTCTGA-Sc...
Scoring system:• Different scoring systems can produce differentoptimal alignments• Scoring systems implicitly represent a...
Substitutions Matrices• Nucleic acids:– Transition-transversion• Amino acids:– Evolution (empirical data) based: (PAM, BLO...
Web server for pairwise alignment
BLAST 2 sequences (bl2Seq) at NCBIProduces the local alignment of two givensequences using BLAST (Basic Local AlignmentSea...
Back to NCBI
BLAST – bl2seq
blastnblastn – nucleotide– nucleotideblastpblastp – protein– proteinBl2Seq - query
Bl2seq results
Bl2seq resultsMatchMatch DissimilarityDissimilarityGapsGaps SimilaritySimilarity LowLowcomplexitycomplexity
Bl2seq results:• Bits score – A score for the alignment according tothe number of similarities, identities, etc.• Expected...
BLAST – programsQuery: DNA ProteinDatabase: DNA Protein
BLAST – Blastp
Blastp - results
Blastp – results (cont’)
Blastp – acquiring sequences
blastp – acquiring sequences(cont’)
Multiple SequenceAlignment (MSA)andPhylogeny
One of the options to get multiplesequence Fasta file
One of the options to get multiplesequence Fasta file
Input: multiple sequence Fasta file>gi|21536452|ref|NP_002762.2| mesotrypsin preproprotein [Homo sapiens]MNPFLILAFVGAAVAVP...
Input: multiple sequence Fasta file>gi|21536452|ref|NP_002762.2| mesotrypsin preproprotein [Homo sapiens]MNPFLILAFVGAAVAVP...
Step1: Load the sequences
Sequences and conservation view
Step2: Perform Alignment
Sequences and conservation view
Sequences and conservation view
Step 3: Create tree
Step 4: NJPlot
• We need some statistical way to estimate theconfidence in the tree topology• But we don’t know anything about the treeto...
Bootstrap1. Resample K positions n times12345 K1 : ATCTG…A2 : ATCTG…C3 : ACTTA…CN : ACCTA…T11244 K1 : AATTT…T2 : AATTT…G3 ...
Bootstrap2. Reconstruct a tree from each data set using the samemethod used for reconstructing the original treeSp1Sp2Sp3S...
Bootstrap3. For each node in our original tree, we count the numberof times it appeared in the bootstrap analysisSp1Sp2Sp3...
Step 3.5 - Bootstrap
Bootstrap values on NJPlotNote:ClustalX saves trees as .ph filetrees with bootstrap are savedas .phbYou might have to reop...
Protein information Resource• Swissprot• PDB
91Swissprot• A protein sequence database which strives toprovide a high level of annotation regarding:* the function of a ...
92
93PDB: Protein Data Bank• Main database of 3D structures ofmacromolecules• Includes ~61,000 entries (proteins, nucleicacid...
94Human CD4 in complex with HIV gp120gp120CD4PDB ID 1G9M
What do bioinformaticians study?• Bioinformatics today is part of almost everymolecular biological research.• Just a few e...
Example 1• Compare proteins with similar sequences (for instance–kinases) and understand what the similarities anddifferen...
Example 2• Look at the genome and predict where genesare (promoters; transcription binding sites;introns; exons)1: Introdu...
• Predict the 3-dimensional structure of aprotein from its primary sequenceExample 3Ab-initioprediction –extremelydifficul...
• Correlate between gene expression anddiseaseExample 4A gene chip – quantifying geneexpression in different tissuesunder ...
Role of Centre for Bioinformatics inSchool of Biotechnology, BHU
MAIN©1996-2007 All Rights Reserved. Online Journal of Bioinformatics . You may not store these pages in any formexcept for...
200 bp172 bpMotif-1Motif-2 Motif-3
MAIN©1996-2007 All Rights Reserved. Online Journal of Bioinformatics . You may not store these pages in any formexcept for...
http://www.insilicogenomics.in/cry-bt-search.asp
CERCOSPORA LEAF SPOT DISEASE OF PIGEONPEA AND ITS MANAGEMENT
Introduction to Bioinformatics
Introduction to Bioinformatics
Introduction to Bioinformatics
Introduction to Bioinformatics
Introduction to Bioinformatics
Introduction to Bioinformatics
Introduction to Bioinformatics
Introduction to Bioinformatics
Introduction to Bioinformatics
Introduction to Bioinformatics
Introduction to Bioinformatics
Introduction to Bioinformatics
Introduction to Bioinformatics
Introduction to Bioinformatics
Introduction to Bioinformatics
Introduction to Bioinformatics
Introduction to Bioinformatics
Introduction to Bioinformatics
Introduction to Bioinformatics
Introduction to Bioinformatics
Introduction to Bioinformatics
Introduction to Bioinformatics
Introduction to Bioinformatics
Introduction to Bioinformatics
Introduction to Bioinformatics
Introduction to Bioinformatics
Introduction to Bioinformatics
Introduction to Bioinformatics
Upcoming SlideShare
Loading in …5
×

Introduction to Bioinformatics

985 views

Published on

Elements of Bioinformatics

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
985
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
81
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Introduction to Bioinformatics

  1. 1. V. K. SinghInformation officerCentre for BioinformaticsBanaras Hindu UniversityIntroduction to Bioinformatics
  2. 2. What is Bioinformatics“The analysis of biological information usingcomputers and statistical techniques;the science of developing and utilizingcomputer databases and algorithms toaccelerate and enhance biological research”www.niehs.nih.gov/dert/trc/glossary.htm1: Introduction
  3. 3. What Bioinformatics can offer tobiologists?1: Introduction
  4. 4. 1: IntroductionComputational biology –Insilico genome revolution atthe turn of the century.
  5. 5. •Life was classified asplants and animals•When Bacteria were discoveredthey were initially classified as plants.•Ernst Haeckel (1866) placed all unicellularorganisms in a kingdom called Protista,separated from Plantae and Animalia.In the very beginning1: Introduction
  6. 6. 1: Introduction
  7. 7. Thus, life were classified to 5 kingdoms:When electron microscopes were developed, it wasfound that Protista in fact include both cells with andwithout nucleus. Also, fungi were found to differ fromplants, since they are heterotrophs (they do notsynthesize their food).LIFEFungiPlants Animals ProtistsProcaryotes1: Introduction
  8. 8. Later, plants, animals, protists and fungi werecollectively called the Eucarya domain, and theprocaryotes were shifted from a kingdom to be aBacteria domain.Domains EucaryaBacteriaFungiPlants Animals ProtistsKingdomsEven later, a new Domain was discovered…1: Introduction
  9. 9. rRNA was sequenced from a greatnumber of organisms to study phylogeny1: Introduction
  10. 10. Revolutionizing the Classification of Life1: IntroductionThe rRNA phylogenetic tree
  11. 11. From sequence analysis only, it was thusestablished that life is divided into 3:BacteriaArchaeaEucarya1: Introduction
  12. 12. Gregor Mendellaws of inheritance,“gene”1866Watson and CrickDNA Discovery1953GenomeProject20031: Introduction
  13. 13. Sequencing of Genomes
  14. 14. Genomic Sequencing – shotgun sequencingSequencing is usually ~700 bp in a single run.How can we sequence a genome?1: Introduction
  15. 15. Genomic Sequencing – Walking.1.Design a primer2.Sequence.3.Design a new primer4.Sequence5.…One has to designnew primers everytime. To do so, onehas to wait for thesequencing results1: Introduction
  16. 16. GAGGAGACGAACACCCGTATACAGTCGACGACCCCGAGGAGACGAACACCCGTATACAGTCGACGTTTATATATAGTATACAGTCGACGTTTATATATAACCCCGAGGAGACGAGenomic Sequencing – shotgunsequencing1. Break DNA to small pieces2. Sequence each piece3. Assemble1: Introduction
  17. 17. GAGGAGACGAACACCCGTATACAGTCGACGACCCCGAGGAGACGA ? GTATACAGTCGACGTTTATATATAGTATACAGTCGACGTTTATATATAACCCCGAGGAGACGAShotgun sequencing – why isn’t it a trivialtask?1. By chance, some parts are not sequencedeven once!!!1: Introduction
  18. 18. Shotgun sequencing – why isn’t it a trivial task?2. Some pieces do not align because ofsequencing errorsGAGGTGAGGAACACCCGTATACAGTCGACGACCCCGAGG?GA?GAACACCCGTATACAGTCGACGTTTATATATAACCCCGAGGAGACGA1: Introduction
  19. 19. Shotgun sequencing – why not a trivial task?3. Repetitive sequences –satellites DNA.GGGGGGGGGGGGGGGGGGGGGGGGGGGGACCCCGGGGGGGGGGGGG????GGGGGGGGGGGGGAGGGGGGGGGGGGGGGGGGGGGGAACCCCGGGGG1: Introduction
  20. 20. A section of the genome that could bereliably assembled.A contig1: Introduction
  21. 21. 23BIOINFORMATICS DATABASES
  22. 22. 24What’s in a database?• Sequences – genes, proteins, etc…• Full genomes• Expression data• Structures• Annotation – information about genes/proteins:- function- cellular location- chromosomal location- introns/exons- phenotypes, diseases• Publications
  23. 23. 25NCBI and Entrez• One of the most largest and comprehensivedatabases belonging to the NIH (national institute ofhealth. The primary Federal agency for conductingand supporting medical research in the USA)• Entrez is the search engine of NCBI• Search for :genes, proteins, genomes, structures, diseases,publications, and morehttp://www.ncbi.nlm.nih.gov
  24. 24. 32PubMed: NCBI’s database of biomedicalarticlesYang X, Kurteva S, Ren X, Lee S, Sodroski J. “Subunit stoichiometryYang X, Kurteva S, Ren X, Lee S, Sodroski J. “Subunit stoichiometryof human immunodeficiency virus type 1 envelope glycoprotein trimersof human immunodeficiency virus type 1 envelope glycoprotein trimersduring virus entry into host cells “, J Virol. 2006 May;80(9):4388-95.during virus entry into host cells “, J Virol. 2006 May;80(9):4388-95.
  25. 25. 33Use fields!Yang[AU] AND glycoprotein[TI] AND 2006[DP] AND J virol[TA]For the full list of field tags: go to help -> Search Field Descriptions and Tags
  26. 26. 34Example• Retrieve all publications in which the firstauthor is: Davidovich C and the last author is:Yonath A
  27. 27. 35Using limitsRetrieve the publications ofYonath A, in the journals:Nature and Proc Natl AcadSci U S A., in the last 5 years
  28. 28. 36Searching NCBI for the proteinhuman CD4Search demonstrationSearch demonstration
  29. 29. 37
  30. 30. 38Using field descriptions, qualifiers, andboolean operators• Cd4[GENE] AND human[ORGN]OrCd4[gene name] AND human[organism]• List of field codes: http://www.ncbi.nlm.nih.gov/entrez/query/static/help/Summary_Matrices.html#Search_Fields_and_Qualifiers– Boolean Operators:ANDORNOTNote: do not use the field Protein name [PROT], only GENE!
  31. 31. 39This time we directly search in the protein databaseThis time we directly search in the protein database
  32. 32. 40RefSeq• Subcollection of NCBI databases with only non-redundant, highly annotated entries (genomic DNA,transcript (RNA), and protein products)
  33. 33. 41
  34. 34. 42An explanation on GenBank records
  35. 35. 4343Fasta format> gi|10835167|ref|NP_000607.1| CD4 antigen precursor [Homo sapiens]MNRGVPFRHLLLVLQLALLPAATQGKKVVLGKKGDTVELTCTASQKKSIQFHWKNSNQIKILGNQGSFLTKGPSKLNDRADSRRSLWDQGNFPLIIKNLKIEDSDTYICEVEDQKEEVQLLVFGLTANSDTHLLQGQSLTLTLESPPGSSPSVQCRSPRGKNIQGGKTLSVSQLELQDSGTWTCTVLQNQKKVEFKIDIVVLAFQKASSIVYKKEGEQVEFSFPLAFTVEKLTGSGELWWQAERASSSKSWITFDLKNKEVSVKRVTQDPKLQMGKKLPLHLTLPQALPQYAGSGNLTLALEAKTGKLHQEVNLVVMRATQLQKNLTCEVWGPTSPKLMLSLKLENKEAKVSKREKAVWVLNPEAGMWQCLLSDSGQVLLESNIKVLPTWSTPVQPMALIVLGGVAGLLLFIGLGIFFCVRCRHRRRQAERMSQIKRLLSEKKTCQCPHRFQKTCSPISave accession numbers for future use (makes searching quicker):RefSeq accession number: NP_000607.1headerID/accession descriptionsequence
  36. 36. 4444Downloading
  37. 37. Homology SearchUsingSequence Alignment
  38. 38. || || ||||| ||| || || |||||||||||||||||||MVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFE…ATGGTGAACCTGACCTCTGACGAGAAGACTGCCGTCCTTGCCCTGTGGAACAAGGTGGACGTGGAAGACTGTGGTGGTGAGGCCCTGGGCAGGTTTGTATGGAGGTTACAAGGCTGCTTAAGGAGGGAGGATGGAAGCTGGGCATGTGGAGACAGACCACCTCCTGGATTTATGACAGGAACTGATTGCTGTCTCCTGTGCTGCTTTCACCCCTCAGGCTGCTGGTCGTGTATCCCTGGACCCAGAGGTTCTTTGAAAGCTTTGGGGACTTGTCCACTCCTGCTGCTGTGTTCGCAAATGCTAAGGTAAAAGCCCATGGCAAGAAGGTGCTAACTTCCTTTGGTGAAGGTATGAATCACCTGGACAACCTCAAGGGCACCTTTGCTAAACTGAGTGAGCTGCACTGTGACAAGCTGCACGTGGATCCTGAGAATTTCAAGGTGAGTCAATATTCTTCTTCTTCCTTCTTTCTATGGTCAAGCTCATGTCATGGGAAAAGGACATAAGAGTCAGTTTCCAGTTCTCAATAGAAAAAAAAATTCTGTTTGCATCACTGTGGACTCCTTGGGACCATTCATTTCTTTCACCTGCTTTGCTTATAGTTATTGTTTCCTCTTTTTCCTTTTTCTCTTCTTCTTCATAAGTTTTTCTCTCTGTATTTTTTTAACACAATCTTTTAATTTTGTGCCTTTAAATTATTTTTAAGCTTTCTTCTTTTAATTACTACTCGTTTCCTTTCATTTCTATACTTTCTATCTAATCTTCTCCTTTCAAGAGAAGGAGTGGTTCACTACTACTTTGCTTGGGTGTAAAGAATAACAGCAATAGCTTAAATTCTGGCATAATGTGAATAGGGAGGACAATTTCTCATATAAGTTGAGGCTGATATTGGAGGATTTGCATTAGTAGTAGAGGTTACATCCAGTTACCGTCTTGCTCATAATTTGTGGGCACAACACAGGGCATATCTTGGAACAAGGCTAGAATATTCTGAATGCAAACTGGGGACCTGTGTTAACTATGTTCATGCCTGTTGTCTCTTCCTCTTCAGCTCCTGGGCAATATGCTGGTGGTTGTGCTGGCTCGCCACTTTGGCAAGGAATTCGACTGGCACATGCACGCTTGTTTTCAGAAGGTGGTGGCTGGTGTGGCTAATGCCCTGGCTCACAAGTACCATTGAMVNLTSDEKTAVLALWNKVDVEDCGGEALGRLLVVYPWTQRFFE…Before we begin…
  39. 39. What is sequence alignment?Alignment: Comparing two (pairwise) or more(multiple) sequences. Searching for a series ofidentical or similar characters in thesequences.MVNLTSDEKTAVLALWNKVDVEDCGGE|| || ||||| ||| || || ||MVHLTPEEKTAVNALWGKVNVDAVGGE
  40. 40. Why sequence alignment?Predict characteristics of a protein –use the structure or function information on knownproteins with similar sequences available indatabases in order to predict the structure orfunction of an unknown proteinAssumptions: similar sequences producesimilar proteins
  41. 41. Local vs. Global• Global alignment – finds the bestalignment across the whole twosequences.• Local alignment – finds regions ofhigh similarity in parts of thesequences.ADLGAVFALCDRYFQ|||| |||| |ADLGRTQN-CDRYYQADLG CDRYFQ|||| |||| |ADLG CDRYYQGlobalalignment:forcesalignment inregions whichdifferLocalalignmentconcentrateson regions ofhigh similarity
  42. 42. In the course of evolution, the sequences changed from theancestral sequence by random mutationsThree types of changes:1. Insertion - an insertion of a letter or several letters to the sequence.AAGA AAGTASequence evolutionAAGAAGAAInsertionInsertion
  43. 43. In the course of evolution, the sequences changed from theancestral sequence by random mutationsThree types of changes :1. Insertion - an insertion of a letter or several letters to the sequence.AAGA AAGTA2. Deletion – a deletion of a letter (or more) from the sequence.AAGA AGASequence evolutionAA AGAGDeletionDeletionAA
  44. 44. In the course of evolution, the sequences changed from theancestral sequence by random mutationsThree types of mutations:1. Insertion - an insertion of a letter or several letters to the sequence.AAGA AAGTA2. Deletion - deleting a letter (or more) from the sequence.AAGA AGA1. Substitution – a replacement of one (or more) sequence letter byanotherAAGA AACAEvolutionary changes in sequencesAAAA AASubstitutionSubstitutionGGCCInsertionInsertion ++ DeletionDeletion  IndelIndel
  45. 45. Sequence alignmentAAGCTGAATTCGAAAGGCTCATTTCTGAAAGCTGAATT-C-GAAAGGCT-CATTTCTGA-One possible alignment:This alignment includes:2 mismatches4 indels (gap)10 perfect matches
  46. 46. Choosing an alignment:• Many different alignments are possible:AAGCTGAATTCGAAAGGCTCATTTCTGAA-AGCTGAATTC--GAAAG-GCTCA-TTTCTGA-Which alignment is better?AAGCTGAATT-C-GAAAGGCT-CATTTCTGA-
  47. 47. Scoring an alignment:example - naïve scoring system:• Match: +1• Mismatch: -2• Indel: -1AAGCTGAATT-C-GAAAGGCT-CATTTCTGA-Score: = (+1)x10 + (-2)x2 + (-1)x4 = 2 Score: = (+1)x9 + (-2)x2 + (-1)x6 = -1A-AGCTGAATTC--GAAAG-GCTCA-TTTCTGA-Higher score  Better alignment
  48. 48. Scoring system:• Different scoring systems can produce differentoptimal alignments• Scoring systems implicitly represent a particulartheory of similarity/dissimilarity betweensequence characters: evolution based, physico-chemical properties based– Some mismatches are more plausible• Transition vs. Transversion• LysArg ≠ LysCys– Gap extension Vs. Gap opening
  49. 49. Substitutions Matrices• Nucleic acids:– Transition-transversion• Amino acids:– Evolution (empirical data) based: (PAM, BLOSUM)– Physico-chemical properties based (Grantham,McLachlan)
  50. 50. Web server for pairwise alignment
  51. 51. BLAST 2 sequences (bl2Seq) at NCBIProduces the local alignment of two givensequences using BLAST (Basic Local AlignmentSearch Tool) engine for local alignment• Does not use an exact algorithm but aheuristic
  52. 52. Back to NCBI
  53. 53. BLAST – bl2seq
  54. 54. blastnblastn – nucleotide– nucleotideblastpblastp – protein– proteinBl2Seq - query
  55. 55. Bl2seq results
  56. 56. Bl2seq resultsMatchMatch DissimilarityDissimilarityGapsGaps SimilaritySimilarity LowLowcomplexitycomplexity
  57. 57. Bl2seq results:• Bits score – A score for the alignment according tothe number of similarities, identities, etc.• Expected-score (E-value) –The number of alignmentswith the same score one can “expect” to see bychance when searching a database of a particularsize. The closer the e-value approaches zero, thegreater the confidence that the hit is real
  58. 58. BLAST – programsQuery: DNA ProteinDatabase: DNA Protein
  59. 59. BLAST – Blastp
  60. 60. Blastp - results
  61. 61. Blastp – results (cont’)
  62. 62. Blastp – acquiring sequences
  63. 63. blastp – acquiring sequences(cont’)
  64. 64. Multiple SequenceAlignment (MSA)andPhylogeny
  65. 65. One of the options to get multiplesequence Fasta file
  66. 66. One of the options to get multiplesequence Fasta file
  67. 67. Input: multiple sequence Fasta file>gi|21536452|ref|NP_002762.2| mesotrypsin preproprotein [Homo sapiens]MNPFLILAFVGAAVAVPFDDDDKIVGGYTCEENSLPYQVSLNSGSHFCGGSLISEQWVVSAAHCYKTRIQVRLGEHNIKVLEGNEQFINAAKIIRHPKYNRDTLDNDIMLIKLSSPAVINARVSTISLPTAPPAAGTECLISGWGNTLSFGADYPDELKCLDAPVLTQAECKASYPGKITNSMFCVGFLEGGKDSCQRDSGGPVVCNGQLQGVVSWGHGCAWKNRPGVYTKVYNYVDWIKDTIAANS>gi|114051746|ref|NP_001040585.1| protease, serine, 2 [Macaca mulatta]MNPLLILAFVGVAVAAPFDDDDKIVGGYTCEENSVPYQVSLNSGYHFCGGSLINEQWVVSAAHCYKTRIQVRLGEHNIEVLEGTEQFINAAKIIRHPDYDRKTLNNDILLIKLSSPAVINARVSTISLPTAPPAAGAEALISGWGNTLSSGADYPDELQCLEAPVLSQAECEASYPGKITSNMFCVGFLEGGKDSCQGDSGGPVVSNGQLQGIVSWGYGCAQKNRPGVYTKVYNYVDWIRDTIAANS>gi|6755891|ref|NP_035775.1| mesotrypsin [Mus musculus]MNALLILALVGAAVAFPVDDDDKIVGGYTCQENSVPYQVSLNSGYHFCGGSLINDQWVVSAAHCYKTRIQVRLGEHNINVLEGNEQFVNAAKIIKHPNFNRKTLNNDIMLLKLSSPVTLNARVATVALPSSCAPAGTQCLISGWGNTLSFGVSEPDLLQCLDAPLLPQADCEASYPGKITGNMVCAGFLEGGKDSCQGDSGGPVVCNRELQGIVSWGYGCALPDNPGVYTKVCNYVDWIQDTIAAN>gi|6981422|ref|NP_036861.1| protease, serine, 2 [Rattus norvegicus]MRALLFLALVGAAVAFPVDDDDKIVGGYTCQENSVPYQVSLNSGYHFCGGSLINDQWVVSAAHCYKSRIQVRLGEHNINVLEGNEQFVNAAKIIKHPNFDRKTLNNDIMLIKLSSPVKLNARVATVALPSSCAPAGTQCLISGWGNTLSSGVNEPDLLQCLDAPLLPQADCEASYPGKITDNMVCVGFLEGGKDSCQGDSGGPVVCNGELQGIVSWGYGCALPDNPGVYTKVCNYVDWIQDTIAAN>gi|27819626|ref|NP_777115.1| pancreatic anionic trypsinogen [Bos taurus]MHPLLILAFVGAAVAFPSDDDDKIVGGYTCAENSVPYQVSLNAGYHFCGGSLINDQWVVSAAHCYQYHIQVRLGEYNIDVLEGGEQFIDASKIIRHPKYSSWTLDNDILLIKLSTPAVINARVSTLALPSACASGSTECL. . .
  68. 68. Input: multiple sequence Fasta file>gi|21536452|ref|NP_002762.2| mesotrypsin preproprotein [Homo sapiens]MNPFLILAFVGAAVAVPFDDDDKIVGGYTCEENSLPYQVSLNSGSHFCGGSLISEQWVVSAAHCYKTRIQVRLGEHNIKVLEGNEQFINAAKIIRHPKYNRDTLDNDIMLIKLSSPAVINARVSTISLPTAPPAAGTECLISGWGNTLSFGADYPDELKCLDAPVLTQAECKASYPGKITNSMFCVGFLEGGKDSCQRDSGGPVVCNGQLQGVVSWGHGCAWKNRPGVYTKVYNYVDWIKDTIAANS>gi|114051746|ref|NP_001040585.1| protease, serine, 2 [Macaca mulatta]MNPLLILAFVGVAVAAPFDDDDKIVGGYTCEENSVPYQVSLNSGYHFCGGSLINEQWVVSAAHCYKTRIQVRLGEHNIEVLEGTEQFINAAKIIRHPDYDRKTLNNDILLIKLSSPAVINARVSTISLPTAPPAAGAEALISGWGNTLSSGADYPDELQCLEAPVLSQAECEASYPGKITSNMFCVGFLEGGKDSCQGDSGGPVVSNGQLQGIVSWGYGCAQKNRPGVYTKVYNYVDWIRDTIAANS>gi|6755891|ref|NP_035775.1| mesotrypsin [Mus musculus]MNALLILALVGAAVAFPVDDDDKIVGGYTCQENSVPYQVSLNSGYHFCGGSLINDQWVVSAAHCYKTRIQVRLGEHNINVLEGNEQFVNAAKIIKHPNFNRKTLNNDIMLLKLSSPVTLNARVATVALPSSCAPAGTQCLISGWGNTLSFGVSEPDLLQCLDAPLLPQADCEASYPGKITGNMVCAGFLEGGKDSCQGDSGGPVVCNRELQGIVSWGYGCALPDNPGVYTKVCNYVDWIQDTIAAN>gi|6981422|ref|NP_036861.1| protease, serine, 2 [Rattus norvegicus]MRALLFLALVGAAVAFPVDDDDKIVGGYTCQENSVPYQVSLNSGYHFCGGSLINDQWVVSAAHCYKSRIQVRLGEHNINVLEGNEQFVNAAKIIKHPNFDRKTLNNDIMLIKLSSPVKLNARVATVALPSSCAPAGTQCLISGWGNTLSSGVNEPDLLQCLDAPLLPQADCEASYPGKITDNMVCVGFLEGGKDSCQGDSGGPVVCNGELQGIVSWGYGCALPDNPGVYTKVCNYVDWIQDTIAAN>gi|27819626|ref|NP_777115.1| pancreatic anionic trypsinogen [Bos taurus]MHPLLILAFVGAAVAFPSDDDDKIVGGYTCAENSVPYQVSLNAGYHFCGGSLINDQWVVSAAHCYQYHIQVRLGEYNIDVLEGGEQFIDASKIIRHPKYSSWTLDNDILLIKLSTPAVINARVSTLALPSACASGSTECL. . .
  69. 69. Step1: Load the sequences
  70. 70. Sequences and conservation view
  71. 71. Step2: Perform Alignment
  72. 72. Sequences and conservation view
  73. 73. Sequences and conservation view
  74. 74. Step 3: Create tree
  75. 75. Step 4: NJPlot
  76. 76. • We need some statistical way to estimate theconfidence in the tree topology• But we don’t know anything about the treetopology distribution or parameters• The only data source we have is our data(MSA)• So, we must rely on our own resources: “pullup by your own bootstraps”How robust is our tree?
  77. 77. Bootstrap1. Resample K positions n times12345 K1 : ATCTG…A2 : ATCTG…C3 : ACTTA…CN : ACCTA…T11244 K1 : AATTT…T2 : AATTT…G3 : AACTT…TN : AACTT…T47789…K1 : TTTAT…T2 : TAACC…G3 : TAACC…TN : TGGGA…T15578… K1 : AGGTA…T2 : AGGAC…G3 : AAAAC…AN : AAAGG…C
  78. 78. Bootstrap2. Reconstruct a tree from each data set using the samemethod used for reconstructing the original treeSp1Sp2Sp3Sp4Sp1Sp2Sp3Sp4Sp1Sp2Sp3Sp411244 K1 : AATTT…T2 : AATTT…G3 : AACTT…TN : AACTT…T47789…K1 : TTTAT…T2 : TAACC…G3 : TAACC…TN : TGGGA…T15578… K1 : AGGTA…T2 : AGGAC…G3 : AAAAC…AN : AAAGG…C
  79. 79. Bootstrap3. For each node in our original tree, we count the numberof times it appeared in the bootstrap analysisSp1Sp2Sp3Sp4Sp1Sp2Sp3Sp4Sp1Sp2Sp3Sp4Sp1Sp2Sp3Sp467%100%
  80. 80. Step 3.5 - Bootstrap
  81. 81. Bootstrap values on NJPlotNote:ClustalX saves trees as .ph filetrees with bootstrap are savedas .phbYou might have to reopen thetree…
  82. 82. Protein information Resource• Swissprot• PDB
  83. 83. 91Swissprot• A protein sequence database which strives toprovide a high level of annotation regarding:* the function of a protein* domains structure* post-translational modifications* variants• One entry for each proteinhttp://www.expasy.ch/sprot
  84. 84. 92
  85. 85. 93PDB: Protein Data Bank• Main database of 3D structures ofmacromolecules• Includes ~61,000 entries (proteins, nucleicacids, complex assemblies)• Is highly redundanthttp://www.rcsb.org
  86. 86. 94Human CD4 in complex with HIV gp120gp120CD4PDB ID 1G9M
  87. 87. What do bioinformaticians study?• Bioinformatics today is part of almost everymolecular biological research.• Just a few examples…1: Introduction
  88. 88. Example 1• Compare proteins with similar sequences (for instance–kinases) and understand what the similarities anddifferences mean1: Introduction
  89. 89. Example 2• Look at the genome and predict where genesare (promoters; transcription binding sites;introns; exons)1: Introduction
  90. 90. • Predict the 3-dimensional structure of aprotein from its primary sequenceExample 3Ab-initioprediction –extremelydifficult!1: Introduction
  91. 91. • Correlate between gene expression anddiseaseExample 4A gene chip – quantifying geneexpression in different tissuesunder different conditionsMay be used for personalizedmedicine1: Introduction
  92. 92. Role of Centre for Bioinformatics inSchool of Biotechnology, BHU
  93. 93. MAIN©1996-2007 All Rights Reserved. Online Journal of Bioinformatics . You may not store these pages in any formexcept for your own personal use. All other usage or distribution is illegal under international copyright treaties.Permission to use any of these pages in any other way besides the before mentioned must be gained in writingfrom the publisher. This article is exclusively copyrighted in its entirety to OJB publications. This article may becopied once but may not be, reproduced or re-transmitted without the express permission of the editors. Thisjournal satisfies the refereeing requirements (DEST) for the Higher Education Research Data Collection(Australia). Linking:To link to this page or any pages linking to this page you must link directly to this page onlyhere rather than put up your own page.OJBTMOnline Journal ofBioinformatics©8 (1) : 75-83, 2007In silico Cis-regulatory Elements Analysis of SeedStorage Protein Promoters Cloned from DifferentCultivars of Wheat, Rice and OatYadav D1, Singh VK1, Singh NK21Department of Molecular Biology and Genetic Engineering, College of Basic Sciences and Humanities G.B.pant University of Agriculture and Technology, Pantnagar (Uttarakhand) 2National Research Center onPlant Biotechnology Indian Agriculture Research Institute, New Delhi 110012ABSTRACTA total of 24 promoter sequences withassigned accession number EF393165 toEF393188 and representing major seedstorage proteins of wheat namely Highmolecular weight glutenin subunit (HMW-GS),low molecular weight glutenin subunits (LMW-GS) alpha/beta gliadins, triticin along withrice glutelins and oat 12S globulins werecloned from indigenous cultivars of wheat,rice and oat and was subjected to in silicoanalysis using bioinformatic softwares for thepresence of different cis-regulatory motifs.The phylogeny studies based on the multiplesequence alignment of these promotersrevealed four distinct clusters showing majorgroup of seed storage promoters. Thepresence of additional motifs like RY repeats,ABRE, AC-11, CAAT box, LTR, UTR, CCGTCCbox, G box, GARE, MBS along with thecommon motifs present in seed storagepromoters like Prolamin-box, TATA, CAATprovides a better option for multifarious uses.Keywords: Seed storage protein promoters,Cis-regulatory Elements, In silico.Seed StorageProtein PromotersAccessionNumberCultivars Length(bp)HMW Glutenin(Triticumaestivum)EF396165EF396184EF396166EF396167EF396168EF396169EF396170EF396171EF396172EF396173UP-262UP-262UP-262UP-262UP-262UP-301UP-301UP-301UP-301UP-301402487397412385385393398392393LMW Glutenin(Triticumaestivum)EF396187 HD-2329 551α/β gliadin(Triticumaestivum)EF396174EF396175EF396177EF396178EF396176EF396182KalyansonaKalyansonaUP-262UP-262UP-262UP-301520564591521563548Triticin (Triticumaestivum)EF396181EF396183EF396185EF396186HD-2329HD-2329HD-2329Kalyansona42837045234312S Globulin( Avena sativa)EF396179 UPO-94 549Glutelins ( Oryzasativa)EF396180EF396188PantDhan-12Pusa Basmati562487
  94. 94. 200 bp172 bpMotif-1Motif-2 Motif-3
  95. 95. MAIN©1996-2007 All Rights Reserved. Online Journal of Bioinformatics . You may not store these pages in any formexcept for your own personal use. All other usage or distribution is illegal under international copyright treaties.Permission to use any of these pages in any other way besides the before mentioned must be gained in writingfrom the publisher. This article is exclusively copyrighted in its entirety to OJB publications. This article may becopied once but may not be, reproduced or re-transmitted without the express permission of the editors. Thisjournal satisfies the refereeing requirements (DEST) for the Higher Education Research Data Collection(Australia). Linking:To link to this page or any pages linking to this page you must link directly to this page onlyhere rather than put up your own page.OJBTMOnline Journal ofBioinformatics©8 (1) : 75-83, 2007In silico Cis-regulatory Elements Analysis of SeedStorage Protein Promoters Cloned from DifferentCultivars of Wheat, Rice and OatYadav D1, Singh VK1, Singh NK21Department of Molecular Biology and Genetic Engineering, College of Basic Sciences and Humanities G.B.pant University of Agriculture and Technology, Pantnagar (Uttarakhand) 2National Research Center onPlant Biotechnology Indian Agriculture Research Institute, New Delhi 110012ABSTRACTA total of 24 promoter sequences withassigned accession number EF393165 toEF393188 and representing major seedstorage proteins of wheat namely Highmolecular weight glutenin subunit (HMW-GS),low molecular weight glutenin subunits (LMW-GS) alpha/beta gliadins, triticin along withrice glutelins and oat 12S globulins werecloned from indigenous cultivars of wheat,rice and oat and was subjected to in silicoanalysis using bioinformatic softwares for thepresence of different cis-regulatory motifs.The phylogeny studies based on the multiplesequence alignment of these promotersrevealed four distinct clusters showing majorgroup of seed storage promoters. Thepresence of additional motifs like RY repeats,ABRE, AC-11, CAAT box, LTR, UTR, CCGTCCbox, G box, GARE, MBS along with thecommon motifs present in seed storagepromoters like Prolamin-box, TATA, CAATprovides a better option for multifarious uses.Keywords: Seed storage protein promoters,Cis-regulatory Elements, In silico.Seed StorageProtein PromotersAccessionNumberCultivars Length(bp)HMW Glutenin(Triticumaestivum)EF396165EF396184EF396166EF396167EF396168EF396169EF396170EF396171EF396172EF396173UP-262UP-262UP-262UP-262UP-262UP-301UP-301UP-301UP-301UP-301402487397412385385393398392393LMW Glutenin(Triticumaestivum)EF396187 HD-2329 551α/β gliadin(Triticumaestivum)EF396174EF396175EF396177EF396178EF396176EF396182KalyansonaKalyansonaUP-262UP-262UP-262UP-301520564591521563548Triticin (Triticumaestivum)EF396181EF396183EF396185EF396186HD-2329HD-2329HD-2329Kalyansona42837045234312S Globulin( Avena sativa)EF396179 UPO-94 549Glutelins ( Oryzasativa)EF396180EF396188PantDhan-12Pusa Basmati562487
  96. 96. http://www.insilicogenomics.in/cry-bt-search.asp
  97. 97. CERCOSPORA LEAF SPOT DISEASE OF PIGEONPEA AND ITS MANAGEMENT

×