SlideShare a Scribd company logo
1 of 59
Download to read offline
Need & Emergence of the Field
                 Speaker
                 Shashi Shekhar
                 Head of computational Section
                 Biowits Life Sciences
   The marriage between computer science and
    molecular biology
    ◦ The algorithm and techniques of computer science
      are being used to solve the problems faced by
      molecular biologists

   ‘Information technology applied to the
    management and analysis of biological data’
    ◦ Storage and Analysis are two of the important
      functions – bioinformaticians build tools for each.
Biology    Chemistry




Computer
 Science    Statistics



                         Bioinformatics
   The need for bioinformatics has arisen from the recent
    explosion of publicly available genomic information,
    such as resulting from the Human Genome Project.
   Gain a better understanding of gene analysis,
    taxonomy, & evolution.
   To work efficiently on the rational drug designs and
    reduce the time taken for the development of drug
    manually.
   To uncover the wealth of Biological information hidden
    in the mass of sequence, structure, literature and
    biological data.
   It is being used now and in the foreseeable future in the
    areas of molecular medicine.
   It has environmental benefits in identifying waste and
    clean up bacteria.
   In agriculture, it can be used to produce high yield, low
    maintenance crops.
   Molecular Medicine
   Gene Therapy
   Drug Development
   Microbial genome applications
   Crop Improvement
   Forensic Analysis of Microbes
   Biotechnology
   Evolutionary Studies
   Bio-Weapon Creation
   In Experimental Molecular Biology
   In Genetics and Genomics
   In generating Biological Data
   Analysis of gene and protein expression
   Comparison of genomic data
   Understanding of evolutionary aspect of Evolution
   Understanding biological pathways and networks in
    System Biology
   In Simulation & Modeling of DNA, RNA & Protein
e.g. homology
    searches




           Bioinformatics lecture
                  March 5, 2002
organisation of knowledge
(sequences, structures,
functional data)
   Prediction of structure from sequence
    ◦ secondary structure
    ◦ homology modelling, threading
    ◦ ab initio 3D prediction
   Analysis of 3D structure
    ◦   structure comparison/ alignment
    ◦   prediction of function from structure
    ◦   molecular mechanics/ molecular dynamics
    ◦   prediction of molecular interactions, docking
   Structure databases (RCSB)
   Sequence Similarity
   Tools used for sequence similarity searching
   There uses in biology or to us
   Databases
   Different types of databases
   One could align the sequence so that many
    corresponding residues match.
   Strong similarity between two sequences is a strong
    argument for their homology.
   Homology: Two(or more) sequences have a common
    ancestor.
   Similarity: Two(or more) sequences are similar by some
    criterion, and it does not refer to any historical process.
   To find the relatedness of the proteins or gene, if they
    have a common ancestor or not.
   Mutation in the sequences, brings the changes or
    divergence in the sequences.
   Can also reveal the part of the sequence which is crucial
    for the functioning of gene or protein.
   Optimal Alignment: The alignment that is the best,
    given a defined set of rules and parameter values for
    comparing different alignments.
   Global Alignment: An alignment that assumes that the
    two proteins are basically similar over the entire length
    of one another. The alignment attempts to match them
    to each other from end to end.
   Local Alignment: An alignment that searches for
    segments of the two sequences that match well. There
    is no attempt to force entire sequences into an
    alignment, just those parts that appear to have good
    similarity.

                                                    (contd.)
   Gaps & Insertions: In an alignment, one may achieve much
    better correspondence between two sequences if one allows a
    gap to be introduced in one sequence. Equivalently, one
    could allow an insertion in the other sequence. Biologically
    this corresponds to an mutation event.
   Substitution matrix: A Substitution matrix describes the two
    residue types would mutate to each other in evolutionary
    time. This is used to estimate how well two residues of given
    types would match if they were aligned in a sequence
    alignment.
   Gap Penalty: The gap penalty is used to help decide whether
    or not to accept a gap or insertion in an alignment when it is
    possible to achieve a good alignment residue to residue at
    some other neighboring point in the sequence.
   Similarity indicates conserved function
   Human and mouse genes are more than 80% similar at
    sequence level
   But these genes are small fraction of genome
   Most sequences in the genome are not recognizably similar
   Comparing sequences helps us understand function
    ◦ Locate similar gene in another species to understand your new
      gene
   Match score:           +1
   Mismatch score:        +0
   Gap penalty:             –1
    ACGTCTGATACGCCGTATAGTCTATCT
        ||||| |||        || ||||||||
    ----CTGATTCGC---ATCGTCTATCT
   Matches:     18 × (+1)
   Mismatches: 2 × 0
   Gaps:         7 × (– 1)
                                   Score = +11
   We want to find alignments that are evolutionarily likely.
   Which of the following alignments seems more likely to
    you?
    ACGTCTGATACGCCGTATAGTCTATCT
    ACGTCTGAT-------ATAGTCTATCT    
    ACGTCTGATACGCCGTATAGTCTATCT
    AC-T-TGA--CG-CGT-TA-TCTATCT     
   We can achieve this by penalizing more for a new gap,
    than for extending an existing gap
   Match/mismatch score:       +1/+0
   Origination/length penalty:  –2/–1
    ACGTCTGATACGCCGTATAGTCTATCT
         ||||| |||      || ||||||||
    ----CTGATTCGC---ATCGTCTATCT
   Matches:     18 × (+1)
   Mismatches: 2 × 0
   Origination: 2 × (–2)
   Length:      7 × (–1)
                                   Score = +7
   Alignment scoring and substitution matrices
   Aligning two sequences
    ◦ Dotplots
    ◦ The dynamic programming algorithm
    ◦ Significance of the results
   Heuristic methods
    ◦ FASTA
    ◦ BLAST
    ◦ Interpreting the output
   Examples:
   Staden: simple text file, lines <= 80 characters
   FASTA: simple text file, lines <= 80 characters, one line
    header marked by ">"
   GCG: structured format with header and formatted
    sequence

   Sequence format descriptions e.g. on
    http://www.infobiogen.fr/doc/tutoriel/formats.html
   Local sequence comparison:

   assumption of evolution by point mutations
    ◦ amino acid replacement (by base replacement)
    ◦ amino acid insertion
    ◦ amino acid deletion

   scores:
    ◦ positive for identical or similar
    ◦ negative for different
    ◦ negative for insertion in one of the two sequences
   Simple comparison without alignment

   Similarities between sequences show up in 2D diagram
identity (i=j)




similarity of sequence
with other parts of itself
   The 1st alignment: highly significant
   The 2nd: plausible
   The 3rd: spurious



   Distinguish by alignment score
   Similarities increase score
                                   substitution matrix
   Mismatches decrease score
   Gaps decrease score               gap penalties
   Substitution matrix weights replacement of one residue
    by another:
    ◦ Similar -> high score (positive)
    ◦ Different -> low score (negative)
   Simplest is identity matrix (e.g. for nucleic acids)
                       A     C       G      T
               A       1     0       0      0
               C       0     1       0      0
               G       0     0       1      0
               T       0     0       0      1
   PAM matrix series (PAM1 ... PAM250):
    ◦ Derived from alignment of very similar sequences
    ◦ PAM1 = mutation events that change 1% of AA
    ◦ PAM2, PAM3, ... extrapolated by matrix multiplication
      e.g.: PAM2 = PAM1*PAM1; PAM3 = PAM2 * PAM1 etc

   Problems with PAM matrices:
    ◦ Incorrect modelling of long time substitutions, since
      conservative mutations dominated by single nucleotide
      change
    ◦ e.g.: L <–> I, L <–> V, Y <–> F
      long time: any Amino Acid change
positive and negative values
identity score depends on residue
   BLOSUM series (BLOSUM50, BLOSUM62, ...)
   derived from alignments of distantly related sequence
   BLOCKS database:
    ◦ ungapped multiple alignments of protein families
        at a given identity

   BLOSUM50 better for gapped alignments
   BLOSUM62 better for ungapped alignments
Blosum62 substitution matrix
   Significance of alignment:
   Depends critically on gap penalty

   Need to adjust to given sequence

   Gap penalties influenced by knowledge of structure
    etc.

   Simple rules when nothing is known (linear or affine)
   Dynamic programming = build up optimal alignment
    using previous solutions for optimal alignments of
    subsequences.
   The dynamic programming relies on a principle of
    optimality. This principle states that in an optimal
    sequence of decisions or choices, each subsequence
    must also be optimal.
   The principle can be related as follows: the optimal
    solution to a problem is a combination of optimal
    solutions to some of its sub-problems.
   Construct a two-dimensional matrix whose axes are the
    two sequences to be compared.
   The scores are calculated one row at a time. This starts
    with the first row of one sequence, which is used to
    scan through the entire length of the other sequence,
    followed by scanning of the second row.
   The scanning of the second row takes into account the
    scores already obtained in the first round. The best
    score is put into the bottom right corner of an
    intermediate matrix.
   This process is iterated until values for all the cells are
    filled.
Contd.
Contd.
   The results are traced back through the matrix in
    reverse order from the lower right-hand corner of the
    matrix toward the origin of the matrix in the upper left-
    hand corner.
   The best matching path is the one that has the
    maximum total score.
   If two or more paths reach the same highest score, one
    is chosen arbitrarily to represent the best alignment.
   The path can also move horizontally or vertically at a
    certain point, which corresponds to introduction of a
    gap or an insertion or deletion for one of the two
    sequences.
   Global alignment (ends aligned)
    ◦ Needleman & Wunsch, 1970

   Local alignment (subsequences aligned)
    ◦ Smith & Waterman, 1981


   Searching for repetitions

   Searching for overlap
   Multi-step approach to find high-scoring alignments

   Exact short word matches

   Maximal scoring ungapped extensions

   Identify gapped alignments
Contd.
   FASTA also uses E-values and bit scores. The FASTA output
    provides one more statistical parameter, the Z-score.
   This describes the number of standard deviations from the
    mean score for the database search.
   Most of the alignments with the query sequence are with
    unrelated sequences, the higher the Z-score for a reported
    match, the further away from the mean of the score
    distribution, hence, the more significant the match.
   For a Z-score > 15, the match can be considered extremely
    significant, with certainty of a homologous relationship.
   If Z is in the range of 5 to 15, the sequence pair can be
    described as highly probable homologs.
   If Z < 5, their relationships is described as less certain.
   Multi-step approach to find high-scoring alignments

   List words of fixed length (3AA) expected to give score
    larger than threshold

   For every word, search database and extend ungapped
    alignment in both directions

   New versions of BLAST allow gaps
Contd.
   The E-value provides information about the likelihood that a
    given sequence match is purely by chance. The lower the E-
    value, the less likely the database match is a result of random
    chance and therefore the more significant the match is.
   If E < 1e − 50 (or 1 × 10−50), there should be an extremely
    high confidence that the database match is a result of
    homologous relationships.
   If E is between 0.01 and 1e − 50, the match can be considered
    a result of homology.
   If E is between 0.01 and 10, the match is considered not
    significant, but may hint at a tentative remote homology
    relationship. Additional evidence is needed.
   If E > 10, the sequences under consideration are either
    unrelated or related by extremely distant relationships that fall
    below the limit of detection with the current method.
   Various versions:

   Blastn:    nucleotide sequences
   Blastp:    protein sequences
   tBlastn:   protein query - translated database
   Blastx:    nucleotide query - protein database
   tBlastx:   nucleotide query - translated database
   Very fast growth of biological data
   Diversity of biological data:
    ◦ Primary sequences
    ◦ 3D structures
    ◦ Functional data
   Database entry usually required for publication
    ◦ Sequences
    ◦ Structures
   Database entry may replace primary publication
    ◦ Genomic approaches
Nucleic Acid    Protein
EMBL (Europe)   PIR -
                Protein Information
                Resource
GenBank (USA)   MIPS
DDBJ (Japan)    SWISS-PROT
                University of Geneva,
                now with EBI
                TrEMBL
                A supplement to SWISS-
                PROT
                NRL-3D
   Three databanks exchange data on a daily basis
   Data can be submitted and accessed at either location

   GenBank
    ◦ www.ncbi.nlm.nih.gov/Genbank/GenbankOverview.html
   EMBL
    ◦ www.ebi.ac.uk/embl/index.html
   DNA Databank of Japan (DDBJ)
    ◦ www.nig.ac.jp/home.html
   As there are many databases which one to search? Some
    are good in some aspects and weak in others?
   Composite databases is the answer – which has several
    databases for its base data
   Search on these databases is indexed and streamlined
    so that the same stored sequence is not searched twice
    in different databases.
   OWL has these as their primary databases.
    ◦   SWISS PROT (top priority)
    ◦   PIR
    ◦   GenBank
    ◦   NRL-3D
   Store secondary structure info or results
    of searches of the primary databases.

      Composite Primary Source
      Databases
      PROSITE   SWISS-PROT

      PRINTS       OWL
   We have sequenced and identified genes. So we
    know what they do.
   The sequences are stored in databases.
   So if we find a new gene in the human genome we
    compare it with the already found genes which are
    stored in the databases.
   Since there are large number of databases we cannot
    do sequence alignment for each and every sequence
   So heuristics must be used again.
   Applications:-
    Bioinformatics joins mathematics, statistics, and computer
    science and information technology to solve complex
    biological problems.

   Sequence Analysis:-
    The application of sequence analysis determines those genes
    which encode regulatory sequences or peptides by using the
    information of sequencing. These computers and tools also
    see the DNA mutations in an organism and also detect and
    identify those sequences which are related. Special software
    is used to see the overlapping of fragments and their
    assembly.


                                                 Contd.
   Prediction of Protein Structure:-
    It is easy to determine the primary structure of proteins
    in the form of amino acids which are present on the
    DNA molecule but it is difficult to determine the
    secondary, tertiary or quaternary structures of proteins.
    Tools of bioinformatics can be used to determine the
    complex protein structures.
   Genome Annotation:-
    In genome annotation, genomes are marked to know
    the regulatory sequences and protein coding. It is a very
    important part of the human genome project as it
    determines the regulatory sequences.
   Comparative Genomics:-
    Comparative genomics is the branch of bioinformatics
    which determines the genomic structure and function
    relation between different biological species. For this
    purpose, intergenomic maps are constructed which
    enable the scientists to trace the processes of evolution
    that occur in genomes of different species.

   Health and Drug discovery:-
    The tools of bioinformatics are also helpful in drug
    discovery, diagnosis and disease management.
    Complete sequencing of human genes has enabled the
    scientists to make medicines and drugs which can
    target more than 500 genes.
Basics of bioinformatics

More Related Content

What's hot (20)

Tools of bioinforformatics by kk
Tools of bioinforformatics by kkTools of bioinforformatics by kk
Tools of bioinforformatics by kk
 
Swiss prot database
Swiss prot databaseSwiss prot database
Swiss prot database
 
blast bioinformatics
blast bioinformaticsblast bioinformatics
blast bioinformatics
 
Multiple sequence alignment
Multiple sequence alignmentMultiple sequence alignment
Multiple sequence alignment
 
Clustal
ClustalClustal
Clustal
 
History and scope in bioinformatics
History and scope in bioinformaticsHistory and scope in bioinformatics
History and scope in bioinformatics
 
Protein database
Protein databaseProtein database
Protein database
 
Introduction to Bioinformatics
Introduction to BioinformaticsIntroduction to Bioinformatics
Introduction to Bioinformatics
 
Gen bank databases
Gen bank databasesGen bank databases
Gen bank databases
 
sequence alignment
sequence alignmentsequence alignment
sequence alignment
 
Major databases in bioinformatics
Major databases in bioinformaticsMajor databases in bioinformatics
Major databases in bioinformatics
 
Dynamic programming and pairwise sequence alignment
Dynamic programming and pairwise sequence alignmentDynamic programming and pairwise sequence alignment
Dynamic programming and pairwise sequence alignment
 
Blast and fasta
Blast and fastaBlast and fasta
Blast and fasta
 
Ddbj
DdbjDdbj
Ddbj
 
Structural genomics
Structural genomicsStructural genomics
Structural genomics
 
protein sequence analysis
protein sequence analysisprotein sequence analysis
protein sequence analysis
 
Tools and database of NCBI
Tools and database of NCBITools and database of NCBI
Tools and database of NCBI
 
NCBI National Center for Biotechnology Information
NCBI National Center for Biotechnology InformationNCBI National Center for Biotechnology Information
NCBI National Center for Biotechnology Information
 
Introduction OF BIOLOGICAL DATABASE
Introduction OF BIOLOGICAL DATABASEIntroduction OF BIOLOGICAL DATABASE
Introduction OF BIOLOGICAL DATABASE
 
Sequence file formats
Sequence file formatsSequence file formats
Sequence file formats
 

Viewers also liked

Bioinformatics
BioinformaticsBioinformatics
BioinformaticsJTADrexel
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformaticsbiinoida
 
Bioinformatics Final Presentation
Bioinformatics Final PresentationBioinformatics Final Presentation
Bioinformatics Final PresentationShruthi Choudary
 
Bioinformatics
BioinformaticsBioinformatics
BioinformaticsAmna Jalil
 
Aamir Javed ArticleMicrobial Degradation of Plastic (LDPE) & domestic waste b...
Aamir Javed ArticleMicrobial Degradation of Plastic (LDPE) & domestic waste b...Aamir Javed ArticleMicrobial Degradation of Plastic (LDPE) & domestic waste b...
Aamir Javed ArticleMicrobial Degradation of Plastic (LDPE) & domestic waste b...Aamir Javed
 
Introduction to bioinformatics
Introduction to bioinformaticsIntroduction to bioinformatics
Introduction to bioinformaticsMakarand Bhale
 
Introduction to bioinformatics
Introduction to bioinformaticsIntroduction to bioinformatics
Introduction to bioinformaticsHamid Ur-Rahman
 
Essence and nature of values
Essence and nature of valuesEssence and nature of values
Essence and nature of valuesJhunisa Agustin
 
Application of bioinformatics
Application of bioinformaticsApplication of bioinformatics
Application of bioinformaticsKamlesh Patade
 
The Nature of Value
The Nature of Value The Nature of Value
The Nature of Value Nick Gogerty
 
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...
De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au ...De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au ...
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...Torsten Seemann
 
Flow Cytometry Training : Introduction day 1 session 1
Flow Cytometry Training : Introduction day 1 session 1Flow Cytometry Training : Introduction day 1 session 1
Flow Cytometry Training : Introduction day 1 session 1Robert (Rob) Salomon
 
Ap Chapter 21
Ap Chapter 21Ap Chapter 21
Ap Chapter 21smithbio
 
Project report-on-bio-informatics
Project report-on-bio-informaticsProject report-on-bio-informatics
Project report-on-bio-informaticsDaniela Rotariu
 
Mapping Genotype to Phenotype using Attribute Grammar, Laura Adam
Mapping Genotype to Phenotype using Attribute Grammar, Laura AdamMapping Genotype to Phenotype using Attribute Grammar, Laura Adam
Mapping Genotype to Phenotype using Attribute Grammar, Laura Adammadalladam
 
Bioinformatics in the Era of Open Science and Big Data
Bioinformatics in the Era of Open Science and Big DataBioinformatics in the Era of Open Science and Big Data
Bioinformatics in the Era of Open Science and Big DataPhilip Bourne
 
DNA Markers Techniques for Plant Varietal Identification
DNA Markers Techniques for Plant Varietal Identification DNA Markers Techniques for Plant Varietal Identification
DNA Markers Techniques for Plant Varietal Identification Senthil Natesan
 

Viewers also liked (20)

Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
Bioinformatics Final Presentation
Bioinformatics Final PresentationBioinformatics Final Presentation
Bioinformatics Final Presentation
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
Aamir Javed ArticleMicrobial Degradation of Plastic (LDPE) & domestic waste b...
Aamir Javed ArticleMicrobial Degradation of Plastic (LDPE) & domestic waste b...Aamir Javed ArticleMicrobial Degradation of Plastic (LDPE) & domestic waste b...
Aamir Javed ArticleMicrobial Degradation of Plastic (LDPE) & domestic waste b...
 
Introduction to bioinformatics
Introduction to bioinformaticsIntroduction to bioinformatics
Introduction to bioinformatics
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
Bioinformatics principles and applications
Bioinformatics principles and applicationsBioinformatics principles and applications
Bioinformatics principles and applications
 
Introduction to bioinformatics
Introduction to bioinformaticsIntroduction to bioinformatics
Introduction to bioinformatics
 
Essence and nature of values
Essence and nature of valuesEssence and nature of values
Essence and nature of values
 
Attitudes and values
Attitudes and valuesAttitudes and values
Attitudes and values
 
Application of bioinformatics
Application of bioinformaticsApplication of bioinformatics
Application of bioinformatics
 
The Nature of Value
The Nature of Value The Nature of Value
The Nature of Value
 
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...
De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au ...De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au ...
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...
 
Flow Cytometry Training : Introduction day 1 session 1
Flow Cytometry Training : Introduction day 1 session 1Flow Cytometry Training : Introduction day 1 session 1
Flow Cytometry Training : Introduction day 1 session 1
 
Ap Chapter 21
Ap Chapter 21Ap Chapter 21
Ap Chapter 21
 
Project report-on-bio-informatics
Project report-on-bio-informaticsProject report-on-bio-informatics
Project report-on-bio-informatics
 
Mapping Genotype to Phenotype using Attribute Grammar, Laura Adam
Mapping Genotype to Phenotype using Attribute Grammar, Laura AdamMapping Genotype to Phenotype using Attribute Grammar, Laura Adam
Mapping Genotype to Phenotype using Attribute Grammar, Laura Adam
 
Bioinformatics in the Era of Open Science and Big Data
Bioinformatics in the Era of Open Science and Big DataBioinformatics in the Era of Open Science and Big Data
Bioinformatics in the Era of Open Science and Big Data
 
DNA Markers Techniques for Plant Varietal Identification
DNA Markers Techniques for Plant Varietal Identification DNA Markers Techniques for Plant Varietal Identification
DNA Markers Techniques for Plant Varietal Identification
 

Similar to Basics of bioinformatics

Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...journal ijrtem
 
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...IJRTEMJOURNAL
 
4. sequence alignment.pptx
4. sequence alignment.pptx4. sequence alignment.pptx
4. sequence alignment.pptxArupKhakhlari1
 
Bioinformatics_Sequence Analysis
Bioinformatics_Sequence AnalysisBioinformatics_Sequence Analysis
Bioinformatics_Sequence AnalysisSangeeta Das
 
AI 바이오 (4일차).pdf
AI 바이오 (4일차).pdfAI 바이오 (4일차).pdf
AI 바이오 (4일차).pdfH K Yoon
 
Sequence Alignment
Sequence AlignmentSequence Alignment
Sequence AlignmentRavi Gandham
 
Introduction to sequence alignment
Introduction to sequence alignmentIntroduction to sequence alignment
Introduction to sequence alignmentKubuldinho
 
lecture4.ppt Sequence Alignmentaldf sdfsadf
lecture4.ppt Sequence Alignmentaldf sdfsadflecture4.ppt Sequence Alignmentaldf sdfsadf
lecture4.ppt Sequence Alignmentaldf sdfsadfalizain9604
 
Bioinformaatics for M.Sc. Biotecchnology.pptx
Bioinformaatics for M.Sc. Biotecchnology.pptxBioinformaatics for M.Sc. Biotecchnology.pptx
Bioinformaatics for M.Sc. Biotecchnology.pptxRanjan Jyoti Sarma
 
B.sc biochem i bobi u 3.1 sequence alignment
B.sc biochem i bobi u 3.1 sequence alignmentB.sc biochem i bobi u 3.1 sequence alignment
B.sc biochem i bobi u 3.1 sequence alignmentRai University
 
B.sc biochem i bobi u 3.1 sequence alignment
B.sc biochem i bobi u 3.1 sequence alignmentB.sc biochem i bobi u 3.1 sequence alignment
B.sc biochem i bobi u 3.1 sequence alignmentRai University
 

Similar to Basics of bioinformatics (20)

Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
 
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
 
4. sequence alignment.pptx
4. sequence alignment.pptx4. sequence alignment.pptx
4. sequence alignment.pptx
 
Bioinformatics_Sequence Analysis
Bioinformatics_Sequence AnalysisBioinformatics_Sequence Analysis
Bioinformatics_Sequence Analysis
 
AI 바이오 (4일차).pdf
AI 바이오 (4일차).pdfAI 바이오 (4일차).pdf
AI 바이오 (4일차).pdf
 
Sequence Alignment
Sequence AlignmentSequence Alignment
Sequence Alignment
 
Automatic Parallelization for Parallel Architectures Using Smith Waterman Alg...
Automatic Parallelization for Parallel Architectures Using Smith Waterman Alg...Automatic Parallelization for Parallel Architectures Using Smith Waterman Alg...
Automatic Parallelization for Parallel Architectures Using Smith Waterman Alg...
 
bioinformatic.pptx
bioinformatic.pptxbioinformatic.pptx
bioinformatic.pptx
 
Introduction to sequence alignment
Introduction to sequence alignmentIntroduction to sequence alignment
Introduction to sequence alignment
 
lecture4.ppt Sequence Alignmentaldf sdfsadf
lecture4.ppt Sequence Alignmentaldf sdfsadflecture4.ppt Sequence Alignmentaldf sdfsadf
lecture4.ppt Sequence Alignmentaldf sdfsadf
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
Seq alignment
Seq alignment Seq alignment
Seq alignment
 
Bioinformaatics for M.Sc. Biotecchnology.pptx
Bioinformaatics for M.Sc. Biotecchnology.pptxBioinformaatics for M.Sc. Biotecchnology.pptx
Bioinformaatics for M.Sc. Biotecchnology.pptx
 
Ga
GaGa
Ga
 
Sequence alignment.pptx
Sequence alignment.pptxSequence alignment.pptx
Sequence alignment.pptx
 
Sequence alignment
Sequence alignmentSequence alignment
Sequence alignment
 
B.sc biochem i bobi u 3.1 sequence alignment
B.sc biochem i bobi u 3.1 sequence alignmentB.sc biochem i bobi u 3.1 sequence alignment
B.sc biochem i bobi u 3.1 sequence alignment
 
B.sc biochem i bobi u 3.1 sequence alignment
B.sc biochem i bobi u 3.1 sequence alignmentB.sc biochem i bobi u 3.1 sequence alignment
B.sc biochem i bobi u 3.1 sequence alignment
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
Bioinformatica 08-12-2011-t8-go-hmm
Bioinformatica 08-12-2011-t8-go-hmmBioinformatica 08-12-2011-t8-go-hmm
Bioinformatica 08-12-2011-t8-go-hmm
 

Recently uploaded

How to Make a Duplicate of Your Odoo 17 Database
How to Make a Duplicate of Your Odoo 17 DatabaseHow to Make a Duplicate of Your Odoo 17 Database
How to Make a Duplicate of Your Odoo 17 DatabaseCeline George
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptxmary850239
 
ROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxVanesaIglesias10
 
Reading and Writing Skills 11 quarter 4 melc 1
Reading and Writing Skills 11 quarter 4 melc 1Reading and Writing Skills 11 quarter 4 melc 1
Reading and Writing Skills 11 quarter 4 melc 1GloryAnnCastre1
 
Oppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and FilmOppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and FilmStan Meyer
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxHumphrey A Beña
 
4.11.24 Poverty and Inequality in America.pptx
4.11.24 Poverty and Inequality in America.pptx4.11.24 Poverty and Inequality in America.pptx
4.11.24 Poverty and Inequality in America.pptxmary850239
 
Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4JOYLYNSAMANIEGO
 
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...Association for Project Management
 
Expanded definition: technical and operational
Expanded definition: technical and operationalExpanded definition: technical and operational
Expanded definition: technical and operationalssuser3e220a
 
Congestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentationCongestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentationdeepaannamalai16
 
ClimART Action | eTwinning Project
ClimART Action    |    eTwinning ProjectClimART Action    |    eTwinning Project
ClimART Action | eTwinning Projectjordimapav
 
Man or Manufactured_ Redefining Humanity Through Biopunk Narratives.pptx
Man or Manufactured_ Redefining Humanity Through Biopunk Narratives.pptxMan or Manufactured_ Redefining Humanity Through Biopunk Narratives.pptx
Man or Manufactured_ Redefining Humanity Through Biopunk Narratives.pptxDhatriParmar
 
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...DhatriParmar
 
Multi Domain Alias In the Odoo 17 ERP Module
Multi Domain Alias In the Odoo 17 ERP ModuleMulti Domain Alias In the Odoo 17 ERP Module
Multi Domain Alias In the Odoo 17 ERP ModuleCeline George
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management SystemChristalin Nelson
 
Using Grammatical Signals Suitable to Patterns of Idea Development
Using Grammatical Signals Suitable to Patterns of Idea DevelopmentUsing Grammatical Signals Suitable to Patterns of Idea Development
Using Grammatical Signals Suitable to Patterns of Idea Developmentchesterberbo7
 
MS4 level being good citizen -imperative- (1) (1).pdf
MS4 level   being good citizen -imperative- (1) (1).pdfMS4 level   being good citizen -imperative- (1) (1).pdf
MS4 level being good citizen -imperative- (1) (1).pdfMr Bounab Samir
 
Mental Health Awareness - a toolkit for supporting young minds
Mental Health Awareness - a toolkit for supporting young mindsMental Health Awareness - a toolkit for supporting young minds
Mental Health Awareness - a toolkit for supporting young mindsPooky Knightsmith
 

Recently uploaded (20)

How to Make a Duplicate of Your Odoo 17 Database
How to Make a Duplicate of Your Odoo 17 DatabaseHow to Make a Duplicate of Your Odoo 17 Database
How to Make a Duplicate of Your Odoo 17 Database
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx
 
ROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptx
 
Reading and Writing Skills 11 quarter 4 melc 1
Reading and Writing Skills 11 quarter 4 melc 1Reading and Writing Skills 11 quarter 4 melc 1
Reading and Writing Skills 11 quarter 4 melc 1
 
Oppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and FilmOppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and Film
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
 
4.11.24 Poverty and Inequality in America.pptx
4.11.24 Poverty and Inequality in America.pptx4.11.24 Poverty and Inequality in America.pptx
4.11.24 Poverty and Inequality in America.pptx
 
Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4
 
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
 
Expanded definition: technical and operational
Expanded definition: technical and operationalExpanded definition: technical and operational
Expanded definition: technical and operational
 
Congestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentationCongestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentation
 
ClimART Action | eTwinning Project
ClimART Action    |    eTwinning ProjectClimART Action    |    eTwinning Project
ClimART Action | eTwinning Project
 
Man or Manufactured_ Redefining Humanity Through Biopunk Narratives.pptx
Man or Manufactured_ Redefining Humanity Through Biopunk Narratives.pptxMan or Manufactured_ Redefining Humanity Through Biopunk Narratives.pptx
Man or Manufactured_ Redefining Humanity Through Biopunk Narratives.pptx
 
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...
 
Multi Domain Alias In the Odoo 17 ERP Module
Multi Domain Alias In the Odoo 17 ERP ModuleMulti Domain Alias In the Odoo 17 ERP Module
Multi Domain Alias In the Odoo 17 ERP Module
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management System
 
INCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptx
INCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptxINCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptx
INCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptx
 
Using Grammatical Signals Suitable to Patterns of Idea Development
Using Grammatical Signals Suitable to Patterns of Idea DevelopmentUsing Grammatical Signals Suitable to Patterns of Idea Development
Using Grammatical Signals Suitable to Patterns of Idea Development
 
MS4 level being good citizen -imperative- (1) (1).pdf
MS4 level   being good citizen -imperative- (1) (1).pdfMS4 level   being good citizen -imperative- (1) (1).pdf
MS4 level being good citizen -imperative- (1) (1).pdf
 
Mental Health Awareness - a toolkit for supporting young minds
Mental Health Awareness - a toolkit for supporting young mindsMental Health Awareness - a toolkit for supporting young minds
Mental Health Awareness - a toolkit for supporting young minds
 

Basics of bioinformatics

  • 1. Need & Emergence of the Field Speaker Shashi Shekhar Head of computational Section Biowits Life Sciences
  • 2. The marriage between computer science and molecular biology ◦ The algorithm and techniques of computer science are being used to solve the problems faced by molecular biologists  ‘Information technology applied to the management and analysis of biological data’ ◦ Storage and Analysis are two of the important functions – bioinformaticians build tools for each.
  • 3. Biology Chemistry Computer Science Statistics Bioinformatics
  • 4. The need for bioinformatics has arisen from the recent explosion of publicly available genomic information, such as resulting from the Human Genome Project.  Gain a better understanding of gene analysis, taxonomy, & evolution.  To work efficiently on the rational drug designs and reduce the time taken for the development of drug manually.
  • 5. To uncover the wealth of Biological information hidden in the mass of sequence, structure, literature and biological data.  It is being used now and in the foreseeable future in the areas of molecular medicine.  It has environmental benefits in identifying waste and clean up bacteria.  In agriculture, it can be used to produce high yield, low maintenance crops.
  • 6. Molecular Medicine  Gene Therapy  Drug Development  Microbial genome applications  Crop Improvement  Forensic Analysis of Microbes  Biotechnology  Evolutionary Studies  Bio-Weapon Creation
  • 7. In Experimental Molecular Biology  In Genetics and Genomics  In generating Biological Data  Analysis of gene and protein expression  Comparison of genomic data  Understanding of evolutionary aspect of Evolution  Understanding biological pathways and networks in System Biology  In Simulation & Modeling of DNA, RNA & Protein
  • 8. e.g. homology searches Bioinformatics lecture March 5, 2002 organisation of knowledge (sequences, structures, functional data)
  • 9. Prediction of structure from sequence ◦ secondary structure ◦ homology modelling, threading ◦ ab initio 3D prediction  Analysis of 3D structure ◦ structure comparison/ alignment ◦ prediction of function from structure ◦ molecular mechanics/ molecular dynamics ◦ prediction of molecular interactions, docking  Structure databases (RCSB)
  • 10.
  • 11. Sequence Similarity  Tools used for sequence similarity searching  There uses in biology or to us  Databases  Different types of databases
  • 12. One could align the sequence so that many corresponding residues match.  Strong similarity between two sequences is a strong argument for their homology.  Homology: Two(or more) sequences have a common ancestor.  Similarity: Two(or more) sequences are similar by some criterion, and it does not refer to any historical process.
  • 13. To find the relatedness of the proteins or gene, if they have a common ancestor or not.  Mutation in the sequences, brings the changes or divergence in the sequences.  Can also reveal the part of the sequence which is crucial for the functioning of gene or protein.
  • 14. Optimal Alignment: The alignment that is the best, given a defined set of rules and parameter values for comparing different alignments.  Global Alignment: An alignment that assumes that the two proteins are basically similar over the entire length of one another. The alignment attempts to match them to each other from end to end.  Local Alignment: An alignment that searches for segments of the two sequences that match well. There is no attempt to force entire sequences into an alignment, just those parts that appear to have good similarity. (contd.)
  • 15. Gaps & Insertions: In an alignment, one may achieve much better correspondence between two sequences if one allows a gap to be introduced in one sequence. Equivalently, one could allow an insertion in the other sequence. Biologically this corresponds to an mutation event.  Substitution matrix: A Substitution matrix describes the two residue types would mutate to each other in evolutionary time. This is used to estimate how well two residues of given types would match if they were aligned in a sequence alignment.  Gap Penalty: The gap penalty is used to help decide whether or not to accept a gap or insertion in an alignment when it is possible to achieve a good alignment residue to residue at some other neighboring point in the sequence.
  • 16. Similarity indicates conserved function  Human and mouse genes are more than 80% similar at sequence level  But these genes are small fraction of genome  Most sequences in the genome are not recognizably similar  Comparing sequences helps us understand function ◦ Locate similar gene in another species to understand your new gene
  • 17. Match score: +1  Mismatch score: +0  Gap penalty: –1 ACGTCTGATACGCCGTATAGTCTATCT ||||| ||| || |||||||| ----CTGATTCGC---ATCGTCTATCT  Matches: 18 × (+1)  Mismatches: 2 × 0  Gaps: 7 × (– 1) Score = +11
  • 18. We want to find alignments that are evolutionarily likely.  Which of the following alignments seems more likely to you? ACGTCTGATACGCCGTATAGTCTATCT ACGTCTGAT-------ATAGTCTATCT  ACGTCTGATACGCCGTATAGTCTATCT AC-T-TGA--CG-CGT-TA-TCTATCT   We can achieve this by penalizing more for a new gap, than for extending an existing gap
  • 19. Match/mismatch score: +1/+0  Origination/length penalty: –2/–1 ACGTCTGATACGCCGTATAGTCTATCT ||||| ||| || |||||||| ----CTGATTCGC---ATCGTCTATCT  Matches: 18 × (+1)  Mismatches: 2 × 0  Origination: 2 × (–2)  Length: 7 × (–1) Score = +7
  • 20. Alignment scoring and substitution matrices  Aligning two sequences ◦ Dotplots ◦ The dynamic programming algorithm ◦ Significance of the results  Heuristic methods ◦ FASTA ◦ BLAST ◦ Interpreting the output
  • 21. Examples:  Staden: simple text file, lines <= 80 characters  FASTA: simple text file, lines <= 80 characters, one line header marked by ">"  GCG: structured format with header and formatted sequence  Sequence format descriptions e.g. on http://www.infobiogen.fr/doc/tutoriel/formats.html
  • 22. Local sequence comparison:  assumption of evolution by point mutations ◦ amino acid replacement (by base replacement) ◦ amino acid insertion ◦ amino acid deletion  scores: ◦ positive for identical or similar ◦ negative for different ◦ negative for insertion in one of the two sequences
  • 23. Simple comparison without alignment  Similarities between sequences show up in 2D diagram
  • 24. identity (i=j) similarity of sequence with other parts of itself
  • 25. The 1st alignment: highly significant  The 2nd: plausible  The 3rd: spurious  Distinguish by alignment score  Similarities increase score substitution matrix  Mismatches decrease score  Gaps decrease score gap penalties
  • 26. Substitution matrix weights replacement of one residue by another: ◦ Similar -> high score (positive) ◦ Different -> low score (negative)  Simplest is identity matrix (e.g. for nucleic acids) A C G T A 1 0 0 0 C 0 1 0 0 G 0 0 1 0 T 0 0 0 1
  • 27. PAM matrix series (PAM1 ... PAM250): ◦ Derived from alignment of very similar sequences ◦ PAM1 = mutation events that change 1% of AA ◦ PAM2, PAM3, ... extrapolated by matrix multiplication e.g.: PAM2 = PAM1*PAM1; PAM3 = PAM2 * PAM1 etc  Problems with PAM matrices: ◦ Incorrect modelling of long time substitutions, since conservative mutations dominated by single nucleotide change ◦ e.g.: L <–> I, L <–> V, Y <–> F long time: any Amino Acid change
  • 28. positive and negative values identity score depends on residue
  • 29. BLOSUM series (BLOSUM50, BLOSUM62, ...)  derived from alignments of distantly related sequence  BLOCKS database: ◦ ungapped multiple alignments of protein families at a given identity  BLOSUM50 better for gapped alignments  BLOSUM62 better for ungapped alignments
  • 31. Significance of alignment:  Depends critically on gap penalty  Need to adjust to given sequence  Gap penalties influenced by knowledge of structure etc.  Simple rules when nothing is known (linear or affine)
  • 32. Dynamic programming = build up optimal alignment using previous solutions for optimal alignments of subsequences.  The dynamic programming relies on a principle of optimality. This principle states that in an optimal sequence of decisions or choices, each subsequence must also be optimal.  The principle can be related as follows: the optimal solution to a problem is a combination of optimal solutions to some of its sub-problems.
  • 33. Construct a two-dimensional matrix whose axes are the two sequences to be compared.  The scores are calculated one row at a time. This starts with the first row of one sequence, which is used to scan through the entire length of the other sequence, followed by scanning of the second row.  The scanning of the second row takes into account the scores already obtained in the first round. The best score is put into the bottom right corner of an intermediate matrix.  This process is iterated until values for all the cells are filled.
  • 36. The results are traced back through the matrix in reverse order from the lower right-hand corner of the matrix toward the origin of the matrix in the upper left- hand corner.  The best matching path is the one that has the maximum total score.  If two or more paths reach the same highest score, one is chosen arbitrarily to represent the best alignment.  The path can also move horizontally or vertically at a certain point, which corresponds to introduction of a gap or an insertion or deletion for one of the two sequences.
  • 37.
  • 38. Global alignment (ends aligned) ◦ Needleman & Wunsch, 1970  Local alignment (subsequences aligned) ◦ Smith & Waterman, 1981  Searching for repetitions  Searching for overlap
  • 39.
  • 40. Multi-step approach to find high-scoring alignments  Exact short word matches  Maximal scoring ungapped extensions  Identify gapped alignments
  • 42.
  • 43. FASTA also uses E-values and bit scores. The FASTA output provides one more statistical parameter, the Z-score.  This describes the number of standard deviations from the mean score for the database search.  Most of the alignments with the query sequence are with unrelated sequences, the higher the Z-score for a reported match, the further away from the mean of the score distribution, hence, the more significant the match.  For a Z-score > 15, the match can be considered extremely significant, with certainty of a homologous relationship.  If Z is in the range of 5 to 15, the sequence pair can be described as highly probable homologs.  If Z < 5, their relationships is described as less certain.
  • 44. Multi-step approach to find high-scoring alignments  List words of fixed length (3AA) expected to give score larger than threshold  For every word, search database and extend ungapped alignment in both directions  New versions of BLAST allow gaps
  • 46.
  • 47. The E-value provides information about the likelihood that a given sequence match is purely by chance. The lower the E- value, the less likely the database match is a result of random chance and therefore the more significant the match is.  If E < 1e − 50 (or 1 × 10−50), there should be an extremely high confidence that the database match is a result of homologous relationships.  If E is between 0.01 and 1e − 50, the match can be considered a result of homology.  If E is between 0.01 and 10, the match is considered not significant, but may hint at a tentative remote homology relationship. Additional evidence is needed.  If E > 10, the sequences under consideration are either unrelated or related by extremely distant relationships that fall below the limit of detection with the current method.
  • 48. Various versions:  Blastn: nucleotide sequences  Blastp: protein sequences  tBlastn: protein query - translated database  Blastx: nucleotide query - protein database  tBlastx: nucleotide query - translated database
  • 49. Very fast growth of biological data  Diversity of biological data: ◦ Primary sequences ◦ 3D structures ◦ Functional data  Database entry usually required for publication ◦ Sequences ◦ Structures  Database entry may replace primary publication ◦ Genomic approaches
  • 50. Nucleic Acid Protein EMBL (Europe) PIR - Protein Information Resource GenBank (USA) MIPS DDBJ (Japan) SWISS-PROT University of Geneva, now with EBI TrEMBL A supplement to SWISS- PROT NRL-3D
  • 51. Three databanks exchange data on a daily basis  Data can be submitted and accessed at either location  GenBank ◦ www.ncbi.nlm.nih.gov/Genbank/GenbankOverview.html  EMBL ◦ www.ebi.ac.uk/embl/index.html  DNA Databank of Japan (DDBJ) ◦ www.nig.ac.jp/home.html
  • 52. As there are many databases which one to search? Some are good in some aspects and weak in others?  Composite databases is the answer – which has several databases for its base data  Search on these databases is indexed and streamlined so that the same stored sequence is not searched twice in different databases.
  • 53. OWL has these as their primary databases. ◦ SWISS PROT (top priority) ◦ PIR ◦ GenBank ◦ NRL-3D
  • 54. Store secondary structure info or results of searches of the primary databases. Composite Primary Source Databases PROSITE SWISS-PROT PRINTS OWL
  • 55. We have sequenced and identified genes. So we know what they do.  The sequences are stored in databases.  So if we find a new gene in the human genome we compare it with the already found genes which are stored in the databases.  Since there are large number of databases we cannot do sequence alignment for each and every sequence  So heuristics must be used again.
  • 56. Applications:- Bioinformatics joins mathematics, statistics, and computer science and information technology to solve complex biological problems.  Sequence Analysis:- The application of sequence analysis determines those genes which encode regulatory sequences or peptides by using the information of sequencing. These computers and tools also see the DNA mutations in an organism and also detect and identify those sequences which are related. Special software is used to see the overlapping of fragments and their assembly. Contd.
  • 57. Prediction of Protein Structure:- It is easy to determine the primary structure of proteins in the form of amino acids which are present on the DNA molecule but it is difficult to determine the secondary, tertiary or quaternary structures of proteins. Tools of bioinformatics can be used to determine the complex protein structures.  Genome Annotation:- In genome annotation, genomes are marked to know the regulatory sequences and protein coding. It is a very important part of the human genome project as it determines the regulatory sequences.
  • 58. Comparative Genomics:- Comparative genomics is the branch of bioinformatics which determines the genomic structure and function relation between different biological species. For this purpose, intergenomic maps are constructed which enable the scientists to trace the processes of evolution that occur in genomes of different species.  Health and Drug discovery:- The tools of bioinformatics are also helpful in drug discovery, diagnosis and disease management. Complete sequencing of human genes has enabled the scientists to make medicines and drugs which can target more than 500 genes.