BIOINFORMATIC TOOLS
By
KAUSHAL KUMAR SAHU
Assistant Professor (Ad Hoc)
Department of Biotechnology
Govt. Digvijay Autonomous P. G. College
Raj-Nandgaon ( C. G. )
 INTRODUCTION
 DEFINITION OF BIOINFORMATICS
 HISTORY
 OBJECTIVE OF BIOINFORMATIC
 TOOLS OF BIOINFORMATICS
 PROCEDURE AND TOOLS OF BIOINFORMATIC
o BIOLOGICAL DATABASES
o HOMOLOGY AND SIMILARITY TOOLS (SEQUENCE
ALIGNMENT)
o PROTEIN FUNCTION ANALYSIS TOOLS
o STRUCTURAL ANALYSIS TOOLS
o SEQUENCE MANIPULATION TOOLS
o SEQUENCE ANALYSIS TOOLS
 APPLICATION
 CONCLUSION
 REFERENCES
 Bioinformatics is a newly emerged scientific discipline for the
computational analysis and storage of biological data. The
word bioinformatics has been derived from two words.
 Bio means biology
 Informatics (a French word) meaning ‘data processing’.
 Bioinformatics simplifies the work of biologist in handling and
analysing vast data.
 several computational method are used for this purpose
include agriculture, medicine and pharmaceuticals, computer
database and algorithms of research in life science.
 Bioinformatics can be defined as the storage, analysis, and
searching of data(e.g. nucleic acid sequences for the genes and
RNAs, amino acid sequence and structural information of
protein).
 The Institute Pasteur, Paris (France) defined bioinformatics
more precisely as the mathematical, statistical and computing
methods that aim to solve biological problems using DNA and
amino acid sequences, and related information.
 i.e. converting “data” to “information
 1977 – Φ-X174 Phage Genome sequenced
 1990 – Paper published in the Journal of Molecular Biology
describes sequence alignment search algorithm
 1990s – Software used to find fragment overlap for the Human
Genome Project
 1992 – NCBI takes over GenBank DNA sequence database in
response to the growing number of gene patents
 1994 – “Entrez” Global Query Cross-Database Search System
allows users to search GenBank database
 1996 – NCBI-BLAST created to provide powerful searches against
the Gen Bank database
 To introduce the bioinformatics discipline
 To introduce the major tools used for sequence and structure
analysis and explain in general how they work
 Homology and Comparative Modeling
 Protein or gene homology is shared nucleotide or amino acid
sequences or domains shared between different proteins regardless of
whether from same or different organism
 Gene or Protein Identification
 Searching databases for nucleotide or amino acid sequences that
match sequences in unknown samples
 These are software programs that are designed for extracting
the meaningful information from the mass of molecular
biology/biological databases and to carry out sequence and
structural analysis.
 After the formation of the databases, tools become available to
search sequences databases.
 The bioinformatics tools can be categorized in to the following
categories:
a) Biological databases
b) Homology and similarity tools (Sequence alignment tool)
c) Protein function analysis tools
d) Structural analysis tools
e) Sequence manipulation tools
f) Sequence analysis tools
 This biological database usually contain genomic, proteomic
and metabolic data. The data include nucleotide sequences of
genes or amino acid sequences.
 Some of the major biological database are:
a) Major Nucleotide Sequences Database.
b) Major Mutation Databases.
c) Major Gene Expression Databases.
d) Major Microbial Genomic Databases.
e) Major Organism Specific Genome Database.
f) Major protein Database.
 EMBL (European Molecular Biology Laboratory nucleotide sequence database at EBI,
Hinxton, UK)
 NDB (Nucleic Acid structure Database at Rutgers University, USA)
 Entrez/Genome (NCBI, USA)
 Homologous sequences are sequences that are related by divergence
from a common ancestor. Thus the degree of similarity between two
sequences can be measured.
 This set of tools can be used to identify similarities between novel
query sequences of unknown structure and function and database
sequences whose structure and functions have been elucidated.
o It is a program for sequence similarity searching developed
at the NCBI.
o It identifies genes and genetic features.
o A BLAST search enables a researcher to compare a query
sequence with a database of sequence and identify database
sequence that resemble the query sequence.
 Nucleotide-nucleotide BLAST (BLASTN):
 Basic nucleotide sequence searches
 The BLAST that we used for our sequences
 Protein-protein BLAST (BLASTP):
 Similar technology used to search amino acid sequences
 Position-Specific relative BLAST (PSI-BLAST):
 A more advance protein BLAST useful for analyzing relationships
between divergently evolved proteins
 BLASTX and BLASTN variants:
 Use translation for proteins and nucleotides, respectively, in the
search
 MegaBLAST:
 Used for BLAST several sequences at once to cut down on
processing load and server reporting-time
 blastp compares an amino acid query sequence against a protein
sequence database
 blastn compares a nucleotide query sequence against a nucleotide
sequence database
 blastx compares a nucleotide query sequence translated in all
reading frames against a protein sequence database
 tblastn compares a protein query sequence against a nucleotide
sequence database dynamically translated in all reading frames
 Query Coverage
 The percent of the query sequence matched by the database entry
 Max Ident
 The percent identity, i.e. the percent that the genes match up within
the limits of the full match (e.g. deletions or additions reduce this
value)
 FASTA is a DNA and protein sequence alignment software
package.
 It is used for a fast protein or fast nucleotide comparison.
 This program achieves a high level of sensitivity for similarity
searching at high speed.
 EMBOSS:
EMBOSS (European Molecular Biology Open Software Suite) is a
software-analysis package. It can work with data in a range of formats
and also retrieve sequence data transparently from the Web. Extensive
libraries are also provided with this package, allowing other scientists
to release their software as open source. It provides a set of sequence-
analysis programs, and also supports all UNIX platforms.
 Clustalw:
It is a fully automated sequence alignment tool for DNA and protein
sequences. It returns the best match over a total length of input
sequences, be it a protein or a nucleic acid.
 RasMol:
It is a powerful research tool to display the structure of DNA,
proteins, and smaller molecules. Protein Explorer, a derivative of
RasMol, is an easier to use program.
 PROSPECT:
PROSPECT (PROtein Structure Prediction and Evaluation Computer
ToolKit) is a protein-structure prediction system that employs a
computational technique called protein threading to construct a
protein's 3-D model
 DNA Sequencing
 Sequence Formats
 Sequence Homology Software Tools
 Aligning Tools
 Annotated Information
 Protein Folding
 Sanger Method
 New nucleotide chains of DNA being replicated by
DNA Polymerase are stopped when di-deoxy
nucleotides (added in the reaction mixture in ~1/100
ratio) are incorperated into the chain
 Fluorescent dyes are bound to the ddNTPs,
allowing the molecule to detected when it is
excited by a laser
 Terminated DNA chains are run on a gel, and
fragments are resolved by size
 By combining the fluorescence readings from each
size nucleotide chain, the DNA sequence is
computed
 First Things First – Sequence File Formats:
 Most common for nucleotides: FASTA / Multi-FASTA
 “>” followed by any unicode text, entire line read as
sequence title
 >E. coli Globin-coupled chemotaxis sensory transducer
(TM domain)
ATGGACCTGATCACAAATGCGATTTAGAGACCTG
ATCACAAATGCGATGACCTGATCACAAATGCGAT
GACCTGATCACAAATGCGATGTAAACCTGATCAC
AAATGCGATGACCTGATCACAAATGCGATCTAAA
CCTGATCACAAATGCGATGACCTGATCACAAATG
CGATTAA
 Clustal (free)
 ClustalX – Software
 ClustalW – Web
 Functionality is similar, but difference is in interface, tools,
and speed of algorithms
 http://www.ebi.ac.uk/clustalw/
 Lowest energy state folding
 Distributed computing is used for mid-sized proteins
 Folding@Home
 Human Proteome Folding Project
 Rosetta@Home
 Predictor@Home
 These groups of programs allow comparing protein sequence
to the secondary protein databases that contain information on
motifs, signatures and protein domains.
 Interproscan
Search protein sequences.
 PPSearch
Searches protein motifs.
 Radar
Protein repeats detection
 3-dimensional structures of proteins, nucleic acids, molecular
complexes etc
 3-d data is available due to techniques such as NMR and X-
Ray crystallography
COPIA(Consensus Pattern Identification and Analysis)
It is a protein structure analysis tool for discovering
motifs in a family of protein sequences. Such motifs can then
be used to determine membership to the family of new
proteins sequences, predict secondary and tertiary structures
and functions of proteins.
 These are software programs for analyzing and formatting
DNA and protein sequences.
 RepeatMasker
It is a program that screens the DNA for interspersed
repeats.
 Webcut
It is an online tool for restriction analysis, silent mutation
analysis, and SNP analysis.
 Translate
It is a tool which allows the translation of a nucleotide
sequence to a protein sequence.
 This set of tools allow to carry out further more detailed
analysis of query sequence including evolutionary analysis,
identification of mutation.
 Align
This tool is used to compare two sequences.
 DNA Scanner
It is a tool that scans DNA for number of different
properties such as biophysical, potential for protein
interaction.
 Data such as experimental microarray images-
gene expression data
 Proteomic data- protein expression data
 Metabolic pathways, protein-protein interaction
data, regulatory networks
 Each Database contains specific information
 Like other biological systems also these databases
are interrelated
32
GENOMIC DATA
GenBank
DDBJ
EMBL
ASSEMBLED
GENOMES
GoldenPath
WormBase
TIGR
PROTEIN
PIR
SWISS-PROT
STRUCTURE
PDB
MMDB
SCOP
LITERATURE
PubMed
PATHWAY
KEGG
COG
DISEASE
LocusLink
OMIM
OMIA
GENES
RefSeq
AllGenes
GDBSNPs
dbSNP
ESTs
dbEST
unigene
MOTIFS
BLOCKS
Pfam
Prosite
GENE
EXPRESSION
Stanford MGDB
NetAffx
ArrayExpress
Some of the applications related to biological
information analysis are:
 Bioinformatics is used in primer design.
 Bioinformatics is used to attempt to predict the function of
actual gene products.
 Molecular modeling/structural biology is a growing field
which can be considered part of bioinformatics.
 There are other fields- for example, medical imaging/ image
analysis, that might be considered part of bioinformatics.
There is also a whole other discipline of biologically inspired
computation: genetic algorithms, etc.
 Bioinformatics is building on the recognition of the importance
of information transmission, accumulation and processing in
biological systems.
 Software tools for bioinformatics range from simple
command-line tools, to more complex graphical programs and
standalone web-services available from various bioinformatics
companies or public institutions.
 S.C.Rastogi – Bioinformatics: concepts, Skills and
Applications, (2003)
 C.S.V.Murthy – Bioinformatics, First Edition, (2003)
 David W.Mount- Bioinformatics sequence genome analysis
second edition
 http://Bioinformatics%20-
%20Tools,%20softwares%20&%20Programmes.htm
 http://Bioinformatics%20-
%20Wikipedia,%20the%20free%20encyclopedia.htm

Bioinformatic, and tools by kk sahu

  • 1.
    BIOINFORMATIC TOOLS By KAUSHAL KUMARSAHU Assistant Professor (Ad Hoc) Department of Biotechnology Govt. Digvijay Autonomous P. G. College Raj-Nandgaon ( C. G. )
  • 2.
     INTRODUCTION  DEFINITIONOF BIOINFORMATICS  HISTORY  OBJECTIVE OF BIOINFORMATIC  TOOLS OF BIOINFORMATICS  PROCEDURE AND TOOLS OF BIOINFORMATIC o BIOLOGICAL DATABASES o HOMOLOGY AND SIMILARITY TOOLS (SEQUENCE ALIGNMENT) o PROTEIN FUNCTION ANALYSIS TOOLS o STRUCTURAL ANALYSIS TOOLS o SEQUENCE MANIPULATION TOOLS o SEQUENCE ANALYSIS TOOLS  APPLICATION  CONCLUSION  REFERENCES
  • 3.
     Bioinformatics isa newly emerged scientific discipline for the computational analysis and storage of biological data. The word bioinformatics has been derived from two words.  Bio means biology  Informatics (a French word) meaning ‘data processing’.  Bioinformatics simplifies the work of biologist in handling and analysing vast data.  several computational method are used for this purpose include agriculture, medicine and pharmaceuticals, computer database and algorithms of research in life science.
  • 4.
     Bioinformatics canbe defined as the storage, analysis, and searching of data(e.g. nucleic acid sequences for the genes and RNAs, amino acid sequence and structural information of protein).  The Institute Pasteur, Paris (France) defined bioinformatics more precisely as the mathematical, statistical and computing methods that aim to solve biological problems using DNA and amino acid sequences, and related information.  i.e. converting “data” to “information
  • 5.
     1977 –Φ-X174 Phage Genome sequenced  1990 – Paper published in the Journal of Molecular Biology describes sequence alignment search algorithm  1990s – Software used to find fragment overlap for the Human Genome Project  1992 – NCBI takes over GenBank DNA sequence database in response to the growing number of gene patents  1994 – “Entrez” Global Query Cross-Database Search System allows users to search GenBank database  1996 – NCBI-BLAST created to provide powerful searches against the Gen Bank database
  • 6.
     To introducethe bioinformatics discipline  To introduce the major tools used for sequence and structure analysis and explain in general how they work
  • 7.
     Homology andComparative Modeling  Protein or gene homology is shared nucleotide or amino acid sequences or domains shared between different proteins regardless of whether from same or different organism  Gene or Protein Identification  Searching databases for nucleotide or amino acid sequences that match sequences in unknown samples
  • 8.
     These aresoftware programs that are designed for extracting the meaningful information from the mass of molecular biology/biological databases and to carry out sequence and structural analysis.  After the formation of the databases, tools become available to search sequences databases.  The bioinformatics tools can be categorized in to the following categories: a) Biological databases b) Homology and similarity tools (Sequence alignment tool) c) Protein function analysis tools d) Structural analysis tools e) Sequence manipulation tools f) Sequence analysis tools
  • 9.
     This biologicaldatabase usually contain genomic, proteomic and metabolic data. The data include nucleotide sequences of genes or amino acid sequences.  Some of the major biological database are: a) Major Nucleotide Sequences Database. b) Major Mutation Databases. c) Major Gene Expression Databases. d) Major Microbial Genomic Databases. e) Major Organism Specific Genome Database. f) Major protein Database.  EMBL (European Molecular Biology Laboratory nucleotide sequence database at EBI, Hinxton, UK)  NDB (Nucleic Acid structure Database at Rutgers University, USA)  Entrez/Genome (NCBI, USA)
  • 10.
     Homologous sequencesare sequences that are related by divergence from a common ancestor. Thus the degree of similarity between two sequences can be measured.  This set of tools can be used to identify similarities between novel query sequences of unknown structure and function and database sequences whose structure and functions have been elucidated.
  • 11.
    o It isa program for sequence similarity searching developed at the NCBI. o It identifies genes and genetic features. o A BLAST search enables a researcher to compare a query sequence with a database of sequence and identify database sequence that resemble the query sequence.
  • 12.
     Nucleotide-nucleotide BLAST(BLASTN):  Basic nucleotide sequence searches  The BLAST that we used for our sequences  Protein-protein BLAST (BLASTP):  Similar technology used to search amino acid sequences  Position-Specific relative BLAST (PSI-BLAST):  A more advance protein BLAST useful for analyzing relationships between divergently evolved proteins
  • 13.
     BLASTX andBLASTN variants:  Use translation for proteins and nucleotides, respectively, in the search  MegaBLAST:  Used for BLAST several sequences at once to cut down on processing load and server reporting-time  blastp compares an amino acid query sequence against a protein sequence database  blastn compares a nucleotide query sequence against a nucleotide sequence database  blastx compares a nucleotide query sequence translated in all reading frames against a protein sequence database  tblastn compares a protein query sequence against a nucleotide sequence database dynamically translated in all reading frames
  • 14.
     Query Coverage The percent of the query sequence matched by the database entry  Max Ident  The percent identity, i.e. the percent that the genes match up within the limits of the full match (e.g. deletions or additions reduce this value)
  • 17.
     FASTA isa DNA and protein sequence alignment software package.  It is used for a fast protein or fast nucleotide comparison.  This program achieves a high level of sensitivity for similarity searching at high speed.
  • 18.
     EMBOSS: EMBOSS (EuropeanMolecular Biology Open Software Suite) is a software-analysis package. It can work with data in a range of formats and also retrieve sequence data transparently from the Web. Extensive libraries are also provided with this package, allowing other scientists to release their software as open source. It provides a set of sequence- analysis programs, and also supports all UNIX platforms.  Clustalw: It is a fully automated sequence alignment tool for DNA and protein sequences. It returns the best match over a total length of input sequences, be it a protein or a nucleic acid.  RasMol: It is a powerful research tool to display the structure of DNA, proteins, and smaller molecules. Protein Explorer, a derivative of RasMol, is an easier to use program.  PROSPECT: PROSPECT (PROtein Structure Prediction and Evaluation Computer ToolKit) is a protein-structure prediction system that employs a computational technique called protein threading to construct a protein's 3-D model
  • 19.
     DNA Sequencing Sequence Formats  Sequence Homology Software Tools  Aligning Tools  Annotated Information  Protein Folding
  • 20.
     Sanger Method New nucleotide chains of DNA being replicated by DNA Polymerase are stopped when di-deoxy nucleotides (added in the reaction mixture in ~1/100 ratio) are incorperated into the chain
  • 21.
     Fluorescent dyesare bound to the ddNTPs, allowing the molecule to detected when it is excited by a laser  Terminated DNA chains are run on a gel, and fragments are resolved by size  By combining the fluorescence readings from each size nucleotide chain, the DNA sequence is computed
  • 23.
     First ThingsFirst – Sequence File Formats:  Most common for nucleotides: FASTA / Multi-FASTA  “>” followed by any unicode text, entire line read as sequence title  >E. coli Globin-coupled chemotaxis sensory transducer (TM domain) ATGGACCTGATCACAAATGCGATTTAGAGACCTG ATCACAAATGCGATGACCTGATCACAAATGCGAT GACCTGATCACAAATGCGATGTAAACCTGATCAC AAATGCGATGACCTGATCACAAATGCGATCTAAA CCTGATCACAAATGCGATGACCTGATCACAAATG CGATTAA
  • 24.
     Clustal (free) ClustalX – Software  ClustalW – Web  Functionality is similar, but difference is in interface, tools, and speed of algorithms  http://www.ebi.ac.uk/clustalw/
  • 25.
     Lowest energystate folding  Distributed computing is used for mid-sized proteins  Folding@Home  Human Proteome Folding Project  Rosetta@Home  Predictor@Home
  • 26.
     These groupsof programs allow comparing protein sequence to the secondary protein databases that contain information on motifs, signatures and protein domains.  Interproscan Search protein sequences.  PPSearch Searches protein motifs.  Radar Protein repeats detection
  • 27.
     3-dimensional structuresof proteins, nucleic acids, molecular complexes etc  3-d data is available due to techniques such as NMR and X- Ray crystallography COPIA(Consensus Pattern Identification and Analysis) It is a protein structure analysis tool for discovering motifs in a family of protein sequences. Such motifs can then be used to determine membership to the family of new proteins sequences, predict secondary and tertiary structures and functions of proteins.
  • 28.
     These aresoftware programs for analyzing and formatting DNA and protein sequences.  RepeatMasker It is a program that screens the DNA for interspersed repeats.  Webcut It is an online tool for restriction analysis, silent mutation analysis, and SNP analysis.  Translate It is a tool which allows the translation of a nucleotide sequence to a protein sequence.
  • 29.
     This setof tools allow to carry out further more detailed analysis of query sequence including evolutionary analysis, identification of mutation.  Align This tool is used to compare two sequences.  DNA Scanner It is a tool that scans DNA for number of different properties such as biophysical, potential for protein interaction.
  • 30.
     Data suchas experimental microarray images- gene expression data  Proteomic data- protein expression data  Metabolic pathways, protein-protein interaction data, regulatory networks
  • 31.
     Each Databasecontains specific information  Like other biological systems also these databases are interrelated
  • 32.
  • 33.
    Some of theapplications related to biological information analysis are:  Bioinformatics is used in primer design.  Bioinformatics is used to attempt to predict the function of actual gene products.  Molecular modeling/structural biology is a growing field which can be considered part of bioinformatics.  There are other fields- for example, medical imaging/ image analysis, that might be considered part of bioinformatics. There is also a whole other discipline of biologically inspired computation: genetic algorithms, etc.
  • 34.
     Bioinformatics isbuilding on the recognition of the importance of information transmission, accumulation and processing in biological systems.  Software tools for bioinformatics range from simple command-line tools, to more complex graphical programs and standalone web-services available from various bioinformatics companies or public institutions.
  • 35.
     S.C.Rastogi –Bioinformatics: concepts, Skills and Applications, (2003)  C.S.V.Murthy – Bioinformatics, First Edition, (2003)  David W.Mount- Bioinformatics sequence genome analysis second edition  http://Bioinformatics%20- %20Tools,%20softwares%20&%20Programmes.htm  http://Bioinformatics%20- %20Wikipedia,%20the%20free%20encyclopedia.htm