bioinformatics
students educators
researchers institutions
What is Bioinformatics?
Conceptualizing biology in terms of molecules
and then applying “informatics” techniques
from math, computer science, and statistics
to understand and organize the
information associated with these
molecules on a large scale.
What is Bioinformatics?
Bioinformatics involves the technology that uses
computers for storage, retrieval, manipulation, and
distribution of information related to biological
macromolecules.
Definition of Bioinformatics
 A multidisciplinary subject.
‘…information technology applied to the management
and analysis of biological data’ (Attwood, T. K)
Bioinformatics aims to…
 Collect,
 Organise,
 Store,
 Retrieve,
 Analyse,
….biological data with the use of
computers.
Term Bioinformatics was invented by
Paulien Hogeweg and Ben Hesper in
1970
As "the study of informatic processes in
biotic systems".
Why do we use Bioinformatics?
 Store/retrieve biological information
(databases)
 Retrieve/compare gene sequences
 Predict function of unknown genes/proteins
 Search for previously known functions of a
gene
 Compare data with other researchers
 Compile/distribute data for other researchers
Subfields
The development of computational tools and
databases
The application of these tools and databases in
generating biological knowledge
Scope
Tools development:
 Writing software for sequence, structural, and
functional analysis
 Construction and curating of biological databases
Tools: Used in three areas
1. Molecular Sequence Analysis
2. Molecular Structural Analysis
3. Molecular Functional Analysis
Applications
Drug design
Forensic DNA
analysis
Agricultural
biotechnology
• Since Charles Darwin the idea of common origin of species
became widely accepted view, however the level of similarity on
molecular level between distant species remained unclear until
1970s and 1980s.
• At that time the fact that many DNA and particularly protein
molecules retain high similarity, hundreds of millions of years
after separation from the common ancestor, was established.
• This discovery as well as practical needs to search growing DB
lead to development of effective methods of similarity search.
• Two programs, which greatly facilitated the similarity search,
were developed FASTA (Pearson and Lipman 1988) and BLAST
(Altschul et al. 1990).
DB search for similar sequences
• The basic step in any similarity search is an alignment of two or
more sequences.
The search provides a list of DB sequences with which a
query sequence can be aligned.
•Then scoring procedure is implemented, which allows to
measure degree of similarity from 100% identity to a loose
similarity.
• A common reason for performing a DB search is to find a
related gene. A matched gene (or any other sequence) may
provide a clue as to function.
• An alternative task can be achieved when a sequence with
known function or role is used as a query for search in a species
genome.
• The search must be fast and sensitive enough.
Basics of similarity searches
FASTA
• FASTA is a program for rapid alignment of pairs of
protein and DNA sequences.
• Comparison of all nucleotides or amino acids is
not an option, even for powerful computers,
FASTA instead searches for matching sequence
patterns (“words”) called k-tuples. These patterns
comprise k consecutive matches in the compared
sequences.
The ktup value
The ktup (for k-tuples) value stands for the length of the word
used to search for identity.
The higher the ktup value, the less likely you will get a match
unless it is identical.
Lower ktups give a more slower, more sensitive search.
The higher the ktup value, the faster analysis.
FASTA ktups are shorter than BLAST words. 1-2 for
proteins and 4-6 for nucleic acids.
•Using k-tuples FASTA builds a local alignment.
• Finally FASTA scores this alignment and output
a list of sequences similar to a query in the
descending order.
FASTA
ATCGAACCTGGATCGTGGCCATCGAACCTGGATCGTGGCCATCGAACCTGGA
GGCGAACCCCTATCGTGGCGTTACCGCCTTATTGACGGCCATCGAACCTGGA
k = 6 k = 8 k = 14
k-tuples
gaps
FASTA Format
 A sequence in FASTA format begins with a single-line
description, followed by lines of sequence data.
 The description line is distinguished from the
sequence data by a greater-than (">") symbol in the
first column.
 It is recommended that all lines of text be shorter than
80 characters in length.
FASTA Format
 >gi|129295|sp|P01013|OVAX_CHICK GENE X PROTEIN (OVALBUMIN-RELATED)
QIKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAEDTREMPFHVTKQESKPVQMMCM
NNSFNVATLPAEKMKILELPFASGDLSMLVLLPDEVSDLERIEKTINFEKLTEWTNPNTME
KRRVKVYLPQMKIEEKYNLTSVLMALGMTDLFIPSANLTGISSAESLKISQAVHGAFMELSE
DGIEMAGSTGVIEDIKHSPESEQFRADHP FLFLIKHNPTNTIVYFGRYWSP
What is an accession number?
An accession number is label that used to
identify a sequence. It is a string of letters
and/or numbers that corresponds to a
molecular sequence.
 Examples (all for retinol-binding protein, RBP4):
 X02775 GenBank genomic DNA sequence
 NM_006744 DNA sequence (from a transcript)
 AAC02945 GenBank protein record
• Basic Local Alignment Search Tool (BLAST)
•Altschul 1990
•developed as a new way to perform seq. similarity search.
• BLAST is faster than FASTA while being nearly as sensitive.
• The minimal “word” (k-tuple) length is slightly higher than
in FASTA, 3 for proteins and 11 for DNA.
BLAST
BLAST
 BLAST is more than a tool to view sequences aligned
with each other or to calculate percent homology, but
a program to locate regions of sequence similarity with
a view to comparing structure and function.
A Practical Example
 A researcher might start with a piece of DNA
rather than a literature citation.
 Here we will –
1. Search a DNA database using a piece of DNA
sequence.
2. Use the results of the search to identify relevant
literature.
The Experiment
1)Grow
plants
2) Extract
the DNA.
3) Amplify up
the desired
section of DNA.
4) Generate
sequence.
A DNA Sequence
>G08_CHEV11Fed.seq
GTCGACGCGCAAATGGTTCTATATCCATACCAATAGCAGTATCGTTGCCA
TTATCACGAATGGAATTAAGTAAAGTTTTCATTCTATCAATAGACTCTAA
AACCACATCCATGATATCTGGAGTTATTTTTAACTCGCCATGTCTTGCTT
TGTTTAAAACATCCTCCATGTGGTGAGTTAACTTTGTTAAAACATCAAAA
TTTAAGAAGCTTGATGATCCTTTAACCGTATGTGCAACACGGAAAATTCT
ATTTAATAATTCTAAATCTTCTGGATTTGATTCAAGCTCTACTAAATCAT
GGTCGATTTGCTCAACAAGCTCAAAAGCTTCAACCAAAAAGTCTTCAAGT
ATTTCTTGCATATCTTCCATATTTTACCCCTGTTCTTGAGATTGATGTTT
TTTAATAACCTTTGCAATTTCATTGAAGAAATCGCTAGCGTTAAATTTGA
CAAGATAGCCTTCTCCACCAGCTTCTTGAACACCTTTCTCATTCATAAAT
TCATTTGATAAAGATGAGTTAAAGACTATAGGAATATCTTTAAATCCGGG
ATCTTCTTTAATGCGTGCAGCGGATCCCGGGTACCTGCAGAATTCAGCTG
CGCCCTTTAGTTCCTAAAGGGTTTTTATCAGTGCGACAAACTGGGATTTT
ATTTATTCAGCAAGTCTTGTAATTCATCCAAAAAACGGCAAACATGAAAG
CCGTCACAAACGGCATGATGCACTTGAATCGATAAGGGAATATAGTATTT
TCCGCCCTCCTCATAATACTTCCCAAACGTAAATATCGGCAGTAGATAGT
The following sequence is in FASTA format.
A BLAST Search
 Lets see how to submit a sequence query to the
Genbank database.
BLAST Search Screen
Enter sequence.
Select database.
Select BLAST type.
BLAST Results
BLAST programs
 BLASTp: compares amino acid sequence
query to a protein sequence data base
 BLASTn: compares nucleotide sequence
query to a nucleotide sequence data base
 BLASTx: nucleotide sequence query is
translated and compared to a protein
sequence database

Main bioinfomatics alignment tools.pptx

Editor's Notes

  • #5 Bioinformatics is an interdisciplinary research area at the interface between computer science and biological science. A variety of definitions exist in the literature and on the world wide web; some are more inclusive than others. Here, we adopt the definition proposed by Luscombe et al. in defining bioinformatics as a union of biology and informatics: bioinformatics involves the technology that uses computers for storage, retrieval, manipulation, and distribution of information related to biological macromolecules such as DNA, RNA, and proteins. The emphasis here is on the use of computers because most of the tasks in genomic data analysis are highly repetitive or mathematically complex. The use of computers is absolutely indispensable in mining genomes for information gathering and knowledge building. Bioinformatics differs from a related field known as computational biology. Bioinformatics is limited to sequence, structural, and functional analysis of genes and genomes and their corresponding products and is often considered computational molecular biology. However, computational biology encompasses all biological areas that involve computation. For example, mathematical modeling of ecosystems, population dynamics, application of the game theory in behavioral studies, and phylogenetic construction using fossil records all employ computational tools, but do not necessarily involve biological macromolecules.
  • #10 Bioinformatics consists of two subfields: the development of computational tools and databases and the application of these tools and databases in generating biological knowledge to better understand living systems.
  • #12 The analyses of biological data often generate new problems and challenges that in turn spur the development of new and better computational tools.
  • #13 The three aspects of bioinformatics analysis are not isolated but often interact to produce integrated results (see Fig. 1.1). For example, protein structure prediction depends on sequence alignment data; clustering of gene expression profiles requires the use of phylogenetic tree construction methods derived in sequence analysis. Sequence-based promoter prediction is related to functional analysis of co-expressed genes.
  • #14 Bioinformatics has not only become essential for basic genomic and molecular biology research, but is having a major impact on many areas of biotechnology and biomedical sciences. It has applications, for example, in knowledge-based drug design, forensic DNA analysis, and agricultural biotechnology. Computational studies of protein–ligand interactions provide a rational basis for the rapid identification of novel leads for synthetic drugs. Knowledge of the three-dimensional structures of proteins allows molecules to be designed that are capable of binding to the receptor site of a target protein with great affinity and specificity. This informatics-based approach significantly reduces the time and cost necessary to develop drugs with higher potency, fewer side effects, and less toxicity than using the traditional trial-and-error approach. In forensics, results from molecular phylogenetic analysis have been accepted as evidence in criminal courts. Some sophisticated Bayesian statistics and likelihood-based methods for analysis of DNA have been applied in the analysis of forensic identity. It is worth mentioning that genomics and bioinformatics are now poised to revolutionize our healthcare system by developing personalized and customized medicine. The high speed genomic sequencing coupled with sophisticated informatics technology will allow a doctor in a clinic to quickly sequence a patient’s genome and easily detect potential harmful mutations and to engage in early diagnosis and effective treatment of diseases. Bioinformatics tools are being used in agriculture as well. Plant genome databases and gene expression profile analyses have played an important role in the development of new crop varieties that have higher productivity and more resistance to disease.