2. Outline
Workshops chronology on hands out
Brief background information
Applications & role
Bioinformatics tools
Practical classes
Problem solving exercises
What’s expected of you ?
Questions/comments are welcome at all
points
3. Aims
To introduce the concepts and language of
bioinformatics.
To provide an understanding of how nucleic acid
and protein sequence data is obtained and
analysed.
To develop skills in utilising online databases and
interpreting data.
To develop an understanding of how bioinformatics
can be applied to solve specific problems in
biomedical science.
To develop transferable IT and communications
skills.
4. In this workshop…..
You will learn about how data is
generated and analysed
As well as what the generated data can
tell us about the molecular biology of
organisms
And various practical applications of
this knowledge
6. Why bioinformatics?
Over the past decade massive amounts
of sequence data have been generated
This has more recently been joined by
gene expression data obtained from
microarrays and proteomic technologies
This vast amount of data can only be
analysed using various specialised
computer algorithms
7. Main Topics (Review............)
Genome organisation and analysis
Functional genomics
Advanced techniques in molecular biology
Archives, information retrieval and alignments:
Nucleic acid sequence databases; genome
databases; protein sequence databases; database
searching
Dot plots (SIMILARITY MATRX) and sequence
alignments (PSI BLAST);
Genome expression: Microarray analysis,
proteomics, eukaryotic genome expression
11. Five W that all biologists
should know
NCBI (The National Center for Biotechnology Information;
http://www.ncbi.nlm.nih.gov/
EBI (The European Bioinformatics Institute)
http://www.ebi.ac.uk/
The Canadian Bioinformatics Resource
http://www.cbr.nrc.ca/
SwissProt/ExPASy (Swiss Bioinformatics Resource)
http://expasy.cbr.nrc.ca/sprot/
PDB (The Protein Databank)
http://www.rcsb.org/PDB/
12. Remember while using web
server-based tools
You are using someone else’s
computer
You are (probably) getting a reduced
set of options or capacity
Servers are great for sporadic or proof-
of-principle work, but for intensive work,
the software should be obtained and
run locally
13. Human Gene Index Database
HGI is a database of expressed DNA
sequences, mostly made of ESTs, which are
a type of partial cDNA
EST stands for Expressed Sequence Tag
These short sequences were created using
essentially the same method used to make
cDNAs
As such they represent the expressed part of
a genome and are made from mRNA which is
ultimately expressed from GENES
16. Similarity Searching
There are a variety of computer
programs that are used for making
comparisons between DNA sequences.
The most popular is known as BLAST
(Basic Local Alignment Search Tool)
BLAST is free at the NCBI website
17. BLAST is Complex
Similarity searching relies on the concepts of
alignment and distance between pairs of
sequences.
Distances can only be measured between
aligned sequences (match vs. mismatch at
each position).
A similarity search is a process of testing the
best alignment of a query sequence with
every sequence in a database.
18. Workshop -1 (database search & inference of possible
homology)
Please refer to getting started with bioinformatics
INTRO TO BLAST
Basic Local Alignment Search Tool
It is used to compare a query sequence with those contained in
nucleotide databases by aligning the query sequence with
previously characterised genes, therefore helping in identifying
genes.
The emphasis of this tool is to find regions of sequence
similarity between two different genes.
These sequence alignments can yield clues about the structure
and function of a novel sequence, and about its evolutionary
history and homology with other sequences in the database.
19. BLAST has Automatic
Translation
BLASTX makes automatic translation (in all
6 reading frames) of your DNA query
sequence to compare with protein
databanks
TBLASTN makes automatic translation of
an entire DNA database to compare with
your protein query sequence
Only make a DNA-DNA search if you are
working with a sequence that does not code
for protein.
20. A typical sequence ready for
submission to BLAST
>THC2465887
GGCTGCGGAGGACCGACCGTCCCCACGCCTGCCGCCCCGCGACCCCGACCGCCAGCATGATCGCCGCGCAGCTCCTGGCC
TATTACTTCACGGAGCTGAAGGATGACCAGGTCAAAAAGATTGACAAGTATCTCTATGCCATGCGGCTCTCCGATGAAAC
TCTCATAGATATCATGACTCGCTTCAGGAAGGAGATGAAGAATGGCCTCTCCCGGGATTTTAATCCAACAGCCACAGTCA
AGATGTTGCCAACATTCGTAAGGTCCATTCCTGATGGCTCTGAAAAGGGAGATTTCATTGCCCTGGATCTTGGTGGGTCT
TCCTTTCGAATTCTGCGGGTGCAAGTGAATCATGAGAAAAACCAGAATGTTCACATGGAGTCCGAGGTTTATGACACCCC
AGAGAACATCGTGCACGGCAGTGGAAGCCAGCTTTTTGATCATGTTGCTGAGTGCCTGGGAGATTTCATGGAGAAAAGGA
AGATCAAGGACAAGAAGTTACCTGTGGGATTCACGTTTTCTTTTCCTTGCCAACAATCCAAAATAGATGAGGCCATCCTG
ATCACCTGGACAAAGCGATTTAAAGCGAGCGGAGTGGAAGGAGCAGATGTGGTCAAACTGCTTAACAAAGCCATCAAAAA
GCGAGGGGACTATGATGCCAACATCGTAGCTGTGGTGAA
23. Understand the
Statistics!
BLAST produces an E-value for every match
This is the same as the P value in a statistical test
A match is generally considered significant if the
E-value < 0.05 (smaller numbers are more significant)
Very low E-values (e-100) are homologs or
identical genes
Moderate E-values are related genes
Long regions of moderate similarity are more
important than short regions of high identity.
24. BLAST is Approximate
BLAST makes similarity searches very
quickly because it takes shortcuts.
looks for short, nearly identical “words” (11 bases)
It also makes errors
misses some important similarities
makes many incorrect matches
easily fooled by repeats or skewed composition
25. Bad Genome
Annotation
Gene finding is at best only 90%
accurate.
New sequences are automatically
annotated with BLAST scores.
Bad annotations propagate
Its going to take us 10-20 years or more
to sort this mess out!
26. Conclusions
We have only touched small parts of
the elephant
Trial and error (intelligently) is often
your best tool
Keep up with the main five sites, and
you’ll have a pretty good idea of what is
happening and available