Bioinformatics MiRON

Welcome to BIOINFORMATICS
-MiRON

Outline
 Workshops chronology on hands out
 Brief background information
 Applications & role
 Bioinformatics tools
 Practical classes
 Problem solving exercises
 What’s expected of you ?
 Questions/comments are welcome at all
points

Aims
 To introduce the concepts and language of
bioinformatics.
 To provide an understanding of how nucleic acid
and protein sequence data is obtained and
analysed.
 To develop skills in utilising online databases and
interpreting data.
 To develop an understanding of how bioinformatics
can be applied to solve specific problems in
biomedical science.
 To develop transferable IT and communications
skills.

In this workshop…..
 You will learn about how data is
generated and analysed
 As well as what the generated data can
tell us about the molecular biology of
organisms
 And various practical applications of
this knowledge

Why bioinformatics?
 Over the past decade massive amounts
of sequence data have been generated
 This has more recently been joined by
gene expression data obtained from
microarrays and proteomic technologies
 This vast amount of data can only be
analysed using various specialised
computer algorithms

Main Topics (Review............)
 Genome organisation and analysis
 Functional genomics
 Advanced techniques in molecular biology
 Archives, information retrieval and alignments:
 Nucleic acid sequence databases; genome
databases; protein sequence databases; database
searching
 Dot plots (SIMILARITY MATRX) and sequence
alignments (PSI BLAST);
 Genome expression: Microarray analysis,
proteomics, eukaryotic genome expression

What bioinformatcian think
they are

Examples of Bioinformatics
 Database interfaces
 Genbank/EMBL/DDBJ, Medline, SwissProt, PDB,
…
 Sequence alignment
 BLAST, FASTA
 Multiple sequence alignment
 Clustal W, MultAlin, DiAlign
 Gene finding
 Genscan, GenomeScan, GeneMark, GRAIL
 Protein Domain analysis and identification
 pfam, BLOCKS, ProDom,
 Pattern Identification/Characterization
 Gibbs Sampler, AlignACE, MEME
 Protein Folding prediction
 PredictProtein, SwissModeler

Five W that all biologists
should know
 NCBI (The National Center for Biotechnology Information;
 http://www.ncbi.nlm.nih.gov/
 EBI (The European Bioinformatics Institute)
 http://www.ebi.ac.uk/
 The Canadian Bioinformatics Resource
 http://www.cbr.nrc.ca/
 SwissProt/ExPASy (Swiss Bioinformatics Resource)
 http://expasy.cbr.nrc.ca/sprot/
 PDB (The Protein Databank)
 http://www.rcsb.org/PDB/

Remember while using web
server-based tools

 You are using someone else’s
computer
 You are (probably) getting a reduced
set of options or capacity
 Servers are great for sporadic or proof-
of-principle work, but for intensive work,
the software should be obtained and
run locally

Human Gene Index Database
 HGI is a database of expressed DNA
sequences, mostly made of ESTs, which are
a type of partial cDNA
 EST stands for Expressed Sequence Tag
 These short sequences were created using
essentially the same method used to make
cDNAs
 As such they represent the expressed part of
a genome and are made from mRNA which is
ultimately expressed from GENES

Similarity Searching
 There are a variety of computer
programs that are used for making
comparisons between DNA sequences.
 The most popular is known as BLAST
(Basic Local Alignment Search Tool)
 BLAST is free at the NCBI website

BLAST is Complex
 Similarity searching relies on the concepts of
alignment and distance between pairs of
sequences.
 Distances can only be measured between
aligned sequences (match vs. mismatch at
each position).
 A similarity search is a process of testing the
best alignment of a query sequence with
every sequence in a database.

Workshop -1 (database search & inference of possible
homology)

Please refer to getting started with bioinformatics

INTRO TO BLAST
 Basic Local Alignment Search Tool
 It is used to compare a query sequence with those contained in
nucleotide databases by aligning the query sequence with
previously characterised genes, therefore helping in identifying
genes.
 The emphasis of this tool is to find regions of sequence
similarity between two different genes.
 These sequence alignments can yield clues about the structure
and function of a novel sequence, and about its evolutionary
history and homology with other sequences in the database.

BLAST has Automatic
Translation
 BLASTX makes automatic translation (in all
6 reading frames) of your DNA query
sequence to compare with protein
databanks
 TBLASTN makes automatic translation of
an entire DNA database to compare with
your protein query sequence
 Only make a DNA-DNA search if you are
working with a sequence that does not code
for protein.

A typical sequence ready for
submission to BLAST
>THC2465887
GGCTGCGGAGGACCGACCGTCCCCACGCCTGCCGCCCCGCGACCCCGACCGCCAGCATGATCGCCGCGCAGCTCCTGGCC
TATTACTTCACGGAGCTGAAGGATGACCAGGTCAAAAAGATTGACAAGTATCTCTATGCCATGCGGCTCTCCGATGAAAC
TCTCATAGATATCATGACTCGCTTCAGGAAGGAGATGAAGAATGGCCTCTCCCGGGATTTTAATCCAACAGCCACAGTCA
AGATGTTGCCAACATTCGTAAGGTCCATTCCTGATGGCTCTGAAAAGGGAGATTTCATTGCCCTGGATCTTGGTGGGTCT
TCCTTTCGAATTCTGCGGGTGCAAGTGAATCATGAGAAAAACCAGAATGTTCACATGGAGTCCGAGGTTTATGACACCCC
AGAGAACATCGTGCACGGCAGTGGAAGCCAGCTTTTTGATCATGTTGCTGAGTGCCTGGGAGATTTCATGGAGAAAAGGA
AGATCAAGGACAAGAAGTTACCTGTGGGATTCACGTTTTCTTTTCCTTGCCAACAATCCAAAATAGATGAGGCCATCCTG
ATCACCTGGACAAAGCGATTTAAAGCGAGCGGAGTGGAAGGAGCAGATGTGGTCAAACTGCTTAACAAAGCCATCAAAAA
GCGAGGGGACTATGATGCCAACATCGTAGCTGTGGTGAA

BLAST line-up of human v canine partial cDNAs for
hexokinase 1

Query: 3034 TGCATGGTTTGATTTTGACCTGGTC---C---CCC-ACGTGTGAAGTGTAGTGGCATCCA 3086
|||||| | |||||| |||||||| | ||| ||||||||||| |||||||| |||
Sbjct: 75 TGCATGATCTGATTTCAACCTGGTCGTACGCTCCCCACGTGTGAAGTTTAGTGGCACCCA 134

Query: 3087 TTTCTAATGTATGCATTCATCCAACAGAGTTATTTATTGGCTGGAGATGGAAAATCACAC 3146
|||| | | | ||||||| || |||||||||||||||||| ||||| ||| |||| |
Sbjct: 135 TTTCCAGTCTCTGCATTCGTCTGACAGAGTTATTTATTGGCCCAAGATGAAAAGTCACGC 194

Query: 3147 CACCTGACAGGCCTTCTGGG-CCTCCAAAGCCCATCCTTGGGGTTCCCCCTCCCTGTGTG 3205
|| | | |||||||| |||| |||| ||||| ||||||||| | | |||||||||
Sbjct: 195 CATCCGCCAGGCCTTATGGGGCCTCTGCAGCCCGTCCTTGGGGACACATC-CCCTGTGTG 253

Query: 3206 AAATGTATTATCACCAGCAGACACTGCCGGGCCTCC-C-TCCCGGGGGCACTGCCTGAAG 3263
||||||||||||||||||||||||||||||| |||| | |||| |||||| | | |
Sbjct: 254 AAATGTATTATCACCAGCAGACACTGCCGGGACTCCTCCTCCCAGGGGCA-T-CTTAGCT 311

Query: 3264 GCGAG-TGTGGGCATAGCATTAGCTGCTTCCTCCCCTCCTG-GCA-CCCACTGTGGCC-T 3319
|| | | | |||| ||||| || | ||| | | | |||| | || | |
Sbjct: 312 GCTTCCTCCCGTCCCAGCACCCACTGCTGTCTGGCGTCCCGAGGATCCCA-TCAGGACGT 370

Query: 3320 GGC-ATCGCATCGTGGTGTGTCAATGCCACAAAATCGTGTGTCCGTGGAACCAGTCCTAG 3378
| | || || | | |||| | || || | || ||| | | || || |
Sbjct: 371 GTCCATGCCACTGAGTCGTGTG--T-CCGTGGAA-C-TG-GTCAGAGCCACT--TCGTGA 422

Query: 3379 CCGCGTGTGACAGTCTTGCATTCTGTTTGTCTCGTGGGGGGAGGTGGACAG-TCCTGCGG 3437
| | | || || ||| | ||| | | | | || || ||||| ||
Sbjct: 423 CAGTCT-TG-CATTCTGTCTGTCT--TGGGGTGGNNGGNAAGNNNNNCCANNTCCTGTGG 478

Query: 3438 -AAAT--GTGTCTTGTCTCCATTTGGA-TAAAA-GGAA-CCAA--CCAACAAACAATGCC 3489
||| | | |||| |||||||||| ||||| |||| |||| ||||||| || ||||
Sbjct: 479 GAAAAAGGGGCCTTGGCTCCATTTGGGGTAAAAAGGAAACCAAACCCAACAA-CAGTGCC 537

Query: 3490 A-TCACTGG-AATTTCCC-ACCG-CTTT--GTGAGCCGTG-TCGTATGA-CCTAGTAAAC 3541
||| ||| |||| ||| | | |||| ||||||| || | |||||| ||||| ||
Sbjct: 538 CCTCATTGGGAATTCCCCCATTGGCTTTTTGTGAGCCATGGTTGTATGAACCTAGGTAAA 597

Query: 3542 TTTGT 3546
|| |
Sbjct: 598 CTTNT 602

Understand the
Statistics!
 BLAST produces an E-value for every match
 This is the same as the P value in a statistical test
 A match is generally considered significant if the
E-value < 0.05 (smaller numbers are more significant)
 Very low E-values (e-100) are homologs or
identical genes
 Moderate E-values are related genes
 Long regions of moderate similarity are more
important than short regions of high identity.

BLAST is Approximate
 BLAST makes similarity searches very
quickly because it takes shortcuts.
 looks for short, nearly identical “words” (11 bases)

 It also makes errors
 misses some important similarities
 makes many incorrect matches
 easily fooled by repeats or skewed composition

Bad Genome
Annotation
 Gene finding is at best only 90%
accurate.
 New sequences are automatically
annotated with BLAST scores.
 Bad annotations propagate
 Its going to take us 10-20 years or more
to sort this mess out!

Conclusions
 We have only touched small parts of
the elephant
 Trial and error (intelligently) is often
your best tool
 Keep up with the main five sites, and
you’ll have a pretty good idea of what is
happening and available

Bioinformatics MiRON

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Bioinformatics MiRON

Similar to Bioinformatics MiRON (20)

Recently uploaded

Recently uploaded (20)

Bioinformatics MiRON

Editor's Notes