What we are going to talk
Why we are doing all this DNA
What genes look like and where they are
How we can compare sequences
between different species
How genes move between species
Bioinformatics is based on the fact that
DNA sequencing is cheap, and
becoming easier and cheaper very
the Human Genome Project cost roughly
$3 billion and took 12 years (1991-2003).
Sequencing James Watson’s genome in
2007 cost $2 million and took 2 months
Today, you could get your genome
sequenced for about $100,000 and it
would take a month.
The Archon X prize: you win $10 million if
you can sequence 100 human genomes in
10 days, at a cost of $10,000 per genome.
It is realistic to envision $100 per genome
within 10 years: everyone’s genome could
be sequenced if they wanted or needed
Why it’s useful
All of the information needed to build an
organism is contained in its DNA. If we
could understand it, we would know how
Preventing and curing diseases like cancer (which
is caused by mutations in DNA) and inherited
Curing infectious diseases (everything from AIDS
and malaria to the common cold). If we
understand how a microorganism works, we can
figure out how to block it.
Understanding genetic and evolutionary
relationships between species
Understanding genetic relationships between
humans. Projects exist to understand human
genetic diversity. Also, sequencing the
Ancient DNA: currently it is thought that under ideal
conditions (continuously kept frozen), there is a limit of
about 1 million years for DNA survival. So, Jurassic Park
will probably remain fiction.
From DNA to Gene
But: extracting that information is difficult. How to convert a
string of ACGT’s into knowledge of how the organism works is
Most of the work is on the computer, with key confirming
experiments done in the “wet lab”.
The sequence below contains a gene critical for life: the
gene that initiates replication of the DNA. Can you spot it?
We are now going to spend some time on what genes look
like and how we can find them.
DNA is just a long string of 4
letters (nucleotides, or bases):
Adenine, Guanine, Cytosine,
Which we will just refer to as A,
C, G, and T
and we are skipping lots of
Each DNA molecule has 2
strands, with the bases paired
in the center
A on one strand always pairs
with T on the other strand
G pairs with C.
the strands run in opposite
directions (like roads)
Since the two DNA strands are
complementary, there is no
need to write down both
Chromosomes and Genes
each chromosome is a long piece of DNA
B. megaterium genome is a circle (like most
bacteria) of about 5 million bases.
Human chromosomes are 100-200 million bases
long. We have 46 chromosomes (2 sets of 23, one
set from each parent).
genes are just regions on that DNA. It is not
obvious where genes are if you look at a DNA
there is a lot of DNA that is not part of genes: in
humans only 2% at most of the DNA is part of any
Bacteria use more of their DNA: 80% of the B. meg
chromosome is genes.
B. meg has about 1 gene per 1000 base pairs
(bp) of DNA. About 5000 genes
Humans have about 25,000 genes.
We are far more complicated than bacteria:
regulation of the genes is very complicated in
We use the same gene in different ways in different
Genes and Proteins
Most genes code for proteins: each gene
contains the information necessary to
make one protein.
Proteins are the most important type of
Structure: collagen in skin, keratin in hair,
crystallin in eye.
Enzymes: all metabolic transformations,
building up, rearranging, and breaking
down of organic compounds, are done by
enzymes, which are proteins.
Transport: oxygen in the blood is carried by
hemoglobin, everything that goes in or out
of a cell (except water and a few gasses) is
carried by proteins.
Also: nutrition (egg yolk), hormones,
The Genetic Code
Proteins are long chains of amino acids.
There are 20 different amino acids coded in
There are only 4 DNA bases, so you need 3
DNA bases to code for the 20 amino acids
4 x 4 x 4 = 64 possible 3 base combinations
Each codon codes for one amino acid
Most amino acids have more than one possible
Genes start at a start codon and end at a
3 codons are stop codons: all genes end at a
Start codons are a bit trickier, since they are
used in the middle of genes as well as at the
in eukaryotes, ATG is always the start codon,
making Methionine (Met) the first amino acid in
all proteins (but in many proteins it is immediately
In prokaryotes, ATG, GTG, or TTG can be used as
a start codon. B. meg prefers ATG, but about
30% of the genes start with GTG or TTG.
In bioinformatics, we generally
ignore the fact that RNA uses the
base uracil (U) in place of T.
How do you get a protein from a gene?
A two-step process (called the Central
Dogma of Molecular Biology).
First, the gene has to be copied (transcribed)
into an RNA form.
The RNA copy (messenger RNA) is exactly
like the gene itself, except RNA replaces T
Most gene regulation: whether the gene is
“on” or “off” happens here
Second, the RNA is translated into protein by
ribosomes, which are complex RNA/protein
With the help of transfer RNA molecules, which
have one end that matches the 3 base codon
and the other end that is attached to the proper
The ribosome starts at the start codon and moves
down the messenger RNA, adding one amino
acid at a time to the growing chain. When the
ribosome reaches a stop codon, it falls off,
releasing the new protein.
Here we get a bit subtle.
Since codons consist of 3 bases,
there are 3 “reading frames”
possible on an RNA (or DNA),
depending on whether you start
reading from the first base, the
second base, or the third base.
The different reading frames give
entirely different proteins.
Consider ATGCCATC, and refer to
the genetic code. (X is junk)
Reading frame 1 divides this into ATG-
CCA-TC, which translates to Met-Pro-X
Reading frame 2 divides this into A-
TGC-CAT-C, which translates to X-Cys-
Reading frame 3 divides this into AT-
GCC-ATC, which translates to X-Ala-Ile
Each gene uses a single reading
frame, so once the ribosome gets
started, it just has to count off
groups of 3 bases to produce the
Open Reading Frames
Ribosomes are very obedient to stop codons:
when a stop codon is reached, the protein is
finished. Thus, all genes end at the first stop
codon in their reading frame.
Since 3 out of the 64 codons are stop codons,
random DNA has stop codons very frequently.
However, genes do something necessary for survival,
so natural selection keeps stop codons out of the
middle of genes.
That is, if a mutation arises that creates a stop codon in the
middle of a gene, the organism dies and leaves no
Open reading frames (ORFs) are regions with no stop
codons. All genes reside in long open reading frames
Note that stop codons in other reading frames have
no effect on the gene.
The start codon must occur “upstream” in the
same reading frame as the stop codon. It is
usually near the beginning of the ORF, but not
necessarily the first possible start codon.
Determining the exact start codon is not easy or
But, the first stop codon in an open reading frame is
always a reasonable guess
This is a map of the stop
codons in all 3 reading
frames in a stretch of DNA.
The long ORF in reading frame
1 is highlighted in black.
Genes can occur on either DNA strand.
If they are on the reverse strand, the DNA sequence needs to be
reversed and complemented
In bacteria, most of the DNA is part of a gene. Most long open
reading frames (say 100 bp or longer) that don’t overlap other
long ORFs contain genes
Most genes do not overlap each other.
Sometimes there are very short overlaps (50 bp or less), especially if
the two genes are functionally related.
In bacteria, genes that affect the same biochemical pathway
or function are sometimes adjacent to each other on the same
DNA strand (not necessarily the same reading frame), allowing
them to be co-regulated
This group of genes is called an “operon”
Operons only exist in bacteria; they are not present in eukaryotes at
First job is to find long ORFs, examining the longest ORFs first and
putting together a set with minimal overlaps.
It is also necessary to identify potential start codons, with the furthest
upstream start codon as the easiest choice.
Then, how do we know that the ORF contains a real gene? The
most definitive way is to match it with a gene known from other
conservation of a sequence between species strongly suggests that
the sequence has a function that is being conserved by natural
We compare protein sequences, not DNA, because protein is
more conserved in evolution than DNA
The organism’s survival depends on the protein being functional,
which means having the proper amino acids sequence
Since the genetic code is degenerate, many different DNA sequences
will give identical proteins.
The protein 3-dimensional structure is even more conserved, because
it is more closely related to enzyme activity than the amino acid
However, we don’t have good ways of determining 3-D structure
from a DNA sequence
So, we compare our ORF sequence to a database of
known protein sequences from many species.
BLAST is the standard sequence alignment tool (BLAST = Basic
Local Alignment Search Tool)
BLAST is based on the concept that if you compare the
same (that is, homologous) protein from many different
species, you can see that some amino acids readily
substitute for each other and others almost never do.
A substitution matrix, giving a score for each amino acid
position in the proteins being compared.
BLAST itself is a bit of software that can be run on almost
any computer, but the database needed for a good cross-
species comparison is quite large
the database is called “nr” for “non-redundant”, and it contains
at least 20 Gb of sequence data
We are going to use the BLAST service at UniProt, a
European consortium that contains a comprehensive
collection of protein sequences
Nearly all derived from DNA sequences: direct sequencing of
proteins is difficult
Terminology: your sequence, which you paste into the box
on the web site, is the query sequence. Sequences in the
database that match yours are called subject sequences.
A Sequence to BLAST
This is a more-or-less
randomly chosen gene
from B. meg.
It is 174 amino acids long
It is written in “fasta”
format: the first line
starts with > and is
by an identifier
(ORF00135), and then
After that the sequence
is written without spaces
or other marks.
Results are arranged with the best ones on top
The most important score is the Expect value, or E-value, which can be
defined the number of hits any random sequence (with the same length
as yours) would have in the database.
E-values for good hits are usually written something like: 3e-42, which is
the same as 3 x 10-42
, a very small number
Bad hits are very common, and they have e-values in a more familiar
form: for example, 0.004 or 1.2
A really good e-values is less than 1e-180, which underflows the
computer’s processing capabilities, so it written as 0.0
E-values are affected by the length of the query sequence as well as
the size of the database, so even perfect matches with short
sequences give poor e-values
In this case we see many hits with good e-values, and the top e-values all
are quite similar.
Before we can conclude that our protein is a homologue of the proteins
BLAST matches it with, we would like them to have roughly the same
length and have a high percentage of identical amino acids.
the lengths of the query and subject sequences should be within 20%
of each other
There should be at least 30% identical amino acids
In this case we can be quite sure we have a good match
BLAST also returns a fourth value, the bit score, which we are going to
Mostly genes are named with the function of their protein.
at some point, some related genes had their function determined through
lab work: by examining the effects of mutations in the gene, by isolating
and studying the protein produced by the gene, etc.
Enzymes (end in –ase), transport across the cell membrane, genetic
information processing (DNA->RNA->protein), structural proteins, sporulation
and germination, and more!
Many genes (maybe 1/4 of them in a typical genome) have no known
function, although they are found in several different species: conserved
Every new genome has some genes that are unique: no matching BLAST hits in
Are they real genes? Sometimes there is evidence in the form of messenger
RNA, but usually we don’t know
call them hypothetical genes
“putative” means that we think we know the gene’s function but we aren’t
sure. Putative should be followed by the function name.
More Gene Names
One question of interest: do the names of the top
BLAST hits agree with each other? They should, but
there are always annotation errors, and our
knowledge of gene function increases over time.
With some sloppiness due to different naming conventions
practiced by different scientists
Here we have a classic case of mis-naming. Why is
the top hit ribosomal protein S2, with no other hit
having this name?
Ribosomal proteins are highly conserved in evolution
Some checking on my part showed that no homology
exists between this gene and the ribosomal protein S2
found in any other Bacillus species
The other names are similar, although not identical.
What is “PAP2”? A quick Google search shows that it
stands for “phosphatidic acid phosphatase”, which fits the
other names well.
There is probably some uncertainty about its exact
function, given the variety of names and the “family
protein” designation in several of them.
Horizontal and Vertical Gene Transfer
We are accustomed to thinking of genes
being passed from parent to offspring,
always staying within the species, with very
occasional splitting of one species into two.
This is called vertical gene transfer.
But, we know that some genes are
transferred across species lines, not by the
standard genetic mechanisms.
This is called horizontal gene transfer
It is rare in humans and other higher organisms
In bacteria 10% or more of genes have been
transferred in horizontally.
B meg genes that come from vertical
descent have other Bacillus species (or
another closely related species) as the
closest BLAST hit
Horizontally transferred genes can come
from almost anywhere: other bacteria,
Archaea, eukaryotes: plants, animals, fungi
The general mechanisms are well known,
including conjugation (direct transfer of DNA
between two bacteria), transduction (transfer
of DNA using a virus as a carrier), and
transformation (the bacteria pick up DNA
molecules from their environment.
“Kings Play Chess On
Fine Ground Sand”
Bacteria is the domain
Firmicutes is the
Bacilli is the class
Bacillales is the order
Bacillaceae is the
Bacillus is the genus.
Most of the top hits are from various Bacillus species: there is
little doubt that this gene is the results of normal, vertical
What about “Anoxybacillus flavithermus”?
Click on the accession number to get more information,
including its phylogeny.
Taxonomic lineage = Bacteria > Firmicutes > Bacillales >
Bacillaceae > Anoxybacillus.
Same family as B meg.
You can see the aligned sequences by
clicking on the “Local alignment” diagrams
Query sequence on top, subject below
Identical amino acids are in the middle of the
alignment, and similar ones have a + sign.
Gaps: regions where one sequence has amino acids
not found in the other sequence, are indicated with
This protein is very typical in that the best
matches are in the middle of the protein, with
fewer identical amino acids near the ends.
Also, the match doesn’t quite make it to the very
beginning of the proteins, although they are almost
identical in length.
The active site of most enzymes is in the middle
The ends of proteins are often not well conserved
Click on Graphical Overview (just
under the BLAST box on the left) to
get an overview of all the aligned
The extent of the matching region is
shown with the colored boxes, with
non-matching regions drawn as a
Color indicates percent of identical
You can see that mostly our query
and the various subjects (matches)
line up along almost all of their
This is a good way to check whether
our start site is reasonable.
A few odd ones lower down.
Genes, and pieces of genes, can
move to new locations in the
genome, fuse with other genes,
break apart, etc. Always subject to
natural selection: if the altered gene
doesn’t work, the organism will die
and we won’t see it.
And of course, sequencing and
annotation errors occur.
The Basic Points
1. DNA can be read in 3 different reading frames,
a consequence of the genetic code (3 bases
= 1 amino acid)
2. Genes are found in long open reading frames,
areas where there are no stop codons.
3. BLAST is the tool we use to compare sequences
• BLAST scores (e-values) describe the probability of
finding a random sequence in the database
1. Gene sequences are conserved between
species by natural selection
• DNA sequences outside of genes are much less
1. Most genes are transferred vertically, from
parent to offspring, but a significant number
are transferred horizontally, from unrelated
Email me : email@example.com