Fundamental concepts in genome
sequencing and genome
University of Adelaide
• Typical mammalian chromosome is between
20-100 x 10^6 bp long.
• Maximum length for current sequencing
• Typical approach is to use massively parallel
sequencing to sequence hundreds of millions
of genome fragments.
The Future: Single Molecule Sequencing
• Genome sequences are not fixed; they are
continually changing as a result of different
types of mutations.
• Mutations can occur at different scales:
– Single/several base(s)
– Hundreds/thousands of bases
• Can involve substitution or insertion/deletion
Mutation vs Variation
• Many alterations to genome sequences have
no apparent phenotype and so are considered
variation rather than mutation. Mutation
implies an alteration of phenotype or
• Another way to describe variation is
Replication vs Recombination vs
• Replication errors contribute primarily to
nucleotide substitution, Simple Sequence
Repeat expansion and small indels.
• Recombination based errors contribute to
large scale indels/translocations.
• Transposition causes indels/ rearrangements.
DNA replication errors can cause SNPs, but environment and
fundamental aspects of nucleotide chemistry contribute as well.
Keller I, Bensasson D, Nichols RA (2007) Transition-Transversion Bias Is Not Universal: A Counter Example from Grasshopper
Pseudogenes. PLoS Genet 3(2): e22. doi:10.1371/journal.pgen.0030022
Environmental causes of SNPs
Dimers can form between two adjacent
pyrimidines. Shown here is (A) thyminethymine cyclobutane-pyrimidine dimer,
and (B) thymine-cytosine dimer and their
photoreactivation by the enzyme
photolyase in the presence of light.
If dimer is not recognised in time, it
can cause a mutation.
Once the mutation has occurred it
cannot be detected or repaired.
Silent (neutral substitution)
Missense (alters the amino acid)
Nonsense (inserts a stop codon)
Splice (alters a splice site)
Promoter (alters a motif)
3’UTR (alters a miRNA target)
Some standard examples
Single base mutation rate
• It has been estimated that in humans and
other mammals, uncorrected errors (=
mutations) occur at the rate of about 1 in
every 50 million (5 x 107) nucleotides added to
• But with 6 x 109 base pairs in a human somatic
cell, that means that each new cell contains
some 120 new mutations.
Larger scale changes
• Once you get beyond single nucleotide
alterations, changes tend to favour indel
events rather than substitution events.
• You can have indels of different sizes; small
ones are consequences of polymerase error.
– For example: Simple Sequence Repeats due to
• Can lead to formation of arrays of tandemly
duplicated genes or deletions via unequal crossing
• Can lead to inversion.
• Can lead to gene conversion.
• Recombination mediated mutation is thought to be
one of the largest source of mutation in humans.
Repeated DNA sequences make up
much of the genome.
• Most DNA repeats are repeated because they
can copy themselves.
• They are called Transposons or
Retrotransposons, depending on how they
What are Transposons?
• Transposons are segments of DNA that can move
around to different positions in the genome of a
• These mobile segments of DNA are sometimes
called "jumping genes".
• ~45% of a typical mammalian genome is made up
of transposons, also referred to as repeats or
• Main type = Retrotransposons
Alu repeats and human disease
11-Dec-13&Batzer, Molecular Genetics and Metabolism 67, 183-193 (1999) Article ID mgme.1999.2864, available online at
Global repeat correlations
Cut chromosomes into 1.5 Mbp bins.
For each bin, count the number of repeats for each of the repeat
groups, the number of genes that started in the bin (gene density),
and the percent of known bases that were G or C (G+C content).
All bins with at least 1x106 bp excluding Ns were used to calculate
Spearman’s rank correlations between each repeat group and the
other repeat groups, as well as gene density and G+C content.
Major causes of variation
NAHR contributes to
one of the most
significant types of
CNV/SV has been
linked to heritable
and to cancer
Genomes are constantly changing.
Replication error drives bp level change.
Recombination drives gene level change.
Retrotransposons, which comprise ~40% of a
mammalian genome, drive whole genome change in
terms of insertion of new DNA, creation of new
genes and regulatory regions.
• The latter probably account for much of the
differences that arise during speciation.