Genome Bioinformatics
y.wurm@qmul.ac.uk
© Alex Wild & others
© National Geographic
Atta leaf-cutter ants
© National Geographic
Atta leaf-cutter ants
© National Geographic
Atta leaf-cutter ants
Oecophylla Weaver ants
© ameisenforum.de
© ameisenforum.de
Fourmis tisserandes
© ameisenforum.de
Oecophylla Weaver ants
© forestryimages.org© wynnie@flickr
Tofilski et al 2008
Forelius pusillus
Tofilski et al 2008
Forelius pusillus hides the nest entrance at night
Tofilski et al 2008
Forelius pusillus hides the nest entrance at night
Tofilski et al 2008
Forelius pusillus hides the nest entrance at night
Tofilski et al 2008
Forelius pusillus hides the nest entrance at night
Avant
Workers staying outside die
« preventive self-sacrifice »
Tofilski et al 2008
Forelius pusillus hides the nest entrance at night
Dorylus driver ants: ants with no home
© BBC
Animal biomass (Brazilian rainforest)
from Fittkau & Klinge 1973
Other insects Amphibians
Reptiles
Birds
Mammals
Earthworms
Spiders
Soil fauna excluding
earthworms,
ants & termites
Ants & termites
Well-studied:
• behavior
• morphology
• evolutionary context
• ecology
Genetic basis?
This changes
everything.
454
Illumina
Solid...
Any lab can
sequence
anything!
Genomics?
This changes
everything.
454
Illumina
Solid...
http://gregoryzynda.com/ncbi/genome/python/2014/03/31/ncbi-genome.html
Genomics is taking over biology
• Applied:
• Cancer genomics
• Biodiversity assessments
• Stool microbiome sequencing
• Personalized medicine
• Research:
• Development,
• Evolution,
• Ecology
• … how life works!
BIG
Challenges
1. Unix
2. Bioinformatics algorithms
3. Databases
4. SequencingTechnologies
5. DIY:Assembly & identifying variants.

Module structure
Unix & High Performance Computing
(HPC)
- Introduction to Unix
- Using Apocrita HPC
- File transfer
- Installing things
-Visit to compute cluster?
Because Excel just ain’t enough.
Challenges
1. Getting up and running with Unix
2. Algorithms in Bioinformatics: strengths & weaknesses
3. Bioinformatics databases
4. SequencingTechnologies
5. Genome assembly & identifying variants.

Algorithms for sequence alignment.
• Dotplots
• The concept of distance: Euclidean, hamming, Levenshtein
• Dynamic programming and the Smith Waterman algorithm
• Local, global, semiglobal alignments
• Gap penalty models
• Basics of approximate methods (Blast)
• Scoring matrices (PAM, Blosum)
• Profiles and PSI-Blast
Dr. Fabrizio Smeraldi (EECS)
Take home message?
• Algorithms are approximate
• Results depend on:
• underlying biology
• approximations made by algorithms
• search and database size
Algorithms for sequence alignment.
BLAST is unable to detect any similarity between
these 2 sequences:
Gp-9 1 ATGAAGACGTTCGTATTGCATATTTTTATTTTTGCTCTCGTGGCTTTCGCTTCTGCATCT 60
||||||||||| |||||||||| ||||||||| |||||||| |||||||||| |||||
K2000 1 ATGAAGACGTTGGTATTGCATAATTTTATTTT---TCTCGTGGATTTCGCTTCTCCATCT 57
Gp-9 61 CGTGATAGCGCGAGGAAGATAGGATCCCAATATGACAATTACGCGACTTGCTTAGCCGAA 120
||||| ||||||| || ||| ||||||||| |||||| |||||| ||||||||| |||||
K2000 58 CGTGAGAGCGCGAAGACGATGGGATCCCAACATGACATTTACGCCACTTGCTTACCCGAA 117
Gp-9 121 CATAGTCTAACAGAGGATGACATCTTCTCGATTGGTGAAGTATCAAGTGGCCAGCACAAA 180
|||| ||||| || |||| || | ||||||||| ||||||||| |||||||||| |||||
K2000 118 CATAATCTAAGAGGGGATAACGTTTTCTCGATTCGTGAAGTATAAAGTGGCCAGGACAAA 177
Gp-9 181 ACCAATCATGAAGATACCGAACTACACAAAAATGGTTGCGTCATGCAATGTTTGTTAGAA 240
|||| ||||||||| |||||||| ||||||||| || ||||||| |||||||| ||||||
K2000 178 ACCAGTCATGAAGAAACCGAACTCCACAAAAATCGTCGCGTCATACAATGTTTATTAGAA 237
Gp-9 241 AAAGATGGACTGATGTCTGGAGCTGATTATGATGAAGAGAAAATGCGTGAGGACTATATC 300
|||||||| |||||| ||| ||| ||||||||| ||| |||||||||| |||||||||
K2000 238 TAAGATGGAATGATGTGTGGGGCTAATTATGATGGAGAAAAAATGCGTGCTGACTATATC 297
Gp-9 301 AAGGAA------ACAGGTGCTCAACCAGGAGATCAAAGGATAGAAGCTCTGAATGCCTGC 354
| |||| || |||| |||||||||| |||| |||| |||| |||||||||| | |
K2000 298 AGGGAATCAGGTACCGGTGGTCAACCAGGACATCAGAGGAGAGAACCTCTGAATGCGTAC 357
Gp-9 355 ATGCAAGAAACAAAAGACATGGAGGATAAATGTGACAAAAGCTTGCTCCTTGTAGCATGT 414
||||||||| ||||||| ||| ||| |||||| ||||||||| | || ||| |||||
K2000 358 ATGCAAGAATCAAAAGATATGCAGGTTAAATGGCACAAAAGCT---TTCTAGTAACATGT 414
Gp-9 415 GTCTTAGCAGCTGAAGCTGTGCTCGCCGATTCTAACGAAGGAGCATAA 462
| |||||||| | |||||| ||||| |||||| ||||||||| ||||
K2000 415 ATTTTAGCAGCGGGAGCTGTTCTCGCGGATTCTCACGAAGGAGAATAA 462
WTF?
Challenges
1. Getting up and running with Unix
2. Algorithms in Bioinformatics: strengths & weaknesses
3. Bioinformatics databases
4. SequencingTechnologies
5. Genome assembly & identifying variants.

Databases for Bioinformatics
• Biological databases & access to the annotated genomes
• NCBI
• Ensembl
• UCSC
• Entrez & Biomart
• Genbank/Uniprot
• Cancer resources and data portals
• TCGA, ICGC and Cosmic
Dr. Claude Chelala & Ajanthah Sangaralingam (Barts)
Challenges
1. Getting up and running with Unix
2. Algorithms in Bioinformatics: strengths & weaknesses
3. Bioinformatics databases
4. Sequencing Technologies
5. Genome assembly & identifying variants.

Challenges
1. Getting up and running with Unix
2. Algorithms in Bioinformatics: strengths & weaknesses
3. Bioinformatics databases
4. SequencingTechnologies
5. Raw data & genome assembly & identifying variants.

Genome assembly & variant calling
• Sequencing data quality assurance
• Processing raw reads
• Genome assembly
• Assembly quality assurance
• Variant calling
• Visualizing variants
• Gene prediction
• Gene curations
Bruno
Vieira
Rodrigo
Pracana
QMPlus
Forum!
QMPlus
bio-informatics
https://biomickwatson.wordpress.com/2015/09/07/finally-a-definition-for-bioinformatics/
(bi-oh-in-foh-shit-I-don’t-understand-how-that-works)
and informatics, meaning “absolutely anything your collaborators or
boss don’t understand about maths, statistics or computing, including
why they can’t print and how the internet works”
From the word bio, meaning “of or related to biology”
2015 09-28 bio721 intro

2015 09-28 bio721 intro

  • 1.
  • 2.
    © Alex Wild& others
  • 4.
  • 5.
  • 6.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
    Tofilski et al2008 Forelius pusillus
  • 13.
    Tofilski et al2008 Forelius pusillus hides the nest entrance at night
  • 14.
    Tofilski et al2008 Forelius pusillus hides the nest entrance at night
  • 15.
    Tofilski et al2008 Forelius pusillus hides the nest entrance at night
  • 16.
    Tofilski et al2008 Forelius pusillus hides the nest entrance at night
  • 17.
    Avant Workers staying outsidedie « preventive self-sacrifice » Tofilski et al 2008 Forelius pusillus hides the nest entrance at night
  • 18.
    Dorylus driver ants:ants with no home © BBC
  • 19.
    Animal biomass (Brazilianrainforest) from Fittkau & Klinge 1973 Other insects Amphibians Reptiles Birds Mammals Earthworms Spiders Soil fauna excluding earthworms, ants & termites Ants & termites
  • 20.
    Well-studied: • behavior • morphology •evolutionary context • ecology Genetic basis?
  • 21.
  • 23.
  • 24.
  • 25.
  • 26.
    Genomics is takingover biology • Applied: • Cancer genomics • Biodiversity assessments • Stool microbiome sequencing • Personalized medicine • Research: • Development, • Evolution, • Ecology • … how life works!
  • 27.
  • 30.
    Challenges 1. Unix 2. Bioinformaticsalgorithms 3. Databases 4. SequencingTechnologies 5. DIY:Assembly & identifying variants.
 Module structure
  • 31.
    Unix & HighPerformance Computing (HPC) - Introduction to Unix - Using Apocrita HPC - File transfer - Installing things -Visit to compute cluster? Because Excel just ain’t enough.
  • 32.
    Challenges 1. Getting upand running with Unix 2. Algorithms in Bioinformatics: strengths & weaknesses 3. Bioinformatics databases 4. SequencingTechnologies 5. Genome assembly & identifying variants.

  • 33.
    Algorithms for sequencealignment. • Dotplots • The concept of distance: Euclidean, hamming, Levenshtein • Dynamic programming and the Smith Waterman algorithm • Local, global, semiglobal alignments • Gap penalty models • Basics of approximate methods (Blast) • Scoring matrices (PAM, Blosum) • Profiles and PSI-Blast Dr. Fabrizio Smeraldi (EECS)
  • 34.
    Take home message? •Algorithms are approximate • Results depend on: • underlying biology • approximations made by algorithms • search and database size Algorithms for sequence alignment.
  • 35.
    BLAST is unableto detect any similarity between these 2 sequences: Gp-9 1 ATGAAGACGTTCGTATTGCATATTTTTATTTTTGCTCTCGTGGCTTTCGCTTCTGCATCT 60 ||||||||||| |||||||||| ||||||||| |||||||| |||||||||| ||||| K2000 1 ATGAAGACGTTGGTATTGCATAATTTTATTTT---TCTCGTGGATTTCGCTTCTCCATCT 57 Gp-9 61 CGTGATAGCGCGAGGAAGATAGGATCCCAATATGACAATTACGCGACTTGCTTAGCCGAA 120 ||||| ||||||| || ||| ||||||||| |||||| |||||| ||||||||| ||||| K2000 58 CGTGAGAGCGCGAAGACGATGGGATCCCAACATGACATTTACGCCACTTGCTTACCCGAA 117 Gp-9 121 CATAGTCTAACAGAGGATGACATCTTCTCGATTGGTGAAGTATCAAGTGGCCAGCACAAA 180 |||| ||||| || |||| || | ||||||||| ||||||||| |||||||||| ||||| K2000 118 CATAATCTAAGAGGGGATAACGTTTTCTCGATTCGTGAAGTATAAAGTGGCCAGGACAAA 177 Gp-9 181 ACCAATCATGAAGATACCGAACTACACAAAAATGGTTGCGTCATGCAATGTTTGTTAGAA 240 |||| ||||||||| |||||||| ||||||||| || ||||||| |||||||| |||||| K2000 178 ACCAGTCATGAAGAAACCGAACTCCACAAAAATCGTCGCGTCATACAATGTTTATTAGAA 237 Gp-9 241 AAAGATGGACTGATGTCTGGAGCTGATTATGATGAAGAGAAAATGCGTGAGGACTATATC 300 |||||||| |||||| ||| ||| ||||||||| ||| |||||||||| ||||||||| K2000 238 TAAGATGGAATGATGTGTGGGGCTAATTATGATGGAGAAAAAATGCGTGCTGACTATATC 297 Gp-9 301 AAGGAA------ACAGGTGCTCAACCAGGAGATCAAAGGATAGAAGCTCTGAATGCCTGC 354 | |||| || |||| |||||||||| |||| |||| |||| |||||||||| | | K2000 298 AGGGAATCAGGTACCGGTGGTCAACCAGGACATCAGAGGAGAGAACCTCTGAATGCGTAC 357 Gp-9 355 ATGCAAGAAACAAAAGACATGGAGGATAAATGTGACAAAAGCTTGCTCCTTGTAGCATGT 414 ||||||||| ||||||| ||| ||| |||||| ||||||||| | || ||| ||||| K2000 358 ATGCAAGAATCAAAAGATATGCAGGTTAAATGGCACAAAAGCT---TTCTAGTAACATGT 414 Gp-9 415 GTCTTAGCAGCTGAAGCTGTGCTCGCCGATTCTAACGAAGGAGCATAA 462 | |||||||| | |||||| ||||| |||||| ||||||||| |||| K2000 415 ATTTTAGCAGCGGGAGCTGTTCTCGCGGATTCTCACGAAGGAGAATAA 462 WTF?
  • 36.
    Challenges 1. Getting upand running with Unix 2. Algorithms in Bioinformatics: strengths & weaknesses 3. Bioinformatics databases 4. SequencingTechnologies 5. Genome assembly & identifying variants.

  • 37.
    Databases for Bioinformatics •Biological databases & access to the annotated genomes • NCBI • Ensembl • UCSC • Entrez & Biomart • Genbank/Uniprot • Cancer resources and data portals • TCGA, ICGC and Cosmic Dr. Claude Chelala & Ajanthah Sangaralingam (Barts)
  • 38.
    Challenges 1. Getting upand running with Unix 2. Algorithms in Bioinformatics: strengths & weaknesses 3. Bioinformatics databases 4. Sequencing Technologies 5. Genome assembly & identifying variants.

  • 40.
    Challenges 1. Getting upand running with Unix 2. Algorithms in Bioinformatics: strengths & weaknesses 3. Bioinformatics databases 4. SequencingTechnologies 5. Raw data & genome assembly & identifying variants.

  • 41.
    Genome assembly &variant calling • Sequencing data quality assurance • Processing raw reads • Genome assembly • Assembly quality assurance • Variant calling • Visualizing variants • Gene prediction • Gene curations
  • 42.
  • 43.
  • 44.
  • 45.
    bio-informatics https://biomickwatson.wordpress.com/2015/09/07/finally-a-definition-for-bioinformatics/ (bi-oh-in-foh-shit-I-don’t-understand-how-that-works) and informatics, meaning“absolutely anything your collaborators or boss don’t understand about maths, statistics or computing, including why they can’t print and how the internet works” From the word bio, meaning “of or related to biology”