Genome Bioinformatics 
y.wurm@qmul.ac.uk
Genomics?
Genomics - Wikipedia 
Genomics is a discipline in genetics that applies recombinant DNA, DNA 
sequencing methods, and bioinformatics to sequence, assemble, and 
analyze the function and structure of genomes (the complete set of DNA 
within a single cell of an organism).[1][2] Advances in genomics have 
triggered a revolution in discovery-based research to understand even 
the most complex biological systems such as brain.[3] The field includes 
efforts to determine the entire DNA sequence of organisms and fine-scale 
genetic mapping. The field also includes studies of intragenomic 
phenomena such as heterosis, epistasis, pleiotropy and other 
interactions between loci and alleles within the genome.[4] ! 
! 
In contrast, the investigation of the roles and functions of single genes is 
a primary focus of molecular biology or genetics and is a common topic 
of modern medical and biological research. Research of single genes 
does not fall into the definition of genomics unless the aim of this genetic, 
pathway, and functional information analysis is to elucidate its effect on, 
place in, and response to the entire genome's networks.[5][6]
Estevezj - CC3 Wikimedia 
http://upload.wikimedia.org/wikipedia/commons/7/73/Number_of_prokaryotic_genomes_and_sequencing_costs.svg Ⓐ 
Ⓑ Ⓒ
• Genomics 
• Biodiversity assessments 
• Stool microbiome sequencing 
• Personalized medicine 
• Cancer genomics
Challenges 
1. Getting up and running with Unix 
2. Algorithms in Bioinformatics: strengths & weaknesses 
3. Bioinformatics databases 
4. DIY: genome assembly & identifying variants.
Getting up and running with Unix 
& High Performance Computing 
(HPC) 
ITS Research Team (Lukasz Zalewski): 
1. Install virtualbox & biolinux. 
2. Introduction to Unix 
3. Using Apocrita HPC = “the cluster” 
!
Algorithms for sequence alignment. 
- dotplots- the concept of distance: Euclidean, hamming, 
Levenshtein 
- dynamic programming and the Smith Waterman algorithm 
- local, global, semiglobal alignments 
- gap penalty models 
- basics of approximate methods (Blast) 
- scoring matrices (PAM, Blosum) 
- Profiles and PSI-Blast
Algorithms for sequence alignment. 
Take home message? 
•Algorithms are approximate 
•Results aren’t perfect 
•Computers can get it wrong
BLAST is unable to detect any similarity between 
these 2 sequences: 
Gp-9 1 ATGAAGACGTTCGTATTGCATATTTTTATTTTTGCTCTCGTGGCTTTCGCTTCTGCATCT 60 
||||||||||| |||||||||| ||||||||| |||||||| |||||||||| ||||| 
K2000 1 ATGAAGACGTTGGTATTGCATAATTTTATTTT---TCTCGTGGATTTCGCTTCTCCATCT 57 
! 
Gp-9 61 CGTGATAGCGCGAGGAAGATAGGATCCCAATATGACAATTACGCGACTTGCTTAGCCGAA 120 
||||| ||||||| || ||| ||||||||| |||||| |||||| ||||||||| ||||| 
K2000 58 CGTGAGAGCGCGAAGACGATGGGATCCCAACATGACATTTACGCCACTTGCTTACCCGAA 117 
! 
Gp-9 121 CATAGTCTAACAGAGGATGACATCTTCTCGATTGGTGAAGTATCAAGTGGCCAGCACAAA 180 
|||| ||||| || |||| || | ||||||||| ||||||||| |||||||||| ||||| 
K2000 118 CATAATCTAAGAGGGGATAACGTTTTCTCGATTCGTGAAGTATAAAGTGGCCAGGACAAA 177 
! 
Gp-9 181 ACCAATCATGAAGATACCGAACTACACAAAAATGGTTGCGTCATGCAATGTTTGTTAGAA 240 
|||| ||||||||| |||||||| ||||||||| || ||||||| |||||||| |||||| 
K2000 178 ACCAGTCATGAAGAAACCGAACTCCACAAAAATCGTCGCGTCATACAATGTTTATTAGAA 237 
! 
Gp-9 241 AAAGATGGACTGATGTCTGGAGCTGATTATGATGAAGAGAAAATGCGTGAGGACTATATC 300 
|||||||| |||||| ||| ||| ||||||||| ||| |||||||||| ||||||||| 
K2000 238 TAAGATGGAATGATGTGTGGGGCTAATTATGATGGAGAAAAAATGCGTGCTGACTATATC 297 
! 
Gp-9 301 AAGGAA------ACAGGTGCTCAACCAGGAGATCAAAGGATAGAAGCTCTGAATGCCTGC 354 
| |||| || |||| |||||||||| |||| |||| |||| |||||||||| | | 
K2000 298 AGGGAATCAGGTACCGGTGGTCAACCAGGACATCAGAGGAGAGAACCTCTGAATGCGTAC 357 
! 
Gp-9 355 ATGCAAGAAACAAAAGACATGGAGGATAAATGTGACAAAAGCTTGCTCCTTGTAGCATGT 414 
||||||||| ||||||| ||| ||| |||||| ||||||||| | || ||| ||||| 
K2000 358 ATGCAAGAATCAAAAGATATGCAGGTTAAATGGCACAAAAGCT---TTCTAGTAACATGT 414 
! 
Gp-9 415 GTCTTAGCAGCTGAAGCTGTGCTCGCCGATTCTAACGAAGGAGCATAA 462 
| |||||||| | |||||| ||||| |||||| ||||||||| |||| 
K2000 415 ATTTTAGCAGCGGGAGCTGTTCTCGCGGATTCTCACGAAGGAGAATAA 462
Algorithms for sequence alignment. 
Take home message? 
• Algorithms are approximate 
• Results depend on: 
• underlying biology 
• approximations made by algorithms 
• search and database size
Databases for Bioinformatics 
• Biological databases & access to the annotated genomes 
• NCBI 
• Ensembl 
• UCSC 
• Entrez & Biomart 
• Genbank/Uniprot 
! 
• Cancer resources and data portals 
• TCGA, ICGC and Cosmic
Databases for Bioinformatics 
Take home message?
Genome Assembly & variant calling 
• Processing raw data 
• Genome assembly algorithms 
• Read mapping 
• Quality Assurance processes 
• Calling & visualising variants 
• Automated gene prediction 
• Doing things in the command-line
Bruno 
Vieira 
Rodrigo 
Pracana
Old & modern assembly 
algorithms 
• Overlap-layout consensus 
! 
• De bruijn-based.
2014 09-29 2nd monday overview

2014 09-29 2nd monday overview

  • 1.
  • 2.
  • 3.
    Genomics - Wikipedia Genomics is a discipline in genetics that applies recombinant DNA, DNA sequencing methods, and bioinformatics to sequence, assemble, and analyze the function and structure of genomes (the complete set of DNA within a single cell of an organism).[1][2] Advances in genomics have triggered a revolution in discovery-based research to understand even the most complex biological systems such as brain.[3] The field includes efforts to determine the entire DNA sequence of organisms and fine-scale genetic mapping. The field also includes studies of intragenomic phenomena such as heterosis, epistasis, pleiotropy and other interactions between loci and alleles within the genome.[4] ! ! In contrast, the investigation of the roles and functions of single genes is a primary focus of molecular biology or genetics and is a common topic of modern medical and biological research. Research of single genes does not fall into the definition of genomics unless the aim of this genetic, pathway, and functional information analysis is to elucidate its effect on, place in, and response to the entire genome's networks.[5][6]
  • 4.
    Estevezj - CC3Wikimedia http://upload.wikimedia.org/wikipedia/commons/7/73/Number_of_prokaryotic_genomes_and_sequencing_costs.svg Ⓐ Ⓑ Ⓒ
  • 5.
    • Genomics •Biodiversity assessments • Stool microbiome sequencing • Personalized medicine • Cancer genomics
  • 6.
    Challenges 1. Gettingup and running with Unix 2. Algorithms in Bioinformatics: strengths & weaknesses 3. Bioinformatics databases 4. DIY: genome assembly & identifying variants.
  • 7.
    Getting up andrunning with Unix & High Performance Computing (HPC) ITS Research Team (Lukasz Zalewski): 1. Install virtualbox & biolinux. 2. Introduction to Unix 3. Using Apocrita HPC = “the cluster” !
  • 8.
    Algorithms for sequencealignment. - dotplots- the concept of distance: Euclidean, hamming, Levenshtein - dynamic programming and the Smith Waterman algorithm - local, global, semiglobal alignments - gap penalty models - basics of approximate methods (Blast) - scoring matrices (PAM, Blosum) - Profiles and PSI-Blast
  • 9.
    Algorithms for sequencealignment. Take home message? •Algorithms are approximate •Results aren’t perfect •Computers can get it wrong
  • 10.
    BLAST is unableto detect any similarity between these 2 sequences: Gp-9 1 ATGAAGACGTTCGTATTGCATATTTTTATTTTTGCTCTCGTGGCTTTCGCTTCTGCATCT 60 ||||||||||| |||||||||| ||||||||| |||||||| |||||||||| ||||| K2000 1 ATGAAGACGTTGGTATTGCATAATTTTATTTT---TCTCGTGGATTTCGCTTCTCCATCT 57 ! Gp-9 61 CGTGATAGCGCGAGGAAGATAGGATCCCAATATGACAATTACGCGACTTGCTTAGCCGAA 120 ||||| ||||||| || ||| ||||||||| |||||| |||||| ||||||||| ||||| K2000 58 CGTGAGAGCGCGAAGACGATGGGATCCCAACATGACATTTACGCCACTTGCTTACCCGAA 117 ! Gp-9 121 CATAGTCTAACAGAGGATGACATCTTCTCGATTGGTGAAGTATCAAGTGGCCAGCACAAA 180 |||| ||||| || |||| || | ||||||||| ||||||||| |||||||||| ||||| K2000 118 CATAATCTAAGAGGGGATAACGTTTTCTCGATTCGTGAAGTATAAAGTGGCCAGGACAAA 177 ! Gp-9 181 ACCAATCATGAAGATACCGAACTACACAAAAATGGTTGCGTCATGCAATGTTTGTTAGAA 240 |||| ||||||||| |||||||| ||||||||| || ||||||| |||||||| |||||| K2000 178 ACCAGTCATGAAGAAACCGAACTCCACAAAAATCGTCGCGTCATACAATGTTTATTAGAA 237 ! Gp-9 241 AAAGATGGACTGATGTCTGGAGCTGATTATGATGAAGAGAAAATGCGTGAGGACTATATC 300 |||||||| |||||| ||| ||| ||||||||| ||| |||||||||| ||||||||| K2000 238 TAAGATGGAATGATGTGTGGGGCTAATTATGATGGAGAAAAAATGCGTGCTGACTATATC 297 ! Gp-9 301 AAGGAA------ACAGGTGCTCAACCAGGAGATCAAAGGATAGAAGCTCTGAATGCCTGC 354 | |||| || |||| |||||||||| |||| |||| |||| |||||||||| | | K2000 298 AGGGAATCAGGTACCGGTGGTCAACCAGGACATCAGAGGAGAGAACCTCTGAATGCGTAC 357 ! Gp-9 355 ATGCAAGAAACAAAAGACATGGAGGATAAATGTGACAAAAGCTTGCTCCTTGTAGCATGT 414 ||||||||| ||||||| ||| ||| |||||| ||||||||| | || ||| ||||| K2000 358 ATGCAAGAATCAAAAGATATGCAGGTTAAATGGCACAAAAGCT---TTCTAGTAACATGT 414 ! Gp-9 415 GTCTTAGCAGCTGAAGCTGTGCTCGCCGATTCTAACGAAGGAGCATAA 462 | |||||||| | |||||| ||||| |||||| ||||||||| |||| K2000 415 ATTTTAGCAGCGGGAGCTGTTCTCGCGGATTCTCACGAAGGAGAATAA 462
  • 11.
    Algorithms for sequencealignment. Take home message? • Algorithms are approximate • Results depend on: • underlying biology • approximations made by algorithms • search and database size
  • 12.
    Databases for Bioinformatics • Biological databases & access to the annotated genomes • NCBI • Ensembl • UCSC • Entrez & Biomart • Genbank/Uniprot ! • Cancer resources and data portals • TCGA, ICGC and Cosmic
  • 13.
    Databases for Bioinformatics Take home message?
  • 14.
    Genome Assembly &variant calling • Processing raw data • Genome assembly algorithms • Read mapping • Quality Assurance processes • Calling & visualising variants • Automated gene prediction • Doing things in the command-line
  • 15.
  • 16.
    Old & modernassembly algorithms • Overlap-layout consensus ! • De bruijn-based.