FBW 
30-09-2014 
Wim Van Criekinge
What is Bioinformatics ? 
• Application of information technology to the 
storage, management and analysis of biological 
information (Facilitated by the use of 
computers) 
– Sequence analysis? 
– Molecular modeling (HTX) ? 
– Phylogeny/evolution? 
– Ecology and population studies? 
– Medical informatics? 
– Image Analysis ? 
– Statistics ? AI ? 
– Sterkstroom of zwakstroom ?
Promises of genomics and bioinformatics 
• Medicine (Pharma) 
– Genome analysis allows the targeting of genetic 
diseases 
– The effect of a disease or of a therapeutic on RNA and 
protein levels can be elucidated 
– Knowledge of protein structure facilitates drug design 
– Understanding of genomic variation allows the tailoring 
of medical treatment to the individual’s genetic make-up 
• The same techniques can be applied to crop (Agro) and 
livestock improvement (Animal Health)
Bioinformatics: What’s in a name ? 
• Begin 1990’s 
• “Bio-informatics”: 
Computing Power 
Genbank 
(Log) 
Time (years)
Bioinformatics: What’s in a name ? 
• Begin 1990’s 
• “Bio-informatics”: 
– convergence of explosive growth in 
biotechnology, paralled by the explosive growth 
in information technology 
• Not new: > 30 years that people use 
“computers” in biology 
• In silico biology, database biology, ...
Time (years)
Happy Birthday …
PCR + dye termination 
Suddenly, a flash of insight caused him to pull the car 
off the road and stop. He awakened his friend 
dozing in the passenger seat and excitedly 
explained to her that he had hit upon a solution - 
not to his original problem, but to one of even 
greater significance. Kary Mullis had just conceived 
of a simple method for producing virtually unlimited 
copies of a specific DNA sequence in a test tube - 
the polymerase chain reaction (PCR)
Math 
Bioinformatics, a scientific discipline … 
Informatics 
Theoretical Biology 
Computational Biology 
(Molecular) 
Biology 
Computer Science 
Bioinformatics
Math 
Algorithm Development 
Bioinformatics, a scientific discipline … 
Interface Design 
Informatics 
Theoretical Biology 
AI, Image Analysis 
structure prediction (HTX) 
Expert Annotation 
Sequence Analysis 
Computational Biology 
(Molecular) 
Biology 
Computer Science 
NP 
Datamining 
Bioinformatics
Math 
Algorithm Development 
Bioinformatics, a scientific discipline … 
Interface Design 
Informatics 
Theoretical Biology 
AI, Image Analysis 
structure prediction (HTX) 
Expert Annotation 
Sequence Analysis 
Computational Biology 
(Molecular) 
Biology 
Computer Science 
NP 
Datamining 
Bioinformatics 
Discovery Informatics – Computational Genomics
Doel van de cursus 
• Meer dan een inleiding tot ... het is de 
bedoeling van de cursus een onderliggend 
inzicht te verschaffen achter de 
verschillende technieken. 
• Naast het gebruik van recepten, wat terug 
te vinden is in delen van de syllabus laat 
een inzicht in 
– de werking van databanken 
– en de achterliggende algoritmen 
• toe 
– om wisselende interfaces op nieuwe 
problemen toe te passen.
Inhoud Lessen: Bioinformatica
Examen 
• Theorie 
– Deel rond een zelf te kiezen publicatie die in verband 
staat met de cursus 
• Bv Bioinformatics of Computational Biology 
– Drie inzichtsvragen over de cursus (inclusief  !!) 
• Practicum (“open-book”) 
– Viertal oefeningen die meestal het schrijven van een 
programma veronderstellen 
• Puntenverdeling 50/50
Cursus 
• Syllabus 25 Euro 
– Syllabus 
• V|Podcasts 
• Weblems – Screencasts
20 
biobix 
wvcrieki 
biobix.be 
bioinformatics.be
• Timelin: Magaret 
Dayhoff …
nature 
the 
Human 
genome 
Setting the stage …
Genome Size 
E. coli = 4.2 x 106 
Yeast = 18 x 106 
Arabidopsis = 80 x 106 
C.elegans = 100 x 106 
Drosophila = 180 x 106 
Human/Rat/Mouse = 3000 x 106 
Lily = 300 000 x 106 
With ... : 99.9 % 
To primates: 99% 
DOGS: Database Of Genome Sizes
Biological Research 
Adapted from John McPherson, OICR
And this is just the beginning …. 
Next Generation Sequencing is here
Basics of the “old” technology 
• Clone the DNA. 
• Generate a ladder of labeled (colored) molecules 
that are different by 1 nucleotide. 
• Separate mixture on some matrix. 
• Detect fluorochrome by laser. 
• Interpret peaks as string of DNA. 
• Strings are 500 to 1,000 letters long 
• 1 machine generates 57,000 nucleotides/run 
• Assemble all strings into a genome.
Basics of the “new” technology 
• Get DNA. 
• Attach it to something. 
• Extend and amplify signal with some color 
scheme. 
• Detect fluorochrome by microscopy. 
• Interpret series of spots as short strings of DNA. 
• Strings are 30-300 letters long 
• Multiple images are interpreted as 0.4 to 1.2 
GB/run (1,200,000,000 letters/day). 
• Map or align strings to one or many genome.
Next Generation Technologies 
• 454 
–Emulsion PCR 
–Polymerase 
–Natural Nucleotides 
• 20-100Mb for 5-15k 
–1% error rate 
–Homopolymers
One additional insight ...
Read Length is Not As Important For Resequencing 
100% 
90% 
80% 
70% 
60% 
50% 
40% 
30% 
20% 
10% 
0% 
8 10 12 14 16 18 20 
Length of K-mer Reads (bp) 
% of Paired K-mers with Uniquely 
Assignable Location 
E.COLI 
HUMAN 
Jay Shendure
Two Short Read Techologies 
• Illumina GA 
• ABI SOLID
Technology Overview: Solexa/Illumina Sequencing
ABI Solid 
Dressman 2003
ABI SOLID
ABI SOLID
Paired End Reads are Important! 
Read 1 Read 2 
Repetitive DNA 
Unique DNA 
Single read maps to 
multiple positions 
Paired read maps uniquely 
Known Distance
Adapted from: Barak Cohen, Washington University, Bio5488 http://tinyurl.com/6zttuq http://tinyurl.com/6k26nh 
Single Molecule Sequencing 
Helicos Biosciences Corp. 
Microscope slide 
Single DNA 
molecule 
dNTP-Cy3 
* * * 
* 
primer 
Super-cooled 
TIRF microscope
Introducing 
NXT GNT DXS 
Next Generation Diagnostics
NXT GNT DXS 
• GNT 
– Dedicated Team & Network 
– Operational: Location 
– Professionalized 
• DXS 
– Content engine 
– Product 1 established 
– Pipeline for n+1 
• NXT 
– Workflow management 
– Bioinformatics 
– Epigenetics
Next next generation sequencing 
Third generation sequencing 
Now sequencing
Complete genomics
Complete genomics
Pacific Biosciences: A Third Generation Sequencing Technology 
Eid et al 2008
Pacific Biosciences: A Third Generation Sequencing Technology
Nanopore Sequencing
NCBI (educational resources)
Weblems 
• What ? 
– Web-based problemes (over de huidige les 
en/of voorbereiding op volgende les) 
• When ? 
– Einde van elke les 
• How ? 
– Oplossingen online via screencasts 
– Practicum 
– Voorbedereiding op het practicum examen ... 
Niet alle problemen vereisen noodzakelijk 
programmacode ...
Weblems 
W1.1: To which phyla do the following species belong (a) 
starfish (b) ginko tree (c) scorpion 
W1.2: What are the common names for the following 
species (a) Orycterophus afer (b) Beta vulagaris (c) 
macrocystis pyrifera 
W1.3: What species has the smallest known genome ? And 
is genome size related to number of genes ? 
W1.4: What are the 5 latest genomes published ? How 
complete is “coverage” ? 
W1.5: For approximately 10% of europeans, the painkiller 
codeine is ineffective because the patients lack the 
enzyme that converts codeine into the active molecule, 
morphine. What is the most common mutation that 
causes this condition ?

2014 09 30_t1_bioinformatics_wim_vancriekinge

  • 3.
    FBW 30-09-2014 WimVan Criekinge
  • 4.
    What is Bioinformatics? • Application of information technology to the storage, management and analysis of biological information (Facilitated by the use of computers) – Sequence analysis? – Molecular modeling (HTX) ? – Phylogeny/evolution? – Ecology and population studies? – Medical informatics? – Image Analysis ? – Statistics ? AI ? – Sterkstroom of zwakstroom ?
  • 5.
    Promises of genomicsand bioinformatics • Medicine (Pharma) – Genome analysis allows the targeting of genetic diseases – The effect of a disease or of a therapeutic on RNA and protein levels can be elucidated – Knowledge of protein structure facilitates drug design – Understanding of genomic variation allows the tailoring of medical treatment to the individual’s genetic make-up • The same techniques can be applied to crop (Agro) and livestock improvement (Animal Health)
  • 6.
    Bioinformatics: What’s ina name ? • Begin 1990’s • “Bio-informatics”: Computing Power Genbank (Log) Time (years)
  • 7.
    Bioinformatics: What’s ina name ? • Begin 1990’s • “Bio-informatics”: – convergence of explosive growth in biotechnology, paralled by the explosive growth in information technology • Not new: > 30 years that people use “computers” in biology • In silico biology, database biology, ...
  • 8.
  • 10.
  • 11.
    PCR + dyetermination Suddenly, a flash of insight caused him to pull the car off the road and stop. He awakened his friend dozing in the passenger seat and excitedly explained to her that he had hit upon a solution - not to his original problem, but to one of even greater significance. Kary Mullis had just conceived of a simple method for producing virtually unlimited copies of a specific DNA sequence in a test tube - the polymerase chain reaction (PCR)
  • 12.
    Math Bioinformatics, ascientific discipline … Informatics Theoretical Biology Computational Biology (Molecular) Biology Computer Science Bioinformatics
  • 13.
    Math Algorithm Development Bioinformatics, a scientific discipline … Interface Design Informatics Theoretical Biology AI, Image Analysis structure prediction (HTX) Expert Annotation Sequence Analysis Computational Biology (Molecular) Biology Computer Science NP Datamining Bioinformatics
  • 14.
    Math Algorithm Development Bioinformatics, a scientific discipline … Interface Design Informatics Theoretical Biology AI, Image Analysis structure prediction (HTX) Expert Annotation Sequence Analysis Computational Biology (Molecular) Biology Computer Science NP Datamining Bioinformatics Discovery Informatics – Computational Genomics
  • 15.
    Doel van decursus • Meer dan een inleiding tot ... het is de bedoeling van de cursus een onderliggend inzicht te verschaffen achter de verschillende technieken. • Naast het gebruik van recepten, wat terug te vinden is in delen van de syllabus laat een inzicht in – de werking van databanken – en de achterliggende algoritmen • toe – om wisselende interfaces op nieuwe problemen toe te passen.
  • 16.
  • 18.
    Examen • Theorie – Deel rond een zelf te kiezen publicatie die in verband staat met de cursus • Bv Bioinformatics of Computational Biology – Drie inzichtsvragen over de cursus (inclusief  !!) • Practicum (“open-book”) – Viertal oefeningen die meestal het schrijven van een programma veronderstellen • Puntenverdeling 50/50
  • 19.
    Cursus • Syllabus25 Euro – Syllabus • V|Podcasts • Weblems – Screencasts
  • 20.
    20 biobix wvcrieki biobix.be bioinformatics.be
  • 22.
  • 23.
    nature the Human genome Setting the stage …
  • 27.
    Genome Size E.coli = 4.2 x 106 Yeast = 18 x 106 Arabidopsis = 80 x 106 C.elegans = 100 x 106 Drosophila = 180 x 106 Human/Rat/Mouse = 3000 x 106 Lily = 300 000 x 106 With ... : 99.9 % To primates: 99% DOGS: Database Of Genome Sizes
  • 29.
    Biological Research Adaptedfrom John McPherson, OICR
  • 30.
    And this isjust the beginning …. Next Generation Sequencing is here
  • 31.
    Basics of the“old” technology • Clone the DNA. • Generate a ladder of labeled (colored) molecules that are different by 1 nucleotide. • Separate mixture on some matrix. • Detect fluorochrome by laser. • Interpret peaks as string of DNA. • Strings are 500 to 1,000 letters long • 1 machine generates 57,000 nucleotides/run • Assemble all strings into a genome.
  • 32.
    Basics of the“new” technology • Get DNA. • Attach it to something. • Extend and amplify signal with some color scheme. • Detect fluorochrome by microscopy. • Interpret series of spots as short strings of DNA. • Strings are 30-300 letters long • Multiple images are interpreted as 0.4 to 1.2 GB/run (1,200,000,000 letters/day). • Map or align strings to one or many genome.
  • 33.
    Next Generation Technologies • 454 –Emulsion PCR –Polymerase –Natural Nucleotides • 20-100Mb for 5-15k –1% error rate –Homopolymers
  • 39.
  • 40.
    Read Length isNot As Important For Resequencing 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 8 10 12 14 16 18 20 Length of K-mer Reads (bp) % of Paired K-mers with Uniquely Assignable Location E.COLI HUMAN Jay Shendure
  • 41.
    Two Short ReadTechologies • Illumina GA • ABI SOLID
  • 42.
  • 48.
  • 49.
  • 50.
  • 54.
    Paired End Readsare Important! Read 1 Read 2 Repetitive DNA Unique DNA Single read maps to multiple positions Paired read maps uniquely Known Distance
  • 55.
    Adapted from: BarakCohen, Washington University, Bio5488 http://tinyurl.com/6zttuq http://tinyurl.com/6k26nh Single Molecule Sequencing Helicos Biosciences Corp. Microscope slide Single DNA molecule dNTP-Cy3 * * * * primer Super-cooled TIRF microscope
  • 56.
    Introducing NXT GNTDXS Next Generation Diagnostics
  • 58.
    NXT GNT DXS • GNT – Dedicated Team & Network – Operational: Location – Professionalized • DXS – Content engine – Product 1 established – Pipeline for n+1 • NXT – Workflow management – Bioinformatics – Epigenetics
  • 59.
    Next next generationsequencing Third generation sequencing Now sequencing
  • 60.
  • 61.
  • 62.
    Pacific Biosciences: AThird Generation Sequencing Technology Eid et al 2008
  • 63.
    Pacific Biosciences: AThird Generation Sequencing Technology
  • 64.
  • 65.
  • 66.
    Weblems • What? – Web-based problemes (over de huidige les en/of voorbereiding op volgende les) • When ? – Einde van elke les • How ? – Oplossingen online via screencasts – Practicum – Voorbedereiding op het practicum examen ... Niet alle problemen vereisen noodzakelijk programmacode ...
  • 67.
    Weblems W1.1: Towhich phyla do the following species belong (a) starfish (b) ginko tree (c) scorpion W1.2: What are the common names for the following species (a) Orycterophus afer (b) Beta vulagaris (c) macrocystis pyrifera W1.3: What species has the smallest known genome ? And is genome size related to number of genes ? W1.4: What are the 5 latest genomes published ? How complete is “coverage” ? W1.5: For approximately 10% of europeans, the painkiller codeine is ineffective because the patients lack the enzyme that converts codeine into the active molecule, morphine. What is the most common mutation that causes this condition ?