Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
DNA analysis on your laptop: Spot the differences
1. DNA analysis on your laptop:
Spot the differences
Tech Tuesday
11 October 2016
Barbera van Schaik
b.d.vanschaik@amc.uva.nl
2. Things I like
● During the day I'm a bioinformatician
● In my spare time I ...
– Go to concerts and festivals
– Cook (all cuisines)
– Read (fantasy, popular science/philosophy,
Dutch literature)
– Make things (sewing, electronics, laser
cutting, welding, 3d printing)
– Look into self-hosted cloud services
– Grow vegetables in my garden
5. How does molecule A interact with protein B?
A schematic visual model of oxygen-binding process, showing all four
monomersand hemes, and protein chains only as diagramatic coils, to
facilitate visualization into the molecule. (
http://en.wikipedia.org/wiki/Hemoglobin)
19. Labsession: Compare two DNA sequences
The game:
Place as many matching letters
as possible opposite each other
Introduce mutations, insertions
and deletions
Scoring scheme:
- Matching letters: +2
- Mismatching letters: -1
- Insertion or deletion: -1
Sum all matching letters,
mutations and indels
Get the maximum score
30. Automated DNA sequencing
~400 sequences per run
Scale-up by using many
DNA sequencers in parallel
Sequence center at Whitehead institute
31. Next generation sequencing
Run: 24 hrs
Data: 0.7 GB
Run: 7-14 days
Data: 120 GB
Run: 3-10 days
Data: 600 GB
2005-now: Next generation sequencing
Millions to billions of sequences
38. The 100K genomes project
The project will focus on
patients with a rare disease and
their families and patients with
cancer. The first samples for
sequencing are being taken
from patients living in England
with discussions taking place
with Scotland, Wales and
Northern Ireland about
potential future involvement.
http://www.genomicsengland.co.uk/
40. Genome projectsGenome projects
• Human genome project (1 individual)
• Exome sequencing (~10 individuals)
• Genome of the Netherlands (770 individuals)
• 1000 genome project (1000 individuals)
• 10K UK project (10,000 individuals)
– Upgraded to 100,000 genomes
• Personal genomes project
… many centers have one or more high throughput
sequencers
http://omicsmaps.com/ Sign @ Wellcome-Sanger, Cambridge, UK
41. Data size challenges
COMPUTING
Sequencing rate is higher
than Moore's law
STORAGE
Sequencing costs lower
than data storage
Stein, Genome Biology, 2010Hayden, Nature, 2014
48. GenBank
Release 200.0 (12 Feb 2014)
has 171,123,749 non-WGS, non-CON records containing
157,943,793,171 base pairs of sequence data. In addition, there are
139,725,795 WGS records containing 591,378,698,544 base pairs of
sequence data.
For downloading purposes, please keep in mind that the GenBank
flatfiles are approximately 625 GB (sequence files only). The ASN.1 data
are approximately 522 GB. https://www.ncbi.nlm.nih.gov/genbank/statistics/
49. Labsession: databases
● Go to: https://www.ncbi.nlm.nih.gov/genbank/
● Search for: NM_000518
● Take the first link, click on “fasta”
● Copy/paste the record in notepad or word
– >blah and the sequence
– Store it on your desktop as HBB.txt
● Do the same for: M25113
– Store this as sickle.txt
55. Labsession: sequence alignment (1)
● Go to: https://blast.ncbi.nlm.nih.gov/
● Choose “Nucleotide Blast”
● Tick the box “Align two or more sequences”
● Copy/paste the HBB.txt sequence in the
first box, and the sickle.txt sequence in the
second box
● Scroll down and click “BLAST”
● Can you spot the differences between the
healthy and ill person?
56. Labsession: sequence alignment (2)
● Go to: https://blast.ncbi.nlm.nih.gov/
● Choose “Nucleotide Blast”
● Copy/paste the 'unknown' sequence in the box
– Sequence on meetup page
● Scroll down and click “BLAST”
60. Structural variation
● Not just mutations, insertions and deletions
● Larger 'blocks' of DNA differ
http://www.nature.com/nmeth/journal/v9/n2/full/nmeth.1858.html
61. How to determine variants
Extract DNA
& amplify to get enough
for measurement
Sequence the DNA
Map DNA fragments on human reference genome
Determine variants compared to reference genome
1
2
3
4
62. What could possibly
go wrong?
● Errors during the amplification step
● Errors during the DNA sequencing process
● Errors during mapping of the DNA
fragments to the reference genome
● Low genome coverage
● Reference genome not complete
● Etc, etc
63. Have your own DNA sequenced
● http://isogg.org/wiki/List_of_personal_genomics_
– Whole genome: $1799-$5000
– Whole exome (the protein coding part of
the genome): $850-$1000
– Mitochondrial DNA or Y-chromosome
– Only variants: ~$200
64. .. and then compare it with other data
● HapMap
● 1000 genomes and other genome projects
● Known (disease) variants
● Other animals
● Family members
65. Labsession: what do your variants
tell about you?
● 23andme dataset as example
● Geographic location
● Neanderthal DNA
● Disease risks
Before you continue...
Try everything with
a public dataset first!
Why?
1) First have an outsiders look on the data
2) Verify what will happen with your data
when you send it to some website
66. Explore public data
● https://my.pgp-hms.org/
– Public data > Whole genome sequences
and other data
– Data type: 23andme (dropdown menu)
– Download one of the datasets
– Unzip the file
More about this project:
http://personalgenomes.org/
67. Selection of DNA tools
● Interpretome
– http://esquilax.stanford.edu/
– Ancestry: PCA and Painting
● Promethease
– http://snpedia.com/index.php/Promethease
– Sample report: 23andme v4 (2014)
● Codegen
– https://codegen.eu/
– Try the demo
69. Your DNA
● What does 'risk' mean?
– It is a risk (most of the time), not (always) a
definitive destination
– Consult a doctor
● What is genetic, what is caused by environment?
● How accurate is the underlying data?
72. TED talk tips
● Svante Paabo – DNA clues to our inner Neanderthal
– https://www.ted.com/talks/svante_paeaebo_dna_clues_to_our_inner_neanderthal
● Sebastian Kraves – The era of personal DNA testing is here
– https://www.ted.com/talks/sebastian_kraves_the_era_of_personal_dna_testing_is_here
● Jennifer Doudna – We can now edit our DNA, but let's do it wisely
– https://www.ted.com/talks/jennifer_doudna_we_can_now_edit_our_dna_but_let_s_do_
● Ellen Jorgensen – What you need to know about CRISPR
– https://www.ted.com/talks/ellen_jorgensen_what_you_need_to_know_about_crispr
● Juan Enriquez – We can reprogram life. How to do it wisely
– https://www.ted.com/talks/juan_enriquez_we_can_reprogram_life_how_to_do_it_wisely
73.
74. Follow up questions
● Hoe gerelateerd zijn sequenties?
● Tree of life (voorbeeld: road-trip Boston)
● Kanker
● Immunologie
● Mutating viruses
75. Roadtrip USA
Kosakovsky Pond S, Wadhawan S, Chiaromonte F, Ananda G,
Chung WY, Taylor J, Nekrutenko A; Galaxy Team (2009)
Windshield splatter analysis with the Galaxy metagenomic
pipeline. Genome Research, 19(11), 2144-2153.
84. Count species
Citrobacter 668 212 0.317
Cronobacter 43 22 0.512
Dickeya 4 1 0.25
Enterobacter 4142 5507 1.33
Enterovibrio 3 1 0.333
Erwinia 2 240 120
Escherichia 811 299 0.369
Francisella 1 1 1
Haemophilus 3 1 0.333
Halomonas 10 4 0.4
Klebsiella 15121 1695 0.112
Kluyvera 14 1 0.071
Marinobacter 3 4 1.333
Bacterie Route A Route B Ratio: B/A
Counted the
amount of
species on bumper
for route A and B
Determined
the ratio of
each species
on route A
compared to route B
"Eukaryote DNA-en" by Eukaryote_DNA.svg: *Difference_DNA_RNA-EN.svg: *Difference_DNA_RNA-DE.svg: Sponk (talk)translation: Sponk (talk)Chromosome.svg: *derivative work: Tryphon (talk)Chromosome-upright.png: Original version: Magnus Manske, this version with upright chromosome: User:Dietzel65Animal_cell_structure_en.svg: LadyofHats (Mariana Ruiz)derivative work: Radio89derivative work: Radio89 - This file was derived from Eukaryote DNA.svg:. Licensed under CC BY-SA 3.0 via Commons - https://commons.wikimedia.org/wiki/File:Eukaryote_DNA-en.svg#/media/File:Eukaryote_DNA-en.svg
http://www.accessexcellence.org/AE/AEPC/NIH/gene14.html
Study of families with particular disease.
Some people are affected (in grey)
Search for mutations or genes which are involved (bioinformatics)
About chances that a particular gene is important (biostatistics)
https://www.nih.gov/news-events/news-releases/nih-human-microbiome-project-defines-normal-bacterial-makeup-body
10x more microorganism cells than human cells
Only 1-3% of body mass
49
49
49
49
49
Score: how similar
Expect: could this hit occur by chance
Query: input sequence
Sbjct: database sequence
Numbers: from where to where are the sequences similar
Vertical bars: matching nucleotides
No vertical bar: indicates mismatches.