Pathogenomics         From patient to bioinformaticianDieter Bulach                                 Torsten Seemann       ...
The "rules"● Conversation, not lecture● Ask questions at any time● Activities and quizzes are interspersed.  These have ye...
Overview● Medical issue  ○ sample collection from patient  ○ sample preparation● Genome sequencing  ○ experimental design●...
What is a pathogen?● infectious agent or "germ"● microbe that causes disease in its host  ○ organism     ■ virus, bacteriu...
What type of pathogen are these?    HIV            Malaria                Golden Staph  Powdery mildew                   B...
How do they work?● Adhesion  ○ bind to host cell surface - interferes normal process● Colonization  ○ take over parts of t...
Patient scenario● Hospital patient with indwelling catheter  ○ risk of pathogens entering the bloodstream  ○ this is not n...
Sample collection● Take patient blood  ○ send to pathology● Centrifuge  ○ slow spin to remove human cells  ○ fast spin to ...
Traditional Microbiology● Phenotype based:   ○ look at cells under microscope   ○ Gram staining - cell walls   ○ biochemic...
WGS for diagnostics● Whole Genome Sequencing  ○ fast and no prerequisite knowledge about the pathogen  ○ Microbiologist wo...
Purify DNA● DNA extraction kit  ○ lyse cells and digest (proteinaseK)  ○ centrifuge to remove cell debris  ○ pass lysate t...
Library preparation● Enough DNA?    ○ each technology requires different amounts● Library type    ○ shotgun, short paired,...
High throughput DNA-Seq● Lots of technologies at market    ○ 454, Illumina, SOLiD, Ion Torrent, PacBio● Each has its ups a...
Current technology           Length Length    Paired     MateMethod                                          Quality   Yie...
Read types● Single end, "shotgun"  ○ ===>---------  ○ sequence from one end of a fragment● Paired end  ○ ==>--------<==  ○...
Read "spaces"● Example read    ○ ACTGGGTCC●   Base space    ○ get native bases: A,C,T,G,G,G,T,C,C●   Flow space    ○ get b...
Read filtering● Sequencing is a multi-step process  ○ ost steps are biological - so there will be errors!● Bacterial genom...
What to filter on● Phred base qualities  ○ Q<20 still means >1% error!● ambiguous bases ie. "N"  ○ these should have low Q...
Sequenced it, now what?● How is it different?   ○ compare to known closely related "reference" strain● Types of difference...
Read mapping - large scale                                        x1 coverage    Conserved     Deleted   x3 x2   Conserved
Read mapping - medium scale
Read mapping - small scale          Reference sequenceDepth          Errors
Are we seeing everything?● Hmm, some of our reads didnt map  ○ sequencing artifacts (some)  ○ contamination (maybe - RA sn...
De novo genome assembly● De novo    ○ Latin - "from the beginning", "afresh", "anew"    ○ Without reference to any other g...
Assembling bacteria● Genomes   ○ DNA, single organism, ~1 sequence● Transcriptomes   ○ RNA (cDNA), single organism, ~4000 ...
Types of assemblers● Greedy   ○ find two best matching reads, join them, iterate● Overlap-Layout-Consensus   ○ collate all...
Assembly algorithm● Find all overlaps between all reads   ○ naively this is O(N2) for N reads, but good heuristics   ○ par...
The tyranny of repeats● Assembler would output 7 contig sequences   ○ path is broken at ambiguous decision points   ○ read...
How many contigs will be produced?
More complex graph     Contigs         Connections
Reality bitesShared vertices are repeats.
Scaffolding● Use paired reads to join contigs    ○ reads with their mates in different contigs      in a consistent manner...
Contig ordering● Optical maps  ○ wet lab method, real experimental evidence  ○ chromosome sized restriction site map● Alig...
Genome closure● Finished genome  ○ one contig per replicon in original sample  ○ bacterial chromosomes/plasmids usually ci...
Annotation● Annotation is the process of  identifying important features in a genome  ○ gene - protein product, promoter, ...
Annotating proteins● Homology vs. Similarity  ○ homology means same biological function  ○ we use sequence similarity as a...
Curation● Automatic annotation  ○ more quality databases and models now  ○ but still flawed● Manual curation  ○   Essentia...
Practical ExerciseGo to URL: dna.med.monash.edu.au/~torsten/tmp/mscInstall: Artemis and MEGAKeep this URL open in a tab.Wa...
Annotation● Start Artemis  ○ open Hendra1994.fa● What is it?  ○ Hendra virus - 18 kb viral genome  ○ single-stranded negat...
The official annotation● In Artemis  ○ download and open Hendra1994.gbk● Task  ○ compare to your annotations● What did you...
DNA vs Protein Similarity● Examine relationships between   Paramyxoviridae   ○ includes Hendravirus already in Artemis● Op...
Phylogeny● Start MEGA  ○ download L_para.fas  ○ multifasta with "L" proteins from 37 similar viruses● Task:  ○ Load L_para...
Viral strain comparison● Hendravirus  ○ 11 complete genome sequences  ○ different hosts (bats horses) and times (1994     ...
Pathogenomics, from patient to bioinformatician   torsten seemann - uni melb - tue 1 may 2012
Upcoming SlideShare
Loading in …5
×

Pathogenomics, from patient to bioinformatician torsten seemann - uni melb - tue 1 may 2012

970 views

Published on

Published in: Health & Medicine
  • Be the first to comment

  • Be the first to like this

Pathogenomics, from patient to bioinformatician torsten seemann - uni melb - tue 1 may 2012

  1. 1. Pathogenomics From patient to bioinformaticianDieter Bulach Torsten Seemann M.Sc(Bioinf) - University of Melbourne - Tue 1 May 2012
  2. 2. The "rules"● Conversation, not lecture● Ask questions at any time● Activities and quizzes are interspersed. These have yellow background like this slide.● Please turn your phones to silent.● Lets start!
  3. 3. Overview● Medical issue ○ sample collection from patient ○ sample preparation● Genome sequencing ○ experimental design● Bioinformatics ○ identify SNPs - read mapping, to reference ■ phylogenomic tree ○ identify novel DNA - de novo assembly, no reference ■ annotation● Biological interpretation
  4. 4. What is a pathogen?● infectious agent or "germ"● microbe that causes disease in its host ○ organism ■ virus, bacterium, fungus, protozoa, parasite ■ most are harmless or even beneficial ○ host ■ human, animal, plant ○ transmission ■ any "hole" in you - inhalation, ingestion, wound ○ virulence ■ how bad, how quick, mortality
  5. 5. What type of pathogen are these? HIV Malaria Golden Staph Powdery mildew Black death (Plague) Glandular Fever
  6. 6. How do they work?● Adhesion ○ bind to host cell surface - interferes normal process● Colonization ○ take over parts of the body - upsets processes● Invasion ○ produce proteins to disrupt host cells, allow entry● Immunosuppression ○ for example, produce proteins to bind to antibodies● Toxins ○ proteins/metabolites that are poison to the host
  7. 7. Patient scenario● Hospital patient with indwelling catheter ○ risk of pathogens entering the bloodstream ○ this is not normal, and is called septicemia ○ sepsis is the whole body inflammatory response to it● Need to defeat the pathogen ○ most likely bacterial in this case ○ need to identify the bacterium and characteristics of the bacterium ■ antibiotic resistance profile eg. MRSA, VRE ■ might even want to know where it came from
  8. 8. Sample collection● Take patient blood ○ send to pathology● Centrifuge ○ slow spin to remove human cells ○ fast spin to pellet bacterial cells● Streak onto agar media ○ first emulsify the pellet to make it spreadable ○ grow for 24 hours, likely to be monoculture
  9. 9. Traditional Microbiology● Phenotype based: ○ look at cells under microscope ○ Gram staining - cell walls ○ biochemical tests==> identification of the bacterium ■ genus and species● PCR based testing: ○ 16s ribosomal RNA ■ common to all bacteria, differs slightly per strain ■ identify genus >90%, species >65%, unknown 10% ○ Multi-locus sequence typing (MLST) ■ sequence ~8 conserved genes ■ each strain/genus has its own MLST pattern==> faster but limited - need prior knowledge
  10. 10. WGS for diagnostics● Whole Genome Sequencing ○ fast and no prerequisite knowledge about the pathogen ○ Microbiologist wont be superseded!! ■ just different tools ■ sequence data set: will still do all the tests to identify and profile
  11. 11. Purify DNA● DNA extraction kit ○ lyse cells and digest (proteinaseK) ○ centrifuge to remove cell debris ○ pass lysate through column ■ DNA sticks to a DNA binding matrix ○ wash bound DNA ○ lower salt concentration - release bound DNA ○ precipitate: dubiously familiar stringy white pellet ■ salt and ethanol● Extract DNA from strawberries at home! ○ detergent - breaks cells (octoploid genome) ○ strainer/pantyhose - remove particulate matter ○ salt - aids DNA precipitation ○ alcohol - precipitates DNA, keeps rest in solution
  12. 12. Library preparation● Enough DNA? ○ each technology requires different amounts● Library type ○ shotgun, short paired, or long paired reads? ○ different construction methods eg. circularization● Size selection ○ nebulize, sonication, enzymatic methods ○ run on gel + scalpel, or fancier methods● Amplification ○ lots DNA : use PCR methods eg. emulsion PCR ○ little DNA : multiple displacement amplification ■ random hexamers, high fidelity polymerase ■ whole genome amplification for single-cell seq
  13. 13. High throughput DNA-Seq● Lots of technologies at market ○ 454, Illumina, SOLiD, Ion Torrent, PacBio● Each has its ups and downs ○ speed, yield, length, price, quality, labour, reliability● Technology trend ○ Illumina is currently the best choice ○ Most mature technology ○ Produces direct "base space" ie. A,G,T,C ○ Easiest data to work with
  14. 14. Current technology Length Length Paired MateMethod Quality Yield "Space" (now) (soon) end? pairs? Yes YesIllumina 150 250 (→800bp) (→3kb) +++++ ++++ base Yes 454 500 900 No (→8kb) +++ ++ flow YesSOLiD 75 75 No (~4kb) +++ +++++ colour Ion TestingTorrent 100 200 Testing (~4kb) ++ +++ flowPacBio 2000 6000+ No No + + base?
  15. 15. Read types● Single end, "shotgun" ○ ===>--------- ○ sequence from one end of a fragment● Paired end ○ ==>--------<== ○ sequence from both ends of the same fragment ○ space between mates is the "insert size" (< 800bp) ○● Mate pair ○ ==>--~~~~--<== ○ sequence both ends of a pseudo-fragment ○ this allows us to use longer insert sizes (> 800bp)
  16. 16. Read "spaces"● Example read ○ ACTGGGTCC● Base space ○ get native bases: A,C,T,G,G,G,T,C,C● Flow space ○ get base flows: A*1, C*1, T*1, G*3, T*1, C*2 ○ mis-count when n > 3 (homopolymers)● Colour space ○ get di-base encoding: T:X,X,X,X,X,X,X ○ theoretically useful, but messy overall
  17. 17. Read filtering● Sequencing is a multi-step process ○ ost steps are biological - so there will be errors!● Bacterial genome sequencing ○ usually excess sequence, can afford to discard● Why filter? ○ reduce size of data set to deal with ■ need less RAM and CPU ○ improve average reads quality ■ better results
  18. 18. What to filter on● Phred base qualities ○ Q<20 still means >1% error!● ambiguous bases ie. "N" ○ these should have low Q scores anyway● reads that are too short ○ too ambiguous to map, too short to assemble● widowed reads ○ reads, that after filtering, no longer have a mate
  19. 19. Sequenced it, now what?● How is it different? ○ compare to known closely related "reference" strain● Types of differences ○ deleted DNA - in reference, not in ours ○ duplicated DNA - extra copies in ours ○ novel DNA - in ours, not in reference ○ SNPs - single nucleotide polymorphisms (1bp subst) ○ indels - short insertions or deletions (usually 1bp) ■ sometimes indels fall under "SNPs" banner ○ structural variation - rearrangements, inversions ■ small scale, large scale
  20. 20. Read mapping - large scale x1 coverage Conserved Deleted x3 x2 Conserved
  21. 21. Read mapping - medium scale
  22. 22. Read mapping - small scale Reference sequenceDepth Errors
  23. 23. Are we seeing everything?● Hmm, some of our reads didnt map ○ sequencing artifacts (some) ○ contamination (maybe - RA sneezed into sample?) ○ DNA in our sample but not in reference (yes) ■ need to de novo assemble● Other comparisons more difficult ○ structural change, rearrangements ■ read length & insert size are limiting factors ○ read mapping is not the answer to every question ■ particularly with non-model organisms
  24. 24. De novo genome assembly● De novo ○ Latin - "from the beginning", "afresh", "anew" ○ Without reference to any other genomes● "Genome assembly is impossible." ○ A/Prof. Mihai Pop - leading assembly researcher!
  25. 25. Assembling bacteria● Genomes ○ DNA, single organism, ~1 sequence● Transcriptomes ○ RNA (cDNA), single organism, ~4000 sequences● Meta-genomes ○ DNA, mix of organisms, >10 sequences ○ eg. human gut microbiome, oral cavity● Meta-transcriptomes ○ RNA (cDNA), mix of organisms, >40000 sequences!
  26. 26. Types of assemblers● Greedy ○ find two best matching reads, join them, iterate● Overlap-Layout-Consensus ○ collate all overlaps into a graph and finds a path● de Bruijn graph (pronounced "brown") ○ break reads into k-mers, overlap is 100%id k-1● String graph ○ represents all that is inferable from the reads ○ encompasses OLC and DBGs
  27. 27. Assembly algorithm● Find all overlaps between all reads ○ naively this is O(N2) for N reads, but good heuristics ○ parameters are: min. overlap, min %identity● Build a graph from these overlaps ○ nodes/arcs <=> reads/overlaps <=> vertices/edges● Simplify the graph ○ because real-world reads have errors● Trace a single path through the graph ○ Read off the consensus of bases as you go
  28. 28. The tyranny of repeats● Assembler would output 7 contig sequences ○ path is broken at ambiguous decision points ○ read/pair length limits ability to resolve repeats
  29. 29. How many contigs will be produced?
  30. 30. More complex graph Contigs Connections
  31. 31. Reality bitesShared vertices are repeats.
  32. 32. Scaffolding● Use paired reads to join contigs ○ reads with their mates in different contigs in a consistent manner suggests adjacency● A difficult constraint problem ○ distance between mates ("insert size") variable ○ repeats cause ambiguous mate placement ○ some assemblers do it, separate scaffolders exist
  33. 33. Contig ordering● Optical maps ○ wet lab method, real experimental evidence ○ chromosome sized restriction site map● Align to reference genome ○ fit contigs best as possible against known reference ○ some contigs will fit if split (DNA rearrangement) ○ expect orphan contigs (novel DNA)
  34. 34. Genome closure● Finished genome ○ one contig per replicon in original sample ○ bacterial chromosomes/plasmids usually circular● Labour intensive ○ design primers around gaps, PCR, Sanger ○ Fosmid/BAC libraries for larger inconsistencies● Why bother? ○ no close reference exists ○ ensures you didnt miss anything ○ understand whole genome architecture ○ simplifies all downstream analysis
  35. 35. Annotation● Annotation is the process of identifying important features in a genome ○ gene - protein product, promoter, signal sequences ■ ~1000 per Mbp in bacteria, coding dense ○ tRNA - transfer RNA ■ ~30 per bacteria cover all codons (wobble base) ○ rRNA - ribosomal RNA locus ■ 1 to 9 per bacteria, fast vs slow growers ○ And many more... ■ small RNAs, ncRNA, binding sites, tx factors
  36. 36. Annotating proteins● Homology vs. Similarity ○ homology means same biological function ○ we use sequence similarity as a proxy for homology ○ works well for most situations● Sequence alignment methods ○ "Exact" - Needleman-Wunsch, Smith-Waterman ○ "Approx" - BLAST, FASTA, and many others! ○ Database is sequences: nr, RefSeq, UniProt● Sequence profile methods ○ Build a HMM (model) of aligned sequence families ○ HMMer - scan profiles against query protein seq. ○ Database is profiles: Pfam, TIGRfams, FigFam
  37. 37. Curation● Automatic annotation ○ more quality databases and models now ○ but still flawed● Manual curation ○ Essential for a quality annotation ○ Find pseudo, missing, bogus, and broken genes ○ Discover mis-assemblies ○ Correct mis-annotated protein families ○ Fix incorrect start codons ■ Bacteria have 3-5 start codons, not just AUG (M)
  38. 38. Practical ExerciseGo to URL: dna.med.monash.edu.au/~torsten/tmp/mscInstall: Artemis and MEGAKeep this URL open in a tab.Wait for further instructions.
  39. 39. Annotation● Start Artemis ○ open Hendra1994.fa● What is it? ○ Hendra virus - 18 kb viral genome ○ single-stranded negative-sense RNA (not DNA!!) ○ has 6 protein coding regions ("genes")● Task ○ find these genes using Artemis ○ use similarity searching to assign a name to the gene
  40. 40. The official annotation● In Artemis ○ download and open Hendra1994.gbk● Task ○ compare to your annotations● What did you find? ○ methionine (M) start codon (ATG)
  41. 41. DNA vs Protein Similarity● Examine relationships between Paramyxoviridae ○ includes Hendravirus already in Artemis● Open BLAST: http://blast.ncbi.nlm.nih.gov/● For the Hendravirus: ○ use blastn to search nr database for sequences related to the L gene (DNA) ○ use blastp to search nr database for sequences related to the L protein (amino acids) ○ Any observations??
  42. 42. Phylogeny● Start MEGA ○ download L_para.fas ○ multifasta with "L" proteins from 37 similar viruses● Task: ○ Load L_para.fas ○ Align sequences (using MUSCLE) ○ Infer tree (minumum evolution method) ○ Examine relationships
  43. 43. Viral strain comparison● Hendravirus ○ 11 complete genome sequences ○ different hosts (bats horses) and times (1994 onwards)● Task ○ Load hendra11.meg into MEGA ○ multiple alignment already done - examine SNPs ○ What is the impact of the nucleotide differences? ■ look at one SNP in detail ■ use Artemis to see if the SNP is in a gene ■ does the SNP change the encoded protein?

×