Pathogenomics, from patient to bioinformatician torsten seemann - uni melb - tue 1 may 2012
Pathogenomics From patient to bioinformaticianDieter Bulach Torsten Seemann M.Sc(Bioinf) - University of Melbourne - Tue 1 May 2012
The "rules"● Conversation, not lecture● Ask questions at any time● Activities and quizzes are interspersed. These have yellow background like this slide.● Please turn your phones to silent.● Lets start!
Overview● Medical issue ○ sample collection from patient ○ sample preparation● Genome sequencing ○ experimental design● Bioinformatics ○ identify SNPs - read mapping, to reference ■ phylogenomic tree ○ identify novel DNA - de novo assembly, no reference ■ annotation● Biological interpretation
What is a pathogen?● infectious agent or "germ"● microbe that causes disease in its host ○ organism ■ virus, bacterium, fungus, protozoa, parasite ■ most are harmless or even beneficial ○ host ■ human, animal, plant ○ transmission ■ any "hole" in you - inhalation, ingestion, wound ○ virulence ■ how bad, how quick, mortality
What type of pathogen are these? HIV Malaria Golden Staph Powdery mildew Black death (Plague) Glandular Fever
How do they work?● Adhesion ○ bind to host cell surface - interferes normal process● Colonization ○ take over parts of the body - upsets processes● Invasion ○ produce proteins to disrupt host cells, allow entry● Immunosuppression ○ for example, produce proteins to bind to antibodies● Toxins ○ proteins/metabolites that are poison to the host
Patient scenario● Hospital patient with indwelling catheter ○ risk of pathogens entering the bloodstream ○ this is not normal, and is called septicemia ○ sepsis is the whole body inflammatory response to it● Need to defeat the pathogen ○ most likely bacterial in this case ○ need to identify the bacterium and characteristics of the bacterium ■ antibiotic resistance profile eg. MRSA, VRE ■ might even want to know where it came from
Sample collection● Take patient blood ○ send to pathology● Centrifuge ○ slow spin to remove human cells ○ fast spin to pellet bacterial cells● Streak onto agar media ○ first emulsify the pellet to make it spreadable ○ grow for 24 hours, likely to be monoculture
Traditional Microbiology● Phenotype based: ○ look at cells under microscope ○ Gram staining - cell walls ○ biochemical tests==> identification of the bacterium ■ genus and species● PCR based testing: ○ 16s ribosomal RNA ■ common to all bacteria, differs slightly per strain ■ identify genus >90%, species >65%, unknown 10% ○ Multi-locus sequence typing (MLST) ■ sequence ~8 conserved genes ■ each strain/genus has its own MLST pattern==> faster but limited - need prior knowledge
WGS for diagnostics● Whole Genome Sequencing ○ fast and no prerequisite knowledge about the pathogen ○ Microbiologist wont be superseded!! ■ just different tools ■ sequence data set: will still do all the tests to identify and profile
Purify DNA● DNA extraction kit ○ lyse cells and digest (proteinaseK) ○ centrifuge to remove cell debris ○ pass lysate through column ■ DNA sticks to a DNA binding matrix ○ wash bound DNA ○ lower salt concentration - release bound DNA ○ precipitate: dubiously familiar stringy white pellet ■ salt and ethanol● Extract DNA from strawberries at home! ○ detergent - breaks cells (octoploid genome) ○ strainer/pantyhose - remove particulate matter ○ salt - aids DNA precipitation ○ alcohol - precipitates DNA, keeps rest in solution
Library preparation● Enough DNA? ○ each technology requires different amounts● Library type ○ shotgun, short paired, or long paired reads? ○ different construction methods eg. circularization● Size selection ○ nebulize, sonication, enzymatic methods ○ run on gel + scalpel, or fancier methods● Amplification ○ lots DNA : use PCR methods eg. emulsion PCR ○ little DNA : multiple displacement amplification ■ random hexamers, high fidelity polymerase ■ whole genome amplification for single-cell seq
High throughput DNA-Seq● Lots of technologies at market ○ 454, Illumina, SOLiD, Ion Torrent, PacBio● Each has its ups and downs ○ speed, yield, length, price, quality, labour, reliability● Technology trend ○ Illumina is currently the best choice ○ Most mature technology ○ Produces direct "base space" ie. A,G,T,C ○ Easiest data to work with
Current technology Length Length Paired MateMethod Quality Yield "Space" (now) (soon) end? pairs? Yes YesIllumina 150 250 (→800bp) (→3kb) +++++ ++++ base Yes 454 500 900 No (→8kb) +++ ++ flow YesSOLiD 75 75 No (~4kb) +++ +++++ colour Ion TestingTorrent 100 200 Testing (~4kb) ++ +++ flowPacBio 2000 6000+ No No + + base?
Read types● Single end, "shotgun" ○ ===>--------- ○ sequence from one end of a fragment● Paired end ○ ==>--------<== ○ sequence from both ends of the same fragment ○ space between mates is the "insert size" (< 800bp) ○● Mate pair ○ ==>--~~~~--<== ○ sequence both ends of a pseudo-fragment ○ this allows us to use longer insert sizes (> 800bp)
Read "spaces"● Example read ○ ACTGGGTCC● Base space ○ get native bases: A,C,T,G,G,G,T,C,C● Flow space ○ get base flows: A*1, C*1, T*1, G*3, T*1, C*2 ○ mis-count when n > 3 (homopolymers)● Colour space ○ get di-base encoding: T:X,X,X,X,X,X,X ○ theoretically useful, but messy overall
Read filtering● Sequencing is a multi-step process ○ ost steps are biological - so there will be errors!● Bacterial genome sequencing ○ usually excess sequence, can afford to discard● Why filter? ○ reduce size of data set to deal with ■ need less RAM and CPU ○ improve average reads quality ■ better results
What to filter on● Phred base qualities ○ Q<20 still means >1% error!● ambiguous bases ie. "N" ○ these should have low Q scores anyway● reads that are too short ○ too ambiguous to map, too short to assemble● widowed reads ○ reads, that after filtering, no longer have a mate
Sequenced it, now what?● How is it different? ○ compare to known closely related "reference" strain● Types of differences ○ deleted DNA - in reference, not in ours ○ duplicated DNA - extra copies in ours ○ novel DNA - in ours, not in reference ○ SNPs - single nucleotide polymorphisms (1bp subst) ○ indels - short insertions or deletions (usually 1bp) ■ sometimes indels fall under "SNPs" banner ○ structural variation - rearrangements, inversions ■ small scale, large scale
Read mapping - small scale Reference sequenceDepth Errors
Are we seeing everything?● Hmm, some of our reads didnt map ○ sequencing artifacts (some) ○ contamination (maybe - RA sneezed into sample?) ○ DNA in our sample but not in reference (yes) ■ need to de novo assemble● Other comparisons more difficult ○ structural change, rearrangements ■ read length & insert size are limiting factors ○ read mapping is not the answer to every question ■ particularly with non-model organisms
De novo genome assembly● De novo ○ Latin - "from the beginning", "afresh", "anew" ○ Without reference to any other genomes● "Genome assembly is impossible." ○ A/Prof. Mihai Pop - leading assembly researcher!
Assembling bacteria● Genomes ○ DNA, single organism, ~1 sequence● Transcriptomes ○ RNA (cDNA), single organism, ~4000 sequences● Meta-genomes ○ DNA, mix of organisms, >10 sequences ○ eg. human gut microbiome, oral cavity● Meta-transcriptomes ○ RNA (cDNA), mix of organisms, >40000 sequences!
Types of assemblers● Greedy ○ find two best matching reads, join them, iterate● Overlap-Layout-Consensus ○ collate all overlaps into a graph and finds a path● de Bruijn graph (pronounced "brown") ○ break reads into k-mers, overlap is 100%id k-1● String graph ○ represents all that is inferable from the reads ○ encompasses OLC and DBGs
Assembly algorithm● Find all overlaps between all reads ○ naively this is O(N2) for N reads, but good heuristics ○ parameters are: min. overlap, min %identity● Build a graph from these overlaps ○ nodes/arcs <=> reads/overlaps <=> vertices/edges● Simplify the graph ○ because real-world reads have errors● Trace a single path through the graph ○ Read off the consensus of bases as you go
The tyranny of repeats● Assembler would output 7 contig sequences ○ path is broken at ambiguous decision points ○ read/pair length limits ability to resolve repeats
Scaffolding● Use paired reads to join contigs ○ reads with their mates in different contigs in a consistent manner suggests adjacency● A difficult constraint problem ○ distance between mates ("insert size") variable ○ repeats cause ambiguous mate placement ○ some assemblers do it, separate scaffolders exist
Contig ordering● Optical maps ○ wet lab method, real experimental evidence ○ chromosome sized restriction site map● Align to reference genome ○ fit contigs best as possible against known reference ○ some contigs will fit if split (DNA rearrangement) ○ expect orphan contigs (novel DNA)
Genome closure● Finished genome ○ one contig per replicon in original sample ○ bacterial chromosomes/plasmids usually circular● Labour intensive ○ design primers around gaps, PCR, Sanger ○ Fosmid/BAC libraries for larger inconsistencies● Why bother? ○ no close reference exists ○ ensures you didnt miss anything ○ understand whole genome architecture ○ simplifies all downstream analysis
Annotation● Annotation is the process of identifying important features in a genome ○ gene - protein product, promoter, signal sequences ■ ~1000 per Mbp in bacteria, coding dense ○ tRNA - transfer RNA ■ ~30 per bacteria cover all codons (wobble base) ○ rRNA - ribosomal RNA locus ■ 1 to 9 per bacteria, fast vs slow growers ○ And many more... ■ small RNAs, ncRNA, binding sites, tx factors
Annotating proteins● Homology vs. Similarity ○ homology means same biological function ○ we use sequence similarity as a proxy for homology ○ works well for most situations● Sequence alignment methods ○ "Exact" - Needleman-Wunsch, Smith-Waterman ○ "Approx" - BLAST, FASTA, and many others! ○ Database is sequences: nr, RefSeq, UniProt● Sequence profile methods ○ Build a HMM (model) of aligned sequence families ○ HMMer - scan profiles against query protein seq. ○ Database is profiles: Pfam, TIGRfams, FigFam
Curation● Automatic annotation ○ more quality databases and models now ○ but still flawed● Manual curation ○ Essential for a quality annotation ○ Find pseudo, missing, bogus, and broken genes ○ Discover mis-assemblies ○ Correct mis-annotated protein families ○ Fix incorrect start codons ■ Bacteria have 3-5 start codons, not just AUG (M)
Practical ExerciseGo to URL: dna.med.monash.edu.au/~torsten/tmp/mscInstall: Artemis and MEGAKeep this URL open in a tab.Wait for further instructions.
Annotation● Start Artemis ○ open Hendra1994.fa● What is it? ○ Hendra virus - 18 kb viral genome ○ single-stranded negative-sense RNA (not DNA!!) ○ has 6 protein coding regions ("genes")● Task ○ find these genes using Artemis ○ use similarity searching to assign a name to the gene
The official annotation● In Artemis ○ download and open Hendra1994.gbk● Task ○ compare to your annotations● What did you find? ○ methionine (M) start codon (ATG)
DNA vs Protein Similarity● Examine relationships between Paramyxoviridae ○ includes Hendravirus already in Artemis● Open BLAST: http://blast.ncbi.nlm.nih.gov/● For the Hendravirus: ○ use blastn to search nr database for sequences related to the L gene (DNA) ○ use blastp to search nr database for sequences related to the L protein (amino acids) ○ Any observations??
Phylogeny● Start MEGA ○ download L_para.fas ○ multifasta with "L" proteins from 37 similar viruses● Task: ○ Load L_para.fas ○ Align sequences (using MUSCLE) ○ Infer tree (minumum evolution method) ○ Examine relationships
Viral strain comparison● Hendravirus ○ 11 complete genome sequences ○ different hosts (bats horses) and times (1994 onwards)● Task ○ Load hendra11.meg into MEGA ○ multiple alignment already done - examine SNPs ○ What is the impact of the nucleotide differences? ■ look at one SNP in detail ■ use Artemis to see if the SNP is in a gene ■ does the SNP change the encoded protein?