Marker Gene Analysis: Best Practices

  • 545 views
Uploaded on

Talk given by Susan Huse at the QIIME/VAMPS Workshop in Boulder, CO on October 17th, 2012.

Talk given by Susan Huse at the QIIME/VAMPS Workshop in Boulder, CO on October 17th, 2012.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
545
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
44
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Marker Gene Analysis Best Practices Susan Huse Marine Biological Laboratory / Brown University October 17, 2012
  • 2. Cleaning DataFiltering: Remove reads that are likely to be overall low-quality and have errors throughout the read.Quality Trimming: trim off nucleotides from the end(s) of the read based on local quality values.Denoising: Adjust nucleotides that are more likely to be an error in base-calling (noise) than a true low-frequency variation (signal)Anchor Trimming: trim the end of long amplicons to a conserved location in the SSU alignmentChimera Removal: remove hybrid sequences created during amplification
  • 3. Recommended 454 Filtering•  Exact match to barcode and proximal primer•  Optional denoising (currently only 454)•  Remove sequences –  with Ns –  that are too short –  Below average or window quality threshold•  Trim to distal primer or anchor –  Remove sequences without anchor / primer
  • 4. SSU rRNA Anchor TrimmingNext-gen sequences often do not reach to the distal primer, and reads may have a range of lengths.De novo OTU clustering and other sequence comparisons are more consistent if all tags are trimmed to the same start and stop positions in the rRNA alignment.Anchor trimming uses a highly conserved location situated within the read length and truncates all reads to that position. Be careful that the anchor is the unique and present across all taxa.
  • 5. An Illumina HiSeq Error Distribution Quality Scores for Error Positions 100% 90% Cumulative Percent of Errors 80% 70% 60% 80% of error bases have 50% a quality score <=16 40% 30% 20% 10% 0% 0 5 10 15 20 25 30 35 40 Quality Score Untrimmed Data Before trimming, most errors have low Q scores
  • 6. HiSeq Reads with Ns NTAGCACCAAACATAAATCACCTCACTTAAGTGGCTGGAGACAAATAATCTCTTTAATAACCTGATTCAGCGAAACCAATCCGCGGCATTTAGTAGCGGTA! NTAATTACCCCAAAAAGAAAGGTATTAAGGATGAGTGTTCAAGATTGCTGGAGGCCTCCACTATGAAATCGCGTAGAGGCTTTGCTATTCAGCGTTTGATG! NGCGCCAATATGAGAAGAGCCATACCGCTGATTCTGCGTTTGCTGATGAACTAAGTCAACCTCAGCACTAACCTTGCGAGTCATTTCTTTGATTTGGTCAT! NGTAAAAATGTCTACAGTAGAGTCAATAGCAAGGCCACGACGCAATGGAGAAAGACGGAGAGCGCCAACGGCGTCCATCTCGAAGGAGTCGCCAGCGATAA! NTCTATGTGGCTAAATACGTTAACAAAAAGTCAGATATGGACCTTGCTGCTAAAGGTCTAGGAGCTAAAGAATGGAACAACTCACTAAAAACCAAGCTGTC! CAGTGGAATAGTCAGGTTAAATTTAATGTGACCGTNTNNNNNAATNNNNNNNNNNNNNNNNNNNNNNNCANNNNNTNGNNNNANNNNNTTGAGTGTGAGGT! CGGATTGTTCAGTAACTTGACTCATGATTTCTTACCTATTAGTGGTTNAACANNNNNNNNNNNNNATAGTAATCCACGCTCTTNTAANATGTCAACAAGAG! TATGCGCCAAATGCTTACTCAAGCTCAAACGGCTGGTCAGAATTTTACCAATGACCANNNCAAAGAAATGACTCGCAAGGTTAGTGCTGAGGTTGACTTAG! TAGAAGTCGTCATTTGGCGAGAAAGCTCAGTCTCAGGAGGAAGCGGAGCAGTCCAAANNNTTTTGAGATGGCAGCAACGGAAACCATAACGAGCATCATCT! TGCTGTTGAGTGGTCTCATGACAATAAAGTATGTCNCTGNNTTGAAGNNTNNNNNNNNNNNNNNNCTNATACAATCACGCNCANNNNNAAAAGTGTCGTGT! CTACTGCGACTAAAGAGATTCAGTACCTTAACGCTAAAGGTGCTTTGNCTTANNNNNNNNNNNNTGGCGACCCTGTTTTGTATGGCANCTTGCCGCCGCGT! CGGCAGAAGCCTGAATGAGCTTAATAGAGGCCAAAGCGGTCTGGAAACGTACGGATTNNNNAGTAACTTGACTCATGATTTCTTACCTATTAGTGGTTGAA! GTGATTTATGTTTGGTGCTATTGCTGGCGGTATTGCTTCTGCTCTTGNTGGTNNCNNNNNNNNNAAATTGTTTGGAGGCGGTCAAAANGCCGCCTCCGGTG! ATATCAACCACACCAGAAGCAGCATCAGTGACGACATTAGAAATATCCTTTGNAGTNNNNNNNNTATGAGAAGAGCCATACCGCTGATTCTGCGTTTGCTG! !In this dataset: •  68 reads contained at least 1 N, of these: •  14 (21%) could not be mapped to PhiX, •  7 of those 14 (50%) had only 1 N •  24 (35%) contain more than 1 N Illumina
  • 7. Minoche Filtering for Illumina Table 2: Expected error rates based on Q-scores (% of bases lost) No filter Illumina Chastity (ChF) Low-Quality (B) tails Ns <1/3 of nt Q<30 in 1st half avgQ < 30 1st 30% of nt All filtersMinoche A, et al. 2011. Genome Biology 12: R112using Bambus vulgaris, Arabidopsis thaliana, and PhiX
  • 8. Remaining Errors Quality Scores for Error Positions 100% 90% 80% 70% PCR errors?Pct of Errors 60% 50% 40% 30% 20% 10% 0% 0 5 10 15 20 25 30 35 40 Quality Score Trimmed Data Untrimmed Data Illumina
  • 9. QIIME Illumina Pipeline•  Single mismatch to barcode•  Trim read to last position above quality threshold q•  Remove sequences less than length threshold p•  Remove sequences with more than n Ns
  • 10. Paired-End Filtering A small insert size allows for sequence overlapRead 1 (forward) Area of sequence overlap Read 2 (reverse) Keep only reads that match exactly throughout the region of overlap. Amplicons designed to completely overlap (e.g., V6) ensure the highest quality sequences.
  • 11. But Variation Still Exists E. coli K-12 V6 paired end with complete perfect overlap ACAATCTGT G C T CAG ACT TC AGAGAT GA TG TG C TCG G ACTGTGAGA C AA A T C TCCAG G A C A T G T C T C AGA TT T G T C C G A GG T C C C A T A GA G A GG T T T CA A TC G A AGAGT T GC C A A C A T G T CC G A A T AA C C A GGT GT A C ACA A GA C GA C T G T 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 595′ 3′ weblogo.berkeley.edu Is this: 1.  systematic bidirectional sequencing error (unlikely) 2.  PCR error, or 3.  natural variation?
  • 12. What are Chimeras andHow do we find them?
  • 13. 5’ PCR primer primer anneals 3’ to complementary target5’ Extension creates3’ double-stranded amplicon But…5’ 3’ Premature dissociation3’ terminates elongation conserved5’ region The incomplete strand binds to a 3’3’ different template at a conserved region…5’3’ …then extends to create a chimera5’ The chimera can act as a template3’ during the next PCR round.
  • 14. Chimera Detection1.  Look for the best match to the left (left parent) Parent A Chimeric Read2.  Look for the best match to the right (right parent) Chimeric Read Parent B3.  Compare the distance between the two parents – are they really different or multiple entries for the same organism Parent A Parent B
  • 15. Detection methods differ by source of parents1.  Reference Comparison: check against known reference sequences2.  De novo detection: check all triplets in your amplification
  • 16. Reference Comparison only as good as the Ref Set•  Can only find parents if they are in the RefSet•  Any chimeras in the Ref Set are deleterious!•  Sparse RefSet may not detect chimeras from closely related organisms (intra-genera, intra-species)•  Differential density of the Ref Set can create biases•  Poor matches to the Ref Set can be mistaken for chimeras•  Hard to detect if parents are similar, but may not matter
  • 17. De Novo Pros and Cons•  Can detect parents not in the RefSet: novel, close neighbors, PCR errors, unexpected amplifications•  Must be run by amplification , ie. by tube All your parents but only your parents•  Abundance profile can be tricky with long tail•  Early False Positives (parent is lost to RefSet) and False Negatives (chimera add to RefSet) will affect downstream calls We use both de novo and ref
  • 18. Rates of Chimera Formation in BPC Datasets As a function of total reads,Various Datasets Percent Chimeric for not unique sequences 70% 60%Percenct of Datasets 50% 40% 30% 20% 10% 0% 0% 10% 20% 30% 40% 50% Percent of Reads that are Chimeric V6V4 V3V5
  • 19. Chimera detection programs optimized for short reads•  UChime (in USearch, QIIME and VAMPS)•  Perseus (in AmpliconNoise and mothur)
  • 20. Aggregating Downstream analytical techniques that compensate for inaccuracies in the remaining sequence data. Taxonomic assignments will generally remain the same despite a few mismatches. More so at coarser taxonomic levels (class vs. genus) OTU Clustering can round out small percentages of errors depending on the algorithm used. Clustering at 3% can (but does not always!) aggregate sequences with 1 – 2% errors.“Aggregating” is not accepted terminology in the field
  • 21. Taxonomic FilteringIn addition to knowledge base associated with taxnomicnames:•  Can filter many unintended PCR amplification products.•  Reads too far from the tree can be classified as “Unknown” and examined further.•  Important to map reads to all domains, not just Bacteria, primers can amplify across domains and organelles
  • 22. Amplification of other Domains SSU Total Archaea Bacteria Organelle Unknown region Reads V6 529,359 0.02% 96% 4% 0.1% V6-V4 3,437,855 0.3% 87% 8% 4%Samples from Little Sippewissett Marsh.Organelles include mitochondria and chloroplasts
  • 23. Non SSU rRNA Amplification Conserved inner membrane protein cardiolipin synthase DNA binding transcriptional dual Predicted regulator, tyrosine- antibiotic 16S rRNA binding transporter Putative transport system permease 16S rRNA proteinPredictedmajor pilinsubunit Thank you, Hilary
  • 24. TaxonomyGAST: Global Alignment of Sequence Taxonomy Use sequence alignment to compare against a RefSet Distance = alignment distance to nearest RefSet sequence (SILVA, Greengenes, Stajich Refs, UNITE, HOMD, etc) (VAMPS)RDP: Ribosomal Database Project Uses k-mer matching to find nearest genus Boot strap values reflect confidence in the assignment (RDP Training set, Greengenes, etc.) (QIIME, VAMPS)
  • 25. Sources of Error in Taxonomic Analyses•  Primer bias•  Chimeras•  Discovery of novel 16S•  Unrepresented in reference database•  Low-quality references•  Taxonomy not available•  Incorrect taxonomy in RefSet•  Ambiguous hypervariable sequence (>1 hit)•  RefSets often biased toward most studied
  • 26. Creating OTUs: Operational Taxonomic Unitsfor taxonomy independent analyses
  • 27. OTUs vs Taxonomy•  Novel organisms•  Many unnamed organisms•  Some clades only defined to phyla or class•  Many species names based on phenotype rather than genotype•  Do not lump together all 16S “unknowns” or diverse partially classified.
  • 28. Clustering Algorithms Different clustering algorithmscan have very different effects on the size and number of OTUs created…
  • 29. Clustering MethodsDe novo (open)•  greedy clusters - test sequentially and incorporate sequence into first qualifying OTU. Dependent on input order.•  average linkage - the average distance from a sequence to every other sequence in the OTU is less than the width. Dependent on input order. [complete and single linkage are other methods]Reference (closed)•  greedy - map each sequence to representative sequences defining prebuilt clusters
  • 30. The Problem of OTU InflationDe novo clustering algorithms return more OTUs than predicted for mock communities.OTU inflation leads to: •  alpha diversity inflation •  beta diversity inflationWhere does this inflation come from? •  residual sequencing errors, •  chimeras, •  multiple sequence alignments, •  clustering algorithms
  • 31. Rarefaction, Sample Size under OTU Inflation M2FN PML MS-CL - PML Rarefaction 7000 6000 5000 5K 4000OTUs 10K 3000 15K 20K 2000 50K 1000 100K 0 - 20,000 40,000 60,000 80,000 100,000 120,000 Number of Sequences Sampled
  • 32. Rarefaction, Sample Sizewith minimal OTU Inflation PML SLP-PW-AL
  • 33. Cluster to Reference1.  Create a comprehensive set of Cluster Representatives (e.g., new Greengenes) representing the breadth of Bacteria2.  Assign each sequence to ClusterRep <= W3.  If Seq is not a member of any cluster, set aside4.  Cluster denovo the set of extra-cluster sequences
  • 34. Advantages of clustering to full-length reference•  Not as prone to OTU inflation•  Can add new data as available•  Provides static Cluster IDs –  Can be used to compare short reads from different regions (v3-v5 and v6) –  Can compare with other projects using same Ref Set
  • 35. Oligotyping•  Further differentiation within closely related organisms (e.g., genus)•  Rather than blanket 3% clustering, select sequence positions with the most information (Shannon Entropy) Fusobacterium oligtypes across oral sites supragingival hard palate subgingival keratinized mucosa dorsum gingiva plaque tongue buccal plaque tonsils saliva throat
  • 36. “But I’m not interested in the rare biosphere, only the major players.Can’t I just remove the low abundance OTUs?”
  • 37. 900 350 7000 800 A small number of highly 300 6000 abundant organisms 700Count in OTU 250 5000Count in OTU 600 200 4000 500 400 3000 150 300 A large number of low 2000 100 Rare Biosphere abundance organisms 200 1000 50 100 0 0 0 0 50 20 100 50 100 150 40 200 60 250 80300 350 100 OTU Rank Rank Consistent community profile across samples and environments Sogin et al, 2006. Microbial diversity in the deep sea and the underexplored “rare biosphere” PNAS 103: 12115-12120
  • 38. Distribution of OTU relative abundances across 210 HMP stool samples Huse et al. (2012) PLoS ONE
  • 39. Distribution of OTU Absolute Abundances in EnglishEnglish Channel Water Abundances Channel Water Samples Distribution of OTU Absolute in SamplesOTUs Frequency in PML Samples Absent Singleton Doubleton 3-5 6-10 11-50 51-500 >500
  • 40. Everything may not be everywhere, but everything is rare somewhere!If you feel you must remove low abundance OTUs, don’t do it until you have clustered ALL of your samples
  • 41. Alpha and Beta Diversity:Impacts of Sampling Depth and Diversity Algorithm
  • 42. Alpha Diversity - Richness1,8001,6001,400 CL - ACE1,200 SLP - ACE1,000 CL - Chao SLP - Chao 800 1 in 5000 600 1 in 2500 400 1 in 1000 1 in 500 200 - - 5,000 10,000 15,000 20,000 25,000 30,000 35,000 40,000 Alpha diversity metrics are sensitive to cluster method, sequencing depth and rare OTUs
  • 43. Sampling Depth and Alpha Diversity 5 4 4 3Diversity 3 2 2 1 1 0 - 5,000 10,000 15,000 20,000 25,000 30,000 35,000 40,000 45,000 Sampling Depth SLP - NPShannon SLP - Simpson CL - NPShannon Simpson Robust to both singletons and depth
  • 44. Comparing Different Sampling DepthsThe “population” is a set of 50,000 reads from one sampleThe “samples” are randomly-selected subsets of sizes: 1,000 15,000 5,000 20,000 7,500 25,000 10,000Calculate subsample diversity estimates across subsample depths which are representing the same population.
  • 45. Community Distance of Subsamples 0.12 0.1Community Distance 0.08 0.06 0.04 0.02 0 Replicates Bray Curtis (1K) Bray-Curtis (5K) Morisita Horn (1K) Morisita Horn (5K)Subsample 1,000 and 5,000 reads from sample of 50,000 reads, Pairwise distances for replicates at single depth
  • 46. Effect of Sample Depth - Bray Curtis Nearly 100% Different1.0000.9000.8000.7000.6000.5000.400 0.300 25000 0.200 20000 15000 0.100 10000 0.000 7500 5000 1000 5000 7500 10000 1000 15000 20000 25000 Bray Curtis uses absolute counts, intra-community distances are high as depths diverge
  • 47. Effect of Sample Depth - Morisita Horn 0.009 Nearly 0.5% Different 0.008 0.007 0.006 0.005 0.004 0.003 25,000 20,000 0.002 15,000 0.001 10,000 0.000 7,500 5,000 1,000 5,000 1,000 7,500 10,000 15,000 20,000 Beta diversity metric that uses relative abundances and compensates for different sample sizes.Distances are low across depths above min.sampling depth.
  • 48. SLP Clustering and Bray-Curtis0.40.30.2 1,000 2,000 5,0000.1 7,500 PC 2 10,000 0 15,000 20,000-0.1 25,000 30,000-0.2 40,000-0.3 PC 1 -0.4 -0.2 0 0.2 0.4 0.6 0.8 Bray-Curtis PCoA clusters entirely on depth (each point represents 10 atop one another)
  • 49. &!#"()*+,-./0#1.+2#34-.*.+5#64-/# "#"$&"#""&"#""(&"#"")&"#""%& &$+"""&& "& &*+"""&& !"#$# &,+*""&&!"#""%& &$"+"""&& &$*+"""&&!"#"")& &%"+"""&&!"#""(& &%*+"""&&!"#""& !"#"$&!"#"$%& !"#%# !"#"$*& !"#"$& !"#""*& "& "#""*& "#"$& Minimum sample depth here of 10,000, but will be a function of the diversity of the sample
  • 50. Acknowledgements The Josephine Bay Paul Center for Comparative Molecular Biology and EvolutionMitch Sogin Andy Voorhis Anna Shipunova David Mark Welch A. Murat Eren Hilary Morrison Joe Vineis Sharon Grim
  • 51. Why filter infrequent errors? Average 454 Errors / Percent of Ns Error Rate 400nt Reads0 or more 0.40% 1.6 100% 0 0.40% 1.6 99.3%If we include all reads with or without Ns, we have an overall error rate of 0.4%.If, however we remove all <1% of sequences with Ns, we have an overall error rate of 0.4%.Why bother?? 454
  • 52. Why filter infrequent errors? Average Error Errors / Percent of Ns Rate 400nt Reads 0 0.40% 1.6 99.3% 1 1.11% 3.1 0.57% 2 3.81% 8.7 0.1% 3 7.26% 16.5 0.0% 4 8.40% 19.2 0.0% 5 10.46% 25.1 0.0% It’s not just improving the overall error rate, but removing spurious dataLow-quality reads can be interpreted as unique organisms: 0.7% of 500,000 reads = 3,500 “unique organisms”
  • 53. 454 Error Distribution Distribution of errors in short reads (<100nt) Most reads contain no errors at all454 Errors are not evenly distributed among reads:Many reads have only a small number of errors, anda small number of reads have many errors 454
  • 54. A good beginning can mask a bad endIf 450 nt read and first 400nt average 35:if last 50 have an average of 0 avg qual = ((400*35) + (50*0)) / 450 = 31if last 100 have an average of 25 avg qual = ((350*35) + (100*25)) / 500 = 30
  • 55. Longer reads,pushing the limits
  • 56. 454 Filter Summary Percent Average Average of Reads Error Rate Errors / 400 ntN=0 99% 0.40% 1.6N>=1 1% 0.91% 3.6Exact Primer 95% 0.38% 1.5Not Exact Primer 5% 0.84% 3.4Average Qual >=30 98% 0.90% 3.6Average Qual <30 2% 1.3% 5.2 454
  • 57. 454 Filter Summary (cont) Percent Average Average of Reads Error Rate Errors / 400 ntRead Length 99+% 0.39% 1.6(500 - 600nt)Read Length 0.1% 1.8% 7.2(<500, >600 nt)Filtered 93% 0.36% 1.4Unfiltered 7% 0.64% 2.6 454
  • 58. Evaluating Chimeras (USearch)Parent A QueryParent BDiffs: A,B: Q matches expected P a,b: Q matches other P p: A=B!=QVotes: + for Model, 0 neutral, ! against ModelModel: shows extent of Parent A and Parent B, xxxx is overlap matching A&B
  • 59. Initial Length: 277 Extent of your sequenceClick on the bar tosee the alignment Extent of your match
  • 60. Check for left and right parents:BLAST the left (1-175)BLAST the right (175 - 277)
  • 61. 100% Match to Fusobacterium1 175 100% Match to Pseudomonas175 277
  • 62. Taxonomic Names•  Bergey’s Taxonomic Outline – manual of taxonomic names for bacteria•  List of Prokaryotic names with Standing in the Nomenclature (vetting process)•  NCBI – similar taxonomy, but multiple “subs” (subclass, suborder, subfamily, tribe)•  Archaea – a work in progress…•  Fungi – another work in progress…
  • 63. Cluster “Width”Diameter Radius Sequences are Sequences are never more than never more than D apart. R from seed. (CL) (SL, AL, Gr)
  • 64. Average Linkage collapses errors Cluster  Count:     1   #1   Clusters  tend  to  be  heavily  dominated  by  their  most  abundant  sequence,  which  strongly  weights  the  average  and  smoothes  the  noise.    
  • 65. Still lose outlier sequencing errorsMultiple sequencing errors still not clustered
  • 66. Inflation in Action: Multiple Sequence Alignmentand Complete Linkage clustering 1,042 is a few more than the expected 2
  • 67. Example MSA Regardless of clustering algorithm, an MSA cannot fully align tags whose sequences are too divergent18,156 sequences and 392 positions
  • 68. Relative Inflation Absolute number of errant OTUs will increase with sample size. Relative number of errant OTUs will descrease with sample complexity
  • 69. The Magical 3%3% SSU OTUs = Species and6% SSU OTUs = Genera NOT!
  • 70. Clustering Questions•  How meaningful are clusters functionally?•  When is an errare rare and when is it an error?•  Should it be included in an existing cluster or start its own?•  How to place sequences if OTUs overlap?•  What is the effect of residual low quality data or chimeras?•  How sensitive are alpha and beta diversity estimates to clustering results?