Amplicon sequencing slides - Trina McMahon - MEWE 2013

3,574 views

Published on

Published in: Education, Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,574
On SlideShare
0
From Embeds
0
Number of Embeds
7
Actions
Shares
0
Downloads
100
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • Decision point: which criteria to use for clustering?
  • Decision point: which criteria to use for clustering?
  • Decision point: which criteria to use for clustering?
  • Amplicon sequencing slides - Trina McMahon - MEWE 2013

    1. 1. MEWE Workshop Principles, potential, and limitations of novel molecular methods in water engineering; from amplicon sequencing to omics methods Programme 9:00 Introduction, Per Halkjær Nielsen, Aalborg University 9:10 Amplicon sequencing, Trina McMahon, University of Wisconsin- Madison 10:10 Importance of a curated 16S database, Aaron Saunders, Aalborg University 10:40 Break 11:00 DNA extraction and primer selection, Søren Karst, Aalborg University 11:30 Discussion in groups/questions 12:15 Lunch
    2. 2. 12:15 Lunch 13:15 Metagenomics, principles, potential and problems, Mads Albertsen, Aalborg University 14:30 Metatranscriptomics, principles, potential and problems, Rohan Williams, SCELSE, Singapore 15:30 Break 15:45 Informatics and data management, Trina McMahon, University of Wisconsin-Madison 16:15 Discussion in groups/questions 17:00 Closing, Per Halkjær Nielsen MEWE Workshop Principles, potential, and limitations of novel molecular methods in water engineering; from amplicon sequencing to omics methods
    3. 3. Amplicon Sequencing Trina McMahon University of Wisconsin – Madison (standing in for Pat Schloss)
    4. 4. What is amplicon sequencing? Anything that requires PCR-based amplification of a specific target gene (locus)
    5. 5. First things first • What is your question or hypothesis? • How can you answer your question or test your hypothesis using the smallest amount of resources? – Replication – Treatments/controls – Time series – Collection effort (depth of sampling)
    6. 6. Principles • Choice of locus – SSU/16S rRNA gene – “Functional” genes (amoA, ppk1, narG, napA, nifH) • Choice of sequencing approach – Clone libraries and Sanger sequencing – Barcoded/multiplexed 454 pyrosequencing – Barcoded/multiplexed Illumina • Choice of primers – Depends on the above two choices! • Choice of data analysis pipeline – Software – Taxonomy trainingset
    7. 7. >SBR1A21 GGCTACCTTGTTACGACTTCACCCCAGTCACGAACCCTGCCGTGGTAATCGCC CTCCTTGCGGTTGGCTAACTACTTCTGGCAGAACCCGCTCCCATGGTGTGACG GGCGGTGTGTACAAGACCCGGGAACGTATTCACCGCGACATGCTGATCCGCG ATTACTAGCGATTCCGACTTCACGCAGTCGAGTTGCAGACTGCGATCCGGACT ACGATCGGCTTTCTGAGATTGGCTCCCCCTCGCGGGTTGGCAACCCTCTGGAC CGACCATTGTATGACGTGTGAAGCCCTACCCATAAGGGCCATGAGGACTTGA CGTCATCCCCACCTTCCTCCGGTTTGTCACCGGCAGTCTCATTAAAGTGCCCA ACTGAATGATGGCAATTAATGACAAGGGTTGCGCTCGTTGCGGGACTTAACC CAACATCTCACGACACGAGCTGACGACAGCCATGCAGCACCTGTGTTCAGGC TCTCTTGCGAGCACTCCCAAATCTCTTCAGGATTCCTGACATGTCAAGGGTAG GTAAGGTTTTTCGCGTTGCATCGAATTAATCCACATCATCCACCGCTTGTGCG GGTCCCCGTCAATTCCTTTGAGTTTTAGCCTTGCGGCCGTACTCCCCAGGCGG TCAACTTCACGCGTTAGCTACGGCACTAAAAGGTTTAACCCTCCCAACACCTA GTTGACATCGTTTAGGGCGTGGACTACCAGGGTATCTAATCCTGTTTGCTCCC CACGCTTTCGTGCATGAGCGTCAGTATTGGCCCAGGGGGCTGCCTTCGCCATC GGTGTTCCTCCACATCTCTACGCATTTCACTGCTACACGTGGAATTCCACCCC CCTCTGCCAAACTCCAGCCTGGCAGTCTCAAATGCAGTTCCCAGGTTGAGCCC GGGGATTTCACATCTGACTTACCAAACCGCCTGCGCACGCTTTACGCCCAGTA ATTCCGATTAACGCTCGCACCCTACGTATTACCGCGGCTGCTGGCACGTAGTT AGCCGGTGCTTCTTATTCGGGTACCGTCATCTACACAGGGTATTAACCCGTGC AATTTCTTCCCCGCCGAAAGAGCTTTACAACCCGAAGGCCTTCTTCACTCACG CGGCATGGCTGGATCAGGCTTCCGCCCATTGTCCAAAATTCCCCACTGCTGCC TCCCGTAGGAGTCTGGGCCGTGTCTCAGTCCCAGTGTGGCGGATCATCCTCTC AGACCCGCTACGGATCGTCGCCTTGGTAGGCCTTTACCCCACCAACTAGCTA ATCCGACATCGGCCGCTCCCAGAGCGCAAGGTCTTGCGATCCCCTGCTTTCCT GCTCACAGAATATGCGGTATTAGCGTAACTTTCGCTACGTTATCCCCCACTCC AGGATACGTTCCGATGCTTTACTCACCCGTCCGCCACTCGCCACCAGGGTTGC CCCCGTGCTGCCGTTCGACTTGCATGTGTAAGGCATGCCGCCAGCGTTTAATC TGAGCCAGGATCAAACTCT ~ 1400 bases of SSU rDNA from EBPR reactor
    8. 8. Seq 1..AGCCCUGGUCGCA.. Seq 2..ACCCCUGGACUGUCGGA..
    9. 9. Seq 1..AGCCCUG----GUCGCA.. Seq 2..ACCCCUGGACUGUCGGA..
    10. 10. Seq 1..AGCCCUG----GUCGCA.. ..|x|||||----||||x|.. Seq 2..ACCCCUGGACUGUCGGA..
    11. 11. Sample alignment
    12. 12. A B C D E A -- 0.1 0.2 0.2 0.4 B 0.9 -- 0.2 0.2 0.4 C 0.8 0.8 -- 0.1 0.4 D 0.8 0.8 0.9 -- 0.4 E 0.6 0.6 0.6 0.6 -- Distance (or “difference”) matrix Fractional identity Fractional difference Note: difference = 1- (identity)
    13. 13. A B C D E A -- 0.1 0.2 0.2 0.4 B 0.9 -- 0.2 0.2 0.4 C 0.8 0.8 -- 0.1 0.4 D 0.8 0.8 0.9 -- 0.4 E 0.6 0.6 0.6 0.6 --
    14. 14. The Big Tree Pace, 1997, Science, 276:734
    15. 15. Ashelford K E et al. Appl. Environ. Microbiol. 2005;71:7724-7736 PMID: 12692101 Certain regions of the 16S rRNA vary more in sequence than others So-called “hyper-variable regions” are targeted by tag sequencing primer sets
    16. 16. Regions of interest within 16S rRNA gene V3 V4 V5 253 bp 429 bp 375 bp Amount of overlap for 2x250 bp reads: V4: 247 bp V34: 71 bp V45: 125 bp
    17. 17. sample gDNA Amplified PCR product with barcode sequencer ~106 – 109 barcoded reads Sequences sorted by sample of origin
    18. 18. Overview workflow (generic)
    19. 19. >GQY1XT001A6MUA AATGGTACCCGTCAATTCATTTGAGTTTCATTCTTGCGAACGTACTCCCCAGGTGG ATCACTTACTGCGTTTGCTGCGGCACCGGAGGTTCTTGAACCCCCGACACCTAGT GATCATCGTTTACGGCGTGGACTACCAGGGTATCTAATCCTGTTTGCTCCCCACG CTTTCGAGCCTCAACGTCAGTTACAGTCCAGTAAGCCGCCTTCGCCACTGGTGTT CCTCCTAATATCTACGCATTTCACCGCTACACTAGGAATTCCACTTACCTCTCCTGC ACTCCAGTCATACAGTTTCCAATG >GQY1XT001BTRWS AATGGTACCCGTCAATTCCTTTGAGTTTCATTCTTGCGAACGTACTCCCCAGGTGG ATTACTTAATGCGTTTGCGGCGGCACCGGAGGGCCTTGGCCCCCCGACACCTAG TAATCATCGTTTACGGCGTGGACTACCAGGGTATCTAATCCTGTTTGCTCCCCACG CTTTCGAGCCTCAACGTCAGTTACAGTCCAGTAAGCCGCCTTCGCCACTGGTGTT CCTCCTAATATCTACGCATTTCACCGCTACACTAGGAATTCCGCTTACCTCTCCTG CACTCGAGCTGCACAGTTTCCAAAGCAGTTCCGGGGTTGGG >GQY1XT001BBPBR AATGGTACCCGTCAATTCATTTGAGTTTCACCGTTGCCGGCGTACTCCCCAGGTG GGATGCTTAACGCTTTCGCTTTGCCACCCAGGCCCCATTCGGCCCGGACAGCTG GCATCCATCGTTTACTGTGCGGACTACCAGGGTATCTAATCCTGTTCGATCCCCGC ACTTTCGTGCCTCAGCGTCAGTAGGGCGCCGGAAGGCTGCCTTCGCAATCGGG GTTCTGCGTGATATCTATGCATTTCACCGCTACACCACGCATTCCGCCTTCTTCTC GCCCACTCAAGGCCCCCAGTTTCAACGG >GQY1XT001BDDE9 AATGGTACCCGTCAATTCCTTTAAGTTTCATTCTTGCGAACGTACTCCCCAGGTGG ATCACTTACTGCGTTTGCTGCGGCACCGATGGGTCCATACCCACCCACACCTAGT AATCATCGTTTACGGCGTGGACTACCAGGGTATCTAATCCTGTTTGCTCCCCACG CTTTCGAGCCTCAACGTCAGTTACAGTCCAGCAGGCCGCCTTCGCCACTGGTGT TCCTCCTAATATCTACGCATTTCACCGCTACACTAGGAATTCCGCCTGCCTCTCCT GCACTCCAGTTACACAGTTTCCAGAG >GQY1XT001CIUF3 AATGGTACCCGTCAATTCCTTTGAGTTTCATTCTTGCGAACGTACTCCCCAGGCG GAATACTTACTGCGTTTGCTGCGGCACCGGCGGGCCGTGCCCGCCGACACCTG Example 454 data
    20. 20. Clustering (and picking OTUs) singletons
    21. 21. Clustering (and picking OTUs)
    22. 22. Clustering (and picking OTUs)
    23. 23. Assigning taxonomies >378462 GATGAACGCTGGCGGCGTGCCTAATACATGCAAGTCGAGCGAACAGATAAGGAGCTTGCT CCTTTGACGTTAGCGGCGGACGGGTGAGTAACACGTGGGTAACCTACCTATAAGACTGGA ... >186233 AGAGTTTGATCCTGGCTCAGGATGAACACTAGCTACAGGCTTAACACATGCAAGTCGAGG GGCATCAGTTTGGTTTGCTTGCAAACCAAAGCTGGCGACCGGCGCACGGGTGAGTAACAC ... >260529 AGAGTTTGATCCTGGCTCAGGATGAACGCTGGCGGCGTGCCTAACACATGCAAGTCGAAC GAAGCATAAGGGAAGGAAGATTCGTCTGACGGAACTTATGACTGAGTGGCGGACGGGTGA ... >256122 CCTGGCTCACAATCACGAAGGAGAGGCGTGCGTAACACATGCAAGTCGACACGGGAGAGC GTGAGGCAACTCCGCAAGTATAGTGGCAGACGGGTGAGTAACACGTGAACAACCTACCCT ... >312796 AGTGGCGAACGGGTGAGTAACGCGTGAGGAACCTGCCTTTCAGAGGGGGACAACAGTTGG AAACGACTGCTAATACCGCATAATACGGTCTGACCGCATGATCGGATCGTCAAAGATTTA ... >574086 CCGCAAGGGGAGTGGCAGACGGGTGAGTAACGCGTGGGAACCTTCCCAGTGGTACGGAAT AACCCAGGGAAACCTGAGCTAATACCGTATACGCCCGAAAGGGGAAAGATTTATCGCCAT ...
    24. 24. Assigning taxonomies 378462 k__Bacteria;p__Firmicutes;c__Bacilli;o__Bacillales;f__Staphylococcaceae;g__Staphylococcus;s__; 186233 k__Bacteria;p__Bacteroidetes;c__Bacteroidia;o__Bacteroidales;f__Porphyromonadaceae;g__Parabacteroides;s__Par 260529 k__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridiales;f__Lachnospiraceae;g__Clostridium;s__; 256122 k__Bacteria;p__Acidobacteria;c__MVS-40;o__;f__;g__;s__; 312796 k__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridiales;f__Ruminococcaceae;g__;s__; 574086 k__Bacteria;p__Proteobacteria;c__Alphaproteobacteria;o__Rhizobiales;f__Hyphomicrobiaceae;g__;s__;
    25. 25. Assigning taxonomies 378462 k__Bacteria;p__Firmicutes;c__Bacilli;o__Bacillales;f__Staphylococcaceae;g__Staphylococcus;s__; 186233 k__Bacteria;p__Bacteroidetes;c__Bacteroidia;o__Bacteroidales;f__Porphyromonadaceae;g__Parabacteroides;s__Parabacteroidesdistasonis; 260529 k__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridiales;f__Lachnospiraceae;g__Clostridium;s__; 256122 k__Bacteria;p__Acidobacteria;c__MVS-40;o__;f__;g__;s__; 312796 k__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridiales;f__Ruminococcaceae;g__;s__; 574086 k__Bacteria;p__Proteobacteria;c__Alphaproteobacteria;o__Rhizobiales;f__Hyphomicrobiaceae;g__;s__; >378462 GATGAACGCTGGCGGCGTGCCTAATACATGCAAGTCGAGCGAACAGATAAGGAGCTTGCT CCTTTGACGTTAGCGGCGGACGGGTGAGTAACACGTGGGTAACCTACCTATAAGACTGGA ... >186233 AGAGTTTGATCCTGGCTCAGGATGAACACTAGCTACAGGCTTAACACATGCAAGTCGAGG GGCATCAGTTTGGTTTGCTTGCAAACCAAAGCTGGCGACCGGCGCACGGGTGAGTAACAC ... >260529 AGAGTTTGATCCTGGCTCAGGATGAACGCTGGCGGCGTGCCTAACACATGCAAGTCGAAC GAAGCATAAGGGAAGGAAGATTCGTCTGACGGAACTTATGACTGAGTGGCGGACGGGTGA ... >256122 CCTGGCTCACAATCACGAAGGAGAGGCGTGCGTAACACATGCAAGTCGACACGGGAGAGC GTGAGGCAACTCCGCAAGTATAGTGGCAGACGGGTGAGTAACACGTGAACAACCTACCCT ... >312796 AGTGGCGAACGGGTGAGTAACGCGTGAGGAACCTGCCTTTCAGAGGGGGACAACAGTTGG AAACGACTGCTAATACCGCATAATACGGTCTGACCGCATGATCGGATCGTCAAAGATTTA ... >574086 CCGCAAGGGGAGTGGCAGACGGGTGAGTAACGCGTGGGAACCTTCCCAGTGGTACGGAAT AACCCAGGGAAACCTGAGCTAATACCGTATACGCCCGAAAGGGGAAAGATTTATCGCCAT ...
    26. 26. Pyrosequencing • Next generation sequencing technology • Ability to generate ~500,000 sequences in an afternoon • Can barcode sequences to sequence many samples in a single run • Reads are getting longer • $10,000-15,000 per run Schloss et al. (2011) PLoS ONE 6:e27310
    27. 27. Caporaso et al 2012 ISMEJ 6:1621-1624
    28. 28. Other methods… • IonTorrent – Tons of short crappy reads – Not worth the effort • PacBio – Modest number of long reads – Not worth the effort • Stick with 454 or MiSeq (preferred)
    29. 29. Costs are falling • Very cheap – Schloss lab sequenced ~30 plates by 454 for $4000 per plate ~ $120,000 – Could re-do everything on MiSeq in 8 runs for $1500 per plate ~ $12,000 • Cost is in DNA extraction analysis – ~$8.00 per sample to get DNA – ~$5.00 per sample to sequence
    30. 30. Data analysis pipelines
    31. 31. The Major Players (for 16S-tag amplicons) • Pat Schloss, UMichigan – mothur – Command line – Coded in C++ but distributed as compiled – Excellent documentation • Rob Knight and friends, UColorado – QIIME – Command line – Coded in python – Can run as a “Virtual Box” – Pretty good documentation • Ribosomal Database Project, MSU – RDP – Web interface – Pretty good documentation
    32. 32. Others • Victor Kunin and Phil Hugenholtz, JGI – Pyrotagger • Eric Triplett and friends, UFlorida - PANGEA • Kumar and friends, UOslo – CLOTU • Fricke and friends, UMaryland - CloVR • Schloetterer, Austria – CANGS • Sogin and friends, MBL - VAMPS • Quince/Curtis/Sloan, UGlasgow – AmpliconNoise/Pyronoise • Greg Hannon, CSHL - FASTX-Toolkit • Claros and friends, Malaga Spain - SeqTrim
    33. 33. Discussion questions 1. How do you think the choice of sequencing technology affects the results? 2. How do you think the choice of primers affects the results? 3. Which data analysis tools do you use and why? What differences do you perceive between mothur, QIIME, RDP, etc? 4. Which kinds of questions can you answer using amplicon sequencing, and which can you not? 5. Which part of the amplicon sequencing process intimidates you the most and why?
    34. 34. Which microbial organisms are represented by the rRNA gene sequences in each sample? >PC.634_1 FLP3FBN01ELBSX CTGGGCCGTGTCTCAGTCCCAATGTGGCCGTTTACCCTCTC AGGCCGGCTACGCATCATCGCCTTGGTGGGCCGTTACCTCA CCAACTAGCTAATGCGCCGCAGGTCCATCCATGTTCACGCC TTGATGGGCGCTTTAATATACTGAGCATGCGCTCTGTATACC TATCCGGTTTTAGCTACCGTTTCCAGCAGTTATCCCGGACAC ATGGGCTAGG >PC.634_2 FLP3FBN01EG8AX TTGGACCGTGTCTCAGTTCCAATGTGGGGGCCTTCCTCTCA GAACCCCTATCCATCGAAGGCTTGGTGGGCCGTTACCCCGC CAACAACCTAATGGAACGCATCCCCATCGATGACCGAAGTT CTTTAATAGTTCTACCATGCGGAAGAACTATGCCATCGGGTA TTAATCTTTCTTTCGAAAGGCTATCCCCGAGTCATCGGCAGG TTGGATACGTGTTACTCACCCGTGCGCCGGT >PC.354_3 FLP3FBN01EEWKD TTGGGCCGTGTCTCAGTCCCAATGTGGCCGATCAGTCTCTT AACTCGGCTATGCATCATTGCCTTGGTAAGCCGTTACCTTAC CAACTAGCTAATGCACCGCAGGTCCATCCAAGAGTGATAGC AGAACCATCTTTCAAACTCTAGACATGCGTCTAGTGTTGTTAT CCGGTATTAGCATCTGTTTCCAGGTGTTATCCCAGTCTCTTG GG rRNA reference database (sequences are available for each ‘tip’ in the tree) Search against reference sequences
    35. 35. Search against reference sequences RefSeq 1 RefSeq 2 RefSeq 3 RefSeq 4 RefSeq 5 RefSeq 6 RefSeq 7 RefSeq 8 RefSeq 9 RefSeq 10 >PC.634_1 FLP3FBN01ELBSX CTGGGCCGTGTCTCAGTCCCAATGTGGCCGTTTACCCTCTC AGGCCGGCTACGCATCATCGCCTTGGTGGGCCGTTACCTCA CCAACTAGCTAATGCGCCGCAGGTCCATCCATGTTCACGCC TTGATGGGCGCTTTAATATACTGAGCATGCGCTCTGTATACC TATCCGGTTTTAGCTACCGTTTCCAGCAGTTATCCCGGACAC ATGGGCTAGG >PC.634_2 FLP3FBN01EG8AX TTGGACCGTGTCTCAGTTCCAATGTGGGGGCCTTCCTCTCA GAACCCCTATCCATCGAAGGCTTGGTGGGCCGTTACCCCGC CAACAACCTAATGGAACGCATCCCCATCGATGACCGAAGTT CTTTAATAGTTCTACCATGCGGAAGAACTATGCCATCGGGTA TTAATCTTTCTTTCGAAAGGCTATCCCCGAGTCATCGGCAGG TTGGATACGTGTTACTCACCCGTGCGCCGGT >PC.354_3 FLP3FBN01EEWKD TTGGGCCGTGTCTCAGTCCCAATGTGGCCGATCAGTCTCTT AACTCGGCTATGCATCATTGCCTTGGTAAGCCGTTACCTTAC CAACTAGCTAATGCACCGCAGGTCCATCCAAGAGTGATAGC AGAACCATCTTTCAAACTCTAGACATGCGTCTAGTGTTGTTAT CCGGTATTAGCATCTGTTTCCAGGTGTTATCCCAGTCTCTTG GG Which microbial organisms are represented by the rRNA gene sequences in each sample?
    36. 36. Assign millions of sequences from thousands of samples to reference Compare samples statistically and visually www.qiime.org Assign reads to samples >GCACCTGAGGACAGGCATGAGGAA… >GCACCTGAGGACAGGGGAGGAGGA… >TCACATGAACCTAGGCAGGACGAA… >CTACCGGAGGACAGGCATGAGGAT… >TCACATGAACCTAGGCAGGAGGAA… >GCACCTGAGGACACGCAGGACGAC… >CTACCGGAGGACAGGCAGGAGGAA… >CTACCGGAGGACACACAGGAGGAA… >GAACCTTCACATAGGCAGGAGGAT… >TCACATGAACCTAGGGGCAAGGAA… >GCACCTGAGGACAGGCAGGAGGAA… RefSeq 1 RefSeq 2 RefSeq 3 RefSeq 4 RefSeq 5 RefSeq 6 RefSeq 7 RefSeq 8 RefSeq 9 RefSeq 10
    37. 37. OTU picking • De Novo – Reads are clustered based on similarity to one another. • Reference-based – Closed reference: any reads which don’t hit a reference sequence are discarded – Open reference: any reads which don’t hit a reference sequence are clustered de novo http://qiime.org/tutorials/otu_picking.html
    38. 38. De novo OTU picking • Pros – All reads are clustered • Cons – Not parallelizable – OTUs may be defined by erroneous reads pick_de_novo_otus.py http://qiime.org/tutorials/tutorial.html
    39. 39. De novo OTU picking • You must use if: – You do not have a reference sequence collection to cluster against, for example because you're working with an infrequently used marker gene. • You cannot use if: – You are comparing non-overlapping amplicons, such as the V2 and the V4 regions of the 16S rRNA. – You working with very large data sets, like a full HiSeq 2000 run. (Technically you can, but it will be really slow.) pick_de_novo_otus.py http://qiime.org/tutorials/tutorial.html
    40. 40. Closed-reference OTU picking • Pros – Built-in quality filter – Easily parallelizable – OTUs are defined by high-quality, trusted sequences • Cons – Reads that don’t hit reference dataset are excluded, so you can never observe new OTUs pick_closed_reference_otus.py
    41. 41. Closed-reference OTU picking • You must use if: – You are comparing non-overlapping amplicons, such as the V2 and the V4 regions of the 16S rRNA. Your reference sequences must span both of the regions being sequenced. • You cannot use if: – You do not have a reference sequence collection to cluster against, for example because you're working with an infrequently used marker gene. pick_closed_reference_otus.py
    42. 42. Percentage of reads that do not hit the reference collection, by environment type.
    43. 43. Open-reference OTU picking • Pros – All reads are clustered – Partially parallelizable • Cons – Only partially parallelizable – Mix of high quality sequences defining OTUs (i.e., the database sequences) and possible low quality sequences defining OTUs (i.e., the sequencing reads) pick_open_reference_otus.py http://qiime.org/tutorials/illumina_overview_tutorial.html http://qiime.org/tutorials/open_reference_illumina_processing.html http://qiime.org/tutorials/fungal_its_analysis.html
    44. 44. Open-reference OTU picking • You cannot use if: – You are comparing non-overlapping amplicons, such as the V2 and the V4 regions of the 16S rRNA. – You do not have a reference sequence collection to cluster against, for example because you're working with an infrequently used marker gene. pick_open_reference_otus.py http://qiime.org/tutorials/illumina_overview_tutorial.html http://qiime.org/tutorials/open_reference_illumina_processing.html http://qiime.org/tutorials/fungal_its_analysis.html
    45. 45. Query sequences Quality filtering: Does a query sequence q match a reference OTU at greater than or equal to (p) percent identity? Discard sequence Record sequence hit for new reference OTU Subsampled open-reference OTU picking workflow No Yes (p): percent sequence identity threshold used for pre-filtering of sequences (default: 60%) (s): percent sequence identity threshold used when clustering sequences either de novo or closed-reference (default: 97%) (n): percentage of sequences that are randomly subsampled from sequences that failed to hit reference OTUs (default: 0.1%) (c): minimum observation count for an OTU to be accepted during post-OTU picking processing (default: 2) Reference OTUs (e.g., derived from Greengenes) High quality query sequences Closed-reference OTU picking: Does a query sequence q match a reference OTU at greater than or equal to (s) percent identity? Record sequence hit for reference OTU Yes Randomly subsample (n) percent of the query sequences that failed to hit the reference OTUs Subsampled query sequences Remaining query sequences Cluster subsampled query sequences de novo at (s) percent identity Cluster centroids are new reference OTUs Closed-reference OTU picking: Does a query sequence q match a new reference OTU at greater than or equal to (s) percent identity? Cluster sequences de novo at (s) percent identity Cluster centroids are clean-up OTUs No Yes No Data file (input, intermediate, or output) Decision Process Output OTUs Legend Does an OTU o have an observation count of at least c? Accept OTUExclude OTU YesNo pick_open_reference_otus.py http://qiime.org/tutorials/open_reference_illumina_processing.html Subsampled open reference OTU picking scales to billions of reads
    46. 46. Read assignment is different for shotgun data, but not that different. In general, the bottleneck is identifying/compiling a reference database. map_reads_to_reference.py parallel_map_reads_to_reference.py http://qiime.org/tutorials/shotgun_analysis.html http://qiime.org/scripts/map_reads_to_reference.html

    ×