QIIME Workshop

   Get started by opening:
http://bit.ly/mbe-qiime2012
       and read up at:
       www.qiime.org
        Greg Caporaso
   gregcaporaso@gmail.com
www.qiime.org
Extract DNA and amplify
   marker gene with
   barcoded primers            Pool amplicons and sequence



                                             RefSeq 1
 >GCACCTGAGGACAGGCATGAGGAA…
 >GCACCTGAGGACAGGGGAGGAGGA…                  RefSeq 2

 >TCACATGAACCTAGGCAGGACGAA…                  RefSeq 3
                                             RefSeq 4
 >CTACCGGAGGACAGGCATGAGGAT…
 >TCACATGAACCTAGGCAGGAGGAA…                  RefSeq 5
                                             RefSeq 6
 >GCACCTGAGGACACGCAGGACGAC…
 >CTACCGGAGGACAGGCAGGAGGAA…                  RefSeq 7
 >CTACCGGAGGACACACAGGAGGAA…                  RefSeq 8
                                             RefSeq 9
 >GAACCTTCACATAGGCAGGAGGAT…
 >TCACATGAACCTAGGGGCAAGGAA…                  RefSeq 10

 >GCACCTGAGGACAGGCAGGAGGAA…
                                  Assign millions of         Compute UniFrac distances
  Assign reads to samples     sequences from thousands         and compare samples
                                 of samples to OTUs
>5000 samples in analysis pipeline
   •   Stream and lake water
   •   Marine water, sediment and reef
   •   Soil (forest, farm, peatland, tundra, …)
   •   Air
   •   Coalbed
   •   Arctic ice core
   •   Insect-associated
   •   Human-associated (gut, mouth, skin)



http://www.earthmicrobiome.org/
>5000 samples analyzed
to date
Alpha diversity by environment type
Where do we look for new diversity?




* As determined by no hit to Greengenes database.
Sequencing output
                                                     Metadata
        (454, Illumina, Sanger)

  fastq, fasta, qual, or sff/trace files
                                                    mapping file              www.QIIME.org
                                                                                                               Phylogenetic Tree
                                                                         OTU (or other sample by
                         Pre-processing                                    observation) table
                                                                                                           Evolutionary relationship
            e.g., remove primer(s), demultiplex,
                                                                                                               between OTUs
                        quality filter



  Denoise 454 Data                    Database Submission               α-diversity and rarefaction        β-diversity and rarefaction
 PyroNoise, Denoiser                                                        e.g., Phylogenetic               e.g., Weighted and
                                          (In development)
                                                                            Diversity, Chao1,             unweighted UniFrac, Bray-
                                                                            Observed Species                   Curtis, Jaccard

      Pick OTUs and representative sequences
    Reference based                    De novo                                          Interactive visualizations
    BLAST, UCLUST,              e.g., UCLUST, CD-HIT,
      USEARCH                   MOTHUR, USEARCH
                                                                  e.g., PCoA plots, distance histograms, taxonomy charts, rarefaction
                                                                     plots, network visualization, jackknifed hierarchical clustering.

        Assign taxonomy               Align sequences
                                      e.g., PyNAST,
                                                                         Legend
          BLAST, RDP                                                                                    Currently supported for
                                   INFERNAL, MUSCLE,                       Currently supported for
           Classifier                                                                                      general sample by
                                         MAFFT                             marker-gene data only
                                                                                                           observation data
                                                                             (i.e., 'upstream' step)     (i.e., 'downstream' step)
     Build 'OTU table'               Build phylogenetic tree
i.e., sample by observation          e.g., FastTree, RAxML,                 Required step or input       Optional step or input
           matrix                            ClearCut
http://analytics.google.com
Running QIIME
       Native installation on OS X
       or Linux (laptops through
       16,416-core compute
       cluster*)

       Ubuntu Linux Virtual Box

       Amazon Web Services
       (EC2)

         * http://ncar.janus.rc.colorado.edu/
IPython notebook
Moving Pictures of the Human
             Microbiome
• Two subjects sampled daily, one for six
  months, one for 18 months
• Four body sites: tongue, palm of left
  hand, palm of right hand, and gut (via fecal
  swabs).
Moving Pictures of the Human
             Microbiome
• Investigate the relative temporal variability of
  body sites.
• Is there a temporal core microbiome?
• Technical points: do we observe the same
  conclusions on 454 and Illumina data?
Moving Pictures of the Human
      Microbiome: QIIME tutorial
• A small subset of the full data set to facilitate
  short run time: ~0.1% of the full sequence
  collection.
• Sequenced across six Illumina GAIIx
  lanes, with a subset of the samples also
  sequenced on 454.
• The online tutorial contains details on all of
  the steps: go back and read that text.
Key QIIME files

• Mapping file: per sample meta-data, user-
  defined
• Input sequence file
• OTU table: sample x OTU matrix, central to
  downstream analyses [now in biom format]
• Parameters file: defines analyses, for use
  with the ‘workflow’ scripts (optional)
Mapping file
Mapping file: always run
             check_id_map.py




 = required field
Sequences file
>[sampleID_seqID] description

Barcodes have been removed!!
>[sampleID_seqID] description

Barcodes have been removed!!
Sequences file: can be user-provided, or
    generated by split_libraries.py
OTU table
     (classic format)
sample x OTU matrix
OTU table
                  (classic format)
    sample x OTU matrix




OTU identifiers
OTU table
                     (classic format)
     sample x OTU matrix




Sample identifiers
OTU table
                    (classic format)
        sample x OTU matrix




Optional per OTU taxonomic information
OTU tables are now in biological observation
             matrix (.biom) format
          (QIIME 1.4.0-dev and later)
            Google: “biom format”


         http://biom-format.org


                See convert_biom.py
for translating between classic and biom otu tables
sample x observation contingency matrix
   Samples

OTUs

       Observation
       counts
sample x observation contingency matrix

       Samples

Taxa

         Observation
         counts
sample x observation contingency matrix
     Metagenomes

Functions

            Observation
            counts
sample x observation contingency matrix
        Samples                          Genomes                       Samples
   OTUs                           Ortholog                      Taxa
                                   groups
            Marker                           Comparative                 Marker
            gene (e.g., 16S)                 genomics                    gene (e.g., 16S)
            surveys                                                      surveys



                                             Samples
     Metagenomes

Functions                          Metabolites

            Metagenomics

            Metatranscriptomics
                                                 Metabolomics
                                                                            ...
The Biological Observation Matrix (BIOM) Format
  or: How I Learned To Stop Worrying and
  Love the Ome-ome

    JSON-based format for
    representing arbitrary
    sample x observation
    contingency tables with
    optional metadata




McDonald et al., GigaScience (2012).
                                       http://www.biom-format.org
Comparative genomic (B) and metagenome
analysis (C) with QIIME
Working with OTU tables
• single_rarefaction.py: even sampling (very important if you
  have different numbers of seqs/sample!)
• filter_otus_from_otu_table.py
• filter_samples_from_otu_table.py
• per_library_stats.py
OTU picking: terminology
OTU picking
• De Novo
  – Reads are clustered based on similarity to one
    another.
• Reference-based
  – Closed reference: any reads which don’t hit a
    reference sequence are discarded
  – Open reference: any reads which don’t hit a
    reference sequence are clustered de novo
De novo OTU picking
• Pros
  – All reads are clustered
• Cons
  – Not parallelizable
  – OTUs may be defined by erroneous reads
Closed-reference OTU picking
• Pros
  – Built-in quality filter
  – Easily parallelizable
  – OTUs are defined by high-quality, trusted
    sequences
• Cons
  – Reads that don’t hit reference dataset are
    excluded, so you can never observe new OTUs
Percentage of reads
that do not hit the
reference
collection, by
environment type.
Open-reference OTU picking
• Pros
  – All reads are clustered
  – Partially parallelizable
• Cons
  – Only partially parallelizable
  – Mix of high quality sequences defining OTUs
    (i.e., the database sequences) and possible low
    quality sequences defining OTUs (i.e., the
    sequencing reads)
Considerations in analysis
Variation in sampling depth is an
important consideration




                                                                                         Human skin, colored
                                                                                         by individual, at 500
                                                                                         sequence/sample

Image/analysis credit: Justin Kuczynski

Data reference:
Forensic identification using skin bacterial communities. Fierer N, Lauber CL, Zhou N, McDonald D, Costello EK, Knight R.
Proc Natl Acad Sci U S A. 2010 Apr 6;107(14):6477-81.
Variation in sampling depth is an
important consideration




                                                                                       Human skin, colored by
                                                                                       sampling depth, at
                                                                                       either 50 or 500
                                                                                       sequences/sample
Image/analysis credit: Justin Kuczynski

Data reference:
Forensic identification using skin bacterial communities. Fierer N, Lauber CL, Zhou N, McDonald D, Costello EK, Knight R.
Proc Natl Acad Sci U S A. 2010 Apr 6;107(14):6477-81.
Variation in sampling depth is an
important consideration




                                                                                       Human skin, colored by
                                                                                       sampling depth, at
                                                                                       either 50 (blue) or 500
                                                                                       (red) sequences/sample
Image/analysis credit: Justin Kuczynski

Data reference:
Forensic identification using skin bacterial communities. Fierer N, Lauber CL, Zhou N, McDonald D, Costello EK, Knight R.
Proc Natl Acad Sci U S A. 2010 Apr 6;107(14):6477-81.
How deep is deep enough?
It depends on the question…
  – Differences between community types: not many
    sequences.
  – Rare biosphere: more (but be careful about
    sequencing noise!)
How deep is deep enough?

   100 sequences/sample                                    10 sequences/sample                              1 sequence/sample
PC2 (8 .4 %)



                                             PC2 (1 1 %)
                                                                                              PC2 (1 7 %)




                                                                                                                     PC1 (2 4 %)


                                                                 PC1 (1 3 %)
                              PC1 (8 .6 %)
                                                                                                                                   PC3 (9 .7 %)

                                                                               PC3 (8 .1 %)



               PC3 (6 .2 %)




                                              Direct sequencing of the human microbiome readily reveals community differences.
                                                                                      J Kuczynski et al. Genome Biology (2011).
Figure 1
  (A)              (B)




                  10
            100




                   1

           (C)
Can we get accurate taxonomic
 assignment from short reads?
Extra slides
Elizabeth K. Costello, et al. Science 2009.
Bacterial Community Variation in Human Body Habitats Across Space and Time.
This work is licensed under the Creative Commons Attribution 3.0 United States License. To view a
copy of this license, visit http://creativecommons.org/licenses/by/3.0/us/ or send a letter to
Creative Commons, 171 Second Street, Suite 300, San Francisco, California, 94105, USA.

Feel free to use or modify these slides, but please credit me by placing the following attribution
information where you feel that it makes sense: Greg Caporaso, www.caporaso.us.

Caporaso sloan qiime_workshop_slides_18_oct2012

  • 1.
    QIIME Workshop Get started by opening: http://bit.ly/mbe-qiime2012 and read up at: www.qiime.org Greg Caporaso gregcaporaso@gmail.com
  • 2.
    www.qiime.org Extract DNA andamplify marker gene with barcoded primers Pool amplicons and sequence RefSeq 1 >GCACCTGAGGACAGGCATGAGGAA… >GCACCTGAGGACAGGGGAGGAGGA… RefSeq 2 >TCACATGAACCTAGGCAGGACGAA… RefSeq 3 RefSeq 4 >CTACCGGAGGACAGGCATGAGGAT… >TCACATGAACCTAGGCAGGAGGAA… RefSeq 5 RefSeq 6 >GCACCTGAGGACACGCAGGACGAC… >CTACCGGAGGACAGGCAGGAGGAA… RefSeq 7 >CTACCGGAGGACACACAGGAGGAA… RefSeq 8 RefSeq 9 >GAACCTTCACATAGGCAGGAGGAT… >TCACATGAACCTAGGGGCAAGGAA… RefSeq 10 >GCACCTGAGGACAGGCAGGAGGAA… Assign millions of Compute UniFrac distances Assign reads to samples sequences from thousands and compare samples of samples to OTUs
  • 3.
    >5000 samples inanalysis pipeline • Stream and lake water • Marine water, sediment and reef • Soil (forest, farm, peatland, tundra, …) • Air • Coalbed • Arctic ice core • Insect-associated • Human-associated (gut, mouth, skin) http://www.earthmicrobiome.org/
  • 4.
  • 5.
    Alpha diversity byenvironment type
  • 6.
    Where do welook for new diversity? * As determined by no hit to Greengenes database.
  • 7.
    Sequencing output Metadata (454, Illumina, Sanger) fastq, fasta, qual, or sff/trace files mapping file www.QIIME.org Phylogenetic Tree OTU (or other sample by Pre-processing observation) table Evolutionary relationship e.g., remove primer(s), demultiplex, between OTUs quality filter Denoise 454 Data Database Submission α-diversity and rarefaction β-diversity and rarefaction PyroNoise, Denoiser e.g., Phylogenetic e.g., Weighted and (In development) Diversity, Chao1, unweighted UniFrac, Bray- Observed Species Curtis, Jaccard Pick OTUs and representative sequences Reference based De novo Interactive visualizations BLAST, UCLUST, e.g., UCLUST, CD-HIT, USEARCH MOTHUR, USEARCH e.g., PCoA plots, distance histograms, taxonomy charts, rarefaction plots, network visualization, jackknifed hierarchical clustering. Assign taxonomy Align sequences e.g., PyNAST, Legend BLAST, RDP Currently supported for INFERNAL, MUSCLE, Currently supported for Classifier general sample by MAFFT marker-gene data only observation data (i.e., 'upstream' step) (i.e., 'downstream' step) Build 'OTU table' Build phylogenetic tree i.e., sample by observation e.g., FastTree, RAxML, Required step or input Optional step or input matrix ClearCut
  • 8.
  • 9.
    Running QIIME Native installation on OS X or Linux (laptops through 16,416-core compute cluster*) Ubuntu Linux Virtual Box Amazon Web Services (EC2) * http://ncar.janus.rc.colorado.edu/
  • 10.
  • 11.
    Moving Pictures ofthe Human Microbiome • Two subjects sampled daily, one for six months, one for 18 months • Four body sites: tongue, palm of left hand, palm of right hand, and gut (via fecal swabs).
  • 12.
    Moving Pictures ofthe Human Microbiome • Investigate the relative temporal variability of body sites. • Is there a temporal core microbiome? • Technical points: do we observe the same conclusions on 454 and Illumina data?
  • 13.
    Moving Pictures ofthe Human Microbiome: QIIME tutorial • A small subset of the full data set to facilitate short run time: ~0.1% of the full sequence collection. • Sequenced across six Illumina GAIIx lanes, with a subset of the samples also sequenced on 454. • The online tutorial contains details on all of the steps: go back and read that text.
  • 14.
    Key QIIME files •Mapping file: per sample meta-data, user- defined • Input sequence file • OTU table: sample x OTU matrix, central to downstream analyses [now in biom format] • Parameters file: defines analyses, for use with the ‘workflow’ scripts (optional)
  • 15.
  • 16.
    Mapping file: alwaysrun check_id_map.py = required field
  • 17.
  • 18.
  • 19.
  • 20.
    Sequences file: canbe user-provided, or generated by split_libraries.py
  • 21.
    OTU table (classic format) sample x OTU matrix
  • 22.
    OTU table (classic format) sample x OTU matrix OTU identifiers
  • 23.
    OTU table (classic format) sample x OTU matrix Sample identifiers
  • 24.
    OTU table (classic format) sample x OTU matrix Optional per OTU taxonomic information
  • 25.
    OTU tables arenow in biological observation matrix (.biom) format (QIIME 1.4.0-dev and later) Google: “biom format” http://biom-format.org See convert_biom.py for translating between classic and biom otu tables
  • 26.
    sample x observationcontingency matrix Samples OTUs Observation counts
  • 27.
    sample x observationcontingency matrix Samples Taxa Observation counts
  • 28.
    sample x observationcontingency matrix Metagenomes Functions Observation counts
  • 29.
    sample x observationcontingency matrix Samples Genomes Samples OTUs Ortholog Taxa groups Marker Comparative Marker gene (e.g., 16S) genomics gene (e.g., 16S) surveys surveys Samples Metagenomes Functions Metabolites Metagenomics Metatranscriptomics Metabolomics ...
  • 30.
    The Biological ObservationMatrix (BIOM) Format or: How I Learned To Stop Worrying and Love the Ome-ome JSON-based format for representing arbitrary sample x observation contingency tables with optional metadata McDonald et al., GigaScience (2012). http://www.biom-format.org
  • 31.
    Comparative genomic (B)and metagenome analysis (C) with QIIME
  • 32.
    Working with OTUtables • single_rarefaction.py: even sampling (very important if you have different numbers of seqs/sample!) • filter_otus_from_otu_table.py • filter_samples_from_otu_table.py • per_library_stats.py
  • 33.
  • 34.
    OTU picking • DeNovo – Reads are clustered based on similarity to one another. • Reference-based – Closed reference: any reads which don’t hit a reference sequence are discarded – Open reference: any reads which don’t hit a reference sequence are clustered de novo
  • 35.
    De novo OTUpicking • Pros – All reads are clustered • Cons – Not parallelizable – OTUs may be defined by erroneous reads
  • 36.
    Closed-reference OTU picking •Pros – Built-in quality filter – Easily parallelizable – OTUs are defined by high-quality, trusted sequences • Cons – Reads that don’t hit reference dataset are excluded, so you can never observe new OTUs
  • 37.
    Percentage of reads thatdo not hit the reference collection, by environment type.
  • 38.
    Open-reference OTU picking •Pros – All reads are clustered – Partially parallelizable • Cons – Only partially parallelizable – Mix of high quality sequences defining OTUs (i.e., the database sequences) and possible low quality sequences defining OTUs (i.e., the sequencing reads)
  • 39.
  • 40.
    Variation in samplingdepth is an important consideration Human skin, colored by individual, at 500 sequence/sample Image/analysis credit: Justin Kuczynski Data reference: Forensic identification using skin bacterial communities. Fierer N, Lauber CL, Zhou N, McDonald D, Costello EK, Knight R. Proc Natl Acad Sci U S A. 2010 Apr 6;107(14):6477-81.
  • 41.
    Variation in samplingdepth is an important consideration Human skin, colored by sampling depth, at either 50 or 500 sequences/sample Image/analysis credit: Justin Kuczynski Data reference: Forensic identification using skin bacterial communities. Fierer N, Lauber CL, Zhou N, McDonald D, Costello EK, Knight R. Proc Natl Acad Sci U S A. 2010 Apr 6;107(14):6477-81.
  • 42.
    Variation in samplingdepth is an important consideration Human skin, colored by sampling depth, at either 50 (blue) or 500 (red) sequences/sample Image/analysis credit: Justin Kuczynski Data reference: Forensic identification using skin bacterial communities. Fierer N, Lauber CL, Zhou N, McDonald D, Costello EK, Knight R. Proc Natl Acad Sci U S A. 2010 Apr 6;107(14):6477-81.
  • 43.
    How deep isdeep enough? It depends on the question… – Differences between community types: not many sequences. – Rare biosphere: more (but be careful about sequencing noise!)
  • 44.
    How deep isdeep enough? 100 sequences/sample 10 sequences/sample 1 sequence/sample PC2 (8 .4 %) PC2 (1 1 %) PC2 (1 7 %) PC1 (2 4 %) PC1 (1 3 %) PC1 (8 .6 %) PC3 (9 .7 %) PC3 (8 .1 %) PC3 (6 .2 %) Direct sequencing of the human microbiome readily reveals community differences. J Kuczynski et al. Genome Biology (2011).
  • 45.
    Figure 1 (A) (B) 10 100 1 (C)
  • 46.
    Can we getaccurate taxonomic assignment from short reads?
  • 49.
  • 50.
    Elizabeth K. Costello,et al. Science 2009. Bacterial Community Variation in Human Body Habitats Across Space and Time.
  • 53.
    This work islicensed under the Creative Commons Attribution 3.0 United States License. To view a copy of this license, visit http://creativecommons.org/licenses/by/3.0/us/ or send a letter to Creative Commons, 171 Second Street, Suite 300, San Francisco, California, 94105, USA. Feel free to use or modify these slides, but please credit me by placing the following attribution information where you feel that it makes sense: Greg Caporaso, www.caporaso.us.