Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

GLBIO/CCBC Metagenomics Workshop

2,125 views

Published on

This presentation compares 16S vs metagenomics and walks through two major approaches for taxonomically and functionally annotating.

Published in: Science
  • I pasted a website that might be helpful to you: ⇒ www.WritePaper.info ⇐ Good luck!
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Überprüfen Sie die Quelle ⇒ www.WritersHilfe.com ⇐ . Diese Seite hat mir geholfen, eine Diplomarbeit zu schreiben.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • You are welcome to visit our brilliant writing company in order to get rid of your academic writing problems once and for all! HelpWriting.net
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

GLBIO/CCBC Metagenomics Workshop

  1. 1. GLBIO/CCBC Microbiome Analysis Workshop: Metagenomics Morgan G.I. Langille Assistant Professor Dalhousie University May 16, 2016
  2. 2. Learning Objectives • Contrast 16S and metagenomic sequencing • Taxonomy from metagenomes • Function from metagenomes • Applicability of assembling and gene calling with metagenomic data • Metagenomic inference and limitations • Tutorial on processing metagenomic data to determine functional and taxonomic profiles
  3. 3. 16S vs Metagenomics • 16S is targeted sequencing of a single gene which acts as a marker for identification • Pros – Well established – Sequencing costs are relatively cheap (~50,000 reads/sample) – Only amplifies what you want (no host contamination) • Cons – Primer choice can bias results towards certain organisms – Usually not enough resolution to identify to the strain level – Different primers are needed for archaea & eukaryotes (18S) – Doesn’t identify viruses
  4. 4. 16S vs Metagenomics • Metagenomics: sequencing all the DNA in a sample • Pros – No primer bias – Can identify all microbes (euks, viruses, etc.) – Provides functional information (“What are they doing?”) • Cons – More expensive (millions of sequences needed) – Host/site contamination can be significant – May not be able to sequence “rare” microbes – Complex bioinformatics
  5. 5. TAXONOMIC PROFILES Who is there?
  6. 6. Metagenomics: Who is there? • Goal: Identify the relative abundance of different microbes in a sample given using metagenomics • Problems: – Reads are all mixed together – Reads can be short (~100bp) – Lateral gene transfer • Two broad approaches 1. Binning Based 2. Marker Based
  7. 7. Binning Based • Attempts to group or “bin” reads into the genome from which they originated • Composition-based – Uses sequence composition such as GC%, k-mers (e.g. Naïve Bayes Classifier) – Generally not very precise • Sequence-based – Compare reads to large reference database using BLAST (or some other similarity search method) – Reads are assigned based on “Best-hit” or “Lowest Common Ancestor” approach
  8. 8. LCA: Lowest Common Ancestor • Use all BLAST hits above a threshold and assign taxonomy at the lowest level in the tree which covers these taxa. • Notable Examples: – MEGAN: http://ab.inf.uni- tuebingen.de/software/megan5/ • One of the first metagenomic tools • Does functional profiling too! – MG-RAST: https://metagenomics.anl.gov/ • Web-based pipeline (might need to wait awhile for results) – Kraken: https://ccb.jhu.edu/software/kraken/ • Fastest binning approach to date and very accurate. • Large computing requirements (e.g. >128GB RAM)
  9. 9. Marker Based • Single Gene • Identify and extract reads hitting a single marker gene (e.g. 16S, cpn60, or other “universal” genes) • Use existing bioinformatics pipeline (e.g. QIIME, etc.) • Multiple Gene • Several universal genes – PhyloSift (Darling et al, 2014) » Uses 37 universal single-copy genes • Clade specific markers – MetaPhlAn2 (Truong et al., 2015)
  10. 10. Marker or Binning? • Binning approaches – Similarity search is computationally intensive – Varying genome sizes and LGT can bias results • Marker approaches – Doesn’t allow functions to be linked directly to organisms – Genome reconstruction/assembly is not possible – Dependent on choice of markers
  11. 11. MetaPhlAn2 • Uses “clade-specific” gene markers • A clade represents a set of genomes that can be as broad as a phylum or as specific as a species • Uses ~1 million markers derived from 17,000 genomes – ~13,500 bacterial and archaeal, ~3,500 viral, and ~110 eukaryotic • Can identify down to the species level (and possibly even strain level) • Can handle millions of reads on a standard computer within a few minutes
  12. 12. MetaPhlAn Marker Selection
  13. 13. MetaPhlAn Marker Selection
  14. 14. Using MetaPhlan • MetaPhlan uses Bowtie2 for sequence similarity searching (nucleotide sequences vs. nucleotide database) • Paired-end data can be used directly • Each sample is processed individually and then multiple sample can be combined together at the last step • Output is relative abundances at different taxonomic levels
  15. 15. Absolute vs. Relative Abundance • Absolute abundance: Numbers represent real abundance of thing being measured (e.g. the actual quantity of a particular gene or organism) • Relative abundance: Numbers represent proportion of thing being measured within sample • In almost all cases microbiome studies are measuring relative abundance – This is due to DNA amplification during sequencing library preparation not being quantitative
  16. 16. Relative Abundance Use Case • Sample A: – Has 108 bacterial cells (but we don’t know this from sequencing) – 25% of the microbiome from this sample is classified as Shigella • Sample B: – Has 106 bacterial cells (but we don’t know this from sequencing) – 50% of the microbiome from this sample is classified as Shigella • “Sample B contains twice as much Shigella as Sample A” – WRONG! (If quantified it we would find Sample A has more Shigella) • “Sample B contains a greater proportion of Shigella compared to Sample A” – Correct!
  17. 17. FUNCTIONAL COMPOSITION What are they doing?
  18. 18. What do we mean by function? • General categories – Photosynthesis – Nitrogen metabolism – Glycolysis • Specific gene families – Nifh – EC: 1.1.1.1 (alchohol dehydrogenase) – K00929 (butyrate kinase)
  19. 19. Various Functional Databases • COG – Well known but original classification (not updated since 2003) • SEED – Used by the RAST and MG-RAST systems • PFAM – Focused more on protein domains • EggNOG – Very comprehensive (~190k groups) • UniRef – Has clustering at different levels (e.g. UniRef100, UniRef90, UniRef50) – Most comprehensive and is constantly updated • KEGG – Very popular, each entry is well annotated, and often linked into “Modules” or “Pathways” – Full access now requires a license fee • MetaCyc – Becoming more widely used. – More microbe focused than KEGG
  20. 20. KEGG • We will focus on using the KEGG database during this workshop • KEGG Orthologs (KOs) – Most specific. Thought to be homologs and doing the same exact “function” – ~12,000 KOs in the database – These can be linked into KEGG Modules and KEGG Pathways, – Identifiers: K01803, K00231, etc.
  21. 21. KEGG (cont.) • KEGG Modules – Manually defined functional units – Small groups of KOs that function together – ~750 KEGG Modules – Identified: M00002, M00011, etc.
  22. 22. KEGG (cont.) • KEGG Pathways – Groups KOs into large pathways (~230) – Each pathway has a graphical map – Individual KOs or Modules can be highlighted within these maps – Pathways can be collapsed into very general functional terms (e.g. Amino Acid Metabolism, Carbohydrate Metabolism, etc.)
  23. 23. Metagenomic Annotation Systems • Web-based – Provide functional and taxonomic analysis, plus hosts your data. – EBI Metagenomics Server – MG-RAST – IMG/M • GUI based – MEGAN • Taxonomy and functional annotation – ClovR • Virtual Machine based, contains SOP, hasn’t been updated recently • Command-line based – MetAMOS • Built in assembly, highly customizable, some features can be buggy – Humann • Functional annotation – DIY • Set up your own in-house custom computational pipeline
  24. 24. Humann (Abubucker et al. 2012)
  25. 25. Humann Step 1 • Reads are searched against a protein database (e.g. KEGG) – Can use BLASTX, but much faster methods now available (e.g. BLAT, USEARCH, RapSearch2, DIAMOND) Buchfink et al., 2015
  26. 26. Humann (Abubucker et al. 2012)
  27. 27. Humann Step 2 • Normalize and weight search results • The relative abundance of each KO is calculated: – Number of reads mapping to a gene sequence in that KO – Weighted by the inverse p-value of each mapping – Normalized by the average length of the KO
  28. 28. Humann (Abubucker et al. 2012)
  29. 29. Humann Step 3 • Reduce number of pathways • A KO can map to one or more KEGG Pathways – Just because a KO is found in a pathway doesn’t mean that complete pathway exists in the community – If a pathway has 20 KOs and only 2 KOs are observed in the community (but at high abundances) what should be the abundance of the pathway? – MinPath (Ye, 2009) attempts to estimate the abundance of these pathways and remove spurious noise
  30. 30. Humann (Abubucker et al. 2012)
  31. 31. Humann Step 4 • Reduce false positive pathways further and normalize by KO copy number • Using the organism information from the KEGG hits – Pathways that are not found to be in any of the observed organisms AND are made up mostly of KOs mapping to a different pathway are removed – KO abundance can be divided by the estimated copy number of that KO as observed from the KEGG organism database
  32. 32. Humann
  33. 33. Humann Step 5 • Smoothing pathways by gap filling – Sequencing depth or poor sequence searches could lead to some KOs within pathways being absent or in low abundance – KOs with 1.5 interquartile ranges below the pathway median are raised to the pathway median
  34. 34. Humann (Abubucker et al. 2012)
  35. 35. What about assembly? • Assembly is often used in genomics to join raw reads into longer contigs and scaffolds TECHNOLOGY FEATURE 2. Find overlaps between reads …AGCCTAGACCTACAGGATGCGCGACACGT GGATGCGCGACACGTCGCATATCCGGT… 3. Assemble overlaps into contigs 1. Fragment DNA and sequence 4. Assemble contigs into scaffolds ar O av h ea h g p ev m in Ju In ge ev fo ge as an scGenome assembly stitches together a genome MichaelSchatz,ColdSpringHarbor rved.
  36. 36. Assembly for Metagenomics? • Pros – Less computation time for similarity search (sequences are collapsed) – Can allow annotation when reads are too short (<100bp) – Can sometimes (partially) reconstruct genomes • Cons – Assembly is computationally intensive (high memory machines needed) – Collapsed reads must be added back to get relative abundances (not all assemblers do this natively) – Low read depth and high diversity can cause assemblers to fail – Reads are not all from the same genome so chimeras are possible – Some organisms/genes will assemble easier (e.g. more abundant) which could lead to annotation bias
  37. 37. What about gene calling? • In genomics, normally you would predict the start and stop positions of genes using a gene prediction program before annotating the genes • In metagenomics: – Pros: • May result in less false positives from annotating “non-real” genes • Lowers the number of similarity searches – Cons • Computationally intensive • No good learning dataset • Raw reads will not cover an entire gene • Often requires assembled data – Possible tools: FragGeneScan, MetaGeneAnnotator – Alternative: Do 6 frame-translation (e.g. BLASTX)
  38. 38. Community Function Potential • Important that this is metagenomics, not metatranscriptomics, and not metaproteomics • These annotations suggest the functional potential of the community • The presence of these genes/functions does not mean that they are biologically active (e.g. may not be transcribed)
  39. 39. PICRUST Predicting function from 16S profiles
  40. 40. Sample 1 Sample 2 Sample 3 OTU 1 4 0 2 OTU 2 1 0 0 OTU 3 2 4 2 16S rRNA gene QIIME Shotgun Metagenomics HUMAnN Sample 1 Sample 2 Sample 3 K00001 20 15 18 K00002 1 2 0 K00003 4 5 4 MetaPhlAn PICRUSt STAMPSTAMP
  41. 41. 41 PICRUSt • Phylogenetic Investigation of Communities by Reconstruction of Unobserved States • http://picrust.github.com
  42. 42. PICRUSt: How does it work?
  43. 43. Predicting the abundance of a single function Known gene abundance Ancestral gene abundance Predicted gene abundance
  44. 44. Predicting the abundance of a single function Known gene abundance Ancestral gene abundance Predicted gene abundance Repeat for each function (~8000X) Repeat for all unknown tips (>100,000)
  45. 45. PICRUSt: Predicting Metagenomes S1 S2 S3 12345 10 0 5 67890 1 0 0 66666 4 8 2 16S Copy Number 12345 5 67890 1 66666 2 S1 S2 S3 12345 2 0 1 67890 1 0 0 66666 2 4 1 Normalized OTU Table PICRUST 16S Predictions OTU Table
  46. 46. PICRUSt: Predicting Metagenomes S1 S2 S3 12345 10 0 5 67890 1 0 0 66666 4 8 2 16S Copy Number 12345 5 67890 1 66666 2 K0001 K0002 K0003 12345 4 0 2 67890 1 0 0 66666 2 4 2 S1 S2 S3 12345 2 0 1 67890 1 0 0 66666 2 4 1 S1 S2 S3 12345 2 0 1 67890 1 0 0 66666 2 4 1 S1 S2 S3 K0001 13 8 6 K0002 8 16 4 K0003 8 8 4 Normalized OTU Table Metagenome Prediction PICRUST 16S Predictions PICRUST KEGG Predictions OTU Table
  47. 47. PICRUSt predictions across body sites 47 Langille et al., 2013, Nature Biotechnology
  48. 48. 48
  49. 49. 49
  50. 50. 50
  51. 51. VISUALIZATION AND STATISTICS What is important?
  52. 52. Visualization and Statistics • Various tools are available to determine statistically significant taxonomic differences across groups of samples – Excel – SigmaPlot – Past – R (many libraries) – Python (matplotlib) – STAMP
  53. 53. STAMP
  54. 54. STAMP Plots
  55. 55. STAMP • Input 1. “Profile file”: Table of features (samples by OTUs, samples by functions, etc.) • Features can form a heirarchy (e.g. Phylum, Order, Class, etc) to allow data to be collapsed within the program 2. “Group file”: Contains different metadata for grouping samples • Can be two groups: (e.g. Healthy vs Sick) or multiple groups (e.g. Water depth at 2M, 4M, and 6M) • Output – PCA, heatmap, box, and bar plots – Tables of significantly different features
  56. 56. METAGENOMICS WORKFLOW Putting it all together
  57. 57. Microbiome Helper • Standard Operating Procedures (SOPs) – 16S – Shotgun Metagenomics • Scripts to wrap and integrate existing tools – Available as an Ubuntu Virtualbox • Tutorials/Walkthroughs • https://github.com/mlangill/microbiome_helper/wiki
  58. 58. IMR: Integrated Microbiome Resource • Offers sequencing and bioinformatics for microbiome projects (http://cgeb-imr.ca)
  59. 59. QUESTIONS?
  60. 60. Tutorial

×