16S rRNA Analysis using Mothur Pipeline

16S rRNA analysis using
Mothur pipeline
Eman Abdelrazik
Bioinformatics Research Assistant, Center of Informatics Science, Nile University
H3ABioNet Teaching Assistant

Before we start!
● Slides reproduced from Galaxy tutorials & H3ABioNet tutorials
● For Questions:
https://bit.ly/2N4mlv2
● Make sure you have a Galaxy account:
https://usegalaxy.org.au/
https://usegalaxy.org

Our Journey ^^
1) Theoretical:
a) Introduction
b) Analysis pipelines
2) Practical
a) File formats
b) Introduction to Galaxy
c) Mothur workflow
d) Let’s do it together
e) Do it by yourself

Theoretical
Part I: Introduction

What is the difference?
● Microbiome: Entire set of microorganism at given site,
and time
● Metagenome: entire genetic information of
microorganism at specific site/time
● Meta-Transcriptome
● Meta-Proteome

Why to study microbiome?
1) Health Care research
● Humans are full of microorganisms
● Skin, gut, oral cavity, nasal cavity,
eyes, ..
● Affects health, drug efficacy, etc
referred to as your second genome
● ~10 times more cells than you
● ~100 times more genes than you
● ~1000s different species

2) Environmental Research
2) Environmental Research
● Microbes in the soil affect plants and animals
● improve agriculture

Global Ocean Sampling Expedition
Ocean Exploration Genome Project (Pacific Ocean)

Shotgun vs. Amplicon!
● Sequence only specific gene
● No functional information
● Less complex to analyse
● Cheaper
● Sequence all DNA
● More information
● Higher complexity
● Higher cost

Amplicon (16S rRNA gene)
● Targeted approach (e.g.
16S/18S rRNA gene)
● Amplifies bacteria, not host, or
environmental fungi, plants.
● Present in all living organisms
(viruses?!)
16S rRNA Secondary Structure

Amplicon
With variable regions: distinguish between genus
● Pros
○ Well-established
○ Inexpensive
● Cons
○ V-region choice can bias results
○ Is based on a very well conserved gene, making it
hard to resolve species and strains

Shotgun metagenomics
Aims to sequence the "whole" metagenome
● Pros:
○ Not biased by amplicon primer set
○ Not limited by conservation of the amplicon
○ Can also provide functional information
● Cons:
○ Environmental contamination, including host
○ More expensive
○ Complex data analysis
○ Requires high performance computing, high memory.

What sequencing technologies offer for
metagenomics!

Theoretical
Part II: Analysis Pipelines

1. Pre-processing
● There are a lot of
ways to filter and
trim your data
● Trade-off
between quality
and amount of
information
retained

2. Chimera Removal
● During PCR multiple
sequences can
combine to form a
hybrid
● Must be removed from
your data for better
results

3. OTU Clustering
● Operational Taxonomic Unit: a cluster of similar
sequences, represented by a single consensus sequence
~ one species.
● OTU clusters are defined by a 97% identity threshold of
the 16S gene sequence variants at genus level. 98% or
99% identity is suggested for species separation.

Search marker database and taxonomy assignment

Results: Visualizations
1. Krona
● interactive exploration of sample taxonomy / per-sample
groups
● Illustrate abundance

Results: Visualizations
2. Phinch
● explore the community structure
● BIOM file input
● various visualizations
● multi-sample data

Pipelines: 1. QIIME
Quantitative Insight Into
Microbial Ecology

Pipelines: 2. DADA2
DADA2 stands for - Divisive Amplicon Denoising Algorithm

Other available pipelines
● UPARSE: http://www.drive5.com/uparse
● IM Tornado:
https://github.com/pjeraldo/imtornado2
● FROGS:
https://github.com/geraldinepascal/FROGS
● VSEARCH: https://github.com/torognes/vsearch

Practical
Part I: File Formats

Sequence Alignment/Map Format (SAM)

CIGAR (Compact Idiosyncratic Gapped Alignment
Report) strings.
The CIGAR string is the result of the sequence alignments, defining the sequence
of matches/mismatches and deletions (or gaps) compared to the reference
sequence.
CIGAR strings, together with the allele sequences, are used to generate a
visualization of the loci alignment.
https://samtools.github.io/hts-specs/SAMv1.pdf

SAM vs. BAM
● Binary format
● Better than fastq file in data storage especially from
different samples as it adds extra annotation to reads
(where they come from?) uBAM

BIOM format
● The Biological Observation Matrix (BIOM) format
● a general-use format for representing biological sample by
observation contingency tables
● command line interface (CLI) for working with BIOM files,
including converting between file formats, adding metadata to
BIOM files, and summarizing BIOM files.

BIOM format
http://biom-format.org/

Practical
Part II: Introduction to Galaxy

What is Galaxy?
● Web-based platform for biological data analysis.
● Command-line tools >> wrapped >> Galaxy
● Retain histories of analysis: re-run and share.

Courses in Higher Education that use Galaxy

Make your account!
https://usegalaxy.org/

Galaxy Interface
Center Panel
Tools Panel History
Panel

Dataset status
job is completed
job is executing
job is queued
job is paused
job has failed

1. Refresh
history
2. View file
1
2
3
3. History setting

How to get Data?
● The maximum size limit is 50G
(uncompressed).
● Most individual file compression formats
are supported, but multi-file archives are
not (.tar, .zip).
ENA ID: PRJEP5480

Practical
Part III: Mothur workflow
https://mothur.org/

Mothur
● A collection of tools combined together.
● Mothur project, initiated by Dr. Patrick Schloss, at The
University of Michigan.
● most cited bioinformatics tool for analyzing 16S rRNA gene
sequences.
● process data generated by Sanger, PacBio, IonTorrent,
454, and Illumina (MiSeq/HiSeq).

Download: Latest version “1.43.0”
https://github.com/mothur/mothur/releases/tag/v.1.43.0

https://www.mothur.org/wiki/Installation

Functional Analysis
https://bpa-csiro-workshops.github.io/btp-manuals-md/modules/metagenomics-mo
dule-fda/fda/
http://motherbox.chemeng.ntua.gr/anastasia_dev/u/makis/w/copy-of-starting-from-
reads-1

Resources
● Soil Tutorial:
https://galaxyproject.github.io/training-material/topics/metagenomics/tutorials/general-tutorial/tutorial.html
● Gut Tutorial:
https://galaxyproject.github.io/training-material/topics/metagenomics/tutorials/mothur-miseq-sop-short/tutoria
l.html
● https://galaxyproject.github.io/training-material/topics/metagenomics/
● https://galaxyproject.github.io/training-material/topics/introduction/
● https://moin.galaxyproject.org/Support#Dataset_status_and_how_jobs_execute
● https://docs.qiime2.org/2019.7/tutorials/
● https://benjjneb.github.io/dada2/tutorial.html
● https://www.coursera.org/learn/galaxy-project

16S rRNA Analysis using Mothur Pipeline

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 16S rRNA Analysis using Mothur Pipeline

Similar to 16S rRNA Analysis using Mothur Pipeline (20)

More from Eman Abdelrazik

More from Eman Abdelrazik (12)

Recently uploaded

Recently uploaded (20)

16S rRNA Analysis using Mothur Pipeline