Creeping Stroke - Venous thrombosis presenting with pc-stroke.pptx
BRN Symposium 03/06/16 How can we determine/analyse it? Technical issues
1. THE MICROBIOME IN RESPIRATORY MEDICINE:
How can we determine/analyse it? Technical issues
Vicente Pérez Brocal, FISABIO, Valencia, Spain.
THE MICROBIOME IN RESPIRATORY MEDICINE
June, 3rd, 2016 | Barcelona Biomedical Research Park (PRBB)
2. The microbiome in the respiratory tract: a
complex ecosystem. How can we determine it?
Sources:
National Institutes of Health,
Scientific American,
Human Microbiome Project
Source: American Museum of Natural
History
http://tumblr.amnh.org/post/137107629989
/its-microbiome-monday-a-few-fast-facts-
about
http://www.coloringchaos.cc
/microbiome-portraiture/
3. Depending on the microbiome component and
analysis there are different –omics approaches
• Bacteria & Archaea
• Viruses (virome)
• Fungi (mycobiome) &
other eukaryotes
Determining the microbial...
• Diversity: specific genes (e.g. 16S rRNA gene)
• Gene potential: DNA (metagenomics)
• Gene expression: RNA (metatranscriptomics)
• Function: Proteins (metaproteomics)
• Activity: Metabolites (metabolomics)
4. All steps are critical to determine the microbiome
Experimental design
Sample collection
Sample preservation and manipulation
Nucleic acids extraction: method
(16S/18S rRNA amplification: primers, PCR conditions)
Sequencing: platform
Bioinformatic analysis
Interpretation of results
Experimental phase
Computational phase LEfSe
5. Databases
• Nucleotide/protein sequence databases:
• nt: Partially non-redundant nucleotide sequences from GenBank, EMBL, and DDBJ excluding GSS,STS, PAT, EST, HTG, and WGS,
• nr: Non-redundant protein sequences from GenPept, Swissprot, PIR, PDF, PDB, and NCBI RefSeq, etc.
• http://ensemblgenomes.org/ r31 (March 2016). Genome sequences: EnsemblBacteria (29,777 entries), EnsemblFungi (589 entries),
EnsemblProtists (158 entries), etc.
• Small and large subunit rRNA gene sequence databases:
• .
• .
• .
• .
• Protein family databases:
• Pfam (http://pfam.xfam.org/, 16,295 entries, 12/2015), TGRFAMs (http://www.jcvi.org/cgi-bin/tigrfams/index.cgi, 4,488
families, 09/2014), EggNOG (http://eggnogdb.embl.de/#/app/home, 190k orthologous groups, 10/2015), etc.
Small (16S/18S) and large subunit (23S/28S) rRNA sequences for Bacteria, Archaea and Eukarya.
Current release:123 (July 23, 2015).
Nº of sequences: Parc: 4,985,791, Ref: 1,575,088 bacteria, and Ref NR 99: 513,311 bacteria.
Bacterial and Archaeal 16S rRNA sequences, and Fungal 28S rRNA sequences
Current release: 11.4: May 26, 2015.
3,224,600 aligned and annotated 16S rRNA sequences.
64,329 species/phylotypes (60,384 Bacteria, 2,662 Archaea)
Current release: August 1, 2015.
Curated database of SSU near-full length sequences from Bacteria and Archaea.
Current release: May, 2013.
http://www.arb-silva.de/
http://greengenes.secondgenome.com/
https://rdp.cme.msu.edu/
http://www.ezbiocloud.net/eztaxon
6. Initial processing
• Download and uncompress sample files containing sequences in FASTA, FASTQ format.
• Sort of reads to samples by barcodes/indices and trim the primer regions.
• Quality assessment: Filter out / trimming of reads with low quality, short length, low-complexity and
ambiguities (‘N’). E.g. prinseq-lite program http://prinseq.sourceforge.net/.
• If paired-end sequences: join both ends: R1 and R2: fastq-join (from ea-tools suite)
http://code.google.com/p/ea-utils/wiki/FastqJoin, COPE http://sourceforge.net/projects/coperead/, PANDAseq
https://github.com/neufeld/pandaseq, SeqPrep https://github.com/jstjohn/SeqPrep, etc.
• Conversion of FASTAQ to FASTA format (optional). E.g. prinseq-lite, FASTX-Toolkit, etc.
• (Metagenomes) removal of non-microbial sequences (e. g. human) by mapping the query sequences
against that genome’s database: bowtie2 http://bowtie-bio.sourceforge.net/bowtie2/index.shtml.
7. Chimeras removal in amplicon-based analyses
• Chimeras are sequences of DNA made from two or more parent sequences.
• Artifacts of the PCR process.
• Steps:
• Web-based Decipher chimera detection tool http://decipher.cee.wisc.edu/FindChimeras.html (only for files < 10Mb).
• UCHIME: http://drive5.com/uchime/.
• Chimera.Slayer, usearch61 http://qiime.org/scripts/identify_chimeric_seqs.html.
8. Shortcomings of 16S rRNA gene analysis approach
Many bacterial entries in 16s rRNA gene databases from uncultured bacteria are tentative, and
reference entries (representative sequences) are picked based on clustering at certain percentage of
identity (97-99% depending on databases).
Many bacterial entries in the databases do not meet the criteria of identity to be categorized as
species, and even at higher taxonomic levels. Therefore, query sequences cannot always be
assigned to the species level because the limitation in the databases.
• Ex: A query read with the highest similarity (>99%) to a record in a database having taxonomy only to the family
level (no genus or species) Our query will not be able to be characterized to the species level.
Sometimes the problem is the low similarity of our query to the extant database.
• Ex: A query read matching the first hit in a database with a low similarity (≤ 80%), even if the hit is well
characterized and with a complete taxonomy our query cannot reach the species level reliably.
9. OTU table
What can we do to reliably assign taxonomy? We work on clusters and OTUs
Clustering e.g. 97% id (uclust)
Clust1 Clust2 Clust3 Clust4 Clust5 Clust6
S1-1
S3-2
S2-4
S3-3
S2-7
S2-8
S1-9
S1-2
S1-5
S1-8
S1-6
S1-3
S1-4
S3-1
S3-4
S2-1
S2-2
S2-3
S1-7
S2-5
S2-6
Pick representative
(longest, most abundant, random, first hit,...)
Clust1 Clust2 Clust3 Clust4 Clust5 Clust6
S1-1
S1-2
S1-3
S1-4
S2-1 S3-2
S1-5
S1-6
S3-1
S2-2 S2-4
S1-8
S2-3
S1-7
S3-3
S2-7
S2-5
S2-6 S2-8
S3-4 S1-9
Global alignment with DB 80-100% id
(usearch V8.0), or Blast
Query Database (clust 99%)
S1-1
S1-5
S2-3
S2-5
S1-9
S3-4
Taxonomy assignment (first hit, LCA, no selection,...)
OTU1
OTU2
OTU3
OTU4
OTU5
OTU6
S1-1
S1-2
S1-3
S1-4
S2-1
S2-3
S1-7
S1-8
S2-4
S2-5
S2-6
S3-3
S2-7
S3-4
S2-8
S1-9
A
B
C
D
F
G
H
I
C
D
E
I
J
K
L
?
S1-5
S1-6
S3-1
S2-2
S3-2
Filtered sequences
S1
S2
S3
S3S3
S3
S1
S1
S1
S1
S1
S2
S2 S2
S2
S2
S2
S2
S1
S1S1
S1 S2 S3 OTU
4 1 0 OTU1 (A,B,C,D)
2 1 2 OTU2 (F,G,H,I)
2 2 0 OTU3 (C,D,E)
0 3 1 OTU4 (I)
0 1 1 OTU5 (J,K,L)
1 0 0 OTU6 (?)
10. Metagenomics
- Gene potential taxonomy & potential functions - WGS (bacteria, viruses, eukaryotes, etc) genomic DNA
- Taxonomy: unassembled reads:
• BLASTx against a reference protein database (e.g. NR) and LCA strategy (e.g. https://github.com/emepyc/Blast2lca).
• (http://kaiju.binf.ku.dk/) “ is a program for the taxonomic assignment of high-throughput sequencing reads, e.g., Illumina or
Roche/454, from whole-genome sequencing of metagenomic DNA. Reads are directly assigned to taxa using the NCBI taxonomy and a reference
database of protein sequences from bacterial and archaeal genomes (optionally also including fungi and microbial eukaryotes). Menzel, P. et al.
(2016) Fast and sensitive taxonomic classification for metagenomics with Kaiju. Nat. Commun. 7:11257 (open access).”
Functions: assembled reads:
• Reads are assembled (e. g. ) to contigs, and genes predicted (e.g. Glimmer, http://ccb.jhu.edu/software/glimmer/index.shtml, Prodigal
http://prodigal.ornl.gov/) to identify coding regions. The coverage of the contigs is calculated by mapping the reads against the contigs, and the
predicted ORFs are aligned against a protein family sequence database (e.g. Pfam, TIGRFAMs) using HMMER (http://hmmer.org/), and tables
of abundances at different levels (e. g. role, sub-role, enzyme) are built.
• ShotMAP (https://github.com/sharpton/shotmap) “is a software workflow that functionally annotates and compares shotgun metagenomes. It
compares unassembled or assembled metagenomic sequences to a protein family database and calculates metagenome functional abundance.
• Others : (http://metagenomics.anl.gov/) , etc.
11. What can we do next?
Analysis of abundance Heatmaps
Analysis of composition Analysis of diversity
LDA analysis and biomarkers Statistical analysis Correlation networks
Microbial diversity Analysis:
MLST http://www.mlst.net/
MOTHUR http://www.mothur.org/
EstimateS
http://viceroy.eeb.uconn.edu/EstimateS/
QIIME http://qiime.org/install/virtual_box.html
PHACCS http://phaccs.sourceforge.net/
12. Take home message
• The complexity of the microbiome in the respiratory tract can be approached by a series of multi-
omics analyses, involving DNA, RNA, proteins, and metabolites.
• All steps (experimental and computational) are critical to the final outcome.
• For bioinformatic analyses, the database we use is as important as our data and the procedure.
• Processing of the data requires initial steps of separation by samples, filtration by quality and
length, joining of pairs, and sometimes removal of ‘contaminant’ sequences and chimeras.
• Limitations in the databases and the query sequences (incomplete taxonomic allocation, tentative
entries based on clustering, low identity query-subject, several hits showing identical similarity,
etc) restrict our capacity to obtain unambiguous results and reach discrete species of bacteria.
• Instead of working with traditional taxonomic levels (i.e. species) it could be advisable to cluster
both the database entries and the queries and work with OTUs, being aware that, in many cases we
will not be able to reach the full taxonomic hierarchy of the bacteria, but it would be more realistic.
• Metagenomic analyses allow functional (and to some extent taxonomic) analyses based on
genomic DNA compared to the use of universal markers (e.g. 16S rRNA gene).
• After reaching a consistent OTU-functional table, there is no a single pathway for the subsequent
analyses: those may vary according to the questions the researchers need to answer.
13. Acknowledgements:
• Andrés Moya,
Professor of Genetics, University of Valencia (UV); Genomics and Health Area
Chair, Institutional Professorship FISABIO- UV
• Rodrigo García López,
PhD student FISABIO-UV