Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Paprica course
1. Welcome!
Universidade de São Paulo
PAthway PRediction by phylogenetIC plAcement (paprica)
short course
Jeff Bowman, bowmanjs@ldeo.columbia.edu
30 March 2016
2. Introduction and Logistics
Schedule (tentative)
0900 – 0915: Introductions and logistics
0915 – 1015 Task 1: Troubleshoot installations, Task 2: Tutorial 1
1015 – 1030: Break
1030 – 1100: Discussion: The paprica workflow
1100 – 1130: Discussion: Tutorial 1 results
1130 – 1200: Troubleshooting installation for custom build of paprica database
1200 – 1300: Lunch
1300 – 1330: Tutorial 2: Building the paprica database
1330 – 1400: Discussion: The paprica database workflow
1400 – 1430: Demonstration: Metagenomic analysis with paprica (break during
module)
1430 – 1630: Your analysis with paprica. If you don’t have a set of libraries that you’d
like to work with we will help you find some.
Objectives
1. Install paprica and dependencies, and learn how to use it to analyze a set of 16S rRNA
gene sequences
2. Install the dependencies for build the paprica database, and learn how to build a
custom database
3. What it paprica, and what can I do with it?
paprica is a pipeline to estimate the metabolic pathways, enzymes (EC numbers), and genome
parameters associated with 16S rRNA gene sequences.
• Designed for NGS data
• Also applicable to small libraries or even single 16S rRNA gene sequences (e.g. isolates)
Bowman and Ducklow, 2015 Bowman, 2015
Introduction and Logistics
4. Bowman, 2015
Function Pathwayb Sanger studies Hatam et al. (2014) Bowman et al. (2012)
CO2 fixation
CO2 fixation into oxaloacetate
(anapleurotic)
Pseudoalteromonas
haloplanktis TAC125
Polaribacter MED152,
Acidimicrobiales YM16-304
Psychrobacter cryohalolentis
K5, Polaribacter MED 152
Antibiotic resistance Triclosan resistance
Pelagibacter ubique
HTCC1062, Polaribacter
MED152
Polaribacter MED152,
Leadbetterella byssophila
DSM17132, Thiomicrospira
spp., Gloeocapsa PCC7428,
Acidimicrobiales YM16-304,
Janthinobacterium spp.
P. cryohalolentis K5,
Polaribacter MED152, GSOS
C1 metabolism
Formaldehyde oxidation II
(glutathione-dependent)
Colwellia psychrerythraea 34H
Gloeocapsa PCC7428,
Marinobacter BSs20148,
Glaciecola nitratireducens
FR1064
Octadecabacter antarcticus
307
Choline degradation Choline degradation 1 C. psychrerythraea 34H Acidimicrobiales YM304
P. cryohalolentis K5, O.
antarcticus 307
Glycine betaine production
Glycine betaine biosynthesis I
(Gram-negative bacteria)
C. psychrerythraea 34H Acidimicrobiales YM304
P. cryohalolentis K5, O.
antarcticus 307
Halocarbon degradation 2-chlorobenzoate degradation P. cryohalolentis K5
Polaromonas
naphthalenivorans CJ2
P. cryohalolentis K5
Mercury conversion
Phenylmercury acetate
degradation
Marinobacter BSs20148, P.
haloplanktis TAC125,
Octadecabacter arcticus 238
Belliella baltica DSM15883,
Bordetella petrii
O. antarcticus 307
Nitrogen fixation Nitrogen fixation
Coraliomargarita akajimensis
DSM45221
C. akajimensis DSM45221,
Methylomonas methanica
MC09, Aeromonas spp.
C. akajimensis DSM45221
Sulfite oxidation Sulfite oxidation II/III
Pelagibacter ubique
HTCC1062
Cellvibrio japonicus UEDA107 GSOS
Sulfate reduction Sulfate reduction IV/V
Halomonas elongata
DSM2581, Psychrobacter
arcticum 273
Vibrio vulnificus YJ016 GSOS
Denitrification Nitrate reduction I/VII C. psychrerythraea 34H C. japonicus UEDA107 -
Introduction and Logistics
6. Troubleshoot installation and conduct basic analysis
Tutorial 1 – Initial analysis with paprica
• Finishing downloading and installing all remaining dependencies, let me know if you need
assistance
• Archaeopteryx
• R and RStudio
• Remove existing paprica directory, then download latest version of paprica:
• Start working through the tutorial located here: http://www.polarmicrobes.org/?p=1473
• Start at “Testing the Installation”
sudo apt-get install default-jre
wget https://googledrive.com/host/0BxMokdxOh-JRM1d2azFoRnF3bGM/download/forester_1038.jar
mv forester_1038.jar archaeopteryx.jar
chmod a+x archaeopteryx.jar
## create bash script archaeopteryx containing these lines (no indentation):
## #!/bin/bash
## java -cp archaeopteryx.jar org.forester.archaeopteryx.Archaeopteryx
## make this script executable
chmod a+x archaeopteryx
rm -r paprica
git clone https://github.com/bowmanjeffs/paprica.git
7. 16S sequence
library, the bigger
the better!
Obtain all
completed
genomes
(Genbank)
Predict
metabolic
pathways
(ptools)
Construct 16S
rRNA gene tree
(Infernal,
RAxML)
Place reads on
reference tree
(Infernal, pplacer)
Extract pathways
for each placement
Generate
confidence score
for sample
Find pathways
shared across
all members of
all clades
Calculate
confidence for
each node
Evaluate
genomic
plasticity for
terminal nodes
Evaluate
relative core
genome size
Analysis
Database
Construction
Confidence
Scoring
Three components to
metabolic inference:
1. Database construction
2. Analysis
3. Confidence scoring
Caveats:
Metabolic inference is only as good
as…
• Our genome annotations
• The diversity of completed
genomes
• Our knowledge of metabolic
pathways
And is further limited by…
• Genomic plasticity
The paprica workflow
8. The paprica workflow
• Data preparation
• Read QC – basic steps
• Overlap if PE
• Trim for quality
• Remove chloroplasts, mitochondria, anything else that looks weird
• Methods
• Mothur (preferred)
• Qiime
• paprica/utilities/read_qc.py
• Test run on single sample
• Setup run for multiple samples
• where samples.txt contains a list of the sample files without their extension
• Let’s take a look at paprica-run.sh…
while read f;do ./paprica-run.sh $f bacteria;done < samples.txt
9. Tutorial 1 results
Files initially provided or created by paprica
summer.fasta
summer.sub.fasta
summer.sub.clean.fasta
Files produced for or during infernal/pplacer
summer.sub.combined_16S.bacteria.tax.clean.align.phyloxml
summer.sub.combined_16S.bacteria.tax.clean.align.csv
summer.sub.combined_16S.bacteria.tax.clean.align.sto
summer.sub.combined_16S.bacteria.tax.clean.align.fasta
summer.sub.combined_16S.bacteria.tax.clean.align.jplace
paprica output files
summer.bacteria.ec.csv
summer.bacteria.sum_ec.csv
summer.bacteria.pathways.csv
summer.bacteria.sum_pathways.csv
summer.bacteria.edge_data.csv
summer.bacteria.sample_data.txt
10. Tutorial 1 results
Files initially provided or created by paprica
summer.fasta
summer.sub.fasta
summer.sub.clean.fasta
Files produced for or during infernal/pplacer
summer.sub.combined_16S.bacteria.tax.clean.align.phyloxml
summer.sub.combined_16S.bacteria.tax.clean.align.csv
summer.sub.combined_16S.bacteria.tax.clean.align.sto
summer.sub.combined_16S.bacteria.tax.clean.align.fasta
summer.sub.combined_16S.bacteria.tax.clean.align.jplace
paprica output files
summer.bacteria.ec.csv
summer.bacteria.sum_ec.csv
summer.bacteria.pathways.csv
summer.bacteria.sum_pathways.csv
summer.bacteria.edge_data.csv
summer.bacteria.sample_data.txt
11. Tutorial 1 results
Files initially provided or created by paprica
summer.fasta
summer.sub.fasta
summer.sub.clean.fasta
Files produced for or during infernal/pplacer
summer.sub.combined_16S.bacteria.tax.clean.align.phyloxml
summer.sub.combined_16S.bacteria.tax.clean.align.csv
summer.sub.combined_16S.bacteria.tax.clean.align.sto
summer.sub.combined_16S.bacteria.tax.clean.align.fasta
summer.sub.combined_16S.bacteria.tax.clean.align.jplace
paprica output files
summer.bacteria.ec.csv
summer.bacteria.sum_ec.csv
summer.bacteria.pathways.csv
summer.bacteria.sum_pathways.csv
summer.bacteria.edge_data.csv
summer.bacteria.sample_data.txt
origin,name,multiplicity,edge_num,like_weight_ratio,post_prob,likelihood,marginal_like,distal_length,pendant_length,classification,map_ratio,map_overlap
summer.sub.combined_16S.bacteria.tax.clean.align,SRR584344.1832,1,2568,0.497633,0.769127,-42222.2,-42226,0.457927,0.317102,NA,NA,NA
summer.sub.combined_16S.bacteria.tax.clean.align,SRR584344.4354,1,2253,0.840252,0.915613,-41188,-41192.1,7.3661e-06,0.263113,NA,NA,NA
summer.sub.combined_16S.bacteria.tax.clean.align,SRR584344.3662,1,2422,0.614939,0.615935,-42880.8,-42884.1,6.32695e-06,0.17298,NA,NA,NA
summer.sub.combined_16S.bacteria.tax.clean.align,SRR584344.2443,1,242,0.557322,0.787045,-43458.2,-43459.3,9.2618e-06,0.0380588,NA,NA,NA
12. Tutorial 1 results
Files initially provided or created by paprica
summer.fasta
summer.sub.fasta
summer.sub.clean.fasta
Files produced for or during infernal/pplacer
summer.sub.combined_16S.bacteria.tax.clean.align.phyloxml
summer.sub.combined_16S.bacteria.tax.clean.align.csv
summer.sub.combined_16S.bacteria.tax.clean.align.sto
summer.sub.combined_16S.bacteria.tax.clean.align.fasta
summer.sub.combined_16S.bacteria.tax.clean.align.jplace
paprica output files
summer.bacteria.ec.csv
summer.bacteria.sum_ec.csv
summer.bacteria.pathways.csv
summer.bacteria.sum_pathways.csv
summer.bacteria.edge_data.csv
summer.bacteria.sample_data.txt
,15,37,51,142,242,243,552,649,678,739,796,802,805,1030,1050,1075,1106,1107,2139…
1.-.-.-,0.0,0.0,0.0,35.25,90.0,14.0,0.0,0.0…
1.1.-.-,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0…
1.1.1.-,0.0,0.0,0.0,35.25,0.0,0.0,0.0,0…
1.1.1.1,0.0,0.0,0.0,23.5,135.0,21.0,0.333333333333…
Edge number for each CCG and CEG
ECnumber
13. Tutorial 1 results
Files initially provided or created by paprica
summer.fasta
summer.sub.fasta
summer.sub.clean.fasta
Files produced for or during infernal/pplacer
summer.sub.combined_16S.bacteria.tax.clean.align.phyloxml
summer.sub.combined_16S.bacteria.tax.clean.align.csv
summer.sub.combined_16S.bacteria.tax.clean.align.sto
summer.sub.combined_16S.bacteria.tax.clean.align.fasta
summer.sub.combined_16S.bacteria.tax.clean.align.jplace
paprica output files
summer.bacteria.ec.csv
summer.bacteria.sum_ec.csv
summer.bacteria.pathways.csv
summer.bacteria.sum_pathways.csv
summer.bacteria.edge_data.csv
summer.bacteria.sample_data.txt
1.-.-.-,175.159090909
1.1.-.-,0.333333333333
1.1.1.-,44.0984848485
1.1.1.1,192.475757576
1.1.1.10,0.0
1.1.1.100,1168.89333799
1.1.1.102,0.333333333333
Sum (normalized) across all CCG and CEG
ECnumber
14. Tutorial 1 results
Files initially provided or created by paprica
summer.fasta
summer.sub.fasta
summer.sub.clean.fasta
Files produced for or during infernal/pplacer
summer.sub.combined_16S.bacteria.tax.clean.align.phyloxml
summer.sub.combined_16S.bacteria.tax.clean.align.csv
summer.sub.combined_16S.bacteria.tax.clean.align.sto
summer.sub.combined_16S.bacteria.tax.clean.align.fasta
summer.sub.combined_16S.bacteria.tax.clean.align.jplace
paprica output files
summer.bacteria.ec.csv
summer.bacteria.sum_ec.csv
summer.bacteria.pathways.csv
summer.bacteria.sum_pathways.csv
summer.bacteria.edge_data.csv
summer.bacteria.sample_data.txt
,15,37,51,142,242,243,552,649,678,739,796,802,805,1030,1050,1075,1106,1107,2139…
"(1,3)-beta-D-xylan degradation",0.0,0.0,0.0,0.0,0.0,0.0,0.0…
(KDO)2-lipid A biosynthesis I,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.6…
(R)-acetoin biosynthesis I,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0…
Edge number for each CCG and CEG
Pathway
15. Tutorial 1 results
Files initially provided or created by paprica
summer.fasta
summer.sub.fasta
summer.sub.clean.fasta
Files produced for or during infernal/pplacer
summer.sub.combined_16S.bacteria.tax.clean.align.phyloxml
summer.sub.combined_16S.bacteria.tax.clean.align.csv
summer.sub.combined_16S.bacteria.tax.clean.align.sto
summer.sub.combined_16S.bacteria.tax.clean.align.fasta
summer.sub.combined_16S.bacteria.tax.clean.align.jplace
paprica output files
summer.bacteria.ec.csv
summer.bacteria.sum_ec.csv
summer.bacteria.pathways.csv
summer.bacteria.sum_pathways.csv
summer.bacteria.edge_data.csv
summer.bacteria.sample_data.txt
(transposed and put in table)
edge_num 242 243
taxon
GCF_000012345.1_Candidatus Pelagibacter ubique
HTCC1062_strain=HTCC1062
nedge 53 5
n16S 1 1
nedge_corrected 53 5
nge 1 1
ncds 1333 1355.5
genome_size 1308759 1325981
GC 29.68308145 29.15748
phi 0.478821295 0.480875
clade_size 1 2
branch_length 0.0189682 0.246143
npaths_terminal 119.5
npaths_actual 116 144
confidence 0.478821295 0.625556
post_prob 0.789555434 0.814622
nec_actual 369 461
nec_terminal 315.5
16. Tutorial 1 results
Files initially provided or created by paprica
summer.fasta
summer.sub.fasta
summer.sub.clean.fasta
Files produced for or during infernal/pplacer
summer.sub.combined_16S.bacteria.tax.clean.align.phyloxml
summer.sub.combined_16S.bacteria.tax.clean.align.csv
summer.sub.combined_16S.bacteria.tax.clean.align.sto
summer.sub.combined_16S.bacteria.tax.clean.align.fasta
summer.sub.combined_16S.bacteria.tax.clean.align.jplace
paprica output files
summer.bacteria.ec.csv
summer.bacteria.sum_ec.csv
summer.bacteria.pathways.csv
summer.bacteria.sum_pathways.csv
summer.bacteria.edge_data.csv
summer.bacteria.sample_data.txt
name summer.bacteria
sample_confidence 0.49199211424
npathways 572
ppathways 1007
nreads 1000
database_created_at 2016-03-03T00:59:34.792240
17. Tutorial 2
• Download the remaining dependencies
• RAxML
• add to PATH
• What if CPU can’t support AVX2? Cheat.
• pathway-tools
• follow GUI instructions
• taxtastic
• make sure that system Python is Anaconda (or alternate distro), then:
• Follow the tutorial here: http://www.polarmicrobes.org/?p=1543
• Only complete the “Test paprica-build.sh” section!
git clone https://github.com/stamatak/standard-RAxML.git
cd standard-RAxML
make -f Makefile.AVX2.PTHREADS.gcc
rm *.o
pip install taxtastic
19. Discussion: The paprica database workflow
paprica-make_ref.py
• Downloads all completed genomes from Genbank
• Counts 16S genes in each genome and pulls representative
• Calculates other genome parameters
• Constructs 16S alignment and distance matrix
• Constructs genome distance matrix (compositional vector based)
• Calculates phi from 16S distance matrix and genome distance matrix
• Find 16S genes in user genomes (if present)
• Add user 16S genes to previous alignment
paprica-place_it.py
• Constructs reference tree and reference package from 16S alignment
paprica-build_core_genomes.py
• Predicts metabolic pathways for each genome
• Tallies up EC numbers for each genome
• For each internal node on reference tree determines mean parameters, and
fraction of occurrence of EC numbers and metabolic pathways
• Exports all of this information as csv files
20. Demonstration: paprica-mg.py
• If you’re on a server you can follow the tutorial at http://www.polarmicrobes.org/?p=1596
• test.annotation.csv: The number of hits in the metagenome, by EC number. This is probably the most useful file to
you. The columns are:
• index: The accession of a representative protein from the database
• genome: Genome the representative protein comes from
• domain: Domain of this genome
• EC_number: The EC number
• product: A sensible name for the gene product
• start: Start position of the gene in the genome
• end: End position of the gene in the genome
• n_occurences: The number of occurrences of this EC number in the database
• nr_hits: The number of reads that matched this EC number. Each read is allowed only one hit.
• test.paprica-mg.nr.daa: The DIAMOND format results file. Only one hit per read is reported.
• test.paprica-mg.nr.txt: A text file of the DIAMOND results. Only one hit per read is reported.
• test_mg.pathologic (for -pathways T only): A directory containing .gbk files for each genome in the paprica database
that received a hit, with each EC number that got a hit for that genome.
• test.pathways.txt: A simple list of all the pathways that were predicted for the metagenome.
paprica-mg_run.py -i ERR318619_1.qc.fasta.gz -o demo -ref_dir ref_genome_database -pathways F