SlideShare a Scribd company logo
1 of 21
Welcome!
Universidade de São Paulo
PAthway PRediction by phylogenetIC plAcement (paprica)
short course
Jeff Bowman, bowmanjs@ldeo.columbia.edu
30 March 2016
Introduction and Logistics
Schedule (tentative)
0900 – 0915: Introductions and logistics
0915 – 1015 Task 1: Troubleshoot installations, Task 2: Tutorial 1
1015 – 1030: Break
1030 – 1100: Discussion: The paprica workflow
1100 – 1130: Discussion: Tutorial 1 results
1130 – 1200: Troubleshooting installation for custom build of paprica database
1200 – 1300: Lunch
1300 – 1330: Tutorial 2: Building the paprica database
1330 – 1400: Discussion: The paprica database workflow
1400 – 1430: Demonstration: Metagenomic analysis with paprica (break during
module)
1430 – 1630: Your analysis with paprica. If you don’t have a set of libraries that you’d
like to work with we will help you find some.
Objectives
1. Install paprica and dependencies, and learn how to use it to analyze a set of 16S rRNA
gene sequences
2. Install the dependencies for build the paprica database, and learn how to build a
custom database
What it paprica, and what can I do with it?
paprica is a pipeline to estimate the metabolic pathways, enzymes (EC numbers), and genome
parameters associated with 16S rRNA gene sequences.
• Designed for NGS data
• Also applicable to small libraries or even single 16S rRNA gene sequences (e.g. isolates)
Bowman and Ducklow, 2015 Bowman, 2015
Introduction and Logistics
Bowman, 2015
Function Pathwayb Sanger studies Hatam et al. (2014) Bowman et al. (2012)
CO2 fixation
CO2 fixation into oxaloacetate
(anapleurotic)
Pseudoalteromonas
haloplanktis TAC125
Polaribacter MED152,
Acidimicrobiales YM16-304
Psychrobacter cryohalolentis
K5, Polaribacter MED 152
Antibiotic resistance Triclosan resistance
Pelagibacter ubique
HTCC1062, Polaribacter
MED152
Polaribacter MED152,
Leadbetterella byssophila
DSM17132, Thiomicrospira
spp., Gloeocapsa PCC7428,
Acidimicrobiales YM16-304,
Janthinobacterium spp.
P. cryohalolentis K5,
Polaribacter MED152, GSOS
C1 metabolism
Formaldehyde oxidation II
(glutathione-dependent)
Colwellia psychrerythraea 34H
Gloeocapsa PCC7428,
Marinobacter BSs20148,
Glaciecola nitratireducens
FR1064
Octadecabacter antarcticus
307
Choline degradation Choline degradation 1 C. psychrerythraea 34H Acidimicrobiales YM304
P. cryohalolentis K5, O.
antarcticus 307
Glycine betaine production
Glycine betaine biosynthesis I
(Gram-negative bacteria)
C. psychrerythraea 34H Acidimicrobiales YM304
P. cryohalolentis K5, O.
antarcticus 307
Halocarbon degradation 2-chlorobenzoate degradation P. cryohalolentis K5
Polaromonas
naphthalenivorans CJ2
P. cryohalolentis K5
Mercury conversion
Phenylmercury acetate
degradation
Marinobacter BSs20148, P.
haloplanktis TAC125,
Octadecabacter arcticus 238
Belliella baltica DSM15883,
Bordetella petrii
O. antarcticus 307
Nitrogen fixation Nitrogen fixation
Coraliomargarita akajimensis
DSM45221
C. akajimensis DSM45221,
Methylomonas methanica
MC09, Aeromonas spp.
C. akajimensis DSM45221
Sulfite oxidation Sulfite oxidation II/III
Pelagibacter ubique
HTCC1062
Cellvibrio japonicus UEDA107 GSOS
Sulfate reduction Sulfate reduction IV/V
Halomonas elongata
DSM2581, Psychrobacter
arcticum 273
Vibrio vulnificus YJ016 GSOS
Denitrification Nitrate reduction I/VII C. psychrerythraea 34H C. japonicus UEDA107 -
Introduction and Logistics
Bowman et al, in revision
Introduction and Logistics
Troubleshoot installation and conduct basic analysis
Tutorial 1 – Initial analysis with paprica
• Finishing downloading and installing all remaining dependencies, let me know if you need
assistance
• Archaeopteryx
• R and RStudio
• Remove existing paprica directory, then download latest version of paprica:
• Start working through the tutorial located here: http://www.polarmicrobes.org/?p=1473
• Start at “Testing the Installation”
sudo apt-get install default-jre
wget https://googledrive.com/host/0BxMokdxOh-JRM1d2azFoRnF3bGM/download/forester_1038.jar
mv forester_1038.jar archaeopteryx.jar
chmod a+x archaeopteryx.jar
## create bash script archaeopteryx containing these lines (no indentation):
## #!/bin/bash
## java -cp archaeopteryx.jar org.forester.archaeopteryx.Archaeopteryx
## make this script executable
chmod a+x archaeopteryx
rm -r paprica
git clone https://github.com/bowmanjeffs/paprica.git
16S sequence
library, the bigger
the better!
Obtain all
completed
genomes
(Genbank)
Predict
metabolic
pathways
(ptools)
Construct 16S
rRNA gene tree
(Infernal,
RAxML)
Place reads on
reference tree
(Infernal, pplacer)
Extract pathways
for each placement
Generate
confidence score
for sample
Find pathways
shared across
all members of
all clades
Calculate
confidence for
each node
Evaluate
genomic
plasticity for
terminal nodes
Evaluate
relative core
genome size
Analysis
Database
Construction
Confidence
Scoring
Three components to
metabolic inference:
1. Database construction
2. Analysis
3. Confidence scoring
Caveats:
Metabolic inference is only as good
as…
• Our genome annotations
• The diversity of completed
genomes
• Our knowledge of metabolic
pathways
And is further limited by…
• Genomic plasticity
The paprica workflow
The paprica workflow
• Data preparation
• Read QC – basic steps
• Overlap if PE
• Trim for quality
• Remove chloroplasts, mitochondria, anything else that looks weird
• Methods
• Mothur (preferred)
• Qiime
• paprica/utilities/read_qc.py
• Test run on single sample
• Setup run for multiple samples
• where samples.txt contains a list of the sample files without their extension
• Let’s take a look at paprica-run.sh…
while read f;do ./paprica-run.sh $f bacteria;done < samples.txt
Tutorial 1 results
Files initially provided or created by paprica
summer.fasta
summer.sub.fasta
summer.sub.clean.fasta
Files produced for or during infernal/pplacer
summer.sub.combined_16S.bacteria.tax.clean.align.phyloxml
summer.sub.combined_16S.bacteria.tax.clean.align.csv
summer.sub.combined_16S.bacteria.tax.clean.align.sto
summer.sub.combined_16S.bacteria.tax.clean.align.fasta
summer.sub.combined_16S.bacteria.tax.clean.align.jplace
paprica output files
summer.bacteria.ec.csv
summer.bacteria.sum_ec.csv
summer.bacteria.pathways.csv
summer.bacteria.sum_pathways.csv
summer.bacteria.edge_data.csv
summer.bacteria.sample_data.txt
Tutorial 1 results
Files initially provided or created by paprica
summer.fasta
summer.sub.fasta
summer.sub.clean.fasta
Files produced for or during infernal/pplacer
summer.sub.combined_16S.bacteria.tax.clean.align.phyloxml
summer.sub.combined_16S.bacteria.tax.clean.align.csv
summer.sub.combined_16S.bacteria.tax.clean.align.sto
summer.sub.combined_16S.bacteria.tax.clean.align.fasta
summer.sub.combined_16S.bacteria.tax.clean.align.jplace
paprica output files
summer.bacteria.ec.csv
summer.bacteria.sum_ec.csv
summer.bacteria.pathways.csv
summer.bacteria.sum_pathways.csv
summer.bacteria.edge_data.csv
summer.bacteria.sample_data.txt
Tutorial 1 results
Files initially provided or created by paprica
summer.fasta
summer.sub.fasta
summer.sub.clean.fasta
Files produced for or during infernal/pplacer
summer.sub.combined_16S.bacteria.tax.clean.align.phyloxml
summer.sub.combined_16S.bacteria.tax.clean.align.csv
summer.sub.combined_16S.bacteria.tax.clean.align.sto
summer.sub.combined_16S.bacteria.tax.clean.align.fasta
summer.sub.combined_16S.bacteria.tax.clean.align.jplace
paprica output files
summer.bacteria.ec.csv
summer.bacteria.sum_ec.csv
summer.bacteria.pathways.csv
summer.bacteria.sum_pathways.csv
summer.bacteria.edge_data.csv
summer.bacteria.sample_data.txt
origin,name,multiplicity,edge_num,like_weight_ratio,post_prob,likelihood,marginal_like,distal_length,pendant_length,classification,map_ratio,map_overlap
summer.sub.combined_16S.bacteria.tax.clean.align,SRR584344.1832,1,2568,0.497633,0.769127,-42222.2,-42226,0.457927,0.317102,NA,NA,NA
summer.sub.combined_16S.bacteria.tax.clean.align,SRR584344.4354,1,2253,0.840252,0.915613,-41188,-41192.1,7.3661e-06,0.263113,NA,NA,NA
summer.sub.combined_16S.bacteria.tax.clean.align,SRR584344.3662,1,2422,0.614939,0.615935,-42880.8,-42884.1,6.32695e-06,0.17298,NA,NA,NA
summer.sub.combined_16S.bacteria.tax.clean.align,SRR584344.2443,1,242,0.557322,0.787045,-43458.2,-43459.3,9.2618e-06,0.0380588,NA,NA,NA
Tutorial 1 results
Files initially provided or created by paprica
summer.fasta
summer.sub.fasta
summer.sub.clean.fasta
Files produced for or during infernal/pplacer
summer.sub.combined_16S.bacteria.tax.clean.align.phyloxml
summer.sub.combined_16S.bacteria.tax.clean.align.csv
summer.sub.combined_16S.bacteria.tax.clean.align.sto
summer.sub.combined_16S.bacteria.tax.clean.align.fasta
summer.sub.combined_16S.bacteria.tax.clean.align.jplace
paprica output files
summer.bacteria.ec.csv
summer.bacteria.sum_ec.csv
summer.bacteria.pathways.csv
summer.bacteria.sum_pathways.csv
summer.bacteria.edge_data.csv
summer.bacteria.sample_data.txt
,15,37,51,142,242,243,552,649,678,739,796,802,805,1030,1050,1075,1106,1107,2139…
1.-.-.-,0.0,0.0,0.0,35.25,90.0,14.0,0.0,0.0…
1.1.-.-,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0…
1.1.1.-,0.0,0.0,0.0,35.25,0.0,0.0,0.0,0…
1.1.1.1,0.0,0.0,0.0,23.5,135.0,21.0,0.333333333333…
Edge number for each CCG and CEG
ECnumber
Tutorial 1 results
Files initially provided or created by paprica
summer.fasta
summer.sub.fasta
summer.sub.clean.fasta
Files produced for or during infernal/pplacer
summer.sub.combined_16S.bacteria.tax.clean.align.phyloxml
summer.sub.combined_16S.bacteria.tax.clean.align.csv
summer.sub.combined_16S.bacteria.tax.clean.align.sto
summer.sub.combined_16S.bacteria.tax.clean.align.fasta
summer.sub.combined_16S.bacteria.tax.clean.align.jplace
paprica output files
summer.bacteria.ec.csv
summer.bacteria.sum_ec.csv
summer.bacteria.pathways.csv
summer.bacteria.sum_pathways.csv
summer.bacteria.edge_data.csv
summer.bacteria.sample_data.txt
1.-.-.-,175.159090909
1.1.-.-,0.333333333333
1.1.1.-,44.0984848485
1.1.1.1,192.475757576
1.1.1.10,0.0
1.1.1.100,1168.89333799
1.1.1.102,0.333333333333
Sum (normalized) across all CCG and CEG
ECnumber
Tutorial 1 results
Files initially provided or created by paprica
summer.fasta
summer.sub.fasta
summer.sub.clean.fasta
Files produced for or during infernal/pplacer
summer.sub.combined_16S.bacteria.tax.clean.align.phyloxml
summer.sub.combined_16S.bacteria.tax.clean.align.csv
summer.sub.combined_16S.bacteria.tax.clean.align.sto
summer.sub.combined_16S.bacteria.tax.clean.align.fasta
summer.sub.combined_16S.bacteria.tax.clean.align.jplace
paprica output files
summer.bacteria.ec.csv
summer.bacteria.sum_ec.csv
summer.bacteria.pathways.csv
summer.bacteria.sum_pathways.csv
summer.bacteria.edge_data.csv
summer.bacteria.sample_data.txt
,15,37,51,142,242,243,552,649,678,739,796,802,805,1030,1050,1075,1106,1107,2139…
"(1,3)-beta-D-xylan degradation",0.0,0.0,0.0,0.0,0.0,0.0,0.0…
(KDO)2-lipid A biosynthesis I,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.6…
(R)-acetoin biosynthesis I,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0…
Edge number for each CCG and CEG
Pathway
Tutorial 1 results
Files initially provided or created by paprica
summer.fasta
summer.sub.fasta
summer.sub.clean.fasta
Files produced for or during infernal/pplacer
summer.sub.combined_16S.bacteria.tax.clean.align.phyloxml
summer.sub.combined_16S.bacteria.tax.clean.align.csv
summer.sub.combined_16S.bacteria.tax.clean.align.sto
summer.sub.combined_16S.bacteria.tax.clean.align.fasta
summer.sub.combined_16S.bacteria.tax.clean.align.jplace
paprica output files
summer.bacteria.ec.csv
summer.bacteria.sum_ec.csv
summer.bacteria.pathways.csv
summer.bacteria.sum_pathways.csv
summer.bacteria.edge_data.csv
summer.bacteria.sample_data.txt
(transposed and put in table)
edge_num 242 243
taxon
GCF_000012345.1_Candidatus Pelagibacter ubique
HTCC1062_strain=HTCC1062
nedge 53 5
n16S 1 1
nedge_corrected 53 5
nge 1 1
ncds 1333 1355.5
genome_size 1308759 1325981
GC 29.68308145 29.15748
phi 0.478821295 0.480875
clade_size 1 2
branch_length 0.0189682 0.246143
npaths_terminal 119.5
npaths_actual 116 144
confidence 0.478821295 0.625556
post_prob 0.789555434 0.814622
nec_actual 369 461
nec_terminal 315.5
Tutorial 1 results
Files initially provided or created by paprica
summer.fasta
summer.sub.fasta
summer.sub.clean.fasta
Files produced for or during infernal/pplacer
summer.sub.combined_16S.bacteria.tax.clean.align.phyloxml
summer.sub.combined_16S.bacteria.tax.clean.align.csv
summer.sub.combined_16S.bacteria.tax.clean.align.sto
summer.sub.combined_16S.bacteria.tax.clean.align.fasta
summer.sub.combined_16S.bacteria.tax.clean.align.jplace
paprica output files
summer.bacteria.ec.csv
summer.bacteria.sum_ec.csv
summer.bacteria.pathways.csv
summer.bacteria.sum_pathways.csv
summer.bacteria.edge_data.csv
summer.bacteria.sample_data.txt
name summer.bacteria
sample_confidence 0.49199211424
npathways 572
ppathways 1007
nreads 1000
database_created_at 2016-03-03T00:59:34.792240
Tutorial 2
• Download the remaining dependencies
• RAxML
• add to PATH
• What if CPU can’t support AVX2? Cheat.
• pathway-tools
• follow GUI instructions
• taxtastic
• make sure that system Python is Anaconda (or alternate distro), then:
• Follow the tutorial here: http://www.polarmicrobes.org/?p=1543
• Only complete the “Test paprica-build.sh” section!
git clone https://github.com/stamatak/standard-RAxML.git
cd standard-RAxML
make -f Makefile.AVX2.PTHREADS.gcc
rm *.o
pip install taxtastic
Discussion: The paprica database workflow
ref_genome_database
ptools-local
user bacteria archaea
bacteria archaea
refseqcomb…refpkg refseqcomb…refpkg
terminal_paths.csv
terminal_ec.csv
internal_probs.csv
internal_ec_probs.csv
internal_ec_n.csv
internal_data.csv
genome_data_final.csv
genome_data.csv
combined_16S.bacteria.tax.database_info.txt
terminal_paths.csv
terminal_ec.csv
internal_probs.csv
internal_ec_probs.csv
internal_ec_n.csv
internal_data.csv
genome_data_final.csv
genome_data.csv
combined_16S.archaea.tax.database_info.txt
GCF…*
*.fasta
*.hits
*.sto
*.5mer_bints.txt.gz
*.genomic.fna
*.genomic.gbff
*.protein.faa
GCF…*
*.fasta
*.hits
*.sto
*.5mer_bints.txt.gz
*.genomic.fna
*.genomic.gbff
*.protein.faa
GCF…* GCF…*
draft.combined_16S.fasta draft.combined_16S.fasta
*.fasta
*.hits
*.sto
*.genomic.fna
*protein.gbk
*.fasta
*.hits
*.sto
*.genomic.fna
*protein.gbk
paprica-mg.dmnd
paprica-mg.prot.csv.gz
combined_16S.[domain].tax.clean.align.fasta
combined_16S. [domain].tax.clean.align.sto
CONTENTS.json
phylo_modeleSi5_T.json
RAxML_fastTreeSH_Support.conf.root.ref.tre
RAxML_info.ref.tre
* *
*
Discussion: The paprica database workflow
paprica-make_ref.py
• Downloads all completed genomes from Genbank
• Counts 16S genes in each genome and pulls representative
• Calculates other genome parameters
• Constructs 16S alignment and distance matrix
• Constructs genome distance matrix (compositional vector based)
• Calculates phi from 16S distance matrix and genome distance matrix
• Find 16S genes in user genomes (if present)
• Add user 16S genes to previous alignment
paprica-place_it.py
• Constructs reference tree and reference package from 16S alignment
paprica-build_core_genomes.py
• Predicts metabolic pathways for each genome
• Tallies up EC numbers for each genome
• For each internal node on reference tree determines mean parameters, and
fraction of occurrence of EC numbers and metabolic pathways
• Exports all of this information as csv files
Demonstration: paprica-mg.py
• If you’re on a server you can follow the tutorial at http://www.polarmicrobes.org/?p=1596
• test.annotation.csv: The number of hits in the metagenome, by EC number. This is probably the most useful file to
you. The columns are:
• index: The accession of a representative protein from the database
• genome: Genome the representative protein comes from
• domain: Domain of this genome
• EC_number: The EC number
• product: A sensible name for the gene product
• start: Start position of the gene in the genome
• end: End position of the gene in the genome
• n_occurences: The number of occurrences of this EC number in the database
• nr_hits: The number of reads that matched this EC number. Each read is allowed only one hit.
• test.paprica-mg.nr.daa: The DIAMOND format results file. Only one hit per read is reported.
• test.paprica-mg.nr.txt: A text file of the DIAMOND results. Only one hit per read is reported.
• test_mg.pathologic (for -pathways T only): A directory containing .gbk files for each genome in the paprica database
that received a hit, with each EC number that got a hit for that genome.
• test.pathways.txt: A simple list of all the pathways that were predicted for the metagenome.
paprica-mg_run.py -i ERR318619_1.qc.fasta.gz -o demo -ref_dir ref_genome_database -pathways F
• Evaluations
On to your own analysis!

More Related Content

Similar to Paprica course

LifeZoneWellnessSNPIntro.pptx
LifeZoneWellnessSNPIntro.pptxLifeZoneWellnessSNPIntro.pptx
LifeZoneWellnessSNPIntro.pptxssuserebe2aa
 
CCBC tutorial beiko
CCBC tutorial beikoCCBC tutorial beiko
CCBC tutorial beikobeiko
 
2015 Bioc4010 lecture1and2
2015 Bioc4010 lecture1and22015 Bioc4010 lecture1and2
2015 Bioc4010 lecture1and2Dan Gaston
 
Open Science and Data Sharing - CERF
Open Science and Data Sharing - CERFOpen Science and Data Sharing - CERF
Open Science and Data Sharing - CERFKaitlin Thaney
 
Diversity Diversity Diversity Diversity ....
Diversity Diversity Diversity Diversity ....Diversity Diversity Diversity Diversity ....
Diversity Diversity Diversity Diversity ....Jonathan Eisen
 
Bioinformatics for Computer Scientists.ppt
Bioinformatics for Computer Scientists.pptBioinformatics for Computer Scientists.ppt
Bioinformatics for Computer Scientists.pptAbdullah Yousafzai
 
Introduction-to-Bioinformatics-1.ppt
Introduction-to-Bioinformatics-1.pptIntroduction-to-Bioinformatics-1.ppt
Introduction-to-Bioinformatics-1.pptRichardEstradaC
 
Bioinformatics final
Bioinformatics finalBioinformatics final
Bioinformatics finalRainu Rajeev
 
BM405 Lecture Slides 21/11/2014 University of Strathclyde
BM405 Lecture Slides 21/11/2014 University of StrathclydeBM405 Lecture Slides 21/11/2014 University of Strathclyde
BM405 Lecture Slides 21/11/2014 University of StrathclydeLeighton Pritchard
 
Giab jan2016 intro and update 160128
Giab jan2016 intro and update 160128Giab jan2016 intro and update 160128
Giab jan2016 intro and update 160128GenomeInABottle
 
AdamAmeur_SciLife_Bioinfo_course_Nov2015.ppt
AdamAmeur_SciLife_Bioinfo_course_Nov2015.pptAdamAmeur_SciLife_Bioinfo_course_Nov2015.ppt
AdamAmeur_SciLife_Bioinfo_course_Nov2015.pptRuthMWinnie
 
AdamAmeur_SciLife_Bioinfo_course_Nov2015.ppt
AdamAmeur_SciLife_Bioinfo_course_Nov2015.pptAdamAmeur_SciLife_Bioinfo_course_Nov2015.ppt
AdamAmeur_SciLife_Bioinfo_course_Nov2015.pptEdizonJambormias2
 
Flow Cytometry Training : Introduction day 1 session 1
Flow Cytometry Training : Introduction day 1 session 1Flow Cytometry Training : Introduction day 1 session 1
Flow Cytometry Training : Introduction day 1 session 1Robert (Rob) Salomon
 
The Human Genome Project - Part I
The Human Genome Project - Part IThe Human Genome Project - Part I
The Human Genome Project - Part Ihhalhaddad
 
Saha UC Davis Plant Pathology seminar Infrastructure for battling the Citrus ...
Saha UC Davis Plant Pathology seminar Infrastructure for battling the Citrus ...Saha UC Davis Plant Pathology seminar Infrastructure for battling the Citrus ...
Saha UC Davis Plant Pathology seminar Infrastructure for battling the Citrus ...Surya Saha
 
Evaluation of the impact of error correction algorithms on SNP calling.
Evaluation of the impact of error correction algorithms on SNP calling.Evaluation of the impact of error correction algorithms on SNP calling.
Evaluation of the impact of error correction algorithms on SNP calling.Nathan Olson
 
Microbiome studies using 16S ribosomal DNA PCR: some cautionary tales.
Microbiome studies using 16S ribosomal DNA PCR: some cautionary tales.Microbiome studies using 16S ribosomal DNA PCR: some cautionary tales.
Microbiome studies using 16S ribosomal DNA PCR: some cautionary tales.jennomics
 

Similar to Paprica course (20)

LifeZoneWellnessSNPIntro.pptx
LifeZoneWellnessSNPIntro.pptxLifeZoneWellnessSNPIntro.pptx
LifeZoneWellnessSNPIntro.pptx
 
CCBC tutorial beiko
CCBC tutorial beikoCCBC tutorial beiko
CCBC tutorial beiko
 
2015 Bioc4010 lecture1and2
2015 Bioc4010 lecture1and22015 Bioc4010 lecture1and2
2015 Bioc4010 lecture1and2
 
Introduction to 16S Microbiome Analysis
Introduction to 16S Microbiome AnalysisIntroduction to 16S Microbiome Analysis
Introduction to 16S Microbiome Analysis
 
Open Science and Data Sharing - CERF
Open Science and Data Sharing - CERFOpen Science and Data Sharing - CERF
Open Science and Data Sharing - CERF
 
Robert T. Dunn, II, Ph.D., DABT, SLAS ADMET Special Interest Group Meeting p...
 Robert T. Dunn, II, Ph.D., DABT, SLAS ADMET Special Interest Group Meeting p... Robert T. Dunn, II, Ph.D., DABT, SLAS ADMET Special Interest Group Meeting p...
Robert T. Dunn, II, Ph.D., DABT, SLAS ADMET Special Interest Group Meeting p...
 
Diversity Diversity Diversity Diversity ....
Diversity Diversity Diversity Diversity ....Diversity Diversity Diversity Diversity ....
Diversity Diversity Diversity Diversity ....
 
Bioinformatics for Computer Scientists.ppt
Bioinformatics for Computer Scientists.pptBioinformatics for Computer Scientists.ppt
Bioinformatics for Computer Scientists.ppt
 
Introduction-to-Bioinformatics-1.ppt
Introduction-to-Bioinformatics-1.pptIntroduction-to-Bioinformatics-1.ppt
Introduction-to-Bioinformatics-1.ppt
 
Bioinformatics final
Bioinformatics finalBioinformatics final
Bioinformatics final
 
Genome Assembly
Genome AssemblyGenome Assembly
Genome Assembly
 
BM405 Lecture Slides 21/11/2014 University of Strathclyde
BM405 Lecture Slides 21/11/2014 University of StrathclydeBM405 Lecture Slides 21/11/2014 University of Strathclyde
BM405 Lecture Slides 21/11/2014 University of Strathclyde
 
Giab jan2016 intro and update 160128
Giab jan2016 intro and update 160128Giab jan2016 intro and update 160128
Giab jan2016 intro and update 160128
 
AdamAmeur_SciLife_Bioinfo_course_Nov2015.ppt
AdamAmeur_SciLife_Bioinfo_course_Nov2015.pptAdamAmeur_SciLife_Bioinfo_course_Nov2015.ppt
AdamAmeur_SciLife_Bioinfo_course_Nov2015.ppt
 
AdamAmeur_SciLife_Bioinfo_course_Nov2015.ppt
AdamAmeur_SciLife_Bioinfo_course_Nov2015.pptAdamAmeur_SciLife_Bioinfo_course_Nov2015.ppt
AdamAmeur_SciLife_Bioinfo_course_Nov2015.ppt
 
Flow Cytometry Training : Introduction day 1 session 1
Flow Cytometry Training : Introduction day 1 session 1Flow Cytometry Training : Introduction day 1 session 1
Flow Cytometry Training : Introduction day 1 session 1
 
The Human Genome Project - Part I
The Human Genome Project - Part IThe Human Genome Project - Part I
The Human Genome Project - Part I
 
Saha UC Davis Plant Pathology seminar Infrastructure for battling the Citrus ...
Saha UC Davis Plant Pathology seminar Infrastructure for battling the Citrus ...Saha UC Davis Plant Pathology seminar Infrastructure for battling the Citrus ...
Saha UC Davis Plant Pathology seminar Infrastructure for battling the Citrus ...
 
Evaluation of the impact of error correction algorithms on SNP calling.
Evaluation of the impact of error correction algorithms on SNP calling.Evaluation of the impact of error correction algorithms on SNP calling.
Evaluation of the impact of error correction algorithms on SNP calling.
 
Microbiome studies using 16S ribosomal DNA PCR: some cautionary tales.
Microbiome studies using 16S ribosomal DNA PCR: some cautionary tales.Microbiome studies using 16S ribosomal DNA PCR: some cautionary tales.
Microbiome studies using 16S ribosomal DNA PCR: some cautionary tales.
 

Recently uploaded

A relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfA relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfnehabiju2046
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )aarthirajkumar25
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...anilsa9823
 
Types of different blotting techniques.pptx
Types of different blotting techniques.pptxTypes of different blotting techniques.pptx
Types of different blotting techniques.pptxkhadijarafiq2012
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxAleenaTreesaSaji
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Lokesh Kothari
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...Sérgio Sacani
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTSérgio Sacani
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡anilsa9823
 
Work, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE PhysicsWork, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE Physicsvishikhakeshava1
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptxanandsmhk
 
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfAnalytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfSwapnil Therkar
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6
 
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |aasikanpl
 
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl
 
G9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptG9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptMAESTRELLAMesa2
 
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxAnalytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxSwapnil Therkar
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisDiwakar Mishra
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 

Recently uploaded (20)

A relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfA relative description on Sonoporation.pdf
A relative description on Sonoporation.pdf
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
 
Types of different blotting techniques.pptx
Types of different blotting techniques.pptxTypes of different blotting techniques.pptx
Types of different blotting techniques.pptx
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptx
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
 
Work, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE PhysicsWork, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE Physics
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 
Engler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomyEngler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomy
 
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfAnalytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
 
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
 
G9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptG9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.ppt
 
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxAnalytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 

Paprica course

  • 1. Welcome! Universidade de São Paulo PAthway PRediction by phylogenetIC plAcement (paprica) short course Jeff Bowman, bowmanjs@ldeo.columbia.edu 30 March 2016
  • 2. Introduction and Logistics Schedule (tentative) 0900 – 0915: Introductions and logistics 0915 – 1015 Task 1: Troubleshoot installations, Task 2: Tutorial 1 1015 – 1030: Break 1030 – 1100: Discussion: The paprica workflow 1100 – 1130: Discussion: Tutorial 1 results 1130 – 1200: Troubleshooting installation for custom build of paprica database 1200 – 1300: Lunch 1300 – 1330: Tutorial 2: Building the paprica database 1330 – 1400: Discussion: The paprica database workflow 1400 – 1430: Demonstration: Metagenomic analysis with paprica (break during module) 1430 – 1630: Your analysis with paprica. If you don’t have a set of libraries that you’d like to work with we will help you find some. Objectives 1. Install paprica and dependencies, and learn how to use it to analyze a set of 16S rRNA gene sequences 2. Install the dependencies for build the paprica database, and learn how to build a custom database
  • 3. What it paprica, and what can I do with it? paprica is a pipeline to estimate the metabolic pathways, enzymes (EC numbers), and genome parameters associated with 16S rRNA gene sequences. • Designed for NGS data • Also applicable to small libraries or even single 16S rRNA gene sequences (e.g. isolates) Bowman and Ducklow, 2015 Bowman, 2015 Introduction and Logistics
  • 4. Bowman, 2015 Function Pathwayb Sanger studies Hatam et al. (2014) Bowman et al. (2012) CO2 fixation CO2 fixation into oxaloacetate (anapleurotic) Pseudoalteromonas haloplanktis TAC125 Polaribacter MED152, Acidimicrobiales YM16-304 Psychrobacter cryohalolentis K5, Polaribacter MED 152 Antibiotic resistance Triclosan resistance Pelagibacter ubique HTCC1062, Polaribacter MED152 Polaribacter MED152, Leadbetterella byssophila DSM17132, Thiomicrospira spp., Gloeocapsa PCC7428, Acidimicrobiales YM16-304, Janthinobacterium spp. P. cryohalolentis K5, Polaribacter MED152, GSOS C1 metabolism Formaldehyde oxidation II (glutathione-dependent) Colwellia psychrerythraea 34H Gloeocapsa PCC7428, Marinobacter BSs20148, Glaciecola nitratireducens FR1064 Octadecabacter antarcticus 307 Choline degradation Choline degradation 1 C. psychrerythraea 34H Acidimicrobiales YM304 P. cryohalolentis K5, O. antarcticus 307 Glycine betaine production Glycine betaine biosynthesis I (Gram-negative bacteria) C. psychrerythraea 34H Acidimicrobiales YM304 P. cryohalolentis K5, O. antarcticus 307 Halocarbon degradation 2-chlorobenzoate degradation P. cryohalolentis K5 Polaromonas naphthalenivorans CJ2 P. cryohalolentis K5 Mercury conversion Phenylmercury acetate degradation Marinobacter BSs20148, P. haloplanktis TAC125, Octadecabacter arcticus 238 Belliella baltica DSM15883, Bordetella petrii O. antarcticus 307 Nitrogen fixation Nitrogen fixation Coraliomargarita akajimensis DSM45221 C. akajimensis DSM45221, Methylomonas methanica MC09, Aeromonas spp. C. akajimensis DSM45221 Sulfite oxidation Sulfite oxidation II/III Pelagibacter ubique HTCC1062 Cellvibrio japonicus UEDA107 GSOS Sulfate reduction Sulfate reduction IV/V Halomonas elongata DSM2581, Psychrobacter arcticum 273 Vibrio vulnificus YJ016 GSOS Denitrification Nitrate reduction I/VII C. psychrerythraea 34H C. japonicus UEDA107 - Introduction and Logistics
  • 5. Bowman et al, in revision Introduction and Logistics
  • 6. Troubleshoot installation and conduct basic analysis Tutorial 1 – Initial analysis with paprica • Finishing downloading and installing all remaining dependencies, let me know if you need assistance • Archaeopteryx • R and RStudio • Remove existing paprica directory, then download latest version of paprica: • Start working through the tutorial located here: http://www.polarmicrobes.org/?p=1473 • Start at “Testing the Installation” sudo apt-get install default-jre wget https://googledrive.com/host/0BxMokdxOh-JRM1d2azFoRnF3bGM/download/forester_1038.jar mv forester_1038.jar archaeopteryx.jar chmod a+x archaeopteryx.jar ## create bash script archaeopteryx containing these lines (no indentation): ## #!/bin/bash ## java -cp archaeopteryx.jar org.forester.archaeopteryx.Archaeopteryx ## make this script executable chmod a+x archaeopteryx rm -r paprica git clone https://github.com/bowmanjeffs/paprica.git
  • 7. 16S sequence library, the bigger the better! Obtain all completed genomes (Genbank) Predict metabolic pathways (ptools) Construct 16S rRNA gene tree (Infernal, RAxML) Place reads on reference tree (Infernal, pplacer) Extract pathways for each placement Generate confidence score for sample Find pathways shared across all members of all clades Calculate confidence for each node Evaluate genomic plasticity for terminal nodes Evaluate relative core genome size Analysis Database Construction Confidence Scoring Three components to metabolic inference: 1. Database construction 2. Analysis 3. Confidence scoring Caveats: Metabolic inference is only as good as… • Our genome annotations • The diversity of completed genomes • Our knowledge of metabolic pathways And is further limited by… • Genomic plasticity The paprica workflow
  • 8. The paprica workflow • Data preparation • Read QC – basic steps • Overlap if PE • Trim for quality • Remove chloroplasts, mitochondria, anything else that looks weird • Methods • Mothur (preferred) • Qiime • paprica/utilities/read_qc.py • Test run on single sample • Setup run for multiple samples • where samples.txt contains a list of the sample files without their extension • Let’s take a look at paprica-run.sh… while read f;do ./paprica-run.sh $f bacteria;done < samples.txt
  • 9. Tutorial 1 results Files initially provided or created by paprica summer.fasta summer.sub.fasta summer.sub.clean.fasta Files produced for or during infernal/pplacer summer.sub.combined_16S.bacteria.tax.clean.align.phyloxml summer.sub.combined_16S.bacteria.tax.clean.align.csv summer.sub.combined_16S.bacteria.tax.clean.align.sto summer.sub.combined_16S.bacteria.tax.clean.align.fasta summer.sub.combined_16S.bacteria.tax.clean.align.jplace paprica output files summer.bacteria.ec.csv summer.bacteria.sum_ec.csv summer.bacteria.pathways.csv summer.bacteria.sum_pathways.csv summer.bacteria.edge_data.csv summer.bacteria.sample_data.txt
  • 10. Tutorial 1 results Files initially provided or created by paprica summer.fasta summer.sub.fasta summer.sub.clean.fasta Files produced for or during infernal/pplacer summer.sub.combined_16S.bacteria.tax.clean.align.phyloxml summer.sub.combined_16S.bacteria.tax.clean.align.csv summer.sub.combined_16S.bacteria.tax.clean.align.sto summer.sub.combined_16S.bacteria.tax.clean.align.fasta summer.sub.combined_16S.bacteria.tax.clean.align.jplace paprica output files summer.bacteria.ec.csv summer.bacteria.sum_ec.csv summer.bacteria.pathways.csv summer.bacteria.sum_pathways.csv summer.bacteria.edge_data.csv summer.bacteria.sample_data.txt
  • 11. Tutorial 1 results Files initially provided or created by paprica summer.fasta summer.sub.fasta summer.sub.clean.fasta Files produced for or during infernal/pplacer summer.sub.combined_16S.bacteria.tax.clean.align.phyloxml summer.sub.combined_16S.bacteria.tax.clean.align.csv summer.sub.combined_16S.bacteria.tax.clean.align.sto summer.sub.combined_16S.bacteria.tax.clean.align.fasta summer.sub.combined_16S.bacteria.tax.clean.align.jplace paprica output files summer.bacteria.ec.csv summer.bacteria.sum_ec.csv summer.bacteria.pathways.csv summer.bacteria.sum_pathways.csv summer.bacteria.edge_data.csv summer.bacteria.sample_data.txt origin,name,multiplicity,edge_num,like_weight_ratio,post_prob,likelihood,marginal_like,distal_length,pendant_length,classification,map_ratio,map_overlap summer.sub.combined_16S.bacteria.tax.clean.align,SRR584344.1832,1,2568,0.497633,0.769127,-42222.2,-42226,0.457927,0.317102,NA,NA,NA summer.sub.combined_16S.bacteria.tax.clean.align,SRR584344.4354,1,2253,0.840252,0.915613,-41188,-41192.1,7.3661e-06,0.263113,NA,NA,NA summer.sub.combined_16S.bacteria.tax.clean.align,SRR584344.3662,1,2422,0.614939,0.615935,-42880.8,-42884.1,6.32695e-06,0.17298,NA,NA,NA summer.sub.combined_16S.bacteria.tax.clean.align,SRR584344.2443,1,242,0.557322,0.787045,-43458.2,-43459.3,9.2618e-06,0.0380588,NA,NA,NA
  • 12. Tutorial 1 results Files initially provided or created by paprica summer.fasta summer.sub.fasta summer.sub.clean.fasta Files produced for or during infernal/pplacer summer.sub.combined_16S.bacteria.tax.clean.align.phyloxml summer.sub.combined_16S.bacteria.tax.clean.align.csv summer.sub.combined_16S.bacteria.tax.clean.align.sto summer.sub.combined_16S.bacteria.tax.clean.align.fasta summer.sub.combined_16S.bacteria.tax.clean.align.jplace paprica output files summer.bacteria.ec.csv summer.bacteria.sum_ec.csv summer.bacteria.pathways.csv summer.bacteria.sum_pathways.csv summer.bacteria.edge_data.csv summer.bacteria.sample_data.txt ,15,37,51,142,242,243,552,649,678,739,796,802,805,1030,1050,1075,1106,1107,2139… 1.-.-.-,0.0,0.0,0.0,35.25,90.0,14.0,0.0,0.0… 1.1.-.-,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0… 1.1.1.-,0.0,0.0,0.0,35.25,0.0,0.0,0.0,0… 1.1.1.1,0.0,0.0,0.0,23.5,135.0,21.0,0.333333333333… Edge number for each CCG and CEG ECnumber
  • 13. Tutorial 1 results Files initially provided or created by paprica summer.fasta summer.sub.fasta summer.sub.clean.fasta Files produced for or during infernal/pplacer summer.sub.combined_16S.bacteria.tax.clean.align.phyloxml summer.sub.combined_16S.bacteria.tax.clean.align.csv summer.sub.combined_16S.bacteria.tax.clean.align.sto summer.sub.combined_16S.bacteria.tax.clean.align.fasta summer.sub.combined_16S.bacteria.tax.clean.align.jplace paprica output files summer.bacteria.ec.csv summer.bacteria.sum_ec.csv summer.bacteria.pathways.csv summer.bacteria.sum_pathways.csv summer.bacteria.edge_data.csv summer.bacteria.sample_data.txt 1.-.-.-,175.159090909 1.1.-.-,0.333333333333 1.1.1.-,44.0984848485 1.1.1.1,192.475757576 1.1.1.10,0.0 1.1.1.100,1168.89333799 1.1.1.102,0.333333333333 Sum (normalized) across all CCG and CEG ECnumber
  • 14. Tutorial 1 results Files initially provided or created by paprica summer.fasta summer.sub.fasta summer.sub.clean.fasta Files produced for or during infernal/pplacer summer.sub.combined_16S.bacteria.tax.clean.align.phyloxml summer.sub.combined_16S.bacteria.tax.clean.align.csv summer.sub.combined_16S.bacteria.tax.clean.align.sto summer.sub.combined_16S.bacteria.tax.clean.align.fasta summer.sub.combined_16S.bacteria.tax.clean.align.jplace paprica output files summer.bacteria.ec.csv summer.bacteria.sum_ec.csv summer.bacteria.pathways.csv summer.bacteria.sum_pathways.csv summer.bacteria.edge_data.csv summer.bacteria.sample_data.txt ,15,37,51,142,242,243,552,649,678,739,796,802,805,1030,1050,1075,1106,1107,2139… "(1,3)-beta-D-xylan degradation",0.0,0.0,0.0,0.0,0.0,0.0,0.0… (KDO)2-lipid A biosynthesis I,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.6… (R)-acetoin biosynthesis I,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0… Edge number for each CCG and CEG Pathway
  • 15. Tutorial 1 results Files initially provided or created by paprica summer.fasta summer.sub.fasta summer.sub.clean.fasta Files produced for or during infernal/pplacer summer.sub.combined_16S.bacteria.tax.clean.align.phyloxml summer.sub.combined_16S.bacteria.tax.clean.align.csv summer.sub.combined_16S.bacteria.tax.clean.align.sto summer.sub.combined_16S.bacteria.tax.clean.align.fasta summer.sub.combined_16S.bacteria.tax.clean.align.jplace paprica output files summer.bacteria.ec.csv summer.bacteria.sum_ec.csv summer.bacteria.pathways.csv summer.bacteria.sum_pathways.csv summer.bacteria.edge_data.csv summer.bacteria.sample_data.txt (transposed and put in table) edge_num 242 243 taxon GCF_000012345.1_Candidatus Pelagibacter ubique HTCC1062_strain=HTCC1062 nedge 53 5 n16S 1 1 nedge_corrected 53 5 nge 1 1 ncds 1333 1355.5 genome_size 1308759 1325981 GC 29.68308145 29.15748 phi 0.478821295 0.480875 clade_size 1 2 branch_length 0.0189682 0.246143 npaths_terminal 119.5 npaths_actual 116 144 confidence 0.478821295 0.625556 post_prob 0.789555434 0.814622 nec_actual 369 461 nec_terminal 315.5
  • 16. Tutorial 1 results Files initially provided or created by paprica summer.fasta summer.sub.fasta summer.sub.clean.fasta Files produced for or during infernal/pplacer summer.sub.combined_16S.bacteria.tax.clean.align.phyloxml summer.sub.combined_16S.bacteria.tax.clean.align.csv summer.sub.combined_16S.bacteria.tax.clean.align.sto summer.sub.combined_16S.bacteria.tax.clean.align.fasta summer.sub.combined_16S.bacteria.tax.clean.align.jplace paprica output files summer.bacteria.ec.csv summer.bacteria.sum_ec.csv summer.bacteria.pathways.csv summer.bacteria.sum_pathways.csv summer.bacteria.edge_data.csv summer.bacteria.sample_data.txt name summer.bacteria sample_confidence 0.49199211424 npathways 572 ppathways 1007 nreads 1000 database_created_at 2016-03-03T00:59:34.792240
  • 17. Tutorial 2 • Download the remaining dependencies • RAxML • add to PATH • What if CPU can’t support AVX2? Cheat. • pathway-tools • follow GUI instructions • taxtastic • make sure that system Python is Anaconda (or alternate distro), then: • Follow the tutorial here: http://www.polarmicrobes.org/?p=1543 • Only complete the “Test paprica-build.sh” section! git clone https://github.com/stamatak/standard-RAxML.git cd standard-RAxML make -f Makefile.AVX2.PTHREADS.gcc rm *.o pip install taxtastic
  • 18. Discussion: The paprica database workflow ref_genome_database ptools-local user bacteria archaea bacteria archaea refseqcomb…refpkg refseqcomb…refpkg terminal_paths.csv terminal_ec.csv internal_probs.csv internal_ec_probs.csv internal_ec_n.csv internal_data.csv genome_data_final.csv genome_data.csv combined_16S.bacteria.tax.database_info.txt terminal_paths.csv terminal_ec.csv internal_probs.csv internal_ec_probs.csv internal_ec_n.csv internal_data.csv genome_data_final.csv genome_data.csv combined_16S.archaea.tax.database_info.txt GCF…* *.fasta *.hits *.sto *.5mer_bints.txt.gz *.genomic.fna *.genomic.gbff *.protein.faa GCF…* *.fasta *.hits *.sto *.5mer_bints.txt.gz *.genomic.fna *.genomic.gbff *.protein.faa GCF…* GCF…* draft.combined_16S.fasta draft.combined_16S.fasta *.fasta *.hits *.sto *.genomic.fna *protein.gbk *.fasta *.hits *.sto *.genomic.fna *protein.gbk paprica-mg.dmnd paprica-mg.prot.csv.gz combined_16S.[domain].tax.clean.align.fasta combined_16S. [domain].tax.clean.align.sto CONTENTS.json phylo_modeleSi5_T.json RAxML_fastTreeSH_Support.conf.root.ref.tre RAxML_info.ref.tre * * *
  • 19. Discussion: The paprica database workflow paprica-make_ref.py • Downloads all completed genomes from Genbank • Counts 16S genes in each genome and pulls representative • Calculates other genome parameters • Constructs 16S alignment and distance matrix • Constructs genome distance matrix (compositional vector based) • Calculates phi from 16S distance matrix and genome distance matrix • Find 16S genes in user genomes (if present) • Add user 16S genes to previous alignment paprica-place_it.py • Constructs reference tree and reference package from 16S alignment paprica-build_core_genomes.py • Predicts metabolic pathways for each genome • Tallies up EC numbers for each genome • For each internal node on reference tree determines mean parameters, and fraction of occurrence of EC numbers and metabolic pathways • Exports all of this information as csv files
  • 20. Demonstration: paprica-mg.py • If you’re on a server you can follow the tutorial at http://www.polarmicrobes.org/?p=1596 • test.annotation.csv: The number of hits in the metagenome, by EC number. This is probably the most useful file to you. The columns are: • index: The accession of a representative protein from the database • genome: Genome the representative protein comes from • domain: Domain of this genome • EC_number: The EC number • product: A sensible name for the gene product • start: Start position of the gene in the genome • end: End position of the gene in the genome • n_occurences: The number of occurrences of this EC number in the database • nr_hits: The number of reads that matched this EC number. Each read is allowed only one hit. • test.paprica-mg.nr.daa: The DIAMOND format results file. Only one hit per read is reported. • test.paprica-mg.nr.txt: A text file of the DIAMOND results. Only one hit per read is reported. • test_mg.pathologic (for -pathways T only): A directory containing .gbk files for each genome in the paprica database that received a hit, with each EC number that got a hit for that genome. • test.pathways.txt: A simple list of all the pathways that were predicted for the metagenome. paprica-mg_run.py -i ERR318619_1.qc.fasta.gz -o demo -ref_dir ref_genome_database -pathways F
  • 21. • Evaluations On to your own analysis!