Nephele 2.0: How to get the most out of your Nephele results

Nephele 2.0 Webinar
16 November 2018
Bioinformatics and Computational Biosciences Branch
Poorani Subramanian, Ph.D.
Mariam Quiñones, Ph.D.

Nephele 2.0 – What's new?
§ New site
§ Under the hood: new
infrastructure framework and
performance improvements
§ Resubmit a job with the job ID
§ Interactive mapping file
submission
§ Updated and New Pipelines
• NEW: 16S DADA2
• NEW: Pre-processing QC
• Updated: 16S mothur
3

§ New site
submission
• NEW: 16S DADA2
4

§ New site
submission
• NEW: 16S DADA2
5

§ New site
submission
• NEW: 16S DADA2
6

https://nephele.niaid.nih.gov/details_dada2/
Nephele 2.0 – New DADA2 Pipeline
§ v1.6 R package
§ Instead of clustering OTUs,
denoises/error corrects reads to
make sequence variants
§ Taxonomic assignment with rdp
algorithm and SILVA db
§ benjjneb.github.io/dada2/index.html
7

8
§ Uploading files
§ Quality check of your data

Uploading Files
9
§ File upload page – upload from
local

Uploading Files
local
• Sometimes you may see an
error
• File size > 450 MB limit
10

Uploading Files
local
error
§ Can upload via ftp instead
• Upload data to any public ftp
server; NIH provides
ftp://helix.nih.gov/pub
11https://nephele-prod-resources.s3.amazonaws.com/How_to_load_files_to_Helix_Public_FTP.pdf

Uploading Files
local
error
§ Can upload via ftp instead
• Upload data to any public ftp
server; NIH provides
ftp://helix.nih.gov/pub
• Use the url of the folder with
your FASTQ files
12https://nephele-prod-resources.s3.amazonaws.com/How_to_load_files_to_Helix_Public_FTP.pdf

13
§ Uploading files
§ Quality check of your data

Why should we care about data quality?
§ Best practices include doing a
series of Quality Control steps to
verify and sometimes improve
data quality
§ Sequence analysis and results
are highly dependent on data
quality!
14

Why should we care about data quality?
§ Many (most?) of the parameters
for Nephele's pipelines relate to
quality
§ Defaults don't always work well
for every dataset
§ Everyone's data is different
§ Get To Know Your Data
15

Pre-processing QC: Get to Know Your Data
§ Nephele's Pre-processing Quality
Check Pipeline
16https://nephele.niaid.nih.gov/details_qc

Check Pipeline
• Designed to be run before you
do microbiome analysis
• Same input data and map file
used for microbiome pipelines

Check Pipeline
• Designed to be run before you
do microbiome analysis
• Same input data and map file
used for microbiome pipelines
§ Getting Started: Run without any
options!

Pre-processing QC: FastQC
§ MultiQC aggregates results into
multiqc_report.html
§ Num reads in each file
• Do R1 & R2 have same num
reads?
19http://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/

Pre-processing QC: FastQC
§ MultiQC aggregates results into
multiqc_report.html
§ Num reads in each file
• Do R1 & R2 have same num
reads?
§ Average per base quality for each
sample
• Colored according to FastQC
defaults
20http://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/

Pre-processing QC: Primer & Adapter Trimming
§ QIIME 2 cutadapt plugin
21https://docs.qiime2.org/2018.6/plugins/available/cutadapt/

§ For amplicon primers, front 5'
adapter

adapter
§ To trim for other adapters, usually
trim 3' adapter
§ CHECK with the sequencing
center for adapter and primer info

adapter
§ To trim for other adapters, usually
trim 3' adapter
§ CHECK with the sequencing
center for adapter and primer info
§ MultiQC graphs

Pre-processing QC: Other Steps
§ Quality trimming with
Trimmomatic
• Trim with sliding window
• Filter poor quality reads
§ Paired-end read merging with
FLASh
• May be more robust than read
mergers included in QIIME,
mothur, and DADA2
• https://www.researchgate.net/publication/3
03288211_Evaluating_Paired-
End_Read_Mergers

26
§ Important Files and Troubleshooting
§ Visualizations

Outputs: DADA2 Example
§ DADA2 results in main outputs
folder
§ graphs folder – output of the 16s
visualizations
27

Important files: logfile.txt
§ The story of your data's analysis
28

§ Messages start with the date, and then INFO, WARNING, or ERROR
§ At the top list of the pipeline parameters
29

§ Individual commands/programs run
30https://nephele.niaid.nih.gov/details_dada2/#pipeline-steps
[Mon Jul 30 16:11:25 2018] Paired End
[Mon Jul 30 16:11:25 2018] pqp <- lapply(readslist, FUN = function(x) { ppp <-
plotQualityProfile(file.path(datadir, x)); ppp$facet$params$ncol <- 4; ppp })
[Mon Jul 30 16:11:37 2018] Saving quality profile plots to
quality_Profile_R*.pdf
[Mon Jul 30 16:11:40 2018] out <-
filterAndTrim(fwd=file.path(datadir,readslist$R1),
filt=file.path(filt.dir,trimlist$R1),rev=file.path(datadir,readslist$R2),
filt.rev=file.path(filt.dir,trimlist$R2), maxEE=5, trimLeft=list(20L, 20L),
truncQ=4, truncLen = list(0L, 0L), rm.phix=TRUE, compress=TRUE, verbose=TRUE,
multithread=nthread, minLen=50)
Creating output directory:
/mnt/EFS/user_uploads/c82b2a9c0e40/outputs/filtered_data

Troubleshooting: logfile.txt
§ Example dummy dataset
§ Get an error email
'Input must be a valid sequence table. '
indicates sequence table is empty
because no sequence variants were
produced after denoising and merging
reads (for PE). You may want to
examine the dataset quality and modify
your filterAndTrim or mergePairs (for
PE) parameters. Please refer to
logfile.txt for more information.
§ When something goes wrong;
look for ERROR messages
31
[2018-10-03 19:00:24.543] dd <- sapply(nameslist, function(x) dada(derep[[x]], err=err[[x]],
multithread=nthread, verbose=1), USE.NAMES=TRUE, simplify=FALSE)
Sample 1 - 99 reads in 54 unique sequences.
[2018-10-03 19:00:24.594] mergePairs(dd$R1, derep$R1, dd$R2, derep$R2, verbose=TRUE,
minOverlap=12, trimOverhang=FALSE, maxMismatch=0, justConcatenate=FALSE)
0 paired-reads (in 0 unique pairings) successfully merged out of 99 (in 9 pairings) input.
[2018-10-03 19:00:24.605] derep <- lapply(trimlist, function(x) derepFastq(x[sample],
verbose=TRUE))
Dereplicating sequence entries in Fastq file:
/mnt/EFS/user_uploads/f6c21d383553/outputs/filtered_data/74S74R1_trim.fastq.gz
Encountered 54 unique sequences from 99 total sequences read.
[2018-10-03 19:00:24.722] seqtab <- makeSequenceTable(sampleVariants)
[2018-10-03 19:00:24.740] seqtabnochimera <- removeBimeraDenovo(seqtab, verbose=TRUE,
multithread=nthread)
Warning in is.na(colnames(unqs[[i]])) :
is.na() applied to non-(list or vector) of type 'NULL'
As of the 1.4 release, the default method changed to consensus (from pooled).
Error:
Input must be a valid sequence table.
Call: isBimeraDenovoTable(unqs[[i]], ..., verbose = verbose), Pipeline Step:
dada2::removeBimeraDenovo, Pipeline: dada2compute
[2018-10-03 19:00:24,759 - ERROR] R Pipeline Error:
[2018-10-03 19:00:24,759 - ERROR] ('Input must be a valid sequence table. ', 'f6c21d383553')
[2018-10-03 19:00:24,866 - INFO] 1

…You may want to examine the dataset
quality and modify your filterAndTrim or
mergePairs (for PE) parameters…
§ Check output of filterAndTrim
• 99/100 reads passed filter
32
[2018-10-03 19:00:17.445] out <- filterAndTrim(fwd=file.path(datadir,readslist$R1),
filt=file.path(filt.dir,trimlist$R1),rev=file.path(datadir,readslist$R2),
filt.rev=file.path(filt.dir,trimlist$R2), maxEE=5, trimLeft=list(20L, 20L), truncQ=4, truncLen
= list(0L, 0L), rm.phix=TRUE, compress=TRUE, verbose=TRUE, multithread=nthread, minLen=50)
Creating output directory: /mnt/EFS/user_uploads/f6c21d383553/outputs/filtered_data
reads.in reads.out
1S1R1.fastq 100 99
2S2R1.fastq 100 99
3S3R1.fastq 100 99
4S4R1.fastq 100 99
5S5R1.fastq 100 99
6S6R1.fastq 100 99
73S73R1.fastq 100 99
7S7R1.fastq 100 99
74S74R1.fastq 100 99
[2018-10-03 19:00:19.004] Checking that trimmed files exist.
[2018-10-03 19:00:19.023] err <- lapply(trimlist, function(x) learnErrors(x,
multithread=nthread, nreads=1000000,randomize=TRUE))
Initializing error rates to maximum possible estimate.
selfConsist step 2
selfConsist step 3
Convergence after 3 rounds.
Total reads used: 891
Initializing error rates to maximum possible estimate.

…You may want to examine the dataset
quality and modify your filterAndTrim or
mergePairs (for PE) parameters…
§ Check messages from
mergePairs
§ None of the samples had
reads that merged!
33
[2018-10-03 19:00:24.605] derep <- lapply(trimlist, function(x) derepFastq(x[sample],
verbose=TRUE))
[2018-10-03 19:00:24.722] seqtab <- makeSequenceTable(sampleVariants)
[2018-10-03 19:00:24.740] seqtabnochimera <- removeBimeraDenovo(seqtab, verbose=TRUE,
multithread=nthread)
Warning in is.na(colnames(unqs[[i]])) :
is.na() applied to non-(list or vector) of type 'NULL'
As of the 1.4 release, the default method changed to consensus (from pooled).
Error:
Input must be a valid sequence table.
Call: isBimeraDenovoTable(unqs[[i]], ..., verbose = verbose), Pipeline Step:
dada2::removeBimeraDenovo, Pipeline: dada2compute
[2018-10-03 19:00:24,759 - ERROR] R Pipeline Error:
[2018-10-03 19:00:24,759 - ERROR] ('Input must be a valid sequence table. ', 'f6c21d383553')
[2018-10-03 19:00:24,866 - INFO] 1

§ Check messages from mergePairs
§ None of the samples had reads that merged!
§ How to fix?
• Change max mismatch for mergePairs in DADA2
• Or use the FLASh read merger in QC pipeline
– Submit merged reads to SE pipeline
34

§ Check messages from filterAndTrim
§ Suppose very few reads pass filter
§ How to fix?
• Change truncLen, truncQ, maxEE for filterAndTrim in
DADA2
• Or use Trimmomatic in QC pipeline
35

Important files: otu_summary_table.txt
Num samples: 10
Num observations: 508
Total count: 161,156
Table density (fraction of non-zero values): 0.167
Counts/sample summary:
Min: 13,516.000
Max: 18,349.000
Median: 15,938.500
Mean: 16,115.600
Std. dev.: 1,566.865
Sample Metadata Categories: None provided
Observation Metadata Categories: taxonomy
Counts/sample detail:
7pRecSw478.1: 13,516.000
A22145: 14,505.000
A22350: 14,814.000
A22833: 15,550.000
A22349: 15,571.000
A22831: 16,306.000
A22061: 16,377.000
A22057: 17,932.000
A22187: 18,236.000
A22192: 18,349.000
36
§ Summary of the final biom file
after taxonomic ID – BUT before
any downstream analysis
§ Num observations: total # of
distinct seq variants or OTUs
§ Compare the counts/sample to:
• # reads in the input file (logfile
or QC report)
• sampling depth (default 10k)

Important files: graphs/samples_being_ignored.txt
§ When is downstream analysis
(graphs, diversity, etc) run?
• At least 3 samples with counts
> sampling depth
§ samples ignored for downstream
analysis
§ These samples do not appear in
the plots or QIIME 1 core diversity
plots and statistics
§ If this file is not in graphs/ folder,
then no samples were ignored
37

38
§ Important Files and Troubleshooting
§ Visualizations

Morpheus Heatmaps
39nephele.niaid.nih.gov/user_guide_tutorials/#heatmap software.broadinstitute.org/morpheus

Plotly Graphs
Simple Edits Videos
40

Plotly Graphs – Change colors
41
1
Bigger Edits Use Plotly Chart Studio
2 3
4
help.plot.ly/tutorials

§ Example Graphs
§ Tutorials page
§ https://nephele.niaid.nih.gov/user_guide_tutorials/#example-files
Try It!
42

Thank You!
Further Help & Info Nephele Team
§ Frequently Asked Questions:
nephele.niaid.nih.gov/faq
§ Tutorials:
nephele.niaid.nih.gov/user_guide_tutorials
§ Details Pages:
nephele.niaid.nih.gov/user_guide_pipes
• Individual Pipelines Links
§ nephelesupport@niaid.nih.gov
43

Nephele 2.0: How to get the most out of your Nephele results

More Related Content

Similar to Nephele 2.0: How to get the most out of your Nephele results

More from Bioinformatics and Computational Biosciences Branch

Recently uploaded

Nephele 2.0: How to get the most out of your Nephele results