Best practices for data analysis when using UMI adapters to improve variant detection

Best practices for data analysis when using
UMI adapters to improve variant detection
1
Wendy Lee, PhD
Staff Scientist

Outline
• Overview of NGS workflow that includes sample multiplexing
• Overview of workflow with xGen® Dual Index UMI Adapters—Tech
Access
• Discussion of data analysis steps:
– Extracting UMIs from sequencing reads
– Constructing consensus reads within UMI families
• Improving variant calling accuracy using consensus reads
2
UMI: unique molecular identifier

NGS workflow with xGen Dual Index UMI Adapters
3
xGen Universal
Blockers
xGen

xGen Dual Index UMI Adapters—Tech Access
4
3-in-1 design
• Designed for Illumina sequencers
• Compatible with standard end-repair and A-tailing library
construction, including PCR-free library methods
• Dual unique sample indices reduce sample cross-talk
• Degenerate 9-base UMI is incorporated for error correction and/or
counting applications

xGen Dual Index UMI Adapters—Tech Access
5
3-in-1 design

Consensus calling reduces artifacts in sequencing data
6
TP
Total readsDedup by start/stop positions

7
TP
Total reads
TP
Consensus reads
(Min3)
Dedup by start/stop positions
A UMI family

8
TP TP
Consensus reads
(Min3)
Dedup by start/stop positions

Extracting UMIs within sample index reads during
demultiplexing
9

Assumptions and requirements
• Sequencing data are generated from the Illumina platform
• The following tools are installed in a Linux environment:
– Picard, version 2.9.0
– Burrows-Wheeler Aligner (BWA), version 0.7.15-r1140
– Fgbio, version 0.5.0
– VarDict Java
• Access to the raw basecall data output from the sequencer
10

Data analysis guidelines on IDT website
11
www.idtdna.com/UMI-techaccess

Overall workflow
12
Sample Sheet
Steps D1–6: Converted base-calls to short
reads with UMI information during
demultiplexing NGS runs
Short reads files with UMI info
Illumina basecalls
Steps C1–4: Call consensus reads using UMI
Steps P1–4: Post-consensus calling analysis
Variant calls

Extract UMIs from sample index reads
through Illumina demultiplexing workflow
13
Step D1: Create the sample barcode input file
Barcode_file.txt
Step D2: Create output directory for storing
the extracted barcode from the
sample index reads
Step D3: Determine the read structure
100T8B9M8B100T
Step D4: Run ExtractIlluminaBarcodes (Picard)
Extracted barcode files
Step D5: Create an input file to specify the
output BAM file associated with the sample
Library_param.txt
Step D6: Run IlluminaBasecallsToSam (Picard)
Unmapped BAM files
Sample sheet

16
Barcode_file.txt
sample index reads
100T8B9M8B100T
Library_param.txt
Unmapped BAM files
Steps D1,2: Create a barcode file containing the sample barcode
information for each sample.
Steps 1 and 2 of 6 in demultiplexing

17
Steps D1,2: Create a barcode file containing the sample barcode
information for each sample.
17
• UMI bases are in Ns in the barcode sequence
• This is a tab-delimited file
• In this example, we saved this file in /mnt/demodata/barcode_file.txt
• In this example, we create an output directory in /mnt/demodata/barcodes
barcode_name library_name barcode_sequence_1 barcode_sequence_2
20180326-BN573-S1 Mix1_Rep1 CTGATCGTNNNNNNNNN GCGCATAT
20180326-BN573-S2 Mix1_Rep2 ACTCTCGANNNNNNNNN CTGTACCA
20180326-BN573-S3 Mix1_Rep3 TGAGCTAGNNNNNNNNN GAACGGTT
Steps 1 and 2 of 6 in demultiplexing

18
Barcode_file.txt
Library_param.txt
Unmapped BAM files
sample index reads
100T8B9M8B100T
Step D3: Determine the read structure for running
ExtractIlluminaBarcodes.
Step 3 of 6 in demultiplexing

19
Step D3: Determine the read structure for running
ExtractIlluminaBarcodes.
For xGen Dual Index UMI Adapters—Tech Access with DNA insert of 100 bp,
use the following corresponding read structure:
100T8B9M8B100T
T – template (insert)
B – Sample barcode
M – Molecular index (UMI)
Read

Step 4 of 6 in demultiplexing 20
Barcode_file.txt
Library_param.txt
Unmapped BAM files
sample index reads
100T8B9M8B100T
Step D4: Run Picard ExtractIlluminaBarcodes to extract sample barcodes.

Input: BARCODE_FILE: Barcode file created in Step D1
BASECALLS_DIR: Directory with sequencing basecall files
READ_STRUCTURE: 100T8B9M8B100T from Step D3
LANE: ExtractIlluminaBarcodes process one lane at a time
Output: 1. A metrics file with the barcode extraction summary
2. Extracted barcodes in output directory created in Step D2.
21
java -Xmx4g -jar picard-2.9.0.jar ExtractIlluminaBarcodes
BARCODE_FILE=/mnt/demodata/barcode_file.txt
BASECALLS_DIR=/mnt/runs/BN573/Data/Intensities/BaseCalls
READ_STRUCTURE=100T8B9M8B100T
LANE=1
OUTPUT_DIR=/mnt/demodata/barcodes
METRICS_FILE=/mnt/demodata/barcode_metrics.txt
Step D4: Run Picard ExtractIlluminaBarcodes to extract sample barcodes.

22
Barcode_file.txt
Library_param.txt
Unmapped BAM files
sample index reads
100T8B9M8B100T
Step D5: Create a tab-delimited file to specify the BAM file for each sample in
the sequencing run with the corresponding barcode sequence(s).

23
In this example, we saved this file in
/mnt/demodata/library_param.txt.
Be sure to create the output directory for the BAM file.
In this example, the output directory is /mnt/bam/L001
OUTPUT SAMPLE_ALIAS LIBRARY_NAME BARCODE_1 BARCODE_2
/mnt/bam/L001/BN573-S1_unmapped.bam 20180326-BN573-S1 Mix1_Rep1 CTGATCGTNNNNNNNNN GCGCATAT
/mnt/bam/L001/BN573-S2_unmapped.bam 20180326-BN573-S2 Mix1_Rep2 ACTCTCGANNNNNNNNN CTGTACCA
/mnt/bam/L001/BN573-S3_unmapped.bam 20180326-BN573-S3 Mix1_Rep3 TGAGCTAGNNNNNNNNN GAACGGTT
/mnt/bam/L001/Unmatched.bam Unmatched Unmatched N
Step D5: Create a tab-delimited file to specify the BAM file for each sample in
the sequencing run with the corresponding barcode sequence(s).

24
Barcode_file.txt
Library_param.txt
Unmapped BAM files
sample index reads
100T8B9M8B100T
Step D6: Run IlluminaBasecallsToSam to convert sequencing
base-calls to short reads in the BAM files.

25
Step D6: Run IlluminaBasecallsToSam to convert sequencing
base-calls to short reads BAM files.
java -Xmx4g -jar picard-2.9.0.jar IlluminaBasecallsToSam
BASECALLS_DIR=/mnt/runs/BN573/Data/Intensities/BaseCalls
BARCODES_DIR=/mnt/demodata/barcodes # Step D4
LANE=1 # process by lane
READ_STRUCTURE=100T8B6M8B100T # Step D3
RUN_BARCODE=180326_BN573 # prefixed to the read names in the output
LIBRARY_PARAMS= /mnt/demodata/library_param.txt # Step D5
TMP_DIR=/mnt/tmp
MOLECULAR_INDEX_TAG=RX # BAM tag that stores UMI sequence
ADAPTERS_TO_CHECK=INDEXED
READ_GROUP_ID=BN573-S1
NUM_PROCESSORS=8

BAM file created by IlluminaBasecallsToSam
• The reads in the BAM file generated by IlluminaBasecallsToSam are
not yet aligned to the reference genome.
• UMI sequence is in the RX tag.
• UMI sequence quality is in the QX tag.
• Sequencing adapter location is in the XT tag. Adapter sequence can
be trimmed using SamToFastq in Picard tools.
26
180326_BN573:1:1101:10008:4281 77 * 0 0 * * 0 0
ACAACGCTCCACGGGAGACCCACCCATCCCTGCCAGGTGAGCCAGACAGTGGCCAAGGGTCTCTAGGTCGAGGCAG
CDDDDCCCDDFFGGGGGGGGGGGGGGHHHHHHHHHHHGHHHHHHHHGHHHHHGHHHHGGHHHHHHHHHHGEFGGGG
RG:Z:BN573-S1 XT:i:114 QX:Z:FFFFGGGG RX:Z:GGTAAAATG
An example record from the BAM file:

Calling consensus using UMIs
27

Workflow for consensus
calling
28
Step C1: Align reads to reference genome
Mapped BAM
without UMI tags
Step C2: Include
UMI tags from
unmapped BAM in
the mapped BAM
Mapped BAM
with UMI tags
Mapped BAM
with UMI family tags
Step C3:
Group reads by UMIs
Unmapped BAM
with UMI tags
Extract UMIs from sample index during demultiplexing
Unmapped BAM
with consensus
reads
Step C4: Call consensus

Step C1,2: Aligning reads from unmapped BAM files to reference
genome, and including the UMI tags
29
Mapped BAM
without UMI tags
Unmapped BAM
with UMI tags
Extract UMIs from sample index
Step C2: Include
UMI tags from
unmapped BAM in
the mapped BAM
Mapped BAM
with UMI tags
Mapped BAM
Step C3:
Group reads by UMIs
Unmapped BAM
with consensus
reads
Steps 1 and 2 of 4 in consensus calling

Step C1,2: Aligning reads from unmapped BAM files to reference
genome, and including the UMI tags
The following command consists of three steps:
1. Convert BAM to FASTQ
2. Align reads using BWA-MEM
3. Include UMI tags from the unmapped BAM in the mapped BAM
Steps 1 and 2 of 4 in consensus calling
30
java -Xmx4g -jar picard-2.9.0.jar SamToFastq
I=BN573-S1_unmapped.bam
F=/dev/stdout INTERLEAVE=true
| bwa mem –p –t 8 hg38.fa /dev/stdin
| java –Xmx4g –jar picard.jar MergeBamAlignment
UNMAPPED=BN573-S1_unmapped.bam ALIGNED=/dev/stdin
O=BN573-S1_mapped.bam R=hg38.fa
SORT_ORDER=coordinate MAX_GAPS=-1
ORIENTATIONS=FR

31
Step C3: Grouping reads by UMIs
Unmapped BAM
with UMI tags
Mapped BAM
without UMI tags
Step C2: Include
UMI tags from
unmapped BAM in
the mapped BAM
Mapped BAM
with UMI tags
Mapped BAM
Step C3:
Group reads by UMIs
Unmapped BAM
with consensus
reads
Step 3 of 4 in consensus calling

Step C3: Grouping reads by UMIs
The reads are grouped into families that share the same UMI
32
java -Xmx4g -jar fgbio.jar GroupReadsByUmi
--input=BN573-S1_mapped.bam --output=BN573-S1_grouped.bam
--strategy=adjacency --edits=1 --min-map-q=20
-–assign-tag=MI

33
Mapped BAM
without UMI tags
Unmapped BAM
with UMI tags
Step C4: Calling consensus
Step C2: Include
UMI tags from
unmapped BAM in
the mapped BAM
Mapped BAM
with UMI tags
Mapped BAM
Step C3:
Group reads by UMIs
Unmapped BAM
with consensus
reads

Step C4: Calling consensus
Consensus reads will be generated using fgbio’s
CallMolecularConsensusReads
34
java -Xmx4g -jar fgbio.jar CallMolecularConsensusReads
--input=BN573-S1_grouped.bam
--output=BN573-S1_ssConsensus_unmapped.bam
--min-reads=1
--rejects=BN573-S1_ssConsensus_rejected.bam
--min-input-base-quality=30
--read-group-id=BN573-S1

Workflow for post consensus-calling analysis
35
Step P2: Include UMI
tags from unmapped
BAM in the mapped
BAM
Mapped BAM
with UMI tags
Step P3: Filter consensus reads
Filtered consensus
BAM
Unmapped BAM
with consensus
reads
Mapped BAM
without UMI tags
Step P1: Align reads to reference genome
VCF
Step P4: Variant calling

Steps 1 and 2 of 4 in post-consensus calling analysis
36
tags from unmapped
BAM in the mapped
BAM
Mapped BAM
with UMI tags
Filtered consensus
BAM
Unmapped BAM
with single strand
consensus reads
Mapped BAM
without UMI tags
Step P1,2: Aligning reads from unmapped BAM files to reference
genome and merging the UMI tags
VCF

Step P1,2: Aligning reads from unmapped BAM files to reference
genome and merging the UMI tags
The following command consists of three steps:
1. Converting BAM to FASTQ
2. Aligning reads using bwa mem
3. Including UMI tags from the unmapped BAM in the mapped BAM
Steps 1 and 2 of 4 in post-consensus calling analysis 37
java -Xmx4g -jar picard-2.9.0.jar SamToFastq
I=BN573-S1_consensus_unmapped.bam
F=/dev/stdout INTERLEAVE=true
| bwa mem –p –t 8 hg38.fa /dev/stdin
| java –Xmx4g –jar picard.jar MergeBamAlignment
UNMAPPED=BN573-S1_dsConsensus_unmapped.bam
ALIGNED=/dev/stdin
O=BN573-S1_consensus_mapped.bam R=hg38.fa
SORT_ORDER=coordinate MAX_GAPS=-1
ORIENTATIONS=FR

Step 3 of 4 in post-consensus calling analysis
38
tags from unmapped
BAM in the mapped
BAM
Mapped BAM
with UMI tags
Filtered consensus
BAM VCF
Unmapped BAM
with single strand
consensus reads
Mapped BAM
without UMI tags
Step P3: Filtering consensus reads

Step P3: Filtering consensus reads
There are two kinds of filtering of consensus reads:
1. Masking or filtering individual bases in reads
2. Filtering reads (i.e., not writing them to the output BAM file)
39
java -Xmx4g -jar fgbio.jar FilterConsensusReads
--input=BN573-S1_ssConsensus_mapped.bam
--output=BN573-S1_ssConsensus_mapped_filtered.bam
--min-reads=3
--min-base-quality=50
--max-no-call-fraction=0.05

40
tags from unmapped
BAM in the mapped
BAM
Mapped BAM
with UMI tags
Filtered consensus
BAM VCF
Unmapped BAM
with single strand
consensus reads
Mapped BAM
without UMI tags

41
• Variant calling can be accomplished with the variant caller of your choice
• The following example shows how to use VarDictJava to generate a VCF file
VarDictJava/bin/VarDict
–G hg38.fa
-N tumor
-f 0.01
-b BN573-S1_ssConsensus_mapped_filtered.bam
-z –c 1 –S 2 –E 3 –g 4 –th 4 target_regions.bed
| VarDictJava/VarDict/teststrandbias.R
| VarDictJava/VarDict/var2vcf_valid.pl –N tumor –E –f 0.01
| awk ‘{if ($1 ~/^#/) print; else if ($4 != $5) print}’
> BN573-S1.ssConsensus.VarDict.vcf

Tumor model system for benchmarking
• 25 ng of a 1% mixture (0.5% minimum allelic frequency) was used to
assess sensitivity and positive predictive value (PPV)
• Libraries were captured with a set of custom xGen Lockdown Probes
covering a total target area of ~35 kb
• Variant calling was performed with VarDict
42

Consensus analysis increases variant calling accuracy
43
All expected variants
0.2% variant calling threshold Positive predictive value (PPV)

Take-home messages
• Building consensus sequences enables in silico error correction,
dramatically increasing variant calling specificity
• Due to the prevalence of artifacts arising from sample degradation,
PCR amplification and sequencing, consensus analysis is necessary
to accurately detect variants present below 1%
• xGen Dual Index UMI Adapters mitigate index switching and can
accurately assign rare variants in multiplexing studies
45
www.idtdna.com/UMI-techaccess

Sensitivity and specificity (PPV)
46
TP: True positive
FP: False positive
FN: False negative
PPV: Positive Predictive Value
Sensitivity =
TP
TP+FN
Specificity (PPV) =
TP
TP+FP

Best practices for data analysis when using UMI adapters to improve variant detection

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Best practices for data analysis when using UMI adapters to improve variant detection

Similar to Best practices for data analysis when using UMI adapters to improve variant detection (20)

More from Integrated DNA Technologies

More from Integrated DNA Technologies (20)

Recently uploaded

Recently uploaded (20)

Best practices for data analysis when using UMI adapters to improve variant detection