Advances in next generation sequencing enable the detection of variants at exceptionally low frequencies. The accurate detection of low-frequency variants is challenging due in part to errors that are introduced during sample preparation, target enrichment, and sequencing. After tagging individual DNA library molecules with adapters containing unique molecular identifiers (UMIs), bioinformatic filters can be applied to identify and correct errors introduced during the sequencing workflow. In this presentation, we walk through the analytical workflows developed at IDT for processing data containing UMIs. We highlight methods to extract UMI information, correct errors, and build consensus among multiple observations of an original source molecule.
Best practices for data analysis when using UMI adapters to improve variant detection
1. Best practices for data analysis when using
UMI adapters to improve variant detection
1
Wendy Lee, PhD
Staff Scientist
2. Outline
• Overview of NGS workflow that includes sample multiplexing
• Overview of workflow with xGen® Dual Index UMI Adapters—Tech
Access
• Discussion of data analysis steps:
– Extracting UMIs from sequencing reads
– Constructing consensus reads within UMI families
• Improving variant calling accuracy using consensus reads
2
UMI: unique molecular identifier
3. NGS workflow with xGen Dual Index UMI Adapters
3
xGen Universal
Blockers
xGen
4. xGen Dual Index UMI Adapters—Tech Access
4
3-in-1 design
• Designed for Illumina sequencers
• Compatible with standard end-repair and A-tailing library
construction, including PCR-free library methods
• Dual unique sample indices reduce sample cross-talk
• Degenerate 9-base UMI is incorporated for error correction and/or
counting applications
10. Assumptions and requirements
• Sequencing data are generated from the Illumina platform
• The following tools are installed in a Linux environment:
– Picard, version 2.9.0
– Burrows-Wheeler Aligner (BWA), version 0.7.15-r1140
– Fgbio, version 0.5.0
– VarDict Java
• Access to the raw basecall data output from the sequencer
10
12. Overall workflow
12
Sample Sheet
Steps D1–6: Converted base-calls to short
reads with UMI information during
demultiplexing NGS runs
Short reads files with UMI info
Illumina basecalls
Steps C1–4: Call consensus reads using UMI
Steps P1–4: Post-consensus calling analysis
Variant calls
13. Extract UMIs from sample index reads
through Illumina demultiplexing workflow
13
Step D1: Create the sample barcode input file
Barcode_file.txt
Step D2: Create output directory for storing
the extracted barcode from the
sample index reads
Step D3: Determine the read structure
100T8B9M8B100T
Step D4: Run ExtractIlluminaBarcodes (Picard)
Extracted barcode files
Step D5: Create an input file to specify the
output BAM file associated with the sample
Library_param.txt
Step D6: Run IlluminaBasecallsToSam (Picard)
Unmapped BAM files
Sample sheet
16. 16
Step D1: Create the sample barcode input file
Barcode_file.txt
Step D2: Create output directory for storing
the extracted barcode from the
sample index reads
Step D3: Determine the read structure
100T8B9M8B100T
Step D4: Run ExtractIlluminaBarcodes (Picard)
Extracted barcode files
Step D5: Create an input file to specify the
output BAM file associated with the sample
Library_param.txt
Step D6: Run IlluminaBasecallsToSam (Picard)
Unmapped BAM files
Steps D1,2: Create a barcode file containing the sample barcode
information for each sample.
Steps 1 and 2 of 6 in demultiplexing
17. 17
Steps D1,2: Create a barcode file containing the sample barcode
information for each sample.
17
• UMI bases are in Ns in the barcode sequence
• This is a tab-delimited file
• In this example, we saved this file in /mnt/demodata/barcode_file.txt
• In this example, we create an output directory in /mnt/demodata/barcodes
barcode_name library_name barcode_sequence_1 barcode_sequence_2
20180326-BN573-S1 Mix1_Rep1 CTGATCGTNNNNNNNNN GCGCATAT
20180326-BN573-S2 Mix1_Rep2 ACTCTCGANNNNNNNNN CTGTACCA
20180326-BN573-S3 Mix1_Rep3 TGAGCTAGNNNNNNNNN GAACGGTT
Steps 1 and 2 of 6 in demultiplexing
18. 18
Step D1: Create the sample barcode input file
Barcode_file.txt
Step D4: Run ExtractIlluminaBarcodes (Picard)
Extracted barcode files
Step D5: Create an input file to specify the
output BAM file associated with the sample
Library_param.txt
Step D6: Run IlluminaBasecallsToSam (Picard)
Unmapped BAM files
Step D2: Create output directory for storing
the extracted barcode from the
sample index reads
Step D3: Determine the read structure
100T8B9M8B100T
Step D3: Determine the read structure for running
ExtractIlluminaBarcodes.
Step 3 of 6 in demultiplexing
19. 19
Step D3: Determine the read structure for running
ExtractIlluminaBarcodes.
Step 3 of 6 in demultiplexing
For xGen Dual Index UMI Adapters—Tech Access with DNA insert of 100 bp,
use the following corresponding read structure:
100T8B9M8B100T
T – template (insert)
B – Sample barcode
M – Molecular index (UMI)
Read
20. Step 4 of 6 in demultiplexing 20
Step D1: Create the sample barcode input file
Barcode_file.txt
Step D4: Run ExtractIlluminaBarcodes (Picard)
Extracted barcode files
Step D5: Create an input file to specify the
output BAM file associated with the sample
Library_param.txt
Step D6: Run IlluminaBasecallsToSam (Picard)
Unmapped BAM files
Step D2: Create output directory for storing
the extracted barcode from the
sample index reads
Step D3: Determine the read structure
100T8B9M8B100T
Step D4: Run Picard ExtractIlluminaBarcodes to extract sample barcodes.
21. Input: BARCODE_FILE: Barcode file created in Step D1
BASECALLS_DIR: Directory with sequencing basecall files
READ_STRUCTURE: 100T8B9M8B100T from Step D3
LANE: ExtractIlluminaBarcodes process one lane at a time
Output: 1. A metrics file with the barcode extraction summary
2. Extracted barcodes in output directory created in Step D2.
21
java -Xmx4g -jar picard-2.9.0.jar ExtractIlluminaBarcodes
BARCODE_FILE=/mnt/demodata/barcode_file.txt
BASECALLS_DIR=/mnt/runs/BN573/Data/Intensities/BaseCalls
READ_STRUCTURE=100T8B9M8B100T
LANE=1
OUTPUT_DIR=/mnt/demodata/barcodes
METRICS_FILE=/mnt/demodata/barcode_metrics.txt
Step 4 of 6 in demultiplexing
Step D4: Run Picard ExtractIlluminaBarcodes to extract sample barcodes.
22. 22
Step D1: Create the sample barcode input file
Barcode_file.txt
Step D4: Run ExtractIlluminaBarcodes (Picard)
Extracted barcode files
Step D5: Create an input file to specify the
output BAM file associated with the sample
Library_param.txt
Step D6: Run IlluminaBasecallsToSam (Picard)
Unmapped BAM files
Step D2: Create output directory for storing
the extracted barcode from the
sample index reads
Step D3: Determine the read structure
100T8B9M8B100T
Step D5: Create a tab-delimited file to specify the BAM file for each sample in
the sequencing run with the corresponding barcode sequence(s).
Step 5 of 6 in demultiplexing
23. 23
In this example, we saved this file in
/mnt/demodata/library_param.txt.
Be sure to create the output directory for the BAM file.
In this example, the output directory is /mnt/bam/L001
OUTPUT SAMPLE_ALIAS LIBRARY_NAME BARCODE_1 BARCODE_2
/mnt/bam/L001/BN573-S1_unmapped.bam 20180326-BN573-S1 Mix1_Rep1 CTGATCGTNNNNNNNNN GCGCATAT
/mnt/bam/L001/BN573-S2_unmapped.bam 20180326-BN573-S2 Mix1_Rep2 ACTCTCGANNNNNNNNN CTGTACCA
/mnt/bam/L001/BN573-S3_unmapped.bam 20180326-BN573-S3 Mix1_Rep3 TGAGCTAGNNNNNNNNN GAACGGTT
/mnt/bam/L001/Unmatched.bam Unmatched Unmatched N
Step D5: Create a tab-delimited file to specify the BAM file for each sample in
the sequencing run with the corresponding barcode sequence(s).
Step 5 of 6 in demultiplexing
24. 24
Step D1: Create the sample barcode input file
Barcode_file.txt
Step D4: Run ExtractIlluminaBarcodes (Picard)
Extracted barcode files
Step D5: Create an input file to specify the
output BAM file associated with the sample
Library_param.txt
Step D6: Run IlluminaBasecallsToSam (Picard)
Unmapped BAM files
Step D2: Create output directory for storing
the extracted barcode from the
sample index reads
Step D3: Determine the read structure
100T8B9M8B100T
Step D6: Run IlluminaBasecallsToSam to convert sequencing
base-calls to short reads in the BAM files.
Step 6 of 6 in demultiplexing
25. 25
Step D6: Run IlluminaBasecallsToSam to convert sequencing
base-calls to short reads BAM files.
java -Xmx4g -jar picard-2.9.0.jar IlluminaBasecallsToSam
BASECALLS_DIR=/mnt/runs/BN573/Data/Intensities/BaseCalls
BARCODES_DIR=/mnt/demodata/barcodes # Step D4
LANE=1 # process by lane
READ_STRUCTURE=100T8B6M8B100T # Step D3
RUN_BARCODE=180326_BN573 # prefixed to the read names in the output
LIBRARY_PARAMS= /mnt/demodata/library_param.txt # Step D5
TMP_DIR=/mnt/tmp
MOLECULAR_INDEX_TAG=RX # BAM tag that stores UMI sequence
ADAPTERS_TO_CHECK=INDEXED
READ_GROUP_ID=BN573-S1
NUM_PROCESSORS=8
Step 6 of 6 in demultiplexing
26. BAM file created by IlluminaBasecallsToSam
• The reads in the BAM file generated by IlluminaBasecallsToSam are
not yet aligned to the reference genome.
• UMI sequence is in the RX tag.
• UMI sequence quality is in the QX tag.
• Sequencing adapter location is in the XT tag. Adapter sequence can
be trimmed using SamToFastq in Picard tools.
26
180326_BN573:1:1101:10008:4281 77 * 0 0 * * 0 0
ACAACGCTCCACGGGAGACCCACCCATCCCTGCCAGGTGAGCCAGACAGTGGCCAAGGGTCTCTAGGTCGAGGCAG
CDDDDCCCDDFFGGGGGGGGGGGGGGHHHHHHHHHHHGHHHHHHHHGHHHHHGHHHHGGHHHHHHHHHHGEFGGGG
RG:Z:BN573-S1 XT:i:114 QX:Z:FFFFGGGG RX:Z:GGTAAAATG
An example record from the BAM file:
28. Workflow for consensus
calling
28
Step C1: Align reads to reference genome
Mapped BAM
without UMI tags
Step C2: Include
UMI tags from
unmapped BAM in
the mapped BAM
Mapped BAM
with UMI tags
Mapped BAM
with UMI family tags
Step C3:
Group reads by UMIs
Unmapped BAM
with UMI tags
Extract UMIs from sample index during demultiplexing
Unmapped BAM
with consensus
reads
Step C4: Call consensus
29. Step C1,2: Aligning reads from unmapped BAM files to reference
genome, and including the UMI tags
29
Step C1: Align reads to reference genome
Mapped BAM
without UMI tags
Unmapped BAM
with UMI tags
Extract UMIs from sample index
Step C2: Include
UMI tags from
unmapped BAM in
the mapped BAM
Mapped BAM
with UMI tags
Mapped BAM
with UMI family tags
Step C3:
Group reads by UMIs
Unmapped BAM
with consensus
reads
Step C4: Call consensus
Steps 1 and 2 of 4 in consensus calling
30. Step C1,2: Aligning reads from unmapped BAM files to reference
genome, and including the UMI tags
The following command consists of three steps:
1. Convert BAM to FASTQ
2. Align reads using BWA-MEM
3. Include UMI tags from the unmapped BAM in the mapped BAM
Steps 1 and 2 of 4 in consensus calling
30
java -Xmx4g -jar picard-2.9.0.jar SamToFastq
I=BN573-S1_unmapped.bam
F=/dev/stdout INTERLEAVE=true
| bwa mem –p –t 8 hg38.fa /dev/stdin
| java –Xmx4g –jar picard.jar MergeBamAlignment
UNMAPPED=BN573-S1_unmapped.bam ALIGNED=/dev/stdin
O=BN573-S1_mapped.bam R=hg38.fa
SORT_ORDER=coordinate MAX_GAPS=-1
ORIENTATIONS=FR
31. 31
Step C3: Grouping reads by UMIs
Unmapped BAM
with UMI tags
Extract UMIs from sample index
Step C1: Align reads to reference genome
Mapped BAM
without UMI tags
Step C2: Include
UMI tags from
unmapped BAM in
the mapped BAM
Mapped BAM
with UMI tags
Mapped BAM
with UMI family tags
Step C3:
Group reads by UMIs
Unmapped BAM
with consensus
reads
Step C4: Call consensus
Step 3 of 4 in consensus calling
32. Step C3: Grouping reads by UMIs
The reads are grouped into families that share the same UMI
Step 3 of 4 in consensus calling
32
java -Xmx4g -jar fgbio.jar GroupReadsByUmi
--input=BN573-S1_mapped.bam --output=BN573-S1_grouped.bam
--strategy=adjacency --edits=1 --min-map-q=20
-–assign-tag=MI
33. Step 4 of 4 in consensus calling
33
Step C1: Align reads to reference genome
Mapped BAM
without UMI tags
Unmapped BAM
with UMI tags
Extract UMIs from sample index
Step C4: Calling consensus
Step C2: Include
UMI tags from
unmapped BAM in
the mapped BAM
Mapped BAM
with UMI tags
Mapped BAM
with UMI family tags
Step C3:
Group reads by UMIs
Unmapped BAM
with consensus
reads
Step C4: Call consensus
34. Step C4: Calling consensus
Consensus reads will be generated using fgbio’s
CallMolecularConsensusReads
Step 4 of 4 in consensus calling
34
java -Xmx4g -jar fgbio.jar CallMolecularConsensusReads
--input=BN573-S1_grouped.bam
--output=BN573-S1_ssConsensus_unmapped.bam
--min-reads=1
--rejects=BN573-S1_ssConsensus_rejected.bam
--min-input-base-quality=30
--read-group-id=BN573-S1
35. Workflow for post consensus-calling analysis
35
Step P2: Include UMI
tags from unmapped
BAM in the mapped
BAM
Mapped BAM
with UMI tags
Step P3: Filter consensus reads
Filtered consensus
BAM
Unmapped BAM
with consensus
reads
Mapped BAM
without UMI tags
Step P1: Align reads to reference genome
VCF
Step P4: Variant calling
36. Steps 1 and 2 of 4 in post-consensus calling analysis
36
Step P2: Include UMI
tags from unmapped
BAM in the mapped
BAM
Mapped BAM
with UMI tags
Step P3: Filter consensus reads
Filtered consensus
BAM
Unmapped BAM
with single strand
consensus reads
Mapped BAM
without UMI tags
Step P1: Align reads to reference genome
Step P1,2: Aligning reads from unmapped BAM files to reference
genome and merging the UMI tags
VCF
Step P4: Variant calling
37. Step P1,2: Aligning reads from unmapped BAM files to reference
genome and merging the UMI tags
The following command consists of three steps:
1. Converting BAM to FASTQ
2. Aligning reads using bwa mem
3. Including UMI tags from the unmapped BAM in the mapped BAM
Steps 1 and 2 of 4 in post-consensus calling analysis 37
java -Xmx4g -jar picard-2.9.0.jar SamToFastq
I=BN573-S1_consensus_unmapped.bam
F=/dev/stdout INTERLEAVE=true
| bwa mem –p –t 8 hg38.fa /dev/stdin
| java –Xmx4g –jar picard.jar MergeBamAlignment
UNMAPPED=BN573-S1_dsConsensus_unmapped.bam
ALIGNED=/dev/stdin
O=BN573-S1_consensus_mapped.bam R=hg38.fa
SORT_ORDER=coordinate MAX_GAPS=-1
ORIENTATIONS=FR
38. Step 3 of 4 in post-consensus calling analysis
38
Step P2: Include UMI
tags from unmapped
BAM in the mapped
BAM
Mapped BAM
with UMI tags
Step P3: Filter consensus reads
Filtered consensus
BAM VCF
Unmapped BAM
with single strand
consensus reads
Mapped BAM
without UMI tags
Step P1: Align reads to reference genome
Step P3: Filtering consensus reads
Step P4: Variant calling
39. Step P3: Filtering consensus reads
There are two kinds of filtering of consensus reads:
1. Masking or filtering individual bases in reads
2. Filtering reads (i.e., not writing them to the output BAM file)
Step 3 of 4 in post-consensus calling analysis
39
java -Xmx4g -jar fgbio.jar FilterConsensusReads
--input=BN573-S1_ssConsensus_mapped.bam
--output=BN573-S1_ssConsensus_mapped_filtered.bam
--min-reads=3
--min-base-quality=50
--max-no-call-fraction=0.05
40. Step 4 of 4 in post-consensus calling analysis
40
Step P2: Include UMI
tags from unmapped
BAM in the mapped
BAM
Mapped BAM
with UMI tags
Step P3: Filter consensus reads
Filtered consensus
BAM VCF
Unmapped BAM
with single strand
consensus reads
Mapped BAM
without UMI tags
Step P1: Align reads to reference genome
Step P4: Variant calling
Step P4: Variant calling
41. Step P4: Variant calling
Step 4 of 4 in post-consensus calling analysis
41
• Variant calling can be accomplished with the variant caller of your choice
• The following example shows how to use VarDictJava to generate a VCF file
VarDictJava/bin/VarDict
–G hg38.fa
-N tumor
-f 0.01
-b BN573-S1_ssConsensus_mapped_filtered.bam
-z –c 1 –S 2 –E 3 –g 4 –th 4 target_regions.bed
| VarDictJava/VarDict/teststrandbias.R
| VarDictJava/VarDict/var2vcf_valid.pl –N tumor –E –f 0.01
| awk ‘{if ($1 ~/^#/) print; else if ($4 != $5) print}’
> BN573-S1.ssConsensus.VarDict.vcf
42. Tumor model system for benchmarking
• 25 ng of a 1% mixture (0.5% minimum allelic frequency) was used to
assess sensitivity and positive predictive value (PPV)
• Libraries were captured with a set of custom xGen Lockdown Probes
covering a total target area of ~35 kb
• Variant calling was performed with VarDict
42
43. Consensus analysis increases variant calling accuracy
43
All expected variants
0.2% variant calling threshold Positive predictive value (PPV)
45. Take-home messages
• Building consensus sequences enables in silico error correction,
dramatically increasing variant calling specificity
• Due to the prevalence of artifacts arising from sample degradation,
PCR amplification and sequencing, consensus analysis is necessary
to accurately detect variants present below 1%
• xGen Dual Index UMI Adapters mitigate index switching and can
accurately assign rare variants in multiplexing studies
45
www.idtdna.com/UMI-techaccess