This document describes different methods for genotype calling in large population studies, including single sample calling, batch calling, and joint/population calling. It presents a case study comparing the Genalice population calling module to traditional stepwise calling using BWA/GATK on a dataset of 3,000 rice genomes. The population calling approach was over 30 times faster than stepwise calling and scaled linearly with sample size. While it produced fewer total variants than BWA/GATK, over 80% of shared sites had identical genotypes between the two methods. The document concludes that population calling is an efficient approach for large datasets but that more analysis is needed to understand differences from other methods.
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Novel mutation detection in large sample pools using Population Calling
1. Population Calling
a powerful tool for novel mutation detection in larger
sample pools
January 12, 2016
Antoine Janssen
antoine.janssen@keygene.com
June 2015/ Antoine JanssenThe crop innovation company
2. The crop innovation company 2
Overview
• Genotype calling strategies
• Infrastructure setup
• Show case 3,000 rice genomes
• Genalice stepwise vs Population calling
• BWA/GATK joint genotyping vs Population calling
• Conclusions
3. The crop innovation company
Overview
Genotype calling strategies
3
• Single sample calling
• Batch calling
• Joint calling
Pros
• Distinction between homozygous reference vs. missing
data
• Greater sensitivity for low-frequency variants
• Greater ability to filter out false positives
Cons
• Scaling and infrastructure
• Incremental analysis
4. The crop innovation company
Single sample calling
Genotype calling strategies
4
5. The crop innovation company
The classic route: BWA/GATK
preprocessing read mapping
genotypingpostprocessing
• python script
• sickle
• BWA
• samtools
• Picard MarkDuplicates
• GATK IndelRealigner
• GATK UnifiedGenotyper• several python scripts
Genotype calling strategies
6. The crop innovation company
Map per sample Mapping of all samples against reference
Genotype per sample Genotyping of all samples
Store variant positions Create BED file of all variation over all
samples
Genotype calling strategies
Stepwise approach (1)
7. The crop innovation company
Adjust VCF per sample
Adjust VCF files to results in samples with
equal amounts of positions – forcing
unknown (./.)
Merge VCFs & Post process Merge VCFs & Post-process all samples
Genotype calling strategies
Stepwise approach (2)
Genotype per sample
Genotyping of all samples with a BED file –
force calls on all positions
8. The crop innovation company
Genalice Population Calling
Genotype calling strategies
8
Map per sample
Map all samples against reference
Genotype per sample Genotype all samples
Pop call & post process Population calling & post-processing of all
samples
9. The crop innovation company
Genalice Population Calling: Process
Genotype calling strategies
9
gaPopulation add
gaPopulation merge
gaPopulation commit
gaPopulation extract
XMLgaMap
GAR
GAR
GAR
GAR
fastq
fastq
fastq
fastq
GVM
VCF
10. The crop innovation company
Infrastructure setup
10
3,000 rice accessions
~17 Tb
fastq.gz format
Streaming
fastq.gz over
NFS
Desktop client (VM)
SSH connection
Illumina
HiSeq2500Intel Xeon 2620 (12
cores, 2Ghz)
96Gb RAM
11. The crop innovation company
3,000 rice genome project
Show case
11
• Rice (Oryza sativa L)
• 3,000 rice accessions
• 89 countries
• Avg seq depth of 14X
• Mapped to Nipponbare (IRGSP-1.0)
o 374 Mbp
o 12 chromosomes
• BWA and GATK (DNANexus)
• Data on
o Gygascience
o EBI
o Amazon
12. The crop innovation company
Test set 1: 131 rice accessions
Show case
12
• Subset of 131 accessions (out of 3,000) selected
• All major rice types are represented
• Mapped to Nipponbare (IRGSP-1.0)
Type #
Aus/boro 5
Basmati/sadri 1
Indica 19
Intermediate type 7
Japonica 11
Temperate japonica 71
Tropical japonica 17
Grand Total 131
13. The crop innovation company
Genalice ‘stepwise’ vs. Population calling
Show case
13
Map per sample
Genotype per sample
Store variant positions
9h
Adjust VCF per sample
Merge VCFs & Post process
Genotype per sample
1h
0.5h
3h
7h
116h
Map per sample
Pop call & post process
3h
5m
Genotype per sample 1h
Total time: 136:30h
8,227,780 variants
Total time: 4:05h
8,137,366 variants
Stepwise Pop call
6,147,072
positions shared
75%/76%
Map: 1:22m / sample
Total: 1:43m / sample
14. The crop innovation company
Genalice ‘stepwise’ vs. Population calling
Show case
14
• Genalice population calling route is straight forward and
does not require external tools
• Major performance increase from stepwise to population
calling approach (factor 34)
• Overlap of approaches on position ~75%
• Further qualitative research required
15. The crop innovation company
BWA/GATK vs. Genalice Population calling
Show case
15
BWA / GATK from
https://aws.amazon.com/public-data-
sets/3000-rice-genome
No details on compute available
gaMap
gaPopulation merge
75h
1.5h
gaPopulation add 25h
Pop call
Map per sample
Pop call & post process
Genotype per sample
BWA/GATK
gaPopulation extract 1h
gaPopulation commit 0.5h
Map: 1:30m / sample
Total: 2:04m / sample
Analysis of 3,000 rice accessions
16. The crop innovation company
BWA/GATK vs. Genalice Population calling (2)
Show case
16
• Format of VCF on Amazon was different from Genalice MAP
o --output_mode EMIT_ALL_SITES
o One gVCF file per accession (~2Gb each)
o Call for all positions of the reference
o Multiple calls for same genomic position
• Datasets are cleaned and merged
• Variant count 131 samples: 14,838,819 vs 8,137,366
• Minimum allele depth 2 vs 5
• 6,031,002 positions shared (74%) -> 26% novels
• 83.5% of shared positions have identical genotypes
• Further analysis required
17. The crop innovation company 17
Conclusions
• The new Population Calling module is extremely fast
• > 30 time faster then stepwise approach
• Analysis time scales linear with # samples
• Differences in content should be further analyzed
• The module is highly flexible
o Incremental addition of samples
• Extracting variants from GVM is very efficient
• The VCF output is conform standard but misses details
required in follow up research (depth / quality)
• This feature is requested and will be added soon
18. The crop innovation company 18
Thank you
Cueleneare
Koen
Nijbroek Hans
Karten
Tim
Rudie Antonise Bas Tolhuis