Genome in a Bottle Consortium
Progress Update
January 27, 2014
Justin Zook, Marc Salit, and the Genome in a
Bottle Consortium
Whole Genome RMs vs.
Current Validation Methods
• Sanger confirmation
– Limited by number of sites (and sometimes it’s wrong)

• High depth NGS confirmation
– May have same systematic errors

• Genotyping microarrays
– Limited to known (easier) variants
– Problems with neighboring “complex” variants, duplications

• Mendelian inheritance
– Can’t account for some systematic errors

• Simulated data
– Generally not very representative of errors in real data

• Ti/Tv
– Varies by region of genome, and only gives overall statistic
2
Goals for Data to Accompany RM
• ~0 false positive AND false negative calls in
confident regions
• Include as much of the genome as possible in
the confident regions (i.e., don’t just take the
intersection)
• Avoid bias towards any particular platform
– take advantage of strengths of each platform

• Avoid bias towards any particular
bioinformatics algorithms
3
Integrate 12 14 Datasets from 5
platforms

4
Integration of Data to
Form Highly Confident Genotype Calls
Candidate variants

Find all possible variant sites

Concordant variants

Find concordant sites across multiple datasets

Find characteristics
of bias

Identify sites with atypical characteristics signifying
sequencing, mapping, or alignment bias

Arbitrate using
evidence of bias

For each site, remove datasets with decreasingly atypical
characteristics until all datasets agree

Confidence Level

Even if all datasets agree, identify them as uncertain if
few have typical characteristics, or if they fall in known
segmental duplications, SVs, or long repeats
5
Verification of “Highly Confident”
Genotype accuracy
• Sanger sequencing
– 100% accuracy but only 100s of sites

• X Prize Fosmid sequencing
– Sometimes call only part of a complex variant

• Microarrays
– Differences appear to be FP or FN in arrays

• Broad 250bp HaplotypeCaller
– Very highly concordant

• Platinum genomes pedigree SNPs
– Some systematic errors are inherited; different representations
of complex variants

• Real Time Genomics SNPs and indels
– Some interesting sites called by RTG complex caller
6
GCAT – Interactive Performance
Metrics
• NIST is working with
GCAT to use our highly
confident variant calls
• Assess performance of
many combinations of
mappers and variant
callers
• www.bioplanet.com/gc
at

Improvement of FreeBayes over 1 year with indels

7
Why do calls differ from our highly
confident genotypes?
Apparent False Positives
• Platform-specific systematic
sequencing errors for SNPs
• Analysis-specific
• Difficult to map regions
• Indels in long
homopolymers

Apparent False Negatives
• Different complex variant
representation
• Near indels
• Inside repeats

8
Complex variants have multiple correct
unphased representations
BWA

T
insertion

CGTools

Ref:

FP indels

TCTCT
insertion

Traditional
comparison

0.38%
(610)

100%
(915)

6.5%
(733)

Comparison
with
realignment

ssaha2

Novoalign

FP SNPs FP MNPs

0.15%
(249)

4.2%
(38)

2.6%
(298)

• ~225,000 highly confident
variants are within 10bp of
another variant
• FPs and FNs are significantly
enriched for complex variants
• RTG vcfeval can fix this issue!
9
Reasons we exclude regions from highconfidence set
Reasons we exclude regions from highconfidence set
Structural variant analytical approach
Depth of coverage (DOC)
Control-FREEC
CnD
Paired-end mapping (PEM)
Breakdancer
Split read (SR)
Pindel
Assembly based (AS)
Velvet
ABySS
Combination
Genome-STRiP

SVMerge

List of
structural
variant calls
Validation parameters for each SV
• Coverage (mean and standard deviation)
• Paired-end distance/insert size (mean and
standard deviation)
• # of discordant paired-ends
• Soft clipping of the reads (mean and standard
deviation)
• Mapping quality (mean and standard deviation)
• # of heterozygous and homozygous SNP
genotype calls
Challenges with assessing
performance
• All variant types are not
equal
• All regions of the genome
are not equal
– Homopolymers, STRs, dupli
cations
– Can be similar or different
in different genomes

• Labeling difficult variants
as uncertain leads to
higher apparent accuracy
when assessing
performance
• Genotypes fall in 3+
categories (not
positive/negative)
– standard diagnostic
accuracy measures not
well posed
15
Pedigree calls
• RTG and Illumina Platinum
Genomes working on this
• Sequence
NA12878, husband, and 11
children to identify high
confidence variants
– Identify cross-over events
– Determine if genotypes are
consistent with inheritance

• Should we integrate these
with the NIST high-confidence
genotypes?
• Should we find larger families
for future genomes?
• See afternoon presentations!

Source: Mike Eberle, Illumina

16
Pedigree Calls in Uncertain Regions
GIAB Characterization of pilot RM
•
•
•
•

NIST – 300x 150x150bp HiSeq (from 6 vials)
NIST – 100x 75bp ECC SOLiD 5500W
Illumina – 50x 100x100bp HiSeq
Complete Genomics – Normal and LFR (nonRM)
• Garvan Institute – Illumina exome
• NCI – Ion Proton whole genome
• INOVA – Infinium SNP/CNV array
Homogeneity and Stability
Homogeneity
• Multiplex First and last vial
– 3 libraries x 33x HiSeq each

• Multiplex 4 Random vials
– 2 libraries x 12.5x HiSeq each

• Compare variability due to:
–
–
–
–
–
–

vial
library
day
flow cell
lane
sampling

• Run PFGE on each vial for size

Stability
• Run PFGE to detect DNA
degradation
• Freeze-thaw 2 and 5 times
• Vortex for 10s
• 4°C for 2 and 8 weeks
• 37°C for 2 and 8 weeks
FTP site and Amazon S3
• NCBI is hosting fastq, bam, and vcf files on the
giab ftp site
• These data are mirrored to Amazon S3, so we
encourage you to take advantage of this!
Pilot Reference Material
• High-confidence calls are available on the ftp
site and are already being used
• NIST plans to release this as a NIST Reference
Material in the next couple months
Future Directions
• Characterize more
“difficult” regions/variants
• Structural variants
• Compare to pedigree calls
• Examine potentially
clinically relevant
regions/variants in RMs
• Use long-read technologies
–
–
–
–
–

Moleculo
CG LFR
PacBio
BioNano Genomics
future technologies??

• Use glia/platypus to realign
reads to candidate variants

• Analyze interlaboratory
study data
• Characterize PGP genomes
–
–
–
–

Ashkenazim trio
son in Asian trio
DNA at NIST in Jan-Feb 2014
Volunteers to sequence?

• Select future genomes
• Tumor-normal?
Topic #1: Moving beyond the easy
regions/variants
Presentations
• Emerging Technologies
–
–
–
–

PacBio
Complete Genomics LFR
Moleculo
BioNano Genomics

• Structural Variants
– Bina Technologies

Topics
• Structural Variants
• Phasing
• Validation
• Where should we set the
threshold(s) for confidence?
Topic #2: Cancer and Future Genomes
Cancer
• Spike-ins
• Mixtures of normal cell lines
• Tumor-normal cell line pair
• Transriptome controls

Priorities for Future Genomes
• Diverse ancestry groups
• Larger families
• Recruitment with consent
for commercialization
• How many genomes?
• Should the parents be NIST
Reference Materials, or only
the child?
Working Group Questions
RM Selection & Design
• Spike-in controls
• FFPE
• Commercial RMs
• ABRF interlaboratory study
• Should we prioritize one or
two genomes?

RM Characterization
• Production mode for new
trios
– Pilot was characterized by
Illumina, SOLiD, Ion
Proton, and Complete
Genomics
– What resources should we
invest in measurements for
each new family?
Working Group Questions
Bioinformatics
• Storing data/pipelines
– Suggestions for ftp structure
– Data submission/accessioning
process
– Data model for genomic data
– Archiving pipelines and
reproducible research

• GRCh38
• How to use pedigree calls for pilot
genome?
• Clones for targeted regions (hard
regions if not whole genome)
• In which difficult regions should
we focus our characterization?

Performance Metrics
• Target audience
• Requirements for user
interface
– Establishing truth set(s)
– Inputs/Outputs
– Visualization

• Integration with GeT-RM

140127 GIAB update and NIST high-confidence calls

  • 1.
    Genome in aBottle Consortium Progress Update January 27, 2014 Justin Zook, Marc Salit, and the Genome in a Bottle Consortium
  • 2.
    Whole Genome RMsvs. Current Validation Methods • Sanger confirmation – Limited by number of sites (and sometimes it’s wrong) • High depth NGS confirmation – May have same systematic errors • Genotyping microarrays – Limited to known (easier) variants – Problems with neighboring “complex” variants, duplications • Mendelian inheritance – Can’t account for some systematic errors • Simulated data – Generally not very representative of errors in real data • Ti/Tv – Varies by region of genome, and only gives overall statistic 2
  • 3.
    Goals for Datato Accompany RM • ~0 false positive AND false negative calls in confident regions • Include as much of the genome as possible in the confident regions (i.e., don’t just take the intersection) • Avoid bias towards any particular platform – take advantage of strengths of each platform • Avoid bias towards any particular bioinformatics algorithms 3
  • 4.
    Integrate 12 14Datasets from 5 platforms 4
  • 5.
    Integration of Datato Form Highly Confident Genotype Calls Candidate variants Find all possible variant sites Concordant variants Find concordant sites across multiple datasets Find characteristics of bias Identify sites with atypical characteristics signifying sequencing, mapping, or alignment bias Arbitrate using evidence of bias For each site, remove datasets with decreasingly atypical characteristics until all datasets agree Confidence Level Even if all datasets agree, identify them as uncertain if few have typical characteristics, or if they fall in known segmental duplications, SVs, or long repeats 5
  • 6.
    Verification of “HighlyConfident” Genotype accuracy • Sanger sequencing – 100% accuracy but only 100s of sites • X Prize Fosmid sequencing – Sometimes call only part of a complex variant • Microarrays – Differences appear to be FP or FN in arrays • Broad 250bp HaplotypeCaller – Very highly concordant • Platinum genomes pedigree SNPs – Some systematic errors are inherited; different representations of complex variants • Real Time Genomics SNPs and indels – Some interesting sites called by RTG complex caller 6
  • 7.
    GCAT – InteractivePerformance Metrics • NIST is working with GCAT to use our highly confident variant calls • Assess performance of many combinations of mappers and variant callers • www.bioplanet.com/gc at Improvement of FreeBayes over 1 year with indels 7
  • 8.
    Why do callsdiffer from our highly confident genotypes? Apparent False Positives • Platform-specific systematic sequencing errors for SNPs • Analysis-specific • Difficult to map regions • Indels in long homopolymers Apparent False Negatives • Different complex variant representation • Near indels • Inside repeats 8
  • 9.
    Complex variants havemultiple correct unphased representations BWA T insertion CGTools Ref: FP indels TCTCT insertion Traditional comparison 0.38% (610) 100% (915) 6.5% (733) Comparison with realignment ssaha2 Novoalign FP SNPs FP MNPs 0.15% (249) 4.2% (38) 2.6% (298) • ~225,000 highly confident variants are within 10bp of another variant • FPs and FNs are significantly enriched for complex variants • RTG vcfeval can fix this issue! 9
  • 10.
    Reasons we excluderegions from highconfidence set
  • 11.
    Reasons we excluderegions from highconfidence set
  • 12.
    Structural variant analyticalapproach Depth of coverage (DOC) Control-FREEC CnD Paired-end mapping (PEM) Breakdancer Split read (SR) Pindel Assembly based (AS) Velvet ABySS Combination Genome-STRiP SVMerge List of structural variant calls
  • 14.
    Validation parameters foreach SV • Coverage (mean and standard deviation) • Paired-end distance/insert size (mean and standard deviation) • # of discordant paired-ends • Soft clipping of the reads (mean and standard deviation) • Mapping quality (mean and standard deviation) • # of heterozygous and homozygous SNP genotype calls
  • 15.
    Challenges with assessing performance •All variant types are not equal • All regions of the genome are not equal – Homopolymers, STRs, dupli cations – Can be similar or different in different genomes • Labeling difficult variants as uncertain leads to higher apparent accuracy when assessing performance • Genotypes fall in 3+ categories (not positive/negative) – standard diagnostic accuracy measures not well posed 15
  • 16.
    Pedigree calls • RTGand Illumina Platinum Genomes working on this • Sequence NA12878, husband, and 11 children to identify high confidence variants – Identify cross-over events – Determine if genotypes are consistent with inheritance • Should we integrate these with the NIST high-confidence genotypes? • Should we find larger families for future genomes? • See afternoon presentations! Source: Mike Eberle, Illumina 16
  • 17.
    Pedigree Calls inUncertain Regions
  • 18.
    GIAB Characterization ofpilot RM • • • • NIST – 300x 150x150bp HiSeq (from 6 vials) NIST – 100x 75bp ECC SOLiD 5500W Illumina – 50x 100x100bp HiSeq Complete Genomics – Normal and LFR (nonRM) • Garvan Institute – Illumina exome • NCI – Ion Proton whole genome • INOVA – Infinium SNP/CNV array
  • 19.
    Homogeneity and Stability Homogeneity •Multiplex First and last vial – 3 libraries x 33x HiSeq each • Multiplex 4 Random vials – 2 libraries x 12.5x HiSeq each • Compare variability due to: – – – – – – vial library day flow cell lane sampling • Run PFGE on each vial for size Stability • Run PFGE to detect DNA degradation • Freeze-thaw 2 and 5 times • Vortex for 10s • 4°C for 2 and 8 weeks • 37°C for 2 and 8 weeks
  • 20.
    FTP site andAmazon S3 • NCBI is hosting fastq, bam, and vcf files on the giab ftp site • These data are mirrored to Amazon S3, so we encourage you to take advantage of this!
  • 21.
    Pilot Reference Material •High-confidence calls are available on the ftp site and are already being used • NIST plans to release this as a NIST Reference Material in the next couple months
  • 22.
    Future Directions • Characterizemore “difficult” regions/variants • Structural variants • Compare to pedigree calls • Examine potentially clinically relevant regions/variants in RMs • Use long-read technologies – – – – – Moleculo CG LFR PacBio BioNano Genomics future technologies?? • Use glia/platypus to realign reads to candidate variants • Analyze interlaboratory study data • Characterize PGP genomes – – – – Ashkenazim trio son in Asian trio DNA at NIST in Jan-Feb 2014 Volunteers to sequence? • Select future genomes • Tumor-normal?
  • 23.
    Topic #1: Movingbeyond the easy regions/variants Presentations • Emerging Technologies – – – – PacBio Complete Genomics LFR Moleculo BioNano Genomics • Structural Variants – Bina Technologies Topics • Structural Variants • Phasing • Validation • Where should we set the threshold(s) for confidence?
  • 24.
    Topic #2: Cancerand Future Genomes Cancer • Spike-ins • Mixtures of normal cell lines • Tumor-normal cell line pair • Transriptome controls Priorities for Future Genomes • Diverse ancestry groups • Larger families • Recruitment with consent for commercialization • How many genomes? • Should the parents be NIST Reference Materials, or only the child?
  • 25.
    Working Group Questions RMSelection & Design • Spike-in controls • FFPE • Commercial RMs • ABRF interlaboratory study • Should we prioritize one or two genomes? RM Characterization • Production mode for new trios – Pilot was characterized by Illumina, SOLiD, Ion Proton, and Complete Genomics – What resources should we invest in measurements for each new family?
  • 26.
    Working Group Questions Bioinformatics •Storing data/pipelines – Suggestions for ftp structure – Data submission/accessioning process – Data model for genomic data – Archiving pipelines and reproducible research • GRCh38 • How to use pedigree calls for pilot genome? • Clones for targeted regions (hard regions if not whole genome) • In which difficult regions should we focus our characterization? Performance Metrics • Target audience • Requirements for user interface – Establishing truth set(s) – Inputs/Outputs – Visualization • Integration with GeT-RM

Editor's Notes

  • #9 ----- Meeting Notes (5/28/13 17:05) -----ask heng for decoy