Sample Characterization

Michael A. Eberle
GiaB, January 2014
Pedigree including NA12878
12889

12890

12891

12892

NA12878
12877

12879

12880

12881

12882

12878

12883

12884

12885

12886

12887

12888

12893

!

All 17 members sequenced to at least 50x depth (PCR-Free protocol)

!

Variants are called across the pedigree using different software & technology

!

Inheritance information provides high confident, direct validation of variant calls

2
Why sequence a pedigree?

AA
CT
AC
GT
TG
AA

AG
CC
AA
GT
TT
AA
3

AG
TT
CC
TG
GT
AC

AG
TC
CA
TT
GT
AA

GG
TC
CA
GT
TT
CA

AG
CC
AA
TT
TT
AA

AG
CT
AC
GG
TT
AC

With a sufficiently large pedigree
the transmission of the parental
chromosomes can unambiguously
be determined

AG
TT
CC
TG
GT
AC

AG
CC
AA
GT
TT
AA

Error: T in blue haplotype should be G
Why sequence a pedigree?

Either parent
could also be TT

AG
CC
AA
GT
TT
AA
4

AG
TT
CC
TG
GT
AC

AA
CT
AC
GT
TG
AA

AG
TC
CA
TT
GT
AA

GG
TC
CA
GT
TT
CA

AG
CC
AA
TT
TT
AA

If only the trio were sequenced this
error would not be detected
When sequencing a trio we can
never eliminate alternative
genotypes in some of the samples

AG
AG
AG
CT
CC
TT
A Could also be GG or GT
C
AA
CC
GG
GT
TG
TT
TT
GT
AC
AA
AC
A large pedigree identifies most errors
Can identify a single error
in >99.7% of the variant
positions (11 sibs)

% Sites Perfectly Constrained
Percent

100

“Perfectly constrained” means could
remove the genotype information of any
More sibs adds
confidence to more sample and impute it based on the
phasing and other sample genotypes
variant calls

50

2 sibs allows phasing & identifies errors in 25% of variant positions

Trio never positively identifies the genotypes in every sample

0
1
5

2

3

4

5

6

7

# Siblings

8

9

10 11
Cost to add more siblings

% Sites Perfectly Constrained
Percent

100
2 Trios of Sequencing / 4 sibs

50

1 Trio of Sequencing

0
1
6

2

3

4

5

6

7

# Siblings

8

9

10 11
Understanding conflicts in the pedigree

7
# Errors

Somatic/cell-line deletions on chr22

300
200

Errors per 50kb

Errors in NA12878 & NA12893

100
0

300
200
100
0

Normalized Depth
4
3
2
1
0
8
# Errors

Somatic/cell-line deletions on chr22

300

Errors per 50kb

Errors in NA12878 & NA12893

200
100
0

300
200
100
0

Normalized Depth
4
3

None of the other children carry
this deletion (though noise may
indicate mosaic)

2
1
0
9

1Mb
Read counts for the haplotypes inferred in NA12878 at
location of cell line deletion (200x depth)
Maternal haplotype (NA12892)

Fraction

0.10

•  Inferred the two haplotypes in
NA12878 based on the other samples
•  Counts represent the predicted
heterozygous locations

0.05
Paternal haplotype (NA12891)

0.00

0

50

100

Allele Counts
10

150

200
Technical replicates validate de novo SNVs
82 (~4%) did not replicate

Total Errors
TotalConflicts

4000
3000
2000

FPs?

1843 (~96%) replicate original call

NA
128

0

82

1000

11

Results in Tech. Rep.
Thoughts on selecting the next samples for sequencing

!

Identify and sequence pedigrees with multiple siblings
–  WGS every individual in the pedigree to identify haplotype transmission vectors
–  One “high quality” family (2 parents & 4 sibs) provides a “better” reference than two
lower quality trios for the same amount of sequencing
–  Technical replicates allow alternative validation of biologically interesting calls – e.g.
de novo mutations, gene conversion etc.

!

Choose one or two samples to target for long reads if sequencing-limited
–  Sequencing both parent will provide 100% of the variants in the pedigree though with
four children only ~75% will be validated in the children
–  Sequencing a child will guarantee that every variant has been sequenced in at least
one of the parents though will only contain ~50% of the variants in the family

!

Quality of the DNA is important
–  CEPH pedigree shows many cell line artifacts that are correctly genotyped but deviate
from inheritance
–  Cell line artifacts complicate the analysis

12

140127 platinum genomes pedigree analyses

  • 1.
    Sample Characterization Michael A.Eberle GiaB, January 2014
  • 2.
    Pedigree including NA12878 12889 12890 12891 12892 NA12878 12877 12879 12880 12881 12882 12878 12883 12884 12885 12886 12887 12888 12893 ! All17 members sequenced to at least 50x depth (PCR-Free protocol) ! Variants are called across the pedigree using different software & technology ! Inheritance information provides high confident, direct validation of variant calls 2
  • 3.
    Why sequence apedigree? AA CT AC GT TG AA AG CC AA GT TT AA 3 AG TT CC TG GT AC AG TC CA TT GT AA GG TC CA GT TT CA AG CC AA TT TT AA AG CT AC GG TT AC With a sufficiently large pedigree the transmission of the parental chromosomes can unambiguously be determined AG TT CC TG GT AC AG CC AA GT TT AA Error: T in blue haplotype should be G
  • 4.
    Why sequence apedigree? Either parent could also be TT AG CC AA GT TT AA 4 AG TT CC TG GT AC AA CT AC GT TG AA AG TC CA TT GT AA GG TC CA GT TT CA AG CC AA TT TT AA If only the trio were sequenced this error would not be detected When sequencing a trio we can never eliminate alternative genotypes in some of the samples AG AG AG CT CC TT A Could also be GG or GT C AA CC GG GT TG TT TT GT AC AA AC
  • 5.
    A large pedigreeidentifies most errors Can identify a single error in >99.7% of the variant positions (11 sibs) % Sites Perfectly Constrained Percent 100 “Perfectly constrained” means could remove the genotype information of any More sibs adds confidence to more sample and impute it based on the phasing and other sample genotypes variant calls 50 2 sibs allows phasing & identifies errors in 25% of variant positions Trio never positively identifies the genotypes in every sample 0 1 5 2 3 4 5 6 7 # Siblings 8 9 10 11
  • 6.
    Cost to addmore siblings % Sites Perfectly Constrained Percent 100 2 Trios of Sequencing / 4 sibs 50 1 Trio of Sequencing 0 1 6 2 3 4 5 6 7 # Siblings 8 9 10 11
  • 7.
  • 8.
    # Errors Somatic/cell-line deletionson chr22 300 200 Errors per 50kb Errors in NA12878 & NA12893 100 0 300 200 100 0 Normalized Depth 4 3 2 1 0 8
  • 9.
    # Errors Somatic/cell-line deletionson chr22 300 Errors per 50kb Errors in NA12878 & NA12893 200 100 0 300 200 100 0 Normalized Depth 4 3 None of the other children carry this deletion (though noise may indicate mosaic) 2 1 0 9 1Mb
  • 10.
    Read counts forthe haplotypes inferred in NA12878 at location of cell line deletion (200x depth) Maternal haplotype (NA12892) Fraction 0.10 •  Inferred the two haplotypes in NA12878 based on the other samples •  Counts represent the predicted heterozygous locations 0.05 Paternal haplotype (NA12891) 0.00 0 50 100 Allele Counts 10 150 200
  • 11.
    Technical replicates validatede novo SNVs 82 (~4%) did not replicate Total Errors TotalConflicts 4000 3000 2000 FPs? 1843 (~96%) replicate original call NA 128 0 82 1000 11 Results in Tech. Rep.
  • 12.
    Thoughts on selectingthe next samples for sequencing ! Identify and sequence pedigrees with multiple siblings –  WGS every individual in the pedigree to identify haplotype transmission vectors –  One “high quality” family (2 parents & 4 sibs) provides a “better” reference than two lower quality trios for the same amount of sequencing –  Technical replicates allow alternative validation of biologically interesting calls – e.g. de novo mutations, gene conversion etc. ! Choose one or two samples to target for long reads if sequencing-limited –  Sequencing both parent will provide 100% of the variants in the pedigree though with four children only ~75% will be validated in the children –  Sequencing a child will guarantee that every variant has been sequenced in at least one of the parents though will only contain ~50% of the variants in the family ! Quality of the DNA is important –  CEPH pedigree shows many cell line artifacts that are correctly genotyped but deviate from inheritance –  Cell line artifacts complicate the analysis 12