3. Why sequence a pedigree?
AA
CT
AC
GT
TG
AA
AG
CC
AA
GT
TT
AA
3
AG
TT
CC
TG
GT
AC
AG
TC
CA
TT
GT
AA
GG
TC
CA
GT
TT
CA
AG
CC
AA
TT
TT
AA
AG
CT
AC
GG
TT
AC
With a sufficiently large pedigree
the transmission of the parental
chromosomes can unambiguously
be determined
AG
TT
CC
TG
GT
AC
AG
CC
AA
GT
TT
AA
Error: T in blue haplotype should be G
4. Why sequence a pedigree?
Either parent
could also be TT
AG
CC
AA
GT
TT
AA
4
AG
TT
CC
TG
GT
AC
AA
CT
AC
GT
TG
AA
AG
TC
CA
TT
GT
AA
GG
TC
CA
GT
TT
CA
AG
CC
AA
TT
TT
AA
If only the trio were sequenced this
error would not be detected
When sequencing a trio we can
never eliminate alternative
genotypes in some of the samples
AG
AG
AG
CT
CC
TT
A Could also be GG or GT
C
AA
CC
GG
GT
TG
TT
TT
GT
AC
AA
AC
5. A large pedigree identifies most errors
Can identify a single error
in >99.7% of the variant
positions (11 sibs)
% Sites Perfectly Constrained
Percent
100
“Perfectly constrained” means could
remove the genotype information of any
More sibs adds
confidence to more sample and impute it based on the
phasing and other sample genotypes
variant calls
50
2 sibs allows phasing & identifies errors in 25% of variant positions
Trio never positively identifies the genotypes in every sample
0
1
5
2
3
4
5
6
7
# Siblings
8
9
10 11
6. Cost to add more siblings
% Sites Perfectly Constrained
Percent
100
2 Trios of Sequencing / 4 sibs
50
1 Trio of Sequencing
0
1
6
2
3
4
5
6
7
# Siblings
8
9
10 11
8. # Errors
Somatic/cell-line deletions on chr22
300
Errors per 50kb
Errors in NA12878 & NA12893
200
100
0
300
200
100
0
Normalized Depth
4
3
None of the other children carry
this deletion (though noise may
indicate mosaic)
2
1
0
8
1Mb
9. Read counts for the haplotypes inferred in NA12878 at
location of cell line deletion (200x depth)
Maternal haplotype (NA12892)
0.10
•
Inferred the two haplotypes in
NA12878 based on the other samples
Counts represent the predicted
heterozygous locations
Fraction
•
0.05
Paternal haplotype (NA12891)
0.00
0
50
100
Allele Counts
9
150
200
10. Technical replicates validate de novo SNVs
82 (~4%) did not replicate
Total Errors
TotalConflicts
4000
3000
2000
FPs?
1843 (~96%) replicate original call
NA
128
0
82
1000
10
Results in Tech. Rep.
11. Thoughts on selecting the next samples for sequencing
Identify and sequence pedigrees with multiple siblings
– WGS every individual in the pedigree to identify haplotype transmission vectors
– One “high quality” family (2 parents & 4 sibs) provides a “better” reference than two
lower quality trios for the same amount of sequencing
– Technical replicates allow alternative validation of biologically interesting calls – e.g.
de novo mutations, gene conversion etc.
Choose one or two samples to target for long reads if sequencing-limited
– Sequencing both parent will provide 100% of the variants in the pedigree though with
four children only ~75% will be validated in the children
– Sequencing a child will guarantee that every variant has been sequenced in at least
one of the parents though will only contain ~50% of the variants in the family
Quality of the DNA is important
– CEPH pedigree shows many cell line artifacts that are correctly genotyped but
deviate from inheritance
– Cell line artifacts complicate the analysis
11