SlideShare a Scribd company logo
1 of 52
Autism exome sequencing:
design, data processing and analysis
Benjamin Neale
Analytic and Translational Genetics Unit, MGH
&
Medical & Population Genetics Program, Broad Institute
Direct Sequencing has
Enormous Potential…
• Ng, Shendure: Miller syndrome, 4 cases
– exome sequenced reveals causal mutations in DHODH
• Lifton: Undiagnosed congenital chloride diarrhoea (consanguinous)
– Exome seq reveals homozygous SLC23A chloride ion transporter mutation
– Return diagnosis of CLD (gi) not suspected Bartter syndrome (renal)
• Worthey, Dimmock: 4-year old, severe unusual IBD
– exome seq reveals XIAP mutation (at a highly conserved aa)
– proimmune disregulation opt for bone marrow transplant over chemo
• Jones, Marra: Secondary lung carcinoma unresponsive to erlotinib
– Genome and transcriptome sequencing reveals defects
– directs alternative sunitinib therapy
• Mardis, Wilson: acute myelocytic leukaemia but not classical translocation
– Genome sequencing (1 week + analysis) reveals PML-RARA translocation
– Directs ATRA (all trans retinoic acid) treatment decision
…and tremendous challenges
• Managing and processing vast quantities of
data into variation
• Interpreting millions of variants per individual
– An individual’s genome harbors
• ~80 point nonsense mutations
• ~100-200 frameshift mutations
• Tens of splice mutants, CNV induced gene disruptions
For very few of these do we have any conclusive understanding
of their medical impact in the population
Successes to date rely on factors that may
not apply generally to common endpoints
• Mendelian disorders
– Single family rare autosomal recessive (linkage
may target 1% of genome, 2 ‘hits’ in the same
gene very unlikely)
– Single (or ‘near single’) gene disorders where
nearly all families carry mutations in the same
underlying gene
• Somatic or de novo mutations
– Extremely rare background rate
Autism exome sequencing
• In progress – ARRA supported by NIMH &
NHGRI
• Collaboration between sequencing centers
(Baylor & Broad) and Y2 follow-up in autism
genetics labs (Buxbaum, Daly, Devlin,
Schellenberg, Sutcliffe)
• Targeted production by years end of 1000
cases and 1000 controls (500/500 from each
site)
Exome production plan
• Baylor: 1000 samples (Nimblegen capture, SOLiD
sequencing)
• Broad: 1000 samples (Agilent capture, Illumina
sequencing)
• Predominantly cases and controls pairwise
matched with GWAS data (one batch of 50 trios
currently being run)
• All samples are available from NIMH repository
Broad Exome Production
• ~700 exomes completely sequenced and
recently completed variant calling
• ~300 completed earlier in the Summer and
fully analyzed (basis of later analysis slides)
• Main production conducted with matched
case-control pairs traveling together through
the sequencing lab and computational runs of
variant calling
From unmapped reads to true genetic variation
in next-generation sequencing data
Raw short reads
Human reference
genome
Solexa
Mapping and alignment
Human reference
genome
Quality calibration and annotation
The quality of each read is calibrated
and additional information annotated
for downstream analyses
The origin of each read from the
human genome sequence is found
Human reference
genome
Identifying genetic variation
SNPs and indels from the reference
are found where the reads collectively
provide evidence of a variant
SNP
A single run of a sequencer generates
~50M ~75bp short reads for analysis
SOLiD
454
Region 1 Region 2
Region 1 Region 2 Region 1 Region 2
Partnership: Genome Sequencing and
Analysis (GSA) team @ Broad
• Genome Sequencing and Analysis
(GSA) develops core capabilities for
genetic analysis
– Data processing and analysis methods
– Technology development
– High-end software engineering
– High-throughput data processing for
MPG exome projects with MPG-
Firehose
• Staffed by full-time research scientists
in MPG
– PhDs and BAs in biochemistry,
engineering, computer science,
mathematics, and genetics
Group Leader
Mark A. DePristo
Analysis Team
Kiran Garimella [Lead]
Chris Hartl
Corin Boyko
Development Team
Eric Banks [Lead]
Guillermo del Angel
Menachem Fromer
Ryan Poplin
Software Engineering
Matthew Hanna
Khalid Shakir
Aaron McKenna
Developing cutting-edge data
processing and analysis methods
Local realignment
Base quality score recalibration
Variation discovery and genotyping
Read-backed
phasing
VariantEval
Adaptive
error
modeling
Novel SNPs found
Challenges
• Mapping/alignment
• Quality score recalibration
• Calling variants
• Evaluating set of variant sites
• Estimating genotypes
Solexa : BWA 454 : SSAHA SOLiD : Corona
• Robust, accurate ‘gold
standard’ aligner for NGS
• Developed by Li and Durbin
• Recently replaced MAQ, also
by Li and Durbin, used for last
2 years
SAM/BAM files
Region 1
Enormous pile
of short reads
from NGS
Detects correct read
origin and flags them
with high certainty
Detects ambiguity in the
origin of reads and flags
them as uncertain
Reference
genome
Mapping and
alignment
algorithm
Finding the true origin of each read is a
computationally demanding and important first step
• Hash-based aligner with
high sensitivity and
specificity with longer
reads
• ABI-designed tool for
aligning in color-space
Region 2 Region 3
The SAM file format
• Data sharing was a major issue with the 1000 genomes
– Each center, technology and analysis tool used its own
idiosyncratic file formats – no one could exchange data
• The Sequence Alignment and Mapping (SAM) file
format was designed to capture all of the critical
information about NGS data in a single indexed and
compressed file
– Becoming a standard and is now used by production
informatics, MPG, and cancer analysis groups at the Broad
• Has enabled sharing of data across centers and the
development of tools that work across platforms
• More info at http://samtools.sourceforge.net/
Local realignment around indels
SLX GA 454 SOLiD HiSeqComplete Genomics
first of pair readssecond of pair reads first of pair readssecond of pair reads first of pair readssecond of pair reads
Base Quality Score
Recalibration
How do indel realignment and base quality
recalibration affect SNP calling?
http://www.broadinstitute.org/gsa/wiki/index.php/Base_quality_score_recalibration
http://www.broadinstitute.org/gsa/wiki/index.php/Local_realignment_around_indels
6.5% of calls on raw
reads are likely false
positives due to indels
The process doesn’t
remove real SNPs
Sensitivity and specificity improved by simultaneous
variant calling in 50-100 individuals
• Sensitivity
– Greater statistical evidence compiled for true variants
seen in more than one individual
• Specificity
– Deviations in metrics that flag false positive sites
become much more statistically significant
• Allele balance: Departure from 50:50 (r:nr) in heterozygotes
• Strand bias: non-reference allele preferentially seen on one
of the two DNA strands
• Proportion of reads with low mapping quality
The Genome Analysis Toolkit (GATK) enables rapid
development of efficient and robust analysis tools
Genome Analysis Toolkit
(GATK) infrastructure
Analysis
tool
Traversal engine
Implemented by userProvided by framework More info: http://www.broadinstitute.org/gsa/wiki/
• All of these tools
have been developed
in the GATK
• They are memory
and CPU efficient,
cluster friendly and are
easily parallelized
• They are now
publically and are
being used at many
sites around the world
• Supports any BAM-
compatible aligner
Initial alignment
MSA realignment
Q-score
recalibration
Multi-sample
genotyping
SNP filtering
M
.
D
e
P
r
i
s
t
o
Additional key advance
• Correcting alignment artifacts and machine-
specific biases in read base calling and quality
score assignment enables machine-
independent variant identification and
genotype calling
• 1000 Genomes data even contains individuals
with data merged from multiple sequencing
platforms!
For our project this is key
• With two centers generating data via distinct
experimental and sequencing procedure,
harmonizing data is integral to downstream
analysis
Stratified analyses
• Because both processes will not afford
equivalent coverage of all targets:
– Critical that case-control balance and individual
pairings are preserved within center
– Final analysis will be stratified by center such that
rare technical differences, lack of coverage on one
or the other platform, etc can be managed
robustly
Secondary data cleaning is critical
• Identify a quality set of individuals
• Identify a quality set of targets
• Identifying a quality set of variants
Primary cleaning
• Identify a quality set of individuals
• Identify a quality set of targets
• Identifying a quality set of variants
Sample composition thus far
Batch Case Control Total
1 25 25 50
2 22 21 43
3 41 12 53
4 25 25 50
5 25 25 50
6 25 25 50
Total 164 132 296
Individual Cleaning
• Mean depth of coverage for all targets
• Concordance rate with ‘super clean’ SNP Chip
– Contamination checks
Mean Coverage per Sample
Exclude this one
Concordance and contamination
checks
• 1/296 samples fails concordance check (genotype
call discrepancy) with SNP chip data (sample
swap)
• 1/295 samples fails contamination check
(proportion of reads calling non-reference alleles
at SNP chip homozygous sites indicates >5% DNA
from another individual)
• Advance a fully validated set of individuals to
downstream analysis
Number of non-reference
genotypes per individual
1500 high frequency sites
Primary cleaning
• Identify a quality set of individuals
• Identify a quality set of targets
• Identifying a quality set of variants
Distribution of mean target coverage
99.86% targets
Depth vs GC Content
>95% of the targeted exons sequenced between 10 and 500x depth –
Defined as successfully evaluated exome
SNP rate as a function of GC Bin
0
1
2
3
4
5
6
7
8
(30,35] (35,40] (40,45] (45,50] (50,55] (55,60] (60,65] (65,~]
SNP rate
SNP rate
% discovered variants that are singletons
50-250x - half the data, median coverage,
singleton plateau at 34%
Lowest bin badly deficient in
singletons – but higher rate of
called variants overall…
------- 95% of targets covered between 10 and 500x ----------
Primary cleaning
• Identify a quality set of individuals
• Identify a quality set of targets
• Identifying a quality set of variants
Defining the set of variant sites
• Define the technical parameters of true polymorphisms
using a core set of ‘gold-standard’ true positive variant
sites:
– Are sites contained in a reference sample (e.g., dbSNP,
another exome or genome study)
– High quality target depth range (50-250) – no true sites
should be missed and no excess coverage suggesting
mapping concerns
• Define distribution of technical properties (balance of
reference/non-reference alleles; balance of non-
reference alleles on +/- strand; read mapping quality)
– Filter non-dbSNP, non-ideal coverage calls based on these
distributions
Allele Balance Example
1% 99%95%5%
In testing now: Variant Quality Score Recalibration enables
definition of data set with user defined sensitivity, specificity
...moving towards posterior estimate for each site
http://www.broadinstitute.org/gsa/wiki/index.php/Variant_quality_score_recalibration
On to analysis…
• 204,123 variants pass all filters across 294
samples…number of all variants & singletons
continue to increase as data is added
• How do we assess how this QC process
performed downstream? Is the experiment
working?
Has the matching worked?
• Matched samples based on MDS distance
from combined GWAS data
• Consider the set of doubletons (two copies in
the dataset)
• Overall, we should see that there are
comparable numbers of variants seen in 2
cases or 2 controls versus 1 case and 1 control;
and we should see an excess of 1 case:1
control variants in matched pairs
Do we see appropriate case:control
statistics for rare variants?
Visual Representation
Case Control1
2
3
4
… …
118
Case Control
Case Control
Case Control
Case Control
VS.
Case Control
Case Control
Case Control
Case Control
Case Control
Case Control
Case Control
Case Control
Case Control
Case Control
or
Expectation: 1/235 for within pair doubles 15,829 doubletons -> ~67.4
Observed: 163 instances observed,
X:0 missense mutations
While one is intrigued by variants seen in 5 or more copies in cases and
not at all in controls – no evidence using permutation that there is a
significant excess of such variants…
Specific nonsense mutations in cases
or controls only
Impossible to pick out which might be relevant
Aggregations of rare nonsense mutations in
single genes, all in cases only
• Encouraging – many
genes where 1-3% of
cases and no controls
harbor nonsense
mutations – best case
scenario?
Genes with multiple nonsense mutations
seen only in cases or controls
Many genes with 2 or more nonsense mutations seen only in cases –
not appreciably different from rate in controls suggesting vast majority is
simply the background rate at which such variants occur…
Challenges of connecting rare variation
to complex phenotype
• Variation must be considered in aggregate per gene (or pathway…) rather than
individually
• In phenotypically relevant genes, many rare variants will be neutral (i.e.,
background rate is high)
• Many documented cases exist where gain and loss of function mutations in same
gene have opposite effects on phenotype
• Polygenicity will not be our friend here…realistically, no reason to think much
smaller samples than in GWAS will be required
• at this point, the best case scenarios of highly penetrant rare mutations (that
would have escaped prior large linkage and GWAS studies), aggregations of very
rare alleles that explain 1% of the overall variance in risk, etc – cannot be
distinguished from the background distribution of test statistics
• Some opportunistic models (.1%-.5% variants with OR~5-10, high penetrance
recessive subtypes) may be able to be detected and confirmed through follow-up
soon…no reason to have assurance these exist however
Parallel analysis tracks will be taken
• High MZ/DZ ratio suggests potential recessive component: search for
excess autozygosity, IBD=2 sharing with affected sibs then homozygosity
for rare allele; compound heterozygosity for rare alleles
• Highly deleterious alleles: Identify all non-synonymous/nonsense/splice
variants unique to the study, not seen in 1000 Genomes or external
control exome data (mostly singletons, very rare and de novo variants –
perhaps ranked by predicted impact/PolyPhen) and compare burden in
cases versus controls (testing a severe Mendelian mutation model)
• Heritable low frequency: Take all standing variation, observed two or more
times in the study and perform sensitive test of gene-wide variation using
C-alpha test of overdispersion (testing for effect of rare and low frequency
variants of modest/intermediate penetrance across the gene)
Flexible, extensible data representation (variants, genotypes and
meta-data)
A number of ways to use the library
command line, R, web, C/C++
Efficient random-access for very large datasets
Key references datasets
dbSNP, 1000G, PolyPhen2, gene transcript and sequence information
Large suite of up-to-date methods available
Madsen-Browning (+/- variable threshold), Li-Leal, C-alpha, etc.
Tools for analysis of variation data from next-generation sequencing platforms: the
PLINK/Seq library
http://pngu.mgh.harvard.edu/purcell/pseq/
Shaun Purcell
PSEQ Features
• Individual statistics
• Variant statistics
• Single-locus association
• Regional association
• Incorporation of annotation information
Data Sharing
• Autism exome data made available
– Providing the gold-standard calibration for variant
calling in other contemporaneous Broad exome
studies
– Control variable sites and counts made available
for comparison with other Broad exome studies
– First batch of raw data deposited to dbGAP
exchange area – NIMH controls broadly consented
for general medical use
Data Sharing
• Summary database of sites discovered, non-
reference allele counts and target by target
coverage provide invaluable cross-study
information:
– Technical validation of low frequency sites
– Ability to evaluate whether specific sites or categories
of variants in certain genes are commonly, rarely or
never seen
– Greatly enhance selection for follow-up
– No individual level data or phenotype information
need be exchanged
Already in use across autism, schizophrenia, released 1000G studies -
would love to collaborate with NHLBI & cancer studies at this level
Credits
• ARRA Autism sequencing
– Mark Daly
– Christine Stevens
– Stacey Gabriel
– Broad Sequencing Platform
• Jen Baldwin, Jane Wilkinson
Joe Buxbaum, Bernie Devlin,
Richard Gibbs, Jerry
Schellenberg, Jim Sutcliffe
NIMH, NHGRI
• Broad GSA team
– Mark DePristo
– Eric Banks
– Kiran Garimella
– Ryan Poplin
• Exome analysis
– Mark Daly
– Manny Rivas
– Jared Maguire
– Ben Voight
– Shaun Purcell
– Kathryn Roeder & Bernie Devlin

More Related Content

What's hot

SPIN Workshop Microbial Genomics @NIST
SPIN Workshop Microbial Genomics @NISTSPIN Workshop Microbial Genomics @NIST
SPIN Workshop Microbial Genomics @NISTNathan Olson
 
Advances and Applications Enabled by Single Cell Technology
Advances and Applications Enabled by Single Cell TechnologyAdvances and Applications Enabled by Single Cell Technology
Advances and Applications Enabled by Single Cell TechnologyQIAGEN
 
140127 GIAB update and NIST high-confidence calls
140127 GIAB update and NIST high-confidence calls140127 GIAB update and NIST high-confidence calls
140127 GIAB update and NIST high-confidence callsGenomeInABottle
 
Building bioinformatics resources for the global community
Building bioinformatics resources for the global communityBuilding bioinformatics resources for the global community
Building bioinformatics resources for the global communityExternalEvents
 
Genome in a bottle for next gen dx v2 180821
Genome in a bottle for next gen dx v2 180821Genome in a bottle for next gen dx v2 180821
Genome in a bottle for next gen dx v2 180821GenomeInABottle
 
Dissecting human brain development at high resolution using RNA-seq
Dissecting human brain development  at high resolution using RNA-seq Dissecting human brain development  at high resolution using RNA-seq
Dissecting human brain development at high resolution using RNA-seq lcolladotor
 
Population-Based DNA Variant Analysis
Population-Based DNA Variant AnalysisPopulation-Based DNA Variant Analysis
Population-Based DNA Variant AnalysisGolden Helix
 
Día 19 - Noel Chen - Introducción a Novogene
Día 19 - Noel Chen - Introducción a Novogene Día 19 - Noel Chen - Introducción a Novogene
Día 19 - Noel Chen - Introducción a Novogene Alejandro Borges
 
Detecting STR Peaks in Degraded DNA samples
Detecting STR Peaks in Degraded DNA samplesDetecting STR Peaks in Degraded DNA samples
Detecting STR Peaks in Degraded DNA samplesEmanuela Marasco
 
RNA Seq Data Analysis
RNA Seq Data AnalysisRNA Seq Data Analysis
RNA Seq Data AnalysisRavi Gandham
 
Development of FDA MicroDB: A Regulatory-Grade Microbial Reference Database
Development of FDA MicroDB: A Regulatory-Grade Microbial Reference DatabaseDevelopment of FDA MicroDB: A Regulatory-Grade Microbial Reference Database
Development of FDA MicroDB: A Regulatory-Grade Microbial Reference DatabaseNathan Olson
 
Genome in a bottle april 30 2015 hvp Leiden
Genome in a bottle april 30 2015 hvp LeidenGenome in a bottle april 30 2015 hvp Leiden
Genome in a bottle april 30 2015 hvp LeidenGenomeInABottle
 
single-cell-sequencing-research-review
single-cell-sequencing-research-reviewsingle-cell-sequencing-research-review
single-cell-sequencing-research-reviewSwati Kadam Ph.D.
 
Getting More from GWAS
Getting More from GWASGetting More from GWAS
Getting More from GWASGolden Helix
 
Achieve improved variant detection in single cell sequencing infographic
Achieve improved variant detection in single cell sequencing infographicAchieve improved variant detection in single cell sequencing infographic
Achieve improved variant detection in single cell sequencing infographicQIAGEN
 
Church_NCBIvariation2013
Church_NCBIvariation2013Church_NCBIvariation2013
Church_NCBIvariation2013Deanna Church
 
Fine-tuning CNV Analysis for the Clinical Analysis of NGS Samples
Fine-tuning CNV Analysis for the Clinical Analysis of NGS SamplesFine-tuning CNV Analysis for the Clinical Analysis of NGS Samples
Fine-tuning CNV Analysis for the Clinical Analysis of NGS SamplesGolden Helix
 

What's hot (20)

SPIN Workshop Microbial Genomics @NIST
SPIN Workshop Microbial Genomics @NISTSPIN Workshop Microbial Genomics @NIST
SPIN Workshop Microbial Genomics @NIST
 
Advances and Applications Enabled by Single Cell Technology
Advances and Applications Enabled by Single Cell TechnologyAdvances and Applications Enabled by Single Cell Technology
Advances and Applications Enabled by Single Cell Technology
 
Sept2016 smallvar 10_x
Sept2016 smallvar 10_xSept2016 smallvar 10_x
Sept2016 smallvar 10_x
 
140127 GIAB update and NIST high-confidence calls
140127 GIAB update and NIST high-confidence calls140127 GIAB update and NIST high-confidence calls
140127 GIAB update and NIST high-confidence calls
 
Jan2016 pac bio giab
Jan2016 pac bio giabJan2016 pac bio giab
Jan2016 pac bio giab
 
Building bioinformatics resources for the global community
Building bioinformatics resources for the global communityBuilding bioinformatics resources for the global community
Building bioinformatics resources for the global community
 
Genome in a bottle for next gen dx v2 180821
Genome in a bottle for next gen dx v2 180821Genome in a bottle for next gen dx v2 180821
Genome in a bottle for next gen dx v2 180821
 
Dissecting human brain development at high resolution using RNA-seq
Dissecting human brain development  at high resolution using RNA-seq Dissecting human brain development  at high resolution using RNA-seq
Dissecting human brain development at high resolution using RNA-seq
 
Population-Based DNA Variant Analysis
Population-Based DNA Variant AnalysisPopulation-Based DNA Variant Analysis
Population-Based DNA Variant Analysis
 
Día 19 - Noel Chen - Introducción a Novogene
Día 19 - Noel Chen - Introducción a Novogene Día 19 - Noel Chen - Introducción a Novogene
Día 19 - Noel Chen - Introducción a Novogene
 
Detecting STR Peaks in Degraded DNA samples
Detecting STR Peaks in Degraded DNA samplesDetecting STR Peaks in Degraded DNA samples
Detecting STR Peaks in Degraded DNA samples
 
RNA Seq Data Analysis
RNA Seq Data AnalysisRNA Seq Data Analysis
RNA Seq Data Analysis
 
Development of FDA MicroDB: A Regulatory-Grade Microbial Reference Database
Development of FDA MicroDB: A Regulatory-Grade Microbial Reference DatabaseDevelopment of FDA MicroDB: A Regulatory-Grade Microbial Reference Database
Development of FDA MicroDB: A Regulatory-Grade Microbial Reference Database
 
Genome in a bottle april 30 2015 hvp Leiden
Genome in a bottle april 30 2015 hvp LeidenGenome in a bottle april 30 2015 hvp Leiden
Genome in a bottle april 30 2015 hvp Leiden
 
single-cell-sequencing-research-review
single-cell-sequencing-research-reviewsingle-cell-sequencing-research-review
single-cell-sequencing-research-review
 
Getting More from GWAS
Getting More from GWASGetting More from GWAS
Getting More from GWAS
 
Achieve improved variant detection in single cell sequencing infographic
Achieve improved variant detection in single cell sequencing infographicAchieve improved variant detection in single cell sequencing infographic
Achieve improved variant detection in single cell sequencing infographic
 
Forensics
ForensicsForensics
Forensics
 
Church_NCBIvariation2013
Church_NCBIvariation2013Church_NCBIvariation2013
Church_NCBIvariation2013
 
Fine-tuning CNV Analysis for the Clinical Analysis of NGS Samples
Fine-tuning CNV Analysis for the Clinical Analysis of NGS SamplesFine-tuning CNV Analysis for the Clinical Analysis of NGS Samples
Fine-tuning CNV Analysis for the Clinical Analysis of NGS Samples
 

Viewers also liked

Top 10 head football coach interview questions and answers
Top 10 head football coach interview questions and answersTop 10 head football coach interview questions and answers
Top 10 head football coach interview questions and answersbustteent
 
Top 10 assistant general manager hotel interview questions and answers
Top 10 assistant general manager hotel interview questions and answersTop 10 assistant general manager hotel interview questions and answers
Top 10 assistant general manager hotel interview questions and answersbustteent
 
IMITZ_catalog_A152
IMITZ_catalog_A152IMITZ_catalog_A152
IMITZ_catalog_A152Trine Falch
 
IMITZ_catalog_A152.pdf v
IMITZ_catalog_A152.pdf vIMITZ_catalog_A152.pdf v
IMITZ_catalog_A152.pdf vTrine Falch
 
Je M’apelle Liliana Gonzalez
Je M’apelle  Liliana  GonzalezJe M’apelle  Liliana  Gonzalez
Je M’apelle Liliana Gonzalezlilina91
 
Je M’Apelle Liliana Gonzalez
Je M’Apelle Liliana GonzalezJe M’Apelle Liliana Gonzalez
Je M’Apelle Liliana Gonzalezlilina91
 
Top 10 head of corporate affairs interview questions and answers
Top 10 head of corporate affairs interview questions and answersTop 10 head of corporate affairs interview questions and answers
Top 10 head of corporate affairs interview questions and answersbustteent
 
Top 10 apprentice hairdresser interview questions and answers
Top 10 apprentice hairdresser interview questions and answersTop 10 apprentice hairdresser interview questions and answers
Top 10 apprentice hairdresser interview questions and answersbustteent
 

Viewers also liked (11)

Top 10 head football coach interview questions and answers
Top 10 head football coach interview questions and answersTop 10 head football coach interview questions and answers
Top 10 head football coach interview questions and answers
 
Top 10 assistant general manager hotel interview questions and answers
Top 10 assistant general manager hotel interview questions and answersTop 10 assistant general manager hotel interview questions and answers
Top 10 assistant general manager hotel interview questions and answers
 
IMITZ_catalog_A152
IMITZ_catalog_A152IMITZ_catalog_A152
IMITZ_catalog_A152
 
IMITZ_catalog_A152.pdf v
IMITZ_catalog_A152.pdf vIMITZ_catalog_A152.pdf v
IMITZ_catalog_A152.pdf v
 
PG Resume 2015
PG Resume 2015PG Resume 2015
PG Resume 2015
 
Je M’apelle Liliana Gonzalez
Je M’apelle  Liliana  GonzalezJe M’apelle  Liliana  Gonzalez
Je M’apelle Liliana Gonzalez
 
Je M’Apelle Liliana Gonzalez
Je M’Apelle Liliana GonzalezJe M’Apelle Liliana Gonzalez
Je M’Apelle Liliana Gonzalez
 
Top 10 head of corporate affairs interview questions and answers
Top 10 head of corporate affairs interview questions and answersTop 10 head of corporate affairs interview questions and answers
Top 10 head of corporate affairs interview questions and answers
 
Top 10 apprentice hairdresser interview questions and answers
Top 10 apprentice hairdresser interview questions and answersTop 10 apprentice hairdresser interview questions and answers
Top 10 apprentice hairdresser interview questions and answers
 
Diapos del hiper
Diapos del hiperDiapos del hiper
Diapos del hiper
 
Krishan Mehra
Krishan MehraKrishan Mehra
Krishan Mehra
 

Similar to Autism exome sequencing design, analysis and data processing

2007. stephen chanock. technologic issues in gwas and follow up studies
2007. stephen chanock. technologic issues in gwas and follow up studies2007. stephen chanock. technologic issues in gwas and follow up studies
2007. stephen chanock. technologic issues in gwas and follow up studiesFOODCROPS
 
140127 rm selection wg summary
140127 rm selection wg summary140127 rm selection wg summary
140127 rm selection wg summaryGenomeInABottle
 
GIAB update for GRC GIAB workshop 191015
GIAB update for GRC GIAB workshop 191015GIAB update for GRC GIAB workshop 191015
GIAB update for GRC GIAB workshop 191015GenomeInABottle
 
ICMP MPS SNP Panel for Missing Persons - Michelle Peck et al.
ICMP MPS SNP Panel for Missing Persons - Michelle Peck et al.ICMP MPS SNP Panel for Missing Persons - Michelle Peck et al.
ICMP MPS SNP Panel for Missing Persons - Michelle Peck et al.QIAGEN
 
CRISPR Screening: the What, Why and How
CRISPR Screening: the What, Why and HowCRISPR Screening: the What, Why and How
CRISPR Screening: the What, Why and HowHorizonDiscovery
 
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511GenomeInABottle
 
Giab poster structural variants ashg 2018
Giab poster structural variants ashg 2018Giab poster structural variants ashg 2018
Giab poster structural variants ashg 2018GenomeInABottle
 
GIAB ASHG 2019 Structural Variant poster
GIAB ASHG 2019 Structural Variant posterGIAB ASHG 2019 Structural Variant poster
GIAB ASHG 2019 Structural Variant posterGenomeInABottle
 
Genetica clinica diaria_sin_videos
Genetica clinica diaria_sin_videosGenetica clinica diaria_sin_videos
Genetica clinica diaria_sin_videosAbadLaboratorio
 
Molecular Subtyping of Breast Cancer and Somatic Mutation Discovery Using DNA...
Molecular Subtyping of Breast Cancer and Somatic Mutation Discovery Using DNA...Molecular Subtyping of Breast Cancer and Somatic Mutation Discovery Using DNA...
Molecular Subtyping of Breast Cancer and Somatic Mutation Discovery Using DNA...Setia Pramana
 
Genome in a bottle for ashg grc giab workshop 181016
Genome in a bottle for ashg grc giab workshop 181016Genome in a bottle for ashg grc giab workshop 181016
Genome in a bottle for ashg grc giab workshop 181016GenomeInABottle
 
Giab for jax long read 190917
Giab for jax long read 190917Giab for jax long read 190917
Giab for jax long read 190917GenomeInABottle
 
SNP Detection for Massively Parallel Whole-genome Sequencing
SNP Detection for Massively Parallel Whole-genome SequencingSNP Detection for Massively Parallel Whole-genome Sequencing
SNP Detection for Massively Parallel Whole-genome SequencingEric Lee
 
Canopy BioSciences August 2017
Canopy BioSciences August 2017Canopy BioSciences August 2017
Canopy BioSciences August 2017Jens-Ole Bock
 
Genome in a bottle for amp GeT-RM 181030
Genome in a bottle for amp GeT-RM 181030Genome in a bottle for amp GeT-RM 181030
Genome in a bottle for amp GeT-RM 181030GenomeInABottle
 
SNPs analysis methods
SNPs analysis methodsSNPs analysis methods
SNPs analysis methodshad89
 
GIAB for AMP GeT-RM Forum
GIAB for AMP GeT-RM ForumGIAB for AMP GeT-RM Forum
GIAB for AMP GeT-RM ForumGenomeInABottle
 
Microarry andd NGS.pdf
Microarry andd NGS.pdfMicroarry andd NGS.pdf
Microarry andd NGS.pdfnedalalazzwy
 

Similar to Autism exome sequencing design, analysis and data processing (20)

2007. stephen chanock. technologic issues in gwas and follow up studies
2007. stephen chanock. technologic issues in gwas and follow up studies2007. stephen chanock. technologic issues in gwas and follow up studies
2007. stephen chanock. technologic issues in gwas and follow up studies
 
140127 rm selection wg summary
140127 rm selection wg summary140127 rm selection wg summary
140127 rm selection wg summary
 
GIAB update for GRC GIAB workshop 191015
GIAB update for GRC GIAB workshop 191015GIAB update for GRC GIAB workshop 191015
GIAB update for GRC GIAB workshop 191015
 
ICMP MPS SNP Panel for Missing Persons - Michelle Peck et al.
ICMP MPS SNP Panel for Missing Persons - Michelle Peck et al.ICMP MPS SNP Panel for Missing Persons - Michelle Peck et al.
ICMP MPS SNP Panel for Missing Persons - Michelle Peck et al.
 
CRISPR Screening: the What, Why and How
CRISPR Screening: the What, Why and HowCRISPR Screening: the What, Why and How
CRISPR Screening: the What, Why and How
 
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
 
Giab poster structural variants ashg 2018
Giab poster structural variants ashg 2018Giab poster structural variants ashg 2018
Giab poster structural variants ashg 2018
 
GIAB ASHG 2019 Structural Variant poster
GIAB ASHG 2019 Structural Variant posterGIAB ASHG 2019 Structural Variant poster
GIAB ASHG 2019 Structural Variant poster
 
Genetica clinica diaria_sin_videos
Genetica clinica diaria_sin_videosGenetica clinica diaria_sin_videos
Genetica clinica diaria_sin_videos
 
Molecular Subtyping of Breast Cancer and Somatic Mutation Discovery Using DNA...
Molecular Subtyping of Breast Cancer and Somatic Mutation Discovery Using DNA...Molecular Subtyping of Breast Cancer and Somatic Mutation Discovery Using DNA...
Molecular Subtyping of Breast Cancer and Somatic Mutation Discovery Using DNA...
 
Genome in a bottle for ashg grc giab workshop 181016
Genome in a bottle for ashg grc giab workshop 181016Genome in a bottle for ashg grc giab workshop 181016
Genome in a bottle for ashg grc giab workshop 181016
 
Giab for jax long read 190917
Giab for jax long read 190917Giab for jax long read 190917
Giab for jax long read 190917
 
SNP Detection for Massively Parallel Whole-genome Sequencing
SNP Detection for Massively Parallel Whole-genome SequencingSNP Detection for Massively Parallel Whole-genome Sequencing
SNP Detection for Massively Parallel Whole-genome Sequencing
 
Canopy BioSciences August 2017
Canopy BioSciences August 2017Canopy BioSciences August 2017
Canopy BioSciences August 2017
 
Genome in a bottle for amp GeT-RM 181030
Genome in a bottle for amp GeT-RM 181030Genome in a bottle for amp GeT-RM 181030
Genome in a bottle for amp GeT-RM 181030
 
SNPs analysis methods
SNPs analysis methodsSNPs analysis methods
SNPs analysis methods
 
GIAB for AMP GeT-RM Forum
GIAB for AMP GeT-RM ForumGIAB for AMP GeT-RM Forum
GIAB for AMP GeT-RM Forum
 
Microarry andd NGS.pdf
Microarry andd NGS.pdfMicroarry andd NGS.pdf
Microarry andd NGS.pdf
 
Giab agbt SVs_2019
Giab agbt SVs_2019Giab agbt SVs_2019
Giab agbt SVs_2019
 
Charles River Pathology Associates Capabilities
Charles River Pathology Associates CapabilitiesCharles River Pathology Associates Capabilities
Charles River Pathology Associates Capabilities
 

Recently uploaded

Pests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPirithiRaju
 
User Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather StationUser Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather StationColumbia Weather Systems
 
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxEran Akiva Sinbar
 
Thermodynamics ,types of system,formulae ,gibbs free energy .pptx
Thermodynamics ,types of system,formulae ,gibbs free energy .pptxThermodynamics ,types of system,formulae ,gibbs free energy .pptx
Thermodynamics ,types of system,formulae ,gibbs free energy .pptxuniversity
 
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 GenuineCall Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuinethapagita
 
User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)Columbia Weather Systems
 
FREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naFREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naJASISJULIANOELYNV
 
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In DubaiDubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubaikojalkojal131
 
Speech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxSpeech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxpriyankatabhane
 
PROJECTILE MOTION-Horizontal and Vertical
PROJECTILE MOTION-Horizontal and VerticalPROJECTILE MOTION-Horizontal and Vertical
PROJECTILE MOTION-Horizontal and VerticalMAESTRELLAMesa2
 
Citronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayCitronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayupadhyaymani499
 
basic entomology with insect anatomy and taxonomy
basic entomology with insect anatomy and taxonomybasic entomology with insect anatomy and taxonomy
basic entomology with insect anatomy and taxonomyDrAnita Sharma
 
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)riyaescorts54
 
Davis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologyDavis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologycaarthichand2003
 
Observational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive starsObservational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive starsSérgio Sacani
 
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxSTOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxMurugaveni B
 
Harmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms PresentationHarmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms Presentationtahreemzahra82
 
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxpriyankatabhane
 

Recently uploaded (20)

Pests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdf
 
User Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather StationUser Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather Station
 
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptx
 
Thermodynamics ,types of system,formulae ,gibbs free energy .pptx
Thermodynamics ,types of system,formulae ,gibbs free energy .pptxThermodynamics ,types of system,formulae ,gibbs free energy .pptx
Thermodynamics ,types of system,formulae ,gibbs free energy .pptx
 
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 GenuineCall Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
 
Let’s Say Someone Did Drop the Bomb. Then What?
Let’s Say Someone Did Drop the Bomb. Then What?Let’s Say Someone Did Drop the Bomb. Then What?
Let’s Say Someone Did Drop the Bomb. Then What?
 
User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)
 
FREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naFREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by na
 
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In DubaiDubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
 
Speech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxSpeech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptx
 
PROJECTILE MOTION-Horizontal and Vertical
PROJECTILE MOTION-Horizontal and VerticalPROJECTILE MOTION-Horizontal and Vertical
PROJECTILE MOTION-Horizontal and Vertical
 
Citronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayCitronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyay
 
Volatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -IVolatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -I
 
basic entomology with insect anatomy and taxonomy
basic entomology with insect anatomy and taxonomybasic entomology with insect anatomy and taxonomy
basic entomology with insect anatomy and taxonomy
 
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
 
Davis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologyDavis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technology
 
Observational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive starsObservational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive stars
 
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxSTOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
 
Harmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms PresentationHarmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms Presentation
 
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
 

Autism exome sequencing design, analysis and data processing

  • 1. Autism exome sequencing: design, data processing and analysis Benjamin Neale Analytic and Translational Genetics Unit, MGH & Medical & Population Genetics Program, Broad Institute
  • 2. Direct Sequencing has Enormous Potential… • Ng, Shendure: Miller syndrome, 4 cases – exome sequenced reveals causal mutations in DHODH • Lifton: Undiagnosed congenital chloride diarrhoea (consanguinous) – Exome seq reveals homozygous SLC23A chloride ion transporter mutation – Return diagnosis of CLD (gi) not suspected Bartter syndrome (renal) • Worthey, Dimmock: 4-year old, severe unusual IBD – exome seq reveals XIAP mutation (at a highly conserved aa) – proimmune disregulation opt for bone marrow transplant over chemo • Jones, Marra: Secondary lung carcinoma unresponsive to erlotinib – Genome and transcriptome sequencing reveals defects – directs alternative sunitinib therapy • Mardis, Wilson: acute myelocytic leukaemia but not classical translocation – Genome sequencing (1 week + analysis) reveals PML-RARA translocation – Directs ATRA (all trans retinoic acid) treatment decision
  • 3. …and tremendous challenges • Managing and processing vast quantities of data into variation • Interpreting millions of variants per individual – An individual’s genome harbors • ~80 point nonsense mutations • ~100-200 frameshift mutations • Tens of splice mutants, CNV induced gene disruptions For very few of these do we have any conclusive understanding of their medical impact in the population
  • 4. Successes to date rely on factors that may not apply generally to common endpoints • Mendelian disorders – Single family rare autosomal recessive (linkage may target 1% of genome, 2 ‘hits’ in the same gene very unlikely) – Single (or ‘near single’) gene disorders where nearly all families carry mutations in the same underlying gene • Somatic or de novo mutations – Extremely rare background rate
  • 5. Autism exome sequencing • In progress – ARRA supported by NIMH & NHGRI • Collaboration between sequencing centers (Baylor & Broad) and Y2 follow-up in autism genetics labs (Buxbaum, Daly, Devlin, Schellenberg, Sutcliffe) • Targeted production by years end of 1000 cases and 1000 controls (500/500 from each site)
  • 6. Exome production plan • Baylor: 1000 samples (Nimblegen capture, SOLiD sequencing) • Broad: 1000 samples (Agilent capture, Illumina sequencing) • Predominantly cases and controls pairwise matched with GWAS data (one batch of 50 trios currently being run) • All samples are available from NIMH repository
  • 7. Broad Exome Production • ~700 exomes completely sequenced and recently completed variant calling • ~300 completed earlier in the Summer and fully analyzed (basis of later analysis slides) • Main production conducted with matched case-control pairs traveling together through the sequencing lab and computational runs of variant calling
  • 8. From unmapped reads to true genetic variation in next-generation sequencing data Raw short reads Human reference genome Solexa Mapping and alignment Human reference genome Quality calibration and annotation The quality of each read is calibrated and additional information annotated for downstream analyses The origin of each read from the human genome sequence is found Human reference genome Identifying genetic variation SNPs and indels from the reference are found where the reads collectively provide evidence of a variant SNP A single run of a sequencer generates ~50M ~75bp short reads for analysis SOLiD 454 Region 1 Region 2 Region 1 Region 2 Region 1 Region 2
  • 9. Partnership: Genome Sequencing and Analysis (GSA) team @ Broad • Genome Sequencing and Analysis (GSA) develops core capabilities for genetic analysis – Data processing and analysis methods – Technology development – High-end software engineering – High-throughput data processing for MPG exome projects with MPG- Firehose • Staffed by full-time research scientists in MPG – PhDs and BAs in biochemistry, engineering, computer science, mathematics, and genetics Group Leader Mark A. DePristo Analysis Team Kiran Garimella [Lead] Chris Hartl Corin Boyko Development Team Eric Banks [Lead] Guillermo del Angel Menachem Fromer Ryan Poplin Software Engineering Matthew Hanna Khalid Shakir Aaron McKenna
  • 10. Developing cutting-edge data processing and analysis methods Local realignment Base quality score recalibration Variation discovery and genotyping Read-backed phasing VariantEval Adaptive error modeling Novel SNPs found
  • 11. Challenges • Mapping/alignment • Quality score recalibration • Calling variants • Evaluating set of variant sites • Estimating genotypes
  • 12. Solexa : BWA 454 : SSAHA SOLiD : Corona • Robust, accurate ‘gold standard’ aligner for NGS • Developed by Li and Durbin • Recently replaced MAQ, also by Li and Durbin, used for last 2 years SAM/BAM files Region 1 Enormous pile of short reads from NGS Detects correct read origin and flags them with high certainty Detects ambiguity in the origin of reads and flags them as uncertain Reference genome Mapping and alignment algorithm Finding the true origin of each read is a computationally demanding and important first step • Hash-based aligner with high sensitivity and specificity with longer reads • ABI-designed tool for aligning in color-space Region 2 Region 3
  • 13. The SAM file format • Data sharing was a major issue with the 1000 genomes – Each center, technology and analysis tool used its own idiosyncratic file formats – no one could exchange data • The Sequence Alignment and Mapping (SAM) file format was designed to capture all of the critical information about NGS data in a single indexed and compressed file – Becoming a standard and is now used by production informatics, MPG, and cancer analysis groups at the Broad • Has enabled sharing of data across centers and the development of tools that work across platforms • More info at http://samtools.sourceforge.net/
  • 15. SLX GA 454 SOLiD HiSeqComplete Genomics first of pair readssecond of pair reads first of pair readssecond of pair reads first of pair readssecond of pair reads Base Quality Score Recalibration
  • 16. How do indel realignment and base quality recalibration affect SNP calling? http://www.broadinstitute.org/gsa/wiki/index.php/Base_quality_score_recalibration http://www.broadinstitute.org/gsa/wiki/index.php/Local_realignment_around_indels 6.5% of calls on raw reads are likely false positives due to indels The process doesn’t remove real SNPs
  • 17. Sensitivity and specificity improved by simultaneous variant calling in 50-100 individuals • Sensitivity – Greater statistical evidence compiled for true variants seen in more than one individual • Specificity – Deviations in metrics that flag false positive sites become much more statistically significant • Allele balance: Departure from 50:50 (r:nr) in heterozygotes • Strand bias: non-reference allele preferentially seen on one of the two DNA strands • Proportion of reads with low mapping quality
  • 18. The Genome Analysis Toolkit (GATK) enables rapid development of efficient and robust analysis tools Genome Analysis Toolkit (GATK) infrastructure Analysis tool Traversal engine Implemented by userProvided by framework More info: http://www.broadinstitute.org/gsa/wiki/ • All of these tools have been developed in the GATK • They are memory and CPU efficient, cluster friendly and are easily parallelized • They are now publically and are being used at many sites around the world • Supports any BAM- compatible aligner Initial alignment MSA realignment Q-score recalibration Multi-sample genotyping SNP filtering M . D e P r i s t o
  • 19. Additional key advance • Correcting alignment artifacts and machine- specific biases in read base calling and quality score assignment enables machine- independent variant identification and genotype calling • 1000 Genomes data even contains individuals with data merged from multiple sequencing platforms!
  • 20. For our project this is key • With two centers generating data via distinct experimental and sequencing procedure, harmonizing data is integral to downstream analysis
  • 21. Stratified analyses • Because both processes will not afford equivalent coverage of all targets: – Critical that case-control balance and individual pairings are preserved within center – Final analysis will be stratified by center such that rare technical differences, lack of coverage on one or the other platform, etc can be managed robustly
  • 22. Secondary data cleaning is critical • Identify a quality set of individuals • Identify a quality set of targets • Identifying a quality set of variants
  • 23. Primary cleaning • Identify a quality set of individuals • Identify a quality set of targets • Identifying a quality set of variants
  • 24. Sample composition thus far Batch Case Control Total 1 25 25 50 2 22 21 43 3 41 12 53 4 25 25 50 5 25 25 50 6 25 25 50 Total 164 132 296
  • 25. Individual Cleaning • Mean depth of coverage for all targets • Concordance rate with ‘super clean’ SNP Chip – Contamination checks
  • 26. Mean Coverage per Sample Exclude this one
  • 27. Concordance and contamination checks • 1/296 samples fails concordance check (genotype call discrepancy) with SNP chip data (sample swap) • 1/295 samples fails contamination check (proportion of reads calling non-reference alleles at SNP chip homozygous sites indicates >5% DNA from another individual) • Advance a fully validated set of individuals to downstream analysis
  • 28. Number of non-reference genotypes per individual 1500 high frequency sites
  • 29. Primary cleaning • Identify a quality set of individuals • Identify a quality set of targets • Identifying a quality set of variants
  • 30. Distribution of mean target coverage 99.86% targets
  • 31. Depth vs GC Content >95% of the targeted exons sequenced between 10 and 500x depth – Defined as successfully evaluated exome
  • 32. SNP rate as a function of GC Bin 0 1 2 3 4 5 6 7 8 (30,35] (35,40] (40,45] (45,50] (50,55] (55,60] (60,65] (65,~] SNP rate SNP rate
  • 33. % discovered variants that are singletons 50-250x - half the data, median coverage, singleton plateau at 34% Lowest bin badly deficient in singletons – but higher rate of called variants overall… ------- 95% of targets covered between 10 and 500x ----------
  • 34. Primary cleaning • Identify a quality set of individuals • Identify a quality set of targets • Identifying a quality set of variants
  • 35. Defining the set of variant sites • Define the technical parameters of true polymorphisms using a core set of ‘gold-standard’ true positive variant sites: – Are sites contained in a reference sample (e.g., dbSNP, another exome or genome study) – High quality target depth range (50-250) – no true sites should be missed and no excess coverage suggesting mapping concerns • Define distribution of technical properties (balance of reference/non-reference alleles; balance of non- reference alleles on +/- strand; read mapping quality) – Filter non-dbSNP, non-ideal coverage calls based on these distributions
  • 37. In testing now: Variant Quality Score Recalibration enables definition of data set with user defined sensitivity, specificity ...moving towards posterior estimate for each site http://www.broadinstitute.org/gsa/wiki/index.php/Variant_quality_score_recalibration
  • 38. On to analysis… • 204,123 variants pass all filters across 294 samples…number of all variants & singletons continue to increase as data is added • How do we assess how this QC process performed downstream? Is the experiment working?
  • 39. Has the matching worked? • Matched samples based on MDS distance from combined GWAS data • Consider the set of doubletons (two copies in the dataset) • Overall, we should see that there are comparable numbers of variants seen in 2 cases or 2 controls versus 1 case and 1 control; and we should see an excess of 1 case:1 control variants in matched pairs
  • 40. Do we see appropriate case:control statistics for rare variants?
  • 41. Visual Representation Case Control1 2 3 4 … … 118 Case Control Case Control Case Control Case Control VS. Case Control Case Control Case Control Case Control Case Control Case Control Case Control Case Control Case Control Case Control or Expectation: 1/235 for within pair doubles 15,829 doubletons -> ~67.4 Observed: 163 instances observed,
  • 42. X:0 missense mutations While one is intrigued by variants seen in 5 or more copies in cases and not at all in controls – no evidence using permutation that there is a significant excess of such variants…
  • 43. Specific nonsense mutations in cases or controls only Impossible to pick out which might be relevant
  • 44. Aggregations of rare nonsense mutations in single genes, all in cases only • Encouraging – many genes where 1-3% of cases and no controls harbor nonsense mutations – best case scenario?
  • 45. Genes with multiple nonsense mutations seen only in cases or controls Many genes with 2 or more nonsense mutations seen only in cases – not appreciably different from rate in controls suggesting vast majority is simply the background rate at which such variants occur…
  • 46. Challenges of connecting rare variation to complex phenotype • Variation must be considered in aggregate per gene (or pathway…) rather than individually • In phenotypically relevant genes, many rare variants will be neutral (i.e., background rate is high) • Many documented cases exist where gain and loss of function mutations in same gene have opposite effects on phenotype • Polygenicity will not be our friend here…realistically, no reason to think much smaller samples than in GWAS will be required • at this point, the best case scenarios of highly penetrant rare mutations (that would have escaped prior large linkage and GWAS studies), aggregations of very rare alleles that explain 1% of the overall variance in risk, etc – cannot be distinguished from the background distribution of test statistics • Some opportunistic models (.1%-.5% variants with OR~5-10, high penetrance recessive subtypes) may be able to be detected and confirmed through follow-up soon…no reason to have assurance these exist however
  • 47. Parallel analysis tracks will be taken • High MZ/DZ ratio suggests potential recessive component: search for excess autozygosity, IBD=2 sharing with affected sibs then homozygosity for rare allele; compound heterozygosity for rare alleles • Highly deleterious alleles: Identify all non-synonymous/nonsense/splice variants unique to the study, not seen in 1000 Genomes or external control exome data (mostly singletons, very rare and de novo variants – perhaps ranked by predicted impact/PolyPhen) and compare burden in cases versus controls (testing a severe Mendelian mutation model) • Heritable low frequency: Take all standing variation, observed two or more times in the study and perform sensitive test of gene-wide variation using C-alpha test of overdispersion (testing for effect of rare and low frequency variants of modest/intermediate penetrance across the gene)
  • 48. Flexible, extensible data representation (variants, genotypes and meta-data) A number of ways to use the library command line, R, web, C/C++ Efficient random-access for very large datasets Key references datasets dbSNP, 1000G, PolyPhen2, gene transcript and sequence information Large suite of up-to-date methods available Madsen-Browning (+/- variable threshold), Li-Leal, C-alpha, etc. Tools for analysis of variation data from next-generation sequencing platforms: the PLINK/Seq library http://pngu.mgh.harvard.edu/purcell/pseq/ Shaun Purcell
  • 49. PSEQ Features • Individual statistics • Variant statistics • Single-locus association • Regional association • Incorporation of annotation information
  • 50. Data Sharing • Autism exome data made available – Providing the gold-standard calibration for variant calling in other contemporaneous Broad exome studies – Control variable sites and counts made available for comparison with other Broad exome studies – First batch of raw data deposited to dbGAP exchange area – NIMH controls broadly consented for general medical use
  • 51. Data Sharing • Summary database of sites discovered, non- reference allele counts and target by target coverage provide invaluable cross-study information: – Technical validation of low frequency sites – Ability to evaluate whether specific sites or categories of variants in certain genes are commonly, rarely or never seen – Greatly enhance selection for follow-up – No individual level data or phenotype information need be exchanged Already in use across autism, schizophrenia, released 1000G studies - would love to collaborate with NHLBI & cancer studies at this level
  • 52. Credits • ARRA Autism sequencing – Mark Daly – Christine Stevens – Stacey Gabriel – Broad Sequencing Platform • Jen Baldwin, Jane Wilkinson Joe Buxbaum, Bernie Devlin, Richard Gibbs, Jerry Schellenberg, Jim Sutcliffe NIMH, NHGRI • Broad GSA team – Mark DePristo – Eric Banks – Kiran Garimella – Ryan Poplin • Exome analysis – Mark Daly – Manny Rivas – Jared Maguire – Ben Voight – Shaun Purcell – Kathryn Roeder & Bernie Devlin