Long oligos synthesized on arrays (DNA) RNA baits synthesized from DNA oligo template RNA baits hybridized to DNA sequencing library Targets captured using beads and bioFn-‐labeled baits RNA bait degraded, leaving sequencing library enriched for target regions
Data Flow FASTQ ﬁles generated by Illumina pipeline Aligned to reference genome (hg18, excluding _random, unmapped, and hap) using Novoalign SAM/BAM used extensively Follow Broad InsFtute GATK pipeline for exome capture Use picard java library for quality assessment Processed BAM ﬁles available via local hZp for browsing
Realignment around Indels The problem Aligners align each read independently PotenFally leads to increased error rates around indels A potenFal soluFon Locally realign reads in regions that might harbor an indel Goal is to align reads overlying indels more accurately, reducing errors in each read and, in turn, reducing SNV call error rates
Quality Recalibration Since most SNV callers will rely on quality scores to estimate error probabilities, having the best possible estimates for error rates is important Reported error rates from the Illumina sequencer generally reflect technical parameters of the base call process, but not other systematic biases Quality recalibration can include covariates to account for systematic biases Cycle count, dinucleotide context, original quality, and sample/library variables
Variant Calling and EvaluaFon A developing art
Complete Genome Sequencing Complete Genomics Data
Data Delivery Via USB results Storage Sizes are LARGE 400GB per sample as delivered with raw reads included Should use 2-‐locaFon backed-‐up storage Not trivial to ﬁnd such storage, so might resort to mulFple USB drives Minimize: Data movement Keeping mulFple copies indeﬁnitely
Data Delivery Storage Processing Data are typically tab-‐delimited text ﬁles, so Excel can be useful for examining individual small ﬁles Generally, command-‐line tools needed MacOS and linux only supported operaFng systems, but Windows might work.... Some analyses (snpdiﬀ) require large memory
Workﬂows Tumor/Normal Copy Number Structural Varia,on Annotated SomaFc Variants Germline List of annotated genotypes per individual, summarized into a single ﬁle that can be used for ﬁltering
Frequent geneFc alteraFons in three criFcal signalling pathways. The Cancer Genome Atlas Research Network Nature 000, 1-‐8 (2008) doi:10.1038/nature07385
ChromaFn ChromaFn is the complex of protein and DNA that make up the chromosomes. It is not a staFc structure.
DNAse is an enzyme that cuts DNA at locaFons where DNA is accessible These “accessible” regions have been associated with open chromaFn Regions of open chromaFn are necessary for transcripFonal and regulatory machinery to have access to gene neighborhoods and facilitate transcripFon
DNAse HypersensiFvity Method for ﬁnding regions of “open” chromaFn In data published with the ENCODE consorFum, DNAse hypersensiFve (HS) were shown to be correlated with: Histone modiﬁcaFon TranscripFon start sites Early replicaFng regions TranscripFon factor binding sites (experimentally determined by ChIP/chip, etc.) IdenFﬁcaFon and analysis of funcFonal elements in 1% of the human genome by the ENCODE pilot project. The ENCODE ConsorFum. Nature, 2007.
DNAse HS Sites and Gene Expression DNAse HS sites near transcripFon start sites are associated with acFvely transcribed genes.
Nucleosome PosiFoning Distances between sequences in non-‐DNAse HS regions have an oscillaFng paZern with frequency that corresponds to a single turn of the double-‐ helix DNAse is known to cut preferenFally in the minor groove, which is exposed every 10.4 bases when wrapped around a nucleosome A nucleosome is wrapped by 147 base pairs when complexed with DNA ImplicaFon: Nucleosomes are posiFoned in a highly organized, precise manner