• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Bioc strucvariant seattle_11_09
 

Bioc strucvariant seattle_11_09

on

  • 662 views

This is an overview of using bioconductor tools for defining structural variation from next-generation sequencing data.

This is an overview of using bioconductor tools for defining structural variation from next-generation sequencing data.

Statistics

Views

Total Views
662
Views on SlideShare
662
Embed Views
0

Actions

Likes
0
Downloads
5
Comments
0

0 Embeds 0

No embeds

Accessibility

Upload Details

Uploaded via as OpenOffice

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Since Knudson’s famous hypothesis proposing the two-hit model, our understanding of cancer as a genetic disease has progressed to the realization that cancer is not often a function of a single gene gone awry, but probably represents a complex interaction of multiple processes in the genome including altered copy number, gene expression, transcriptional regulation, chromatin modification, sequence variation, and DNA methylation. It is vital to the goal of producing better patient outcomes to understand not only what genes are involved in a certain type of cancer, but also how these other processes affect gene regulation. In short, an integrated view of the cancer genome is necessary and is now becoming possible.
  • The first karyotypes were produced in 1956. Shown here is a comparison of a normal karyotype of a normal female and one from a tumor. By 1960, a karyotype of a cancer genome revealed the presence of the Philadelphia chromosome. Now known to represent the BCR-ABL fusion protein, it was not until 33 years later in 1993 that a drug, gleevec, become available that targeted the fusion product. By applying high-throughput microarray technologies, the Cancer Genetics Branch is striving to make observations of the cancer genome that will provide deeper understandings of the biology of cancer, to develop prognostic and diagnostic markers to improve patient-specific treatments, and to find promising targets for directed drug therapy.
  • Zooming out to look at the whole genome at once, the normal genome with normal female DNA in red and normal male DNA in green shows the expected abnormalities on the X and Y chromosomes. Comparing that to a single breast cancer genome reveals the richness of the data that we are producing. Nearly every chromosome shows some copy number alteration that can be mapped to the genome to produce lists of candidate genes. But with so many alterations, it is helpful to consider multiple genomes at once, as copy number changes that occur in multiple samples are more likely to be of biological importance and not simply a product of an unstable cancer genome.

Bioc strucvariant seattle_11_09 Bioc strucvariant seattle_11_09 Presentation Transcript

  • Using R and BioConductor To Find Structural Variants In Short Read Sequencing Data Sean Davis, MD, PhD National Cancer Institute National Institutes of Health Bethesda, MD
  • Why use R and BioConductor?
  • phenotype Gene Copy Number Sequence Variation Chromatin Structure and Function Gene Expression Transcriptional Regulation DNA Methylation Patient and Population Characteristics
  • Why structural variation?
  • Overview
    • What is structural variation and why might it be important in biology?
    • What is paired-end sequencing?
    • How can paired-end sequencing be used for finding structural variants?
    • How can the existing tools within R and BioConductor be leveraged to find structural variants in the genome?
  • Overview
    • What is structural variation and why might it be important in biology?
    • What is paired-end sequencing?
    • How can paired-end sequencing be used for finding structural variants?
    • How can the existing tools within R and BioConductor be leveraged to find structural variants in the genome?
  • What is a structural variation?
    • Insertions
    • Deletions
    • Translocations
      • Intrachromosomal
      • Interchromosomal
    • Inversions
    • [copy number variation]
  • Importance of Structural Variation
    • Can alter gene expression both directly and indirectly
      • Deletion, insertion, or translocation that disrupts or removes transcript(s)
      • Translocation that alters regulatory environment
      • Can place two distant functional elements in proximity to each other (gene fusion events are an example)
    • Possibly change chromatin structure
  • Normal Karyotype Tumor Karyotype
  • Redon et al., Nature 2006
  •  
  • A Genome View of Copy Number
  • Overview
    • What is structural variation and why might it be important in biology?
    • What is paired-end sequencing?
    • How can paired-end sequencing be used for finding structural variants?
    • How can the existing tools within R and BioConductor be leveraged to find structural variants in the genome?
  •  
  •  
  • Insert Read Read
  • Overview
    • What is structural variation and why might it be important in biology?
    • What is paired-end sequencing?
    • How can paired-end sequencing be used for finding structural variants?
    • How can the existing tools within R and BioConductor be leveraged to find structural variants in the genome?
  • Medvedev et al., Nature 2009
  •  
  • Paired-end reads and SV
    • For the paired-end data, determine the distribution of insert sizes
    • Find reads that show an unusually high or low insert size ( mean +/- 3sd, for example)
    • Cluster these abnormal related pairs
    • Where there is significant clustering, there may be evidence for a structural variant
    • The type of the structural variant can be determined using the relationships between clusters
  • Overview
    • What is structural variation and why might it be important in biology?
    • What is paired-end sequencing?
    • How can paired-end sequencing be used for finding structural variants?
    • How can the existing tools within R and BioConductor be leveraged to find structural variants in the genome?
  • Experimental Setup
    • Choose 1Mb region on chromosome 17 (from 40Mb to 41Mb) as “reference” sequence
    • Make a new sequence that has several structural variants in it and use that as the basis for our “sequencing”
    • Simulate 100k paired end reads using MAQ simulate
      • Simulated with mean insert size 200, sd 20
      • 35 bp reads
      • Allow errors according to error model from real data
  • Experimental Setup, continued
    • Align paired-end data to the human reference genome using BWA
    • Convert output of BWA to sorted and indexed BAM
    • Use R and Bioconductor tools to try to rediscover the structural variants in the simulation
  • The Sample Sequence
  • The Sample Sequence, continued
    • Segment between 40.4 and 40.5Mb is translocated to sit between 40.04Mb and 40.05MB.
    • Segment between 40.1 and 40.11 tandemly duplicated five times (a copy number variation)
  • Bioconductor and R Tools Used
    • Rsamtools
    • IRanges
      • RangedData object is used to store paired-end reads and subset abnormal mapped pairs
      • Calculate coverage on abnormal mapped pairs
    • R graphics for making plots
  • Get data into R
    • Rsamtools -> RangedData
  •  
  • Using the SAM Flag Field 41 = 1 0 1 0 1 8 = 0 0 1 0 0 & 8 = 0 0 1 0 0
  • Insert Size Distribution
  •  
  •  
  •  
  •  
  • Future Work
    • Build infrastructure for dealing more easily with paired-end reads
    • Refine workflow for finding and clustering abnormal related pairs
    • Define or implement algorithms for taking raw clustering results and converting that to biologically meaningful descriptions of the structural variants
    • Lots, lots, lots more
  •  
  • A couple of final thoughts
    • Public data
      • 1000 genomes data
      • SRA (NCBI short read sequencing archive)
      • NCBI GEO—not just for microarrays
    • Interactive visualization
      • UCSC genome browser (rtracklayer)
      • Integrated Genome Browser (IGB, available from Affymetrix and at Sourceforge)
      • Integrated Genomic Viewer (IGV, available from the Broad Institute)