SlideShare a Scribd company logo
Genome & Exome Sequencing
Read Mapping
Xiaole Shirley Liu
STAT115, STAT215, BIO298, BIST520
Whole Genome Sequencing
• Usually need 30-50X coverage (~ 3 lanes of
100bp PE HiSeq2000 sequencing)
2
Exome Sequencing
• 2011
3
Exome Sequencing
• Solution Hybrid
Selection: Probes in
solution can capture
all exons (exome) for
high throughput
sequencing
• 1-2% of whole
genome seq
• Easily multiplex 20
samples in one lane
4
Comparative Sequencing
• Somatic mutation
detection between
normal / cancer pairs
• WGS or WES
• More mutation yield
and better causal
gene identification
than Mendelian
disorders
5
Meyerson et al, Nat Rev Genet 2010
Hallmark of Mendelian Disease
Gene Discovery
6
Gilissen, Genome Biol 2011
Hallmark of Mendelian Disease
Gene Discovery
7
Gilissen, Genome Biol 2011
Mutation Targets vs Disorder
Frequency
Rarer disorders are focused on fewer mutated genes
8
Gilissen, Genome Biol 2011
Whole Genome or Exome Seq?
• Enabling technologies: NGS machines, open-source
algorithms, capture reagents, lowering cost, big sample
collections
• Exomes more cost effective: Sequence patient DNA
and filter common SNPs; compare parents child trios;
compare paired normal cancer
• Challenges:
– Still can’t interpret many Mendelian disorders
– Rare variants need large samples sizes
– Exome might miss region (e.g. novel non-coding
genes)
– Unsuccessful at using exome-seq to interpret clinical
data9
Shendure, Genome Biol 2011
Read Mapping
• Mapping hundreds of millions of reads back
to the reference genome is CPU and RAM
intensive, and slow
• Read quality decreases with length (small
single nucleotide mismatches or indels)
• Very few mapper deals with indel, and
often allow ~2 mismatches within first 30bp
(4 ^ 28 could still uniquely identify most
30bp sequences in a 3GB genome)
• Mapping output: SAM (BAM) or BED
10
Spaced seed
alignment
• Tags and tag-sized pieces of
reference are cut into small
“seeds.”
• Pairs of spaced seeds are
stored in an index.
• Look up spaced seeds for
each tag.
• For each “hit,” confirm the
remaining positions.
• Report results to the user.
Burrows-Wheeler
• Store entire reference
genome.
• Align tag base by base from
the end.
• When tag is traversed, all
active locations are
reported.
• If no match is found, then
back up and try a
substitution.
Trapnell & Salzberg, Nat Biotech 2009Trapnell & Salzberg, Nat Biotech 2009
Burrows-Wheeler Transform
• Reversible permutation used originally in compression
• Once BWT(T) is built, all else shown here is discarded
– Matrix will be shown for illustration only
Burrows
Wheeler
Matrix
Last column
BWT(T)T
Burrows M, Wheeler DJ: A block sorting lossless data compression algorithm. Digital Equipment
Corporation, Palo Alto, CA 1994, Technical Report 124; 1994
Slides from Ben Langmead
Burrows-Wheeler Transform
• Property that makes BWT(T) reversible is “LF Mapping”
– ith
occurrence of a character in Last column is same
text occurrence as the ith
occurrence in First column
T
BWT(T)
Burrows Wheeler
Matrix
Rank: 2
Rank: 2
Slides from Ben Langmead
Burrows-Wheeler Transform
• To recreate T from BWT(T), repeatedly apply rule:
T = BWT[ LF(i) ] + T; i = LF(i)
– Where LF(i) maps row i to row whose first character
corresponds to i’s last per LF Mapping
Final T
Slides from Ben Langmead
Exact Matching with FM Index
• To match Q in T using BWT(T), repeatedly apply rule:
top = LF(top, qc); bot = LF(bot, qc)
– Where qc is the next character in Q (right-to-left) and
LF(i, qc) maps row i to the row whose first character
corresponds to i’s last character as if it were qc
Slides from Ben Langmead
Exact Matching with FM Index
• In progressive rounds, top & bot delimit the range of
rows beginning with progressively longer suffixes of Q
Slides from Ben Langmead
Exact Matching with FM Index
• If range becomes empty (top = bot) the query suffix
(and therefore the query) does not occur in the text
Slides from Ben Langmead
Backtracking
• Consider an attempt to find Q = “agc” in T = “acaacg”:
• Instead of giving up, try to “backtrack” to a previous
position and try a different base (much slower)
• For 50bp reads, need to have ~25bp perfect match
“gc” does not
occur in the text
“g”
“c”
Slides from Ben Langmead
Seq Files
• Raw FASTQ
– Sequence ID, sequence
– Quality ID, quality score
• Mapped SAM
– Map: 0 OK, 4 unmapped,
16 mapped reverse strand
– XA (mapper-specific)
– MD: mismatch info
– NM: number of mismatch
• Mapped BED
– Chr, start, end, strand
20
@HWI-EAS305:1:1:1:991#0/1
GCTGGAGGTTCAGGCTGGCCGGATTTAAACGTAT
+HWI-EAS305:1:1:1:991#0/1
MVXUWVRKTWWULRQQMMWWBBBBBBBBBBBBBB
@HWI-EAS305:1:1:1:201#0/1
AAGACAAAGATGTGCTTTCTAAATCTGCACTAAT
+HWI-EAS305:1:1:1:201#0/1
PXX[[[[XTXYXTTWYYY[XXWWW[TMTVXWBBB
HWUSI-EAS366_0112:6:1:1298:18828#0/1    16      chr9  
 98116600        255     38M     *       0       0      
TACAATATGTCTTTATTTGAGATATGGATTTTAG
GCCG  Y]bc^dab
[_UU`^`LbTUTccLbbYaY`cWLYW^  XA:i:1
 MD:Z:3C30T3     NM:i:2
HWUSI-EAS366_0112:6:1:1257:18819#0/1    4       *       0       0
      *       *       0       0      
AGACCACATGAAGCTCAAGAAGAAGGAAGACA
AAAGTG  ece^dddTcT^c`a`ccdKc^^__]Yb_cKS^_W
 XM:i:1
HWUSI-EAS366_0112:6:1:1315:19529#0/1    16      chr9  
 102610263       255     38M     *       0       0      
GCACTCAAGGGTACAGGAAAAGGGTCAGAAGT
GTGGCC  ^c_YcLcb`bbYdTadd`dda`cddYddd^cT`
 XA:i:0  MD:Z:38 NM:i:0
chr1 123450 123500 +
chr5 28374615 28374615 -
http://samtools.sourceforge.net/SAM1.pdf
Data Analysis
• Heuristic filtering
to identify novel
genes for
Mendelian
disorders
21
Stitziel et al, Genome Biol 2011
Genomic Structural Variation
22 Baker et al, Nat Meth 2012
altered genome found in a sample is shown at the bottom. B) Inversion (INV) has reciprocal join
in opposite orientations. C) Intra-chromosome translocation (ITX) has unilateral join in opposite
orientation. D) Deletion (DEL) has two breakpoints joined in ascending order of genomic
coordinates in the same orientation. E) Insertion (INS) has two breakpoints joined in descending
order of genomic coordinates in the same orientation.
Structural Variation Detection
BreakDancer
Chen et al,
Nat Meth
2009
Only look at
anomalous
read pairs
Structural Variation Detection
• Crest (Wang et al, Nat Meth 2011)
– Use soft-clipped reads, kind of like bidir-blast
24
Copy Number Variation Detection
• Change in read coverage
25
Representation: VCF Format
• http://www.1000genomes.org/node/101
26
Summary
• Whole genome and whole exome
sequencing
– Solution hybrid selection
– Specific locus for rare diseases
• Bioinformatics issues:
– Read mapping
– SNP, indel detection
– Heuristic filtering
– Structural variation detection
27

More Related Content

What's hot

Ngs presentation
Ngs presentationNgs presentation
Ngs presentation
Chakradhar Reddy
 
Functional genomics
Functional genomicsFunctional genomics
Functional genomics
ajay301
 
Microarrays;application
Microarrays;applicationMicroarrays;application
Microarrays;applicationFyzah Bashir
 
Physical maps and their use in annotations
Physical maps and their use in annotationsPhysical maps and their use in annotations
Physical maps and their use in annotations
Sheetal Mehla
 
Protein docking
Protein dockingProtein docking
Protein docking
Saramita De Chakravarti
 
Sanger sequencing method of DNA
Sanger sequencing method of DNA Sanger sequencing method of DNA
Sanger sequencing method of DNA
Dr. Dinesh C. Sharma
 
DNA Sequencing
DNA SequencingDNA Sequencing
DNA Sequencing
Surender Rawat
 
SNP Detection Methods and applications
SNP Detection Methods and applications SNP Detection Methods and applications
SNP Detection Methods and applications
Aneela Rafiq
 
Introduction to sequence alignment partii
Introduction to sequence alignment partiiIntroduction to sequence alignment partii
Introduction to sequence alignment partii
SumatiHajela
 
Sequence Assembly
Sequence AssemblySequence Assembly
Sequence Assembly
Meghaj Mallick
 
Ion torrent and SOLiD Sequencing Techniques
Ion torrent and SOLiD Sequencing Techniques Ion torrent and SOLiD Sequencing Techniques
Ion torrent and SOLiD Sequencing Techniques
fikrem24yahoocom6261
 
Gene regulatory networks
Gene regulatory networksGene regulatory networks
Gene regulatory networks
Madiheh
 
Cancer genome
Cancer genomeCancer genome
Cancer genome
Kundan Singh
 
DNA Methylation: An Essential Element in Epigenetics Facts and Technologies
DNA Methylation: An Essential Element in Epigenetics Facts and TechnologiesDNA Methylation: An Essential Element in Epigenetics Facts and Technologies
DNA Methylation: An Essential Element in Epigenetics Facts and Technologies
QIAGEN
 
NEXT GENERATION SEQUENCING
NEXT GENERATION SEQUENCINGNEXT GENERATION SEQUENCING
NEXT GENERATION SEQUENCING
Bilal Nizami
 
Sequence assembly
Sequence assemblySequence assembly
Sequence assembly
Dr. Naveen Gaurav srivastava
 
cBioPortal Webinar Slides (2/3)
cBioPortal Webinar Slides (2/3)cBioPortal Webinar Slides (2/3)
cBioPortal Webinar Slides (2/3)
Pistoia Alliance
 
Ion Torrent Sequencing
Ion Torrent SequencingIon Torrent Sequencing
Ion Torrent Sequencing
USD Bioinformatics
 
Genome annotation
Genome annotationGenome annotation
Genome annotation
Rezwana Nishat
 

What's hot (20)

Ngs presentation
Ngs presentationNgs presentation
Ngs presentation
 
Functional genomics
Functional genomicsFunctional genomics
Functional genomics
 
Microarrays;application
Microarrays;applicationMicroarrays;application
Microarrays;application
 
Physical maps and their use in annotations
Physical maps and their use in annotationsPhysical maps and their use in annotations
Physical maps and their use in annotations
 
Protein docking
Protein dockingProtein docking
Protein docking
 
Sanger sequencing method of DNA
Sanger sequencing method of DNA Sanger sequencing method of DNA
Sanger sequencing method of DNA
 
DNA Sequencing
DNA SequencingDNA Sequencing
DNA Sequencing
 
Est database
Est databaseEst database
Est database
 
SNP Detection Methods and applications
SNP Detection Methods and applications SNP Detection Methods and applications
SNP Detection Methods and applications
 
Introduction to sequence alignment partii
Introduction to sequence alignment partiiIntroduction to sequence alignment partii
Introduction to sequence alignment partii
 
Sequence Assembly
Sequence AssemblySequence Assembly
Sequence Assembly
 
Ion torrent and SOLiD Sequencing Techniques
Ion torrent and SOLiD Sequencing Techniques Ion torrent and SOLiD Sequencing Techniques
Ion torrent and SOLiD Sequencing Techniques
 
Gene regulatory networks
Gene regulatory networksGene regulatory networks
Gene regulatory networks
 
Cancer genome
Cancer genomeCancer genome
Cancer genome
 
DNA Methylation: An Essential Element in Epigenetics Facts and Technologies
DNA Methylation: An Essential Element in Epigenetics Facts and TechnologiesDNA Methylation: An Essential Element in Epigenetics Facts and Technologies
DNA Methylation: An Essential Element in Epigenetics Facts and Technologies
 
NEXT GENERATION SEQUENCING
NEXT GENERATION SEQUENCINGNEXT GENERATION SEQUENCING
NEXT GENERATION SEQUENCING
 
Sequence assembly
Sequence assemblySequence assembly
Sequence assembly
 
cBioPortal Webinar Slides (2/3)
cBioPortal Webinar Slides (2/3)cBioPortal Webinar Slides (2/3)
cBioPortal Webinar Slides (2/3)
 
Ion Torrent Sequencing
Ion Torrent SequencingIon Torrent Sequencing
Ion Torrent Sequencing
 
Genome annotation
Genome annotationGenome annotation
Genome annotation
 

Viewers also liked

Genome Browsing, Genomic Data Mining and Genome Data Visualization with Ensem...
Genome Browsing, Genomic Data Mining and Genome Data Visualization with Ensem...Genome Browsing, Genomic Data Mining and Genome Data Visualization with Ensem...
Genome Browsing, Genomic Data Mining and Genome Data Visualization with Ensem...
VHIR Vall d’Hebron Institut de Recerca
 
Examining gene expression and methylation with next gen sequencing
Examining gene expression and methylation with next gen sequencingExamining gene expression and methylation with next gen sequencing
Examining gene expression and methylation with next gen sequencing
Stephen Turner
 
Rare Variant Analysis Workflows: Analyzing NGS Data in Large Cohorts
Rare Variant Analysis Workflows: Analyzing NGS Data in Large CohortsRare Variant Analysis Workflows: Analyzing NGS Data in Large Cohorts
Rare Variant Analysis Workflows: Analyzing NGS Data in Large Cohorts
Golden Helix Inc
 
Next-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plotNext-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plot
Li Shen
 
Data analysis pipelines for NGS applications
Data analysis pipelines for NGS applicationsData analysis pipelines for NGS applications
Data analysis pipelines for NGS applications
Vall d'Hebron Institute of Research (VHIR)
 

Viewers also liked (6)

Genome Browsing, Genomic Data Mining and Genome Data Visualization with Ensem...
Genome Browsing, Genomic Data Mining and Genome Data Visualization with Ensem...Genome Browsing, Genomic Data Mining and Genome Data Visualization with Ensem...
Genome Browsing, Genomic Data Mining and Genome Data Visualization with Ensem...
 
Examining gene expression and methylation with next gen sequencing
Examining gene expression and methylation with next gen sequencingExamining gene expression and methylation with next gen sequencing
Examining gene expression and methylation with next gen sequencing
 
Rare Variant Analysis Workflows: Analyzing NGS Data in Large Cohorts
Rare Variant Analysis Workflows: Analyzing NGS Data in Large CohortsRare Variant Analysis Workflows: Analyzing NGS Data in Large Cohorts
Rare Variant Analysis Workflows: Analyzing NGS Data in Large Cohorts
 
Next-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plotNext-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plot
 
RNA-Seq
RNA-SeqRNA-Seq
RNA-Seq
 
Data analysis pipelines for NGS applications
Data analysis pipelines for NGS applicationsData analysis pipelines for NGS applications
Data analysis pipelines for NGS applications
 

Similar to Exome Sequencing

Ashg2014 grc workshop_schneider
Ashg2014 grc workshop_schneiderAshg2014 grc workshop_schneider
Ashg2014 grc workshop_schneider
Genome Reference Consortium
 
Blast fasta 4
Blast fasta 4Blast fasta 4
Blast fasta 4
Er Puspendra Tripathi
 
Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...
Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...
Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...
Elia Brodsky
 
ChIP-seq - Data processing
ChIP-seq - Data processingChIP-seq - Data processing
ChIP-seq - Data processing
Sebastian Schmeier
 
lecture4.ppt Sequence Alignmentaldf sdfsadf
lecture4.ppt Sequence Alignmentaldf sdfsadflecture4.ppt Sequence Alignmentaldf sdfsadf
lecture4.ppt Sequence Alignmentaldf sdfsadf
alizain9604
 
Bioinformatica t4-alignments
Bioinformatica t4-alignmentsBioinformatica t4-alignments
Bioinformatica t4-alignments
Prof. Wim Van Criekinge
 
2015 Bioc4010 lecture1and2
2015 Bioc4010 lecture1and22015 Bioc4010 lecture1and2
2015 Bioc4010 lecture1and2
Dan Gaston
 
Bioinformatics t4-alignments wim_vancriekingev2013
Bioinformatics t4-alignments wim_vancriekingev2013Bioinformatics t4-alignments wim_vancriekingev2013
Bioinformatics t4-alignments wim_vancriekingev2013
Prof. Wim Van Criekinge
 
Lopez-Bigas talk at the EBI/EMBL Cancer Genomics Workshop
Lopez-Bigas talk at the EBI/EMBL Cancer Genomics WorkshopLopez-Bigas talk at the EBI/EMBL Cancer Genomics Workshop
Lopez-Bigas talk at the EBI/EMBL Cancer Genomics WorkshopNuria Lopez-Bigas
 
Topological associated domains- Hi-C
Topological associated domains- Hi-CTopological associated domains- Hi-C
Topological associated domains- Hi-C
Mohamed Nadhir Djekidel
 
Metro nome agbt-poster
Metro nome agbt-posterMetro nome agbt-poster
Metro nome agbt-poster
Toby Bloom
 
Church_GenomeAccess_2013_genome2013
Church_GenomeAccess_2013_genome2013Church_GenomeAccess_2013_genome2013
Church_GenomeAccess_2013_genome2013
Deanna Church
 
Phylogenetics1
Phylogenetics1Phylogenetics1
Phylogenetics1
Sébastien De Landtsheer
 
Bits protein structure
Bits protein structureBits protein structure
Bits protein structureBITS
 
Generating high-quality human reference genomes using PromethION nanopore seq...
Generating high-quality human reference genomes using PromethION nanopore seq...Generating high-quality human reference genomes using PromethION nanopore seq...
Generating high-quality human reference genomes using PromethION nanopore seq...
Miten Jain
 
STR IMP.pptx
STR IMP.pptxSTR IMP.pptx
STR IMP.pptx
GOURAVOSTWAL
 
Bioinformatica 20-10-2011-t3-scoring matrices
Bioinformatica 20-10-2011-t3-scoring matricesBioinformatica 20-10-2011-t3-scoring matrices
Bioinformatica 20-10-2011-t3-scoring matrices
Prof. Wim Van Criekinge
 
Bioinformatica t3-scoring matrices
Bioinformatica t3-scoring matricesBioinformatica t3-scoring matrices
Bioinformatica t3-scoring matrices
Prof. Wim Van Criekinge
 

Similar to Exome Sequencing (20)

_BLAST.ppt
_BLAST.ppt_BLAST.ppt
_BLAST.ppt
 
Ashg2014 grc workshop_schneider
Ashg2014 grc workshop_schneiderAshg2014 grc workshop_schneider
Ashg2014 grc workshop_schneider
 
Blast fasta 4
Blast fasta 4Blast fasta 4
Blast fasta 4
 
Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...
Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...
Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...
 
ChIP-seq - Data processing
ChIP-seq - Data processingChIP-seq - Data processing
ChIP-seq - Data processing
 
lecture4.ppt Sequence Alignmentaldf sdfsadf
lecture4.ppt Sequence Alignmentaldf sdfsadflecture4.ppt Sequence Alignmentaldf sdfsadf
lecture4.ppt Sequence Alignmentaldf sdfsadf
 
Bioinformatica t4-alignments
Bioinformatica t4-alignmentsBioinformatica t4-alignments
Bioinformatica t4-alignments
 
2015 Bioc4010 lecture1and2
2015 Bioc4010 lecture1and22015 Bioc4010 lecture1and2
2015 Bioc4010 lecture1and2
 
Bioinformatics t4-alignments wim_vancriekingev2013
Bioinformatics t4-alignments wim_vancriekingev2013Bioinformatics t4-alignments wim_vancriekingev2013
Bioinformatics t4-alignments wim_vancriekingev2013
 
Lopez-Bigas talk at the EBI/EMBL Cancer Genomics Workshop
Lopez-Bigas talk at the EBI/EMBL Cancer Genomics WorkshopLopez-Bigas talk at the EBI/EMBL Cancer Genomics Workshop
Lopez-Bigas talk at the EBI/EMBL Cancer Genomics Workshop
 
Topological associated domains- Hi-C
Topological associated domains- Hi-CTopological associated domains- Hi-C
Topological associated domains- Hi-C
 
Metro nome agbt-poster
Metro nome agbt-posterMetro nome agbt-poster
Metro nome agbt-poster
 
Church_GenomeAccess_2013_genome2013
Church_GenomeAccess_2013_genome2013Church_GenomeAccess_2013_genome2013
Church_GenomeAccess_2013_genome2013
 
Phylogenetics1
Phylogenetics1Phylogenetics1
Phylogenetics1
 
Bits protein structure
Bits protein structureBits protein structure
Bits protein structure
 
Ch06 alignment
Ch06 alignmentCh06 alignment
Ch06 alignment
 
Generating high-quality human reference genomes using PromethION nanopore seq...
Generating high-quality human reference genomes using PromethION nanopore seq...Generating high-quality human reference genomes using PromethION nanopore seq...
Generating high-quality human reference genomes using PromethION nanopore seq...
 
STR IMP.pptx
STR IMP.pptxSTR IMP.pptx
STR IMP.pptx
 
Bioinformatica 20-10-2011-t3-scoring matrices
Bioinformatica 20-10-2011-t3-scoring matricesBioinformatica 20-10-2011-t3-scoring matrices
Bioinformatica 20-10-2011-t3-scoring matrices
 
Bioinformatica t3-scoring matrices
Bioinformatica t3-scoring matricesBioinformatica t3-scoring matrices
Bioinformatica t3-scoring matrices
 

Recently uploaded

Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
g2nightmarescribd
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 

Recently uploaded (20)

Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 

Exome Sequencing

  • 1. Genome & Exome Sequencing Read Mapping Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520
  • 2. Whole Genome Sequencing • Usually need 30-50X coverage (~ 3 lanes of 100bp PE HiSeq2000 sequencing) 2
  • 4. Exome Sequencing • Solution Hybrid Selection: Probes in solution can capture all exons (exome) for high throughput sequencing • 1-2% of whole genome seq • Easily multiplex 20 samples in one lane 4
  • 5. Comparative Sequencing • Somatic mutation detection between normal / cancer pairs • WGS or WES • More mutation yield and better causal gene identification than Mendelian disorders 5 Meyerson et al, Nat Rev Genet 2010
  • 6. Hallmark of Mendelian Disease Gene Discovery 6 Gilissen, Genome Biol 2011
  • 7. Hallmark of Mendelian Disease Gene Discovery 7 Gilissen, Genome Biol 2011
  • 8. Mutation Targets vs Disorder Frequency Rarer disorders are focused on fewer mutated genes 8 Gilissen, Genome Biol 2011
  • 9. Whole Genome or Exome Seq? • Enabling technologies: NGS machines, open-source algorithms, capture reagents, lowering cost, big sample collections • Exomes more cost effective: Sequence patient DNA and filter common SNPs; compare parents child trios; compare paired normal cancer • Challenges: – Still can’t interpret many Mendelian disorders – Rare variants need large samples sizes – Exome might miss region (e.g. novel non-coding genes) – Unsuccessful at using exome-seq to interpret clinical data9 Shendure, Genome Biol 2011
  • 10. Read Mapping • Mapping hundreds of millions of reads back to the reference genome is CPU and RAM intensive, and slow • Read quality decreases with length (small single nucleotide mismatches or indels) • Very few mapper deals with indel, and often allow ~2 mismatches within first 30bp (4 ^ 28 could still uniquely identify most 30bp sequences in a 3GB genome) • Mapping output: SAM (BAM) or BED 10
  • 11. Spaced seed alignment • Tags and tag-sized pieces of reference are cut into small “seeds.” • Pairs of spaced seeds are stored in an index. • Look up spaced seeds for each tag. • For each “hit,” confirm the remaining positions. • Report results to the user.
  • 12. Burrows-Wheeler • Store entire reference genome. • Align tag base by base from the end. • When tag is traversed, all active locations are reported. • If no match is found, then back up and try a substitution. Trapnell & Salzberg, Nat Biotech 2009Trapnell & Salzberg, Nat Biotech 2009
  • 13. Burrows-Wheeler Transform • Reversible permutation used originally in compression • Once BWT(T) is built, all else shown here is discarded – Matrix will be shown for illustration only Burrows Wheeler Matrix Last column BWT(T)T Burrows M, Wheeler DJ: A block sorting lossless data compression algorithm. Digital Equipment Corporation, Palo Alto, CA 1994, Technical Report 124; 1994 Slides from Ben Langmead
  • 14. Burrows-Wheeler Transform • Property that makes BWT(T) reversible is “LF Mapping” – ith occurrence of a character in Last column is same text occurrence as the ith occurrence in First column T BWT(T) Burrows Wheeler Matrix Rank: 2 Rank: 2 Slides from Ben Langmead
  • 15. Burrows-Wheeler Transform • To recreate T from BWT(T), repeatedly apply rule: T = BWT[ LF(i) ] + T; i = LF(i) – Where LF(i) maps row i to row whose first character corresponds to i’s last per LF Mapping Final T Slides from Ben Langmead
  • 16. Exact Matching with FM Index • To match Q in T using BWT(T), repeatedly apply rule: top = LF(top, qc); bot = LF(bot, qc) – Where qc is the next character in Q (right-to-left) and LF(i, qc) maps row i to the row whose first character corresponds to i’s last character as if it were qc Slides from Ben Langmead
  • 17. Exact Matching with FM Index • In progressive rounds, top & bot delimit the range of rows beginning with progressively longer suffixes of Q Slides from Ben Langmead
  • 18. Exact Matching with FM Index • If range becomes empty (top = bot) the query suffix (and therefore the query) does not occur in the text Slides from Ben Langmead
  • 19. Backtracking • Consider an attempt to find Q = “agc” in T = “acaacg”: • Instead of giving up, try to “backtrack” to a previous position and try a different base (much slower) • For 50bp reads, need to have ~25bp perfect match “gc” does not occur in the text “g” “c” Slides from Ben Langmead
  • 20. Seq Files • Raw FASTQ – Sequence ID, sequence – Quality ID, quality score • Mapped SAM – Map: 0 OK, 4 unmapped, 16 mapped reverse strand – XA (mapper-specific) – MD: mismatch info – NM: number of mismatch • Mapped BED – Chr, start, end, strand 20 @HWI-EAS305:1:1:1:991#0/1 GCTGGAGGTTCAGGCTGGCCGGATTTAAACGTAT +HWI-EAS305:1:1:1:991#0/1 MVXUWVRKTWWULRQQMMWWBBBBBBBBBBBBBB @HWI-EAS305:1:1:1:201#0/1 AAGACAAAGATGTGCTTTCTAAATCTGCACTAAT +HWI-EAS305:1:1:1:201#0/1 PXX[[[[XTXYXTTWYYY[XXWWW[TMTVXWBBB HWUSI-EAS366_0112:6:1:1298:18828#0/1    16      chr9    98116600        255     38M     *       0       0       TACAATATGTCTTTATTTGAGATATGGATTTTAG GCCG  Y]bc^dab [_UU`^`LbTUTccLbbYaY`cWLYW^  XA:i:1  MD:Z:3C30T3     NM:i:2 HWUSI-EAS366_0112:6:1:1257:18819#0/1    4       *       0       0       *       *       0       0       AGACCACATGAAGCTCAAGAAGAAGGAAGACA AAAGTG  ece^dddTcT^c`a`ccdKc^^__]Yb_cKS^_W  XM:i:1 HWUSI-EAS366_0112:6:1:1315:19529#0/1    16      chr9    102610263       255     38M     *       0       0       GCACTCAAGGGTACAGGAAAAGGGTCAGAAGT GTGGCC  ^c_YcLcb`bbYdTadd`dda`cddYddd^cT`  XA:i:0  MD:Z:38 NM:i:0 chr1 123450 123500 + chr5 28374615 28374615 - http://samtools.sourceforge.net/SAM1.pdf
  • 21. Data Analysis • Heuristic filtering to identify novel genes for Mendelian disorders 21 Stitziel et al, Genome Biol 2011
  • 22. Genomic Structural Variation 22 Baker et al, Nat Meth 2012 altered genome found in a sample is shown at the bottom. B) Inversion (INV) has reciprocal join in opposite orientations. C) Intra-chromosome translocation (ITX) has unilateral join in opposite orientation. D) Deletion (DEL) has two breakpoints joined in ascending order of genomic coordinates in the same orientation. E) Insertion (INS) has two breakpoints joined in descending order of genomic coordinates in the same orientation.
  • 23. Structural Variation Detection BreakDancer Chen et al, Nat Meth 2009 Only look at anomalous read pairs
  • 24. Structural Variation Detection • Crest (Wang et al, Nat Meth 2011) – Use soft-clipped reads, kind of like bidir-blast 24
  • 25. Copy Number Variation Detection • Change in read coverage 25
  • 26. Representation: VCF Format • http://www.1000genomes.org/node/101 26
  • 27. Summary • Whole genome and whole exome sequencing – Solution hybrid selection – Specific locus for rare diseases • Bioinformatics issues: – Read mapping – SNP, indel detection – Heuristic filtering – Structural variation detection 27