SlideShare a Scribd company logo
1 of 15
Statistics for K-mer Based
Splicing Event Analysis
Data Learner Miner Practitioner
Ruofei Du, Hao Li, Hui Miao, Shangfu Peng
Alternative Splicing Events
Image from: "Alternative Splicing Event" Wikipedia: The Free Encyclopedia. Wikimedia Foundation, Inc. 2 Apr. 2014.
<http://en.wikipedia.org/wiki/Alternative_splicing>
● Alternative splicing is used to describe
any case in which a primary transcript
can be spliced in more than one pattern
to generate multiple and distinct
mRNAs.
● 5 traditional basic modes; most
common: exon skipping.
● It is a widespread mechanism for
generating protein diversity and
regulating protein expression.
● Improve
understanding of
cell
differentiation
and classify
disease types
Image from: Sammeth, Michael, Sylvain Foissac, and Roderic Guigó. "A General Definition and Nomenclature for Alternative Splicing
Events." PLoS Computational Biology 4.8 (2008): e1000147.
Alternative Splicing Events
● Different species tend to have different
splicing event patterns.
● Different splicing events also indicates the
abnormal cells activities, such as cancer
Image from: Sammeth, Michael, Sylvain Foissac, and Roderic Guigó. "A General Definition and Nomenclature for Alternative Splicing
Events." PLoS Computational Biology 4.8 (2008): e1000147.
Abundance Estimation for
Alternative Splicing Events
● Given RNA-Seq samples, estimate the abundance and
the relative proportion of every alternative transcription
path
Image from: Hu, Yin, et al. "DiffSplice: the genome-wide detection of differential splicing events with RNA-seq." Nucleic acids research 41.2 (2013): e39-e39.
Abundance Estimation for Isoforms
● The Standard Paradigm
o Read alignment step can be very computationally
intensive.
● Sailfish
o Far faster than the standard paradigm
o Replace the step of read mapping with the much
faster and simpler process of k-mer counting
Sailfish: Alignment-free Isoform Quantification from RNA-seq Reads using Lightweight Algorithms Rob Patro, Stephen M. Mount, and
Carl Kingsford. Manuscript Submitted (2013) http://www.cs.cmu.edu/~ckingsf/class/02714-f13/Lec05-sailfish.pdf
K-mer
● A fixed sized (K) sequence
Sailfish: Alignment-free Isoform Quantification from RNA-seq Reads using Lightweight
Algorithms Rob Patro, Stephen M. Mount, and Carl Kingsford. Manuscript Submitted
(2013) http://www.cs.cmu.edu/~ckingsf/class/02714-f13/Lec05-sailfish.pdf
A
C
G
T
AA AC AG AT
CA CC CG CT
GA GC GG GT
TA TC TG TT
● A string of length N contains
N-K+1 k-mers
● One can build K-mer index to
represent a string
7-mer iD N
ATTCGAC 1 1
TTCGACA 2 1
TCGACAG 3 1
...
1-mer 2-mer
Sailfish Workflow
● Indexing
o Build K-mer index for known
isoform transcripts
Sailfish: Alignment-free Isoform Quantification from RNA-seq Reads using
Lightweight Algorithms Rob Patro, Stephen M. Mount, and Carl Kingsford.
● Quantification
o Counts the number of times
each K-mer occurs in the
reads.
o Estimating abundances via an
EM algorithm
Sailfish Workflow: Indexing
● Perfect Hashing
http://www.cs.cmu.edu/~ckingsf/class/02714-f13/Lec05-sailfish.pdf
Domain(K-mer) Range([0,|D|-1])
Sailfish Workflow: Quantification
2.K-mer Allocation to
Transcripts
http://www.cs.cmu.edu/~ckingsf/class/02714-f13/Lec05-sailfish.pdf
1. Read Data K-mer Counting
Our Proposal
● We propose to investigate the scalable statistic method
using k-mer and k-mer index to estimate abundance of
alternative splicing events.
● We will focus on the
most frequent event type:
Exon Skipping Event
o other event types can
be extended naturally
Shen, Shihao, et al. "MATS: a Bayesian framework for Flexible Detection of Differential Alternative Splicing from RNA-Seq Data."
Nucleic Acids Research 40.8 (2012): e6
(1) (2) (3)
● Variables for abundance:
● Build k-mer index for a specific gene: e.g. A B C D E
● On reads part, aggregated k-mer counts like Sailfish
● Use EM to do maximum likelihood estimation
Class I: Each exon i
Class II: Each exon-exon junction (non-spliced)
Class III: Each spliced junction
Initial Idea
Exon A, B, C, D, E
Non-spliced junction AB, BC, CD, DE
Spliced junction AC, BD, CE
Advantage
● Do not require to know the Isoform space.
● Replace the step of read mapping, and provide a faster
approach for splicing event analysis.
Thank you
Questions
1. The drawback of the straightforward method: get the Pi of each
Isoform using EM first, and then calculate the frequency of events.
2. Why we have to use EM, why not solve equations?
3. Require to know the frequency of the five events?

More Related Content

Similar to Statistics for K-mer Based Splicing Analysis

Introduction to Apollo: A webinar for the i5K Research Community
Introduction to Apollo: A webinar for the i5K Research CommunityIntroduction to Apollo: A webinar for the i5K Research Community
Introduction to Apollo: A webinar for the i5K Research CommunityMonica Munoz-Torres
 
OVium Bioinformatic Solutions
OVium Bioinformatic SolutionsOVium Bioinformatic Solutions
OVium Bioinformatic SolutionsOVium Solutions
 
Apollo Introduction for the Chestnut Research Community
Apollo Introduction for the Chestnut Research CommunityApollo Introduction for the Chestnut Research Community
Apollo Introduction for the Chestnut Research CommunityMonica Munoz-Torres
 
2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible research2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible researchYannick Wurm
 
SBVRLDNACOMP:AN EFFECTIVE DNA SEQUENCE COMPRESSION ALGORITHM
 SBVRLDNACOMP:AN EFFECTIVE DNA SEQUENCE COMPRESSION ALGORITHM SBVRLDNACOMP:AN EFFECTIVE DNA SEQUENCE COMPRESSION ALGORITHM
SBVRLDNACOMP:AN EFFECTIVE DNA SEQUENCE COMPRESSION ALGORITHMijcsa
 
NetBioSIG2013-Talk Gang Su
NetBioSIG2013-Talk Gang SuNetBioSIG2013-Talk Gang Su
NetBioSIG2013-Talk Gang SuAlexander Pico
 
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...Golden Helix Inc
 
20131019 生物物理若手 Journal Club
20131019 生物物理若手 Journal Club20131019 生物物理若手 Journal Club
20131019 生物物理若手 Journal ClubMed_KU
 
Data Management for Quantitative Biology - Data sources (Next generation tech...
Data Management for Quantitative Biology - Data sources (Next generation tech...Data Management for Quantitative Biology - Data sources (Next generation tech...
Data Management for Quantitative Biology - Data sources (Next generation tech...QBiC_Tue
 
Affymetrix OncoScan®* data analysis with Nexus Copy Number™
Affymetrix OncoScan®* data analysis with Nexus Copy Number™Affymetrix OncoScan®* data analysis with Nexus Copy Number™
Affymetrix OncoScan®* data analysis with Nexus Copy Number™Affymetrix
 
The Transformation of Systems Biology Into A Large Data Science
The Transformation of Systems Biology Into A Large Data ScienceThe Transformation of Systems Biology Into A Large Data Science
The Transformation of Systems Biology Into A Large Data ScienceRobert Grossman
 
2015BPSposterQL
2015BPSposterQL2015BPSposterQL
2015BPSposterQLQing Li
 
PROKARYOTIC TRANSCRIPTOMICS AND METAGENOMICS
PROKARYOTIC TRANSCRIPTOMICS AND METAGENOMICSPROKARYOTIC TRANSCRIPTOMICS AND METAGENOMICS
PROKARYOTIC TRANSCRIPTOMICS AND METAGENOMICSLubna MRL
 
Candidacy Exam Final Version
Candidacy Exam Final VersionCandidacy Exam Final Version
Candidacy Exam Final VersionAnthony Salvagno
 
Structure based computer aided drug design
Structure based computer aided drug designStructure based computer aided drug design
Structure based computer aided drug designThanh Truong
 

Similar to Statistics for K-mer Based Splicing Analysis (20)

Introduction to Apollo: A webinar for the i5K Research Community
Introduction to Apollo: A webinar for the i5K Research CommunityIntroduction to Apollo: A webinar for the i5K Research Community
Introduction to Apollo: A webinar for the i5K Research Community
 
OVium Bioinformatic Solutions
OVium Bioinformatic SolutionsOVium Bioinformatic Solutions
OVium Bioinformatic Solutions
 
Apollo Introduction for the Chestnut Research Community
Apollo Introduction for the Chestnut Research CommunityApollo Introduction for the Chestnut Research Community
Apollo Introduction for the Chestnut Research Community
 
2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible research2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible research
 
SBVRLDNACOMP:AN EFFECTIVE DNA SEQUENCE COMPRESSION ALGORITHM
 SBVRLDNACOMP:AN EFFECTIVE DNA SEQUENCE COMPRESSION ALGORITHM SBVRLDNACOMP:AN EFFECTIVE DNA SEQUENCE COMPRESSION ALGORITHM
SBVRLDNACOMP:AN EFFECTIVE DNA SEQUENCE COMPRESSION ALGORITHM
 
Progetto_final
Progetto_finalProgetto_final
Progetto_final
 
Introduction to Apollo for i5k
Introduction to Apollo for i5kIntroduction to Apollo for i5k
Introduction to Apollo for i5k
 
NetBioSIG2013-Talk Gang Su
NetBioSIG2013-Talk Gang SuNetBioSIG2013-Talk Gang Su
NetBioSIG2013-Talk Gang Su
 
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
 
20131019 生物物理若手 Journal Club
20131019 生物物理若手 Journal Club20131019 生物物理若手 Journal Club
20131019 生物物理若手 Journal Club
 
Seminar 20150920.2
Seminar 20150920.2Seminar 20150920.2
Seminar 20150920.2
 
Data Management for Quantitative Biology - Data sources (Next generation tech...
Data Management for Quantitative Biology - Data sources (Next generation tech...Data Management for Quantitative Biology - Data sources (Next generation tech...
Data Management for Quantitative Biology - Data sources (Next generation tech...
 
Affymetrix OncoScan®* data analysis with Nexus Copy Number™
Affymetrix OncoScan®* data analysis with Nexus Copy Number™Affymetrix OncoScan®* data analysis with Nexus Copy Number™
Affymetrix OncoScan®* data analysis with Nexus Copy Number™
 
JClinChem_2003
JClinChem_2003JClinChem_2003
JClinChem_2003
 
The Transformation of Systems Biology Into A Large Data Science
The Transformation of Systems Biology Into A Large Data ScienceThe Transformation of Systems Biology Into A Large Data Science
The Transformation of Systems Biology Into A Large Data Science
 
UNMSymposium2014
UNMSymposium2014UNMSymposium2014
UNMSymposium2014
 
2015BPSposterQL
2015BPSposterQL2015BPSposterQL
2015BPSposterQL
 
PROKARYOTIC TRANSCRIPTOMICS AND METAGENOMICS
PROKARYOTIC TRANSCRIPTOMICS AND METAGENOMICSPROKARYOTIC TRANSCRIPTOMICS AND METAGENOMICS
PROKARYOTIC TRANSCRIPTOMICS AND METAGENOMICS
 
Candidacy Exam Final Version
Candidacy Exam Final VersionCandidacy Exam Final Version
Candidacy Exam Final Version
 
Structure based computer aided drug design
Structure based computer aided drug designStructure based computer aided drug design
Structure based computer aided drug design
 

More from Ruofei Du

Project Geollery.com: Reconstructing a Live Mirrored World With Geotagged Soc...
Project Geollery.com: Reconstructing a Live Mirrored World With Geotagged Soc...Project Geollery.com: Reconstructing a Live Mirrored World With Geotagged Soc...
Project Geollery.com: Reconstructing a Live Mirrored World With Geotagged Soc...Ruofei Du
 
Geollery: A Mixed Reality Social Media Platform
Geollery: A Mixed Reality Social Media PlatformGeollery: A Mixed Reality Social Media Platform
Geollery: A Mixed Reality Social Media PlatformRuofei Du
 
Fusing Multimedia Data Into Dynamic Virtual Environments
Fusing Multimedia Data Into Dynamic Virtual EnvironmentsFusing Multimedia Data Into Dynamic Virtual Environments
Fusing Multimedia Data Into Dynamic Virtual EnvironmentsRuofei Du
 
Montage4D: Interactive Seamless Fusion of Multiview Video Textures
Montage4D: Interactive Seamless Fusion of Multiview Video TexturesMontage4D: Interactive Seamless Fusion of Multiview Video Textures
Montage4D: Interactive Seamless Fusion of Multiview Video TexturesRuofei Du
 
CTUAAA Summit 2017 Schedule
CTUAAA Summit 2017 ScheduleCTUAAA Summit 2017 Schedule
CTUAAA Summit 2017 ScheduleRuofei Du
 
交大历史与梅竹赛
交大历史与梅竹赛交大历史与梅竹赛
交大历史与梅竹赛Ruofei Du
 
Social Street View: Blending Immersive Street Views with Geo-tagged Social Media
Social Street View: Blending Immersive Street Views with Geo-tagged Social MediaSocial Street View: Blending Immersive Street Views with Geo-tagged Social Media
Social Street View: Blending Immersive Street Views with Geo-tagged Social MediaRuofei Du
 
Video Fields: Fusing Multiple Surveillance Videos into a Dynamic Virtual Envi...
Video Fields: Fusing Multiple Surveillance Videos into a Dynamic Virtual Envi...Video Fields: Fusing Multiple Surveillance Videos into a Dynamic Virtual Envi...
Video Fields: Fusing Multiple Surveillance Videos into a Dynamic Virtual Envi...Ruofei Du
 
Chinese Caligraphy 品读书法·感悟中华 (2010)
Chinese Caligraphy 品读书法·感悟中华 (2010)Chinese Caligraphy 品读书法·感悟中华 (2010)
Chinese Caligraphy 品读书法·感悟中华 (2010)Ruofei Du
 
基于视频的疲劳驾驶检测系统
基于视频的疲劳驾驶检测系统基于视频的疲劳驾驶检测系统
基于视频的疲劳驾驶检测系统Ruofei Du
 
Online Vigilance Analysis Combining Video and Electrooculography Features
Online Vigilance Analysis Combining Video and Electrooculography FeaturesOnline Vigilance Analysis Combining Video and Electrooculography Features
Online Vigilance Analysis Combining Video and Electrooculography FeaturesRuofei Du
 
Deliberately Planning and Acting for Angry Birds with Refinement Methods
Deliberately Planning and Acting for Angry Birds with Refinement MethodsDeliberately Planning and Acting for Angry Birds with Refinement Methods
Deliberately Planning and Acting for Angry Birds with Refinement MethodsRuofei Du
 

More from Ruofei Du (12)

Project Geollery.com: Reconstructing a Live Mirrored World With Geotagged Soc...
Project Geollery.com: Reconstructing a Live Mirrored World With Geotagged Soc...Project Geollery.com: Reconstructing a Live Mirrored World With Geotagged Soc...
Project Geollery.com: Reconstructing a Live Mirrored World With Geotagged Soc...
 
Geollery: A Mixed Reality Social Media Platform
Geollery: A Mixed Reality Social Media PlatformGeollery: A Mixed Reality Social Media Platform
Geollery: A Mixed Reality Social Media Platform
 
Fusing Multimedia Data Into Dynamic Virtual Environments
Fusing Multimedia Data Into Dynamic Virtual EnvironmentsFusing Multimedia Data Into Dynamic Virtual Environments
Fusing Multimedia Data Into Dynamic Virtual Environments
 
Montage4D: Interactive Seamless Fusion of Multiview Video Textures
Montage4D: Interactive Seamless Fusion of Multiview Video TexturesMontage4D: Interactive Seamless Fusion of Multiview Video Textures
Montage4D: Interactive Seamless Fusion of Multiview Video Textures
 
CTUAAA Summit 2017 Schedule
CTUAAA Summit 2017 ScheduleCTUAAA Summit 2017 Schedule
CTUAAA Summit 2017 Schedule
 
交大历史与梅竹赛
交大历史与梅竹赛交大历史与梅竹赛
交大历史与梅竹赛
 
Social Street View: Blending Immersive Street Views with Geo-tagged Social Media
Social Street View: Blending Immersive Street Views with Geo-tagged Social MediaSocial Street View: Blending Immersive Street Views with Geo-tagged Social Media
Social Street View: Blending Immersive Street Views with Geo-tagged Social Media
 
Video Fields: Fusing Multiple Surveillance Videos into a Dynamic Virtual Envi...
Video Fields: Fusing Multiple Surveillance Videos into a Dynamic Virtual Envi...Video Fields: Fusing Multiple Surveillance Videos into a Dynamic Virtual Envi...
Video Fields: Fusing Multiple Surveillance Videos into a Dynamic Virtual Envi...
 
Chinese Caligraphy 品读书法·感悟中华 (2010)
Chinese Caligraphy 品读书法·感悟中华 (2010)Chinese Caligraphy 品读书法·感悟中华 (2010)
Chinese Caligraphy 品读书法·感悟中华 (2010)
 
基于视频的疲劳驾驶检测系统
基于视频的疲劳驾驶检测系统基于视频的疲劳驾驶检测系统
基于视频的疲劳驾驶检测系统
 
Online Vigilance Analysis Combining Video and Electrooculography Features
Online Vigilance Analysis Combining Video and Electrooculography FeaturesOnline Vigilance Analysis Combining Video and Electrooculography Features
Online Vigilance Analysis Combining Video and Electrooculography Features
 
Deliberately Planning and Acting for Angry Birds with Refinement Methods
Deliberately Planning and Acting for Angry Birds with Refinement MethodsDeliberately Planning and Acting for Angry Birds with Refinement Methods
Deliberately Planning and Acting for Angry Birds with Refinement Methods
 

Recently uploaded

DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 

Recently uploaded (20)

DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 

Statistics for K-mer Based Splicing Analysis

  • 1. Statistics for K-mer Based Splicing Event Analysis Data Learner Miner Practitioner Ruofei Du, Hao Li, Hui Miao, Shangfu Peng
  • 2. Alternative Splicing Events Image from: "Alternative Splicing Event" Wikipedia: The Free Encyclopedia. Wikimedia Foundation, Inc. 2 Apr. 2014. <http://en.wikipedia.org/wiki/Alternative_splicing> ● Alternative splicing is used to describe any case in which a primary transcript can be spliced in more than one pattern to generate multiple and distinct mRNAs. ● 5 traditional basic modes; most common: exon skipping. ● It is a widespread mechanism for generating protein diversity and regulating protein expression.
  • 3. ● Improve understanding of cell differentiation and classify disease types Image from: Sammeth, Michael, Sylvain Foissac, and Roderic Guigó. "A General Definition and Nomenclature for Alternative Splicing Events." PLoS Computational Biology 4.8 (2008): e1000147.
  • 4. Alternative Splicing Events ● Different species tend to have different splicing event patterns. ● Different splicing events also indicates the abnormal cells activities, such as cancer Image from: Sammeth, Michael, Sylvain Foissac, and Roderic Guigó. "A General Definition and Nomenclature for Alternative Splicing Events." PLoS Computational Biology 4.8 (2008): e1000147.
  • 5. Abundance Estimation for Alternative Splicing Events ● Given RNA-Seq samples, estimate the abundance and the relative proportion of every alternative transcription path Image from: Hu, Yin, et al. "DiffSplice: the genome-wide detection of differential splicing events with RNA-seq." Nucleic acids research 41.2 (2013): e39-e39.
  • 6. Abundance Estimation for Isoforms ● The Standard Paradigm o Read alignment step can be very computationally intensive. ● Sailfish o Far faster than the standard paradigm o Replace the step of read mapping with the much faster and simpler process of k-mer counting Sailfish: Alignment-free Isoform Quantification from RNA-seq Reads using Lightweight Algorithms Rob Patro, Stephen M. Mount, and Carl Kingsford. Manuscript Submitted (2013) http://www.cs.cmu.edu/~ckingsf/class/02714-f13/Lec05-sailfish.pdf
  • 7. K-mer ● A fixed sized (K) sequence Sailfish: Alignment-free Isoform Quantification from RNA-seq Reads using Lightweight Algorithms Rob Patro, Stephen M. Mount, and Carl Kingsford. Manuscript Submitted (2013) http://www.cs.cmu.edu/~ckingsf/class/02714-f13/Lec05-sailfish.pdf A C G T AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT ● A string of length N contains N-K+1 k-mers ● One can build K-mer index to represent a string 7-mer iD N ATTCGAC 1 1 TTCGACA 2 1 TCGACAG 3 1 ... 1-mer 2-mer
  • 8. Sailfish Workflow ● Indexing o Build K-mer index for known isoform transcripts Sailfish: Alignment-free Isoform Quantification from RNA-seq Reads using Lightweight Algorithms Rob Patro, Stephen M. Mount, and Carl Kingsford. ● Quantification o Counts the number of times each K-mer occurs in the reads. o Estimating abundances via an EM algorithm
  • 9. Sailfish Workflow: Indexing ● Perfect Hashing http://www.cs.cmu.edu/~ckingsf/class/02714-f13/Lec05-sailfish.pdf Domain(K-mer) Range([0,|D|-1])
  • 10. Sailfish Workflow: Quantification 2.K-mer Allocation to Transcripts http://www.cs.cmu.edu/~ckingsf/class/02714-f13/Lec05-sailfish.pdf 1. Read Data K-mer Counting
  • 11. Our Proposal ● We propose to investigate the scalable statistic method using k-mer and k-mer index to estimate abundance of alternative splicing events. ● We will focus on the most frequent event type: Exon Skipping Event o other event types can be extended naturally Shen, Shihao, et al. "MATS: a Bayesian framework for Flexible Detection of Differential Alternative Splicing from RNA-Seq Data." Nucleic Acids Research 40.8 (2012): e6 (1) (2) (3)
  • 12. ● Variables for abundance: ● Build k-mer index for a specific gene: e.g. A B C D E ● On reads part, aggregated k-mer counts like Sailfish ● Use EM to do maximum likelihood estimation Class I: Each exon i Class II: Each exon-exon junction (non-spliced) Class III: Each spliced junction Initial Idea Exon A, B, C, D, E Non-spliced junction AB, BC, CD, DE Spliced junction AC, BD, CE
  • 13. Advantage ● Do not require to know the Isoform space. ● Replace the step of read mapping, and provide a faster approach for splicing event analysis.
  • 15. Questions 1. The drawback of the straightforward method: get the Pi of each Isoform using EM first, and then calculate the frequency of events. 2. Why we have to use EM, why not solve equations? 3. Require to know the frequency of the five events?

Editor's Notes

  1. Good morning everyone, we’re data learner miner practitioner team. Today we’re going to talk about our project proposal: statistics for k-mer based splicing event analysis.
  2. So what are alternative splicing and what are alternative splicing events? Alternative splicing is a regulated process during gene expression that results in a single gene coding for multiple proteins. There are five traditional basic modes of alternative splicing events: Exon skipping, Mutually exclusive exons, Alternative donor site, Alternative acceptor site, and Intron retention. For the exon skipping case, an exon (as the yellow one in the figure) may be skipped from the primary transcript. This is the most common mode in mammalian pre-mRNAs. Mutually exclusive exons: One of the two yellow exons is retained in mRNAs after splicing, but not both. Alternative donor site: An alternative 5' splice junction (donor site) is used, changing the 3' boundary of the upstream exon. Alternative acceptor site: An alternative 3' splice junction (acceptor site) is used, changing the 5' boundary of the downstream exon. Intron retention: A subsequence in one exon may be spliced out as an intron or simply retained. This is distinguished from exon skipping because the retained sequence is not flanked by introns. Alternative splicing is a widespread mechanism for generating protein diversity and regulating protein expression. The term alternative splicing is used in biology to describe any case in which a primary transcript can be spliced in more than one pattern to generate multiple, distinct mRNAs. AS events are available for the following model organisms: Caenorhabditis elegans Danio rerio Drosophila melanogaster Homo sapiens Mus musculus Rattus norvegicus
  3. So why are we interested in splicing events? For one thing, different species tend to have different splicing event patterns. For example, for each of the 12 compared species, a pie diagram shows the distribution of splicing events across 5 structural different classes. It’s clear from the figure that mammals has more exon skipping and complex events and less retained introns than invertebrates. For another, different splicing events, also indicates the abnormal cells activities, such as cancer The splicing event analysis could Improve our understanding of cell differentiation and classify disease types. Next, Hui would introduce abundance estimation for alternative splicing events
  4. The splicing event analysis could Improve our understanding of cell differentiation and classify disease types. Next, Hui would introduce abundance estimation for alternative splicing events
  5. Estimates the abundance and the relative proportion of every alternative transcription path. Subsequently, the estimators for the expression of each ASM are propagated to derive an estimator for the overall gene expression
  6. Shuffle ambiguously mapped reads around. usually with the goal of uniform coverage.
  7. K-mers are robust to errors. Longer k-mers may result in less ambiguity, but may be more affected by errors in the reads. shorter k-mers, though more ambiguous, may be more robust to errors in the reads
  8. Sailfish works in two phases: indexing and quantification The most important data structure in the index is the minimal perfect hash function that maps each k-mer in the reference transcripts to an index between 0 and the number of different k-mers in the transcripts such that no two k-mers share an index. This allows us to quickly index and count any k-mer from the reads that also appears in the transcripts. Sailfish then applies an expectation maximization (EM) procedure to determine maximum likelihood estimates for the relative abundance of each transcript. this procedure is similar to the EM algorithm used by RSEM [5], except that k-mers rather than fragments are probabilistically assigned to transcripts,
  9. Sailfish works in two phases: indexing and quantification The most important data structure in the index is the minimal perfect hash function that maps each k-mer in the reference transcripts to an index between 0 and the number of different k-mers in the transcripts such that no two k-mers share an index. This allows us to quickly index and count any k-mer from the reads that also appears in the transcripts. Sailfish then applies an expectation maximization (EM) procedure to determine maximum likelihood estimates for the relative abundance of each transcript. this procedure is similar to the EM algorithm used by RSEM [5], except that k-mers rather than fragments are probabilistically assigned to transcripts,
  10. Counts the number of times each K-mer occurs in the reads. Applies EM to determine maximum likelihood estimates for the abundance of each transcript By working with k-mers, we can replace the computationally intensive step of read mapping with the much faster and simpler process of k-mer counting We also avoid any dependence on read mapping parameters Two k-mers are equivalent from the perspective of the EM algorithm if they occur in the same set of transcript sequences with the same rate This reduction in the number of active variables substantially reduces the computational requirements of the EM procedure
  11. The basic idea is to focus on the frequency of the exon-exon junction. Like this picture, we named it exon 1, exon 2 and exon 3. If we tested a 1-3 junction from the reads, we know one exon skipping event has occurred. So our task is to estimate the frequency of exon-exon junction.
  12. Recalling Sailfish, it estimates mu_i for each isoform. Similarly, here we introduce three classes variables on genes part to be estimated. The first class is mu_i for each single exon. The second class and the third class are the mu for all exon-exon junction. But the second class is for non-spliced junction and the third class is for spliced junction. For example, for the gene sequence ABCDE, where ABCDE are exons. The first class is the mu for A,B,C,D,E. The second class is the mu for AB. The third class is mu for If we know all mu result, the sum of mu of the third class is exactly the frequency of exon skipping event. To estimate mu, we build k-mer index for each variable on the gene part. And on reads part, we aggregated k-mer counts like Sailfish.
  13. which is a hard task in biology. Simlar to Sailfish