SlideShare a Scribd company logo
Biotechnology and genomics deal with sensitive information and intellectual property. Seven Bridges Genomics will protect the confidentiality of your data and
proprietary approaches. Similarly, we look to you to protect our interests in our intellectual property. Seven Bridges Genomics does not accept any liability for
information contained in this document. All information provided in this document is subject to change without notice. sevenbridges.com
Dimensionality reduction and visualization
techniques for high-dimensional genomic data
Dusan Ranđelović
Bioinformatics Analyst, Seven Bridges
DATA SCIENCE CONFERENCE 3.0
© 2017 Seven Bridges sevenbridges.com
Genomic data science
● Specifics of genomics
● Just enough cell biology
AGENDA
DSC3.0
Dimensionality reduction
● Curse of dimensionality
● Use-case: Population genomics (PCA)
● Use-case: Cell populations (IsoMap)
● Use-case: Tissue expression profiles (tSNE)
© 2017 Seven Bridges sevenbridges.com
Genomic data science
© 2017 Seven Bridges sevenbridges.comDusan Randjelovic / DSC3.0
General data scientist:
Person who is better at statistics than any
software engineer and better at software
engineering than any statistician
DSC3.0
© 2017 Seven Bridges sevenbridges.com
Genomics vs. general data science
Dusan Randjelovic / DSC3.0
Source: Moutari and Dehmer. Emmert-Streib, 2016
Specifics of genomics:
- domain is crucial
- multi-omics approach
- population scale and
per-sample studies
equally uncharted
DSC3.0
© 2017 Seven Bridges sevenbridges.com
Cell biology
Eukaryotic cell
Dusan Randjelovic / DSC3.0DSC3.0
© 2017 Seven Bridges sevenbridges.com
Complex interplay between millions of molecules
Dusan Randjelovic / DSC3.0DSC3.0
© 2017 Seven Bridges sevenbridges.com
Features:
Millions of variations
along 3*10^9 positions
Features:
10s of thousands of
gene expression values
Dusan Randjelovic / DSC3.0DSC3.0
© 2017 Seven Bridges sevenbridges.com
Sequencing → Genomics
Dusan Randjelovic / DSC3.0DSC3.0
© 2017 Seven Bridges sevenbridges.com
Higher dimensions
Dusan Randjelovic / DSC3.0
Featuring lots of features
© 2017 Seven Bridges sevenbridges.com
The more the merrier?
Complex biological processes in a cell could be characterized by measuring
thousands or millions of molecules’ properties at a time (birth of genomics)
We are FORTUNATE to be able to measure so many features at once
However, when we compare measurements, or
estimate any function of measured features, there are difficulties
There is a CURSE!
Dusan Randjelovic / DSC3.0DSC3.0
© 2017 Seven Bridges sevenbridges.com
Curse of dimensionality
Imagine 1, 2 or 3 dimensional feature-space...
Source: Parsons et al. KDD Explorations 2004
Dusan Randjelovic / DSC3.0DSC3.0
© 2017 Seven Bridges sevenbridges.com
Curse of dimensionality
Imagine 1, 2 or 3 dimensional feature-space...
Source: Clarke R, et. al: The properties of high dimensional
data spaces: implications for exploring gene and protein
expression data. Nat Rev Cancer 8: 37-49
Dusan Randjelovic / DSC3.0
10 features: 0.24% !
DSC3.0
© 2017 Seven Bridges sevenbridges.com
Curse of dimensionality
Now imagine 10, 20, 1000… dimensional space
- sparsity introduced
- locality broken
- # samples needed
grows exp. to
# features
Dusan Randjelovic / DSC3.0DSC3.0
© 2017 Seven Bridges sevenbridges.com
Dimensionality reduction
Dusan Randjelovic / DSC3.0
© 2017 Seven Bridges sevenbridges.com
Reduction of dimensionality – the Why?
Reduce # of features for further (un)supervised learning
- feature selection or feature engineering
- detecting intrinsic dimensionality
Lower computational demand
- lower memory footprint
- compression, scalability
Exploratory data analysis technique
Projections that improve signal-to-noise ratio for specific effect
pixel values (ex. 64x64) 2D: scale + rotation
Dusan Randjelovic / DSC3.0DSC3.0
© 2017 Seven Bridges sevenbridges.com
Reduction of dimensionality – the How?
Dimensionality reduction:
…which retains geometry of the data as much as possible (van der Maaten, 2009).
Dusan Randjelovic / DSC3.0DSC3.0
© 2017 Seven Bridges sevenbridges.com
Reduction of dimensionality – the How?
Taxonomy of methods:
- Properties of data / nature of mapping: Linear vs. non-linear
- Objective function properties: convex vs. non-convex
- Properties to preserve: global vs. local
As in classification or clustering, we need:
- Similarity measure between datapoints
Dusan Randjelovic / DSC3.0DSC3.0
© 2017 Seven Bridges sevenbridges.com
Similarity: neighborhood and distances
Source: doi=10.1.1.154.8446
Distance is metric when:
Dusan Randjelovic / DSC3.0DSC3.0
© 2017 Seven Bridges sevenbridges.com
Non-linear reduction: Manifold learning
Dusan Randjelovic / DSC3.0DSC3.0
© 2017 Seven Bridges sevenbridges.com
Common techniques
+ SNE, t-SNE
Source: van der Maaten, 2009: Dimensionality Reduction: A Comparative Review
Dusan Randjelovic / DSC3.0DSC3.0
© 2017 Seven Bridges sevenbridges.com
Genomics use-cases
Population variations
Infer cell populations
Tissue classification
Source: 2D Representation
of Transcriptomes by t-SNE
Exposes Relatedness
between Human Tissues
Source: Simons dataset @ SBG Platform
Dusan Randjelovic / DSC3.0DSC3.0
© 2017 Seven Bridges sevenbridges.com
Principal component analysis (PCA)
Dusan Randjelovic / DSC3.0
Use-case:
Population variations – Simons Diversity dataset
© 2017 Seven Bridges sevenbridges.com
Simons Diversity dataset
300 genomes
142 diverse populations
35TB raw + processed
Sample analysis @SBG →
Dusan Randjelovic / DSC3.0DSC3.0
© 2017 Seven Bridges sevenbridges.com
Simons Diversity PCA
SNPRelate 1.10.1 Bioconductor tool
PCA done on non-African samples,
on chromosome 6 only, SNPs only
→ different populations have
variations
in the genome with similar
frequencies
Dusan Randjelovic / DSC3.0DSC3.0
© 2017 Seven Bridges sevenbridges.com
Principal component analysis (PCA)
Dusan Randjelovic / DSC3.0
Linear technique that finds directions along which variance of the
data is maximized (eigenvectors)
Algorithm: iteratively updates M’s components to
maximize variance or minimize reconstruction
error, usually via SVD
Related: ICA, MDS, other generalizations of PCA
Drawback: retains only global disimilarities
DSC3.0
© 2017 Seven Bridges sevenbridges.com
ISOMap – nonlinear mapping,
preserves geodesic distances
Dusan Randjelovic / DSC3.0
Use-case:
Infer cell populations from single-cell RNA-seq
© 2017 Seven Bridges sevenbridges.com
Single-cell RNA-seq
Dusan Randjelovic / DSC3.0
Assess relative abundance of RNA molecules from 100s of cells
NOTE: cells have same DNA, but express different genes (transcribe different RNAs)
Expression profiles should correspond to cell types
DSC3.0
© 2017 Seven Bridges sevenbridges.com
Shalek, Satija et al. 2014
Dusan Randjelovic / DSC3.0
FastProject: Framework on sckit-learn to do multiple projections and
test for correspondance with known molecular pathways
DSC3.0
© 2017 Seven Bridges sevenbridges.com
ISOMap
Dusan Randjelovic / DSC3.0
Dynamics of gene expression and gene regulatory networks is non-linear
PCA and even Euclidean distances do not hold
Geodesic distance along the manifold -> better data similarity
Algorithm: 1. kNN + weighted graph, 2. Shortest path, 3. MDS
Related: MDS, other spectral nonlinear techniques
Drawback: Topological instability
DSC3.0
© 2017 Seven Bridges sevenbridges.com
t-Distributed Stochastic Neighbor Embedding
(t-SNE)
Dusan Randjelovic / DSC3.0
Use-case:
Tissue expression profiles – GTEx dataset
© 2017 Seven Bridges sevenbridges.com
● Genotype-tissue expression (DNA+RNA)
● V7 data: 53 tissues, 714 donors, 11688 samples
● > 50.000 quantified RNA molecules
(features)
Source: http://www.gtexportal.org/home/documentationPage
GTEx dataset
Dusan Randjelovic / DSC3.0DSC3.0
© 2017 Seven Bridges sevenbridges.com
GTEx analysis
Dusan Randjelovic / DSC3.0
Original study: Science, 2015 t-SNE reanalysis: PLOS, 2016
DSC3.0
© 2017 Seven Bridges sevenbridges.comDusan Randjelovic / DSC3.0
t-SNE
Non-convex technique (random initializations could produce different results)
Similarity between data points is conditional probability
In 2D/3D preserves probability, but on t-distribution rather than normal
DSC3.0
© 2017 Seven Bridges sevenbridges.com
Some thoughts for takeaway
Dusan Randjelovic / DSC3.0
© 2017 Seven Bridges sevenbridges.com
Dimensionality reduction implementations
Dusan Randjelovic / DSC3.0
Standard sklearn’s fit_transform paradigm
DSC3.0
© 2017 Seven Bridges sevenbridges.com
Get to know your data
Dusan Randjelovic / DSC3.0
Even better → learn about data-generation processes
Make hypotheses about relations in dataset
Even better → test them and incorporate learned relations
Compare methods and measure fitness
Even better → Visualize
DSC3.0
© 2017 Seven Bridges sevenbridges.com
Have fun & thank you!
Dusan Randjelovic / DSC3.0DSC3.0
© 2017 Seven Bridges sevenbridges.com
Questions?
Dusan Randjelovic / DSC3.0DSC3.0

More Related Content

Similar to Dimensionality reduction and visualization techniques for high-dimensional genomic data - Data Science Conference DSC3.0 talk - Dusan Randjelovic

SECURE WATERMARKING TECHNIQUE FOR MEDICAL IMAGES WITH VISUAL EVALUATION
SECURE WATERMARKING TECHNIQUE FOR MEDICAL IMAGES WITH VISUAL EVALUATIONSECURE WATERMARKING TECHNIQUE FOR MEDICAL IMAGES WITH VISUAL EVALUATION
SECURE WATERMARKING TECHNIQUE FOR MEDICAL IMAGES WITH VISUAL EVALUATION
sipij
 
Image processing-ieee-2014-projects
Image processing-ieee-2014-projectsImage processing-ieee-2014-projects
Image processing-ieee-2014-projects
Vijay Karan
 
Image Processing IEEE 2014 Projects
Image Processing IEEE 2014 ProjectsImage Processing IEEE 2014 Projects
Image Processing IEEE 2014 Projects
Vijay Karan
 
THRESHOLD BASED DATA REDUCTION FOR PROLONGING LIFE OF WIRELESS SENSOR NETWORK
THRESHOLD BASED DATA REDUCTION FOR PROLONGING LIFE OF WIRELESS SENSOR NETWORKTHRESHOLD BASED DATA REDUCTION FOR PROLONGING LIFE OF WIRELESS SENSOR NETWORK
THRESHOLD BASED DATA REDUCTION FOR PROLONGING LIFE OF WIRELESS SENSOR NETWORK
pijans
 
THRESHOLD BASED DATA REDUCTION FOR PROLONGING LIFE OF WIRELESS SENSOR NETWORK
THRESHOLD BASED DATA REDUCTION FOR PROLONGING LIFE OF WIRELESS SENSOR NETWORKTHRESHOLD BASED DATA REDUCTION FOR PROLONGING LIFE OF WIRELESS SENSOR NETWORK
THRESHOLD BASED DATA REDUCTION FOR PROLONGING LIFE OF WIRELESS SENSOR NETWORK
pijans
 
Top Cited Articles in Data Mining - International Journal of Data Mining & Kn...
Top Cited Articles in Data Mining - International Journal of Data Mining & Kn...Top Cited Articles in Data Mining - International Journal of Data Mining & Kn...
Top Cited Articles in Data Mining - International Journal of Data Mining & Kn...
IJDKP
 
An Effect of Compressive Sensing on Image Steganalysis
An Effect of Compressive Sensing on Image SteganalysisAn Effect of Compressive Sensing on Image Steganalysis
An Effect of Compressive Sensing on Image Steganalysis
IRJET Journal
 
Estimation of Age Through Fingerprints Using Wavelet Transform and Singular V...
Estimation of Age Through Fingerprints Using Wavelet Transform and Singular V...Estimation of Age Through Fingerprints Using Wavelet Transform and Singular V...
Estimation of Age Through Fingerprints Using Wavelet Transform and Singular V...
CSCJournals
 
CrossScenarioTransferPersonReidentification_finalManuscript
CrossScenarioTransferPersonReidentification_finalManuscriptCrossScenarioTransferPersonReidentification_finalManuscript
CrossScenarioTransferPersonReidentification_finalManuscript
Xiaojuan (Kathleen) WANG
 
ANURADHA_FINAL_REPORT
ANURADHA_FINAL_REPORTANURADHA_FINAL_REPORT
ANURADHA_FINAL_REPORT
Anuradha Chaudhary
 
Cas pratique de la science de la donnée dans le domaine universitaire - Data ...
Cas pratique de la science de la donnée dans le domaine universitaire - Data ...Cas pratique de la science de la donnée dans le domaine universitaire - Data ...
Cas pratique de la science de la donnée dans le domaine universitaire - Data ...
Swiss Data Forum Swiss Data Forum
 
Webinar: Using R for Advanced Analytics with MongoDB
Webinar: Using R for Advanced Analytics with MongoDBWebinar: Using R for Advanced Analytics with MongoDB
Webinar: Using R for Advanced Analytics with MongoDB
MongoDB
 
Context aware service recommendation
Context aware service recommendationContext aware service recommendation
Context aware service recommendation
Yueshen Xu
 
SIGNIFICANCE OF RATIONAL 6TH ORDER DISTORTION MODEL IN THE FIELD OF MOBILE’S ...
SIGNIFICANCE OF RATIONAL 6TH ORDER DISTORTION MODEL IN THE FIELD OF MOBILE’S ...SIGNIFICANCE OF RATIONAL 6TH ORDER DISTORTION MODEL IN THE FIELD OF MOBILE’S ...
SIGNIFICANCE OF RATIONAL 6TH ORDER DISTORTION MODEL IN THE FIELD OF MOBILE’S ...
P singh
 
ICCES 2017 - Crowd Density Estimation Method using Regression Analysis
ICCES 2017 - Crowd Density Estimation Method using Regression AnalysisICCES 2017 - Crowd Density Estimation Method using Regression Analysis
ICCES 2017 - Crowd Density Estimation Method using Regression Analysis
Ahmed Gad
 
[OSGeo-KR Tech Workshop] Deep Learning for Single Image Super-Resolution
[OSGeo-KR Tech Workshop] Deep Learning for Single Image Super-Resolution[OSGeo-KR Tech Workshop] Deep Learning for Single Image Super-Resolution
[OSGeo-KR Tech Workshop] Deep Learning for Single Image Super-Resolution
Taegyun Jeon
 
Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9
Ganesan Narayanasamy
 
IEEE 2014 Matlab Projects
IEEE 2014 Matlab ProjectsIEEE 2014 Matlab Projects
IEEE 2014 Matlab Projects
Vijay Karan
 
IEEE 2014 Matlab Projects
IEEE 2014 Matlab ProjectsIEEE 2014 Matlab Projects
IEEE 2014 Matlab Projects
Vijay Karan
 
Survey Paper on Clustering Data Streams Based on Shared Density between Micro...
Survey Paper on Clustering Data Streams Based on Shared Density between Micro...Survey Paper on Clustering Data Streams Based on Shared Density between Micro...
Survey Paper on Clustering Data Streams Based on Shared Density between Micro...
IRJET Journal
 

Similar to Dimensionality reduction and visualization techniques for high-dimensional genomic data - Data Science Conference DSC3.0 talk - Dusan Randjelovic (20)

SECURE WATERMARKING TECHNIQUE FOR MEDICAL IMAGES WITH VISUAL EVALUATION
SECURE WATERMARKING TECHNIQUE FOR MEDICAL IMAGES WITH VISUAL EVALUATIONSECURE WATERMARKING TECHNIQUE FOR MEDICAL IMAGES WITH VISUAL EVALUATION
SECURE WATERMARKING TECHNIQUE FOR MEDICAL IMAGES WITH VISUAL EVALUATION
 
Image processing-ieee-2014-projects
Image processing-ieee-2014-projectsImage processing-ieee-2014-projects
Image processing-ieee-2014-projects
 
Image Processing IEEE 2014 Projects
Image Processing IEEE 2014 ProjectsImage Processing IEEE 2014 Projects
Image Processing IEEE 2014 Projects
 
THRESHOLD BASED DATA REDUCTION FOR PROLONGING LIFE OF WIRELESS SENSOR NETWORK
THRESHOLD BASED DATA REDUCTION FOR PROLONGING LIFE OF WIRELESS SENSOR NETWORKTHRESHOLD BASED DATA REDUCTION FOR PROLONGING LIFE OF WIRELESS SENSOR NETWORK
THRESHOLD BASED DATA REDUCTION FOR PROLONGING LIFE OF WIRELESS SENSOR NETWORK
 
THRESHOLD BASED DATA REDUCTION FOR PROLONGING LIFE OF WIRELESS SENSOR NETWORK
THRESHOLD BASED DATA REDUCTION FOR PROLONGING LIFE OF WIRELESS SENSOR NETWORKTHRESHOLD BASED DATA REDUCTION FOR PROLONGING LIFE OF WIRELESS SENSOR NETWORK
THRESHOLD BASED DATA REDUCTION FOR PROLONGING LIFE OF WIRELESS SENSOR NETWORK
 
Top Cited Articles in Data Mining - International Journal of Data Mining & Kn...
Top Cited Articles in Data Mining - International Journal of Data Mining & Kn...Top Cited Articles in Data Mining - International Journal of Data Mining & Kn...
Top Cited Articles in Data Mining - International Journal of Data Mining & Kn...
 
An Effect of Compressive Sensing on Image Steganalysis
An Effect of Compressive Sensing on Image SteganalysisAn Effect of Compressive Sensing on Image Steganalysis
An Effect of Compressive Sensing on Image Steganalysis
 
Estimation of Age Through Fingerprints Using Wavelet Transform and Singular V...
Estimation of Age Through Fingerprints Using Wavelet Transform and Singular V...Estimation of Age Through Fingerprints Using Wavelet Transform and Singular V...
Estimation of Age Through Fingerprints Using Wavelet Transform and Singular V...
 
CrossScenarioTransferPersonReidentification_finalManuscript
CrossScenarioTransferPersonReidentification_finalManuscriptCrossScenarioTransferPersonReidentification_finalManuscript
CrossScenarioTransferPersonReidentification_finalManuscript
 
ANURADHA_FINAL_REPORT
ANURADHA_FINAL_REPORTANURADHA_FINAL_REPORT
ANURADHA_FINAL_REPORT
 
Cas pratique de la science de la donnée dans le domaine universitaire - Data ...
Cas pratique de la science de la donnée dans le domaine universitaire - Data ...Cas pratique de la science de la donnée dans le domaine universitaire - Data ...
Cas pratique de la science de la donnée dans le domaine universitaire - Data ...
 
Webinar: Using R for Advanced Analytics with MongoDB
Webinar: Using R for Advanced Analytics with MongoDBWebinar: Using R for Advanced Analytics with MongoDB
Webinar: Using R for Advanced Analytics with MongoDB
 
Context aware service recommendation
Context aware service recommendationContext aware service recommendation
Context aware service recommendation
 
SIGNIFICANCE OF RATIONAL 6TH ORDER DISTORTION MODEL IN THE FIELD OF MOBILE’S ...
SIGNIFICANCE OF RATIONAL 6TH ORDER DISTORTION MODEL IN THE FIELD OF MOBILE’S ...SIGNIFICANCE OF RATIONAL 6TH ORDER DISTORTION MODEL IN THE FIELD OF MOBILE’S ...
SIGNIFICANCE OF RATIONAL 6TH ORDER DISTORTION MODEL IN THE FIELD OF MOBILE’S ...
 
ICCES 2017 - Crowd Density Estimation Method using Regression Analysis
ICCES 2017 - Crowd Density Estimation Method using Regression AnalysisICCES 2017 - Crowd Density Estimation Method using Regression Analysis
ICCES 2017 - Crowd Density Estimation Method using Regression Analysis
 
[OSGeo-KR Tech Workshop] Deep Learning for Single Image Super-Resolution
[OSGeo-KR Tech Workshop] Deep Learning for Single Image Super-Resolution[OSGeo-KR Tech Workshop] Deep Learning for Single Image Super-Resolution
[OSGeo-KR Tech Workshop] Deep Learning for Single Image Super-Resolution
 
Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9
 
IEEE 2014 Matlab Projects
IEEE 2014 Matlab ProjectsIEEE 2014 Matlab Projects
IEEE 2014 Matlab Projects
 
IEEE 2014 Matlab Projects
IEEE 2014 Matlab ProjectsIEEE 2014 Matlab Projects
IEEE 2014 Matlab Projects
 
Survey Paper on Clustering Data Streams Based on Shared Density between Micro...
Survey Paper on Clustering Data Streams Based on Shared Density between Micro...Survey Paper on Clustering Data Streams Based on Shared Density between Micro...
Survey Paper on Clustering Data Streams Based on Shared Density between Micro...
 

Recently uploaded

一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
bmucuha
 
ML-PPT-UNIT-2 Generative Classifiers Discriminative Classifiers
ML-PPT-UNIT-2 Generative Classifiers Discriminative ClassifiersML-PPT-UNIT-2 Generative Classifiers Discriminative Classifiers
ML-PPT-UNIT-2 Generative Classifiers Discriminative Classifiers
MastanaihnaiduYasam
 
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
asyed10
 
Template xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptxTemplate xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptx
TeukuEriSyahputra
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
ElizabethGarrettChri
 
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
eoxhsaa
 
一比一原版莱斯大学毕业证(rice毕业证)如何办理
一比一原版莱斯大学毕业证(rice毕业证)如何办理一比一原版莱斯大学毕业证(rice毕业证)如何办理
一比一原版莱斯大学毕业证(rice毕业证)如何办理
zsafxbf
 
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
uevausa
 
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
9gr6pty
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
Vietnam Cotton & Spinning Association
 
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
Marlon Dumas
 
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
ihavuls
 
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
eudsoh
 
How To Control IO Usage using Resource Manager
How To Control IO Usage using Resource ManagerHow To Control IO Usage using Resource Manager
How To Control IO Usage using Resource Manager
Alireza Kamrani
 
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
actyx
 
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
Márton Kodok
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
ytypuem
 
一比一原版悉尼大学毕业证如何办理
一比一原版悉尼大学毕业证如何办理一比一原版悉尼大学毕业证如何办理
一比一原版悉尼大学毕业证如何办理
keesa2
 
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
agdhot
 
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
ywqeos
 

Recently uploaded (20)

一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
ML-PPT-UNIT-2 Generative Classifiers Discriminative Classifiers
ML-PPT-UNIT-2 Generative Classifiers Discriminative ClassifiersML-PPT-UNIT-2 Generative Classifiers Discriminative Classifiers
ML-PPT-UNIT-2 Generative Classifiers Discriminative Classifiers
 
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
 
Template xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptxTemplate xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptx
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
 
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
 
一比一原版莱斯大学毕业证(rice毕业证)如何办理
一比一原版莱斯大学毕业证(rice毕业证)如何办理一比一原版莱斯大学毕业证(rice毕业证)如何办理
一比一原版莱斯大学毕业证(rice毕业证)如何办理
 
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
 
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
 
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
 
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
 
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
 
How To Control IO Usage using Resource Manager
How To Control IO Usage using Resource ManagerHow To Control IO Usage using Resource Manager
How To Control IO Usage using Resource Manager
 
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
 
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
 
一比一原版悉尼大学毕业证如何办理
一比一原版悉尼大学毕业证如何办理一比一原版悉尼大学毕业证如何办理
一比一原版悉尼大学毕业证如何办理
 
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
 
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
 

Dimensionality reduction and visualization techniques for high-dimensional genomic data - Data Science Conference DSC3.0 talk - Dusan Randjelovic

  • 1. Biotechnology and genomics deal with sensitive information and intellectual property. Seven Bridges Genomics will protect the confidentiality of your data and proprietary approaches. Similarly, we look to you to protect our interests in our intellectual property. Seven Bridges Genomics does not accept any liability for information contained in this document. All information provided in this document is subject to change without notice. sevenbridges.com Dimensionality reduction and visualization techniques for high-dimensional genomic data Dusan Ranđelović Bioinformatics Analyst, Seven Bridges DATA SCIENCE CONFERENCE 3.0
  • 2. © 2017 Seven Bridges sevenbridges.com Genomic data science ● Specifics of genomics ● Just enough cell biology AGENDA DSC3.0 Dimensionality reduction ● Curse of dimensionality ● Use-case: Population genomics (PCA) ● Use-case: Cell populations (IsoMap) ● Use-case: Tissue expression profiles (tSNE)
  • 3. © 2017 Seven Bridges sevenbridges.com Genomic data science
  • 4. © 2017 Seven Bridges sevenbridges.comDusan Randjelovic / DSC3.0 General data scientist: Person who is better at statistics than any software engineer and better at software engineering than any statistician DSC3.0
  • 5. © 2017 Seven Bridges sevenbridges.com Genomics vs. general data science Dusan Randjelovic / DSC3.0 Source: Moutari and Dehmer. Emmert-Streib, 2016 Specifics of genomics: - domain is crucial - multi-omics approach - population scale and per-sample studies equally uncharted DSC3.0
  • 6. © 2017 Seven Bridges sevenbridges.com Cell biology Eukaryotic cell Dusan Randjelovic / DSC3.0DSC3.0
  • 7. © 2017 Seven Bridges sevenbridges.com Complex interplay between millions of molecules Dusan Randjelovic / DSC3.0DSC3.0
  • 8. © 2017 Seven Bridges sevenbridges.com Features: Millions of variations along 3*10^9 positions Features: 10s of thousands of gene expression values Dusan Randjelovic / DSC3.0DSC3.0
  • 9. © 2017 Seven Bridges sevenbridges.com Sequencing → Genomics Dusan Randjelovic / DSC3.0DSC3.0
  • 10. © 2017 Seven Bridges sevenbridges.com Higher dimensions Dusan Randjelovic / DSC3.0 Featuring lots of features
  • 11. © 2017 Seven Bridges sevenbridges.com The more the merrier? Complex biological processes in a cell could be characterized by measuring thousands or millions of molecules’ properties at a time (birth of genomics) We are FORTUNATE to be able to measure so many features at once However, when we compare measurements, or estimate any function of measured features, there are difficulties There is a CURSE! Dusan Randjelovic / DSC3.0DSC3.0
  • 12. © 2017 Seven Bridges sevenbridges.com Curse of dimensionality Imagine 1, 2 or 3 dimensional feature-space... Source: Parsons et al. KDD Explorations 2004 Dusan Randjelovic / DSC3.0DSC3.0
  • 13. © 2017 Seven Bridges sevenbridges.com Curse of dimensionality Imagine 1, 2 or 3 dimensional feature-space... Source: Clarke R, et. al: The properties of high dimensional data spaces: implications for exploring gene and protein expression data. Nat Rev Cancer 8: 37-49 Dusan Randjelovic / DSC3.0 10 features: 0.24% ! DSC3.0
  • 14. © 2017 Seven Bridges sevenbridges.com Curse of dimensionality Now imagine 10, 20, 1000… dimensional space - sparsity introduced - locality broken - # samples needed grows exp. to # features Dusan Randjelovic / DSC3.0DSC3.0
  • 15. © 2017 Seven Bridges sevenbridges.com Dimensionality reduction Dusan Randjelovic / DSC3.0
  • 16. © 2017 Seven Bridges sevenbridges.com Reduction of dimensionality – the Why? Reduce # of features for further (un)supervised learning - feature selection or feature engineering - detecting intrinsic dimensionality Lower computational demand - lower memory footprint - compression, scalability Exploratory data analysis technique Projections that improve signal-to-noise ratio for specific effect pixel values (ex. 64x64) 2D: scale + rotation Dusan Randjelovic / DSC3.0DSC3.0
  • 17. © 2017 Seven Bridges sevenbridges.com Reduction of dimensionality – the How? Dimensionality reduction: …which retains geometry of the data as much as possible (van der Maaten, 2009). Dusan Randjelovic / DSC3.0DSC3.0
  • 18. © 2017 Seven Bridges sevenbridges.com Reduction of dimensionality – the How? Taxonomy of methods: - Properties of data / nature of mapping: Linear vs. non-linear - Objective function properties: convex vs. non-convex - Properties to preserve: global vs. local As in classification or clustering, we need: - Similarity measure between datapoints Dusan Randjelovic / DSC3.0DSC3.0
  • 19. © 2017 Seven Bridges sevenbridges.com Similarity: neighborhood and distances Source: doi=10.1.1.154.8446 Distance is metric when: Dusan Randjelovic / DSC3.0DSC3.0
  • 20. © 2017 Seven Bridges sevenbridges.com Non-linear reduction: Manifold learning Dusan Randjelovic / DSC3.0DSC3.0
  • 21. © 2017 Seven Bridges sevenbridges.com Common techniques + SNE, t-SNE Source: van der Maaten, 2009: Dimensionality Reduction: A Comparative Review Dusan Randjelovic / DSC3.0DSC3.0
  • 22. © 2017 Seven Bridges sevenbridges.com Genomics use-cases Population variations Infer cell populations Tissue classification Source: 2D Representation of Transcriptomes by t-SNE Exposes Relatedness between Human Tissues Source: Simons dataset @ SBG Platform Dusan Randjelovic / DSC3.0DSC3.0
  • 23. © 2017 Seven Bridges sevenbridges.com Principal component analysis (PCA) Dusan Randjelovic / DSC3.0 Use-case: Population variations – Simons Diversity dataset
  • 24. © 2017 Seven Bridges sevenbridges.com Simons Diversity dataset 300 genomes 142 diverse populations 35TB raw + processed Sample analysis @SBG → Dusan Randjelovic / DSC3.0DSC3.0
  • 25. © 2017 Seven Bridges sevenbridges.com Simons Diversity PCA SNPRelate 1.10.1 Bioconductor tool PCA done on non-African samples, on chromosome 6 only, SNPs only → different populations have variations in the genome with similar frequencies Dusan Randjelovic / DSC3.0DSC3.0
  • 26. © 2017 Seven Bridges sevenbridges.com Principal component analysis (PCA) Dusan Randjelovic / DSC3.0 Linear technique that finds directions along which variance of the data is maximized (eigenvectors) Algorithm: iteratively updates M’s components to maximize variance or minimize reconstruction error, usually via SVD Related: ICA, MDS, other generalizations of PCA Drawback: retains only global disimilarities DSC3.0
  • 27. © 2017 Seven Bridges sevenbridges.com ISOMap – nonlinear mapping, preserves geodesic distances Dusan Randjelovic / DSC3.0 Use-case: Infer cell populations from single-cell RNA-seq
  • 28. © 2017 Seven Bridges sevenbridges.com Single-cell RNA-seq Dusan Randjelovic / DSC3.0 Assess relative abundance of RNA molecules from 100s of cells NOTE: cells have same DNA, but express different genes (transcribe different RNAs) Expression profiles should correspond to cell types DSC3.0
  • 29. © 2017 Seven Bridges sevenbridges.com Shalek, Satija et al. 2014 Dusan Randjelovic / DSC3.0 FastProject: Framework on sckit-learn to do multiple projections and test for correspondance with known molecular pathways DSC3.0
  • 30. © 2017 Seven Bridges sevenbridges.com ISOMap Dusan Randjelovic / DSC3.0 Dynamics of gene expression and gene regulatory networks is non-linear PCA and even Euclidean distances do not hold Geodesic distance along the manifold -> better data similarity Algorithm: 1. kNN + weighted graph, 2. Shortest path, 3. MDS Related: MDS, other spectral nonlinear techniques Drawback: Topological instability DSC3.0
  • 31. © 2017 Seven Bridges sevenbridges.com t-Distributed Stochastic Neighbor Embedding (t-SNE) Dusan Randjelovic / DSC3.0 Use-case: Tissue expression profiles – GTEx dataset
  • 32. © 2017 Seven Bridges sevenbridges.com ● Genotype-tissue expression (DNA+RNA) ● V7 data: 53 tissues, 714 donors, 11688 samples ● > 50.000 quantified RNA molecules (features) Source: http://www.gtexportal.org/home/documentationPage GTEx dataset Dusan Randjelovic / DSC3.0DSC3.0
  • 33. © 2017 Seven Bridges sevenbridges.com GTEx analysis Dusan Randjelovic / DSC3.0 Original study: Science, 2015 t-SNE reanalysis: PLOS, 2016 DSC3.0
  • 34. © 2017 Seven Bridges sevenbridges.comDusan Randjelovic / DSC3.0 t-SNE Non-convex technique (random initializations could produce different results) Similarity between data points is conditional probability In 2D/3D preserves probability, but on t-distribution rather than normal DSC3.0
  • 35. © 2017 Seven Bridges sevenbridges.com Some thoughts for takeaway Dusan Randjelovic / DSC3.0
  • 36. © 2017 Seven Bridges sevenbridges.com Dimensionality reduction implementations Dusan Randjelovic / DSC3.0 Standard sklearn’s fit_transform paradigm DSC3.0
  • 37. © 2017 Seven Bridges sevenbridges.com Get to know your data Dusan Randjelovic / DSC3.0 Even better → learn about data-generation processes Make hypotheses about relations in dataset Even better → test them and incorporate learned relations Compare methods and measure fitness Even better → Visualize DSC3.0
  • 38. © 2017 Seven Bridges sevenbridges.com Have fun & thank you! Dusan Randjelovic / DSC3.0DSC3.0
  • 39. © 2017 Seven Bridges sevenbridges.com Questions? Dusan Randjelovic / DSC3.0DSC3.0

Editor's Notes

  1. Good morning! Thank you all for coming to this talk. My name is Dusan and for the past 2 years I have been occasionaly playing with some interesting genomics datasets. I work as Bioinformatics Analyst, in a company called “Seven Bridges Genomics”, so by now you are already guessing that this talk will be more on the science side of data science. I will however focus on some particular methods for analysis of biological data - which are quite interesting and widely applicable in general data science - primarily methods of dimensionality reduction. I hope you’ll find this overview helpful and if you come to like computational biology and genomics after this talk it’s just a bonus :).
  2. First - I’ll talk about some specifics of genomics and introduce just enough cell biology needed so that you could understand the use-cases. And then I will talk about dimensionality reduction and show-case some high-dimensional datasets in genomics.
  3. There is a saying that data scientist is “Person who is better at programming than any… ”, but in case of genomics data science I would like to argue that those two disciplines are not enough. Genomics data scientist depends heavily on the domain knowledge.
  4. In order to understand the results of analysis, to pose the right questions or even to recognise features in a dataset you need to have some basic knowledge about the processes in a cell and in data-generation. So the domain is actually crucial here. Apaet from this - there is usually not only one approach to each study, but several experiments on different levels (DNA, RNA level, some biomedical measurements, phenotype quantification, etc.) - and this so called multi-omics approach, that is so powerful for clinicians, gives headaches to genomics data scientists. Another difficulty here is that population scale studies and per-sample studies deal with equally unknown phenomena - there is so many associations and correlation but discipline is so young and there are not many theoretical models to help guide new analysis.
  5. As I said we will need some basic cell biology here, but I’ll try to be brief :) - you probably remember most of this from high-school anyway, right? Cells of complex organisms have a nucleus in which there are some long molecules called DNA, packed in chromosoms. DNA molecules serve as a blueprint for making other molecules that are involved in every function of a cell, like RNA molecules or indirectly proteins. Whole DNA material is called a genome and some small regions of it are famous genes. What is important here is that DNA is structured and is the same in every cell. It is composed of billions of smaller molecules adenine, thymine, cytosin and guanine and could be represented as long sequence of letters ACTG -> unique for each individual.
  6. Overall, complex interactions of DNA, RNA, proteins and environment makes what we call a phenotype -> some physical charateristics, like eye color or a disease.
  7. If we would be somehow able digitize these molecules, we would get a picture of processes in a cell that cause diseases or are responsible for some phenotype. That digitization is possible and is called sequencing. From a data science perspective: when you sequence a genome, the end-result is a dataset that says on which positions among 3 billion letters of your genome there is a variation or mutation - something different than some reference genome. You could see these millions of differences as features of your dataset to explore! Another common digitization is to count RNA molecules transcribed from genes, which usually gives you datasets with 10s of thousands features.
  8. Sequencing is what gave birth to genomics, the study of the whole genome, and since 2003 this technology breaks Moore’s law and is currently one of the greatest sources of big data.
  9. In order to profile complex biological processes we measure as much as we can. It’s fortunate that we could measure all that features of some process at once - but with complexity and lots of features comes a Curse.
  10. And it is called the Curse of dimensionality. What is meant by this are some geometric and probabilistic consequences of dealing with high domensional feature space. For example: if we have some number of samples and we measure 1, 2, or 3 features of those same samples, what we notice is that by taking more features into account we are sampling from increasing feature-space - and our samples are less and less representative of that feature-space. Main definition of curse of dimensionality is that we need exponentially more samples if we are increasing number of features.
  11. On the other hand - even if we have enough samples - there are some geometrical consequences of going higher in dimensionality. In machine learning, wheter unsupervised or supervised - we are usually interested in finding distances between datapoints, in order to establish some similarity metric. And distances and neighbourhoods in high-dimensional feature-space are problematic. If you look at the right image you’ll see that greatest circle inside a square covers 78% of it, which could also be seen as greatest neighbourhood inside 2dim feature-space. If we increase the dimensionality, we see that a sphere covers 52% of a cube, and going further we have only 0.24% of 10dim hypersphere covering 10dim hypercube. This really changes the meaning of near and far - since most data points are far away in the corners.
  12. So if we go to 100 or 1000 dimensions we have sparcity introduced merely by geometry of such a space. Locality is also broken, and # of samples needed grows rapidly. Additionaly, most algorithms have some optimal number of features to work with, as seen on the classifier performance curve from the right.
  13. So what do we do to avoid the Curse? - We try to reduce the dimensions.
  14. We could and should always try to reduce dimensionality of a dataset if we suspect that intrinsic dimensionality - the one that completely describes the effect that we are measuring - is lower than number of features. For example, on the image here you see small images of letter A and since images are usually described by values inside a pixels, we could imagine this dataset being of for example 64x64 features. But if we are interested in transformations of A present in these pictures we could see that only 2 transformations are applied: scaling and rotation - so instead of 64x64 features we could have only 2 to represent the whole dataset. Reduction of dimensionality is done when we are doing feature selection or filtering of only interesting features or feature engineering - construction of new, better features from the ones we measure. It is sometimes imperative to reduce dimensions merely because of computational complexity or to compress the dataset. But, most interesting purpose of dimensionality reduction is to do visualization and exploratory analysis.
  15. Dimensionality reduction techniques are unsupervised machine learning techniques to learn the embedding of high-dimensional dataset in lower dimensions, usually 2 or 3 if we aim to explore and visualize the dataset. The last note from this more formal explanation from the slide is about keeping geometry intact as much as possible - that part is the hardest since datasets could have some weird topological or metrical properties to it.
  16. Number of dimensionality reduction methods and techniques is rapidly rising, especially of non-linear ones. Methods could be divided by nature of realtionships among features to linear and non-linear. We have division on convex and non-convex methods if objective function that is optimized is convex or not. Important distinction is between global and local methods (ones that preserve global dissimilarities in a dataset and ones that keep local similarity better). No matter which method, similar to clustering or classification, what we need among datapoints is some established similarity measure.
  17. When we speak about similarity two other terms usually come in: neighborhood and distance. Neighborhood of a datapoint could be defined by all the points that fall under some radius (left image on the slide) or specified number of points in decreasing order. But to measure distances in order to calculate neighborhoods or in order to set some similarity measure we have many choices. Euclidean distance in higher dimensions is most common but in some case some others are used, like geodesic distance or even some non-metric distances as we will see in examples.
  18. Most non-linear techniques could be seen as finding lower dimensional manifold on which datapoints lay. Similarity between datapoints is than measured along this manifold. Manifold learning is sometimes synonim for dimensionality reduction.
  19. As you saw there are different criteria to divide techniques and only one of taxonomies is given here on the slide. Going further I’ll talk into more details about PCA, ISOMap and tSNE.
  20. And as promised I’ll show-case these methods on some real genomics datasets.
  21. First one is Principal component analysis technique done on Simons Diversity dataset, as part of finding out about stratification of human population by genomic variants or mutations.
  22. Simons Diversity dataset contains 300 genomes from 142 populations. It’s 35TB of raw sequencing data and processed data. On Seven Bridges, we host this complete dataset. We have also done reproduction of published studies.
  23. We know that different human populations have common phenotypes, which should also be found on genome level. If we take all mutations/variations from simons dataset and reduce them to 2 dimensions based on how variable are they in samples - we should get similar samples clustered together. We use global disimilarity between datapoints here. What we actually did, we took only non-African samples, and only one type of mutations, and all of that only on one chromosome, and we still got some nice separation in 2 dimensions! On the plot you could see different colors for different populations but that is only for plotting, this was done unsupervised by PCA. This was done on SBG platform with R/Bioconductor tool SNPrelate
  24. PCA is a linear technique that finds directions along which variance of the data is maximized (so called eigenvectors). Eigenvectors are basis of M matrix here on slide and are all mutually ortogonal/normal and could be found as solutions to second eqn. By decomposing initial data matrix X this way we get ordered principal components: independent features in decreasing order of contribution to overall variance. How many principal components we take for further analysis could be determined by second plot here: proportinal variance explained.
  25. Second case is a bit more complex - since it deals with highly non-linear data. Goal is to infer cell populations from single cell RNA-seq data.
  26. In single cell RNA-seq experiment we are looking at RNA molecules and counting them in each cell. What you get as a result after processing raw sequencing data is matrix like in lower right corner: cells as columns and different molecules that correspond to genes as rows. If you remember me saying: cells have same DNA but depending on cell type and function different cells have different RNAs, and we should be able to cluster cells by their expression profiles. Usually clustering is easier if done in reduced space.
  27. Multiple-hypothesis testing -> if you torture your data enough, it will confess
  28. I have used this scikit-learn based framework to look at different projections of one particular single cell study. What was challenging here was to confirm cell types after reduction and clustering and for that I have looked at how well some cell’s expression profile corelated with some known molecular pathway (blue to red scale right).
  29. One non-linear technique capable of dealing with this kind of data is ISOmap, which uses geodesic distances along the manifold to model similarity between points. Plastic example of how this distance is more useful could be seen on artificial dataset on the left image: x and y are near in euclidean sense, but far apart if you measure along the manifold.
  30. And last one is tSNE used to visualize tissue-specific expression profiles.
  31. Dataset is somewhat similar to single cell RNA-seq one - but in this case we have RNAs from different tissues of 100s of people. We expect to see tissues separated in some lower dim space.
  32. Original study used PCA and k-means in low dim. And hierarchical clustering in high dimensional space, but when tSNE reanalysis is done it showed better separation of tissues in tSNE space.
  33. tSNE is non-convex method which means every time it will give slightly different results. Similarity between data points is not even a metric, it is conditional probability that some nearby point is neighbouring. In lower space we try to keep those probabilities but with t-distr. instead of normal. tSNE preserves local similarity between datapoints and is very effective and popular technique, but difficult to interpret.