Dimensionality reduction and visualization techniques for high-dimensional genomic data - Data Science Conference DSC3.0 talk - Dusan Randjelovic

Biotechnology and genomics deal with sensitive information and intellectual property. Seven Bridges Genomics will protect the confidentiality of your data and
proprietary approaches. Similarly, we look to you to protect our interests in our intellectual property. Seven Bridges Genomics does not accept any liability for
information contained in this document. All information provided in this document is subject to change without notice. sevenbridges.com
Dimensionality reduction and visualization
techniques for high-dimensional genomic data
Dusan Ranđelović
Bioinformatics Analyst, Seven Bridges
DATA SCIENCE CONFERENCE 3.0

© 2017 Seven Bridges sevenbridges.com
Genomic data science
● Specifics of genomics
● Just enough cell biology
AGENDA
DSC3.0
Dimensionality reduction
● Curse of dimensionality
● Use-case: Population genomics (PCA)
● Use-case: Cell populations (IsoMap)
● Use-case: Tissue expression profiles (tSNE)

Genomic data science

© 2017 Seven Bridges sevenbridges.comDusan Randjelovic / DSC3.0
General data scientist:
Person who is better at statistics than any
software engineer and better at software
engineering than any statistician
DSC3.0

Genomics vs. general data science
Dusan Randjelovic / DSC3.0
Source: Moutari and Dehmer. Emmert-Streib, 2016
Specifics of genomics:
- domain is crucial
- multi-omics approach
- population scale and
per-sample studies
equally uncharted
DSC3.0

Cell biology
Eukaryotic cell
Dusan Randjelovic / DSC3.0DSC3.0

Complex interplay between millions of molecules

Features:
Millions of variations
along 3*10^9 positions
Features:
10s of thousands of
gene expression values

Sequencing → Genomics

Higher dimensions
Featuring lots of features

The more the merrier?
Complex biological processes in a cell could be characterized by measuring
thousands or millions of molecules’ properties at a time (birth of genomics)
We are FORTUNATE to be able to measure so many features at once
However, when we compare measurements, or
estimate any function of measured features, there are difficulties
There is a CURSE!

Curse of dimensionality
Imagine 1, 2 or 3 dimensional feature-space...
Source: Parsons et al. KDD Explorations 2004

Imagine 1, 2 or 3 dimensional feature-space...
Source: Clarke R, et. al: The properties of high dimensional
data spaces: implications for exploring gene and protein
expression data. Nat Rev Cancer 8: 37-49
10 features: 0.24% !
DSC3.0

Now imagine 10, 20, 1000… dimensional space
- sparsity introduced
- locality broken
- # samples needed
grows exp. to
# features

Dimensionality reduction

Reduction of dimensionality – the Why?
Reduce # of features for further (un)supervised learning
- feature selection or feature engineering
- detecting intrinsic dimensionality
Lower computational demand
- lower memory footprint
- compression, scalability
Exploratory data analysis technique
Projections that improve signal-to-noise ratio for specific effect
pixel values (ex. 64x64) 2D: scale + rotation

Reduction of dimensionality – the How?
Dimensionality reduction:
…which retains geometry of the data as much as possible (van der Maaten, 2009).

Reduction of dimensionality – the How?
Taxonomy of methods:
- Properties of data / nature of mapping: Linear vs. non-linear
- Objective function properties: convex vs. non-convex
- Properties to preserve: global vs. local
As in classification or clustering, we need:
- Similarity measure between datapoints

Similarity: neighborhood and distances
Source: doi=10.1.1.154.8446
Distance is metric when:

Non-linear reduction: Manifold learning

Common techniques
+ SNE, t-SNE
Source: van der Maaten, 2009: Dimensionality Reduction: A Comparative Review

Genomics use-cases
Population variations
Infer cell populations
Tissue classification
Source: 2D Representation
of Transcriptomes by t-SNE
Exposes Relatedness
between Human Tissues
Source: Simons dataset @ SBG Platform

Principal component analysis (PCA)
Use-case:
Population variations – Simons Diversity dataset

Simons Diversity dataset
300 genomes
142 diverse populations
35TB raw + processed
Sample analysis @SBG →

Simons Diversity PCA
SNPRelate 1.10.1 Bioconductor tool
PCA done on non-African samples,
on chromosome 6 only, SNPs only
→ different populations have
variations
in the genome with similar
frequencies

Principal component analysis (PCA)
Linear technique that finds directions along which variance of the
data is maximized (eigenvectors)
Algorithm: iteratively updates M’s components to
maximize variance or minimize reconstruction
error, usually via SVD
Related: ICA, MDS, other generalizations of PCA
Drawback: retains only global disimilarities
DSC3.0

ISOMap – nonlinear mapping,
preserves geodesic distances
Use-case:
Infer cell populations from single-cell RNA-seq

Single-cell RNA-seq
Assess relative abundance of RNA molecules from 100s of cells
NOTE: cells have same DNA, but express different genes (transcribe different RNAs)
Expression profiles should correspond to cell types
DSC3.0

Shalek, Satija et al. 2014
FastProject: Framework on sckit-learn to do multiple projections and
test for correspondance with known molecular pathways
DSC3.0

ISOMap
Dynamics of gene expression and gene regulatory networks is non-linear
PCA and even Euclidean distances do not hold
Geodesic distance along the manifold -> better data similarity
Algorithm: 1. kNN + weighted graph, 2. Shortest path, 3. MDS
Related: MDS, other spectral nonlinear techniques
Drawback: Topological instability
DSC3.0

t-Distributed Stochastic Neighbor Embedding
(t-SNE)
Use-case:
Tissue expression profiles – GTEx dataset

● Genotype-tissue expression (DNA+RNA)
● V7 data: 53 tissues, 714 donors, 11688 samples
● > 50.000 quantified RNA molecules
(features)
Source: http://www.gtexportal.org/home/documentationPage
GTEx dataset

GTEx analysis
Original study: Science, 2015 t-SNE reanalysis: PLOS, 2016
DSC3.0

© 2017 Seven Bridges sevenbridges.comDusan Randjelovic / DSC3.0
t-SNE
Non-convex technique (random initializations could produce different results)
Similarity between data points is conditional probability
In 2D/3D preserves probability, but on t-distribution rather than normal
DSC3.0

Some thoughts for takeaway

Dimensionality reduction implementations
Standard sklearn’s fit_transform paradigm
DSC3.0

Get to know your data
Even better → learn about data-generation processes
Make hypotheses about relations in dataset
Even better → test them and incorporate learned relations
Compare methods and measure fitness
Even better → Visualize
DSC3.0

Have fun & thank you!

Questions?

Dimensionality reduction and visualization techniques for high-dimensional genomic data - Data Science Conference DSC3.0 talk - Dusan Randjelovic

Recommended

Recommended

More Related Content

Similar to Dimensionality reduction and visualization techniques for high-dimensional genomic data - Data Science Conference DSC3.0 talk - Dusan Randjelovic

Similar to Dimensionality reduction and visualization techniques for high-dimensional genomic data - Data Science Conference DSC3.0 talk - Dusan Randjelovic (20)

Recently uploaded

Recently uploaded (20)

Dimensionality reduction and visualization techniques for high-dimensional genomic data - Data Science Conference DSC3.0 talk - Dusan Randjelovic

Editor's Notes