Gene Expression Data Analysis

Analysis of
Gene Expression Data
_______________________

Jhoirene B. Clemente
Algorithms and Complexity Lab
University of the Philippines Diliman

Overview

● Definitions
● Clustering of Gene Expression Data
● Visualizations of Gene Expression Data

Definitions
Gene
Basic unit of heredity in a living organism.
It is normally a stretch of DNA that codes
for a type of protein or for an RNA chain
that has a function in the organism.

Expression level of genes in an individual
that is measured through Microarray

Definitions

Gene Gene
Expression
a
b
c
...
n

Definitions
Gene Expression Data 1 Sample

Gene Gene
Expression
a
b
n
Samples c
...
n

Definitions
(n x m) Data Matrix m Samples

Gene Sample Sample ..... Sample
1 1 m
a
b
n
Samples c
...
n

Clustering

Clustering is the unsupervised classiﬁcation of
patterns including observations, data sets and
feature vectors into groups called clusters,
such that objects in the same cluster are similar to
each other while objects in different clusters are
dissimilar as possible.

Cluster Analysis
Preprocessing
● Filtering

● Normalization

Clustering

Analysis

Clustering
Partitional
● K-means Algorithm
● X-means Algorithm

Hierarchical

Clustering
Given the (n x m) data matrix, we can

● Cluster the set of genes
● Cluster the set of samples
● Cluster the set of genes and samples
simultaneously.

Data Set
Data set is a time series gene expression data from
a synchronized population of yeast.

Preprocessing
Filtering
● Removed genes not involved in cell cycle

regulation
● Removed genes belonging to more than one

group

Normalization
● All gene expression values range from -1.0 to

1.0.

Data Set
Data matrix (384 genes and 17 samples) with 5
classifications.
Groupings based from cell cycle phase activation.

Data Set
Group 1: Resting Phase

Data Set
Group 2: First Growth Phase

Data Set
Group 3: Synthesis Phase

Data Set
Group 4: Second Growth Phase

Data Set
Group 5: Cell Division

Clustering of genes
K-means Algorithm

Given n data points in Rd
1. Assign k initial centers of the k clusters
2. Assign all the data points to the nearest cluster
(Euclidean distance, Manhattan distance, etc.)
3. Adjust the k centers
4. Repeat steps 2 and 3 until convergence

Clustering of genes
K-means Algorithm

Given n data points in Rd
1. Assign k initial centers of the k clusters
2. Assign all the data points to the nearest cluster
(Euclidean distance, Manhattan distance, etc.)
3. Adjust the k centers
4. Repeat steps 2 and 3 until convergence
k =5
since we want to approximate the 5

Clustering of genes
Initialization

1. Choose the first k centers that will maximize the
distance between the clusters
2. Sort the distances between all the data points
and then choose the k initial points at constant
intervals from the sorted list
3. Use the first k points in the data set as the first k
centers

Clustering of genes
Using k-means clustering, with k =5

Clustering of genes
● Clustering may suggest possible roles for genes
with unknown functions
● Clustering the samples or experiments may shed
light on new subtypes of diseases.
● Identify which type of treatment is suited for a
specific type of cancer.
● Building genetic networks

visualization
Vector Fusion
Non-metric Multidimensional Scaling (nMDS)
Principal Components Analysis (PCA)

Vector fusion
Visualization technique that uses the Single point
broken line parallel algorithm

nMDS visualization
Input (Dissimilarity Matrix=|ij|) actual distance
● In nMDS, only the rank order of entries is

assumed to contain the significant information.
● Thus, the purpose of the non-metric MDS

algorithm is to find a configuration of points
whose distances reflect as closely as possible
the rank order of the data.
● The transformation is by using a non parametric

function f. (monotone regression)

dij= f(dij) pseudo-distance

References
2010: "Non-Metric Multidimensional Scaling and Vector
Fusion Visualization of Cell Cycle Independent Gene
Expressions for Gene Function Analysis", Clemente J.,
Salido J.A., (2010), Published in the conference
proceedings of National Conference on Information
Technology for Education(NCITE) 2010 and Philippine IT
Journal Feb 2011 Issue.

2010: "Cluster Analysis for Identifying Genes Highly
Correlated with a Phenotype", Clemente J.,
Undergraduate thesis, Department of Computer Science,
University of the Philippines Diliman

Gene Expression Data Analysis

More Related Content

What's hot

Viewers also liked

Similar to Gene Expression Data Analysis

More from Jhoirene Clemente

Recently uploaded

Gene Expression Data Analysis