Analysis of
Gene Expression Data
     _______________________

            Jhoirene B. Clemente
       Algorithms and Complexity Lab
     University of the Philippines Diliman
Overview

● Definitions
● Clustering of Gene Expression Data
● Visualizations of Gene Expression Data
Definitions
Gene
Basic unit of heredity in a living organism.
It is normally a stretch of DNA that codes
for a type of protein or for an RNA chain
that has a function in the organism.

Gene Expression Data
Expression level of genes in an individual
that is measured through Microarray
Definitions
Definitions
Definitions
Gene Expression Data

                        Gene     Gene
                               Expression
                       a
                       b
                       c
                       ...
                       n
Definitions
Gene Expression Data                 1 Sample

                              Gene     Gene
                                     Expression
                             a
                             b
                      n
                   Samples   c
                             ...
                             n
Definitions
   (n x m) Data Matrix          m Samples


            Gene   Sample   Sample      .....   Sample
                     1        1                   m
           a
           b
   n
Samples    c
           ...
           n
Definitions
   (n x m) Data Matrix          m Samples


            Gene   Sample   Sample      .....   Sample
                     1        1                   m
           a
           b
   n
Samples    c
           ...
           n
Clustering




Clustering is the unsupervised classification of
patterns including observations, data sets and
feature vectors into groups called clusters,
such that objects in the same cluster are similar to
each other while objects in different clusters are
dissimilar as possible.
Clustering




Clustering is the unsupervised classification of
patterns including observations, data sets and
feature vectors into groups called clusters,
such that objects in the same cluster are similar to
each other while objects in different clusters are
dissimilar as possible.
Cluster Analysis
Preprocessing
 ● Filtering

 ● Normalization




                   Clustering



                                Analysis
Clustering
Partitional
●   K-means Algorithm
●   X-means Algorithm



Hierarchical
Clustering
Given the (n x m) data matrix, we can

●   Cluster the set of genes
●   Cluster the set of samples
●   Cluster the set of genes and samples
    simultaneously.
Data Set
Data set is a time series gene expression data from
a synchronized population of yeast.
Data Set
Data set is a time series gene expression data from
a synchronized population of yeast.
Preprocessing
Filtering
 ● Removed genes not involved in cell cycle

    regulation
 ● Removed genes belonging to more than one

    group

Normalization
● All gene expression values range from -1.0 to

  1.0.
Data Set
Data matrix (384 genes and 17 samples) with 5
classifications.
Groupings based from cell cycle phase activation.
Data Set
Group 1: Resting Phase
Data Set
Group 2: First Growth Phase
Data Set
Group 3: Synthesis Phase
Data Set
Group 4: Second Growth Phase
Data Set
Group 5: Cell Division
Clustering of genes
K-means Algorithm

Given n data points in Rd
1. Assign k initial centers of the k clusters
2. Assign all the data points to the nearest cluster
   (Euclidean distance, Manhattan distance, etc.)
3. Adjust the k centers
4. Repeat steps 2 and 3 until convergence
Clustering of genes
K-means Algorithm

Given n data points in Rd
1. Assign k initial centers of the k clusters
2. Assign all the data points to the nearest cluster
   (Euclidean distance, Manhattan distance, etc.)
3. Adjust the k centers
4. Repeat steps 2 and 3 until convergence
                   k =5
    since we want to approximate the 5
Clustering of genes
Initialization

1. Choose the first k centers that will maximize the
   distance between the clusters
2. Sort the distances between all the data points
   and then choose the k initial points at constant
   intervals from the sorted list
3. Use the first k points in the data set as the first k
   centers
Clustering of genes
Using k-means clustering, with k =5
Clustering of genes
●   Clustering may suggest possible roles for genes
    with unknown functions
●   Clustering the samples or experiments may shed
    light on new subtypes of diseases.
●   Identify which type of treatment is suited for a
    specific type of cancer.
●   Building genetic networks
visualization
Vector Fusion
Non-metric Multidimensional Scaling (nMDS)
Principal Components Analysis (PCA)
Vector fusion
Visualization technique that uses the Single point
broken line parallel algorithm
nMDS visualization
Input (Dissimilarity Matrix=|ij|) actual distance
 ● In nMDS, only the rank order of entries is

   assumed to contain the significant information.
 ● Thus, the purpose of the non-metric MDS

   algorithm is to find a configuration of points
   whose distances reflect as closely as possible
   the rank order of the data.
 ● The transformation is by using a non parametric

   function f. (monotone regression)

             dij= f(dij) pseudo-distance
PCA
vector fusion
visualization
nmds visualization
nmds visualization
nmds visualization
nmds visualization
nmds visualization
nmds visualization
nmds visualization
References
2010: "Non-Metric Multidimensional Scaling and Vector
Fusion Visualization of Cell Cycle Independent Gene
Expressions for Gene Function Analysis", Clemente J.,
Salido J.A., (2010), Published in the conference
proceedings of National Conference on Information
Technology for Education(NCITE) 2010 and Philippine IT
Journal Feb 2011 Issue.

2010: "Cluster Analysis for Identifying Genes Highly
Correlated with a Phenotype", Clemente J.,
Undergraduate thesis, Department of Computer Science,
University of the Philippines Diliman
Thank you for
  Listening

Gene Expression Data Analysis

  • 1.
    Analysis of Gene ExpressionData _______________________ Jhoirene B. Clemente Algorithms and Complexity Lab University of the Philippines Diliman
  • 2.
    Overview ● Definitions ● Clusteringof Gene Expression Data ● Visualizations of Gene Expression Data
  • 3.
    Definitions Gene Basic unit ofheredity in a living organism. It is normally a stretch of DNA that codes for a type of protein or for an RNA chain that has a function in the organism. Gene Expression Data Expression level of genes in an individual that is measured through Microarray
  • 4.
  • 5.
  • 6.
    Definitions Gene Expression Data Gene Gene Expression a b c ... n
  • 7.
    Definitions Gene Expression Data 1 Sample Gene Gene Expression a b n Samples c ... n
  • 8.
    Definitions (n x m) Data Matrix m Samples Gene Sample Sample ..... Sample 1 1 m a b n Samples c ... n
  • 9.
    Definitions (n x m) Data Matrix m Samples Gene Sample Sample ..... Sample 1 1 m a b n Samples c ... n
  • 10.
    Clustering Clustering is theunsupervised classification of patterns including observations, data sets and feature vectors into groups called clusters, such that objects in the same cluster are similar to each other while objects in different clusters are dissimilar as possible.
  • 11.
    Clustering Clustering is theunsupervised classification of patterns including observations, data sets and feature vectors into groups called clusters, such that objects in the same cluster are similar to each other while objects in different clusters are dissimilar as possible.
  • 12.
    Cluster Analysis Preprocessing ●Filtering ● Normalization Clustering Analysis
  • 13.
    Clustering Partitional ● K-means Algorithm ● X-means Algorithm Hierarchical
  • 14.
    Clustering Given the (nx m) data matrix, we can ● Cluster the set of genes ● Cluster the set of samples ● Cluster the set of genes and samples simultaneously.
  • 15.
    Data Set Data setis a time series gene expression data from a synchronized population of yeast.
  • 16.
    Data Set Data setis a time series gene expression data from a synchronized population of yeast.
  • 17.
    Preprocessing Filtering ● Removedgenes not involved in cell cycle regulation ● Removed genes belonging to more than one group Normalization ● All gene expression values range from -1.0 to 1.0.
  • 18.
    Data Set Data matrix(384 genes and 17 samples) with 5 classifications. Groupings based from cell cycle phase activation.
  • 19.
    Data Set Group 1:Resting Phase
  • 20.
    Data Set Group 2:First Growth Phase
  • 21.
    Data Set Group 3:Synthesis Phase
  • 22.
    Data Set Group 4:Second Growth Phase
  • 23.
    Data Set Group 5:Cell Division
  • 24.
    Clustering of genes K-meansAlgorithm Given n data points in Rd 1. Assign k initial centers of the k clusters 2. Assign all the data points to the nearest cluster (Euclidean distance, Manhattan distance, etc.) 3. Adjust the k centers 4. Repeat steps 2 and 3 until convergence
  • 25.
    Clustering of genes K-meansAlgorithm Given n data points in Rd 1. Assign k initial centers of the k clusters 2. Assign all the data points to the nearest cluster (Euclidean distance, Manhattan distance, etc.) 3. Adjust the k centers 4. Repeat steps 2 and 3 until convergence k =5 since we want to approximate the 5
  • 26.
    Clustering of genes Initialization 1.Choose the first k centers that will maximize the distance between the clusters 2. Sort the distances between all the data points and then choose the k initial points at constant intervals from the sorted list 3. Use the first k points in the data set as the first k centers
  • 27.
    Clustering of genes Usingk-means clustering, with k =5
  • 28.
    Clustering of genes ● Clustering may suggest possible roles for genes with unknown functions ● Clustering the samples or experiments may shed light on new subtypes of diseases. ● Identify which type of treatment is suited for a specific type of cancer. ● Building genetic networks
  • 29.
    visualization Vector Fusion Non-metric MultidimensionalScaling (nMDS) Principal Components Analysis (PCA)
  • 30.
    Vector fusion Visualization techniquethat uses the Single point broken line parallel algorithm
  • 31.
    nMDS visualization Input (DissimilarityMatrix=|ij|) actual distance ● In nMDS, only the rank order of entries is assumed to contain the significant information. ● Thus, the purpose of the non-metric MDS algorithm is to find a configuration of points whose distances reflect as closely as possible the rank order of the data. ● The transformation is by using a non parametric function f. (monotone regression) dij= f(dij) pseudo-distance
  • 32.
  • 33.
  • 34.
  • 35.
  • 36.
  • 37.
  • 38.
  • 39.
  • 40.
  • 41.
    References 2010: "Non-Metric MultidimensionalScaling and Vector Fusion Visualization of Cell Cycle Independent Gene Expressions for Gene Function Analysis", Clemente J., Salido J.A., (2010), Published in the conference proceedings of National Conference on Information Technology for Education(NCITE) 2010 and Philippine IT Journal Feb 2011 Issue. 2010: "Cluster Analysis for Identifying Genes Highly Correlated with a Phenotype", Clemente J., Undergraduate thesis, Department of Computer Science, University of the Philippines Diliman
  • 42.
    Thank you for Listening