Data preprocessing

Data preprocessing in the normalization of gene expression data and the selectionof cancer
related genes for clustering analysis
Introduction
The analysis of proteins and messenger RNA is commonly used in the comparison of
gene expression patterns in tissues or cells of different types and under distinct
conditions. In gene expression analysis, normalization is a critical step as it guarantees the
validity of downstream analyses. Data preprocessing is an indispensable step in the extraction and
normalization of microarray gene expression data. The normalization of gene expression data is
essential in ensuring accurate inferences. A number of normalization methods in high throughput
sequencing studies are being employed. The preprocessing activity begins by a careful analysis of
the gene expression data and usually involves the classification of many raw signal intensities into
one expression value. The Robust Multiarray Average (RMA) is a normalization approach for
microarrays that involves background correction, normalization and summarization of probe levels
information without using MM probes (Lim et al., 2007). It is an algorithm commonly used in the
creation of an expression matrix for Affymetrix data and is one of the most commonly used modes
of preprocessing to normalize gene expression data. Values of raw intensity are initially
background corrected and log2 transformed before being normalized. In order to generate an
expression measure for probe sets on each array, a linear model is fitted to the normalized data.
Methods
It is important to note that the Robust Multi-array average works on all arrays simultaneously,
thereby using a lot of memory. Another commonly used method of normalization is the MAS 5.0
method which uses an Affymetrix algorithm designed to produce gene expression data signals.

This method involves a series of procedures including background correction, the calculation of
probe summaries and scaling. Biological samples of gene expression data is a critical building
block in solving a wide range of problems in the field of bioinformatics such as cancer. In the
selection of cancer related genes, a wide range of clustering methods are used.
Deep gene selection is a procedure to select cancer related genes for clustering. DGS is an effective
gene-selection algorithm that can achieve best performance and outcomes in terms of
computational cost and classification accuracy. Human microbial clustering and microbiome
composition quantification can be achieved using 16S rRNA technology into data sequences
(Pascal et al., 2017).
Discussion
Given the popularity of Affymetrix microarrays, it is very important that the process of
normalization is applied appropriately. The suitability of the robust Multiarray average method
has been proven to be quite high in the application of datasets with a wide range of groups of
biological samples. There are two key elements of the Robust Multiarray average method, the
Median polish and Quantile normalization. Median polish is an approach aimed at extracting
column and row effects in a two-way table by making use of medians. Both the summarization
approaches and quantile normalization of the RMA method are cohort-based (Likas et al., 2003).
The conduction of RMA with quantile normalization has however been discovered to mix
biological signals between groups.
Clustering methods
Hierarchical clustering

An example of a gene selection technique based on clustering is the hierarchical clustering method.
The most popular method of hierarchical clustering is the agglomerative hierarchical method
which involves the grouping of objects in clusters. A bottom-approach where each data point is
grouped in its own cluster is employed. Such an algorithm begins with a specific object
representing a specific cluster, making it suitable for the selection of cancer related genes for
clustering analysis. The methods then slowly merge these clusters into much larger ones.
Advantages
Agglomerative techniques provide a series of nested partitions that begin with trivial clustering in
which each object is in a unique cluster, and concludes with the trivial clustering in which all
objects are within the same cluster (Na, Xumin & Yong, 2010). Each cluster is further divided
down to two clusters until the desired number of clusters is achieved (Pascal et al., 2017).
Limitations
This technique is less robust as it makes assumptions on the structure or number of clusters to be
used.
K-means clustering
This is an example of an algorithm designed for grouping samples or genes on the basis of
K groups. Aggregation is achieved by reducing the sum of the squares of distances between
microarray data as well as the corresponding cluster centroid. As a result, this model of clustering
is aimed at classifying array data on the bass of similar expression.
Advantages

It is faster and easier to use compared to other clustering methods such as hierarchical clustering
when computing large datasets.
Limitations
One major disadvantage of k means clustering is that it can only compute numerical data. in
addition, users must specify k, the number of clusters right from the beginning. Lastly, it is
sometimes difficult to predict k values.
Dynamical partitioning clustering
It is a partitional algorithm that makes use of a predefined number of clusters by optimizing the
best fitting between the clusters and their representations. Beginning with prototype values that
have been selected from random individuals, this technique operates on two alternated steps: a
representation step where prototypes are created for each cluster and an allocation step, which
involves the allocation of each individual to the cluster containing the prototype with lower
dissimilarity (Zhao et al., 2015).
Advantages
Partitioning techniques are the most fundamental and simple mode of clustering as they use simple
algorithms to group a database into k clusters.
Limitations
One major challenge posed by this algorithm is that it is highly sensitive to the selection of the
initial partition. As a result, the algorithm is likely to converge to a local minimum.
Self-organizing maps (SOM)

This is model-based method widely used in gene expression data where a certain gene may have
high correlation with two different clusters.
Advantage
Perhaps the most important element of SOM is its applicability to large datasets, making it suitable
for selecting cancer related genes for clustering analysis. It is also suitable for gene expression data
analyses as it based on a single layered neural network. This facilitates the generation of an
appealing map representing a high-dimensional dataset in both 2D and 3D space. This method
places similar clusters close to each other enabling east visualization and interpretation.
Limitations
Groupings found in Napa may not be completely accurate when there is limited information in an
SOM.
Statistical information grid-based (STING) algorithm
This clustering method uses statistical information to estimate and establish expected query results.
By comparing cluster areas to previous density values, it enables the classification of cluster areas.
Advantages
The approach makes it suitable for clustering cancer related genes as it designates less density
cluster areas as not relevant, as a result, the noise effect is minimized considerably. In addition, it
is very efficient for large datasets. Lastly, minimal computation resources are required to group
large spatial datasets using this clustering approach since the I/O cost is relatively low.
References

Pascal, V., Pozuelo, M., Borruel, N., Casellas, F., Campos, D., Santiago, A., ... & Vermeire, S.
(2017). A microbial signature for Crohn's disease. Gut, 66(5), 813-822.
Likas, A., Vlassis, N., & Verbeek, J. J. (2003). The global k-means clustering algorithm. Pattern
recognition, 36(2), 451-461.
Na, S., Xumin, L., & Yong, G. (2010, April). Research on k-means clustering algorithm: An
improved k-means clustering algorithm. In 2010 Third International Symposium on intelligent
information technology and security informatics (pp. 63-67). IEEE.
Lim, W. K., Wang, K., Lefebvre, C., & Califano, A. (2007). Comparative analysis of microarray
normalization procedures: effects on reverse engineering gene networks. Bioinformatics, 23(13),
i282-i288
Zhao, X. M., Liu, K. Q.,Zhu, G., He, F., Duval, B., Richer, J. M., ... & Chen, L. (2015). Identifying
cancer-related microRNAs based on gene expression data. Bioinformatics, 31(8), 1226-1234.

Data preprocessing

More Related Content

What's hot

Similar to Data preprocessing

More from Kimberly Williams

Recently uploaded

Data preprocessing