SlideShare a Scribd company logo
Sample work for
LITERATURE REVIEW
Copyright © 2018 Phdassistance. No part of this document may be published without permission of the author.
CHAPTER 2
LITERATURE REVIEW
2.1 INTRODUCTION
The rapid technological development in the field of genomics has created an
unprecedented situation in biology. This is mainly due to large volume of genes with no clear
sequence homology with previously characterized genes, understanding how these genes act in
driving the physiology is a major challenge in the upcoming years especially while handling
in data analysis, statistical modelling and interpretation of results. High-density macro and micro
arrays have acquired a special role in this challenging field as these consists of ordered collection
of thousands of different Deoxyribonucleic Acid (DNA) sequences that can be measured by DNA
and Ribonucleic Acid (RNA) variation (Lipschultz et al 1999). In particular, microarrays are
considered a breakthrough technology in biology, facilitating the quantitative study of thousands
of genes simultaneously from a single sample of cells. However, they were utilized in many
applications but commonly used in expression profiling (Bowtell 1999). Seo Young et al (2007)
discussed that microarray has emerged as the effective and extensively used tool to address a
broad range of problems like categorization of disease subtypes and tumors in biological and
medical research. One of the major objectives in analyzing gene expression data has been the
identification of samples or genes with identical expression patterns and several statistical
techniques exist for analyzing and organizing these complex data into useful information. A
handful of research studies have been presented for clustering microarray gene expression. Their
data are illustrated below. Kim et al (2005) address a wide range of problems such as
categorization of disease subtypes and tumors in biological and medical research. The
researchers describe the microarray, which has emerged as the most effective and broadly used
tool for this categorization. The main objective of analyzing gene expression data has been to
isolate data samples or genes. Identical expression patterns and statistical techniques exist to
analyze and organize these complex data in a meaningful way. The researchers discovered that
normalization, extent of noise and clarity in the datasets will change the clustering methods that
Copyright © 2018 Phdassistance. No part of this document may be published without permission of the author.
are most commonly used in the analysis of microarray data. Data preprocessing such as
strategies of normalization or noise clearness has been used as the basis to express and compare
the performance of diverse clustering methods. Using validation measures, they have evaluated
all these clustering methods for both simulated data and real gene expression data. They have
discovered that normalization and extent of noise and clearness for datasets affect the clustering
methods that are normally used in microarray data analysis. Valarmathie et al (2009)
recommended hybrid fuzzy c-means method to determine the precise number of clusters and
interpreted the same efficiently. The challenging issue in microarray technique was to analyze
and interpret the large volume of data. This could be achieved by clustering techniques in data
mining. In hard clustering like hierarchical and k-means clustering techniques, data was divided
into distinct clusters, where each data element belongs to exactly one cluster so that the outcome
of the clustering may be incorrect, many times. The problems addressed in hard clustering could
be solved in fuzzy clustering technique. Among fuzzy based clustering, fuzzy c-means (FCM)
was the most suitable for microarray gene expression data. The problem associated with fuzzy c-
means was the number of clusters to be generated for the given dataset, which needs to be
specified earlier. This could be solved by combining this method with a popular probability
related Expectation Maximization (EM) algorithm, which provides the statistical framework to
model the cluster structure of gene expression data which was done by the proposed system.
The microarray technique in concurrent measurement of the expression level in thousands of
messenger RNA (mRNA)s has been enabled. This has been made possible by mining the data; it
is feasible to recognize the dynamics of a gene expression time series in this manner. The
researchers decreased the dimensionality of the data set by employing Principal Component
Analysis (PCA). Examination of the components has provided an approach into the underlying
factors calculated in the experiments. PCA has demonstrated that it is proved from their
consequences that all rhythmic content of data can be decreased to three main components
(Layana, C. and Diambra 2007).
Hereditary inclusion body myopathy (HIBM) of adult start steadily rising distal and
proximal myopathy has also been discussed (Eisenberg et al 2008). After examining the
Copyright © 2018 Phdassistance. No part of this document may be published without permission of the author.
expression outline data sets by the overlap of three statistic methods (Student‟s t-test, TNoM and
Info score), it has been found that the HIBM-specific transcriptome contains 374 differentially
expressed genes. With the delicate contribution of mitochondrial processes exposed in HIBM, an
unexpected feature of HIBM pathophysiology has been discovered. This could be expanded to
provide reasons for the slow development of this disorder, and assist to perceive knowledge
regarding its disease mechanism.
The Gene Expression (GE) profiling helps in understanding the fundamental cause of
gene behaviour, growth of genes and to identify new ailments such as cancer and also to analyse
their molecular pharmacology. The main objective of gene expression analysis is to comprehend the
processes of regulatory networks, and the pathways that are restricted during inter-cellular and intra-
cellular activities. Currently, microarray datasets are broadly used for this purpose.
Identifying meaningful information patterns and dependencies in Gene
Expression (GE) data, to provide a basis for hypothesis testing, is non-trivial. An initial step is to
cluster or “group” genes with similar changes in expression. However, lack of a priori
knowledge means that unsupervised clustering techniques, where data are unlabeled (un-
annotated), are common in GE work. These are exploratory techniques and assume that there is
an unknown mapping that assigns a group “label” to each gene, where the goal is to estimate this
mapping. However, it has been noted that common clustering approaches do not always translate
well to GE data, and may fail significantly to account for data profile (Kerr et al 2008). D‟Souza
et al (2009) have demonstrated that their algorithm can detect gene networks with reasonable
ease by employing their algorithm on a yeast speculation dataset.
Hence, with this background the present review critically analyses the previous literature on
application of clustering algorithm applied to different microarray gene expression data sets.
Accuracy of GE data strongly depends on experimental design and minimisation of
technical variation, which may be due to instruments, observer or pre-processing
(Zakharkin et al 2005). Image corruption and/or slide impurities may lead to incomplete data
(Troyanskaya et al 2001). The development of data analysis strategies and tools to cope with the
Copyright © 2018 Phdassistance. No part of this document may be published without permission of the author.
complexity of the data is a sizeable task. Current methods for analysis are based on comparison
role and patternrecognition algorithms such as cluster analysis.
As cluster analysis is usually exploratory, lack of a priori knowledge on gene groups
or their number, K, is common. Arbitrary selection of this number may undesirably bias the
search, as pattern elements may be ill defined unless signals are strong. Meta-data can guide
choice of correct K, e.g. genes with common promoter sequence are likely to be expressed
together and thus are likely to be placed in the same group. Methods for determining optimal
number of groups, K, are discussed in Fridlyand and Dudoit (2001); Milligan and Cooper (1985).
Clustering a GE matrix can be achieved in two ways: (i) genes can form a group which show
similar expression across conditions and (ii) samples can form a group which show similar
expression across all genes. Both (i) and (ii) lead to global clusters, where a gene or sample is
grouped across all dimensions. However, genes and samples can be clustered simultaneously,
with their inter-relationship represented by bi-clusters. These are defined over a subset of genes
and a subset of samples thus capturing local structure in the dataset. This is a major strength of
bi-clustering as cellular processes are found to rely on subsets of genes, which are co-regulated
and co-expressed under certain conditions and behave independently (Ben-Dor, Chor, Karp and
Yakhini 2003).
Most clustering algorithms can be classified into two groups: hierarchical and
partitional clustering. The hierarchical techniques produce a nested sequence of partitions, with a
single, all-inclusive cluster at the top and single clusters of individual objects at the bottom (leaf
nodes) (divisive hierarchical clustering) or a set of singleton clusters at the top and one single
partition at the bottom (agglomerative hierarchical clustering). Examples of the hierarchical
clustering are the Principal Direction Divisive Partitioning (PDDP) (Boley 1998), Bisecting K-
Means (BKM) (Savaresi and Boley 2001), Hierarchical Agglomerative Clustering (HAC) (Jain,
Murty and Flynn 1999), Collaborative Document Clustering (CDC) (Hammoudaand Kamel
2006). The partitional clustering approaches partition a collection of objects into a set of groups,
so as to maximize the quality of clustering. The K-means (KM) (Hartigan and Wong 1979) and
fuzzy c-means (Bezdek, Ehrlich and Full 1984) algorithms are members of the family of
partitional clustering algorithms (Kashef and Kamel 2009).
Copyright © 2018 Phdassistance. No part of this document may be published without permission of the author.
Approaches to gene expression data analysis rely heavily on results from cluster
analysis (e.g., k-means, self-organizing maps and trees), supervised learning (e.g., recursive
partitioning), classification and regression trees (Pollard and van der Laan 2002). Lopamudra
Dey et al (2011) described the clustering analysis of microarray gene expression data. Many
clustering algorithms like K-means, FCM, hierarchical techniques were used for gene expression
data clustering. But this PSO based K-means gave better accuracy than those existing algorithms.
In this paper, Particle Swarm Optimization (PSO)-based K-means clustering algorithm has been
proposed for clustering microarray gene expression data.
Jessica Mar et al (2011) developed an „informativeness metric‟ based on a simple
analysis of variance statistic that identified the number of clusters which best separated
phenotypic groups. The performance of the informativeness metric had been tested on both
experimental and simulated datasets, and the researcher contrasted these results with those
obtained using alternative methods such as the gap statistic.
A firm gene selection and efficient cancer prediction structure called SGS has been
introduced. This structure first recognizes gene groups in which the genes have high correlation
coefficient by means of a clustering algorithm. To the end, a prediction model has been
constructed based on shrinkage gene space, using a capable classification algorithm (such as
Support Vector regression (SVM), 1-nearest neighbor (1NN), or regression). By means of the
trial results obtained on real-world data, the structure has been shown to regularize highly
available feature selection and prediction methods, such as Significant Analysis of Microarray
(SAM), Information gain (IG) and the Lasso-type prediction model (Jing et al., 2010).
Liping Jing et al (2010) presented a Stable Gene Selection (SGS) and efficient cancer
prediction framework. The proposed framework has first identified the gene groups where genes
in each group has higher correlation coefficient by means of a clustering algorithm, and then it
has employed Bayesian Lasso and group Lasso to select significant genes in each group and
important gene groups, respectively, and finally based on shrinkage gene space with efficient
classification algorithm (like Support Vector Machine (SVM), Single Nearest Neighbour (1NN),
Copyright © 2018 Phdassistance. No part of this document may be published without permission of the author.
Regression etc.) which constructed the prediction model. The proposed framework has been
proved it to frequently outperform available feature selection and prediction methods, such as
Significance Analysis of Microarrays (SAM), Information Gain (IG) and Lasso-type prediction
model by means of the experimental results obtained from real world data.
Further, many clustering algorithms require a complete matrix of input values, so
imputation (missing data estimation) techniques need to be considered before clustering. GE data
are intrinsically noisy, resulting in outliers, typically managed by: (i) robust statistical
estimation/testing (when extreme values are not of primary interest) or (ii) identification (when
outlier information is of intrinsic importance (Liu, Cheng and Wu 2002).
The following section reviews previous studies on application of cluster algorithm to
different data sets. The study by Pollard and Van der Laan (2002) proposed a statistical
framework for two-way clustering for data if genes and samples are considered simultaneously
and where complex pattern can be identified. In this study, a simultaneous clustering parameter
is defined as a function ( ) of the true data generating distribution P, and an estimate is
obtained by applying this function to the empirical distribution . Authors in this study illustrate
a wide range of clustering procedures, including generalized hierarchical methods; can be
defined as parameters which are compositions of individual mappings for clustering patients and
genes. This framework allows one to assess classical properties of clustering methods, such as
consistency, and to formally study statistical inference regarding the clustering parameter. The
present results of simulations are designed to assess the asymptotic validity of different bootstrap
methods for estimating the distribution of ( ).
Mendez et al (2002) presented a procedure that combines classical statistical methods
to assess the confidence of gene clusters identified by hierarchical clustering of expression data.
This approach was applied to a publicly released Drosophila metamorphosis data set (White et al
1999). The study can produce reliable classifications of gene groups and genes within the groups
by applying unsupervised (cluster analysis), dimension reduction (principal component analysis)
and supervised methods (linear discriminant analysis) in a sequential form. This procedure
Copyright © 2018 Phdassistance. No part of this document may be published without permission of the author.
provides a means to select relevant information from microarray data, reducing the number of
genes and clusters that require further biological analysis.
Xu et al (2002) proposed three Minimum Spanning Tree (MST)-based algorithms:
removing long MST-edges, a center-based iterative algorithm, and a representative-based global
optimal algorithm. But for a specific dataset, users do not know which algorithm is suitable.
Most clustering algorithms become ineffective when provided with unsuitable parameters or
applied to datasets which are composed of clusters with diverse shapes, sizes, and densities.
Du and Lin (2004) suggested an alternative parallelized algorithm of hierarchical
clustering to solve the problem of traditional hierarchical clustering which cannot handle large
data sets within a reasonable time and memory resources. The study was implemented on a
Multiple Instruction Multiple Data (MIMD) architecture, which shows considerable reduction in
computational time and inter-node communication overhead, especially for large data sets. The
authors used the standard message passing library, Message Passing Interface (MPI) for any
MIMD systems.
The study by Seal, Komarina and Aluru (2005) developed Clustering algorithms on
gene expression data to find co-regulated genes. The study reduced the run time to O(N2) by
applying known hierarchical clustering algorithms [Proc. 9th Annual ACM-SIAM Symposium
on Discrete Algorithms, 1998, pp. 619–628].The problem of high run time was addressed by
Pearson correlation coefficient based hierarchical clustering. The study presents an algorithm
which runs in O(NlogN) time using a geometrical reduction and shows that it is optimal.
The study by He, Pan and Lin (2006) presented multivariate normal mixture model
based clustering analyses to detect differential gene expression between two conditions.
Deviating from the general mixture model and model-based clustering, the mixture
models with specific mean and covariance structures that account for special features of two-
condition microarray experiments were proposed. Explicit updated formulas in the Expectation-
Maximization (EM) algorithm for three such models are derived. The methods are applied to a
real dataset to compare the expression levels of 1176 genes of rats with and without
Copyright © 2018 Phdassistance. No part of this document may be published without permission of the author.
pneumococcal middle-ear infection to illustrate the performance and usefulness of this approach.
About 10 genes and 20 genes are found to be differentially expressed in a six-dimensional
modelling and a bivariate modelling, respectively. Two simulation studies are conducted to
compare the performance of univariate and multivariate methods. Depending on data, neither
method can always dominate the other. The results suggest that multivariate normal mixture
models can be useful alternatives to univariate methods to detect differential gene expression in
exploratory data analysis.
Linag (2007) proposed the method to overcome the difficulties of the mixture-
Gaussian model-based clustering of gene expression profile by the probit transformation in
conjunction with the Singular Value Decomposition (SVD). SVD reduces the dimensionality of
the data, and the probit transformation converts the scaled Eigen samples, which can be
interpreted using correlation coefficients, as explained in the text, and can be presented as
Gaussian random variables. The present results show that the SVD-based probit transformation
enhances the ability of the mixture-Gaussian model-based clustering method for identifying
prominent patterns of the data. As a by-product, authors reported that the SVD-based probit
transformation also improves the performance of the model-free clustering methods, such as
hierarchical, K-means and Self-Organizing Maps (SOM), for the data sets containing scattered
genes. Further this study also proposed a run test-based rule for selection of Eigen samples used
for clustering.
Delibasic, Vukicevic, Jovanovic, Kirchner, Ruhland and Suknovic (2012) proposed
architecture for the design of representative-based clustering algorithms based on reusable
components. These components were derived from K-means-like algorithms and their
extensions. With the suggested clustering design architecture, it is not only possible to
reconstruct popular algorithms, but also to build new algorithms by exchanging components
from original algorithms and their improvements. In this way, the design of a myriad of
representative-based clustering algorithms and their fair comparison and evaluation are possible.
In addition to the architecture, the study showed the usefulness of the proposed approach by
providing experimental evaluation. However, this study recommends meta-learning as a better
approach for intelligent algorithm selection particularly in the area of clustering and also this is a
Copyright © 2018 Phdassistance. No part of this document may be published without permission of the author.
relatively new and unexplored topic (de Souto, Prudencio, Soares, Araujo, Costa, Ludermir and
Schliep 2008).
In addition, clustering algorithms that combine the advantages of hierarchical and
partitional clustering have been proposed in the literature (Cheng, Kannan, Vempala and Wang
2006; Kaukoranta, Fränti and Nevalainen 1998; Lee and Olafsson 2011; Lin and Chen 2005; Liu,
Jiang and Kot, 2009). This kind of hybrid algorithm analyzes the dataset in two stages. In the
first stage, the dataset is split into a number of subsets with a partitioning criterion. In the second
stage, the produced subsets are merged in terms of a similarity measure. Different split and
merge approaches have been designed in several hybrid algorithms. Cohesion Self-Merging
(CSM) (Lin and Chen 2005) first applies K-means to partition the dataset into K0 subsets, where
K0 is an input parameter. Afterwards, single linkage, which uses a dedicated cohesion function
as the similarity measure, is utilized to iteratively merge the K0 subsets until K subsets are
achieved. In the split stage, as K-means may produce different partitions in different runs, the
final results may be unstable.
CHAMELEON (Karypis, Han and Kumar 1999) is another example of a hybrid
clustering algorithm. It constructs a K-nearest neighbour graph, and employs a graph cut scheme
to partition the graph into K0 subsets. Relative inter-connectivity and relative closeness are
defined to merge the subsets. Liu et al (2009) proposed a multi-prototype clustering algorithm,
which can also be considered as a hybrid method. The method uses a convergence mechanism,
and repeatedly performs split and merge operations until the prototypes remain unchanged.
However, many empirical parameters are involved. Kaukoranta et al (1998) proposed a split-and
merge algorithm, where the objective function is to minimize the mean squared error. A
Minimum Spanning Tree (MST) is a useful graph structure, which has been employed to capture
perceptual grouping (Jain and Dubes 1998). Zahn (1971) defined several criteria of edge
inconsistency for detecting clusters of different shapes. However, for datasets consisting of
differently shaped clusters, the method lacks an adaptive selection of the criteria.
To alleviate the deficiencies of clusters with diverse shapes, sizes and densities
Zhond, Miao and Franti (2011) proposed a novel split-and-merge hierarchical clustering method
Copyright © 2018 Phdassistance. No part of this document may be published without permission of the author.
in which a MST and an MST-based graph are employed to guide the splitting and merging
process. In the splitting process, vertices with high degrees in the MST-based graph are selected
as initial prototypes, and K-means is used to split the dataset. In the merging process, subgroup
pairs are filtered and only neighbouring pairs are considered for merge. The proposed method
requires no parameter except the number of clusters. Experimental results demonstrate its
effectiveness both on synthetic and real datasets.
Analysis of large GE datasets is a relatively new task, although pattern recognition of
complex data is well established in a number of fields. Many common generic algorithms have,
in consequence, been adopted for GE data (e.g. hierarchical (Eisen and Spellman 1998), SOMs
(Kohonen 1990), but not all perform well. A good method must deal with noisy high
dimensional data, be insensitive to the order of input, have moderate time and space complexity
(i.e. allow increased data load without breakdown or requirement of major changes), require few
input parameters, incorporate meta-data knowledge (an extended range of attributes) and produce
results, which are interpretable in the biological context.
2.2 RESEARCH GAPS IN EXISTING METHODS
It has been concluded from the previous studies that cluster analysis applied to GE
data aims to highlight meaningful patterns for gene co-regulation. The evidence suggests that,
while commonly applied, agglomerative and partitive techniques are insufficiently powerful
given the high dimensionality and nature of the data. While further testing on non-standard and
diverse data sets is required, comparative assessment and numerical evidence, to date, support
the view that bi-clustering methods, although computationally expensive, offer better
interpretation in terms of data features and local structure. While the limitations of commonly
used algorithms are well documented in the literature, adoption by the bioinformatics community
of new (and hybrid) techniques developed specifically for GE analysis has been slow, mainly
due to the increased algorithmic complexity involved. This would be catalysed by more
transparent guidelines and increased availability in specialised software and public dataset
repositories.
Copyright © 2018 Phdassistance. No part of this document may be published without permission of the author.
In comparison to other methods, hierarchical clustering method was extensively used
by biologists in microarray data analysis as it combines all data points into a single set by
keeping on combining pairs of data points or sets of points adjacent to each other in the feature
space for discovering groupings. Obtaining the best clustering that signifies a set of patterns in
the background of a given distance metric through graphic representation is the objective of
hierarchical clustering (Jin Hwan Do and Dong-Kug Choi 2007). This method is preferred
among biologists because it permits users to visualize global expression patterns in DNA
microarray data through graphic representation of the results of hierarchical clustering.
Generally, this is classified as agglomerative (bottom-up) and divisive (top-down), based on a
similarity or distance measure of the data, like correlation, Euclidean, squared Euclidean, or city-
block (Manhattan) distance. Clusters are constructed by the hierarchical tree which is calculates
the distance between pairs of objects in the correlation matrix.
In previous studies, data is partitioned by clustering algorithms where each gene
belonged to only one cluster (Zhaohui Qin 2006; Minsoo Lee et al 2007). Its limitations include
high sensitivity to noise, outlier and non-linearity, lack of validity, difficult in handling different
sized clusters and shapes, added time (He Pan and Lin 2006), inability to detect small sets,
increased in algorithmic complexity (Linag 2007), lack of statistical test and interpretation of
results (He, Pan & Lin, 2006; Liu et al 2009). In addition, these methods have disadvantages
when working with microarray gene expression data that leads to biological complexity (Zhond,
Miao and Franti 2011). The nature of proteins and their interactions is the major reason for this.
The genes that generate proteins are expected to co express with more than one group of genes
because proteins generally perform diverse biological functions by interacting with different
groups of proteins. This explains the inclusion of a gene in more than one cluster of microarray
gene expression data. Further, a good method must deal with noisy high dimensional data, be
insensitive to the order of input, have moderate time and space complexity (i.e. allow increased
data load without breakdown or requirement of major changes), require few input parameters,
incorporate meta-data knowledge (an extended range of attributes) and produce results, which
are interpretable in the biological context.
Copyright © 2018 Phdassistance. No part of this document may be published without permission of the author.
Unlike data from model organisms and cell lines that have uniform genetic
background, and where experiments are conducted under controlled conditions, disease samples
are typically much more heterogeneous. Differences in the genetic background of the subjects,
disease stage, progression, and severity as well as the presence of disease subtypes contribute to
the overall heterogeneity. Discovering genes or features that are most relevant to the disease in
question and identifying disease subtypes from such heterogeneous data remains an open
problem. Due to large variability in gene mutations and gene expression especially in cancer
population, till date not all patients have the same response to therapy and pose high challenge to
physicians for treatment.
Hence, with this background in this thesis, an improved clustering model is being
proposed. The first and third model is on semi-supervised and two-dimensional hierarchical
clustering is proposed to represent the existence of genes in one or more cluster consistent with
the nature of the gene and its attributes and prevent biological complexities by means of hybrid
distance based similarity measure. The second model is based on the Quad Tree that enhances
the speed of the clustering process and also finds closest pair in a quickest time.
2.3 OBJECTIVES OF THIS WORK
In order to gain a better insight into the problem of cancer classification,
systematic approaches based on global gene expression analysis have been presented.
The main challenge in the existing algorithm is that each gene belonged to only one
cluster, and the processing time was high. The study aimed to overcome these important
challenges using the microarray gene expression data. Thus the study aimed to evaluate
microarray gene expression data of acute human leukemia, and the target is to distinguish
between ALL and AML, which is a typical cancer classification problem, not well solved
despite many years of research.
This research is aimed on the classification (prediction) of the problem (Zhang and
Ke 2000) using the two datasets of standard leukemia for training and testing obtained from
Copyright © 2018 Phdassistance. No part of this document may be published without permission of the author.
ALL/AML datasets and the performance of this hierarchical technique on clustering the ground
truth data of the cancer classes, namely, AML and ALL is demonstrated.
The study has the following specific research objectives:
 The study develops an enhanced clustering model that analyzes the presence of
a gene in more than one cluster using the enhanced clustering model utilizing
the microarray gene expression data of acute human leukemia.
 To develop an enhanced model that reduces the processing time of the
clustering and finding the closest pair elements using the hybrid similarity
measure. The clustering elements are selected from the microarray gene
expression database by means of the index matrix. The best „K‟ clusters are
identified using fitness evaluation.
 To develop an optimum number of clusters for a given dataset
 To analyze the performance of the enhanced model as presented
below, using Precision, Recall and the F-measure. The novel Semi
supervised hierarchical clustering in comparison with the
unsupervised techniques.
 The quad tree based hierarchical clustering is compared to the semi
supervised hierarchical clustering without the quad tree.
 Compare the two dimensional hierarchical clustering with the hybrid
similarity measures with the semi supervised hierarchical clustering
 This research also validates the enhanced clustering technique by applying the
evaluation metric on the clustering results
Copyright © 2018 Phdassistance. No part of this document may be published without permission of the author.
From the review, it is obvious that data is partitioned by clustering algorithms used in
the previous research in such a way that each gene belonged to only one cluster. Some examples
for the clustering algorithms that create only one cluster for a gene are K-means algorithm,
hierarchical clustering algorithm, biclustering algorithm, fuzzy k-means algorithm and SOM and
are used in gene expression data. But, these methods have disadvantages when working with
microarray gene expression data that gives rise to biological complexity. The nature of proteins
and their interactions are the major reasons for this. The genes that generate proteins are
expected to co express with more than one group of genes because proteins generally perform
diverse biological roles by interacting with diverse groups of proteins. This explains the
inclusion of a gene in more than one cluster of microarray gene expression data. In this research,
a novel two dimensional hierarchical clustering was proposed to represent the existence of genes
in one or more clusters consistent with the nature of the gene and its attributes and methods to
prevent biological complexities.
Hence, an architecture for a two dimensional hierarchical clustering is developed in
this thesis, which provides three different analyses including semi-supervised hierarchical
End of the Sample Work
See Other Sample in www.phdassistance.com
Contact Us

More Related Content

What's hot

human skills - men vs women, communication
human skills - men vs women, communicationhuman skills - men vs women, communication
human skills - men vs women, communication
vivek shah
 
Women in STEM
Women in STEMWomen in STEM
Women in STEM
Kelly Services
 
Quantitative data analysis
Quantitative data analysisQuantitative data analysis
Quantitative data analysis
atrantham
 
Gender Identity PowerPoint.pptx
Gender Identity PowerPoint.pptxGender Identity PowerPoint.pptx
Gender Identity PowerPoint.pptx
AaronMiller778878
 
Gender Equality
Gender EqualityGender Equality
Gender Equality
KajalChauhan54
 
Masculinity Presentation
Masculinity PresentationMasculinity Presentation
Masculinity Presentation
Michael Hope
 

What's hot (7)

human skills - men vs women, communication
human skills - men vs women, communicationhuman skills - men vs women, communication
human skills - men vs women, communication
 
Women in STEM
Women in STEMWomen in STEM
Women in STEM
 
Logistic regression
Logistic regressionLogistic regression
Logistic regression
 
Quantitative data analysis
Quantitative data analysisQuantitative data analysis
Quantitative data analysis
 
Gender Identity PowerPoint.pptx
Gender Identity PowerPoint.pptxGender Identity PowerPoint.pptx
Gender Identity PowerPoint.pptx
 
Gender Equality
Gender EqualityGender Equality
Gender Equality
 
Masculinity Presentation
Masculinity PresentationMasculinity Presentation
Masculinity Presentation
 

Similar to Sample Work For Engineering Literature Review and Gap Identification

Clustering Approaches for Evaluation and Analysis on Formal Gene Expression C...
Clustering Approaches for Evaluation and Analysis on Formal Gene Expression C...Clustering Approaches for Evaluation and Analysis on Formal Gene Expression C...
Clustering Approaches for Evaluation and Analysis on Formal Gene Expression C...
rahulmonikasharma
 
COMPUTATIONAL METHODS FOR FUNCTIONAL ANALYSIS OF GENE EXPRESSION
COMPUTATIONAL METHODS FOR FUNCTIONAL ANALYSIS OF GENE EXPRESSIONCOMPUTATIONAL METHODS FOR FUNCTIONAL ANALYSIS OF GENE EXPRESSION
COMPUTATIONAL METHODS FOR FUNCTIONAL ANALYSIS OF GENE EXPRESSION
csandit
 
Reconstruction and analysis of cancerspecific Gene regulatory networks from G...
Reconstruction and analysis of cancerspecific Gene regulatory networks from G...Reconstruction and analysis of cancerspecific Gene regulatory networks from G...
Reconstruction and analysis of cancerspecific Gene regulatory networks from G...
ijbbjournal
 
Classification of Microarray Gene Expression Data by Gene Combinations using ...
Classification of Microarray Gene Expression Data by Gene Combinations using ...Classification of Microarray Gene Expression Data by Gene Combinations using ...
Classification of Microarray Gene Expression Data by Gene Combinations using ...
IJCSEA Journal
 
An Overview on Gene Expression Analysis
An Overview on Gene Expression AnalysisAn Overview on Gene Expression Analysis
An Overview on Gene Expression Analysis
IOSR Journals
 
EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...
EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...
EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...
IJDKP
 
EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...
EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...
EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...
IJDKP
 
SCDT: FC-NNC-structured Complex Decision Technique for Gene Analysis Using Fu...
SCDT: FC-NNC-structured Complex Decision Technique for Gene Analysis Using Fu...SCDT: FC-NNC-structured Complex Decision Technique for Gene Analysis Using Fu...
SCDT: FC-NNC-structured Complex Decision Technique for Gene Analysis Using Fu...
IJECEIAES
 
A Review of Various Methods Used in the Analysis of Functional Gene Expressio...
A Review of Various Methods Used in the Analysis of Functional Gene Expressio...A Review of Various Methods Used in the Analysis of Functional Gene Expressio...
A Review of Various Methods Used in the Analysis of Functional Gene Expressio...
ijitcs
 
Biological Significance of Gene Expression Data Using Similarity Based Biclus...
Biological Significance of Gene Expression Data Using Similarity Based Biclus...Biological Significance of Gene Expression Data Using Similarity Based Biclus...
Biological Significance of Gene Expression Data Using Similarity Based Biclus...
CSCJournals
 
GENE-GENE INTERACTION ANALYSIS IN ALZHEIMER
GENE-GENE INTERACTION ANALYSIS IN ALZHEIMERGENE-GENE INTERACTION ANALYSIS IN ALZHEIMER
GENE-GENE INTERACTION ANALYSIS IN ALZHEIMER
ijcsit
 
GENE-GENE INTERACTION ANALYSIS IN ALZHEIMER
GENE-GENE INTERACTION ANALYSIS IN ALZHEIMERGENE-GENE INTERACTION ANALYSIS IN ALZHEIMER
GENE-GENE INTERACTION ANALYSIS IN ALZHEIMER
AIRCC Publishing Corporation
 
Novel modelling of clustering for enhanced classification performance on gene...
Novel modelling of clustering for enhanced classification performance on gene...Novel modelling of clustering for enhanced classification performance on gene...
Novel modelling of clustering for enhanced classification performance on gene...
IJECEIAES
 
IRJET- Disease Identification using Proteins Values and Regulatory Modules
IRJET-  	  Disease Identification using Proteins Values and Regulatory  ModulesIRJET-  	  Disease Identification using Proteins Values and Regulatory  Modules
IRJET- Disease Identification using Proteins Values and Regulatory Modules
IRJET Journal
 
Gene Selection for Sample Classification in Microarray: Clustering Based Method
Gene Selection for Sample Classification in Microarray: Clustering Based MethodGene Selection for Sample Classification in Microarray: Clustering Based Method
Gene Selection for Sample Classification in Microarray: Clustering Based Method
IOSR Journals
 
IRJET- Gene Mutation Data using Multiplicative Adaptive Algorithm and Gene On...
IRJET- Gene Mutation Data using Multiplicative Adaptive Algorithm and Gene On...IRJET- Gene Mutation Data using Multiplicative Adaptive Algorithm and Gene On...
IRJET- Gene Mutation Data using Multiplicative Adaptive Algorithm and Gene On...
IRJET Journal
 
A Classification of Cancer Diagnostics based on Microarray Gene Expression Pr...
A Classification of Cancer Diagnostics based on Microarray Gene Expression Pr...A Classification of Cancer Diagnostics based on Microarray Gene Expression Pr...
A Classification of Cancer Diagnostics based on Microarray Gene Expression Pr...
IJTET Journal
 
Identification of Differentially Expressed Genes by unsupervised Learning Method
Identification of Differentially Expressed Genes by unsupervised Learning MethodIdentification of Differentially Expressed Genes by unsupervised Learning Method
Identification of Differentially Expressed Genes by unsupervised Learning Method
praveena06
 
Majority Voting Approach for the Identification of Differentially Expressed G...
Majority Voting Approach for the Identification of Differentially Expressed G...Majority Voting Approach for the Identification of Differentially Expressed G...
Majority Voting Approach for the Identification of Differentially Expressed G...
csandit
 
MAJORITY VOTING APPROACH FOR THE IDENTIFICATION OF DIFFERENTIALLY EXPRESSED G...
MAJORITY VOTING APPROACH FOR THE IDENTIFICATION OF DIFFERENTIALLY EXPRESSED G...MAJORITY VOTING APPROACH FOR THE IDENTIFICATION OF DIFFERENTIALLY EXPRESSED G...
MAJORITY VOTING APPROACH FOR THE IDENTIFICATION OF DIFFERENTIALLY EXPRESSED G...
cscpconf
 

Similar to Sample Work For Engineering Literature Review and Gap Identification (20)

Clustering Approaches for Evaluation and Analysis on Formal Gene Expression C...
Clustering Approaches for Evaluation and Analysis on Formal Gene Expression C...Clustering Approaches for Evaluation and Analysis on Formal Gene Expression C...
Clustering Approaches for Evaluation and Analysis on Formal Gene Expression C...
 
COMPUTATIONAL METHODS FOR FUNCTIONAL ANALYSIS OF GENE EXPRESSION
COMPUTATIONAL METHODS FOR FUNCTIONAL ANALYSIS OF GENE EXPRESSIONCOMPUTATIONAL METHODS FOR FUNCTIONAL ANALYSIS OF GENE EXPRESSION
COMPUTATIONAL METHODS FOR FUNCTIONAL ANALYSIS OF GENE EXPRESSION
 
Reconstruction and analysis of cancerspecific Gene regulatory networks from G...
Reconstruction and analysis of cancerspecific Gene regulatory networks from G...Reconstruction and analysis of cancerspecific Gene regulatory networks from G...
Reconstruction and analysis of cancerspecific Gene regulatory networks from G...
 
Classification of Microarray Gene Expression Data by Gene Combinations using ...
Classification of Microarray Gene Expression Data by Gene Combinations using ...Classification of Microarray Gene Expression Data by Gene Combinations using ...
Classification of Microarray Gene Expression Data by Gene Combinations using ...
 
An Overview on Gene Expression Analysis
An Overview on Gene Expression AnalysisAn Overview on Gene Expression Analysis
An Overview on Gene Expression Analysis
 
EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...
EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...
EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...
 
EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...
EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...
EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...
 
SCDT: FC-NNC-structured Complex Decision Technique for Gene Analysis Using Fu...
SCDT: FC-NNC-structured Complex Decision Technique for Gene Analysis Using Fu...SCDT: FC-NNC-structured Complex Decision Technique for Gene Analysis Using Fu...
SCDT: FC-NNC-structured Complex Decision Technique for Gene Analysis Using Fu...
 
A Review of Various Methods Used in the Analysis of Functional Gene Expressio...
A Review of Various Methods Used in the Analysis of Functional Gene Expressio...A Review of Various Methods Used in the Analysis of Functional Gene Expressio...
A Review of Various Methods Used in the Analysis of Functional Gene Expressio...
 
Biological Significance of Gene Expression Data Using Similarity Based Biclus...
Biological Significance of Gene Expression Data Using Similarity Based Biclus...Biological Significance of Gene Expression Data Using Similarity Based Biclus...
Biological Significance of Gene Expression Data Using Similarity Based Biclus...
 
GENE-GENE INTERACTION ANALYSIS IN ALZHEIMER
GENE-GENE INTERACTION ANALYSIS IN ALZHEIMERGENE-GENE INTERACTION ANALYSIS IN ALZHEIMER
GENE-GENE INTERACTION ANALYSIS IN ALZHEIMER
 
GENE-GENE INTERACTION ANALYSIS IN ALZHEIMER
GENE-GENE INTERACTION ANALYSIS IN ALZHEIMERGENE-GENE INTERACTION ANALYSIS IN ALZHEIMER
GENE-GENE INTERACTION ANALYSIS IN ALZHEIMER
 
Novel modelling of clustering for enhanced classification performance on gene...
Novel modelling of clustering for enhanced classification performance on gene...Novel modelling of clustering for enhanced classification performance on gene...
Novel modelling of clustering for enhanced classification performance on gene...
 
IRJET- Disease Identification using Proteins Values and Regulatory Modules
IRJET-  	  Disease Identification using Proteins Values and Regulatory  ModulesIRJET-  	  Disease Identification using Proteins Values and Regulatory  Modules
IRJET- Disease Identification using Proteins Values and Regulatory Modules
 
Gene Selection for Sample Classification in Microarray: Clustering Based Method
Gene Selection for Sample Classification in Microarray: Clustering Based MethodGene Selection for Sample Classification in Microarray: Clustering Based Method
Gene Selection for Sample Classification in Microarray: Clustering Based Method
 
IRJET- Gene Mutation Data using Multiplicative Adaptive Algorithm and Gene On...
IRJET- Gene Mutation Data using Multiplicative Adaptive Algorithm and Gene On...IRJET- Gene Mutation Data using Multiplicative Adaptive Algorithm and Gene On...
IRJET- Gene Mutation Data using Multiplicative Adaptive Algorithm and Gene On...
 
A Classification of Cancer Diagnostics based on Microarray Gene Expression Pr...
A Classification of Cancer Diagnostics based on Microarray Gene Expression Pr...A Classification of Cancer Diagnostics based on Microarray Gene Expression Pr...
A Classification of Cancer Diagnostics based on Microarray Gene Expression Pr...
 
Identification of Differentially Expressed Genes by unsupervised Learning Method
Identification of Differentially Expressed Genes by unsupervised Learning MethodIdentification of Differentially Expressed Genes by unsupervised Learning Method
Identification of Differentially Expressed Genes by unsupervised Learning Method
 
Majority Voting Approach for the Identification of Differentially Expressed G...
Majority Voting Approach for the Identification of Differentially Expressed G...Majority Voting Approach for the Identification of Differentially Expressed G...
Majority Voting Approach for the Identification of Differentially Expressed G...
 
MAJORITY VOTING APPROACH FOR THE IDENTIFICATION OF DIFFERENTIALLY EXPRESSED G...
MAJORITY VOTING APPROACH FOR THE IDENTIFICATION OF DIFFERENTIALLY EXPRESSED G...MAJORITY VOTING APPROACH FOR THE IDENTIFICATION OF DIFFERENTIALLY EXPRESSED G...
MAJORITY VOTING APPROACH FOR THE IDENTIFICATION OF DIFFERENTIALLY EXPRESSED G...
 

More from PhD Assistance

The relationship between clinical and biochemical findings with diabetic keto...
The relationship between clinical and biochemical findings with diabetic keto...The relationship between clinical and biochemical findings with diabetic keto...
The relationship between clinical and biochemical findings with diabetic keto...
PhD Assistance
 
Referencing an Article - Its styles and type.pptx
Referencing an Article - Its styles and type.pptxReferencing an Article - Its styles and type.pptx
Referencing an Article - Its styles and type.pptx
PhD Assistance
 
Referencing an Article - Its styles and type.pdf
Referencing an Article - Its styles and type.pdfReferencing an Article - Its styles and type.pdf
Referencing an Article - Its styles and type.pdf
PhD Assistance
 
ROLE OF COMMUNITY TO BOOST MENTAL HEALTH .pptx
ROLE OF COMMUNITY TO BOOST MENTAL HEALTH .pptxROLE OF COMMUNITY TO BOOST MENTAL HEALTH .pptx
ROLE OF COMMUNITY TO BOOST MENTAL HEALTH .pptx
PhD Assistance
 
Current and future developments in cultural psychology of inequality in PhD r...
Current and future developments in cultural psychology of inequality in PhD r...Current and future developments in cultural psychology of inequality in PhD r...
Current and future developments in cultural psychology of inequality in PhD r...
PhD Assistance
 
Quantum Machine Learning is all you Need – PhD Assistance.pdf
Quantum Machine Learning is all you Need – PhD Assistance.pdfQuantum Machine Learning is all you Need – PhD Assistance.pdf
Quantum Machine Learning is all you Need – PhD Assistance.pdf
PhD Assistance
 
Nutritional Interventional trials in muscle and cachexia PhD research directi...
Nutritional Interventional trials in muscle and cachexia PhD research directi...Nutritional Interventional trials in muscle and cachexia PhD research directi...
Nutritional Interventional trials in muscle and cachexia PhD research directi...
PhD Assistance
 
Nutritional Interventional trials in muscle and cachexia PhD research directi...
Nutritional Interventional trials in muscle and cachexia PhD research directi...Nutritional Interventional trials in muscle and cachexia PhD research directi...
Nutritional Interventional trials in muscle and cachexia PhD research directi...
PhD Assistance
 
7 Major Types of Cyber Security Threats.pdf
7 Major Types of Cyber Security Threats.pdf7 Major Types of Cyber Security Threats.pdf
7 Major Types of Cyber Security Threats.pdf
PhD Assistance
 
Machine Learning Algorithm for Business Strategy.pdf
Machine Learning Algorithm for Business Strategy.pdfMachine Learning Algorithm for Business Strategy.pdf
Machine Learning Algorithm for Business Strategy.pdf
PhD Assistance
 
Key Factors Influencing Customer Purchasing Behavior.pptx
Key Factors Influencing Customer Purchasing Behavior.pptxKey Factors Influencing Customer Purchasing Behavior.pptx
Key Factors Influencing Customer Purchasing Behavior.pptx
PhD Assistance
 
Key Factors Influencing Customer Purchasing Behavior.pdf
Key Factors Influencing Customer Purchasing Behavior.pdfKey Factors Influencing Customer Purchasing Behavior.pdf
Key Factors Influencing Customer Purchasing Behavior.pdf
PhD Assistance
 
Factors Contributing and Counter Measure in Drowsiness Detection of Drivers.pptx
Factors Contributing and Counter Measure in Drowsiness Detection of Drivers.pptxFactors Contributing and Counter Measure in Drowsiness Detection of Drivers.pptx
Factors Contributing and Counter Measure in Drowsiness Detection of Drivers.pptx
PhD Assistance
 
Factors Contributing and Counter Measure in Drowsiness Detection of Drivers.pdf
Factors Contributing and Counter Measure in Drowsiness Detection of Drivers.pdfFactors Contributing and Counter Measure in Drowsiness Detection of Drivers.pdf
Factors Contributing and Counter Measure in Drowsiness Detection of Drivers.pdf
PhD Assistance
 
Immigrant’s Potentials to Emerge as Entrepreneurs.pptx
Immigrant’s Potentials to Emerge as Entrepreneurs.pptxImmigrant’s Potentials to Emerge as Entrepreneurs.pptx
Immigrant’s Potentials to Emerge as Entrepreneurs.pptx
PhD Assistance
 
Immigrant’s Potentials to Emerge as Entrepreneurs - PhD Assistance.pdf
Immigrant’s Potentials to Emerge as Entrepreneurs - PhD Assistance.pdfImmigrant’s Potentials to Emerge as Entrepreneurs - PhD Assistance.pdf
Immigrant’s Potentials to Emerge as Entrepreneurs - PhD Assistance.pdf
PhD Assistance
 
An overview of cyber security data science from a perspective of machine lear...
An overview of cyber security data science from a perspective of machine lear...An overview of cyber security data science from a perspective of machine lear...
An overview of cyber security data science from a perspective of machine lear...
PhD Assistance
 
An overview of cyber security data science from a perspective of machine lear...
An overview of cyber security data science from a perspective of machine lear...An overview of cyber security data science from a perspective of machine lear...
An overview of cyber security data science from a perspective of machine lear...
PhD Assistance
 
Selecting a Research Topic - Framework for Doctoral Students.pdf
Selecting a Research Topic - Framework for Doctoral Students.pdfSelecting a Research Topic - Framework for Doctoral Students.pdf
Selecting a Research Topic - Framework for Doctoral Students.pdf
PhD Assistance
 
Identifying and Formulating the Research Problem in Food and Nutrition Study ...
Identifying and Formulating the Research Problem in Food and Nutrition Study ...Identifying and Formulating the Research Problem in Food and Nutrition Study ...
Identifying and Formulating the Research Problem in Food and Nutrition Study ...
PhD Assistance
 

More from PhD Assistance (20)

The relationship between clinical and biochemical findings with diabetic keto...
The relationship between clinical and biochemical findings with diabetic keto...The relationship between clinical and biochemical findings with diabetic keto...
The relationship between clinical and biochemical findings with diabetic keto...
 
Referencing an Article - Its styles and type.pptx
Referencing an Article - Its styles and type.pptxReferencing an Article - Its styles and type.pptx
Referencing an Article - Its styles and type.pptx
 
Referencing an Article - Its styles and type.pdf
Referencing an Article - Its styles and type.pdfReferencing an Article - Its styles and type.pdf
Referencing an Article - Its styles and type.pdf
 
ROLE OF COMMUNITY TO BOOST MENTAL HEALTH .pptx
ROLE OF COMMUNITY TO BOOST MENTAL HEALTH .pptxROLE OF COMMUNITY TO BOOST MENTAL HEALTH .pptx
ROLE OF COMMUNITY TO BOOST MENTAL HEALTH .pptx
 
Current and future developments in cultural psychology of inequality in PhD r...
Current and future developments in cultural psychology of inequality in PhD r...Current and future developments in cultural psychology of inequality in PhD r...
Current and future developments in cultural psychology of inequality in PhD r...
 
Quantum Machine Learning is all you Need – PhD Assistance.pdf
Quantum Machine Learning is all you Need – PhD Assistance.pdfQuantum Machine Learning is all you Need – PhD Assistance.pdf
Quantum Machine Learning is all you Need – PhD Assistance.pdf
 
Nutritional Interventional trials in muscle and cachexia PhD research directi...
Nutritional Interventional trials in muscle and cachexia PhD research directi...Nutritional Interventional trials in muscle and cachexia PhD research directi...
Nutritional Interventional trials in muscle and cachexia PhD research directi...
 
Nutritional Interventional trials in muscle and cachexia PhD research directi...
Nutritional Interventional trials in muscle and cachexia PhD research directi...Nutritional Interventional trials in muscle and cachexia PhD research directi...
Nutritional Interventional trials in muscle and cachexia PhD research directi...
 
7 Major Types of Cyber Security Threats.pdf
7 Major Types of Cyber Security Threats.pdf7 Major Types of Cyber Security Threats.pdf
7 Major Types of Cyber Security Threats.pdf
 
Machine Learning Algorithm for Business Strategy.pdf
Machine Learning Algorithm for Business Strategy.pdfMachine Learning Algorithm for Business Strategy.pdf
Machine Learning Algorithm for Business Strategy.pdf
 
Key Factors Influencing Customer Purchasing Behavior.pptx
Key Factors Influencing Customer Purchasing Behavior.pptxKey Factors Influencing Customer Purchasing Behavior.pptx
Key Factors Influencing Customer Purchasing Behavior.pptx
 
Key Factors Influencing Customer Purchasing Behavior.pdf
Key Factors Influencing Customer Purchasing Behavior.pdfKey Factors Influencing Customer Purchasing Behavior.pdf
Key Factors Influencing Customer Purchasing Behavior.pdf
 
Factors Contributing and Counter Measure in Drowsiness Detection of Drivers.pptx
Factors Contributing and Counter Measure in Drowsiness Detection of Drivers.pptxFactors Contributing and Counter Measure in Drowsiness Detection of Drivers.pptx
Factors Contributing and Counter Measure in Drowsiness Detection of Drivers.pptx
 
Factors Contributing and Counter Measure in Drowsiness Detection of Drivers.pdf
Factors Contributing and Counter Measure in Drowsiness Detection of Drivers.pdfFactors Contributing and Counter Measure in Drowsiness Detection of Drivers.pdf
Factors Contributing and Counter Measure in Drowsiness Detection of Drivers.pdf
 
Immigrant’s Potentials to Emerge as Entrepreneurs.pptx
Immigrant’s Potentials to Emerge as Entrepreneurs.pptxImmigrant’s Potentials to Emerge as Entrepreneurs.pptx
Immigrant’s Potentials to Emerge as Entrepreneurs.pptx
 
Immigrant’s Potentials to Emerge as Entrepreneurs - PhD Assistance.pdf
Immigrant’s Potentials to Emerge as Entrepreneurs - PhD Assistance.pdfImmigrant’s Potentials to Emerge as Entrepreneurs - PhD Assistance.pdf
Immigrant’s Potentials to Emerge as Entrepreneurs - PhD Assistance.pdf
 
An overview of cyber security data science from a perspective of machine lear...
An overview of cyber security data science from a perspective of machine lear...An overview of cyber security data science from a perspective of machine lear...
An overview of cyber security data science from a perspective of machine lear...
 
An overview of cyber security data science from a perspective of machine lear...
An overview of cyber security data science from a perspective of machine lear...An overview of cyber security data science from a perspective of machine lear...
An overview of cyber security data science from a perspective of machine lear...
 
Selecting a Research Topic - Framework for Doctoral Students.pdf
Selecting a Research Topic - Framework for Doctoral Students.pdfSelecting a Research Topic - Framework for Doctoral Students.pdf
Selecting a Research Topic - Framework for Doctoral Students.pdf
 
Identifying and Formulating the Research Problem in Food and Nutrition Study ...
Identifying and Formulating the Research Problem in Food and Nutrition Study ...Identifying and Formulating the Research Problem in Food and Nutrition Study ...
Identifying and Formulating the Research Problem in Food and Nutrition Study ...
 

Recently uploaded

Polish students' mobility in the Czech Republic
Polish students' mobility in the Czech RepublicPolish students' mobility in the Czech Republic
Polish students' mobility in the Czech Republic
Anna Sz.
 
Supporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptxSupporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptx
Jisc
 
How to Break the cycle of negative Thoughts
How to Break the cycle of negative ThoughtsHow to Break the cycle of negative Thoughts
How to Break the cycle of negative Thoughts
Col Mukteshwar Prasad
 
Unit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdfUnit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdf
Thiyagu K
 
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdf
Welcome to TechSoup   New Member Orientation and Q&A (May 2024).pdfWelcome to TechSoup   New Member Orientation and Q&A (May 2024).pdf
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdf
TechSoup
 
How to Create Map Views in the Odoo 17 ERP
How to Create Map Views in the Odoo 17 ERPHow to Create Map Views in the Odoo 17 ERP
How to Create Map Views in the Odoo 17 ERP
Celine George
 
Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345
beazzy04
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
siemaillard
 
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
EugeneSaldivar
 
Palestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptxPalestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptx
RaedMohamed3
 
Fish and Chips - have they had their chips
Fish and Chips - have they had their chipsFish and Chips - have they had their chips
Fish and Chips - have they had their chips
GeoBlogs
 
Sectors of the Indian Economy - Class 10 Study Notes pdf
Sectors of the Indian Economy - Class 10 Study Notes pdfSectors of the Indian Economy - Class 10 Study Notes pdf
Sectors of the Indian Economy - Class 10 Study Notes pdf
Vivekanand Anglo Vedic Academy
 
2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...
Sandy Millin
 
The French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free downloadThe French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free download
Vivekanand Anglo Vedic Academy
 
Language Across the Curriculm LAC B.Ed.
Language Across the  Curriculm LAC B.Ed.Language Across the  Curriculm LAC B.Ed.
Language Across the Curriculm LAC B.Ed.
Atul Kumar Singh
 
Cambridge International AS A Level Biology Coursebook - EBook (MaryFosbery J...
Cambridge International AS  A Level Biology Coursebook - EBook (MaryFosbery J...Cambridge International AS  A Level Biology Coursebook - EBook (MaryFosbery J...
Cambridge International AS A Level Biology Coursebook - EBook (MaryFosbery J...
AzmatAli747758
 
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
MysoreMuleSoftMeetup
 
PART A. Introduction to Costumer Service
PART A. Introduction to Costumer ServicePART A. Introduction to Costumer Service
PART A. Introduction to Costumer Service
PedroFerreira53928
 
ESC Beyond Borders _From EU to You_ InfoPack general.pdf
ESC Beyond Borders _From EU to You_ InfoPack general.pdfESC Beyond Borders _From EU to You_ InfoPack general.pdf
ESC Beyond Borders _From EU to You_ InfoPack general.pdf
Fundacja Rozwoju Społeczeństwa Przedsiębiorczego
 
The Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official PublicationThe Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official Publication
Delapenabediema
 

Recently uploaded (20)

Polish students' mobility in the Czech Republic
Polish students' mobility in the Czech RepublicPolish students' mobility in the Czech Republic
Polish students' mobility in the Czech Republic
 
Supporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptxSupporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptx
 
How to Break the cycle of negative Thoughts
How to Break the cycle of negative ThoughtsHow to Break the cycle of negative Thoughts
How to Break the cycle of negative Thoughts
 
Unit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdfUnit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdf
 
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdf
Welcome to TechSoup   New Member Orientation and Q&A (May 2024).pdfWelcome to TechSoup   New Member Orientation and Q&A (May 2024).pdf
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdf
 
How to Create Map Views in the Odoo 17 ERP
How to Create Map Views in the Odoo 17 ERPHow to Create Map Views in the Odoo 17 ERP
How to Create Map Views in the Odoo 17 ERP
 
Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
 
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
 
Palestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptxPalestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptx
 
Fish and Chips - have they had their chips
Fish and Chips - have they had their chipsFish and Chips - have they had their chips
Fish and Chips - have they had their chips
 
Sectors of the Indian Economy - Class 10 Study Notes pdf
Sectors of the Indian Economy - Class 10 Study Notes pdfSectors of the Indian Economy - Class 10 Study Notes pdf
Sectors of the Indian Economy - Class 10 Study Notes pdf
 
2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...
 
The French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free downloadThe French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free download
 
Language Across the Curriculm LAC B.Ed.
Language Across the  Curriculm LAC B.Ed.Language Across the  Curriculm LAC B.Ed.
Language Across the Curriculm LAC B.Ed.
 
Cambridge International AS A Level Biology Coursebook - EBook (MaryFosbery J...
Cambridge International AS  A Level Biology Coursebook - EBook (MaryFosbery J...Cambridge International AS  A Level Biology Coursebook - EBook (MaryFosbery J...
Cambridge International AS A Level Biology Coursebook - EBook (MaryFosbery J...
 
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
 
PART A. Introduction to Costumer Service
PART A. Introduction to Costumer ServicePART A. Introduction to Costumer Service
PART A. Introduction to Costumer Service
 
ESC Beyond Borders _From EU to You_ InfoPack general.pdf
ESC Beyond Borders _From EU to You_ InfoPack general.pdfESC Beyond Borders _From EU to You_ InfoPack general.pdf
ESC Beyond Borders _From EU to You_ InfoPack general.pdf
 
The Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official PublicationThe Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official Publication
 

Sample Work For Engineering Literature Review and Gap Identification

  • 2. Copyright © 2018 Phdassistance. No part of this document may be published without permission of the author. CHAPTER 2 LITERATURE REVIEW 2.1 INTRODUCTION The rapid technological development in the field of genomics has created an unprecedented situation in biology. This is mainly due to large volume of genes with no clear sequence homology with previously characterized genes, understanding how these genes act in driving the physiology is a major challenge in the upcoming years especially while handling in data analysis, statistical modelling and interpretation of results. High-density macro and micro arrays have acquired a special role in this challenging field as these consists of ordered collection of thousands of different Deoxyribonucleic Acid (DNA) sequences that can be measured by DNA and Ribonucleic Acid (RNA) variation (Lipschultz et al 1999). In particular, microarrays are considered a breakthrough technology in biology, facilitating the quantitative study of thousands of genes simultaneously from a single sample of cells. However, they were utilized in many applications but commonly used in expression profiling (Bowtell 1999). Seo Young et al (2007) discussed that microarray has emerged as the effective and extensively used tool to address a broad range of problems like categorization of disease subtypes and tumors in biological and medical research. One of the major objectives in analyzing gene expression data has been the identification of samples or genes with identical expression patterns and several statistical techniques exist for analyzing and organizing these complex data into useful information. A handful of research studies have been presented for clustering microarray gene expression. Their data are illustrated below. Kim et al (2005) address a wide range of problems such as categorization of disease subtypes and tumors in biological and medical research. The researchers describe the microarray, which has emerged as the most effective and broadly used tool for this categorization. The main objective of analyzing gene expression data has been to isolate data samples or genes. Identical expression patterns and statistical techniques exist to analyze and organize these complex data in a meaningful way. The researchers discovered that normalization, extent of noise and clarity in the datasets will change the clustering methods that
  • 3. Copyright © 2018 Phdassistance. No part of this document may be published without permission of the author. are most commonly used in the analysis of microarray data. Data preprocessing such as strategies of normalization or noise clearness has been used as the basis to express and compare the performance of diverse clustering methods. Using validation measures, they have evaluated all these clustering methods for both simulated data and real gene expression data. They have discovered that normalization and extent of noise and clearness for datasets affect the clustering methods that are normally used in microarray data analysis. Valarmathie et al (2009) recommended hybrid fuzzy c-means method to determine the precise number of clusters and interpreted the same efficiently. The challenging issue in microarray technique was to analyze and interpret the large volume of data. This could be achieved by clustering techniques in data mining. In hard clustering like hierarchical and k-means clustering techniques, data was divided into distinct clusters, where each data element belongs to exactly one cluster so that the outcome of the clustering may be incorrect, many times. The problems addressed in hard clustering could be solved in fuzzy clustering technique. Among fuzzy based clustering, fuzzy c-means (FCM) was the most suitable for microarray gene expression data. The problem associated with fuzzy c- means was the number of clusters to be generated for the given dataset, which needs to be specified earlier. This could be solved by combining this method with a popular probability related Expectation Maximization (EM) algorithm, which provides the statistical framework to model the cluster structure of gene expression data which was done by the proposed system. The microarray technique in concurrent measurement of the expression level in thousands of messenger RNA (mRNA)s has been enabled. This has been made possible by mining the data; it is feasible to recognize the dynamics of a gene expression time series in this manner. The researchers decreased the dimensionality of the data set by employing Principal Component Analysis (PCA). Examination of the components has provided an approach into the underlying factors calculated in the experiments. PCA has demonstrated that it is proved from their consequences that all rhythmic content of data can be decreased to three main components (Layana, C. and Diambra 2007). Hereditary inclusion body myopathy (HIBM) of adult start steadily rising distal and proximal myopathy has also been discussed (Eisenberg et al 2008). After examining the
  • 4. Copyright © 2018 Phdassistance. No part of this document may be published without permission of the author. expression outline data sets by the overlap of three statistic methods (Student‟s t-test, TNoM and Info score), it has been found that the HIBM-specific transcriptome contains 374 differentially expressed genes. With the delicate contribution of mitochondrial processes exposed in HIBM, an unexpected feature of HIBM pathophysiology has been discovered. This could be expanded to provide reasons for the slow development of this disorder, and assist to perceive knowledge regarding its disease mechanism. The Gene Expression (GE) profiling helps in understanding the fundamental cause of gene behaviour, growth of genes and to identify new ailments such as cancer and also to analyse their molecular pharmacology. The main objective of gene expression analysis is to comprehend the processes of regulatory networks, and the pathways that are restricted during inter-cellular and intra- cellular activities. Currently, microarray datasets are broadly used for this purpose. Identifying meaningful information patterns and dependencies in Gene Expression (GE) data, to provide a basis for hypothesis testing, is non-trivial. An initial step is to cluster or “group” genes with similar changes in expression. However, lack of a priori knowledge means that unsupervised clustering techniques, where data are unlabeled (un- annotated), are common in GE work. These are exploratory techniques and assume that there is an unknown mapping that assigns a group “label” to each gene, where the goal is to estimate this mapping. However, it has been noted that common clustering approaches do not always translate well to GE data, and may fail significantly to account for data profile (Kerr et al 2008). D‟Souza et al (2009) have demonstrated that their algorithm can detect gene networks with reasonable ease by employing their algorithm on a yeast speculation dataset. Hence, with this background the present review critically analyses the previous literature on application of clustering algorithm applied to different microarray gene expression data sets. Accuracy of GE data strongly depends on experimental design and minimisation of technical variation, which may be due to instruments, observer or pre-processing (Zakharkin et al 2005). Image corruption and/or slide impurities may lead to incomplete data (Troyanskaya et al 2001). The development of data analysis strategies and tools to cope with the
  • 5. Copyright © 2018 Phdassistance. No part of this document may be published without permission of the author. complexity of the data is a sizeable task. Current methods for analysis are based on comparison role and patternrecognition algorithms such as cluster analysis. As cluster analysis is usually exploratory, lack of a priori knowledge on gene groups or their number, K, is common. Arbitrary selection of this number may undesirably bias the search, as pattern elements may be ill defined unless signals are strong. Meta-data can guide choice of correct K, e.g. genes with common promoter sequence are likely to be expressed together and thus are likely to be placed in the same group. Methods for determining optimal number of groups, K, are discussed in Fridlyand and Dudoit (2001); Milligan and Cooper (1985). Clustering a GE matrix can be achieved in two ways: (i) genes can form a group which show similar expression across conditions and (ii) samples can form a group which show similar expression across all genes. Both (i) and (ii) lead to global clusters, where a gene or sample is grouped across all dimensions. However, genes and samples can be clustered simultaneously, with their inter-relationship represented by bi-clusters. These are defined over a subset of genes and a subset of samples thus capturing local structure in the dataset. This is a major strength of bi-clustering as cellular processes are found to rely on subsets of genes, which are co-regulated and co-expressed under certain conditions and behave independently (Ben-Dor, Chor, Karp and Yakhini 2003). Most clustering algorithms can be classified into two groups: hierarchical and partitional clustering. The hierarchical techniques produce a nested sequence of partitions, with a single, all-inclusive cluster at the top and single clusters of individual objects at the bottom (leaf nodes) (divisive hierarchical clustering) or a set of singleton clusters at the top and one single partition at the bottom (agglomerative hierarchical clustering). Examples of the hierarchical clustering are the Principal Direction Divisive Partitioning (PDDP) (Boley 1998), Bisecting K- Means (BKM) (Savaresi and Boley 2001), Hierarchical Agglomerative Clustering (HAC) (Jain, Murty and Flynn 1999), Collaborative Document Clustering (CDC) (Hammoudaand Kamel 2006). The partitional clustering approaches partition a collection of objects into a set of groups, so as to maximize the quality of clustering. The K-means (KM) (Hartigan and Wong 1979) and fuzzy c-means (Bezdek, Ehrlich and Full 1984) algorithms are members of the family of partitional clustering algorithms (Kashef and Kamel 2009).
  • 6. Copyright © 2018 Phdassistance. No part of this document may be published without permission of the author. Approaches to gene expression data analysis rely heavily on results from cluster analysis (e.g., k-means, self-organizing maps and trees), supervised learning (e.g., recursive partitioning), classification and regression trees (Pollard and van der Laan 2002). Lopamudra Dey et al (2011) described the clustering analysis of microarray gene expression data. Many clustering algorithms like K-means, FCM, hierarchical techniques were used for gene expression data clustering. But this PSO based K-means gave better accuracy than those existing algorithms. In this paper, Particle Swarm Optimization (PSO)-based K-means clustering algorithm has been proposed for clustering microarray gene expression data. Jessica Mar et al (2011) developed an „informativeness metric‟ based on a simple analysis of variance statistic that identified the number of clusters which best separated phenotypic groups. The performance of the informativeness metric had been tested on both experimental and simulated datasets, and the researcher contrasted these results with those obtained using alternative methods such as the gap statistic. A firm gene selection and efficient cancer prediction structure called SGS has been introduced. This structure first recognizes gene groups in which the genes have high correlation coefficient by means of a clustering algorithm. To the end, a prediction model has been constructed based on shrinkage gene space, using a capable classification algorithm (such as Support Vector regression (SVM), 1-nearest neighbor (1NN), or regression). By means of the trial results obtained on real-world data, the structure has been shown to regularize highly available feature selection and prediction methods, such as Significant Analysis of Microarray (SAM), Information gain (IG) and the Lasso-type prediction model (Jing et al., 2010). Liping Jing et al (2010) presented a Stable Gene Selection (SGS) and efficient cancer prediction framework. The proposed framework has first identified the gene groups where genes in each group has higher correlation coefficient by means of a clustering algorithm, and then it has employed Bayesian Lasso and group Lasso to select significant genes in each group and important gene groups, respectively, and finally based on shrinkage gene space with efficient classification algorithm (like Support Vector Machine (SVM), Single Nearest Neighbour (1NN),
  • 7. Copyright © 2018 Phdassistance. No part of this document may be published without permission of the author. Regression etc.) which constructed the prediction model. The proposed framework has been proved it to frequently outperform available feature selection and prediction methods, such as Significance Analysis of Microarrays (SAM), Information Gain (IG) and Lasso-type prediction model by means of the experimental results obtained from real world data. Further, many clustering algorithms require a complete matrix of input values, so imputation (missing data estimation) techniques need to be considered before clustering. GE data are intrinsically noisy, resulting in outliers, typically managed by: (i) robust statistical estimation/testing (when extreme values are not of primary interest) or (ii) identification (when outlier information is of intrinsic importance (Liu, Cheng and Wu 2002). The following section reviews previous studies on application of cluster algorithm to different data sets. The study by Pollard and Van der Laan (2002) proposed a statistical framework for two-way clustering for data if genes and samples are considered simultaneously and where complex pattern can be identified. In this study, a simultaneous clustering parameter is defined as a function ( ) of the true data generating distribution P, and an estimate is obtained by applying this function to the empirical distribution . Authors in this study illustrate a wide range of clustering procedures, including generalized hierarchical methods; can be defined as parameters which are compositions of individual mappings for clustering patients and genes. This framework allows one to assess classical properties of clustering methods, such as consistency, and to formally study statistical inference regarding the clustering parameter. The present results of simulations are designed to assess the asymptotic validity of different bootstrap methods for estimating the distribution of ( ). Mendez et al (2002) presented a procedure that combines classical statistical methods to assess the confidence of gene clusters identified by hierarchical clustering of expression data. This approach was applied to a publicly released Drosophila metamorphosis data set (White et al 1999). The study can produce reliable classifications of gene groups and genes within the groups by applying unsupervised (cluster analysis), dimension reduction (principal component analysis) and supervised methods (linear discriminant analysis) in a sequential form. This procedure
  • 8. Copyright © 2018 Phdassistance. No part of this document may be published without permission of the author. provides a means to select relevant information from microarray data, reducing the number of genes and clusters that require further biological analysis. Xu et al (2002) proposed three Minimum Spanning Tree (MST)-based algorithms: removing long MST-edges, a center-based iterative algorithm, and a representative-based global optimal algorithm. But for a specific dataset, users do not know which algorithm is suitable. Most clustering algorithms become ineffective when provided with unsuitable parameters or applied to datasets which are composed of clusters with diverse shapes, sizes, and densities. Du and Lin (2004) suggested an alternative parallelized algorithm of hierarchical clustering to solve the problem of traditional hierarchical clustering which cannot handle large data sets within a reasonable time and memory resources. The study was implemented on a Multiple Instruction Multiple Data (MIMD) architecture, which shows considerable reduction in computational time and inter-node communication overhead, especially for large data sets. The authors used the standard message passing library, Message Passing Interface (MPI) for any MIMD systems. The study by Seal, Komarina and Aluru (2005) developed Clustering algorithms on gene expression data to find co-regulated genes. The study reduced the run time to O(N2) by applying known hierarchical clustering algorithms [Proc. 9th Annual ACM-SIAM Symposium on Discrete Algorithms, 1998, pp. 619–628].The problem of high run time was addressed by Pearson correlation coefficient based hierarchical clustering. The study presents an algorithm which runs in O(NlogN) time using a geometrical reduction and shows that it is optimal. The study by He, Pan and Lin (2006) presented multivariate normal mixture model based clustering analyses to detect differential gene expression between two conditions. Deviating from the general mixture model and model-based clustering, the mixture models with specific mean and covariance structures that account for special features of two- condition microarray experiments were proposed. Explicit updated formulas in the Expectation- Maximization (EM) algorithm for three such models are derived. The methods are applied to a real dataset to compare the expression levels of 1176 genes of rats with and without
  • 9. Copyright © 2018 Phdassistance. No part of this document may be published without permission of the author. pneumococcal middle-ear infection to illustrate the performance and usefulness of this approach. About 10 genes and 20 genes are found to be differentially expressed in a six-dimensional modelling and a bivariate modelling, respectively. Two simulation studies are conducted to compare the performance of univariate and multivariate methods. Depending on data, neither method can always dominate the other. The results suggest that multivariate normal mixture models can be useful alternatives to univariate methods to detect differential gene expression in exploratory data analysis. Linag (2007) proposed the method to overcome the difficulties of the mixture- Gaussian model-based clustering of gene expression profile by the probit transformation in conjunction with the Singular Value Decomposition (SVD). SVD reduces the dimensionality of the data, and the probit transformation converts the scaled Eigen samples, which can be interpreted using correlation coefficients, as explained in the text, and can be presented as Gaussian random variables. The present results show that the SVD-based probit transformation enhances the ability of the mixture-Gaussian model-based clustering method for identifying prominent patterns of the data. As a by-product, authors reported that the SVD-based probit transformation also improves the performance of the model-free clustering methods, such as hierarchical, K-means and Self-Organizing Maps (SOM), for the data sets containing scattered genes. Further this study also proposed a run test-based rule for selection of Eigen samples used for clustering. Delibasic, Vukicevic, Jovanovic, Kirchner, Ruhland and Suknovic (2012) proposed architecture for the design of representative-based clustering algorithms based on reusable components. These components were derived from K-means-like algorithms and their extensions. With the suggested clustering design architecture, it is not only possible to reconstruct popular algorithms, but also to build new algorithms by exchanging components from original algorithms and their improvements. In this way, the design of a myriad of representative-based clustering algorithms and their fair comparison and evaluation are possible. In addition to the architecture, the study showed the usefulness of the proposed approach by providing experimental evaluation. However, this study recommends meta-learning as a better approach for intelligent algorithm selection particularly in the area of clustering and also this is a
  • 10. Copyright © 2018 Phdassistance. No part of this document may be published without permission of the author. relatively new and unexplored topic (de Souto, Prudencio, Soares, Araujo, Costa, Ludermir and Schliep 2008). In addition, clustering algorithms that combine the advantages of hierarchical and partitional clustering have been proposed in the literature (Cheng, Kannan, Vempala and Wang 2006; Kaukoranta, Fränti and Nevalainen 1998; Lee and Olafsson 2011; Lin and Chen 2005; Liu, Jiang and Kot, 2009). This kind of hybrid algorithm analyzes the dataset in two stages. In the first stage, the dataset is split into a number of subsets with a partitioning criterion. In the second stage, the produced subsets are merged in terms of a similarity measure. Different split and merge approaches have been designed in several hybrid algorithms. Cohesion Self-Merging (CSM) (Lin and Chen 2005) first applies K-means to partition the dataset into K0 subsets, where K0 is an input parameter. Afterwards, single linkage, which uses a dedicated cohesion function as the similarity measure, is utilized to iteratively merge the K0 subsets until K subsets are achieved. In the split stage, as K-means may produce different partitions in different runs, the final results may be unstable. CHAMELEON (Karypis, Han and Kumar 1999) is another example of a hybrid clustering algorithm. It constructs a K-nearest neighbour graph, and employs a graph cut scheme to partition the graph into K0 subsets. Relative inter-connectivity and relative closeness are defined to merge the subsets. Liu et al (2009) proposed a multi-prototype clustering algorithm, which can also be considered as a hybrid method. The method uses a convergence mechanism, and repeatedly performs split and merge operations until the prototypes remain unchanged. However, many empirical parameters are involved. Kaukoranta et al (1998) proposed a split-and merge algorithm, where the objective function is to minimize the mean squared error. A Minimum Spanning Tree (MST) is a useful graph structure, which has been employed to capture perceptual grouping (Jain and Dubes 1998). Zahn (1971) defined several criteria of edge inconsistency for detecting clusters of different shapes. However, for datasets consisting of differently shaped clusters, the method lacks an adaptive selection of the criteria. To alleviate the deficiencies of clusters with diverse shapes, sizes and densities Zhond, Miao and Franti (2011) proposed a novel split-and-merge hierarchical clustering method
  • 11. Copyright © 2018 Phdassistance. No part of this document may be published without permission of the author. in which a MST and an MST-based graph are employed to guide the splitting and merging process. In the splitting process, vertices with high degrees in the MST-based graph are selected as initial prototypes, and K-means is used to split the dataset. In the merging process, subgroup pairs are filtered and only neighbouring pairs are considered for merge. The proposed method requires no parameter except the number of clusters. Experimental results demonstrate its effectiveness both on synthetic and real datasets. Analysis of large GE datasets is a relatively new task, although pattern recognition of complex data is well established in a number of fields. Many common generic algorithms have, in consequence, been adopted for GE data (e.g. hierarchical (Eisen and Spellman 1998), SOMs (Kohonen 1990), but not all perform well. A good method must deal with noisy high dimensional data, be insensitive to the order of input, have moderate time and space complexity (i.e. allow increased data load without breakdown or requirement of major changes), require few input parameters, incorporate meta-data knowledge (an extended range of attributes) and produce results, which are interpretable in the biological context. 2.2 RESEARCH GAPS IN EXISTING METHODS It has been concluded from the previous studies that cluster analysis applied to GE data aims to highlight meaningful patterns for gene co-regulation. The evidence suggests that, while commonly applied, agglomerative and partitive techniques are insufficiently powerful given the high dimensionality and nature of the data. While further testing on non-standard and diverse data sets is required, comparative assessment and numerical evidence, to date, support the view that bi-clustering methods, although computationally expensive, offer better interpretation in terms of data features and local structure. While the limitations of commonly used algorithms are well documented in the literature, adoption by the bioinformatics community of new (and hybrid) techniques developed specifically for GE analysis has been slow, mainly due to the increased algorithmic complexity involved. This would be catalysed by more transparent guidelines and increased availability in specialised software and public dataset repositories.
  • 12. Copyright © 2018 Phdassistance. No part of this document may be published without permission of the author. In comparison to other methods, hierarchical clustering method was extensively used by biologists in microarray data analysis as it combines all data points into a single set by keeping on combining pairs of data points or sets of points adjacent to each other in the feature space for discovering groupings. Obtaining the best clustering that signifies a set of patterns in the background of a given distance metric through graphic representation is the objective of hierarchical clustering (Jin Hwan Do and Dong-Kug Choi 2007). This method is preferred among biologists because it permits users to visualize global expression patterns in DNA microarray data through graphic representation of the results of hierarchical clustering. Generally, this is classified as agglomerative (bottom-up) and divisive (top-down), based on a similarity or distance measure of the data, like correlation, Euclidean, squared Euclidean, or city- block (Manhattan) distance. Clusters are constructed by the hierarchical tree which is calculates the distance between pairs of objects in the correlation matrix. In previous studies, data is partitioned by clustering algorithms where each gene belonged to only one cluster (Zhaohui Qin 2006; Minsoo Lee et al 2007). Its limitations include high sensitivity to noise, outlier and non-linearity, lack of validity, difficult in handling different sized clusters and shapes, added time (He Pan and Lin 2006), inability to detect small sets, increased in algorithmic complexity (Linag 2007), lack of statistical test and interpretation of results (He, Pan & Lin, 2006; Liu et al 2009). In addition, these methods have disadvantages when working with microarray gene expression data that leads to biological complexity (Zhond, Miao and Franti 2011). The nature of proteins and their interactions is the major reason for this. The genes that generate proteins are expected to co express with more than one group of genes because proteins generally perform diverse biological functions by interacting with different groups of proteins. This explains the inclusion of a gene in more than one cluster of microarray gene expression data. Further, a good method must deal with noisy high dimensional data, be insensitive to the order of input, have moderate time and space complexity (i.e. allow increased data load without breakdown or requirement of major changes), require few input parameters, incorporate meta-data knowledge (an extended range of attributes) and produce results, which are interpretable in the biological context.
  • 13. Copyright © 2018 Phdassistance. No part of this document may be published without permission of the author. Unlike data from model organisms and cell lines that have uniform genetic background, and where experiments are conducted under controlled conditions, disease samples are typically much more heterogeneous. Differences in the genetic background of the subjects, disease stage, progression, and severity as well as the presence of disease subtypes contribute to the overall heterogeneity. Discovering genes or features that are most relevant to the disease in question and identifying disease subtypes from such heterogeneous data remains an open problem. Due to large variability in gene mutations and gene expression especially in cancer population, till date not all patients have the same response to therapy and pose high challenge to physicians for treatment. Hence, with this background in this thesis, an improved clustering model is being proposed. The first and third model is on semi-supervised and two-dimensional hierarchical clustering is proposed to represent the existence of genes in one or more cluster consistent with the nature of the gene and its attributes and prevent biological complexities by means of hybrid distance based similarity measure. The second model is based on the Quad Tree that enhances the speed of the clustering process and also finds closest pair in a quickest time. 2.3 OBJECTIVES OF THIS WORK In order to gain a better insight into the problem of cancer classification, systematic approaches based on global gene expression analysis have been presented. The main challenge in the existing algorithm is that each gene belonged to only one cluster, and the processing time was high. The study aimed to overcome these important challenges using the microarray gene expression data. Thus the study aimed to evaluate microarray gene expression data of acute human leukemia, and the target is to distinguish between ALL and AML, which is a typical cancer classification problem, not well solved despite many years of research. This research is aimed on the classification (prediction) of the problem (Zhang and Ke 2000) using the two datasets of standard leukemia for training and testing obtained from
  • 14. Copyright © 2018 Phdassistance. No part of this document may be published without permission of the author. ALL/AML datasets and the performance of this hierarchical technique on clustering the ground truth data of the cancer classes, namely, AML and ALL is demonstrated. The study has the following specific research objectives:  The study develops an enhanced clustering model that analyzes the presence of a gene in more than one cluster using the enhanced clustering model utilizing the microarray gene expression data of acute human leukemia.  To develop an enhanced model that reduces the processing time of the clustering and finding the closest pair elements using the hybrid similarity measure. The clustering elements are selected from the microarray gene expression database by means of the index matrix. The best „K‟ clusters are identified using fitness evaluation.  To develop an optimum number of clusters for a given dataset  To analyze the performance of the enhanced model as presented below, using Precision, Recall and the F-measure. The novel Semi supervised hierarchical clustering in comparison with the unsupervised techniques.  The quad tree based hierarchical clustering is compared to the semi supervised hierarchical clustering without the quad tree.  Compare the two dimensional hierarchical clustering with the hybrid similarity measures with the semi supervised hierarchical clustering  This research also validates the enhanced clustering technique by applying the evaluation metric on the clustering results
  • 15. Copyright © 2018 Phdassistance. No part of this document may be published without permission of the author. From the review, it is obvious that data is partitioned by clustering algorithms used in the previous research in such a way that each gene belonged to only one cluster. Some examples for the clustering algorithms that create only one cluster for a gene are K-means algorithm, hierarchical clustering algorithm, biclustering algorithm, fuzzy k-means algorithm and SOM and are used in gene expression data. But, these methods have disadvantages when working with microarray gene expression data that gives rise to biological complexity. The nature of proteins and their interactions are the major reasons for this. The genes that generate proteins are expected to co express with more than one group of genes because proteins generally perform diverse biological roles by interacting with diverse groups of proteins. This explains the inclusion of a gene in more than one cluster of microarray gene expression data. In this research, a novel two dimensional hierarchical clustering was proposed to represent the existence of genes in one or more clusters consistent with the nature of the gene and its attributes and methods to prevent biological complexities. Hence, an architecture for a two dimensional hierarchical clustering is developed in this thesis, which provides three different analyses including semi-supervised hierarchical End of the Sample Work See Other Sample in www.phdassistance.com Contact Us