As a B.Tech project, developed a faster search method by reducing both space and time thereby enhancing data mining process across various domains of application.
-> Achieved lossless data compression with higher compressibility ratio
-> Performed clustering in compressed domain for classification and pattern matching
-> Applied it on human genome for identification of cancer signatures
1. Many big datasets exhibit lot of redundancies
in the data that can be exploited to design
faster query search tools. This report exploits
to design entropy scaling search tools on well
defined structured datasets. Frameworks of
similarity search based on characterizing the
dataset’s entropy and fractal dimensions are
used. Through these search techniques it is
proved to have lesser time and space
complexity. Time is scaled with metric
entropy(number of covering hyper spheres)
and low fractal number. Space is scaled with
sum of metric entropy and information
theoretic entropy. Also this approach
accelerates the standard tools, with no loss in
specificity and little loss in sensitivity. In later
part we will estimating the weights of each
mutational signatures in order to cluster the
DNA sequences hierarchically using GA
Methods
Entropy scaling search for massive biological data
Gokulakannan Selvam, Raghunathan Rengaswamy *
Department of Chemical Engineering, Indian Institute of Technology Madras
The better reduction in time and space is
achieved in entropy search is by using the
following methods,
Firstly, every DNA sequences are
transformed using BW function which
forms the basis for further compression
The output is applied on to Move to front
transformation function to increase entropy
Above transformed data is then encoded
by Huffman function (for high compression)
Finding the dissimilarities between every
compressed DNA sequences (hamming d.)
Hierarchically cluster the sequences and
then estimate appropriate cluster centers
Later weights of all six mutational signatures
(C>A,C>G,C>T,T>A,T>C,T>G) is estimated
using genetic algorithm with fitness function –
simple evolutionary principles to find the
optimum values
Abstract
Entropy scaling
framework
A- finding dissimilarities , B- selecting
cluster centers, C- coarse search,
D- fine search
Flow chart
start
Initialize the weights
Cluster hierarchically and then estimate the
total number of mis-match
If change in objective function is less
than tolerance
stop
Yes
NO
Plot 2: plot of fitness value and average distance
vs. generations for population size = 100 and initial
range is [0,100000]
Results
For 100x100 randomly generated sequence,
Total size of original data = 12,288 bytes
Total size of compressed data = 3919 bytes
Compressibility ~ 3.14
1
0 2 4
without
compression
With
compression
For the 10 neight files {file 1, file 2,.., file10}
Observed clusters
Cluster 1: {file 1}
Cluster 2: {file 2}
Cluster 3:{file 3,file 4,file 5,..,file 9,file 10}
w1 w2 w3 w4 w5 w6
objective
function
-144552 2310314 -119340 114300 217846.2 191830.1 -0.03597
Theory
For local fractal dimension around a data
point can be computed by determining
other data points within the radii r1 and r2,
of that point;given those points (n1 and n2)
d = log (n2/n1) / log (r2/r1)
Time complexity is measure of how long
the process takes while space complexity
is measure of how much memory is used
up during the process
Order = O (k + |Bd (q,r)| (r+2rc / r)d)
References
Conclusion
1. Entropy scaling search –cell systems by
Y.W. Yu, N.M. Daniels, D.C. Danko
2. DNA sequence compression using BWT
by Don A.,Y.Zhang, Amar M, Tim bell
3. Opportunistic data structures by Paola F
and Giovanni Manzini
In this project we have introduced an
entropy-scaling framework for accelerating
approximate search on dynamic omic data’s
This approach bounds both time and
space as functions of the dataset
entropy (metric entropy bounds time,
while I-T entropy bounds space)
Low fractal dimensions ensures that run
time is dominated by metric entropy
Weights of all 6 mutational signatures
are dependent on many factors like
population size and initial range
Better results can be obtained by GA
and by considering many other features
& larger data sets