DNA VISUAL AND ANALYTIC DATA MINING: A CASE STUDYDocument Transcript
DNA Visual And Analytic Data Mining
Patrick Hoffman1, Georges Grinstein 1
Ken Marx2 , Ivo Grosse3, Eugene Stanley3
Institute for Visualization and Perception Research
Department of Computer Science
University of Massachusetts Lowell
Lowell, MA 01854
Center for Intelligent Biomaterials
Department of Chemistry
University of Massachusetts Lowell
Lowell, MA 01854
Center for Polymer Studies and Department of Physics
the data mining was to see whether some possibly non-linear
Abstract combination of these values could be a better predictor than the
The current approach for accurately finding gene locations is
mainly experimental and thus consumes lengthy laboratory time. After a brief introduction to genetics and to the various data
This is costly and any small increase in the accuracy of computer mining techniques, we discuss our analytic approach, the
classification can result in substantial costs savings. In this paper visualizations we used and developed, and the results of these
we describe our recent experiences to harness data exploration analyses. We also briefly suggest how visualization and data
techniques to classify DNA sequences. Several visualization and mining could be integrated in the future.
data mining techniques were used to validate and attempt to
discover new methods for distinguishing coding DNA sequences,
or exons, from non-coding DNA sequences, or introns. A new
1.1 Genetic Background
statistic for DNA sequences, called the Average Mutual
The international effort called the Human Genome Project is
Information (AMI) discovered by one of the authors, is based on
rapidly sequencing the complete DNA sequences of all 23 human
nucleotide frequency distributions extracted from a small
chromosomes. As well, the chromosomes from a number of other
segment of a DNA sequence and is used as a predictor for
organisms are being entirely sequenced. The DNA component of
distinguishing exons from introns. The goal of the data mining
chromosomes are long linear molecules comprised of strings of
was to see whether some other possibly non-linear combination
the four nucleotides (A, C, T, G), the information bearing
of these values could be a better predictor than the AMI. We
chemical units. Non-coding sequences (introns) are interspersed
tried many different classification techniques including rule-
by coding sequences(exons) along the chromosomes whose
based classifiers and neural networks. We also used visualization
information encodes protein structures. Transcription of the
of both the original data and the results of the data mining to help
coding DNA sequence into mRNA, which is then translated into
verify patterns and to understand the distinction between the
proteins in the cell comprise the general flow of information.
different types of data and classifications. In the conclusion, we
This process is responsible for all normal cellular functions as
discuss the interactions between the visualizations and data
diverse as development into multicellular organisms, organ
mining and suggest an integration approach.
development, the immune system, to name a few, as well as
abnormal function such as cancer, birth defects, etc.
The current approach for finding genes (protein coding
1.2 Data Mining Tools Used
sequences) is mainly experimental. Thus the procedure is costly
and consumes lengthy laboratory time. Any small increase in the 1.2.1 Clementine
accuracy of computer classification can therefore result in
substantial costs savings. In this paper we describe our Clementine  is a data mining suite based on the data flow
experiences to harness data exploration techniques to classify visual programming paradigm similar to AVS or IBM's Data
DNA sequences. Explorer. It provides four machine learning modules, two rule-
based algorithms , a standard neural net (multi-layer perceptron),
Several visualization and data mining techniques were used to and a Kohonen neural net for clustering, each with default
validate and attempt to discover new methods for distinguishing settings . Elaborate tuning is possible but not necessary to get
coding DNA sequences, or exons, from non-coding DNA some early results. One rule-based classifier in Clementine is the
sequences, or introns. A new statistic for DNA sequences, called C4.5 algorithm by Quinlan . There was no information on the
the Average Mutual Information (AMI) and discovered by one of algorithm used by Clementine's other rule-based classifier.
the authors , is based on nucleotide frequency distributions
extracted from a small segment of a DNA sequence and is used
as a predictor for distinguishing exons from introns. The goal of
Tooldiag , a set of pattern recognition tools provides several Correct Wrong
classifiers including K-Nearest Neighbor, Quadratic Gaussian, as Neural Net 638 ( 79.55%) 164 ( 20.45%)
well as Principle Component Analysis. It also provided output C4.5 RULE 551 ( 68.70%) 251 ( 31.30%)
for the data and setup files to the Stuttgart Neural Network Clementine Rule 573 ( 71.45%) 229 ( 28.55%)
1.2.3 Stuttgart Neural Network Simulator These initial results were very promising for the neural network,
since we knew we could potentially tune the network for better
SNNS [15,17] is a comprehensive X Windows-based simulator performance, and we still were using a small training set.
for a large number of different types of pruned and unpruned
neural networks including back propagation, counter At this point we were ready to explore other packages, but more
propagation, quick propagation, back percolation 1, generalized pressing was trying to understand in greater detail the structure of
radial basis functions (RBF), Rprop, ART1 and 2, as well as the data we had generated. Unfortunately, neural networks and
time-delay networks. classification rules do not easily reveal their insights into the
data. We wanted to "see" these differences! Visualization was
2.0 Data Mining DNA Sequences
Fickett  developed databases consisting of sequences of 3.2 Adding Visualization
known exons or introns and described several classification
methods. Some methods depend on knowing the particular We used several visualization approaches to look both at the
starting and ending sequences. In  the Mutual Information original Fickett data as well as processed data.
function was developed and studied. It provides a measure of the
influence of a particular nucleotide n nucleotides away. 3.2.1 Radial Visualizations
For biochemical reasons coding sequences possess a triplet codon Spring constants can be used to represent relational values
information structure. Thus nucleotides at positions 3, 6, 9, and between points [1,9]. We developed a radial visualization,
generally 3n , positions away from each other have higher similar in spirit to parallel coordinates (lossless visualization), in
correlations. Thus we only need to look at the frequencies of A, which n-dimensional data points are laid out as points equally
C, T, G extracted from a small segment of DNA, at positions 1, spaced around the perimeter of a circle. One end of n springs are
2, and 3.  defines the AMI as a particular combination of attached to these n perimeter points. The other ends of the
these 12 values and uses it as a predictor for distinguishing exons springs are attached to a data point. The spring constant Ki equals
from introns. It classifies DNA sequences with a high degree of the values of the i-th coordinate of the fixed point. Each data
accuracy (76 to 80%). point is then displayed where the sum of the spring forces equals
0. All the data point values are normalized to have values
Is it possible that there are other functions of these values that between 0 and 1.
can even better distinguish exons from introns? We decided to
use data mining to help find these functions? For example if all n coordinates have the same value the data
point will lie exactly in the center of the circle. If the point is a
In our initial study we examined several thousands of FickettÕs unit vector then that point will lie exactly at the fixed point on the
sequences of various length exons and introns. Our first task was edge of the circle (where the spring for that dimension is fixed).
to divide the data into training and test sets. The training sets Many points can map to the same position as in the Exvis
were used to build a classifier, similar to the rule-based ID3 , displays . This represents a non-linear transformation of the
and a neural net. The test sets were used to evaluate the accuracy data which preserves certain symmetries and which produces an
of the classifier. intuitive display. Some features of this visualization include:
Up to date the classification programs have reached accuracy’s • points with approximately equal coordinate values will lie
between 73 and 81 percent. This provided a baseline that we close to the center
were comparing against. Since there are only two classes of data
• points with similar values whose dimensions are opposite
any classifier should reach at least a minimum accuracy of 50%.
each other on the circle will lie near the center
• points which have one or two coordinate values greater than
3.1 Results the others lie closer to those dimensions
3.1 Initial results Figure 1 displays 2000 points using Radviz: red points are exons
and blue points are introns. Most points lie close to the center
Initially, only 200 points (100 exons an 100 introns) were used to implying equal forces. Figure 2 displays shows 500 exons are
train displayed, zoomed by a factor of 10 with the point size increased.
ClementineÕs 2 rule-based classifiers and neural net. The default In this picture we discovered a "symmetry" of the data around a
NN used 12 input nodes , 4 hidden layers and 1 output node. line drawn between dimensions 5 & 6 and 0 & 11. This mirror
After training the rule-based classifiers were correct 93 and 94 image is a consequence of the complementary pair nature of
percent of the time while the NN was only about 80% accurate Fickett DNA sequences: consecutive lines of the sequences were
on the training data. With 800 non-training samples (400 exons the reverse of the previous lines. This was known by one
and 400 introns) we obtained the following accuracy: member of our research group who assumed all new this! This
symmetry needed to be corrected for the data mining.
map data dimensions to the "best" display parameters such as
Figure 3 displays 2000 points zoomed up by a factor of 5 . In this color, texture, or the coordinate systems will prove attractive
picture we can see that the exons (red) are more spread out, and [e.g., 13].
the introns (blue) are closer to the center of the circle. Figure 4
represents the same data with a zoom factor of 8 and larger
points. Notice that zooming produces points well outside the
circle which is not possible with real springs. The explanation for
This work is funded in part by a grant from Pfizer to the Center
the spreading of exons is that they are not as random as introns.
for Intelligent Biomaterials and the Institute for Visualization and
The random frequency distribution tends to make the forces
Perception Research at the University of Massachusetts Lowell
balance, hence closer to the center.
and Boston University's Center for Polymer Studies and the
Department of Physics; and in part by a grant from the National
3.2.2 Parallel Coordinates Institute for Standards and
As a comparison, we used the Parallel Coordinate
implementation in Xmdv . Figure 5 and 6 display the same
data. In Figure 5 introns are highlighted in blue, while in Figure
6 exons are highlighted in blue. Notice some spreading of introns
 Ankerst M., Keim D. A., Kriegel H.-P. Circle Segments: A
is still apparent. Also notice the symmetry line can still be seen
Technique for Visually Exploring Large Multidimensional Data
between c2 and g2 (ignoring the last two values).
Sets, IEEE VisualizationÕ96 Proceedings, Hot Topic, San
Francisco, CA, 1996.
 Clementine. http://www.isl.co.uk/clem.html
3.2.3 Sammon Plots  Erbacher R. and G. Grinstein, Issues in the development of 3D
Icons, Proceedings of the Fifth Eurographics Workshop on
Another interesting visualization is a Sammon plot . We Visualization in Scientific Computing, Springer-Verlag Publishers,
used Tooldiag to produce Figure 7. The Sammon plot reduces pp109-131, 1994.
dimensions by trying to preserve the distances between data J.W. Fickett and Chang-Shung Tung, Nucl. Acids Res. 20 (1992)
points. Notice the spreading of exons is readily apparent but that 6441
the symmetry is difficult to observe. Grinstein G. (1996) Harnessing the Human in Knowledge Discovery ,
Proceedings of the second International Conference on Knowledge
Discovery and Data Mining, August 1996, Portland, Simoudis, Han,
4.0 Further Analysis and Fayyad (eds), pp384-385.
I. Grosse, K. Marx, S. Buldyrev, G. Grinstein, H. Herzel, P.
The previous visualizations helped us cleanse the Fickett datasets
Hoffman, A. Li, C. Meneses, and H.E. Stanley. Data Mining of Large
by eliminating the redundancies. They also reinforced insights
Gene Datasets Using the Mutual Information Function. to appear in
about exons and introns which helped us refine the directions we
Journal of Biomolecular Structure and Dynamics.
wanted to go with the neural networks thereby improving the
 I. Gross, H. Herzel, S. Buldyrev, and H.E Stanley, Mutual
information of coding and noncoding DNA. To appear in Nature.
 P. Hoffman. Radviz. http://www.cs.uml.edu/~phoffman/viz
So far the most promising results come from an SNNS network
 K.A. Olsen, R.R. Korfhage, K.M. Sochats, M.B. Spring and J.G.
using three values that were analytically extracted from the 12
Williams,Visualisation of a Document Collection: The VIBE
values, the AMI and 2 other similar functions. The network
System, Information Processing and Management, Vol. 29, No. 1,
combined the three values to get over 82% accuracy. Since each
pp. 69-81, Pergamon Press Ltd, 1993.
point by itself classifies with about 80% accuracy, this is by far
 J R. Quinlan. Induction of deciscion trees. In Machine learning,
our best result to date. We display the data used in the SNNS
volume 1, pages 81-106. Kluwer Academic Publishers, 1986.
computations in Figures 8 and 9.
 J.R. Quinlan, C4.5: Programs for Machine Learning, Morgan
5.0 Conclusion & Future directions  T.W. Rauber. Tooldiag. Universidade de Lisboa, Dept. of Electrical
This case study described seems to show the need for integrating  S.F. Roth. The SAGE Project.
multi-dimensional visualization tools with data mining tools. http://www.cs.cmu.edu/Web/Groups/sage/sage.html
Many packages provide standard scatter plots, and some have 3-d  J. W. Sammon, Jr. A nonlinear mapping for data structure analysis.
plots. However, reducing to 2 or 3 dimensions from many is a IEEE Transactions on Computers, C-18(5):401-409, May 1969.
difficult task (which dimensions to select) and one which always  (SNNS)http://www.informatik.uni-
produces the feeling that something is missing (is the other stuttgart.de/ipvr/bv/projekte/snns/snns.html
dimension more important). Analytic tools do help. Neural nets,  M. Ward, A. Martin, High Dimensional Brushing for Interactive
classifying and clustering algorithms, are clearly powerful but Exploration of Multivariate Data. Visualization'95, Atlanta, GA ,
they still need to be guided by human insight. However, when the 1995
only output of analytic results presented is a few numbers such as  A. Zell, G. Mamier, M. Vogt, N. Mache A. Der Stuttgarter
Ò82% accuracyÓ, or some 500 line rule of nested Òif Neuronale Netze Simulator, in G. Dorffner, K. Mšller, G. Paa§, S.
statementsÓ, the user is left stranded. Does the user understand Vogel (Hrsg.): Konnektionismus und Neuronale Netze, GMD-
the rules? Can the user believe the accuracy? Studien Nr. 272, Okt. 1995, pp. 335-350.
There is a need to integrate the analytic with the visual . Such
integration with intelligent visualizations which automatically