Finding concepts for organizing HERO data to enable within ...
Finding concepts for organizing HERO data to enable
within site and between site comparisons to be made
Mark Gahegan, 10/20/01
I thought it might be useful to circulate some preliminary thoughts regarding the
gathering and use of variables that could form the basis of understanding and
representing some of the concepts of interest, and hence that might provide a means for
tracking the state of a study site through time, for comparing sites and also predicting
trajectories of less-studied places, via induction.
Obviously, each team has specific data interests and expertise. The variables gathered
are likely to reflect these interests and expertise, and so be different from place to place in
aspects of collection, meaning and scale, and therefore not readily comparable. Also, at
present, there is no top-down imposed list of data that each site is required to gather. The
first task then, needs to be the identification of all data that are held by all sites, to see
where exact matches are possible (perhaps variables such as height, eco-region, and many
population variables are directly commensurate, whereas others, such as soil loss, flood
risk and economic viability may not be).
Imposing order on such a problem is obviously difficult, but as Sir Francis Bacon said:
“Error emerges more readily from order than from confusion.” By imposing structure on
these data it should become apparent whether the structure aids our understanding, or
hinders it, and eventually what to do to improve such structures so they better serve our
Exploration and the Process of Science
To begin searching for useful structure, we use exploratory visualization and knowledge
discovery tools (Yuan et al., 2001; Valdez-Perez, 1999; MacEachren et al., 1999). If we
can construct useful concepts, i.e. ones that have meaning across sites, then they could
form the basis on which a quantitative analysis is later constructed. Figure 1 shows an
overview of the scientific process as conceptualized here, starting with exploration and
synthesis, and progressing via analysis to evaluation and presentation (Feyerband, 1975;
Hanson, 1958; Kuhn, 1962; Popper, 1959; Feist and Gorman, 1998; Zimmerman, 2000;
Langley, 2000; Shrager and Langley, 1990; Thagard, 1988; Fayyad et al., 1996). The
roles played by different forms of inference are also shown (Peirce, 1878; Suppes, 1962;
GeoVISTA Studio (www.geovistastudio.psu.edu) will provide the functionality for
COMMUNICATION, LEARNING, Induction
Figure 1. An overview of the scientific process as a number of stages, each one
predicated on the outcomes of the previous stage, by which meaning is constructed, used
Some Exploratory Tools
At this point, it might be useful to switch perspectives and approach the problem from the
opposite end to give you an idea where we are aiming—what kinds of exploratory
analysis might we use, what might we learn from them, and how might they help us
overcome the data heterogeneity problems outlined above?
Figure 2. A parallel coordinate plot; one possible visualization tool for exploring high
Figure 2 shows a parallel coordinate plot (from Studio) displaying 10 separate bio-
physical variables (related to environmental analysis) and colored according to possible
landcover classes. The collected data for over 1700 sample sites are shown as a series of
‘strings’, with each string representing one site. Clustering in the display suggests
commonalities between samples that might be further examined as a basis for imposing
high level structuresthe categories and concepts that many forms of quantification
require a-priori. Figure 3 similarly shows 13 demographic variables defined for each
state in the USA, and Figure 4 shows three different views onto the same underlying data.
Figure 3. Socio-demographic data; the selected string (for California which is an outlier
in most respects!) is shown in green.
Figure 4. Three views onto the same dataset.
A few properties of displays such as these are worth pointing out:
(1) Other views onto the data, such as those provided by maps and spreadsheets, can
be used together with visualization tools. The views can be linked such that, for
example, selected items appear highlighted in all views (MacEachren et al.,
(2) The selected axis (a single variable) provides a color palette that is applied to all
strings. The user can change the associated classification scheme, the axis
selected, the drawing order and many other visual properties.
(3) Data can be sorted according to correlation, or by principal components, to the
(4) The axes themselves provide a visual indicator of the spread of values for each
variable. For data to be visually comparable, they do not need to share the same
units and dynamic range, nor be represented on the interval or ratio scale. In
other words, variables can be gathered in a variety of ways, so long as an ordinal
comparison of their values makes sense. The plot is really a device for exploring
how neighboring axes relate to each other, to form a ‘signature’.
This last point provides a possible approach: can sets of variables be defined across all
sites, such that an ordinal comparison across sites makes sense? For example, one
variable might be soil moisture: it might be measured differently from place to place, and
indeed even concepts such as ‘dry’ or ‘wet’ might not be equivalent from site to site. But
where we should expect to see structures emerge is how these kinds of measures relate to
other variables. It is then not necessary that the variables be exactly the same, but that
they speak to the same (agreed and shared) notion. Let us call these more flexibly-
defined variables ‘data dimensions’, to make a clear distinction.
Then, can sets of data dimensions be grouped according to the underlying themes or
research questions they help elucidate? If so, then we can hope to create visualizations
that will allow us to make comparisons. For example, by using pairs of parallel
coordinate displays, it is possible to make a number of important visual comparisons,
such as: are these two concepts ‘separable’ in the data? Is this place ‘similar’ to that
place? Has this place changed from time-1 to time-2? Does this place respond to an
event (such as a drought or economic recession) in the same way as another? See Figure
Figure 5. Comparing two concepts. The signatures show substantial differences in their
underlying data for certain dimensions. The rightmost figures show how these two
concepts cluster in a self-organizing neural network as a further measure of their likely
utility (Gahegan et al., 2001; Kohonen, 1997). The large colored dots (mainly green)
show closeness of the original data samples in feature space.
If we can place data from all sites into such a framework, where comparisons can be
readily made and are meaningful to make, then we will have provided the infrastructure
from which synthesis can begin, followed in turn by quantitative analysis (as described in
One possible way to impose order on these otherwise non-commensurate data is to start
from a set of high-level themes (such as those provided by the recent Delphi exercise)
that all sites agree to adopt. We can easily build parallel coordinate plots and other
linked exploratory tools that represent three to twenty variables that might together
describe a particular theme of interest. Each theme would then have a ‘form’ that could
be compared, both within and between sites.
What we need to do
(1) Find out what variables are (or can easily be) common across all sites, since these
might be best candidates to form data dimensions.
(2) Relate the themes gathered in the recent Delphi exercise to a set of ‘data
dimensions’ such that each dimension can be represented at each site by some
available variable, or calculated, estimated or otherwise invented as necessary.
(3) From the previous point, it is possible that a set of recurrent data dimensions
might emerge, that together represent a cross-section of all themes. If so, we
should use them as a kind of ‘site overview’ description.
(4) Build visualizations of the themes (and possibly the site overview) for our own
(5) Experiment with the robustness of the displays and concepts emerging from them
as we change the way that data dimensions are defined.
(6) Test our ability to perceive changes, by comparing themes and emerging concepts
across times and ultimately across other Hero sites.
(7) Iterate as necessary. (Figure XXX is a circle, after all!)
(8) Begin to construct analytical models from our findings.
(1) The broad-brush techniques required to make data comparable may themselves
hide important variations and structures. We cannot hope to address this problem
directly, though of course we can experiment with different ways to ‘make’ each
data dimension and see what effects that has on outcomes (as mentioned above).
(2) If we rely on too narrow a set of key data dimensions to define themes, then
themes will likely demonstrate a high degree of correlation that will adversely
affect later analysis. We should probably aim to keep themes as separate as
possible, which will be difficult.
(3) Some data may prove very difficult, perhaps impossible, to transform into a single
dimensional axis, even on an ordinal scale.
Baker, V. (1996). Hypotheses and geomorphological reasoning. In Rhoads, B.L. and Thorn, C.E.
(Eds.) The scientific nature of geomorphology. Wiley, New York, 57-86.
Fayyad, U., Piatetsky-Shapiro, G. and Smyth, P. (1996). From data mining to knowledge
discovery in databases. AI Magazine, Fall 1996, pp. 37-54.
Feist, G. J. and Gorman, M. E. (1998) The psychology of science: review and integration
of a nascent discipline. Review of general psychology, 2(1), 3-47.
Feyerband, P. (1975) Against Method, London: Verso.
Gahegan, M., Wachowicz, M., Harrower, M. and Rhyne, T.-M. (2001). The integration of
geographic visualization with knowledge discovery in databases and geocomputation.
Cartography and Geographic Information Systems (special issue on the ICA research
Hanson, N. (1958). Patterns of discovery, Cambridge University Press, Cambridge.
Haslett, J., Bradley, R., Craig, P., Unwin, A. and Wills, G. (1991). Dynamic graphics for
exploring spatial data with application to locating global and local anomalies. The American
Statistician, Vol. 45, No. 3, pp. 234-242.
Hinneburg, A., Keim, D. and Wawryniuk, M. (1999). HD-Eye: Visual mining of high
dimensional data. IEEE Computer Graphics and Applications, September/October 1999, pp.
Jankowski, P., Andrienko, N., and Andrienko, G. (2001) “Map-centred exploratory approach to
multiple criteria spatial decision making”, International Journal of Geographical Science,
Kohonen, T. (1997). Self-organizing maps. Berlin, New York.
Kuhn, T. S. (1962) The structure of scientific revolutions. University of Chicago Press, Chicago.
Langley, P. (2000). The computational support of scientific discovery. Int. Journal of Human-
Computer Studies, 53, 393-410.
Peirce, C. S. (1878). "Deduction, induction and hypothesis." Popular Science Monthly, 13,
Popper, K. (1959). The logic of scientific discovery, Basic Books: New York, 479pp.
Shrager, J. (1990) “Commonsense Perception and the psychology of theory formation”, in
Shrager, J. and Langley, P. (Eds.) Computational Models of Scientific Discovery and Theory
Formation, San Mateo: Morgan Kaufman, 437-470.
Shrager, J. and Langley, P. (1990) (Eds.) Computational Models of Scientific Discovery and
Theory Formation, San Mateo: Morgan Kaufman.
Suppes, P. (1960) “A comparison of the meaning and uses of models in mathematics and the
empirical sciences”, in P. Suppes (Ed.), Studies in the Methodology and Foundations of
Science, Reidel, Dordrecht.
Suppes, P. (1962). Models of Data. In Nagel, E. Suppes, P. and Tarski, A. (Eds.), Logic,
methodology and the philosophy of science: proceedings of the 1960 International Congress,
Stanford University Press, Stanford, CA, 252-61.
Thagard, P. (1988) Computational philosophy of science, Cambridge, Mass.: MIT Press.
Valdez-Perez, R. E. (1999). Principles of human computer collaboration for knowledge discovery
in science. Artificial Intelligence, Vol. 107, No. 2, pp. 335-346.
Yuan, M., Buttenfield, B. Gahegan, M. and Miller, H. (2001) Geospatial Data Mining and
Knowledge Discovery. A UCGIS White Paper on Emergent Research Themes. URL: http://
Zimmerman, C. (2000). The development of scientific reasoning skills. Developmental Review,
Gahegan, M. Takatsuka, M. and Dai X. (2001). An exploration into the definition,
operationalization and evaluation of geographical categories. In Sixth International
Conference on GeoComputation, Brisbane, Australia, Sep. 2001, CD-ROM.
MacEachren, A.M., Hardisty, F., Gahegan, M., Wheeler, M., Dai, X., Guo, D. and Takatsuka, M.
(2001). Supporting visual integration and analysis of geospatially-referenced statistics
through web-deployable, cross-platform tools, Proceeding, dg.o.2001, National Conference
for Digital Government Research, Los Angeles, CA, May 21-23, pp. 17-24.