Finding concepts for organizing HERO data to enable within ...


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Finding concepts for organizing HERO data to enable within ...

  1. 1. Finding concepts for organizing HERO data to enable within site and between site comparisons to be made (Draft) Mark Gahegan, 10/20/01 Introduction I thought it might be useful to circulate some preliminary thoughts regarding the gathering and use of variables that could form the basis of understanding and representing some of the concepts of interest, and hence that might provide a means for tracking the state of a study site through time, for comparing sites and also predicting trajectories of less-studied places, via induction. Obviously, each team has specific data interests and expertise. The variables gathered are likely to reflect these interests and expertise, and so be different from place to place in aspects of collection, meaning and scale, and therefore not readily comparable. Also, at present, there is no top-down imposed list of data that each site is required to gather. The first task then, needs to be the identification of all data that are held by all sites, to see where exact matches are possible (perhaps variables such as height, eco-region, and many population variables are directly commensurate, whereas others, such as soil loss, flood risk and economic viability may not be). Imposing order on such a problem is obviously difficult, but as Sir Francis Bacon said: “Error emerges more readily from order than from confusion.” By imposing structure on these data it should become apparent whether the structure aids our understanding, or hinders it, and eventually what to do to improve such structures so they better serve our interests. Exploration and the Process of Science To begin searching for useful structure, we use exploratory visualization and knowledge discovery tools (Yuan et al., 2001; Valdez-Perez, 1999; MacEachren et al., 1999). If we can construct useful concepts, i.e. ones that have meaning across sites, then they could form the basis on which a quantitative analysis is later constructed. Figure 1 shows an overview of the scientific process as conceptualized here, starting with exploration and synthesis, and progressing via analysis to evaluation and presentation (Feyerband, 1975; Hanson, 1958; Kuhn, 1962; Popper, 1959; Feist and Gorman, 1998; Zimmerman, 2000; Langley, 2000; Shrager and Langley, 1990; Thagard, 1988; Fayyad et al., 1996). The roles played by different forms of inference are also shown (Peirce, 1878; Suppes, 1962; Baker, 1996).
  2. 2. GeoVISTA Studio ( will provide the functionality for exploratory analysis. Data Abduction Exploration: Map EXPLORATION, Concept DISCOVERY Presentation: Synthesis: COMMUNICATION, LEARNING, Induction CONSENSUS, GENERALIZING MAP Explanation Theory Evaluation: Analysis: EXPLAINING, QUANTIFYING, Understanding MODELING Deduction Model-based Map Figure 1. An overview of the scientific process as a number of stages, each one predicated on the outcomes of the previous stage, by which meaning is constructed, used and shared. Some Exploratory Tools At this point, it might be useful to switch perspectives and approach the problem from the opposite end to give you an idea where we are aiming—what kinds of exploratory analysis might we use, what might we learn from them, and how might they help us overcome the data heterogeneity problems outlined above? Figure 2. A parallel coordinate plot; one possible visualization tool for exploring high dimensional data.
  3. 3. Figure 2 shows a parallel coordinate plot (from Studio) displaying 10 separate bio- physical variables (related to environmental analysis) and colored according to possible landcover classes. The collected data for over 1700 sample sites are shown as a series of ‘strings’, with each string representing one site. Clustering in the display suggests commonalities between samples that might be further examined as a basis for imposing high level structuresthe categories and concepts that many forms of quantification require a-priori. Figure 3 similarly shows 13 demographic variables defined for each state in the USA, and Figure 4 shows three different views onto the same underlying data. Figure 3. Socio-demographic data; the selected string (for California which is an outlier in most respects!) is shown in green. Figure 4. Three views onto the same dataset. A few properties of displays such as these are worth pointing out: (1) Other views onto the data, such as those provided by maps and spreadsheets, can be used together with visualization tools. The views can be linked such that, for example, selected items appear highlighted in all views (MacEachren et al., 2001).
  4. 4. (2) The selected axis (a single variable) provides a color palette that is applied to all strings. The user can change the associated classification scheme, the axis selected, the drawing order and many other visual properties. (3) Data can be sorted according to correlation, or by principal components, to the selected axis (4) The axes themselves provide a visual indicator of the spread of values for each variable. For data to be visually comparable, they do not need to share the same units and dynamic range, nor be represented on the interval or ratio scale. In other words, variables can be gathered in a variety of ways, so long as an ordinal comparison of their values makes sense. The plot is really a device for exploring how neighboring axes relate to each other, to form a ‘signature’. This last point provides a possible approach: can sets of variables be defined across all sites, such that an ordinal comparison across sites makes sense? For example, one variable might be soil moisture: it might be measured differently from place to place, and indeed even concepts such as ‘dry’ or ‘wet’ might not be equivalent from site to site. But where we should expect to see structures emerge is how these kinds of measures relate to other variables. It is then not necessary that the variables be exactly the same, but that they speak to the same (agreed and shared) notion. Let us call these more flexibly- defined variables ‘data dimensions’, to make a clear distinction. Then, can sets of data dimensions be grouped according to the underlying themes or research questions they help elucidate? If so, then we can hope to create visualizations that will allow us to make comparisons. For example, by using pairs of parallel coordinate displays, it is possible to make a number of important visual comparisons, such as: are these two concepts ‘separable’ in the data? Is this place ‘similar’ to that place? Has this place changed from time-1 to time-2? Does this place respond to an event (such as a drought or economic recession) in the same way as another? See Figure 5.
  5. 5. Figure 5. Comparing two concepts. The signatures show substantial differences in their underlying data for certain dimensions. The rightmost figures show how these two concepts cluster in a self-organizing neural network as a further measure of their likely utility (Gahegan et al., 2001; Kohonen, 1997). The large colored dots (mainly green) show closeness of the original data samples in feature space. If we can place data from all sites into such a framework, where comparisons can be readily made and are meaningful to make, then we will have provided the infrastructure from which synthesis can begin, followed in turn by quantitative analysis (as described in Figure 1). One possible way to impose order on these otherwise non-commensurate data is to start from a set of high-level themes (such as those provided by the recent Delphi exercise) that all sites agree to adopt. We can easily build parallel coordinate plots and other linked exploratory tools that represent three to twenty variables that might together describe a particular theme of interest. Each theme would then have a ‘form’ that could be compared, both within and between sites. What we need to do (1) Find out what variables are (or can easily be) common across all sites, since these might be best candidates to form data dimensions. (2) Relate the themes gathered in the recent Delphi exercise to a set of ‘data dimensions’ such that each dimension can be represented at each site by some available variable, or calculated, estimated or otherwise invented as necessary. (3) From the previous point, it is possible that a set of recurrent data dimensions might emerge, that together represent a cross-section of all themes. If so, we should use them as a kind of ‘site overview’ description. (4) Build visualizations of the themes (and possibly the site overview) for our own study area. (5) Experiment with the robustness of the displays and concepts emerging from them as we change the way that data dimensions are defined. (6) Test our ability to perceive changes, by comparing themes and emerging concepts across times and ultimately across other Hero sites. (7) Iterate as necessary. (Figure XXX is a circle, after all!) (8) Begin to construct analytical models from our findings. Dangers (1) The broad-brush techniques required to make data comparable may themselves hide important variations and structures. We cannot hope to address this problem directly, though of course we can experiment with different ways to ‘make’ each data dimension and see what effects that has on outcomes (as mentioned above).
  6. 6. (2) If we rely on too narrow a set of key data dimensions to define themes, then themes will likely demonstrate a high degree of correlation that will adversely affect later analysis. We should probably aim to keep themes as separate as possible, which will be difficult. (3) Some data may prove very difficult, perhaps impossible, to transform into a single dimensional axis, even on an ordinal scale. References Baker, V. (1996). Hypotheses and geomorphological reasoning. In Rhoads, B.L. and Thorn, C.E. (Eds.) The scientific nature of geomorphology. Wiley, New York, 57-86. Fayyad, U., Piatetsky-Shapiro, G. and Smyth, P. (1996). From data mining to knowledge discovery in databases. AI Magazine, Fall 1996, pp. 37-54. Feist, G. J. and Gorman, M. E. (1998) The psychology of science: review and integration of a nascent discipline. Review of general psychology, 2(1), 3-47. Feyerband, P. (1975) Against Method, London: Verso. Gahegan, M., Wachowicz, M., Harrower, M. and Rhyne, T.-M. (2001). The integration of geographic visualization with knowledge discovery in databases and geocomputation. Cartography and Geographic Information Systems (special issue on the ICA research agenda). Hanson, N. (1958). Patterns of discovery, Cambridge University Press, Cambridge. Haslett, J., Bradley, R., Craig, P., Unwin, A. and Wills, G. (1991). Dynamic graphics for exploring spatial data with application to locating global and local anomalies. The American Statistician, Vol. 45, No. 3, pp. 234-242. Hinneburg, A., Keim, D. and Wawryniuk, M. (1999). HD-Eye: Visual mining of high dimensional data. IEEE Computer Graphics and Applications, September/October 1999, pp. 22-31. Jankowski, P., Andrienko, N., and Andrienko, G. (2001) “Map-centred exploratory approach to multiple criteria spatial decision making”, International Journal of Geographical Science, 15(2), 101-127. Kohonen, T. (1997). Self-organizing maps. Berlin, New York. Kuhn, T. S. (1962) The structure of scientific revolutions. University of Chicago Press, Chicago. Langley, P. (2000). The computational support of scientific discovery. Int. Journal of Human- Computer Studies, 53, 393-410. Peirce, C. S. (1878). "Deduction, induction and hypothesis." Popular Science Monthly, 13, 470-482. Popper, K. (1959). The logic of scientific discovery, Basic Books: New York, 479pp. Shrager, J. (1990) “Commonsense Perception and the psychology of theory formation”, in Shrager, J. and Langley, P. (Eds.) Computational Models of Scientific Discovery and Theory Formation, San Mateo: Morgan Kaufman, 437-470. Shrager, J. and Langley, P. (1990) (Eds.) Computational Models of Scientific Discovery and Theory Formation, San Mateo: Morgan Kaufman.
  7. 7. Suppes, P. (1960) “A comparison of the meaning and uses of models in mathematics and the empirical sciences”, in P. Suppes (Ed.), Studies in the Methodology and Foundations of Science, Reidel, Dordrecht. Suppes, P. (1962). Models of Data. In Nagel, E. Suppes, P. and Tarski, A. (Eds.), Logic, methodology and the philosophy of science: proceedings of the 1960 International Congress, Stanford University Press, Stanford, CA, 252-61. Thagard, P. (1988) Computational philosophy of science, Cambridge, Mass.: MIT Press. Valdez-Perez, R. E. (1999). Principles of human computer collaboration for knowledge discovery in science. Artificial Intelligence, Vol. 107, No. 2, pp. 335-346. Yuan, M., Buttenfield, B. Gahegan, M. and Miller, H. (2001) Geospatial Data Mining and Knowledge Discovery. A UCGIS White Paper on Emergent Research Themes. URL: http:// Zimmerman, C. (2000). The development of scientific reasoning skills. Developmental Review, 20, 99-149. Gahegan, M. Takatsuka, M. and Dai X. (2001). An exploration into the definition, operationalization and evaluation of geographical categories. In Sixth International Conference on GeoComputation, Brisbane, Australia, Sep. 2001, CD-ROM. MacEachren, A.M., Hardisty, F., Gahegan, M., Wheeler, M., Dai, X., Guo, D. and Takatsuka, M. (2001). Supporting visual integration and analysis of geospatially-referenced statistics through web-deployable, cross-platform tools, Proceeding, dg.o.2001, National Conference for Digital Government Research, Los Angeles, CA, May 21-23, pp. 17-24.