Like this document? Why not share!

# FuzzyDataMining.doc

## on May 10, 2010

• 656 views

### Views

Total Views
656
Views on SlideShare
656
Embed Views
0

Likes
1
8
0

No embeds

### Categories

Uploaded via as Microsoft Word

### Report content

• Comment goes here.
Are you sure you want to
Your message goes here

## FuzzyDataMining.docDocument Transcript

• NSF Topic 4(c): MATHEMATICAL SCIENCES: Statistical Methods NSF Sol. 97-64 Proposal Title: “Fuzzy Data Mining” PIN: SFS-97-22 B. Cover Page NSF Topic 4(c) MATHEMATICAL SCIENCES: Statistical Methods “FUZZY DATA MINING” Submitted to: Solicitation 97-64 (SBIR Program) National Science Foundation PPU 4201 Wilson Blvd Room P60 Arlington VA 22230 703/306-1391 SciFish -1-
• NSF Topic 4(c): MATHEMATICAL SCIENCES: Statistical Methods NSF Sol. 97-64 Proposal Title: “Fuzzy Data Mining” PIN: SFS-97-22 C. Project Summary SUMMARY With the proliferation of data, data mining tools are becoming available to meet the market demand for ways to find useful information within that data. One drawback to data mining, specifically data mining of spatial data, is representing vastly different data values and inferring missing data. This is especially evident in data mining applications that seek to find relationships between biological and environmental parameters. Current data mining approaches that utilize neural networks, genetic algorithms, or statistical techniques do not inherently allow for such common data inadequacies. A methodology is needed that can properly represent, and process, data with a large amount of uncertainty. Scientific Fishery Systems, Inc. (SciFish) proposes the development of a fuzzy data mining methodology that utilizes fuzzy set theory in two key steps of the data mining process. First, fuzzy membership functions are used to represent each data attribute. This allow the data mining practitioner to properly represent each parameter, defining the ranges for low, medium, high, and so on. Second, fuzzy set operations are used during the data mining process, providing different fuzzy correlations that can then be examined to reveal strong trends that traditional correlation techniques might have missed. COMMERCIAL POTENTIAL The commercial potential of the proposed fuzzy data mining approach will depend on SciFish’s ability to convince the GIS and data mining users that the incorporation of fuzzy techniques will improve their ability to extract more information from their data than they currently are able. The best way to make this happen is through a successful demonstration of fuzzy data mining to an application that has significant interest to a large community. One such area is the fisheries, where the interactions and relationships between various species and their environment is largely unknown. From this foundation, it will then be possible to extend such applications into other areas, such as: oil exploration, forest management, wildlife management, retail site exploration, and local zoning and planning. SciFish -2-
• NSF Topic 4(c): MATHEMATICAL SCIENCES: Statistical Methods NSF Sol. 97-64 Proposal Title: “Fuzzy Data Mining” PIN: SFS-97-22 D. Identification and Significance of the Problem or Opportunity With the proliferation of data, data mining tools are becoming available to meet the market demand for ways to find useful information within that data. Data mining in an automated search for new and valuable information in a set of data. The ultimate objective of data mining is knowledge discovery. Data mining methodology extracts hidden predictive information from large databases. The Problem. One drawback to data mining, specifically data mining of spatial data, is representing vastly different data values and inferring missing data. This is especially evident in data mining applications that seek to find relationships between biological and environmental parameters. As an example, data mining can be a valuable tool if applied to the fisheries. But, with fisheries data, there is a tremendous difference in value ranges, spatial extent, temporal extent and data validity. Current data mining approaches that utilize neural networks, genetic algorithms, or statistical techniques do not inherently allow for such common data inadequacies. A methodology is needed that can properly represent, and process, data with a large amount of uncertainty. The Opportunity. Scientific Fishery Systems, Inc. (SciFish) proposes the development of a fuzzy data mining methodology that utilizes fuzzy set theory in two key steps of the data mining process. First, fuzzy membership functions are used to represent each data attribute. This allow the data mining practitioner to properly represent each parameter, defining the ranges for low, medium, high, and so on. Second, fuzzy set operations are used during the data mining process, providing different fuzzy correlations that can then be examined to reveal strong trends that traditional correlation techniques might have missed. As an example, it is quite possible that a high degree of young fish are strongly correlated with high water temperatures. Such analysis results would be immediately available using the proposed technique. Using existing techniques, the same result would not be revealed because high correlations would be biased to revealing older fish and higher temperatures, the larger end of both value ranges. An illustration of the entire fuzzy data mining approach is outlined below in Figure 1. The Benefits. The proposed fuzzy data mining approach will allow the practitioner to partition the parameter space into a set of membership functions that are germane to the task. A large Walleye Pollock has a very different length and weight than a large Pacific Halibut. The proposed approach allows those different ranges to be compared equitably. In addition to the application of fuzzy set technology to the data mining process, the proposed approach is also emphasizing the exploration of spatial data sets. Although it is the intent of geographic information systems (GIS) to provide analyses of spatial data, you’ll find that such analysis is almost entirely application specific, intending to answer questions such as: Where is the water shed? How much area is covered by trees? Where is the best spot to look for oil? The proposed fuzzy data mining approach will be a significant new tool in that arsenal, providing answers to a whole new set of questions, such as: What parameters have the greatest impact on young fish? What is the relationship between depth and fish size? What other species are most strongly correlated with Walleye Pollock? Prior Experience. SciFish is an innovative technology company with a proven track record of taking concepts into working field prototypes, and prototypes into the marketplace. Current fisheries-related products include the development of a broadband sonar fish identification system, a broadband sonar temperature profiler, and a fisheries geographic information system entitled Fisherman’s Associate that integrates several data sources to help fishers optimize their operations. This last product is currently being sold commercially. The sonar fish identification system will begin manufacturing and sales in late 1998. The temperature profiler recently completed Phase I development. All of this technology is the result of SBIR funded projects. Although the proposed fuzzy data mining Parameter Fuzzy Data Base Extraction Representation Fuzzy Spatial Report Analysis Correlation Representation Figure 1. Outline of Fuzzy Data Mining Approach SciFish -4-
• NSF Topic 4(c): MATHEMATICAL SCIENCES: Statistical Methods NSF Sol. 97-64 Proposal Title: “Fuzzy Data Mining” PIN: SFS-97-22 product is not specifically a fisheries-related application, SciFish will be using a fisheries data set to develop the approach. In addition, SciFish’s prior experience in developing a software product provides this project with valuable insights that can enhance the overall probability of becoming a commercial success. During Phase I, SciFish will develop a fuzzy data mining software product that can be applied to a myriad of spatial problems. To accomplish this goal, SciFish will develop the fuzzy data mining methodology through the application in the fisheries. Several software product specifications will be created for different commercialization opportunities. A detailed market analysis will be conducted with SciFish funding. And, a final report will be produced that describes the details of each stage of this development process. During Phase II, SciFish will produce at least one of the software products, as well as extend the fuzzy data mining methodology from local spatial analysis to global and spatiotemporal analysis. The Commercial Potential. The commercial potential of the proposed fuzzy data mining approach will depend on SciFish’s ability to convince the GIS and data mining users that the incorporation of fuzzy techniques will improve their ability to extract more information from their data than they currently are able. The best way to make this happen is through a successful demonstration of fuzzy data mining to an application that has significant interest to a large community. One such area is the fisheries, where the interactions and relationships between various species and their environment is largely unknown. From this foundation, it will then be possible to extend such applications into other areas, such as: oil exploration, forest management, wildlife management, retail site exploration, and local zoning and planning. E. Background and Technical Approach The following three sections provide background (§E.1), describe the technical approach (§E.2), and review related research in the proposed area (§E.3). E.1 Background The following background sections lay the groundwork for the Phase I Work Plan that follows. There are four areas that are reviewed. First (§E.1.1) a set of ten steps for data mining is outlined. Next (§E.1.2), the key environmental factors that influence fishes is reviewed. Finally (§E.1.3), the motivation for using the Walleye Pollock as a test case during product development is provided. E.1.1 The Ten Steps of Data Mining In a recent PC AI article, a set of 10 steps for data mining were described. These are summarized here to provide an overview of the current data mining methodology. In the following sections, the steps that will be modified are steps 7 and 8, which deal with model construction and validating the findings. 1. Identify the Objective. Clearly define the intent of the analysis. 1. Select the Data. Select the data available for achieving the goal. 1. Prepare the Data. Determine which attributes and parameters within the selected data should be used for the analysis format the selected parameters. 1. Audit the Data. Evaluate the resulting data to determine if the data from the various sources has the same level of confidence, range of values, time extent, and spatial extent. Discard all parameters and attributes that are deemed insufficient. 1. Select the Tools. Decide which tool is the best for meeting the objective. The emphasis of the proposed approach is to utilize a fuzzy systems approach for those data elements which widely varying ranges in value, time, and space. 1. Format the Solution. Determine the format of the solution. With the fuzzy systems approach, this step includes the creation of fuzzy membership functions for each of the parameters and attributes. For the application presented herein as an example, the format of the data will consist of fuzzy membership values defined for cells of a predefined resolution SciFish -5-
• NSF Topic 4(c): MATHEMATICAL SCIENCES: Statistical Methods NSF Sol. 97-64 Proposal Title: “Fuzzy Data Mining” PIN: SFS-97-22 1. Construct the Model. Apply the selected data mining tool, in this instance a fuzzy correlation approach, to the formatted data. The result of the proposed fuzzy data mining approach will be the identification of strong correlations between variables. 1. Validate the Findings. Share the results of the data mining with the client. Determine if the results are valid. Make corrections as needed and repeat step 7, if needed. 1. Deliver the Findings. Provide a report to the client summarizing the results. 1. Integrate the Solution. Apply the findings as appropriate. E.1.2 Key Environmental Factors Affecting Fishes There are several environment factors that influence different aspects of a fishes life. Some environmental factors, such as current, affect the transportation of larvae while others are related to food. The following sections outline many of the key environmental factors that affect fishes. With this information, it is then possible to determine which MTPE-derived data products can be used to measure these environmental factors. Sea and Swell. Waves in the sea, generated by local and distant wind fields (wind waves and swell, respectively) are the most significant phenomena at sea which affect safety, comfort, fishing operations, and fish behavior and availability. There are three different effects of waves on the sea below the surface which are of concern to the fisheries: 1. Vertical mixing by wave action and turbulence caused by breaking waves. This wave mixing can deepen the surface mixed layer depth and sharpen the thermocline gradient. Furthermore, it can affect fish directly by making them seasick and inducing them to move deeper, where the orbital movement of water, caused by waves, is absent. 1. Waves cause current (mass transport by waves) in addition to surface wind drag. 1. Breaking waves cause wave noise which can affect fish behavior. Surface Currents. Fish sense currents with the rheotactic organ located on the lateral line. Generally, fish head into the current even when they let themselves be carried with it. The swimming speed of the fish depends on their size, and is affected by temperature, being slower in lower temperatures. Fish eggs, larvae, and small juveniles are carried with currents and dispersed by them. Japanese fisheries scientists as well as fishermen have long recognized that pelagic fish tend to aggregate at current boundaries, where good catches are made. 1 The reasons for this are considered to be threefold: 1. Food supply (micronekton) accumulates at current convergences; 1. The current boundary acts as an environmental boundary; and 1. migrating fish dynamically aggregate at current boundaries. Salinity and Basic Nutrients. The chemistry of sea water which might affect fish is little influenced by weather and by climatic changes. Two aspects should, however, be considered: salinity and basic nutrient salts such as phosphates and nitrates. These elements can be limiting factors in basic organic production (phytoplankton production) in the sea. Changes in salinity are small and usually indicative of advective changes and mixing. Of other chemical properties, the changes of nutrient salts are indicative of productivity changes and eutrophication. Light. Changes in light conditions (cloudiness, turbidity) might affect basic organic production as well as fish behavior. 2 Sea Surface Temperature. Heat exchanges between the atmosphere takes place through the sea surface, whereby the sea surface temperature (SST) plays a major role in these processes. SST is the most observed parameter in the sea, and is also a good indicator of various processes in the surface layer which have occurred in the past. At times, the temperature itself might not be the direct affecting factor we are looking for, but it might be used as indicating other changes and conditions in the sea. Examples of indirect uses are the estimation of upwelling 1 Burbank, A. & Douglas, R. (1969). Fisheries forecasting systems - a study of the Japanese fisheries forecasting system. Final report for IR&D MJO 9843-39. TRW Systems, 99900-6865-RO-00, 55 pp. 2 Laevastu, T. (1993). Marine Climate, Weather and Fisheries, Halstead Press, New York, 204 pp. SciFish -6-
• NSF Topic 4(c): MATHEMATICAL SCIENCES: Statistical Methods NSF Sol. 97-64 Proposal Title: “Fuzzy Data Mining” PIN: SFS-97-22 intensities and the computation of current and surface water type boundaries. Correlations between temperature and the behavior and occurrence of fish have been sought and found. An extensive summary of this subject is given by Laevastu & Hayes.3 Surface Pressure. The climatological mean surface pressure systems are created by summation of synoptic surface processes which depict the movement of the surface lows. These storm tracks can vary considerably from year to year in space and time. The consequences of these variations are manifold: first they cause changes in the surface layers in the oceans, especially currents and mixing, which, in turn will affect some components of the marine ecosystem. Secondly, they affect the fishing operations. Wind Speed and Direction. Monthly surface wind anomalies can be computed using daily surface wind from the surface pressure distribution and forming long-term monthly means from which the given monthly mean wind can be subtracted. These surface wind anomalies can have various affects on the ocean, besides the creation of anomalies in surface currents. Cushing4 described the formation of the Great Salinity Anomaly of the 1970’s as being caused by the stress of northerly winds off East Greenland in winter during the 1960’s and drifting across the North Atlantic for nearly twenty years. Upwellings along the coasts are also created by prevailing wind systems and thus are sensitive to local wind anomalies. Water Type. The association of a stock of fish with a water type (mass) has been described by Seckel & Waldron. 5 Different water types have different plankton contents, both by abundance and by species dominance. It is possible that pelagic fish are associated with different plankton as food in these water types or with different food abundance, and are advected with these preferred water types. Large scale changes of types of water masses caused by circulation changes have been assumed to cause changes in fish distribution. E.1.3 Fuzzy Data Mining the North Pacific Fisheries: A Test Case In a recent workshop entitled “Changing Oceans and Changing Fisheries: Environmental Data for Fisheries Research and Management”6 there were five themes that emerged. Two of those themes were: (1) to develop baseline time-series of the most important parameters related to the fisheries, and (2) to apply new environmental data technologies to fisheries problems. Other themes dealt with sharing information and demonstrating the applicability of the identified data sources. Each of these themes point toward a need for a tool that can identify key relationships between environmental parameters and fish stocks as well as identify trends among those key parameters. Data mining can be an important asset here. The North Pacific fisheries would serve as an excellent test case for the fuzzy data mining approach. The vastly different values of fish catch, environmental data, and geographic information poses a true challenge to traditional data mining approaches. Specifically, the Walleye Pollock will be examined using the fuzzy data mining approach described herein. There are several reasons for selecting this species: 1. Commercially Important. Walleye pollock supports the largest single species fisheries in the world. Fished with various trawls from the Sea of Japan to the Gulf of Alaska, world-wide harvests annually averaged 5.6 million mt during the 1980s. It is fished commercially in the Pacific from the Bering Sea to Oregon. Largest harvests have come from the southeastern Bering Sea, averaging 975,000 mt annually during 1981-83, followed by the western Gulf of Alaska and the Aleutian Islands, averaging 270,000 mt combined. The annual catch off British Columbia averages about 2,000 mt. Historically the focus of foreign fleets, by 1986 joint venture and wholly domestic fisheries accounted for 75% of its harvest in the U.S. EEZ. In 1988, the entire pollock harvest in U.S. waters was 'Americanized' as no directed foreign fishing was permitted off Alaska: U.S. fishermen caught more than 1.4 million mt, valued at over \$200 million. The flesh is soft and therefore marketed in processed form, domestically as fish sticks and animal feed overseas. It is marketed as fish meal and minced fish (surimi) and is often exported in such forms as artificial crab legs. The roe is also an important export. 3 Laevastu, T. & Hayes, M. (1981). Fisheries Oceanography and Ecology, Fishing News Books, Oxford, 199 pp. 4 Cushing, D. (1982). Climate and Fisheries, Academic Press, London, 373 pp. 5 Seckel, G. & Waldron, K. (1960). Oceanography and the Hawaiin skipjack fishery, Pacific Fisherman, 58(2), 11-13. 6 On-Line at http://upwell.pfeg.noaa.gov/workshop SciFish -7-
• NSF Topic 4(c): MATHEMATICAL SCIENCES: Statistical Methods NSF Sol. 97-64 Proposal Title: “Fuzzy Data Mining” PIN: SFS-97-22 1. Biologically Important. Walleye Pollock is an extremely important prey species for larger fishes, birds, and mammals and an important predator on pelagic organisms. 1. Politically Important. Walleye Pollock are a highly migratory species, with annual routes that pass through U.S., Canadian, Russian, and International waters. The harvest of Pollock at different points along their annual path is becoming an issue of great concern with the nations involved. 1. Prior Research Available. Because of the tremendous commercial, biological, and political value of Walleye Pollock, their has been a great deal of research conducted concerning various aspects of Walleye Pollock. As such, it makes an excellent candidate for validating the results of the fuzzy data mining approach. Excellent examples of such work include papers by Swartzman, Silverman, & Williamson7 1. Data Availability. There is a tremendous amount of oceanographic, atmospheric, biological, and geographical information available in the North Pacific. This data will allow analysis of interactions between species, as well as between species and environment. E.2 Technical Approach The proposed fuzzy data mining approach will be developed, demonstrated, and evaluated using Walleye Pollock as the primary target. The objective of this analysis is to determine those environmental variables that are locally and globally correlated with Walleye Pollock. The core elements of the proposed fuzzy data mining methodology are as follows: 1. Fuzzy Membership Representation. Create fuzzy set membership functions for each of the parameters of interest. 1. Creation of Spatial Representations. Create a spatial layer for each fuzzy set membership function that defines the degree of membership for each cell in a predefined grid. 1. Fuzzy Correlation of Spatial Data. Using the spatial representation, perform fuzzy correlation analyses, both locally and globally, to determine which parameters are strongly correlated. Condense the results of this analysis into a set of highly correlated parameters, both in value and space. 7 Swartzman, G., Silverman, E., & Williamson, N. (1995). Relating trends in walleye pollock (Theragra chalcogramma) abundance in the Bering Sea to environmental factors, Can. J. Aquat. Sci., Vol. 52, pp 369-380. Pelletier, D. & Parma, A. (1994). Spatial distribution of Pacific Halibut (Hippoglossus stenolpis): An application of geostatistics to longline survey data, Can. J. Aquat. Sci., Vol. 51, pp.1506-1518. SciFish -8-
• NSF Topic 4(c): MATHEMATICAL SCIENCES: Statistical Methods NSF Sol. 97-64 Proposal Title: “Fuzzy Data Mining” PIN: SFS-97-22 Table 1 Here (landscape view) The first three steps are organized into separate tasks in the following sections. In addition, tasks for data collection, software product definition, market analysis, and technology transfer are included. The schedule for these tasks is provided in a later section (§G). E.2.1 Task 1: Data Collection for the North Pacific During Phase I, the data sources that are available for the North Pacific will be collected and validated. Data sources are currently available for this region are included in Table 1. SciFish has immediate access to all of these data sources. The data will be transferred from the various forms of storage to a single database of tables for the range of years from 1980 - 1996. Table 2. Parameters and Attributes for North Pacific Walleye Pollock Parameter Range Fisheries Statistics 1. Length (m) [0 - 2] 2. Weight (kg) [0 - 200] 3. Catch Per Unit Effort (kg/hr) [0 - 1,000,000] 4. Count [0 - 100,000] Sea Surface 5. Temperature ( C ) [-2 - 20] 6. Pressure (mB) [20 - 40] 7. Current (km/hr) [0 - 10] SciFish -9-
• NSF Topic 4(c): MATHEMATICAL SCIENCES: Statistical Methods NSF Sol. 97-64 Proposal Title: “Fuzzy Data Mining” PIN: SFS-97-22 Wind 8. Speed (km/hr) [0 - 150] 9. Direction (deg) [0 - 360] Bathymetry 10. Depth (m) [0 - 8,000] Seasons 11. Dates (months) [Jan - Dec] small medium large 1.0 0.5 Walleye Pollock Length 0.0 10 cm 30 cm 50 cm middle young aged old 1.0 0.5 Walleye Pollock Age 0.0 1 yr 3 yrs 5 yrs very shallow shallow deep very deep 1.0 0.5 Bottom Depth 0.0 50 m 250 m 500 m 5,000 m Figure 2. Illustration of Fuzzy Set Representations for Length of Walleye Pollock E.2.2 Task 2: Develop Fuzzy Representation Methodology As an example, some of the parameters and their ranges for the North Pacific are shown below in Table 2. These data attributes would be relevant to a fuzzy data mining objective of looking for fuzzy correlations between fish species and environmental parameters. As just this sample shows, the values can range from as small as -2 to as large as 1 million, as well as including textual data. Looking for correlations with such vastly different ranges in value. This illustrates one way in which the fuzzy data mining approach will beneficial. Using a fuzzy representation for each parameter would result in a collection of fuzzy sets for each parameter. One possible set of fuzzy representations is shown below for Walleye Pollock length, Walleye Pollock age, and water depth. The length representation represents that the majority of Pollock are around 30 cm in length, with large Pollock exceeding 50 cm and small Pollock are below 10 cm. The number, value ranges, and shape of each fuzzy set can be fine-tuned to the parameter. Fuzzy sets will be derived for each parameter. The resulting fuzzy sets will be used for the fuzzy spatial correlations that are to follow. The range for each fuzzy set is shown in Table 3, with the corresponding fuzzy membership functions shown in Figure 2. Table 3. Illustrating Fuzzy Membership Function Ranges for Some Parameters Parameter Range Walleye Pollock Length (cm) 1. Short [0 - 25] 2. Medium [15 - 45] 3. Long [> 35] Walleye Pollock Age (yr) 4. Young [0 - 2.5] SciFish - 10 -
• NSF Topic 4(c): MATHEMATICAL SCIENCES: Statistical Methods NSF Sol. 97-64 Proposal Title: “Fuzzy Data Mining” PIN: SFS-97-22 5. Middle-Aged [1.5 - 4.5] 6. Old [> 3.5] Bottom Depth (m) 7. Very Shallow [0 - 200] 8. Shallow [75 - 400] 9. Deep [300 - 4,000] 10. Very Deep [>750] Next, the data must be organized into a grid. The spatial resolution of the grid will depend on each application. For the North Pacific example we are developing here, lets assume the grid cell size is 0.5 degree by 0.5 degree, with a spatial range from 159 W to 164 W Longitude and 52 N to 55 N Latitude, resulting in the grid shown below in Figure 3. Each grid cell’s fuzzy membership value is illustrated by various intensities in color, where the darker colors represent lower membership values. Mathematically, this is expressed as cell ijk = µk ( x ) where i = longitude cell number, j = latitude cell number, k = membership function number, cell is the cell value, and x is the value being applied to the cell. If a grid cell does not have a value, that cell is given zero membership. If multiple values reside in the same grid cell a centroid weighting will be used to determine the membership value, similar to that done during centroid defuzzification with fuzzy control applications8 or the median of all the values falling within the grid cell will be calculated prior to applying the fuzzy membership function. 55N 54N 53N 52N 164W 162W 160W Figure 3. Illustration of Grid Cells Filled with Fuzzy Memberships A layer of grid cells, like that shown in Figure 3, will be constructed for each fuzzy membership function that is created. An illustration of the result of this processing can be seen below in Figure 4. During Phase I, the fuzzy representation approach described herein will be applied to each parameter of interest, resulting in a set of fuzzy membership layers that will be used in the next step of processing. A complete description of the parameters and their corresponding fuzzy set membership functions will be included in the Phase I final report. E.2.3 Task 3: Develop Fuzzy Correlation Methodology There are at least three types of fuzzy correlation that can be considered: local correlations (as shown in Figure 4), global correlations, and spatiotemporal correlations. There is a dramatic difference in the processing requirements for each, with a relatively modest computational cost for local correlations and a dramatic cost for global and spatiotemporal correlations. The local correlations will be the subject of the Phase I effort, with the development of the global and spatiotemporal correlations left for Phase II. 8 Eberhart, R., Simpson, P. & Dobbins, R. (1996). Computational Intelligence PC Tools, Academic Press, Boston, MA. SciFish - 11 -
• NSF Topic 4(c): MATHEMATICAL SCIENCES: Statistical Methods NSF Sol. 97-64 Proposal Title: “Fuzzy Data Mining” PIN: SFS-97-22 Local Fuzzy Correlations. Local fuzzy correlation is illustrated below in Figure 4, where correlations are performed for the same grid cell across all fuzzy layers. The appropriate fuzzy correlation function is a subject of research during Phase I. Immediate candidates are fuzzy union (often the MAX function) and the fuzzy intersection (often the MIN function). Other possible fuzzy correlation functions include taking the product 1/2 Degree Cells 164W 163W 162W 161W 160W 53N 52N Short 51N Length Pollock 50N Medium Length Pollock Local Fuzzy Correlation Across All Cells for the Same Long Spatial Cell Length Pollock Very Deep Figure 4. Illustration of Local Fuzzy Spatial Correlations The fuzzy correlations will be performed across all combinations of fuzzy layers. Assuming there is a total of N fuzzy layers, the corresponding N x N fuzzy correlation matrix, L, for cellij would be constructed using the expression Lij =fuzzy_op(cellijk , cellijl ) kl where Lij represents the correlation between layer k and layer l for cellij , cellijk is the fuzzy layer value for kl cellij in layer k, and fuzzy_op is the fuzzy correlation operator. At the end of this operation, each cell position will have a corresponding fuzzy correlation matrix. An example of this matrix is shown below in Figure 5 using the fuzzy layers shown in Figure 4. SciFish - 12 -
• NSF Topic 4(c): MATHEMATICAL SCIENCES: Statistical Methods NSF Sol. 97-64 Proposal Title: “Fuzzy Data Mining” PIN: SFS-97-22 Medium Length Pollock Short Length Pollock Long Length Pollock Very Deep Short Length Pollock L11 L12 L13 L1 N Medium Length Pollock L21 L22 L23 L2 N Long Length Pollock L31 L32 L33 L3 N Lkl Very Deep L N 1 LN 2 L N 3 LNN Figure 5. Illustration of a Fuzzy Correlation Matrix Produced for Each Grid Cell Data analysis of the resulting fuzzy correlation matrices will require searching for the larger correlation values among each matrix. These values can be listed for each matrix, and then trends can be sought. Alternatively, further reduction in data can be achieved by computing the median across all matrix locations to produce a summary report for the entire area under examination. During Phase I, fuzzy correlation matrices will be created using both Union and Intersection operations. The resulting fuzzy correlation matrices will be analyzed to determine which correlations are strongest and where they occur. The results of this fuzzy correlation analysis will be compared against the existing literature to determine if known relationships were revealed and if new relationships were captured. The results of this comparison will be reported in the Phase I Final report. Global Fuzzy Correlations. There is an immediate extension of the local fuzzy correlations to a global correlation of data elements. In the global correlation, the correlation matrix is extended to include correlations between all cells of all layers, resulting in an (NxM) X (NxM) fuzzy correlation matrix, where N is the number of layers and M is the total number of grid cells. Clearly this is a tremendous computational extension, but it is thought that it might reveal spatial correlations beyond those found in the local approach. During Phase II, this approach will be examined. Spatiotemporal Fuzzy Correlations. It is an underlying assumption that the fuzzy data mining approach is being applied during a single snapshot in time. It is likely that many other relevant correlations can be found when considering the change of fuzzy membership values from one time slice to the next. This approach would required the formation of local fuzzy correlation matrices for each increment of time, followed by a second analysis that would track correlations over time. This approach will also be considered during Phase II. E.2.4 Task 4: Specifying the Fuzzy Data Mining Software Product It proposed that through the development of the fuzzy data mining approach, using the application to a North Pacific fish stock as a test case, will have defined the methodology to a point that it will now be possible to develop a software product specification. It is felt that the best opportunity for a software product of this type exists as a third party add-in to an existing Geographic Information System (GIS), spreadsheet, or database package. The software produced during Phase I will be written in Visual Basic 5.0 to allow for the maximum flexibility for developing an add-in package. Products will explored for each of the existing software products: 1. GIS Add-In. Both ArcView and MapInfo have a large market share in GIS products. A plug-in fuzzy data mining module will be explored with each company. SciFish - 13 -