Agri Data Mining/Warehousing:


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Agri Data Mining/Warehousing:

  1. 1. 1 Agri Data Mining/Warehousing: Innovative Tools for Analysis of Integrated Agricultural & Meteorological Data Ahsan Abdullah Stephen Brobst Ijaz Pervaiz National University of Computers & Teradata Division, NCR, Directorate of Pest Warning & Emerging Sciences, Islamabad, Pakistan Dayton, OH, USA Quality Control of Pesticides, Punjab Muhammad Umer, Azhar Nisar National University of Computers & Emerging Sciences, Islamabad, Pakistan Abstract making where a whole history of events is required to be synthesized. Later is the point where we, as a nation, Every year significant yield loss occurs in Pakistan due seriously lack today even though we possess basic input to pest attacks on cash crops. Although pesticides have for such an undertaking i.e. availability of historic (of been used, but the desired correlation has not been more than two decades) monitoring data. Integration of observed between yield and pesticides usage. Different this data in a single standardized format and a set of government departments and agencies have the task to automated tools that may complement the task of monitor dynamic agricultural situations all around exploratory analysis of this massive data is what we do Pakistan, but the data collected has never been integrated not possess. and standardized to give a complete picture, and answer several pressing questions. In this paper we have In this paper we have discussed a pilot project discussed a pilot project implementation of a data implementation of a data warehouse for the analyses of warehouse for the analyses of above mentioned data in an above mentioned data in an integrated fashion. Such a integrated fashion. Indigenously developed Data Mining data warehouse can best support new breed of analytical and OLAP (Online Analytical Processing) tools were used tools including Data Mining and Online Analytical to analyze the data. Processing (OLAP)1. No such work has ever been undertaken in agriculture sector of Pakistan [2]. Data warehouses are quite popular in industries such as 1. Introduction telecommunication, retail sale, manufacturing and scientific research but an application in agriculture sector Every year significant yield loss occurs in Pakistan is a novel idea and we have been able to show its strength due to pest attacks on cash crops. To monitor and through an actual implementation. ultimately counter these attacks, different government departments and agencies have the task to keep an eye on Rest of this paper is organized as follows; Section 2 dynamic agricultural situations all around Pakistan. As a gives the motivation behind this work, Section 3 gives result, thousands of digital and non digital data files are necessary technical background and an explanation of generated from hundreds of pest-scouting and yield techniques and methods we have employed, Section 4 and surveys, agro-meteorological data collection and other 5 give a discussion of construction of the Agri Data such undertakings. The collected data, due to its Warehouse while Section 6 describe various analytical multivariate nature and disparate origins, has never been operations performed over it. Section 7 gives a roundup of integrated and thus do not provide a complete picture. related work. Lessons learnt and conclusions are Thus the lack of data integration (and standardization) summarized in section 8 and 9. contributes to an under-utilization of historical data, and inevitably results in a limited ability to perform even simple analysis. 2. Motivation Analyzing data - such as pest scouting, pesticide Pakistan’s agricultural sector contributes more than 24% usage and agro-meteorological recordings – contains huge of GDP, employs about 44% of the labor force, and analytical potential in at least two major respects. Firstly, short term forecasting and day to day tactical handling of issues related to crop and pest management and secondly, long-term forecasting, strategic planning and policy 1 details in section 3
  2. 2. 2 directly sustains 75% of the population and accounts for This data is weekly collected from a hundred odd 30% of exports [29]. More importantly, it accounts for points in every district. This turns out to be nearly 3400 about 60% of total foreign exchange earnings [31]. Textile recordings per week for the whole province. exports comprise more than 60% of Pakistan's total exports, thus the success or failure of cotton crop has a Above exercise has been in place with more or less direct bearing on textile exports. Cotton production is the same vigor and shape since the last two decades. Amount inherent comparative advantage of the textile sector of of data that has been accumulated until now is enormous Pakistan [30]. both diagonally (scores of factors for which recordings are made) and vertically (coarse estimate stands at more than Punjab is the main producer of agricultural 3 million records). Moreover, for a detailed analysis, commodities in Pakistan, producing 83% of the cotton, scouting data is required to be integrated with other data 72% of the wheat, 95% of the rice, 56% of the sugarcane elements such as crop yield and prices over the years and and 35% of the maize [32]. For this reason, the Punjab is most importantly weather data for the same duration of commonly known as the bread basket of Pakistan [33]. time. After the crash of cotton crop in 1983, Government of Punjab decided to enhance crop monitoring facilities and Counting on human brain alone, for synthesis of established the Directorate of Pest Scouting and Quality information contained by this data is not only impractical Control of Pesticides (DPWQCP) in 1984. but is unjust too. Our motivation is just the same i.e. complimenting the knowledge discovery in this massive Pest scouting is a systematic field sampling process data using modern information management tools, that provide field specific information on pest pressure specifically Data Warehouse, Data Mining and OLAP. and crop injury [21]. Motivation of this work has arisen form the need to have a better insight into the dynamics of 3. Technological Background crop growth using the data being generated constantly from pest scouting program in the province of Punjab Taking into account, the diverse audience that this implemented by DPWQCP. paper caters to, we present a brief introduction of some technological concepts from the IT domain that needs DPWQCP has over the years perfected the activity of understanding for the appropriate appreciation of our pest scouting such as the data it generates gives the true work. and unbiased coverage of the whole acreage. Scouts move from field to field in their area of jurisdiction, collect statistics on pest situation from various fixed and random 3.1. Data Warehouse sampling points and keep a check on pest population dynamics. Collection of data regarding farmer A data warehouse is an integrated and time-varying demographics (acreage, variety sown, date of sowing etc.) collection of data primarily used for the support of and pesticide usage history (amount of pesticide used, management decision-making [14, 9]. A data warehouse spray dates etc) is an essential part of data collected by the often integrates heterogeneous data from multiple and Directorate. Table 1 gives a brief detail of the attributes distributed information sources and contains historical and recorded at each point. aggregated data. A major misconception is assuming a “data Sr. no Attribute warehouse to be a warehouse of data”. It is true that a data 1 Date of Visit warehouse normally contains large amount of data but it’s 2 Farmer Name and Address not the requirement of building one. Major requirements 3 Acreage attributed to a data warehouse are its ability to 4 Variety(ies) Sown complement the process of analytical querying through 5 Plant Population simplistic schema, efficient implementation and optimized 6 Pest Population performance. 7 Predator Population 8 Pesticide Spray Dates 3.2. Data Mining 9 Pesticide(s) Used Data mining is the exploration and analysis of Table 1: Attributes Recorded by DPWQCP Surveyors extremely large quantities of multivariate data by
  3. 3. 3 automatic means in order to discover meaningful patterns A pilot project strategy is highly recommended in and rules [4]. data warehouse construction [23]. As the full sized data warehouse construction requires huge amount of capital, Data mining is regarded as knowledge discovery process i.e. no prior assumption or hypothesis is made effort and resources, it must be attempted only after a about data to be proved or disproved through mining. thorough understanding of domain and a valid proof of Furthermore, It operates on an undirected knowledge concept. A small scale projects in this regard serves many discovery discourse where one attempts to find patterns or purposes such as (i) providing a valid proof of concept, similarities among groups of records without the use of (ii) establishing blue print processes for later full-blown particular target field or collection of predefined classes. project, (iii) identifying problem areas and, (iv) revealing true data demographics. Agri data warehousing is an 3.3. OLAP unexplored territory, requiring knowledge of a multitude Aggregate queries (such as sum, average etc.) are used of domains, hence we deem building a small scale version frequently in decision support applications, where the in first iteration as the best strategy. A detailed proposal basic goal is to collect information from detail tables [12]. for full blown project has already been submitted to OLAP tools capitalize on these aggregate queries, by Pakistan Agriculture Research Council (PARC), generating and then storing answers to all possible queries Islamabad for funding under Agriculture Linkages in advance and provide a powerful and intuitive Graphical Program, and was under review by PARC when this paper User Interface (GUI). The most popular among OLAP went to press. Full blown project will cover all 34 districts features are drill down and roll up aggregates [11]. of Punjab, 10 years of pest scouting data and meteorological data of 53 elements recorded at seven OLAP tools are powerful and fast tools for reporting observatories. on data, fundamentally though, they depend upon human intelligence coupled with domain expertise for extraction For the sake of pilot project, we limited ourselves to of valuable information [4]. the pest scouting data of cotton crop recorded in District Multan (Figure 1), during the cotton growing seasons 2000-01 and 2001-02. This data was weekly recorded 3.4. The Connection between Data Mining, from more than 100 fields of district Multan, as per OLAP & Data Warehousing normal practice of DPWQCP. Meteorological data for the same dates was also arranged thorough a different source. Data mining algorithms require data as input, which Following sections give the details of our Pest-Pesticide- may not necessarily come from a data warehouse. Still a Metrology Data Warehouse (PPM-DWH), issues in its data warehouse simplifies the job of data miner [4]. In the construction and a discussion of results that it generated scenario where data mining is to be performed on data after the processes of mining and iterative analyses. coming in huge volumes, from multiple sources, with inconsistent representation and with an inherent time disparity (monthly vs. weekly vs. daily data ), a single and consistent source of truth may be the only solution, hence the data warehouse. Multan OLAP and data mining are complementary; both are important parts of exploiting data. OLAP is a presentation tool that can enable manual knowledge discovery, while on the other hand; data mining is an automated knowledge Shujabad discovery process. It happens quite often that the OLAP tool is used to explore results/findings generated by data mining into more detail. Jalalpur Pirwala 4. Pest-Pesticide-Metrology Data Warehouse (PPM-DWH): The Pilot Project Figure 1: Area under study for pilot project– Three Tehsils of District Multan
  4. 4. 4 Fig. 2: The Overall Process 5 Development Life Cycle for Pilot Project of data cleansing in scouting data were arisen by its processing at three levels by different individuals. i.e. firstly, recording by the surveyors at the field level, Figure 2 gives a panoramic view of the life cycle secondly, typing into data sheets at the DPWQCP office, that we undertook for this study. Overall process can be and lastly digitization by data entry operators. To divided into four major phases, Requirement Analysis, maintain full compliance between the data sheets and Input Data Acquisition, Implementation and Analytical their digitized copies a double check strategy was Operations. We look upon each of these in the adopted. Two individuals entered every row of data subsequent sections. Readers interested in the technical separately. In case of conflict the data sheet was details of implementation are referred to [1b]. consulted for final reconciliation. 5.1 Input Data Acquisition and ETL Phase Variations in farmer names were removed through certain heuristics so that records of same farmers can be 5.1.1 Pest Scouting Data identified and grouped together. Similar variations found in pesticide names and cotton variety names were Field level acquisition of pest scouting data can be removed by comparing them to actual names using the summarized as: Trained surveyors from DPWQCP visit standard pesticide list of National Agriculture Research a point and note the recordings against the attributes Center [15] and standard crop varieties list of Cotton given in Table 1. These readings are later typed on a Research Institute [10]. standard sheet and stored in the hard format. 5.1.2 Meteorological Data For PPM–DWH implementation these sheets were digitized by data entry operators. During data Digitally formatted daily weather recordings from over acquisition phase, standard procedures of ETL (Extract, seventy observatories throughout Pakistan for more Transform, Load) were applied, a primary step in any than last three decades is available with Pakistan data warehouse implementation that concerns acquiring, Meteorological Department (PMD). This data has an integrating, cleansing and standardizing data from inherent weakness, as very large area is represented by source(s). one meteorological recording. It’s a common Data cleansing and standardization is probably the largest part in an ETL exercise. In our case, major issues
  5. 5. 5 observation that meteorological elements vary with in •Insect dimension: Insects surveyed in the scouting even kilometers of range, hence using same figure for process are grouped on entomological basis. Such thousands of Kilometers brings with it a strong element as in case of cotton crop, Bollworm Complex (Pink of estimation. Bollworm, Spotted Bollworm, Army Bollworm), Sucking pests (Whitefly, Jassid etc.), Viruses As no local readings for the past where available in (CLCV) and Predators. our case so we had to use meteorological recordings taken at district level. The second weakness discovered •Pesticide dimension: Numerous pesticide solutions was the cost of meteorological data as commissioned by are used by farmers depending upon infestation, PMD that is too high for a research group to pay, price and availability. These solutions may differ in literary running beyond millions of Rupees. As a last their trade names but belong to some generic resort, daily weather estimates of years 2001 and 2002, chemical class, and cure group, as listed in [15]. including minimum, maximum temperatures, humidity and outlook were downloaded from the website of the 5.2.2 Schema Design newspaper daily Dawn ( Looking at the whole scouting process, number of 5.2Implementation Phase data elements, type and frequency of data generation and type of questions likely to be faced by the final 5.2.1 Dimensional Model implementation we propose a modified star schema for PPM-DWH (not shown here). It has been persistently Dimensional modeling is a technique used to model reported in the literature that star schema best support databases for analytical applications. It yields a simpler DSS due to its simplified nature [19, 20]. Due to its design and hence efficient retrievals, a prime technical nature, we omit the details of schema here. requirement for large data warehouses [6, 18, 19]. Primary output of a dimensional model is the 5.2.3 Coding and Hardware/Software identification of Dimensions and Facts present in a Platform given scenario. PPM-DWH was implemented on a commercially A Dimension is a collection of conceptually related available server with dual Intel 950 Mhz Xeon entities with an inherent hierarchy. For example, Time processors and 1GB of RAM. Total internal Hard Disk dimension would consist of entities representing capacity of the server amounts to 36 GB while external temporal intervals, such as day, week, month, and year. RAID control supports 8 additional SCSIs of 18 GB On the other hand, Facts are the metrics associated with each. (and reported for) dimensions. Such as minimum or maximum temperature are facts recorded for time 5.2.4 Data Validation dimension. Quality and validity of the underlying data is the key to meaningful and authentic analyses. After Though we omit the details of dimensional model ensuring a satisfactory level of data quality it is here, but a brief introduction to the involved dimensions extremely important to somehow judge the validity of is as follows. data that a data warehouse constitutes. We applied some very natural checks for this purpose. •Location dimension: It corresponds to the administrative hierarchy of a province. It starts with Relationship between the pesticide spraying and the Division which is divided into districts. Each predator (insects that destroy pests) population is a fact district consists of three to four Tehsils which are that has been discussed by many agriculturists further divided into a number of Markaz. [James02], [Relyea01]. Predator population decreases with the first pesticide spray and then continually •Farmer dimension: For a detailed study, farms are decreases. We dig out this fact in its same form from our categorized into different sets depending upon data as well, as it can bee seen in the figure below. acreage [Pak 01], such as 0.5 to under 1.0 Acres, 1.0 to under 2.0 Acres, 2 to under 3.0 Acres, 3 to under 5.0 Acres and similarly so on.
  6. 6. 6 Data mining technique that we applied on PPM- DWH is called Recursive noise Removal (RNR), first proposed and used on gene expression data by [1]. Due to non-technical nature of this paper, we omit the details of this method but a brief overview may be necessary in order to appreciate the results appropriately. More interested readers are referred to [1]. Readers interested in the details of RNR application to the agriculture data are referred to [1a]. 6.1.1 Method Clustering is a data mining technique that assigns data elements to various classes/clusters. A cluster is a collection of data elements that are highly similar to one Figure 3: Y 2001 Pesticide usage vs Predators another within the same cluster, but weakly similar from the data elements in other clusters. 6. Analytical Operations Clustering falls into two main classes (i) Un- supervised, when size number and/or demographics of Once the underlying structure is in place (the data clusters are not known in advance (ii) Supervised, warehouse), there is no end to the exploration that one popularly known as classification it applies to the can perform. Probably the task of knowledge discovery situation where cluster (or class) properties are known a is limited only by the imagination and to some extent priori, and unclassified data elements are assigned to domain expertise of the explorer, provided that any known clusters. An obvious edge that unsupervised such undertaking is supported by appropriate automated clustering has, is its data driven and domain independent or semi- automated tools. We applied such tools on the nature. An unsupervised clustering technique fit for pilot version of our PPM data warehouse and results are identifying patterns in medical images may be equally more than promising. applicable in analyzing seismic data. As described above, given a standardized and RNR algorithm is an unsupervised clustering method integrated data set coupled with efficient storage that can run on any database table containing structure and data exploration tools, any aspect of data alphanumeric values. Domain expertise is required not can be investigated resulting in numerous findings. for extracting the clusters but for understanding the Hence PPM-DWH and the tools involved are by no implication of clusters that RNR has identified. RNR means limited to the findings we report here. We give works by repeatedly/recursively using crossing these findings with the sole aim of demonstrating the minimization technique and dropping “noise” till a potential of the framework we proposed (and partially desired level of cluster quality is obtained. We implemented). demonstrate its use through following experiment. Following sub sections describe the analytical operations that we performed on PPM-DWH (data 6.1.2 Data Mining Experiment mining, OLAP and statistical analysis) and the results these operations yielded. Initiative behind this experiment was a common farmer question that which pesticide should be bought 6.1 Data Mining Operations and when should it be bought. We modeled these questions as to find the relationship between pest As described in section 3.2, data mining is a process of population and meteorological data elements and to find automated discovery. In the wake of data explosion in out (if possible) temperature and humidity thresholds at almost every domain data mining techniques and which population of a certain pest booms. algorithms have received huge acclaim. Numerous data mining techniques, frameworks and algorithms have Figure 4 gives the random input. Pre-processing method been reported in literature, and are currently used for preparing this input is omitted here due to its successfully in a multitude of domains. mathematical and technical nature, see [1] for details. Two distinct clusters were identified by RNR heuristic
  7. 7. 7 as shown in Figure 5. Matching the clusters with the Checking these rules against the data (376 matching detailed data showed clear grouping on the basis of pest records retrieved out of 2,000+) shows some very populations where cluster 1 have low populations and exciting results as shown in Figure 7. cluster 2 has quite high pest populations. Average values shown in Table-2. Over Threshold Under Threshold 100 80 % incidence 60 C1 Cluster Cluster 40 C1 20 Cluster 0 C2 C2 Cluster Thrips Jassid SBW Figure 7: Experiment 2 Findings Figure 4: Input Figure 5: Clusters similarity matrix Identified by RNR This experimentation presents a very credible case Cluster Jassid Thrip SBW* that common farmer questions can be modeled through C1 0.1 2.22 0.88 this data mining technique and answers can be given C2 0.65 4.11 2.44 based on evidence present in the data before the pest ETL** 1 8-10 3 attack occurs. Strength of this method lies in clustering Table 2 -: Cluster Demographics the evidence scattered in the data and hidden from the *Spotted Boll Worm bare human mind. ** Economic Threshold Level These clusters provide us with a good starting point and next we try to establish certain rules on the basis of 6.2 OLAP and Statistical Operations this clustering. For each record, we look back in time for seven days and note meteorological recordings against OLAP and statistical analysis operations are related minimum and maximum temperatures and humidity. in the sense that these operations are user-driven, unlike Figure 6 shows the resulting graph of average values. the above described method of data mining, which is Now on the basis of these graphs we establish two data-driven. An analyst has particular questions in mind simple rules i.e. so the exploration through both of these methods is performed with a certain bias towards answering those questions. • If Temp > 29 AND Humidity > 70 then pest incidence will be high. PPM-DWH contains integrated data and hence can be probed along any dimension. Generally an iterative • If Temp < 27 AND Humidity < 67 then pest analysis technique proves most beneficial. Iterative incidence will be low. analysis starts with a broad based question resulting in a High Pest Population Low Pest Population large set of records. Analyst then capitalizes on these 80 records and asks more specific question. The process of 70 Humidity iteratively building up on the previous question continues until a result of significant importance is Met element 60 achieved. 50 40 Tmax We performed this analysis with various initiatives and a few of the interesting results are reported below. 30 Tmin 20 6.2.1 Working Behaviors at Field Level 6 5 4 3 2 1 0 Days before visit Cultural practices are one way of controlling pests; Figure 6: Meteorological recordings for the two we were interested in exploring this behavior of our clusters farmers. Results of probing for sowing dates in Y2001
  8. 8. 8 and Y2002 are shown in Fig-8. Note the surprising This finding was later confirmed by agriculturalists finding that most sowings occurring on 20th and 25th of i.e. in social set up of District Multan, Thursdays are May and 2nd of Jun. in both years. usually related with religious activities, such as visiting 2001: Sowing date shrines, hence a tendency of doing lesser amount of work exists on this day. 300 250 200 Y2001: Sprayings Vs. Weekdays 150 100 50 0 5/1/01 5/3/01 5/5/01 5/7/01 5/9/01 5/11/01 5/13/01 5/15/01 5/17/01 5/19/01 5/21/01 5/23/01 5/25/01 5/27/01 5/29/01 5/31/01 6/2/01 6/4/01 6/6/01 6/8/01 6/10/01 6/12/01 6/14/01 6/16/01 6/18/01 6/20/01 6/22/01 6/24/01 6/26/01 6/28/01 6/30/01 Figure-8(a): Sowing Vs. day of year 2001 2002: Sowing date 250 200 150 100 50 0 5/1/02 5/3/02 5/5/02 5/7/02 5/9/02 5/11/02 5/13/02 5/15/02 5/17/02 5/19/02 5/21/02 5/23/02 5/25/02 5/27/02 5/29/02 5/31/02 6/2/02 6/4/02 6/6/02 6/8/02 6/10/02 6/12/02 6/14/02 6/16/02 6/18/02 6/20/02 6/22/02 6/24/02 6/26/02 6/28/02 6/30/02 Y2002: Sprayings Vs. Weekdays Figure-8(b): Sowing Vs. day of year 2002 Further drilling down on the day of the week basis as shown in Fig-9 and 10, resulted in an even more surprising finding, that least number of sowings occurring on Thursdays, in each year. Last but not the least, except for 2002, least number of pesticides were also sprayed on Thursdays. Thus Thursday is the day when work performed is particularly less than other days. Figure 10: Number of sowings against week days Y2001: Sowings Vs. Weekdays 7 Related Work Data warehousing is very popular in domains such as telecommunication, retail sale, manufacturing and scientific research [23]. An agriculture data warehouse is a rather new concept with a very few parallels. Probably the closest among these is USDA-NASS2 data warehouse. Established in 1997, basic goal behind its construction was to standardize and integrate survey Y2002: Sowings Vs. Weekdays data generated by NASS [26]. Our work differs in principal with [26] as (i) data is generated by multiple sources and (ii) goal behind our data warehouse is construction of a foundation on which analytical exploration can take place. Other than [26], world have yet to see a full blown agricultural data warehouse implementation, though 2 Figure 9: Number of sowings against week days United States Department of Agriculture, National Agriculture Statistics Service
  9. 9. 9 (iv) All pests are not present all the time, most there have been a number of proposals in this regard, of the times second spray is not done (or not such as [25]. No such work has ever been undertaken in recorded), hence tables are sparse. We had agriculture sector of Pakistan [2]. to split tables to decrease header size and table space. Data mining applications have quite recently found (v) Unlike traditional data warehouse where the their way into agricultural research and a lot of activity end users are decision makers, here the end can be seen in this area such as [5, 10, 13, 24, 28]. In users include the farmers as well, thus the [26] details of GIMMI project are provided which is decision-making goes all the way “down” to aiming at providing a one-stop and integrated access to the extension level. This presets a challenge the assessment of pesticide leaching into soil and to the analytical operations’ designer, as the groundwater. Several IT tools including data mining are findings must be fairly simple to to be implemented as part of this project. In [3] a new understand. approach for acquisition and pre-processing of agricultural data mining has been described. 9 Conclusions In [7] simple numerical methods have been used to establish the relationship between 10 soil characteristic Analytical exploration of vast amount of variables and corn yield. In [28] remote sensing agricultural data can best be supported by an appropriate techniques are used in conjunction with AI neural application of Data warehousing, and OLAP networks to identify weeds in cornfields. In [22] Data technologies. A data warehouse provides a flexible yet Mining techniques on images have been used to identify efficient and reliable storage structure for vast amount trash in the ginned cotton. In [24] a case study approach of data while OLAP techniques provide mechanisms for was used to help understand how data mining could be ad hoc and in depth analysis of this data. Traditional used in the manufacturing of textiles using SAS. In [13] analytical tools and database techniques may not spatio-temporal knowledge discovery techniques are succeed here due to their rigid nature. Techniques used integrated into a Geo-Spatial Decision Support System in this work are equally applicable at any geographic (GDSS) using a combination of data mining techniques location provided that related data is available. The to find relationships between user-specified target paradigms are quite different from a traditional business episodes and other climatic events and to predict the application of a data warehouse. target episodes. 10 References 8 Lessons [1] Abdullah, A., and Brobst, S., “Clustering by recursive noise During the construction and subsequent utilization of removal”, Proc. Atlantic Symposium on Computational the data warehouse, following lessons are of extremely Biology and Genome Informatics, USA, Sep. 2003 important nature. [1a] Abdullah, A., Brobst S., Pervaiz I., Umer M., and Nisar A., “Learning Dynamics of Pesticide Abuse through Data (i) ETL of agricultural data is a big issue. Mining”, to appear in proceedings of Australasian Workshop There are no digitized operational databases on Data Mining and Web Intelligence 20004 (AWDM&WI2004), Dunedin, New Zealand, January 2004. so one has to resort to data available in typed (or hand written) sheets. Typing of [1b] Abdullah, A., Brobst S., and Umer M., “The Case for an Agri these sheets is very expensive, slow and Data Warehouse: Enabling Analytical Exploration of prone to errors. Integrated Agricultural Data”, to appear in proceedings of The IASTED International Conference on Databases and Applications (DBA 2004), Innsbruck, Austria, Feb. 2004 (ii) Particular to the pest scouting data, farmer individualization is critical, as a farmer is visited number of times by the extension [2] Ahmed, Mumtax & Joseph G. Nagy, “Private Investment in Agriculture Research : Pakistan”, Economic Research people. Services, U.S. Department of Agriculture, January 2001. (iii) Scouting data includes pesticide names, [3] Avesani, P., E. Olivetti, and A.Susi, “Feeding Data Mining”, which are complex and not easy to IRST Technical Report #0207-01, Istituto Trentino di Cultura, Povo (Trento), Italy, July 2002 remember/pronounce and requires extra effort to learn and type correctly. [4] Berry, J.A. Micheal & Gordan Linoff “Data Mining Techniques” John Wiley and Sons inc., 1997.
  10. 10. 10 [5] Bertis, B., Walter L. Johnston et al . “Data Mining in U.S. [20] Levene, Mark and George Loizou, “Why is the Snowflake Schema Corn Fields”, Proceedings of the First SIAM International a Good Data Warehouse Design?” 1999 Conference on Data Mining, Fall 2001 [21] “Introduction to Crop Scouting”, Plant Protection Program, [6] Brobst, Stephen, “Perfect Dimensions”, Intelligent College of Agriculture, Food and Natural Resources, MU ENTERPRISE, June 1999. Extension University of Missouri-Columbia, 2001 [7] Christensen, W. F., and Di Cook, "Data Mining Soil [22] Nguyen, H. T., N. R. Prasad, V. Kreinovich, and H. Characteristics Affecting Corn Yield", 1998, Gassoumi, "Some Practical Applications of Soft Computing and Data Mining ", In: A. Kandel, H. Bunke, and M. Last (eds.), Data Mining and Computational Intelligence, Springer- [8] S. Chaudhuri and U. Dayal. An overview of data warehousing Verlag, Berlin, pp. 273-- 307, 2001. and OLAP technology. ACM SIGMOD Record, 26:6574, 1997. [23] Poe, Vidette, Patricia Klauer and Stephen Brobst, “Building A Data Warehouse for Decision Support” 2nd Edition, Prentice [9] “Cotton Production Technology”, Cotton Research Institute, Hall, 1998. Faisalabad, Pakistan. [24] Scherte, S. L., PhD dissertation, “DATA MINING AND ITS _fbd.htm POTENTIAL USE IN TEXTILES: A Spinning Mill”, North Carolina State University, 2002 [10] Cunningham, Sally Jo and Geoffrey Holmes, “Developing innovative applications in agriculture using data [25] Sharma, S.D., Randhir Singh and Anil Rai, “Integrated mining”, Department of Computer Science, University of National Agricultural Resources Information System Waikato, Hamilton, New Zealand, 2001 (INARIS)”, Indian Agricultural Statistics Research Institute, New Delhi, 2000. [11] Gray, Jim, Surajit CHAUDHURI, ADAM BOSWORTH, et al. “Data Cube: A Relational Aggregation [26] Voss, H., et al., “Simulation, Visualization, and Decision Operator Generalizing Group-By, Cross-Tab, and Sub- Support in GIMMI”, 9 th EC GI & GIS Workshop, ESDI Totals”, J. Data Mining and Knowledge Discovery, 1997. Serving the User, A Coruña, Spain, June 2003 [12] Gupta, A., Venky Harinarayan and Dallan Quass, “Aggregate-Query Processing in Data Warehousing [27] Yost, Mickey., Jack Nealon, “Using A Dimensional Data th Environments”, Proc. Of 24 Conf. on Very Large Warehouse to Standardize Survey And Census Metadata”, Databases, Zurich, Switzerland, 1995. National Agricultural Statistics Service, U.S. Department of Agriculture, Fall 1999. [13] Hrms, Sherri K., et al, “Data Mining in a Geospatial Decision Support System for Drought Risk Management” U.S. [28] Yang, C.-C., S. O. Prasher and J.-A. Department of Agriculture, Risk Management Agency, Fall Landry, “Use of artificial neural networks to recognize weeds 2001 in a corn field”, Journée d'information scientifique et technique en génie agroalimentaire, Saint-Hyacinthe QC, [14]] W.H. Inmon. Building the Data Warehouse. John Wiley & Canada, p. 60-65, Mar. 1999. Sons, Chichester, second edition, 1996. [29] Government of Pakistan “Economic survey of Pakistan [15] Irshad, M., Ehsan-ul-Haq and Javed Iqbal, “Catalogue of 2000”, Islamabad, Pakistan. Insecticides for Agricultural Pests of Pakistan”, Integrated Pest Management Institute, National Agriculture Research [30] “Cotton and Ginning”, Center (NARC), Islamabad, 2001. ginning.pdf viewed on 20 Sep. 2003. [16] James, David G., and Tanya S. Price, “Imidacloprid Boosts [31] United Nations Fod and Agricuture Organization, TSSM Egg Production”, Agricutre and Environment news, “Agricultural Sector in Pakistan” 1998 Issue No. 189, Washington State University, USA, July 2002 [32] Government of Punjab, Agriculture Department. “Future [17]] Johnston, Doug., “Data Mining for Site-specific Agriculture”, agricultural extension strategy” Lahore Illinois Council on Food and Agriculture research (C-FAR) , Illinois, January 2000. [33] Davidson, Andrew P. (2000) ‘Soil salinity, a major constraint to irrigated agriculture in the Punjab Region of Pakistan: [18]] R. Kimball, “A dimensional Modeling Manifesto”, DBMS contributing factors and strategies for amelioration’. and Internet Systems, August 1997 American Journal of Alternative Agriculture No. 15 pp. 154– 9. [19] R. Kimball, L. Reeves, M. Ross, and W. Thornthwaite. “The Data Warehouse Lifecycle Toolkit: Expert Methods for Designing, Developing and Deploying Data Warehouses”, John Wiley & Sons, Chichester, 1998.