NSF IDM 2001 Workshop

             Data Mining, Data Warehousing & OLAP
                        and E-Commerce
•   Novel environments to make data mining possible
   •   Data warehousing and integration with data mining processes and...
terms of new directions in the knowledge discovery process, there is a need to develop
new approaches for interactive, sem...
dynamic schema change, must be addressed in future development of data mining
       and data warehousing methods in order...
Upcoming SlideShare
Loading in …5

Data Mining, Data Warehousing


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Data Mining, Data Warehousing

  1. 1. NSF IDM 2001 Workshop Data Mining, Data Warehousing & OLAP and E-Commerce Breakout Group Report Co-Chairs: Lawrence Holder Meral Ozsoyoglu University of Texas at Arlington Case Western Reserve University 1. Introduction The goals of the Data Mining, Data Warehousing and E-commerce breakout group were to identify recent accomplishments, success stories, new research directions and challenges in these areas and provide recommendati ns to NSF for supporting continued o progress in these areas. The group consisted of 35 participants primarily from areas in data mining and data warehousing & OLAP and none from E -commerce. We began the discussion by determining the problem scope of thegroup, which was again primarily one of data mining and data warehousing. Because no one in the group was exclusively dealing with E -commerce issues, the group quickly agreed that E- commerce should be considered an important application of data mining anddata warehousing, but would not be discussed as a major research direction of the group. Certainly there are many overlapping issues between E -commerce, data mining and data warehousing (e.g., mining and storing E-commerce data, managing real-time maintenance of E-commerce data, and security of E-commerce data), and many of these issues were treated in two other breakout groups: “Innovative Applications of Information Retrieval and Data Mining and Data Security and “Quality of Service for ” the Information Age.” At the same time, we realize that E-applications in general, of which E-commerce is but one instance, represent an important class of applications that are closely related to research in data mining and data warehousing. Other such E - applications include E-science, and E-education/E-learning, that is the warehousing and mining of data and processes specific to the scientific and education communities. The data mining and data warehousing problem was then identified to beextracting and managing of useful and interesting patterns and anomalies from large datasets. The group also identified the following important specific issues of this problem. • Novel algorithms for identifying and extracting patterns • Methods for measuring “interestingness” and “u sefulness” of patterns • Efficiency, performance, scalability
  2. 2. • Novel environments to make data mining possible • Data warehousing and integration with data mining processes and systems • Data mining techniques for schema and database integration • Visualization and presentation of patterns and discoveries • Incorporation of user feedback • Use of extracted patterns to iteratively mine for additional discoveries 2. Recent Accomplishments and Success Stories Much progress has been made toward the data mining and data warehousing issues listed above. Several algorithms have been developed for mining association rules and sequences. Classification and clustering algorithms are now available that scale to large datasets and fit within a theoretical framework for guaranteeing performance. Developments in data warehousing and on-line analytical processing (OLAP) have enabled improvements in data models, architectures, queries and query processing. Data compression and reduction techniques (e.g., histograms, wavelet transfo rms) are becoming more prolific in the knowledge discovery process. In terms of applications data mining and data warehousing have made significant progress in analyzing and managing Web, E-commerce and biological databases. Indicators of these accomplishments can be found in numerous data mining and data warehousing success stories. We now see widespread use and confidence in data mining techniques. The successful use of data mining for customization is exemplified by customer personalization (e.g., amazon.com) and market data collection (e.g., grocery store discount cards). Data mining and data warehousing is now deployed on an enterprise scale at companies such as Walmart. On -line information extraction services, such as MySimon.com, use data miningtechniques to provide relevant information to the user. To support these applications, a new sector of service companies has emerged to provide data mining and data warehousing expertise. The above activities have also caught on at a global level, prompting world-wide demand for data mining and data warehousing research and development. Many other accomplishments and success stories demonstrate the importance of continued progress in data mining and data warehousing. Although progress has been significant, there are many problems left to be solved. 3. New Research Directions and Challenges Despite our progress in the areas of data mining and data warehousing, many new research issues must be addressed in order to overcome the challenging problems faced in this area. Our group identified a number of new research directions, primarily for dealing with new types of data: unstructured data, streaming data, image data, high-dimensional data, heterogeneous data, dynamic data, multi-resolution data, geo-spatial and temporal data, compressed data, graph-based data, encapsulated data with user-defined types. In
  3. 3. terms of new directions in the knowledge discovery process, there is a need to develop new approaches for interactive, semi-automatic data mining. The group also identifiedfour challenges that can be addressed only by significant progress in several of the aforementioned research directions. • Building systems that work in real applications. Although many instances have been demonstrated of specific data min and data warehousing systems applied ing to specific data, many of these applications do not scale to complexity of the task. For example, several data mining algorithms have been applied to specific components of available bioinformatics data, but no int grated system has been e developed allowing management and analysis of a significant subset of bioinformatics databases. A similar case can be made for other tasks, e.g., video collections of the BBC and CNN, analysis of information sources for financial institutions, Web click-stream data, associations between medical and social databases, integration of geo-spatial databases, and other E-applications. Providing an integrated system that can store and analyze large amounts of heterogeneous data is a significant research challenge. • Building systems that real people can use. Most data mining systems are operated by computer scientists specially trained in the domain of interest, or domain scientists specially trained in the operation of the data mining syst em. In order to reduce the extra-curricular training necessary to mine data in different domains, data mining systems must be easier to use for domain scientists, industrial users, and individual users. Specific challenges in this area include the automated preparation of the data, semi-automatic discovery, and the evaluation, visualization and interpretation of the results. Next generation data mining and data warehousing tools must improve upon these capabilities. • Determining which data mining methods to use with a given problem and application. Numerous successful and unsuccessful applications of data mining methods have provided important data on the applicability of various methods to different tasks. Research is needed to collect and analyze theseresults, develop general mappings between methods and tasks, and eventually develop a theory providing performance bounds on potential method/task pairings in order to make more systematic decisions regarding appropriate existing methods and needed non-existing methods. • Non-traditional data mining and data warehousing tasks. With the advent of numerous, heterogeneous sources of structured, semi-structured and unstructured data, new data mining and data warehousing methods are necessary to integrate this data to support the extraction of knowledge relating different information sources. Many of these new data types are mentioned above as new research directions, but an even greater challenge exists in the integration of several such data types. Issues including representation, scalability, realtime collection, and -
  4. 4. dynamic schema change, must be addressed in future development of data mining and data warehousing methods in order to perform in these non-traditional areas. 4. Recommendations The NSF IDM program has been instrumental in achieving the accomplishments and successes described above and will continue to be a major resource for supporting future progress and meeting the challenges of next generation data mining and data - warehousing. The group discussed ways in which the NSF IDM program might best support these efforts and outlined several recommendations. First, NSF should continue to support effective and productive collaboration between data mining/warehousing researchers and domain scientists. Such support would include efforts to promote collaboration with researchers in other closely related areas (e.g., statistics, algorithms and scientific computation) and continued efforts to facilitate cross disciplinary - education. Second, NSF should focus more effort on establishing analysis techniques, metrics and benchmarks for evaluating the performance of data mining/warehousing systems, especially in the non-traditional data formats mentioned earlier. This effort would not only provide mechanisms fo comparing different methods, but also help develop r objective measures of the interestingness of the discovered knowledge and the value of storing information and knowledge. Finally, the group stressed that NSF should continue to view its primary role asone of supporting long-term research on core problems in computer science and technology rather than short-term, market-influenced projects. NSF should provide the foundation for, but not try to compete with short -term industry technology developments. Maintaining a continued focus on such research will allow the NSF IDM program to support significant progress on the difficult challenges awaiting the next generation of data mining and data warehousing tasks.