NSF IDM 2001 Workshop
Data Mining, Data Warehousing & OLAP
Breakout Group Report
Lawrence Holder Meral Ozsoyoglu
University of Texas at Arlington Case Western Reserve University
The goals of the Data Mining, Data Warehousing and E-commerce breakout group were
to identify recent accomplishments, success stories, new research directions and
challenges in these areas and provide recommendati ns to NSF for supporting continued
progress in these areas. The group consisted of 35 participants primarily from areas in
data mining and data warehousing & OLAP and none from E -commerce.
We began the discussion by determining the problem scope of thegroup, which was
again primarily one of data mining and data warehousing. Because no one in the group
was exclusively dealing with E -commerce issues, the group quickly agreed that E-
commerce should be considered an important application of data mining anddata
warehousing, but would not be discussed as a major research direction of the group.
Certainly there are many overlapping issues between E -commerce, data mining and data
warehousing (e.g., mining and storing E-commerce data, managing real-time
maintenance of E-commerce data, and security of E-commerce data), and many of these
issues were treated in two other breakout groups: “Innovative Applications of
Information Retrieval and Data Mining and Data Security and “Quality of Service for
the Information Age.” At the same time, we realize that E-applications in general, of
which E-commerce is but one instance, represent an important class of applications that
are closely related to research in data mining and data warehousing. Other such E -
applications include E-science, and E-education/E-learning, that is the warehousing and
mining of data and processes specific to the scientific and education communities.
The data mining and data warehousing problem was then identified to beextracting and
managing of useful and interesting patterns and anomalies from large datasets. The
group also identified the following important specific issues of this problem.
• Novel algorithms for identifying and extracting patterns
• Methods for measuring “interestingness” and “u sefulness” of patterns
• Efficiency, performance, scalability
• Novel environments to make data mining possible
• Data warehousing and integration with data mining processes and systems
• Data mining techniques for schema and database integration
• Visualization and presentation of patterns and discoveries
• Incorporation of user feedback
• Use of extracted patterns to iteratively mine for additional discoveries
2. Recent Accomplishments and Success Stories
Much progress has been made toward the data mining and data warehousing issues listed
above. Several algorithms have been developed for mining association rules and
sequences. Classification and clustering algorithms are now available that scale to large
datasets and fit within a theoretical framework for guaranteeing performance.
Developments in data warehousing and on-line analytical processing (OLAP) have
enabled improvements in data models, architectures, queries and query processing. Data
compression and reduction techniques (e.g., histograms, wavelet transfo rms) are
becoming more prolific in the knowledge discovery process. In terms of applications
data mining and data warehousing have made significant progress in analyzing and
managing Web, E-commerce and biological databases.
Indicators of these accomplishments can be found in numerous data mining and data
warehousing success stories. We now see widespread use and confidence in data mining
techniques. The successful use of data mining for customization is exemplified by
customer personalization (e.g., amazon.com) and market data collection (e.g., grocery
store discount cards). Data mining and data warehousing is now deployed on an
enterprise scale at companies such as Walmart. On -line information extraction services,
such as MySimon.com, use data miningtechniques to provide relevant information to the
user. To support these applications, a new sector of service companies has emerged to
provide data mining and data warehousing expertise. The above activities have also
caught on at a global level, prompting world-wide demand for data mining and data
warehousing research and development.
Many other accomplishments and success stories demonstrate the importance of
continued progress in data mining and data warehousing. Although progress has been
significant, there are many problems left to be solved.
3. New Research Directions and Challenges
Despite our progress in the areas of data mining and data warehousing, many new
research issues must be addressed in order to overcome the challenging problems faced in
this area. Our group identified a number of new research directions, primarily for dealing
with new types of data: unstructured data, streaming data, image data, high-dimensional
data, heterogeneous data, dynamic data, multi-resolution data, geo-spatial and temporal
data, compressed data, graph-based data, encapsulated data with user-defined types. In
terms of new directions in the knowledge discovery process, there is a need to develop
new approaches for interactive, semi-automatic data mining.
The group also identifiedfour challenges that can be addressed only by significant
progress in several of the aforementioned research directions.
• Building systems that work in real applications. Although many instances have
been demonstrated of specific data min and data warehousing systems applied
to specific data, many of these applications do not scale to complexity of the task.
For example, several data mining algorithms have been applied to specific
components of available bioinformatics data, but no int grated system has been
developed allowing management and analysis of a significant subset of
bioinformatics databases. A similar case can be made for other tasks, e.g., video
collections of the BBC and CNN, analysis of information sources for financial
institutions, Web click-stream data, associations between medical and social
databases, integration of geo-spatial databases, and other E-applications.
Providing an integrated system that can store and analyze large amounts of
heterogeneous data is a significant research challenge.
• Building systems that real people can use. Most data mining systems are operated
by computer scientists specially trained in the domain of interest, or domain
scientists specially trained in the operation of the data mining syst
em. In order to
reduce the extra-curricular training necessary to mine data in different domains,
data mining systems must be easier to use for domain scientists, industrial users,
and individual users. Specific challenges in this area include the automated
preparation of the data, semi-automatic discovery, and the evaluation,
visualization and interpretation of the results. Next generation data mining and
data warehousing tools must improve upon these capabilities.
• Determining which data mining methods to use with a given problem and
application. Numerous successful and unsuccessful applications of data mining
methods have provided important data on the applicability of various methods to
different tasks. Research is needed to collect and analyze theseresults, develop
general mappings between methods and tasks, and eventually develop a theory
providing performance bounds on potential method/task pairings in order to make
more systematic decisions regarding appropriate existing methods and needed
• Non-traditional data mining and data warehousing tasks. With the advent of
numerous, heterogeneous sources of structured, semi-structured and unstructured
data, new data mining and data warehousing methods are necessary to integrate
this data to support the extraction of knowledge relating different information
sources. Many of these new data types are mentioned above as new research
directions, but an even greater challenge exists in the integration of several such
data types. Issues including representation, scalability, realtime collection, and
dynamic schema change, must be addressed in future development of data mining
and data warehousing methods in order to perform in these non-traditional areas.
The NSF IDM program has been instrumental in achieving the accomplishments and
successes described above and will continue to be a major resource for supporting future
progress and meeting the challenges of next generation data mining and data
warehousing. The group discussed ways in which the NSF IDM program might best
support these efforts and outlined several recommendations. First, NSF should continue
to support effective and productive collaboration between data mining/warehousing
researchers and domain scientists. Such support would include efforts to promote
collaboration with researchers in other closely related areas (e.g., statistics, algorithms
and scientific computation) and continued efforts to facilitate cross disciplinary
Second, NSF should focus more effort on establishing analysis techniques, metrics and
benchmarks for evaluating the performance of data mining/warehousing systems,
especially in the non-traditional data formats mentioned earlier. This effort would not
only provide mechanisms fo comparing different methods, but also help develop
objective measures of the interestingness of the discovered knowledge and the value of
storing information and knowledge.
Finally, the group stressed that NSF should continue to view its primary role asone of
supporting long-term research on core problems in computer science and technology
rather than short-term, market-influenced projects. NSF should provide the foundation
for, but not try to compete with short -term industry technology developments.
Maintaining a continued focus on such research will allow the NSF IDM program to
support significant progress on the difficult challenges awaiting the next generation of
data mining and data warehousing tasks.