2004 Dagstuhl Workshop
Data Mining: The Next Generation
Johann Christoph Freytag
Data Mining has enjoyed great popularity in recent years, with advances in both research
and commercialization. The first generation of data mining research and development has
yielded several commercially available systems, both stand-alone and integrated with
database systems; produced scalable versions of algorithms for many classical data mining
problems; and introduced novel pattern discovery problems.
In recent years, research has tended to be fragmented into several distinct pockets without a
comprehensive framework. Researchers have continued to work largely within the
parameters of their parent disciplines, building upon existing and distinct research
methodologies. Even when they address a common problem (for example, how to cluster a
dataset) they apply different techniques, different perspectives on what the important issues
are, and different evaluation criteria. While different approaches can be complementary,
and such a diversity is ultimately a strength of the field, better communication across
disciplines is required if Data Mining is to forge a distinct identity with a core set of
principles, perspectives, and challenges that differentiate it from each of the parent
Further, while the amount and complexity of data continues to grow rapidly, and the task
of distilling useful insight continues to be central, serious concerns have emerged about
social implications of data mining. Addressing these concerns will require advances in our
theoretical understanding of the principles that underlie Data Mining algorithms, as well as
an integrated approach to security and privacy in all phases of data management and
We believe that it is timely to bring together researchers from a variety of backgrounds to
re-assess the current directions of the field, to identify critical problems that require
attention, and to discuss ways to increase the flow of ideas across the different disciplines
that Data Mining has brought together. We propose a workshop to foster such a discussion.
The success of Data Mining depends on many constituencies (e.g., academia, tool vendors,
policy advocates and regulators), each with their own agendas and concerns, and some
focus is desirable to ensure good interactions. We will focus the workshop on research
directions, and specifically, directions that will lead to increased use of techniques and
perspectives drawn from the different disciplines involved in KDD. The workshop
participants will be asked to identify promising research problems for the next 5 years,
using three criteria:
• Is this problem real? Will the practice of data mining be significantly improved by
advancing the state of the art?
• Does the problem have sufficient depth and breadth to engage the research
• Does the problem cut across boundaries of traditional disciplines like Database
Systems, Machine Learning, and Statistics? Will it lead to increased collaborations
and cross-disciplinary research?
Some candidate problems are listed below, and are intended to serve as a seed for further
1. Compositional Data Mining: Can we develop compositional approaches and
optimization of multi-step mining "queries" to efficiently explore a large
space of candidate models using high-level input from an analyst? The goal
is to reduce the time taken to explore a large and complex dataset
• Examples of real applications that made use more than one data mining
operation.How was the composition achieved? How it could have been different?
What was missing?
• Illustrative examples of how compositional use of mining techniques can be useful.
• Thoughts on primitive operations, algebra of composition, opportunities for
optimization, incorporation of domain knowledge.
2. Query Centric vs. Data Centric Data Mining: Techniques arising in Database Systems
are typically query centric, and seek to retrieve patterns from data that match patterns
specified by a query. In contrast, techniques arising in Machine Learning and Statistics are
typically data-driven, and seek to generate patterns or data descriptions that characterize
(interesting or large) subsets of data.
• Are the two approaches reconcilable? What could be the meeting grounds?
• Examples of applications where the two approaches have been, or can be, used
3. Designing for security and privacy: How can we enable effective mining while
controlling access to data according to specific privacy and security policies?
• What are the limits to what we can learn, given a set of governing policies?
• Issues in mining across enterprises? Issues in mining in a service-provider
4. Tight integration of mining with relational database systems: How can we improve data
mining environments to store data mining results and their provenance in a secure,
searchable, sharable, scalable manner? Given a set of ongoing mining objectives, how
should the data in a warehouse be organized, indexed, and archived?
• Do we need to extend SQL to support mining operations? What is the appropriate
granularity? Operations such as clustering or light-weight operations that can be
used to implement clustering and other higher-level operations? Examples of the
• Do we need to extend SQL to store and reason about mining algorithms and
• Design principles for mining environments.
Participants will be invited to make a case for other problems as well. However, the
workshop will seek to discuss a small number (say, 3-4) of problems in depth. In addition,
we hope that the workshop will lead to a better understanding of the structure of the field.
KDD has brought together to machine learning, statistics, and database communities.
Increasingly, other communities have also focused on mining activities. Examples include
text, natural language, and multimedia mining. However, the sheer breadth of tasks and
techniques has led to relatively little communication across the subgroups. Is this likely to
continue as the norm? Are there useful synergies between these diverse groups?
The participant list covers various well-known people as well as young scientists from both
industry and academics. It is our hope that the seminar will improve the understanding of
this rapidly growing and changing field, and stimulate new collaborations between the
The workshop will run Monday through Friday, and will emphasize informal presentations,
discussions, and provide opportunities for participants to work on ideas in small groups.
All participants will make short presentations, explaining their backgrounds and recent
research activity related to the Workshop, in order for everyone to get acquainted.
We will also solicit feedback on the specific problems and topics to be discussed during the
remainder of the workshop.
Tuesday through Thursday
This will be the working period of the workshop, and will feature selected presentations in
the mornings, followed by loosely structured panels and discussions in the afternoon.
Evenings will be left open for small groups or individuals to work on their own.
The final day of the workshop will feature a morning plenary session in which we take
stock of the workshop discussions, and determine an agenda for follow-up work. We
expect that many ideas that arise during the workshop will need some discussion in
preparation for extended collaborations, and so the afternoon will be left open for small