DATA MINING

256 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
256
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

DATA MINING

  1. 1. Our ability to capture and store data far outpaces our ability to process and exploit it. EVOLVING DATA MINING INTO SOLUTIONS FOR INSIGHTS he capacity of digital data storage operations will be lost. Data mining—one of the worldwide has doubled every nine most general approaches to reducing data in order to months for at least a decade, at twice explore, analyze, and understood it—is the focus of the rate predicted by Moore’s Law for the this special section. growth of computing power during the Data mining is defined as the identification of same period [5]. This less familiar but interesting structure in data. Structure designates noteworthy phenomenon, which we call patterns, statistical or predictive models of the data, Storage Law, is among the reasons for the and relationships among parts of the data. Each of increasing importance and rapid growth these terms—patterns, models, and relationships— of the field of data mining. has a concrete definition in the context of data min- The aggressive rate of growth of disk storage and ing. A pattern is a parsimonious summary of a the gap between Moore’s Law and Storage Law subset of the data (such as people who own minivans growth trends represents a very interesting pattern in have children). A model of the data can be a model the state of technology evolution. Our ability to cap- of the entire data set and can be predictive; it can be ture and store data has far outpaced our ability to used to, say, anticipate future customer behavior process and utilize it. This growing challenge has pro- (such as the likelihood a customer is or is not happy, duced a phenomenon we call the data tombs, or data based on historical data of interaction with a partic- stores that are effectively write-only; data is deposited ular company). It can also be a general model (such to merely rest in peace, since in all likelihood it will as a joint probability distribution on the set of vari- never be accessed again. ables in the data). However, the concept of interest- Data tombs also represent missed opportunities. ing is much more difficult to define. Whether the data might support exploration in a sci- What structure within a particular data set is entific activity or commercial exploitation by a busi- likely to be interesting to a user or task? An algo- ROBERT NEUBECKER ness organization, the data is potentially valuable rithm could easily enumerate lots of patterns from a information. Without next-generation data mining finite database. Identifying interesting structure and tools, most will stay unused; hence most of the oppor- useful patterns among the plethora of possibilities is tunity to discover, profit, improve service, or optimize what a data mining algorithm must do, and it must BY USAMA FAYYAD AND RAMASAMY UTHURUSAMY, GUEST EDITORS 28 August 2002/Vol. 45, No. 8 COMMUNICATIONS OF THE ACM
  2. 2. Data mining is primarily concerned with making it EASY, CONVENIENT, AND PRACTICAL to explore very large databases for organizations and users with lots of data but without years of training as data analysts. do it quickly over very large databases. sands in cases involving retail transactions, Web For example, frequent item sets (variable values browsing, or text document analysis). A model occurring together frequently in a database of transac- derived from this automated discovery and search tions) could be used to answer, say, which items are process can be used to find lower-dimensional most frequently bought together in the same super- subspaces where people find it easier to under- market. Such an algorithm could also discover a pat- stand aspects of the problem that are interesting. tern in a demographics database with exceptionally Automating search. Instead of relying solely on a high confidence that, say, all husbands are males. human analyst to enumerate and create hypothe- While true, however, this particular association is ses, the algorithms perform much of this tedious unlikely to be interesting. This same method did and data-intensive work automatically. uncover in the set of transactions representing physi- Finding patterns and models understandable and cians billing the Australian Government’s medical interesting to users. Classical methodologies for insurance agency a correlation deemed extremely scoring models focus on notions of accuracy (how interesting by the agency’s auditors. Two billing codes well the model predicts data) and utility (how to were highly correlated; they were representative of the measure the benefit of the derived pattern, such same medical procedure and hence had created the as money saved). While these measures are well potential for double-billing fraud. This nugget of understood in decision analysis, the data mining information represented millions of dollars of over- community is also concerned with new measures, payment. such as the understandability of a model or the The quest for patterns in data has been studied for novelty of a pattern and how to simplify a model a long time in many fields, including statistics, pat- for interpretability. It is particularly important tern recognition, and exploratory data analysis [6]. that the algorithm help end users gain insight Data mining is primarily concerned with making it from data by focusing on the extraction of pat- easy, convenient, and practical to explore very large terns that are easily understood or can be turned databases for organizations and users with lots of data into meaningful reports and summaries by trad- but without years of training as data analysts [1, 3, 4]. ing off complexity for understandability. The goals uniquely addressed by data mining fall into certain categories: Trends and Challenge s Among the most important trends in data mining is Scaling analysis to large databases. What can be the rise of “verticalized,” or highly specialized, solu- done with large data sets that cannot be loaded tions, rather than the earlier emphasis on building and manipulated in main memory? Can abstract new data mining tools. Web analytics, customer data-access primitives embedded in database sys- behavior analysis, and customer relationship man- tems provide mining algorithms with the infor- agement all reflect the new trend; solutions to busi- mation to drive a search for patterns? How might ness problems increasingly embed data mining we avoid having to scan an entire very large data- technology, often in a hidden fashion, into the appli- base while reliably searching for patterns? cation. Hence, data mining applications are increas- Scaling to high-dimensional data and models. Classi- ingly targeted and designed specifically for end cal statistical data analysis relies on humans to users. This is an important and positive departure formulate a model, then use the data to assess the from most of the field’s earlier work, which tended model’s fit to data. But humans are ineffective at to focus on building mining tools for data mining formulating hypotheses when data sets have large experts. numbers of variables (possibly thousands in cases Transparency and data fusion represent two major involving demographics and hundreds of thou- challenges for the growth of the data mining market 30 August 2002/Vol. 45, No. 8 COMMUNICATIONS OF THE ACM
  3. 3. and technology development. Transparency concerns plant, existing human-expert-intensive analytical the need for an end-user-friendly interface, whereby techniques for significantly improving the quality of the data mining is transparent as far as the user is con- business decision making. Jiawei Han et al. outline a cerned. Embedding vertical applications is a positive number of data analysis and discovery challenges step toward addressing this problem, since it is easier posed by emerging applications in the areas of bioin- to generate explanations from models built in a spe- formatics, telecommunications, geospatial modeling, cific context. Data fusion concerns a more pervasive and climate and Earth ecosystem modeling. infrastructure problem: Where is the data that has to Data mining also represents a step in the process of be mined? Unfortunately, most efforts at building the knowledge discovery in databases (KDD) [2]. The decision-support infrastructure, including data ware- recent rapid increase in KDD tools and techniques for houses, have proved to be big, complicated, and a growing variety of applications needs to follow a expensive. Industry analysts report the failure of a consistent process. The business requirement that any majority of enterprise data warehousing efforts. KDD solution must be seamlessly integrated into an Hence, even though the data accumulates in stores, it existing environment makes it imperative that ven- is not being organized in a format that is easy to access dors, researchers, and practitioners all adhere to the for mining or even for general decision support. technical standards that make their solutions interop- Much of the problem involves data fusion. How erable, efficient, and effective. Robert Grossman et al. can a data miner consistently reconcile a variety of outline the various standards efforts under way today data sources? Often labeled as data integration, ware- for dealing with the numerous steps in data mining housing, or IT initiatives, the problem is also often and the KDD process. the unsolved prerequisite to data mining. The prob- Providing a realistic view of this still young field, lem of building and maintaining useful data ware- these articles should help identify the opportunities houses remains one of the great obstacles to succesful for applying data mining tools and techniques in any data mining. The sad reality today is that before users area of research or practice, now and in the future. get around to applying a mining algorithm, they must They also reflect the beginning of a still new science spend months or years bringing together the data and the foundation for what will become a theory of sources. Fortunately, new disciplined approaches to effective inference from and exploitation of all those data warehousing and mining are emerging as part of massive (and growing) databases. c the vertical solutions approach. References Emphasizing Ta r geted Applications 1. Fayyad, U., Grinstein, G., and Wierse, A., Eds. Information Visualization in Data Mining. Morgan Kaufmann Publishers, San Francisco, 2002. The six articles in this special section reflect the 2. Fayyad, U., Piatetsky-Shapiro, G., Smyth, P., and Uthurusamy, R., Eds. recent emphasis on targeted applications, as well as Advances in Knowledge Discovery and Data Mining. MIT Press, Cam- bridge, MA, 1996. data characterization and standards. 3. Han, J. and Kamber, M. Data Mining: Concepts and Techniques. Morgan Padhraic Smyth et al. explore the development of Kaufmann Publishers, San Francisco, 2000. new algorithms and techniques in response to chang- 4. Hand, D, Mannila, H., and Smyth, P. Principles of Data Mining. MIT Press, Cambridge, MA, 2001. ing data forms and streams, covering the influence of 5. Porter, J. Disk Trend 1998 Report; www.disktrend.com/pdf/ the data form on the evolution of mining algorithms. portrpkg.pdf. 6. Tukey, J. Exploratory Data Analysis. Addison-Wesley, Reading, MA, Paul Bradley et al. sample the effort to make data 1977. mining algorithms scale to very large databases, espe- cially those in which one cannot assume the data is easily manipulated outside the database system or Usama Fayyad (fayyad@digimine.com) is the president and chief executive officer of digiMine, Inc., Seattle, WA. even scanned more than a few times. Ramasamy Uthurusamy (samy@gm.com) is General Director Ron Kohavi et al. look into emerging trends in the of Emerging Technologies in the Information Systems and Services vertical solutions arena, focusing on business analyt- Division of General Motors Corp., Detroit, MI. ics, which is driven by business value measured as progress toward bridging the gap between the needs of Permission to make digital or hard copies of all or part of this work for personal or class- room use is granted without fee provided that copies are not made or distributed for business users and the accessibility and usability of profit or commercial advantage and that copies bear this notice and the full citation on analytic tools. the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Specific applications have always been an impor- tant aspect of data mining practice. Two overview articles cover mature and emerging applications. Chi- danand Apte et al. examine industrial applications where these techniques supplement, sometimes sup- © 2002 ACM 0002-0782/02/0800 $5.00 COMMUNICATIONS OF THE ACM August 2002/Vol. 45, No. 8 31

×