“ Data Mining is the process of discovering actionable and meaningful patterns , profiles, and trends by sifting through your data using pattern recognition technologies (…) is a hot new technology about one of the oldest processes of human endeavour: pattern recognition (…) It is an iterative process of extracting knowledge from business transactions (…) DM is the automatic discovery of usable knowledge from your stored data.”
Jesús Mena: Data Mining Your Website (Digital Press, 1999)
“ Data Mining, by its simplest definition, automates the detection of relevant patterns in a database (…) DM is not magic. For many years, statisticians have manually “mined” databases (…) DM uses well-established statistical and machine learning techniques to build models that predict customer behaviour . Today, technology automates the mining process, integrates it with commercial data warehouses, and presents it in a relevant way for business users (…) the leading DM products address the broader business and technical issues, such as their integration into complex IT environments .”
Berson, Smith, & Thearling: Building Data Mining Applications for CRM (McGraw-Hill, 2000)
WIKIPEDIA DIXIT: “ Data mining has been defined as "The nontrivial extraction of implicit, previously unknown, and potentially useful information from data" (1) and "The science of extracting useful information from large data sets or databases" (2) . Although it is usually used in relation to analysis of data, data mining, like artificial intelligence, is an umbrella term and is used with varied meaning in a wide range of contexts. ”
(1) W. Frawley and G. Piatetsky-Shapiro and C. Matheus, Knowledge Discovery in Databases: An Overview. AI Magazine, 1992 , 213-228.
(2) D. Hand, H. Mannila, P. Smyth: Principles of Data Mining. MIT Press, 2001.
WIKIPEDIA’06 also DIXIT: “ Data mining (DM), also called Knowledge-Discovery in Databases (KDD) or Knowledge-Discovery and Data Mining , is the process of automatically searching large volumes of data for patterns such as association rules. It is a fairly recent topic in computer science but applies many older computational techniques from statistics, information retrieval, machine learning and pattern recognition .
In 1996, in the proceedings of the 1st International Conference on KDD, Fayyad gave one of the best-known definitions of Knowledge Discovery from Data:
“ The non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data.”
KDD quickly gathered strength as an interdisciplinary research field where a combination of advanced techniques from Statistics, Artificial Intelligence, Information Systems, and Visualization are used to tackle knowledge acquisition from large data bases. The term Knowledge Discovery from Data appeared in 1989 referring to the:
“ [...] overall process of finding and interpreting patterns from data, typically interactive and iterative, involving repeated application of specific data mining methods or algorithms and the interpretation of the patterns generated by these algorithms.”
From “Special Issue on Applications of Machine Learning and the Knowledge Discovery Process”: Kohavi, R., Provost, F. (1998)
The term data mining is somewhat overloaded. It sometimes refers to the whole process of knowledge discovery and sometimes to the specific machine learning phase.
The non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data.
In Knowledge Discovery, machine learning is most commonly used to mean the application of induction algorithms, which is one step in the knowledge discovery process. Machine Learning is the field of scientific study that concentrates on induction algorithms and on other algorithms that can be said to “learn”.
15-17 Septiembre’04: Wessex Institute of Technology (W.I.T.)
What to find in a DM conference… Session 1: Text Mining (I) Session 2: Text Mining (II) Session 3: Web Mining Session 4: Clustering Techniques Session 5: Data Preperation Techniques Session 6: Applications in Business, Industry and Government (I) Session 7: Applications in Business, Industry and Government (II) Session 8: Customer Relationship Management (CRM) Session 9: Applications in Science and Engineering (I) Session 10: Applications in Science and Engineering (II)
What to find in a DM conference… (2) Sessions 1 & 2: Text Mining Stock Broker P – sentiment extraction for the stock market An XML-based semantic protein map Automated text mining comparison of Japanese and USA multi-robot research (*) Session 3: Web Mining The influence of caching on web usage mining Clickstreams, the basis to establish user navigation patterns on web sites (*) U.S. Army Tank-Automotive and Armaments Command
“ The Total Information Awareness (TIA) program may have been killed by congressional decree, but key elements of the program have survived at other intelligence agencies, according to congressional, federal, and research officials. TIA's goal was to employ data-mining to shift through public and private databases to track terrorists , which stirred up fears that the program would be used to spy on millions of innocent Americans.”
“ Congressional officials have not disclosed which TIA programs were eliminated and which were retained, but insiders report that TIA's Evidence and Extraction and Link Discovery projects , collectively encompassing 18 data-mining initiatives, are among the surviving components . The continuance of certain projects originally falling under the auspices of TIA has led Steve Aftergood of the Federation of American Scientists to conclude that Congress' decision to disband TIA was nothing more than "a shell game." “
“ Despite the death of TIA, Capitol Hill is still paying for the development of software designed to collect foreign intelligence on terrorists: a $64 million research program run by the Advanced Research and Development Activity (ARDA), which has employed some of the same researchers as TIA, was left untouched by Congress.”
Comprensión del problema de negocio: estudio de los objetivos y requerimientos desde la perspectiva de negocio. Conversión en una definición de problema de DM
Comprensión de los datos: recolección de datos y familiarización con los mismos, intentando buscar defectos de calidad y detectar datos de posible interés
Preparación de los datos: Construcción del conjunto de datos que será modelado, a partir de los datos originales. Proceso iterativo y exploratorio. Selección de tablas, variables, observaciones… limpieza de datos
Modelado: Análisis de datos vía técnicas de modelado adaptadas al problema a tratar. Calibrado de los parámetros de estos modelos
Evaluación: Todos los pasos previos son evaluados conjuntamente, y se ha de decidir si los resultados son adecuados al problema planteado
Implementación: Todo el conocimiento adquirido ha de ser organizado y presentado al cliente, de manera que éste lo pueda utilizar. En muchos casos será el mismo cliente quien se responsabilice de esta fase
Use of DM methodologies Enterprise MinerTM : SEMMA The acronym SEMMA -- sample, explore, modify, model, assess -- refers to the core process of conducting data mining. Beginning with a statistically representative sample of your data, SEMMA makes it easy to apply exploratory statistical and visualization techniques, select and transform the most significant predictive variables, model the variables to predict outcomes, and confirm a model's accuracy.