Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Explicar algunos. Enlace T.I.A.
  • Explicar algunos. Enlace T.I.A.
  • Explicar algunos. Enlace T.I.A.
  • Digresión sobre la investigación confidencial en DM (USA)
  • Fases: lo más general es aplicable a cualquier proceso. Instancias del proceso: particularización a problemas específicos.
  • 1

    1. 1. An Introduction to Mining (1) Análisis Inteligente de Datos y Data Mining Alfredo Vellido
    2. 2. <ul><li>www.lsi.upc.edu/~avellido/teaching/data_mining.htm </li></ul><ul><li>… /~belanche/docencia/aiddm/aiddm.html </li></ul>
    3. 3. Contents of the course (hopefully) <ul><li>1. Introduction & methodologies </li></ul><ul><li>2. Exploratory DM through visualization </li></ul><ul><li>3. Pattern recognition 1 </li></ul><ul><li>4. Pattern recognition 2 </li></ul><ul><li>5. Feature extraction </li></ul><ul><li>6. Feature selection </li></ul><ul><li>7. Error estimation </li></ul><ul><li>8. Linear classifiers, kernels and SVMs </li></ul><ul><li>9. Probability in Data Mining </li></ul><ul><li>10. Latency, generativity, manifolds and all that </li></ul><ul><li>11. Application of GTM: from medicine to ecology </li></ul><ul><li>12. DM Case studies </li></ul>Sorry guys! … no fuzzy systems …
    4. 4. What is DATA MINING?
    5. 5. What is DATA MINING? (1) <ul><li>“ Data Mining is the process of discovering actionable and meaningful patterns , profiles, and trends by sifting through your data using pattern recognition technologies (…) is a hot new technology about one of the oldest processes of human endeavour: pattern recognition (…) It is an iterative process of extracting knowledge from business transactions (…) DM is the automatic discovery of usable knowledge from your stored data.” </li></ul><ul><li>Jesús Mena: Data Mining Your Website (Digital Press, 1999) </li></ul>
    6. 6. What is DATA MINING? (2) <ul><li>“ Data Mining, by its simplest definition, automates the detection of relevant patterns in a database (…) DM is not magic. For many years, statisticians have manually “mined” databases (…) DM uses well-established statistical and machine learning techniques to build models that predict customer behaviour . Today, technology automates the mining process, integrates it with commercial data warehouses, and presents it in a relevant way for business users (…) the leading DM products address the broader business and technical issues, such as their integration into complex IT environments .” </li></ul><ul><li>Berson, Smith, & Thearling: Building Data Mining Applications for CRM (McGraw-Hill, 2000) </li></ul>
    7. 7. What is DATA MINING? (3) <ul><li>WIKIPEDIA DIXIT: “ Data mining has been defined as &quot;The nontrivial extraction of implicit, previously unknown, and potentially useful information from data&quot; (1) and &quot;The science of extracting useful information from large data sets or databases&quot; (2) . Although it is usually used in relation to analysis of data, data mining, like artificial intelligence, is an umbrella term and is used with varied meaning in a wide range of contexts. ” </li></ul><ul><li>(1) W. Frawley and G. Piatetsky-Shapiro and C. Matheus, Knowledge Discovery in Databases: An Overview. AI Magazine, 1992 , 213-228. </li></ul><ul><li>(2) D. Hand, H. Mannila, P. Smyth: Principles of Data Mining. MIT Press, 2001. </li></ul><ul><li>wikipedia 2005: en.wikipedia.org/wiki/Data_mining </li></ul>
    8. 8. What is DATA MINING? (4) <ul><li>WIKIPEDIA’06 also DIXIT: “ Data mining (DM), also called Knowledge-Discovery in Databases (KDD) or Knowledge-Discovery and Data Mining , is the process of automatically searching large volumes of data for patterns such as association rules. It is a fairly recent topic in computer science but applies many older computational techniques from statistics, information retrieval, machine learning and pattern recognition . </li></ul><ul><li>wikipedia 2006: en.wikipedia.org/wiki/Data_mining </li></ul>
    9. 9. What is DATA MINING? (5) <ul><li>In 1996, in the proceedings of the 1st International Conference on KDD, Fayyad gave one of the best-known definitions of Knowledge Discovery from Data: </li></ul><ul><ul><li>“ The non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data.” </li></ul></ul><ul><li>KDD quickly gathered strength as an interdisciplinary research field where a combination of advanced techniques from Statistics, Artificial Intelligence, Information Systems, and Visualization are used to tackle knowledge acquisition from large data bases. The term Knowledge Discovery from Data appeared in 1989 referring to the: </li></ul><ul><ul><li>“ [...] overall process of finding and interpreting patterns from data, typically interactive and iterative, involving repeated application of specific data mining methods or algorithms and the interpretation of the patterns generated by these algorithms.” </li></ul></ul>
    10. 10. DM: a glossary <ul><li>From “Special Issue on Applications of Machine Learning and the Knowledge Discovery Process”: Kohavi, R., Provost, F. (1998) </li></ul><ul><li>Data mining </li></ul><ul><ul><li>The term data mining is somewhat overloaded. It sometimes refers to the whole process of knowledge discovery and sometimes to the specific machine learning phase. </li></ul></ul><ul><li>Knowledge discovery </li></ul><ul><ul><li>The non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data. </li></ul></ul><ul><li>Machine learning </li></ul><ul><ul><li>In Knowledge Discovery, machine learning is most commonly used to mean the application of induction algorithms, which is one step in the knowledge discovery process. Machine Learning is the field of scientific study that concentrates on induction algorithms and on other algorithms that can be said to “learn”. </li></ul></ul>
    11. 11. What to expect from a DM conference… <ul><li>15-17 Septiembre’04: Wessex Institute of Technology (W.I.T.) </li></ul>
    12. 12. What to find in a DM conference… Session 1: Text Mining (I) Session 2: Text Mining (II) Session 3: Web Mining Session 4: Clustering Techniques Session 5: Data Preperation Techniques Session 6: Applications in Business, Industry and Government (I) Session 7: Applications in Business, Industry and Government (II) Session 8: Customer Relationship Management (CRM) Session 9: Applications in Science and Engineering (I) Session 10: Applications in Science and Engineering (II)
    13. 13. What to find in a DM conference… (2) Sessions 1 & 2: Text Mining Stock Broker P – sentiment extraction for the stock market An XML-based semantic protein map Automated text mining comparison of Japanese and USA multi-robot research (*) Session 3: Web Mining The influence of caching on web usage mining Clickstreams, the basis to establish user navigation patterns on web sites (*) U.S. Army Tank-Automotive and Armaments Command
    14. 14. What to find in a DM conference… (3) <ul><li>Sessions 4-10 </li></ul><ul><li>Exploration of the ecological status of Mediterranean rivers: clustering , visualizing and reconstructing streams data using Generative Topographic Mapping </li></ul><ul><li>A visual tool for mining macroeconomics data </li></ul><ul><li>An intelligent real-time particle monitoring application for polymer film manufacturing </li></ul><ul><li>Mining call center dialog data </li></ul><ul><li>Data mining in publishing : a nice feature or a necessity? </li></ul><ul><li>Data mining highly multiple time series of astronomical observations </li></ul><ul><li>Data mining approach to study Quality of Voice over IP applications </li></ul>
    15. 15. Do you need money ?
    16. 16. La T.I.A. <ul><li>“ The Total Information Awareness (TIA) program may have been killed by congressional decree, but key elements of the program have survived at other intelligence agencies, according to congressional, federal, and research officials. TIA's goal was to employ data-mining to shift through public and private databases to track terrorists , which stirred up fears that the program would be used to spy on millions of innocent Americans.” </li></ul><ul><li>“ Congressional officials have not disclosed which TIA programs were eliminated and which were retained, but insiders report that TIA's Evidence and Extraction and Link Discovery projects , collectively encompassing 18 data-mining initiatives, are among the surviving components . The continuance of certain projects originally falling under the auspices of TIA has led Steve Aftergood of the Federation of American Scientists to conclude that Congress' decision to disband TIA was nothing more than &quot;a shell game.&quot; “ </li></ul><ul><li>“ Despite the death of TIA, Capitol Hill is still paying for the development of software designed to collect foreign intelligence on terrorists: a $64 million research program run by the Advanced Research and Development Activity (ARDA), which has employed some of the same researchers as TIA, was left untouched by Congress.” </li></ul>
    17. 17. DATA MINING as a methodology
    18. 18. CRISP: a DM methodology <ul><li>CR oss- I ndustry S tandard P rocess for Data Mining : neutral methodology from the point of view of industry, tool and application (free & non-proprietary) </li></ul><ul><li>Pete Chapman, Randy Kerber ( NCR ); Julian Clinton, Thomas Khabaza, Colin Shearer ( SPSS ), Thomas Reinartz, Rüdiger Wirth ( DaimlerChrysler ) </li></ul><ul><li>CRISP-DM was conceived in 1996 </li></ul><ul><li>DaimlerChrysler: leaders in industrial application, SPSS: leaders in product development ( Clementine, 1994) , NCR: owners of large (huge!) databases (Teradata ) </li></ul><ul><li>Financed by the EU . Version 1.0 released officially in 1999 </li></ul>
    19. 19. CRISP: Hierarchic structure of the methodology
    20. 20. CRISP: Methodology phases
    21. 21. CRISP: Descripción de las fases <ul><li>Comprensión del problema de negocio: estudio de los objetivos y requerimientos desde la perspectiva de negocio. Conversión en una definición de problema de DM </li></ul><ul><li>Comprensión de los datos: recolección de datos y familiarización con los mismos, intentando buscar defectos de calidad y detectar datos de posible interés </li></ul><ul><li>Preparación de los datos: Construcción del conjunto de datos que será modelado, a partir de los datos originales. Proceso iterativo y exploratorio. Selección de tablas, variables, observaciones… limpieza de datos </li></ul><ul><li>Modelado: Análisis de datos vía técnicas de modelado adaptadas al problema a tratar. Calibrado de los parámetros de estos modelos </li></ul><ul><li>Evaluación: Todos los pasos previos son evaluados conjuntamente, y se ha de decidir si los resultados son adecuados al problema planteado </li></ul><ul><li>Implementación: Todo el conocimiento adquirido ha de ser organizado y presentado al cliente, de manera que éste lo pueda utilizar. En muchos casos será el mismo cliente quien se responsabilice de esta fase </li></ul>
    22. 22. Use of DM methodologies Enterprise MinerTM : SEMMA The acronym SEMMA -- sample, explore, modify, model, assess -- refers to the core process of conducting data mining. Beginning with a statistically representative sample of your data, SEMMA makes it easy to apply exploratory statistical and visualization techniques, select and transform the most significant predictive variables, model the variables to predict outcomes, and confirm a model's accuracy.