2

627 views

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
627
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
10
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Ejemplo de Metrofang…
  • Comentar algunas áreas: Bioinformatics, Banking & similar…
  • Hablar de METROFANG
  • PMML: Predictive Mining (Modelling) Mark-up Language (ORACLE)
  • 2

    1. 1. An Introduction to Mining (2) Análisis Inteligente de Datos y Data Mining Alfredo Vellido
    2. 2. DATA MINING as a methodology
    3. 3. CRISP: a DM methodology CRoss-Industry Standard Process for Data Mining: neutral methodology from the point of view of industry, tool and application (free & non-proprietary) Pete Chapman, Randy Kerber (NCR); Julian Clinton, Thomas Khabaza, Colin Shearer (SPSS), Thomas Reinartz, Rüdiger Wirth (DaimlerChrysler) CRISP-DM was conceived in 1996 DaimlerChrysler: leaders in industrial application, SPSS: leaders in product development (Clementine, 1994), NCR: owners of large (huge!) databases (Teradata) Financed by the EU. Version 1.0 released officially in 1999
    4. 4. CRISP: Methodology phases
    5. 5. Use of DM methodologies What main methodology are you using for data mining? CRISP-DM (72) 42% SEMMA (17) 10% My organization's (11) 6% My own (48) 28% Other (10) 6% None (12) 7% Enterprise MinerTMEnterprise MinerTM:: SEMMASEMMA The acronym SEMMA -- sample, explore, modify, model, assess -- refers to the core process of conducting data mining. Beginning with a statistically representative sample of your data, SEMMA makes it easy to apply exploratory statistical and visualization techniques, select and transform the most significant predictive variables, model the variables to predict outcomes, and confirm a model's accuracy.
    6. 6. CRISP: Phases: Problem understanding COMPRENSIÓN PROBLEMA NEGOCIO COMPRENSIÓN DATOS PREPARACIÓN DATOS MODELADO EVALUACIÓN IMPLEMEN TACION DETERMINAR OBJETIVO NEGOCIO CALIBRAR SITUACIÓN DETERMINAR OBJETIVOS DM PRODUCIR PLAN PROYECTO BACKGROUND INVENTORIO RECURSOS OBJETIVOS DM PLAN DE PROYECTO OBJETIVOS DE NEGOCIO CRITERIOS DE ÉXITO NEGOCIO CRITERIOS DE ÉXITO DM REQUERIMS. ASUNCIONES LIMITACIONES RIESGOS CONTINGEN. TERMINOLOG. COSTES & BENEFICIOS SELECCIÓN INICIAL HERRAM.
    7. 7. DM application areas (’06) Industries/fields where you applied data mining in the past 12 months [111 voters, 278 votes total] CRM (43) 39.1% Fraud Detection (24) 21.8% Direct Marketing/ Fundraising (22) 20.0% Credit Scoring (21) 19.1% Biotech/Genomics (17) 15.5% Web content mining/Search (15) 13.6% Other (15) 13.6% Telecom (14) 12.7% Web usage mining (12) 10.9% Science (12) 10.9% Insurance (12) 10.9% Retail (11) 10.0% Investment / Stocks (11) 10.0% Medical/ Pharma (8) 7.3% Manufacturing (7) 6.4% Government/Military (7) 6.4% e-commerce (6) 5.5% Travel/Hospitality (5) 4.5% Security / Anti-terrorism (5) 4.5% Health care/ HR (5) 4.5% Junk email / Anti-spam (2) 1.8% Entertainment/ Music (2) 1.8% Banking (1) 0.9%
    8. 8. CRISP: Phases: Data understanding OBTENER DATOS INICIALES DESCRIPCIÓN DATOS EXPLORACIÓN DATOS VERIFICAR CALIDAD DATOS COMPRENSIÓN PROBLEMA NEGOCIO COMPRENSIÓN DATOS PREPARACIÓN DATOS MODELADO EVALUACIÓN IMPLEMEN TACION INFORME DATOS INICIALES INFORME DESCRIPTIVO DATOS INFORME EXPLORACIÓN DATOS INFORME CALIDAD DATOS
    9. 9. METROFANG: a real story about data understanding (1)
    10. 10. METROFANG: a real story about data understanding (2) caudal entrada 0,00 50,00 100,00 150,00 200,00 250,00 300,00 350,00 1 1768 3535 5302 7069 8836 10603 12370 14137 15904 17671 Par motor Secador A 0,00 20,00 40,00 60,00 80,00 100,00 120,00 140,00 1 1768 3535 5302 7069 8836 10603 12370 14137 15904 17671 Missing data Estacionalidad Outliers Series Temporales Fin de semana? FORUM???
    11. 11. What data format we use? What are your preferred methods for storing data for data mining? [403 votes total] Text, CSV (comma-separated) (72) 18% Text, space or tab separated (55) 14% Excel (38) 9% SAS (57) 14% SPSS (31) 8% S-Plus/R (15) 4% Weka ARFF (23) 6% Other data mining tool format (11) 3% In a database system (93) 23% Other - please comment (8) 2%
    12. 12. CRISP: Phases: Data preparation COMPRENSIÓN PROBLEMA NEGOCIO COMPRENSIÓN DATOS PREPARACIÓN DATOS MODELADO EVALUACIÓN IMPLEMEN TACION SELECCIÓN DE DATOS LIMPIEZA DE DATOS RECONSTRUC. DATOS INTEGRAR DATOS DAR FORMATO A LOS DATOS ARGUMENTACIÓN DE LA SELECCIÓN INFORME LIMPIEZA DE DATOS VARIABLES DERIVADAS DATOS INTEGRADOS OSERVACIONES GENERADAS DATOS CON NUEVO FORMATO
    13. 13. Is data preparation that important? What % of time in your data mining project(s) is spent on data cleaning and preparation [187 votes total] over 80% (46) 25% 61 to 80% (73) 39% 41 to 60% (46) 25% 21 to 40% (7) 4% 20% or less (15) 8%
    14. 14. Common data types …(’05) Types of data you analyzed/mined in last 12 months [126 voters, 391 choices] table data, fixed # of columns (103) 82% (26%) time series (51) 40% (13%) text, free-form (42) 33% (11%) transactions (association rules) (38) 30% (10%) web clickstream (21) 17% (5%) spatial data (2-D, 3-D) (20) 16% (5%) web content (19) 15% (5%) email (16) 13% (4%) XML data (16) 13% (4%) links or networks (14) 11% (4%) anonymized data (14) 11% (4%) images/video (8) 6% (2%) music and audio (8) 6% (2%) other (21) 17% (5%)
    15. 15. Common data types …(’06) table data - fixed # of columns (75) 70.8% time series (36) 34.0% text - free-form (35) 33.0% transactions (association rules) (30) 28.3% anonymized data (27) 25.5% spatial data (2D 3-D) (15) 14.2% other (15) 14.2% email (11) 10.4% web clickstream (9) 8.5% links or networks (9) 8.5% images / video (8) 7.5% XML data (7) 6.6% web content (6) 5.7% music / audio (5) 4.7% Compared to 2005 KDnuggets Poll on “Types of data you analyzed/mined in last 12 months”, the biggest increase was in anonymized data (perhaps and indicator of increasing importance of privacy issues).
    16. 16. How large it is and where do you store it?…(’06) What did you use for data storage for significant data mining projects in the past year: [142 voters, 284 votes] Text files (e.g. tab or comma delim) (75) 52.8% Data mining system format (SAS, SPSS, arff) (57) 40.1% Excel (28) 19.7% Oracle (25) 17.6% SQL Server (15) 10.6% mySQL (12) 8.5% other format (10) 7.0% other commercial DBMS (7) 4.9% other free DBMS (4) 2.8% Largest database or dataset you data-mined was [181 votes total] less than 1 MB (5) 3% 1.1 to 10 MB (11) 6% 11 to 100 MB (27) 15% 101 MB to 1 GB (22) 12% 1.1 to 10 GB (45) 25% 11 to 100 GB (22) 12% 101 GB to 1 Terabyte (28) 15% over 1 Terabyte (21) 12%
    17. 17. CRISP: Phases: Modelling SELECCIONAR TÉCNICA MODELADO CREAR DISEÑO TEST CONSTRUIR MODELO VALIDAR MODELO COMPRENSIÓN PROBLEMA NEGOCIO COMPRENSIÓN DATOS PREPARACIÓN DATOS MODELADO EVALUACIÓN IMPLEMEN TACION TÉCNICA SELECCIONADA DISEÑO DE TEST ELECCIÓN DE PARÁMETROS VALIDACIÓN DEL MODELO MODELO DESCRIPCIÓN DE MODELO
    18. 18. CRISP: Typology of DM problems PROBLEMA DESCRIPCIÓN EJEMPLOS TÉCNICAS DESCRIPCIÓN y RESUMEN de DATOS Descripción concisa de datos, elemental y agregada. Análisis exploratorio Casi cualquier problema incluye elementos de descripción ERPs, stats., OLAP, EIS, cuadro de mando SEGMENTACIÓN Separación de datos en subgrupos significativos (no superv.) –ambigüedad segm / clust / clasif Market Segmentation, Shopping Basket analysis Clustering, NNs (SOM, GTM), visualización DESCRIPCIÓN CONCEPTUAL Descripción inteligible y útil de conceptos / clases / grupos. Prima el conocimiento sobre la precisión. Ligado a clasif / segmentación Descripción de grupos de clientes según lealtad. Perfilado de segmentos if SEX=male and age>45 then CUST=loyal Rule Induction, Conceptual Clustering CLASIFICACIÓN Es asumido que diferentes observaciones, definidas por variables, pertenecen a clases dadas (supervisado) Bancarrota, Credit Scoring Discriminant Analysis, Rule Induction, Decision Trees, NNs, C-B Reasoning, GAs PREDICCIÓN (REGRESIÓN, FORECASTING) Variable dependiente continua. Dados valores de variables predictivas, encontrar valor de predicción (supervisado) Bolsa, beneficios empresa, cuota de mercado Regression Analysis, Regression Trees, NNs, Box-Jenkins, GAs ANÁLISIS de DEPENDENCIAS Búsqueda de dependencias entre variables (superv. o no superv.) A menudo con segmentación Basket Analysis 30% de los que compraron cacahuetes compraron también cerveza Correlation Analysis, Association Rules, Bayesian Networks, Inductive Logic Prog.
    19. 19. CRISP: Selection of techniques U N I V E R S O D E T É C N I C A SU N I V E R S O D E T É C N I C A S TÉCNICAS ADECUADAS A PROBLEMA REQUERIMIENTOS POLÍTICOS (Negocio, ejecutiva) LIMITACIONES Tipo de datos, conocimientoTipo de datos, conocimiento HERRAMIENTA(S)HERRAMIENTA(S) SELECCIONADA(S)SELECCIONADA(S) Tiempo, dinero, rr.hh.Tiempo, dinero, rr.hh. (Definidas por herramientas)
    20. 20. Commonly used models (‘05)… Data mining/analytic techniques you use frequently: [784 votes total] Decision Trees/Rules (107) 14% Clustering (101) 13% Regression (90) 11% Statistics (80) 10% Visualization (63) 8% Neural Nets (61) 8% Association rules (54) 7% Nearest Neighbor (34) 4% SVM (Support vector machine) (31) 4% Bayesian (30) 4% Sequence/Time series analysis (26) 3% Boosting (25) 3% Hybrid methods (23) 3% Bagging (20) 3% Genetic algorithms (19) 2% Other (20) 3%
    21. 21. Commonly used models (‘06)… Data mining/ analytic methods you used frequently in the last year: [176 voters] Decision Trees/Rules (90) 51.1% Clustering (70) 39.8% Regression (67) 38.1% Statistics (64) 36.4% Association rules (54) 30.7% Visualization (38) 21.6% SVM (31) 17.6% Neural Nets (31) 17.6% Sequence/Time series analysis (24) 13.6% Bayesian (24) 13.6% Nearest Neighbor (20) 11.4% Boosting (17) 9.7% Hybrid methods (14) 8.0% Bagging (13) 7.4% Genetic algorithms (12) 6.8% Other (4) 2.3%
    22. 22. CRISP: Phases: Evaluation COMPRENSIÓN PROBLEMA NEGOCIO COMPRENSIÓN DATOS PREPARACIÓN DATOS MODELADO EVALUACIÓN IMPLEMEN TACION EVALUAR RESULTADOS REVISAR PROCESOS DETERMINAR PRÓXIMOS PASOS EVOLUCIÓN RESULTADOS DM REVISION DEL PROCESO LISTA DE POSIBLES ACCIONES DECISIONES MODELOS APROBADOS
    23. 23. CRISP: Phases: Deployment PLANIFICAR IMPLEMEN TACIÓN PLANIFICAR MONITORIZACIÓN Y MANTENIMIENTO PRODUCIR INFORME FINAL REVISAR PROYECTO COMPRENSIÓN PROBLEMA NEGOCIO COMPRENSIÓN DATOS PREPARACIÓN DATOS MODELADO EVALUACIÓN IMPLEMEN TACION PLAN DE IMPLEMENTACIÓN PLAN DE MONITORIZACIÓN Y MANTENIMIENTO INFORME FINAL DOCUMENTACIÓN DE LA EXPERIENCIA PRESENTACIÓN FINAL
    24. 24. How do you deploy it? (’06) How do you usually deploy data mining results? (Choose all that apply): [95 voters] Publish research papers (37) 38.9% Use findings to change business rules (42) 44.2% === Deploy in production and ... (46) 48.4% Use data mining tool for scoring (47) 49.5% Convert model to SQL (20) 21.1% Convert model to another language (16) 16.8% Convert model to C or Java (16) 16.8% Convert model to PMML (4) 4.2% === Deploy in batch mode (48) 50.5% Deploy in real-time mode (21) 22.1%
    25. 25. Software popularity (‘05) Data mining/Analytic tools you used in 2005 [376 voters, 860 votes total] SPSS Clementine 135 SPSS 96 Excel 78 CART/MARS/TreeNet/RF 69 SAS 53 SAS Enterprise Miner 49 Your own code 39 Other free tools 34 Insightful Miner/ S-Plus 32 Statsoft Statistica 30 Weka 30 ThinkAnalytics 26 C4.5/C5.0/See5 25 R 25 Microsoft SQL Server 23 Other commercial tools 23 MATLAB 16 Mineset (PurpleInsight) 16 Xelopes 16 Oracle Data Mining 10
    26. 26. Software popularity (‘06) Data mining/analytic tools you used in 2006: [561 voters] CART/MARS/TreeNet/RF 159 (72 alone) SPSS Clementine 127 (47 alone, 46 with SPSS) SPSS 100 (5 alone, 46 with Clementine) Excel 100 (3 alone) KXEN 90 (75 alone) your own code 77 (1 alone) SAS 72 (3 alone, 13 with E-Miner) Weka 62 (7 alone) R 53 (5 alone) MATLAB 41 (5 alone) other free tools 39 (3 alone) SAS E-Miner 37 (9 alone, 13 with SAS) SQL Server 32 (3 alone) other commercial tools 31 (7 alone) Oracle Data Mining 20 (13 alone) Insightful Miner/ S-Plus 20 (0 alone) C4.5/C5.0/See5 18 (0 alone) Megaputer 16 (14 alone) Statsoft Statistica 13 (2 alone) Angoss 8 (2 alone) Mineset (PurpleInsight) 5 (3 alone) Eudaptics Viscovery 5 (3 alone) Xelopes 4 (4 alone) Visumap 3 (2 alone) IBM I-miner 3 (0 alone) Equbits 3 (3 alone)
    27. 27. SPSS webinars Seminario on line: Depuración de Datos con SPSS Viernes, 6 de Octubre de 2006 - 10:00h Duración: 30 minutos Mejore la validación de los datos para obtener resultados más precisos. El nuevo módulo SPSS Validación de Datos le permite: Identificar fácilmente casos, variables o valores sospechosos o que no son válidos Ver patrones de datos que faltan y resumir distribuciones de variables. Sabiendo esto, puede determinar la validez de los datos y eliminar o corregir los casos sospechosos que considere antes del análisis. Seminario on line: Clementine Desktop Viernes, 20 de Octubre de 2006 - 10:00h Duración: 30 minutos La minería de datos o “Data Mining” es una tecnología que aporta a su empresa un valor considerable y cuantificable. Al descubrir conexiones y patrones ocultos en los datos, el Data Mining permite a su organización mejorar sus procesos de negocio y tomar las mejores decisiones en el momento oportuno. SPSS Inc. ofrece ahora Clementine Desktop con el fin de ayudar a las pequeñas y medianas empresas y unidades de negocio dentro de organizaciones mayores a disfrutar las ventajas del Data Mining. Como Clementine, nuestra solución de Data Mining líder del sector, Clementine Desktop combina técnicas avanzadas de análisis con una interfaz visual y muy intuitiva. Además es compatible con CRISP-DM (Cross-Industry Standard Process para Data Mining), que es la metodología estándar de minería de datos. REGISTRO: https://spssevents.webex.com/
    28. 28. Show me the money!
    29. 29. Miner’s salaries over the globe (’05-’06)…
    30. 30. Mining jobs… Company: Microsoft Position: Research SDE Location: Redmond, WA As a Research Developer, you will be working side by side with applied researchers in the adCenter Labs to incubate and build research prototypes in the areas of data mining, information retrieval and online auction. Realize the algorithms in the form of research prototypes to functional production components… The ideal candidate should have Excellence in algorithms, data structure, discrete math, data base and data warehousing Production coding experience in web scripting, C/C++, .NET framework, Perl, SQL/MDX. (02/10/06) Company: Waterfront International Ltd, Position: Data Mining Analyst Location: Toronto, Canada Waterfront International is a Toronto-based financial consulting firm, specializing in developing computer based statistical trading strategies. Primary Responsibilities: Perform financial market data research and analysis to identify and resolve data issues using advanced data mining techniques. Develop proprietary data mining tools and applications, and predictive models. Requirements: PhD or Masters in mathematics, statistics or computer science specializing in data mining // Must possess expert level C/C++ programming skills // Some financial experience desired but not required. (25/09/06)
    31. 31. Mining jobs … Company: Yahoo! Position: Data Mining Researcher Location: Sunnyvale, CA Each day Yahoo! collects approximately ten terabytes of data- more than the entire Library of Congress. We analyze this data and act on it to constantly better our user experience, while building the world’s best consumer-centric data platform. Yahoo! DATA MINING and RESEARCH GROUP (DMR) is looking for an outstanding data mining researcher who wants to work on real problems leading to solutions that make a direct and measurable business impact. This individual should enjoy formulating problems based on customer needs, selecting, modifying and/or building appropriate tools or methodologies, and providing true end-to-end solutions for diversified and challenging data mining and data research projects. # Experience in exploratory data analysis and data mining process # Proven knowledge of data mining and machine learning methods and tools # Ph.D. in Data Mining, Machine Learning, Information Retrieval, Statistics, Artificial Intelligence, or a related field # Software development skills (C/C++, Perl, Java, SQL, etc.) (10/09/06)
    32. 32. Mining jobs… Company: Siemens Corporate Research Position: Research Scientist - Semantic Analysis . Location: Princeton, NJ The Semantic Analysis Group of Siemens Corporate Research, Princeton, NJ, has an opening for a Research Scientist in the area of semantic modeling and analysis. The ideal candidate will have a background in statistical learning, natural language processing, and ontological knowledge formalism technologies. In particular, candidates with experience in using machine learning techniques for semantic analysis of unstructured data in a range of applications such as data cleaning, clustering, summarization, question/answering and topic detection will be preferred. Familiarity with Semantic Web and knowledge representation are also expected. We will only consider candidates holding a Ph.D. in Computer Science, Electrical Engineering or Applied Math (25/09/06)
    33. 33. ¡Empleos! de minero… Company: NOVAQUALITY CONSULTING Position: Consultor de Data Mining con SAS Location: Madrid Estamos buscando una persona con amplia experiencia en proyectos de minería de datos utilizando las diferentes soluciones de SAS. Realizará tanto labores técnicas, análisis de estructuras de bases de datos, definición de procesos y generación de matrices, etc.., como labores funcionales y de interlocución con los responsables de negocio. Experiencia de al menos 3 años en proyectos de Data Mining. Implantando modelos predictivos y descriptivos, desde su concepción hasta su puesta en producción. Altos conocimientos de las distintas herramientas de SAS, tanto de tratamiento de datos básicos (SAS BASE), análisis estadístico y minería de datos (SAS Stat y EM). (04/10/06) Company: DMR Consulting Position: Consultores Sr. Business Intelligence Location: Barcelona Precisamos incorporar Consultores Senior Business Intelligence para participar en la definición y ejecución de soluciones relacionadas con Business Intelligence como: Cuadro de Mandos, Balanced Scorecard, Reporting, Marketing Analítico, Data Mining, entornos de simulación, etc. Ingeniero Superior o similar con experiencia en diseño e implantación de soluciones de negocio basadas en Business Intelligence, Data Warehousing y/oData Mining. Con conocimiento de varios de los siguientes productos del mercado: - BBDD Relacionales: Oracle: SQL Server, DB2, Informix, etc. - Reporting: Oracle Reports: Dynasight, OnVision, Chrystal Reports, etc. - Herramientas Analíticas: Cognos, Business Objects, Mycrostrategy, Análisis Services (03/10/06)

    ×